How to scrape and clean wikipedia tables.
I’ve scraped many websites, and scraping Wikipedia is by far the simplest. This post will teach you how to scrape tables from Wikipedia and how to clean the data that has been scraped.
The two urls we’ll use are given below.
The index is based on 60 indicators grouped in five different categories, measuring pluralism, civil liberties and political culture.
The index which ranks countries by their perceived levels of public sector corruption, as determined by expert assessments and opinion surveys.
For more information, please see the URLs above.
The index scores of countries participating in the Shanghai Cooperation Organization Summit in 2022 will be compared to those of countries that are not members of the organization.
url1 <- "https://en.wikipedia.org/wiki/Democracy_Index"
url2 <- "https://en.wikipedia.org/wiki/Corruption_Perceptions_Index"
participants <- c(
"China",
"Russia",
"Kyrgyzstan",
"Tajikistan",
"Kazakhstan",
"Uzbekistan",
"India",
"Pakistan",
"Iran",
"Turkmenistan",
"Mongolia",
"Belarus",
"Turkey",
"Azerbaijan"
)
While scraping the Democracy Index table, we followed the steps outlined below.
Read HTML code using the read_html() function
Get tables using the html_table() function
The 6th table in the list is the one we want (by running the first two lines of code, you can see all of the tables)
Select country and score columns
Divide the countries based on the previously stated criteria
Rename score column
While scraping the Corruption Index table, we followed the steps outlined below.
Read HTML code using the read_html() function
Get tables using the html_table() function
The 5th table in the list is the one we want (by running the first two lines of code, you can see all of the tables)
Select country and score columns
Remove the first line
Rename country and score columns
Convert character to numeric for the score column
The inner_join() function can be used to join common countries from two tables.
master <- democracy %>%
inner_join(corruption, by = "Country") %>% # the common rows between two tables
select(Country,Democracy,Corruption,Shanghai)
As an example, the ten countries in the table are listed below.
Country | Democracy | Corruption | Shanghai |
---|---|---|---|
Canada | 8.87 | 74 | Non-Shanghai |
United States | 7.85 | 67 | Non-Shanghai |
Austria | 8.07 | 74 | Non-Shanghai |
Belgium | 7.51 | 73 | Non-Shanghai |
Cyprus | 7.43 | 53 | Non-Shanghai |
Denmark | 9.09 | 88 | Non-Shanghai |
Finland | 9.27 | 88 | Non-Shanghai |
France | 7.99 | 71 | Non-Shanghai |
Germany | 8.67 | 80 | Non-Shanghai |
Greece | 7.56 | 49 | Non-Shanghai |
Let’s visualize the data in the last step.
ggplot(master, aes(x = Democracy, y = Corruption, color = Shanghai)) +
geom_point(size = 5, alpha = .5) +
ggrepel::geom_text_repel(data = master %>% filter(Shanghai == "Shanghai"),
aes(label = Country), size = 7, show.legend = FALSE) +
geom_vline(xintercept = 5, linetype = "dashed") +
geom_hline(yintercept = 50, linetype = "dashed") +
ggthemes::theme_fivethirtyeight() +
theme(
axis.title = element_text(size = 25),
axis.text = element_text(size = 20),
legend.title = element_blank(),
legend.position = "top",
legend.text = element_text(size = 20)
) +
scale_color_manual(values = c("gray60","red"))