Uraz's R Blog: Scraping Data from Wikipedia Tables

I’ve scraped many websites, and scraping Wikipedia is by far the simplest. This post will teach you how to scrape tables from Wikipedia and how to clean the data that has been scraped.

The two urls we’ll use are given below.

https://en.wikipedia.org/wiki/Democracy_Index

The index is based on 60 indicators grouped in five different categories, measuring pluralism, civil liberties and political culture.

https://en.wikipedia.org/wiki/Corruption_Perceptions_Index

The index which ranks countries by their perceived levels of public sector corruption, as determined by expert assessments and opinion surveys.

For more information, please see the URLs above.

The index scores of countries participating in the Shanghai Cooperation Organization Summit in 2022 will be compared to those of countries that are not members of the organization.

library(rvest)
library(tidyverse)

url1 <- "https://en.wikipedia.org/wiki/Democracy_Index"
url2 <- "https://en.wikipedia.org/wiki/Corruption_Perceptions_Index"

participants <- c(
  "China",
  "Russia",
  "Kyrgyzstan",
  "Tajikistan",
  "Kazakhstan",
  "Uzbekistan",
  "India",
  "Pakistan",
  "Iran",
  "Turkmenistan",
  "Mongolia",
  "Belarus",
  "Turkey",
  "Azerbaijan"
)

While scraping the Democracy Index table, we followed the steps outlined below.

Read HTML code using the read_html() function
Get tables using the html_table() function
The 6th table in the list is the one we want (by running the first two lines of code, you can see all of the tables)
Select country and score columns
Divide the countries based on the previously stated criteria
Rename score column

democracy <- read_html(url1) %>% 
  html_table() %>% 
  .[[6]] %>% 
  select(3,5) %>% 
  mutate(
    Shanghai = ifelse(
      Country %in% participants, "Shanghai", "Non-Shanghai"
    )
  ) %>% 
  rename(
    "Democracy"=2
  )

While scraping the Corruption Index table, we followed the steps outlined below.

Read HTML code using the read_html() function
Get tables using the html_table() function
The 5th table in the list is the one we want (by running the first two lines of code, you can see all of the tables)
Select country and score columns
Remove the first line
Rename country and score columns
Convert character to numeric for the score column

corruption <- read_html(url2) %>% 
  html_table() %>% 
  .[[5]] %>% 
  select(2,3) %>% 
  slice(-1) %>% 
  rename(
    "Country"=1,
    "Corruption"=2
  ) %>% 
  mutate(
    Corruption = as.numeric(Corruption)
  )

The inner_join() function can be used to join common countries from two tables.

master <- democracy %>% 
  inner_join(corruption, by = "Country") %>% # the common rows between two tables
  select(Country,Democracy,Corruption,Shanghai)

As an example, the ten countries in the table are listed below.

Country	Democracy	Corruption	Shanghai
Canada	8.87	74	Non-Shanghai
United States	7.85	67	Non-Shanghai
Austria	8.07	74	Non-Shanghai
Belgium	7.51	73	Non-Shanghai
Cyprus	7.43	53	Non-Shanghai
Denmark	9.09	88	Non-Shanghai
Finland	9.27	88	Non-Shanghai
France	7.99	71	Non-Shanghai
Germany	8.67	80	Non-Shanghai
Greece	7.56	49	Non-Shanghai

Let’s visualize the data in the last step.

ggplot(master, aes(x = Democracy, y = Corruption, color = Shanghai)) +
  geom_point(size = 5, alpha = .5) +
  ggrepel::geom_text_repel(data = master %>% filter(Shanghai == "Shanghai"),
                           aes(label = Country), size = 7, show.legend = FALSE) +
  geom_vline(xintercept = 5, linetype = "dashed") +
  geom_hline(yintercept = 50, linetype = "dashed") +
  ggthemes::theme_fivethirtyeight() +
  theme(
    axis.title = element_text(size = 25),
    axis.text = element_text(size = 20),
    legend.title = element_blank(),
    legend.position = "top",
    legend.text = element_text(size = 20)
  ) +
  scale_color_manual(values = c("gray60","red"))