Scraping Data from Wikipedia Tables

Web Data Manipulation

How to scrape and clean wikipedia tables.

Uraz Akgül
2022-09-18

I’ve scraped many websites, and scraping Wikipedia is by far the simplest. This post will teach you how to scrape tables from Wikipedia and how to clean the data that has been scraped.

The two urls we’ll use are given below.

  1. https://en.wikipedia.org/wiki/Democracy_Index

The index is based on 60 indicators grouped in five different categories, measuring pluralism, civil liberties and political culture.

  1. https://en.wikipedia.org/wiki/Corruption_Perceptions_Index

The index which ranks countries by their perceived levels of public sector corruption, as determined by expert assessments and opinion surveys.

For more information, please see the URLs above.

The index scores of countries participating in the Shanghai Cooperation Organization Summit in 2022 will be compared to those of countries that are not members of the organization.

url1 <- "https://en.wikipedia.org/wiki/Democracy_Index"
url2 <- "https://en.wikipedia.org/wiki/Corruption_Perceptions_Index"

participants <- c(
  "China",
  "Russia",
  "Kyrgyzstan",
  "Tajikistan",
  "Kazakhstan",
  "Uzbekistan",
  "India",
  "Pakistan",
  "Iran",
  "Turkmenistan",
  "Mongolia",
  "Belarus",
  "Turkey",
  "Azerbaijan"
)

While scraping the Democracy Index table, we followed the steps outlined below.

  1. Read HTML code using the read_html() function

  2. Get tables using the html_table() function

  3. The 6th table in the list is the one we want (by running the first two lines of code, you can see all of the tables)

  4. Select country and score columns

  5. Divide the countries based on the previously stated criteria

  6. Rename score column

democracy <- read_html(url1) %>% 
  html_table() %>% 
  .[[6]] %>% 
  select(3,5) %>% 
  mutate(
    Shanghai = ifelse(
      Country %in% participants, "Shanghai", "Non-Shanghai"
    )
  ) %>% 
  rename(
    "Democracy"=2
  )

While scraping the Corruption Index table, we followed the steps outlined below.

  1. Read HTML code using the read_html() function

  2. Get tables using the html_table() function

  3. The 5th table in the list is the one we want (by running the first two lines of code, you can see all of the tables)

  4. Select country and score columns

  5. Remove the first line

  6. Rename country and score columns

  7. Convert character to numeric for the score column

corruption <- read_html(url2) %>% 
  html_table() %>% 
  .[[5]] %>% 
  select(2,3) %>% 
  slice(-1) %>% 
  rename(
    "Country"=1,
    "Corruption"=2
  ) %>% 
  mutate(
    Corruption = as.numeric(Corruption)
  )

The inner_join() function can be used to join common countries from two tables.

master <- democracy %>% 
  inner_join(corruption, by = "Country") %>% # the common rows between two tables
  select(Country,Democracy,Corruption,Shanghai)

As an example, the ten countries in the table are listed below.

Country Democracy Corruption Shanghai
Canada 8.87 74 Non-Shanghai
United States 7.85 67 Non-Shanghai
Austria 8.07 74 Non-Shanghai
Belgium 7.51 73 Non-Shanghai
Cyprus 7.43 53 Non-Shanghai
Denmark 9.09 88 Non-Shanghai
Finland 9.27 88 Non-Shanghai
France 7.99 71 Non-Shanghai
Germany 8.67 80 Non-Shanghai
Greece 7.56 49 Non-Shanghai

Let’s visualize the data in the last step.

ggplot(master, aes(x = Democracy, y = Corruption, color = Shanghai)) +
  geom_point(size = 5, alpha = .5) +
  ggrepel::geom_text_repel(data = master %>% filter(Shanghai == "Shanghai"),
                           aes(label = Country), size = 7, show.legend = FALSE) +
  geom_vline(xintercept = 5, linetype = "dashed") +
  geom_hline(yintercept = 50, linetype = "dashed") +
  ggthemes::theme_fivethirtyeight() +
  theme(
    axis.title = element_text(size = 25),
    axis.text = element_text(size = 20),
    legend.title = element_blank(),
    legend.position = "top",
    legend.text = element_text(size = 20)
  ) +
  scale_color_manual(values = c("gray60","red"))