Uraz's R Blog: Using Cosine Similarity to Build a Movie Recommendation System

Have you ever fallen into a void after finishing a movie and tried to find similar ones? After a movie that impressed me, I immediately try to find alternatives. If we really enjoy working with data and are good at algorithms, why not build our own system?

We used the cosine similarity metric to achieve our results. I think I’ve been obsessed with it lately. I use it especially when I study on text similarities. In this study, we are going to use the method I mentioned. Let’s dive into it.

Cosine similarity measures the similarity between two vectors and is the cosine of the angle between two vectors. The smaller the angle between two vectors, the more similar they are to each other.

Consider two vectors, x and y. We can calculate the cosine similarity between the vectors as follows:

\(cos(\theta) = \frac{x.y}{||x||||y||} = \frac{\sum_{i=1}^{n}x_iy_i}{\sqrt{\sum_{i=1}^{n}x_i^2}\sqrt{\sum_{i=1}^{n}y_i^2}}\)

See how it’s calculated with a simple example.

x <- c(3,4,1,0)
y <- c(3,4,4,8)

\(x.y = 3*3 + 4*4 + 1*4 + 0*8 = 29\)

\(||x|| = \sqrt{3^2 + 4^2 + 1^2 + 0^2} = 5.09902\)

\(||y|| = \sqrt{3^2 + 4^2 + 4^2 + 8^2} = 10.24695\)

\(cos(x,y) = cos(\theta) = \frac{29}{5.09902*10.24695} = 0.5550303\)

Remember: The smaller the angle between two vectors, the more similar they are to each other. So, if the angle between vectors is zero degrees, then the cosine similarity value is 1, which means the two vectors are similar or relevant. The similarity can be any value in the range [−1,1]. In addition to this, the cosine distance is as:

\(Cosine\ Distance\ =\ 1\ -\ Cosine\ Similarity\)

Cosine similarity can also be calculated with the help of the {lsa} package.

as.numeric(lsa::cosine(x,y))

[1] 0.5550303

So, how can similarity be calculated when it comes to texts? Let me give you an example.

text1 <- "I am learning R programming language"
text2 <- "I am learning Python programming language"

We can create two vectors with the word frequencies corresponding to each other as follows:

# I, am, learning, R, Python, programming, language

text1_n <- c(1,1,1,1,0,1,1)
text2_n <- c(1,1,1,0,1,1,1)

as.numeric(lsa::cosine(text1_n,text2_n))

[1] 0.8333333

We can move on to building a movie recommender system. The data to be used in the study will be obtained by web scraping from IMDB.

The following steps can be followed:

https://www.imdb.com/
Menu/Browse Movies by Genre
Popular Movies by Genre/Sci-Fi (Scroll down to see)
https://www.imdb.com/search/title/?title_type=feature&genres=sci-fi&start=1&explore=genres&ref_=adv_nxt

Focus on last URL. The value you see as 1 next to start will increase by 50 each time you switch to the next page. In fact, you do not encounter this URL at first, but when you move to the next page, you can obtain this URL and edit the first page yourself. The movie list starts with the value you type in the value to the right of start.

I want to get the first 100 URLs in the list sorted by popularity.

library(rvest)
library(tidyverse)

URLs need to be created first.

urls <- str_c(
  "https://www.imdb.com/search/title/?title_type=feature&genres=sci-fi&start=",
  seq(1,5000,50),
  "&explore=genres&ref_=adv_nxt"
)

urls[c(1,length(urls))]

[1] "https://www.imdb.com/search/title/?title_type=feature&genres=sci-fi&start=1&explore=genres&ref_=adv_nxt"   
[2] "https://www.imdb.com/search/title/?title_type=feature&genres=sci-fi&start=4951&explore=genres&ref_=adv_nxt"

We are going to get the titles and descriptions of the movies using a for loop. See the examples below for the first page.

firstURL <- urls[1]

title <- read_html(firstURL) %>% 
  html_nodes("div.lister-item-content h3.lister-item-header a") %>% 
  html_text()

head(title)

[1] "Nope"                             
[2] "Prey"                             
[3] "Samaritan"                        
[4] "DC League of Super-Pets"          
[5] "Thor: Love and Thunder"           
[6] "Everything Everywhere All at Once"

description <- read_html(firstURL) %>% 
  html_nodes("p.text-muted") %>% 
  html_text() %>% 
  .[c(FALSE,TRUE)] %>% 
  gsub("[\n]", "", .)

head(paste0(substr(description,1,100),"..."))

[1] "The residents of a lonely gulch in inland California bear witness to an uncanny and chilling discove..."
[2] "The origin story of the Predator in the world of the Comanche Nation 300 years ago. Naru, a skilled ..."
[3] "A young boy learns that a superhero who was thought to have gone missing after an epic battle twenty..."
[4] "Krypto the Super-Dog and Superman are inseparable best friends, sharing the same superpowers and fig..."
[5] "Thor enlists the help of Valkyrie, Korg and ex-girlfriend Jane Foster to fight Gorr the God Butcher,..."
[6] "An aging Chinese immigrant is swept up in an insane adventure, where she alone can save the world by..."

df <- data.frame(
  title = title,
  description = description
)

title	description
Nope	The residents of a lonely gulch in inland California bear witness to an uncanny and chilling discovery.
Prey	The origin story of the Predator in the world of the Comanche Nation 300 years ago. Naru, a skilled warrior, fights to protect her tribe against one of the first highly-evolved Predators to land on Earth.
Samaritan	A young boy learns that a superhero who was thought to have gone missing after an epic battle twenty years ago may in fact still be around.
DC League of Super-Pets	Krypto the Super-Dog and Superman are inseparable best friends, sharing the same superpowers and fighting crime side by side in Metropolis. However, Krypto must master his own powers for a rescue mission when Superman is kidnapped.
Thor: Love and Thunder	Thor enlists the help of Valkyrie, Korg and ex-girlfriend Jane Foster to fight Gorr the God Butcher, who intends to make the gods extinct.
Everything Everywhere All at Once	An aging Chinese immigrant is swept up in an insane adventure, where she alone can save the world by exploring other universes connecting with the lives she could have led.

Creating the dataset…

master <- data.frame()

for(i in seq_along(urls)){
  
  title <- read_html(urls[i]) %>% 
  html_nodes("div.lister-item-content h3.lister-item-header a") %>% 
  html_text()
  
  description <- read_html(urls[i]) %>% 
  html_nodes("p.text-muted") %>% 
  html_text() %>% 
  .[c(FALSE,TRUE)] %>% 
  gsub("[\n]", "", .)
  
  master <- master %>% 
    bind_rows(
      data.frame(
        title = title,
        description = description
      )
    )
  
  #print(i)
  Sys.sleep(1)
  
}

A dataset should be examined after it is created, but in this work I ignore this step. Ultimately, we have 5000 movies.

Finally we can move on to the cosine similarity calculation. The {widyr} package can be used to calculate it. The 2 packages shown below will be used.

library(tidytext)
library(widyr)

We are going to calculate cosine similarity based on two methods: Word Frequencies and TF-IDF.

master2 <- master %>% 
  unnest_tokens(output = "word", input = "description") %>% 
  anti_join(get_stopwords()) %>% 
  count(word, title) %>% 
  bind_tf_idf(word, title, n)

Cosine Similarity - Word Frequencies:

master_freq <- master2 %>% 
  pairwise_similarity(title, word, n, upper = FALSE, sort = TRUE) %>% 
  mutate(item = paste0(item1,"---",item2), .before = similarity) %>% 
  select(-c(item1,item2)) %>% 
  rename("similarity_freq"=2)

Cosine Similarity - TF-IDF:

TF-IDF stands for Term Frequency-Inverse Document Frequency, and it’s intended to measure how important a word is to a document in a collection or corpus.

i: term (word),

j: document (set of words),

N: count of corpus (the total document set)

tf(i,j) = count of i in j / number of words in j

df(i) = occurrence of i in documents

idf(i) = log(N/(df + 1))

tf-idf(i,j) = tf(i,j) * log(N/(df + 1))

master_tf_idf <- master2 %>% 
  pairwise_similarity(title, word, tf_idf, upper = FALSE, sort = TRUE) %>% 
  mutate(item = paste0(item1,"---",item2), .before = similarity) %>% 
  select(-c(item1,item2)) %>% 
  rename("similarity_tf_idf"=2)

I used Minority Report, of the Sci-Fi genre as the item to be used for cosine similarity calculations when recommending the top 10 movies.

Based on word frequency:

df_wf <- master_freq %>% 
  filter(grepl("Minority Report",item)) %>% 
  top_n(10, similarity_freq) %>% 
  arrange(desc(similarity_freq))

item	similarity_freq
Minority Report—Judge Dredd	0.3077935
Cyborg 2087—Minority Report	0.2766858
Minority Report—Omega Cop	0.2601330
Le dernier combat—Minority Report	0.2545139
Para iz buduschego—Minority Report	0.2445580
Minority Report—Eternity	0.2445580
Minority Report—Maayavan	0.2433321
Minority Report—Knights	0.2433321
Minority Report—Narcopolis	0.2391824
Minority Report—Trudno byt bogom	0.2369396

The title with the closest similarity was Judge Dredd based on word frequency.

Based on TF-IDF:

df_tfidf <- master_tf_idf %>% 
  filter(grepl("Minority Report",item)) %>% 
  top_n(10, similarity_tf_idf) %>% 
  arrange(desc(similarity_tf_idf))

item	similarity_tf_idf
Minority Report—Time Frame	0.2416788
Minority Report—Strange Behavior	0.1993392
Minority Report—Outpost 37	0.1971800
Minority Report—Resident Evil	0.1873535
Minority Report—Im Stahlnetz des Dr. Mabuse	0.1752203
Kidô keisatsu patorebâ: The Movie 2—Minority Report	0.1744674
Minority Report—In the Shadow of the Moon	0.1643750
Minority Report—Time Under Fire	0.1642217
Minority Report—Judge Dredd	0.1626650
Minority Report—Maayavan	0.1608787

The title with the closest similarity was Time Frame based on TF-IDF.

df_wf %>% 
  ggplot(aes(x = reorder(item, similarity_freq), y = similarity_freq, fill = similarity_freq)) +
  geom_col() +
  ggthemes::theme_fivethirtyeight() +
  theme(legend.position = "none") +
  scale_fill_gradient(low = "orange", high = "red") +
  coord_flip() +
  labs(title = "Word Frequency") -> g1

df_tfidf %>% 
  ggplot(aes(x = reorder(item, similarity_tf_idf), y = similarity_tf_idf, fill = similarity_tf_idf)) +
  geom_col() +
  ggthemes::theme_fivethirtyeight() +
  theme(legend.position = "none") +
  scale_fill_gradient(low = "orange", high = "red") +
  coord_flip() +
  labs(title = "TF-IDF") -> g2

gridExtra::grid.arrange(g1,g2,ncol=2)

title	description
Minority Report	In a future where a special police unit is able to arrest murderers before they commit their crimes, an officer from that unit is himself accused of a future murder.
Judge Dredd	In a dystopian future, Joseph Dredd, the most famous Judge (a police officer with instant field judiciary powers), is convicted for a crime he did not commit and must face his murderous counterpart.
Time Frame	What if you were eleven hours away from being executed for a murder you didn’t commit, and the only way to save yourself was to go back in time, and commit the murder?

If everything is clear so far, you can do more using your imagination. These studies don’t work without story and imagination.