The Data

The data this week comes from Kaggle w/ credit to Shivam Bansal.

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

nf_titles <- read_csv("netflix_titles.csv")

Note:

After posting this entry, I noticed that Julia Silge had done a similar (and obs, way better!) analysis on her blog. I’ve gone through her code and replicated it here. The majority of the code here is copied directly from her blog.

Creating topics using LDA

By unnesting the descriptions (into separate words), we can identify the words that describe each of the shows.

My version:

unnest_desc <- nf_titles %>%
  unnest_tokens(word, description) %>%
  filter(!word %in% stop_words$word) %>% 
  group_by(word) %>%
  mutate(word_total = n()) %>%
  ungroup() %>%
  filter(word_total > 50)

netflix_dtm <- unnest_desc %>%
  unite(type, show_id) %>%
  count(type, word) %>%
  cast_dtm(type, word, n)

Julia Silge’s version:

nf_titles %>%
  unnest_tokens(word, description) %>%
  anti_join(tidytext::get_stopwords())  %>%  
  count(type, word, sort = TRUE) %>%
  group_by(type) %>%
  slice_max(n, n = 15) %>%
  ungroup() %>%
  mutate(word = reorder_within(word, n, type)) %>%
  ggplot(aes(x = n, y = word, fill = type)) +
  geom_col(show.legend = FALSE, alpha = 0.8) +
  scale_y_reordered() + 
  facet_wrap(~type, scales = "free") +
  labs(
    x = "Word frequency", y = NULL,
    title = "Top words in Netflix descriptions by frequency",
    subtitle = "after removing stop words"
  )

Jo LDA

netflix_lda <- LDA(netflix_dtm, k = 10, control = list(seed = 4747))
netflix_lda
## A LDA_VEM topic model with 10 topics.
netflix_topics <- tidy(netflix_lda, matrix = "beta")
netflix_topics
## # A tibble: 2,580 x 3
##    topic term        beta
##    <int> <chr>      <dbl>
##  1     1 chance 0.00505  
##  2     2 chance 0.00205  
##  3     3 chance 0.00233  
##  4     4 chance 0.0000912
##  5     5 chance 0.00644  
##  6     6 chance 0.00147  
##  7     7 chance 0.000496 
##  8     8 chance 0.00527  
##  9     9 chance 0.00451  
## 10    10 chance 0.00307  
## # … with 2,570 more rows

Latent Dirichlet Analysis creates groups of words which are commonly seen together (in the same document). I created 10 different groups of words.

netflix_top_terms <- netflix_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>% 
  ungroup() %>%
  arrange(topic, -beta)

netflix_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

Predicting

Using the word groups above, we can predict which group each show is more likely to belong.

netflix_gamma <- tidy(netflix_lda, matrix = "gamma")

show_class <- netflix_gamma %>%
  group_by(document) %>%
  slice_max(gamma) %>%
  ungroup()

titles <- nf_titles %>%
  full_join(show_class, by = c("show_id" = "document"))

By grouping the shows (by genre), we can see if there are any trends to the LDA groups. We can see that there are many more international movies than other genres, but there do not appear to be any trends with respect to the genres and LDA groups. Possibly dramas are enriches in group 4 (?), but even that trend is very minor.

titles %>%
  filter(!is.na(topic)) %>%
  mutate(lda_topic = as.factor(topic)) %>%
  unnest_tokens(genre, listed_in, token = stringr::str_split, pattern = ",") %>%
  group_by(genre) %>%
  mutate(genre_total = n()) %>%
  ungroup() %>%
  filter(genre_total > 500) %>%
  #select(genre, topic) %>%
  group_by(genre, lda_topic) %>%
  summarize(count = n()) %>%
  ungroup() %>%
  #table()
  ggplot() +
  geom_tile(aes(x = genre, y = reorder(lda_topic, desc(lda_topic)), fill = count)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        axis.title=element_text(size=15,face="bold"),
        plot.title = element_text(size = 20, face = "bold"),
        plot.caption = element_text(size = 15)) +
  ylab("LDA topic")+
  scale_fill_distiller(palette = "RdPu", trans = "reverse") +
  labs(
    title = "LDA groups from show descriptions",
    caption = "Tidy Tuesday Plot: @hardin47 | Data: Netflix Shows") 

Julia’s modeling

Building

library(tidymodels)

set.seed(47)
netflix_split <- nf_titles %>%
  select(type, description) %>%
  initial_split(strata = type)

# first split into test/train
netflix_train <- training(netflix_split)
netflix_test <- testing(netflix_split)

# then use CV on the training data
netflix_folds <- vfold_cv(netflix_train, strata = type)
netflix_folds
## #  10-fold cross-validation using stratification 
## # A tibble: 10 x 2
##    splits             id    
##    <list>             <chr> 
##  1 <split [5254/585]> Fold01
##  2 <split [5254/585]> Fold02
##  3 <split [5255/584]> Fold03
##  4 <split [5255/584]> Fold04
##  5 <split [5255/584]> Fold05
##  6 <split [5255/584]> Fold06
##  7 <split [5255/584]> Fold07
##  8 <split [5256/583]> Fold08
##  9 <split [5256/583]> Fold09
## 10 <split [5256/583]> Fold10

Some things I noticed in Julia’s code:

  • why do we need the two additional packages (what do they do beyond tidytext?)
  • she tokenizes the words after building the training data. which makes sense, of course. but why not after cross validating?
  • i want to know more about what the normalize function does
  • no idea what smote does. i need to learn about smote.
  • why do all these functions start with step_ ?
library(textrecipes)
library(themis)

netflix_rec <- recipe(type ~ description, data = netflix_train) %>%
  step_tokenize(description) %>%
  step_tokenfilter(description, max_tokens = 1e3) %>%
  step_tfidf(description) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_smote(type)

netflix_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          1
## 
## Operations:
## 
## Tokenization for description
## Text filtering for description
## Term frequency-inverse document frequency with description
## Centering and scaling for all_numeric_predictors()
## SMOTE based on type

A linear support vector machine actually seems really good here. There are a zillion binary variables (each word is one-hot coded). Also, because the predictor variables are all binary, it doesn’t seem like any transformations (i.e., transformed boundary) would be any different than a linear model.

Again, lots to learn here! In particular, how are models being set up without information about the CV folds? And why add a recipe? What does that mean? In terms of students working through this code, how can it be parsed down for as few new functions as possible?

svm_spec <- parsnip::svm_linear() %>%
  set_mode("classification") %>%
  set_engine("LiblineaR")


netflix_wf <- workflow() %>%
  add_recipe(netflix_rec) %>%
  add_model(svm_spec)

netflix_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_linear()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
## 
## • step_tokenize()
## • step_tokenfilter()
## • step_tfidf()
## • step_normalize()
## • step_smote()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Support Vector Machine Specification (classification)
## 
## Computational engine: LiblineaR
doParallel::registerDoParallel()

set.seed(4747)
svm_rs <- fit_resamples(
  netflix_wf,
  netflix_folds,
  metrics = metric_set(accuracy, recall, precision),
  control = control_resamples(save_pred = TRUE)
)

collect_metrics(svm_rs)
## # A tibble: 3 x 6
##   .metric   .estimator  mean     n std_err .config             
##   <chr>     <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy  binary     0.677    10 0.00677 Preprocessor1_Model1
## 2 precision binary     0.786    10 0.00561 Preprocessor1_Model1
## 3 recall    binary     0.732    10 0.00540 Preprocessor1_Model1
library(tune)
svm_rs %>%
  conf_mat_resampled(tidy = FALSE) %>% 
  autoplot()

final_fitted <- last_fit(
  netflix_wf,
  netflix_split,
  metrics = metric_set(accuracy, recall, precision)
)

collect_metrics(final_fitted)
## # A tibble: 3 x 4
##   .metric   .estimator .estimate .config             
##   <chr>     <chr>          <dbl> <chr>               
## 1 accuracy  binary         0.667 Preprocessor1_Model1
## 2 recall    binary         0.708 Preprocessor1_Model1
## 3 precision binary         0.788 Preprocessor1_Model1
collect_predictions(final_fitted) %>%
  conf_mat(type, .pred_class)
##           Truth
## Prediction Movie TV Show
##    Movie     952     256
##    TV Show   393     347
netflix_fit <- pull_workflow_fit(final_fitted$.workflow[[1]])

tidy(netflix_fit) %>%
  arrange(estimate)
## # A tibble: 1,001 x 2
##    term                         estimate
##    <chr>                           <dbl>
##  1 tfidf_description_series      -0.293 
##  2 tfidf_description_docuseries  -0.179 
##  3 tfidf_description_world       -0.126 
##  4 tfidf_description_arts        -0.0929
##  5 tfidf_description_and         -0.0906
##  6 tfidf_description_crimes      -0.0892
##  7 tfidf_description_adventures  -0.0862
##  8 tfidf_description_group       -0.0840
##  9 tfidf_description_fight       -0.0796
## 10 tfidf_description_murder      -0.0777
## # … with 991 more rows
tidy(netflix_fit) %>%
  filter(term != "Bias") %>%
  group_by(sign = estimate > 0) %>%
  slice_max(abs(estimate), n = 15) %>%
  ungroup() %>%
  mutate(
    term = str_remove(term, "tfidf_description_"),
    sign = if_else(!sign, "More from TV shows", "More from movies")
  ) %>%
  ggplot(aes(x = abs(estimate), 
             y = fct_reorder(term, abs(estimate)),
             fill = sign)) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~sign, scales = "free") +
  labs( x = "Coefficient from linear SVM", y = NULL,
        title = "Which words are most predictive of movies vs. TV shows?",
        subtitle = "For description text of movies and TV shows on Netflix")