The data this week comes from Kaggle w/ credit to Shivam Bansal.
This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.
nf_titles <- read_csv("netflix_titles.csv")
After posting this entry, I noticed that Julia Silge had done a similar (and obs, way better!) analysis on her blog. I’ve gone through her code and replicated it here. The majority of the code here is copied directly from her blog.
By unnesting the descriptions (into separate words), we can identify the words that describe each of the shows.
My version:
unnest_desc <- nf_titles %>%
unnest_tokens(word, description) %>%
filter(!word %in% stop_words$word) %>%
group_by(word) %>%
mutate(word_total = n()) %>%
ungroup() %>%
filter(word_total > 50)
netflix_dtm <- unnest_desc %>%
unite(type, show_id) %>%
count(type, word) %>%
cast_dtm(type, word, n)
Julia Silge’s version:
nf_titles %>%
unnest_tokens(word, description) %>%
anti_join(tidytext::get_stopwords()) %>%
count(type, word, sort = TRUE) %>%
group_by(type) %>%
slice_max(n, n = 15) %>%
ungroup() %>%
mutate(word = reorder_within(word, n, type)) %>%
ggplot(aes(x = n, y = word, fill = type)) +
geom_col(show.legend = FALSE, alpha = 0.8) +
scale_y_reordered() +
facet_wrap(~type, scales = "free") +
labs(
x = "Word frequency", y = NULL,
title = "Top words in Netflix descriptions by frequency",
subtitle = "after removing stop words"
)
netflix_lda <- LDA(netflix_dtm, k = 10, control = list(seed = 4747))
netflix_lda
## A LDA_VEM topic model with 10 topics.
netflix_topics <- tidy(netflix_lda, matrix = "beta")
netflix_topics
## # A tibble: 2,580 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 chance 0.00505
## 2 2 chance 0.00205
## 3 3 chance 0.00233
## 4 4 chance 0.0000912
## 5 5 chance 0.00644
## 6 6 chance 0.00147
## 7 7 chance 0.000496
## 8 8 chance 0.00527
## 9 9 chance 0.00451
## 10 10 chance 0.00307
## # … with 2,570 more rows
Latent Dirichlet Analysis creates groups of words which are commonly seen together (in the same document). I created 10 different groups of words.
netflix_top_terms <- netflix_topics %>%
group_by(topic) %>%
slice_max(beta, n = 10) %>%
ungroup() %>%
arrange(topic, -beta)
netflix_top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
Using the word groups above, we can predict which group each show is more likely to belong.
netflix_gamma <- tidy(netflix_lda, matrix = "gamma")
show_class <- netflix_gamma %>%
group_by(document) %>%
slice_max(gamma) %>%
ungroup()
titles <- nf_titles %>%
full_join(show_class, by = c("show_id" = "document"))
By grouping the shows (by genre), we can see if there are any trends to the LDA groups. We can see that there are many more international movies than other genres, but there do not appear to be any trends with respect to the genres and LDA groups. Possibly dramas are enriches in group 4 (?), but even that trend is very minor.
titles %>%
filter(!is.na(topic)) %>%
mutate(lda_topic = as.factor(topic)) %>%
unnest_tokens(genre, listed_in, token = stringr::str_split, pattern = ",") %>%
group_by(genre) %>%
mutate(genre_total = n()) %>%
ungroup() %>%
filter(genre_total > 500) %>%
#select(genre, topic) %>%
group_by(genre, lda_topic) %>%
summarize(count = n()) %>%
ungroup() %>%
#table()
ggplot() +
geom_tile(aes(x = genre, y = reorder(lda_topic, desc(lda_topic)), fill = count)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
axis.title=element_text(size=15,face="bold"),
plot.title = element_text(size = 20, face = "bold"),
plot.caption = element_text(size = 15)) +
ylab("LDA topic")+
scale_fill_distiller(palette = "RdPu", trans = "reverse") +
labs(
title = "LDA groups from show descriptions",
caption = "Tidy Tuesday Plot: @hardin47 | Data: Netflix Shows")
library(tidymodels)
set.seed(47)
netflix_split <- nf_titles %>%
select(type, description) %>%
initial_split(strata = type)
# first split into test/train
netflix_train <- training(netflix_split)
netflix_test <- testing(netflix_split)
# then use CV on the training data
netflix_folds <- vfold_cv(netflix_train, strata = type)
netflix_folds
## # 10-fold cross-validation using stratification
## # A tibble: 10 x 2
## splits id
## <list> <chr>
## 1 <split [5254/585]> Fold01
## 2 <split [5254/585]> Fold02
## 3 <split [5255/584]> Fold03
## 4 <split [5255/584]> Fold04
## 5 <split [5255/584]> Fold05
## 6 <split [5255/584]> Fold06
## 7 <split [5255/584]> Fold07
## 8 <split [5256/583]> Fold08
## 9 <split [5256/583]> Fold09
## 10 <split [5256/583]> Fold10
Some things I noticed in Julia’s code:
step_
?library(textrecipes)
library(themis)
netflix_rec <- recipe(type ~ description, data = netflix_train) %>%
step_tokenize(description) %>%
step_tokenfilter(description, max_tokens = 1e3) %>%
step_tfidf(description) %>%
step_normalize(all_numeric_predictors()) %>%
step_smote(type)
netflix_rec
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 1
##
## Operations:
##
## Tokenization for description
## Text filtering for description
## Term frequency-inverse document frequency with description
## Centering and scaling for all_numeric_predictors()
## SMOTE based on type
A linear support vector machine actually seems really good here. There are a zillion binary variables (each word is one-hot coded). Also, because the predictor variables are all binary, it doesn’t seem like any transformations (i.e., transformed boundary) would be any different than a linear model.
Again, lots to learn here! In particular, how are models being set up without information about the CV folds? And why add a recipe? What does that mean? In terms of students working through this code, how can it be parsed down for as few new functions as possible?
svm_spec <- parsnip::svm_linear() %>%
set_mode("classification") %>%
set_engine("LiblineaR")
netflix_wf <- workflow() %>%
add_recipe(netflix_rec) %>%
add_model(svm_spec)
netflix_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_linear()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
##
## • step_tokenize()
## • step_tokenfilter()
## • step_tfidf()
## • step_normalize()
## • step_smote()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Support Vector Machine Specification (classification)
##
## Computational engine: LiblineaR
doParallel::registerDoParallel()
set.seed(4747)
svm_rs <- fit_resamples(
netflix_wf,
netflix_folds,
metrics = metric_set(accuracy, recall, precision),
control = control_resamples(save_pred = TRUE)
)
collect_metrics(svm_rs)
## # A tibble: 3 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.677 10 0.00677 Preprocessor1_Model1
## 2 precision binary 0.786 10 0.00561 Preprocessor1_Model1
## 3 recall binary 0.732 10 0.00540 Preprocessor1_Model1
library(tune)
svm_rs %>%
conf_mat_resampled(tidy = FALSE) %>%
autoplot()
final_fitted <- last_fit(
netflix_wf,
netflix_split,
metrics = metric_set(accuracy, recall, precision)
)
collect_metrics(final_fitted)
## # A tibble: 3 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.667 Preprocessor1_Model1
## 2 recall binary 0.708 Preprocessor1_Model1
## 3 precision binary 0.788 Preprocessor1_Model1
collect_predictions(final_fitted) %>%
conf_mat(type, .pred_class)
## Truth
## Prediction Movie TV Show
## Movie 952 256
## TV Show 393 347
netflix_fit <- pull_workflow_fit(final_fitted$.workflow[[1]])
tidy(netflix_fit) %>%
arrange(estimate)
## # A tibble: 1,001 x 2
## term estimate
## <chr> <dbl>
## 1 tfidf_description_series -0.293
## 2 tfidf_description_docuseries -0.179
## 3 tfidf_description_world -0.126
## 4 tfidf_description_arts -0.0929
## 5 tfidf_description_and -0.0906
## 6 tfidf_description_crimes -0.0892
## 7 tfidf_description_adventures -0.0862
## 8 tfidf_description_group -0.0840
## 9 tfidf_description_fight -0.0796
## 10 tfidf_description_murder -0.0777
## # … with 991 more rows
tidy(netflix_fit) %>%
filter(term != "Bias") %>%
group_by(sign = estimate > 0) %>%
slice_max(abs(estimate), n = 15) %>%
ungroup() %>%
mutate(
term = str_remove(term, "tfidf_description_"),
sign = if_else(!sign, "More from TV shows", "More from movies")
) %>%
ggplot(aes(x = abs(estimate),
y = fct_reorder(term, abs(estimate)),
fill = sign)) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~sign, scales = "free") +
labs( x = "Coefficient from linear SVM", y = NULL,
title = "Which words are most predictive of movies vs. TV shows?",
subtitle = "For description text of movies and TV shows on Netflix")