Alt text!

Thanks to the work of those at RStudio, alt text is easy to add to a figure chunk. I’ve added alt text to each of the figures in this week’s #TidyTuesday. To see the alt text, hover the mouse over the figure, and a text box with a plot description will pop up.

The Data

The data this week comes from Kaggle thanks to manual data aggregation by plummye. Hat tip to Sara Stoudt for recommending this dataset!

scooby <- read_csv("scoobydoo.csv") 

scooby <- scooby %>%
  mutate(imdb = as.numeric(imdb), engagement = as.numeric(engagement),
         number_of_snacks = as.numeric(number_of_snacks)) %>%
  mutate(monsters_gen = case_when(
    str_detect(monster_gender, "Male") & str_detect(monster_gender,"Female") ~ "Both",
    str_detect(monster_gender, "Male") ~ "Male",
    str_detect(monster_gender, "Female") ~ "Female",
    TRUE ~ "NoGender")) %>%
  mutate(motive2 = case_when(
    motive == "Food" ~ "Food",
    motive == "Preservation" ~ "Preservation",
    motive == "Abduction" ~ "Abduction",
    motive == "Trespassing" ~ "Trespassing",
    motive == "Smuggling" ~ "Smuggling",
    motive == "Natural Resource" ~ "Natural Resource",
    motive == "Conquer" ~ "Conquer",
    motive == "Treasure" ~ "Treasure",
    motive == "Theft" ~ "Theft",
    motive == "Competition" ~ "Competition",
    is.na(motive)  ~ "None",
    TRUE ~ "Other"
  )) %>%
  mutate(year = year(date_aired))

Predicting IMDB

Can we figure out what characteristics of the show will best lead to a high IMDB rating?

IMDB as a function of …

Before starting to build a model to predict the rating, I’m going to visualize the IMDB rating as a function of various other variables.

ggplot(scooby) +
  geom_point(aes(x = date_aired, y = imdb, color = motive2)) + 
  facet_wrap(~ monsters_gen) +
  labs(caption = "Tidy Tuesday Plot: @hardin47 | Data: Scooby Doo") 

Scatterplot with year on the x-axis and imdb rating on the y-axis.  The points are colored by motive and faceted by gender of the monster.  The majority of the points are in the male monster box.  There are no substantial patterns to discern in the scatterplots.

scooby %>%
  group_by(year, monsters_gen) %>%
  summarize(count = n()) %>%
  ungroup %>%
ggplot() +
  geom_bar(stat = "identity", aes(x = year, count, 
                                  group = monsters_gen, 
                                  fill = monsters_gen)) +
  labs(caption = "Tidy Tuesday Plot: @hardin47 | Data: Scooby Doo") 

A model

Back to practicing tidy models…

test & training data

scooby_small <- scooby %>%
  select(imdb, season, run_time, monsters_gen, monster_amount,
         starts_with("caught"), starts_with("captured"),
         starts_with("unmask"), starts_with("snack"),
         suspects_amount, motive2, culprit_amount,
         number_of_snacks, batman, scooby_dum, scrappy_doo,
         hex_girls, blue_falcon) %>%
  drop_na()

set.seed(47)

data_split <- initial_split(scooby_small, prop = 2/3)

scooby_train <- training(data_split)
scooby_test <- testing(data_split)

recipe

scooby_rec <-
  recipe(imdb ~ ., data = scooby_train) 

summary(scooby_rec)
## # A tibble: 37 x 4
##    variable       type    role      source  
##    <chr>          <chr>   <chr>     <chr>   
##  1 season         nominal predictor original
##  2 run_time       numeric predictor original
##  3 monsters_gen   nominal predictor original
##  4 monster_amount numeric predictor original
##  5 caught_fred    nominal predictor original
##  6 caught_daphnie nominal predictor original
##  7 caught_velma   nominal predictor original
##  8 caught_shaggy  nominal predictor original
##  9 caught_scooby  nominal predictor original
## 10 caught_other   logical predictor original
## # … with 27 more rows

set the model

rf_mod1 <-
  rand_forest(mtry = 5,
              trees = 500,
              min_n = 5) %>%
  set_engine("ranger", importance = "permutation") %>%
  set_mode("regression") 

fit the model

After we’ve specified the engine, we can then fit() the model:

set.seed(4747)
scooby_rf1 <-
  workflow() %>%
  add_model(rf_mod1) %>%
  add_recipe(scooby_rec) %>%
  fit(data = scooby_train)

scooby_rf1
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~5,      x), num.trees = ~500, min.node.size = min_rows(~5, x), importance = ~"permutation",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      356 
## Number of independent variables:  36 
## Mtry:                             5 
## Target node size:                 5 
## Variable importance mode:         permutation 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.2416005 
## R squared (OOB):                  0.5561264

model fit

vip

The variable importance plot gives a measure of which variable(s) was/were typically chosen first as the most important information to discriminate between the lakes.

library(vip)

scooby_rf1 %>%
  pull_workflow_fit() %>%
  vip()

Barplot of importance values for each of the variables in the random forest. runtime has the highest importance (of roughly 0.11) and monster_amount (of about 0.08). ##### observed vs predicted

scooby_test %>%
  cbind(predict(scooby_rf1, new_data = scooby_test)) %>%
  ggplot() +
  geom_point(aes(x = imdb, y = .pred, color = season)) +
  geom_abline(slope = 1, intercept = 0)

Julia Silge’s screencast

Julia did a tidymodel to predict whether or not the monster is real. I’m just going to copy her code directly to walk through the process of building a tidymodel.

scooby %>%
  filter(monster_amount > 0) %>%
  count(monster_real)
## # A tibble: 2 x 2
##   monster_real     n
##   <chr>        <int>
## 1 FALSE          404
## 2 TRUE           112

How many monsters are real in buckets of 5 years. Notice the trick for empty bars at the ends.

scooby %>%
  filter(monster_amount > 0) %>%
  count(
    year_aired = 5 * ((lubridate::year(date_aired) + 1) %/% 5),
    monster_real
  ) %>%
  mutate(year_aired = factor(year_aired)) %>%
  ggplot(aes(year_aired, n, fill = monster_real)) +
  geom_col(position = position_dodge(preserve = "single"), alpha = 0.8) +
  labs(x = "Date aired", y = "Monsters per decade", fill = "Real monster?")

wrangling the data

  • keep only episodes with a real monster
  • format variable type
  • create training, test, and CV

Note that the training data is set in an interesting way here. First of all, there is some stratified sampling to make sure that the groups are balanced. But Julia also bootstrapped (within??) each sample. I’m not totally sure.

set.seed(123)
scooby_split <- scooby %>%
  mutate(
    year_aired = lubridate::year(date_aired)
  ) %>%
  filter(monster_amount > 0, !is.na(imdb)) %>%
  mutate(
    monster_real = case_when(
      monster_real == "FALSE" ~ "fake",
      TRUE ~ "real"
    ),
    monster_real = factor(monster_real)
  ) %>%
  select(year_aired, imdb, monster_real, title) %>%
  initial_split(strata = monster_real)
scooby_train <- training(scooby_split)
scooby_test <- testing(scooby_split)

set.seed(234)
scooby_folds <- bootstraps(scooby_train, strata = monster_real)
scooby_folds
## # Bootstrap sampling using stratification 
## # A tibble: 25 x 2
##    splits            id         
##    <list>            <chr>      
##  1 <split [375/133]> Bootstrap01
##  2 <split [375/144]> Bootstrap02
##  3 <split [375/140]> Bootstrap03
##  4 <split [375/132]> Bootstrap04
##  5 <split [375/139]> Bootstrap05
##  6 <split [375/134]> Bootstrap06
##  7 <split [375/146]> Bootstrap07
##  8 <split [375/132]> Bootstrap08
##  9 <split [375/143]> Bootstrap09
## 10 <split [375/143]> Bootstrap10
## # … with 15 more rows

set the model

Note that we wouldn’t be able to run the model after this step because we haven’t set the parameters yet. We’ll need to tune them.

tree_spec <-
  decision_tree(
    cost_complexity = tune(),
    tree_depth = tune(),
    min_n = tune()
  ) %>%
  set_mode("classification") %>%
  set_engine("rpart")

tree_spec
## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = tune()
##   tree_depth = tune()
##   min_n = tune()
## 
## Computational engine: rpart

So we set up a grid of possible values to try (using mostly the default tuning values):

tree_grid <- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 4)
tree_grid
## # A tibble: 64 x 3
##    cost_complexity tree_depth min_n
##              <dbl>      <int> <int>
##  1    0.0000000001          1     2
##  2    0.0000001             1     2
##  3    0.0001                1     2
##  4    0.1                   1     2
##  5    0.0000000001          5     2
##  6    0.0000001             5     2
##  7    0.0001                5     2
##  8    0.1                   5     2
##  9    0.0000000001         10     2
## 10    0.0000001            10     2
## # … with 54 more rows

find the model

Now let’s fit each possible parameter combination to each resample. By putting non-default metrics into metric_set(), we can specify which metrics are computed for each resample. There are 25 Bootstrap resamples x 64 possible parameter settings (each time keeping track of the metrics).

doParallel::registerDoParallel()

set.seed(345)
tree_rs <-
  tune_grid(
    tree_spec,
    monster_real ~ year_aired + imdb,
    resamples = scooby_folds,
    grid = tree_grid,
    metrics = metric_set(accuracy, roc_auc, sensitivity, specificity)
  )

tree_rs
## # Tuning results
## # Bootstrap sampling using stratification 
## # A tibble: 25 x 4
##    splits            id          .metrics           .notes          
##    <list>            <chr>       <list>             <list>          
##  1 <split [375/133]> Bootstrap01 <tibble [256 × 7]> <tibble [0 × 1]>
##  2 <split [375/144]> Bootstrap02 <tibble [256 × 7]> <tibble [0 × 1]>
##  3 <split [375/140]> Bootstrap03 <tibble [256 × 7]> <tibble [0 × 1]>
##  4 <split [375/132]> Bootstrap04 <tibble [256 × 7]> <tibble [0 × 1]>
##  5 <split [375/139]> Bootstrap05 <tibble [256 × 7]> <tibble [0 × 1]>
##  6 <split [375/134]> Bootstrap06 <tibble [256 × 7]> <tibble [0 × 1]>
##  7 <split [375/146]> Bootstrap07 <tibble [256 × 7]> <tibble [0 × 1]>
##  8 <split [375/132]> Bootstrap08 <tibble [256 × 7]> <tibble [0 × 1]>
##  9 <split [375/143]> Bootstrap09 <tibble [256 × 7]> <tibble [0 × 1]>
## 10 <split [375/143]> Bootstrap10 <tibble [256 × 7]> <tibble [0 × 1]>
## # … with 15 more rows

evaluate model

show_best(tree_rs)
## # A tibble: 5 x 9
##   cost_complexity tree_depth min_n .metric  .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>    <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001         10     2 accuracy binary     0.872    25 0.00481
## 2    0.0000001            10     2 accuracy binary     0.872    25 0.00481
## 3    0.0001               10     2 accuracy binary     0.872    25 0.00481
## 4    0.0000000001         15     2 accuracy binary     0.871    25 0.00456
## 5    0.0000001            15     2 accuracy binary     0.871    25 0.00456
## # … with 1 more variable: .config <chr>
autoplot(tree_rs)

Lots of ways to choose “best” … select_best() gives the numerically best model (for some metric). Alternatively, we can choose a simpler tree within a small amount of the best metric. A less complex tree might keep us from over fitting.

Now, final_tree has the parameter values, so we can fit it to the data!

simpler_tree <- select_by_one_std_err(tree_rs,
  -cost_complexity,
  metric = "roc_auc"
)

final_tree <- finalize_model(tree_spec, simpler_tree)

fit the model

Note the use of the fit() functions instead of the last_fit() function. The fit() function uses only the training data. The last_fit() function fits the final best model to the training set and evaluates on the test set!

final_fit <- fit(final_tree, monster_real ~ year_aired + imdb, scooby_train)

on the test data

But we do want to see how the model will work on the test data…

final_rs <- last_fit(final_tree, monster_real ~ year_aired + imdb, scooby_split)
collect_metrics(final_rs)
## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.857 Preprocessor1_Model1
## 2 roc_auc  binary         0.780 Preprocessor1_Model1
# remotes::install_github("grantmcdermott/parttree")
library(parttree)

scooby_train %>%
  ggplot(aes(imdb, year_aired)) +
  geom_parttree(data = final_fit, aes(fill = monster_real), alpha = 0.2) +
  geom_jitter(alpha = 0.7, width = 0.05, height = 0.2, aes(color = monster_real))

pivoting and summarizing

There were some great plots today on Twitter, so I’m trying them out. The code below is mostly due to [@jack_davison](https://github.com/jack-davison/TidyTuesday/tree/master/R).

colors = tribble(
  ~name, ~color,
  "daphnie", "#7C6AA8",
  "shaggy", "#B2BE34",
  "scooby", "#B1752C",
  "velma", "#FA9C39", 
  "fred", "#01A0DA"
)

scooby_plot <- scooby %>%
  select(index, year, contains("unmask"), contains("captured"), contains("snack")) %>%
  select(-number_of_snacks) %>%
  mutate(across(-c(index,year), as.logical)) %>% 
  pivot_longer(where(is.logical)) %>% 
  mutate(activity = str_extract(name, "unmask|captured|snack"),
         name = str_remove(name, "unmask_|captured_|snack_")) %>% 
  group_by(year) %>%
  count(name, value, activity) %>% 
  filter(value, name != "other") %>% 
  left_join(colors, by = "name") %>% 
  mutate(name = fct_reorder(name, n, max)) %>%
  mutate(decade = floor(year/10)*10) %>%
  ungroup() %>%
  group_by(decade, name, activity) %>%
  summarise(n_decade = sum(n))
scooby_plot %>%
  ggplot() +
  geom_line(aes(x = decade, y = n_decade, group = name,
                color = name)) +
  facet_grid(activity~1)