Thanks to the work of those at RStudio, alt text is easy to add to a figure chunk. I’ve added alt text to each of the figures in this week’s #TidyTuesday. To see the alt text, hover the mouse over the figure, and a text box with a plot description will pop up.
The data this week comes from Kaggle thanks to manual data aggregation by plummye. Hat tip to Sara Stoudt for recommending this dataset!
scooby <- read_csv("scoobydoo.csv")
scooby <- scooby %>%
mutate(imdb = as.numeric(imdb), engagement = as.numeric(engagement),
number_of_snacks = as.numeric(number_of_snacks)) %>%
mutate(monsters_gen = case_when(
str_detect(monster_gender, "Male") & str_detect(monster_gender,"Female") ~ "Both",
str_detect(monster_gender, "Male") ~ "Male",
str_detect(monster_gender, "Female") ~ "Female",
TRUE ~ "NoGender")) %>%
mutate(motive2 = case_when(
motive == "Food" ~ "Food",
motive == "Preservation" ~ "Preservation",
motive == "Abduction" ~ "Abduction",
motive == "Trespassing" ~ "Trespassing",
motive == "Smuggling" ~ "Smuggling",
motive == "Natural Resource" ~ "Natural Resource",
motive == "Conquer" ~ "Conquer",
motive == "Treasure" ~ "Treasure",
motive == "Theft" ~ "Theft",
motive == "Competition" ~ "Competition",
is.na(motive) ~ "None",
TRUE ~ "Other"
)) %>%
mutate(year = year(date_aired))
Can we figure out what characteristics of the show will best lead to a high IMDB rating?
Before starting to build a model to predict the rating, I’m going to visualize the IMDB rating as a function of various other variables.
ggplot(scooby) +
geom_point(aes(x = date_aired, y = imdb, color = motive2)) +
facet_wrap(~ monsters_gen) +
labs(caption = "Tidy Tuesday Plot: @hardin47 | Data: Scooby Doo")
scooby %>%
group_by(year, monsters_gen) %>%
summarize(count = n()) %>%
ungroup %>%
ggplot() +
geom_bar(stat = "identity", aes(x = year, count,
group = monsters_gen,
fill = monsters_gen)) +
labs(caption = "Tidy Tuesday Plot: @hardin47 | Data: Scooby Doo")
Back to practicing tidy models…
scooby_small <- scooby %>%
select(imdb, season, run_time, monsters_gen, monster_amount,
starts_with("caught"), starts_with("captured"),
starts_with("unmask"), starts_with("snack"),
suspects_amount, motive2, culprit_amount,
number_of_snacks, batman, scooby_dum, scrappy_doo,
hex_girls, blue_falcon) %>%
drop_na()
set.seed(47)
data_split <- initial_split(scooby_small, prop = 2/3)
scooby_train <- training(data_split)
scooby_test <- testing(data_split)
scooby_rec <-
recipe(imdb ~ ., data = scooby_train)
summary(scooby_rec)
## # A tibble: 37 x 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 season nominal predictor original
## 2 run_time numeric predictor original
## 3 monsters_gen nominal predictor original
## 4 monster_amount numeric predictor original
## 5 caught_fred nominal predictor original
## 6 caught_daphnie nominal predictor original
## 7 caught_velma nominal predictor original
## 8 caught_shaggy nominal predictor original
## 9 caught_scooby nominal predictor original
## 10 caught_other logical predictor original
## # … with 27 more rows
rf_mod1 <-
rand_forest(mtry = 5,
trees = 500,
min_n = 5) %>%
set_engine("ranger", importance = "permutation") %>%
set_mode("regression")
After we’ve specified the engine, we can then fit()
the model:
set.seed(4747)
scooby_rf1 <-
workflow() %>%
add_model(rf_mod1) %>%
add_recipe(scooby_rec) %>%
fit(data = scooby_train)
scooby_rf1
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~5, x), num.trees = ~500, min.node.size = min_rows(~5, x), importance = ~"permutation", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
##
## Type: Regression
## Number of trees: 500
## Sample size: 356
## Number of independent variables: 36
## Mtry: 5
## Target node size: 5
## Variable importance mode: permutation
## Splitrule: variance
## OOB prediction error (MSE): 0.2416005
## R squared (OOB): 0.5561264
The variable importance plot gives a measure of which variable(s) was/were typically chosen first as the most important information to discriminate between the lakes.
library(vip)
scooby_rf1 %>%
pull_workflow_fit() %>%
vip()
##### observed vs predicted
scooby_test %>%
cbind(predict(scooby_rf1, new_data = scooby_test)) %>%
ggplot() +
geom_point(aes(x = imdb, y = .pred, color = season)) +
geom_abline(slope = 1, intercept = 0)
Julia did a tidymodel to predict whether or not the monster is real. I’m just going to copy her code directly to walk through the process of building a tidymodel.
scooby %>%
filter(monster_amount > 0) %>%
count(monster_real)
## # A tibble: 2 x 2
## monster_real n
## <chr> <int>
## 1 FALSE 404
## 2 TRUE 112
How many monsters are real in buckets of 5 years. Notice the trick for empty bars at the ends.
scooby %>%
filter(monster_amount > 0) %>%
count(
year_aired = 5 * ((lubridate::year(date_aired) + 1) %/% 5),
monster_real
) %>%
mutate(year_aired = factor(year_aired)) %>%
ggplot(aes(year_aired, n, fill = monster_real)) +
geom_col(position = position_dodge(preserve = "single"), alpha = 0.8) +
labs(x = "Date aired", y = "Monsters per decade", fill = "Real monster?")
Note that the training data is set in an interesting way here. First of all, there is some stratified sampling to make sure that the groups are balanced. But Julia also bootstrapped (within??) each sample. I’m not totally sure.
set.seed(123)
scooby_split <- scooby %>%
mutate(
year_aired = lubridate::year(date_aired)
) %>%
filter(monster_amount > 0, !is.na(imdb)) %>%
mutate(
monster_real = case_when(
monster_real == "FALSE" ~ "fake",
TRUE ~ "real"
),
monster_real = factor(monster_real)
) %>%
select(year_aired, imdb, monster_real, title) %>%
initial_split(strata = monster_real)
scooby_train <- training(scooby_split)
scooby_test <- testing(scooby_split)
set.seed(234)
scooby_folds <- bootstraps(scooby_train, strata = monster_real)
scooby_folds
## # Bootstrap sampling using stratification
## # A tibble: 25 x 2
## splits id
## <list> <chr>
## 1 <split [375/133]> Bootstrap01
## 2 <split [375/144]> Bootstrap02
## 3 <split [375/140]> Bootstrap03
## 4 <split [375/132]> Bootstrap04
## 5 <split [375/139]> Bootstrap05
## 6 <split [375/134]> Bootstrap06
## 7 <split [375/146]> Bootstrap07
## 8 <split [375/132]> Bootstrap08
## 9 <split [375/143]> Bootstrap09
## 10 <split [375/143]> Bootstrap10
## # … with 15 more rows
Note that we wouldn’t be able to run the model after this step because we haven’t set the parameters yet. We’ll need to tune them.
tree_spec <-
decision_tree(
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_mode("classification") %>%
set_engine("rpart")
tree_spec
## Decision Tree Model Specification (classification)
##
## Main Arguments:
## cost_complexity = tune()
## tree_depth = tune()
## min_n = tune()
##
## Computational engine: rpart
So we set up a grid of possible values to try (using mostly the default tuning values):
tree_grid <- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 4)
tree_grid
## # A tibble: 64 x 3
## cost_complexity tree_depth min_n
## <dbl> <int> <int>
## 1 0.0000000001 1 2
## 2 0.0000001 1 2
## 3 0.0001 1 2
## 4 0.1 1 2
## 5 0.0000000001 5 2
## 6 0.0000001 5 2
## 7 0.0001 5 2
## 8 0.1 5 2
## 9 0.0000000001 10 2
## 10 0.0000001 10 2
## # … with 54 more rows
Now let’s fit each possible parameter combination to each resample. By putting non-default metrics into metric_set(), we can specify which metrics are computed for each resample. There are 25 Bootstrap resamples x 64 possible parameter settings (each time keeping track of the metrics).
doParallel::registerDoParallel()
set.seed(345)
tree_rs <-
tune_grid(
tree_spec,
monster_real ~ year_aired + imdb,
resamples = scooby_folds,
grid = tree_grid,
metrics = metric_set(accuracy, roc_auc, sensitivity, specificity)
)
tree_rs
## # Tuning results
## # Bootstrap sampling using stratification
## # A tibble: 25 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [375/133]> Bootstrap01 <tibble [256 × 7]> <tibble [0 × 1]>
## 2 <split [375/144]> Bootstrap02 <tibble [256 × 7]> <tibble [0 × 1]>
## 3 <split [375/140]> Bootstrap03 <tibble [256 × 7]> <tibble [0 × 1]>
## 4 <split [375/132]> Bootstrap04 <tibble [256 × 7]> <tibble [0 × 1]>
## 5 <split [375/139]> Bootstrap05 <tibble [256 × 7]> <tibble [0 × 1]>
## 6 <split [375/134]> Bootstrap06 <tibble [256 × 7]> <tibble [0 × 1]>
## 7 <split [375/146]> Bootstrap07 <tibble [256 × 7]> <tibble [0 × 1]>
## 8 <split [375/132]> Bootstrap08 <tibble [256 × 7]> <tibble [0 × 1]>
## 9 <split [375/143]> Bootstrap09 <tibble [256 × 7]> <tibble [0 × 1]>
## 10 <split [375/143]> Bootstrap10 <tibble [256 × 7]> <tibble [0 × 1]>
## # … with 15 more rows
show_best(tree_rs)
## # A tibble: 5 x 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 10 2 accuracy binary 0.872 25 0.00481
## 2 0.0000001 10 2 accuracy binary 0.872 25 0.00481
## 3 0.0001 10 2 accuracy binary 0.872 25 0.00481
## 4 0.0000000001 15 2 accuracy binary 0.871 25 0.00456
## 5 0.0000001 15 2 accuracy binary 0.871 25 0.00456
## # … with 1 more variable: .config <chr>
autoplot(tree_rs)
Lots of ways to choose “best” … select_best()
gives the numerically best model (for some metric). Alternatively, we can choose a simpler tree within a small amount of the best metric. A less complex tree might keep us from over fitting.
Now, final_tree
has the parameter values, so we can fit it to the data!
simpler_tree <- select_by_one_std_err(tree_rs,
-cost_complexity,
metric = "roc_auc"
)
final_tree <- finalize_model(tree_spec, simpler_tree)
Note the use of the fit()
functions instead of the last_fit()
function. The fit()
function uses only the training data. The last_fit()
function fits the final best model to the training set and evaluates on the test set!
final_fit <- fit(final_tree, monster_real ~ year_aired + imdb, scooby_train)
But we do want to see how the model will work on the test data…
final_rs <- last_fit(final_tree, monster_real ~ year_aired + imdb, scooby_split)
collect_metrics(final_rs)
## # A tibble: 2 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.857 Preprocessor1_Model1
## 2 roc_auc binary 0.780 Preprocessor1_Model1
# remotes::install_github("grantmcdermott/parttree")
library(parttree)
scooby_train %>%
ggplot(aes(imdb, year_aired)) +
geom_parttree(data = final_fit, aes(fill = monster_real), alpha = 0.2) +
geom_jitter(alpha = 0.7, width = 0.05, height = 0.2, aes(color = monster_real))
There were some great plots today on Twitter, so I’m trying them out. The code below is mostly due to [@jack_davison](https://github.com/jack-davison/TidyTuesday/tree/master/R).
colors = tribble(
~name, ~color,
"daphnie", "#7C6AA8",
"shaggy", "#B2BE34",
"scooby", "#B1752C",
"velma", "#FA9C39",
"fred", "#01A0DA"
)
scooby_plot <- scooby %>%
select(index, year, contains("unmask"), contains("captured"), contains("snack")) %>%
select(-number_of_snacks) %>%
mutate(across(-c(index,year), as.logical)) %>%
pivot_longer(where(is.logical)) %>%
mutate(activity = str_extract(name, "unmask|captured|snack"),
name = str_remove(name, "unmask_|captured_|snack_")) %>%
group_by(year) %>%
count(name, value, activity) %>%
filter(value, name != "other") %>%
left_join(colors, by = "name") %>%
mutate(name = fct_reorder(name, n, max)) %>%
mutate(decade = floor(year/10)*10) %>%
ungroup() %>%
group_by(decade, name, activity) %>%
summarise(n_decade = sum(n))
scooby_plot %>%
ggplot() +
geom_line(aes(x = decade, y = n_decade, group = name,
color = name)) +
facet_grid(activity~1)