Thanks to the work of those at RStudio, alt text is easy to add to a figure chunk. I’ve added alt text to each of the figures in this week’s #TidyTuesday. To see the alt text, hover the mouse over the figure, and a text box with a plot description will pop up.
The data this week comes from Great Lakes Fishery Commission. Full details on the data can be found on their statistic notes and background notes.
fishing <- read_csv("fishing.csv", na = c("", "NA", "*")) %>%
clean_names()
stocked <- read_csv("stocked.csv", na = c("", "NA", "*")) %>%
clean_names()
fishing %>%
filter(str_detect(region, "U.S. Total")) %>%
filter(!is.na(region)) %>%
mutate(species = fct_lump(species, 12)) %>%
filter(species != "Other") %>%
ggplot(aes(x = year, y = values, color = lake)) +
geom_line() +
facet_wrap(~species) +
theme(legend.position = "top")
fishing %>%
filter(str_detect(region, "U.S. Total")) %>%
filter(!is.na(region)) %>%
mutate(species = fct_lump(species, 12)) %>%
filter(species != "Other") %>%
ggplot(aes(x = year, y = grand_total, color = lake)) +
geom_line() +
facet_wrap(~species) +
theme(legend.position = "top")
I’m on a quest to learn tidymodels
, so I’m going to try again today to model the given data. However, I am slightly uncomfortable with the idea of modeling this data because the idea of comparing things over time seems to violate all sorts of modeling conditions. Also, in the fishing dataset, one-third to one-half of the observations are missing. The data can’t possibly be missing at random! And last, I don’t understand the data. For example, the variable agemonth
is described as the “Age in months.” The age of what? Presumably of the stocked fish, but a single row can’t be a single fish. So the stocked fish all come from the same clutch? Does that make sense? Also, what is mark
(“marking method”) and validation
(“validation method”) ? Seems like I should have much more knowledge about fishing and about these data before trying to model the data.
stocked_small <- stocked %>%
select(year, lake, stat_dist, no_stocked, agemonth, species, sid) %>%
mutate(lake = factor(lake)) %>%
#mutate(lake = ifelse(lake == "MI", "MI", "notMI")) %>%
filter(!(species %in% c("LAH", "LAS", "SMB", "STN", "YEP"))) %>%
drop_na()
set.seed(47)
data_split <- initial_split(stocked_small, prop = 2/3)
stocked_train <- training(data_split)
stocked_test <- testing(data_split)
Note to figure out: the step_dummy()
line caused the random forest to not run. The outcome variable cannot have been formatted using step_dummy()
. Also, if step_dummy()
is applied to a variable, the role will be predictor
even if the update_role()
is used.
stock_rec <-
recipe(lake ~ ., data = stocked_train) %>%
step_dummy(species) %>%
update_role(stat_dist, sid, new_role = "ID")
summary(stock_rec)
## # A tibble: 7 x 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 year numeric predictor original
## 2 stat_dist nominal ID original
## 3 no_stocked numeric predictor original
## 4 agemonth numeric predictor original
## 5 species nominal predictor original
## 6 sid numeric ID original
## 7 lake nominal outcome original
Just to mix it up, we’re going to try to classify the data into the different lakes. There are five Great Lakes. First we specify that we’re going to use a Random Forest as the engine / model of interest:
rf_mod1 <-
rand_forest(mtry = 5,
trees = 500,
min_n = 5) %>%
set_engine("ranger", importance = "permutation") %>%
set_mode("classification")
After we’ve specified the engine, we can then fit()
the model:
stock_rf1 <-
workflow() %>%
add_model(rf_mod1) %>%
add_recipe(stock_rec) %>%
fit(data = stocked_train)
stock_rf1
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## • step_dummy()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~5, x), num.trees = ~500, min.node.size = min_rows(~5, x), importance = ~"permutation", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 500
## Sample size: 16638
## Number of independent variables: 12
## Mtry: 5
## Target node size: 5
## Variable importance mode: permutation
## Splitrule: gini
## OOB prediction error (Brier s.): 0.2538034
The variable importance plot gives a measure of which variable(s) was/were typically chosen first as the most important information to discriminate between the lakes.
library(vip)
stock_rf1 %>%
pull_workflow_fit() %>%
vip()
mtry
and min_n
in the random forest.rf_mod2 <-
rand_forest(mtry = tune(),
trees = 500,
min_n = tune()) %>%
set_engine("ranger", importance = "permutation") %>%
set_mode("classification")
After we’ve specified the engine, we can then fit()
the model:
stock_rf2 <-
workflow() %>%
add_model(rf_mod2) %>%
add_recipe(stock_rec) %>%
fit(data = stocked_train)
stock_rf2
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## • step_dummy()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~tune(), x), num.trees = ~500, min.node.size = min_rows(~tune(), x), importance = ~"permutation", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 500
## Sample size: 16638
## Number of independent variables: 12
## Mtry: 12
## Target node size: 16638
## Variable importance mode: permutation
## Splitrule: gini
## OOB prediction error (Brier s.): 0.623763
Now we’ll CV the data so that we can run the random forest on different subsets of the training data.
set.seed(4747)
stock_folds <- vfold_cv(stocked_train, v = 3)
stock_grid <- grid_regular(mtry(range = c(2, 13)),
min_n(range = c(5, 40)),
levels = 5)
stock_grid
## # A tibble: 25 x 2
## mtry min_n
## <int> <int>
## 1 2 5
## 2 4 5
## 3 7 5
## 4 10 5
## 5 13 5
## 6 2 13
## 7 4 13
## 8 7 13
## 9 10 13
## 10 13 13
## # … with 15 more rows
Now we tune the model, this time using the grid of parameter values. Note that our problem is classifying the lakes which is not binary. Therefore the default ROC metric will fail. We use metric_set(accuracy)
as the measure of fit.
doParallel::registerDoParallel()
set.seed(470)
tune_result <- tune_grid(
stock_rf2,
resamples = stock_folds,
grid = stock_grid,
metrics = metric_set(accuracy)
)
tune_result
## # Tuning results
## # 3-fold cross-validation
## # A tibble: 3 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [11092/5546]> Fold1 <tibble [25 × 6]> <tibble [25 × 1]>
## 2 <split [11092/5546]> Fold2 <tibble [25 × 6]> <tibble [25 × 1]>
## 3 <split [11092/5546]> Fold3 <tibble [25 × 6]> <tibble [25 × 1]>
tune_result %>%
collect_metrics() %>%
filter(.metric == "accuracy") %>%
mutate(min_n = factor(min_n)) %>%
ggplot(aes(x = mtry, y = mean, color = min_n)) +
geom_line(alpha = 0.5, size = 1.5) +
geom_point() +
labs(y = "accuracy",
color = "minimum node size")
rf_best <-
rand_forest(mtry = 8,
trees = 500,
min_n = 5) %>%
set_engine("ranger", importance = "permutation") %>%
set_mode("classification")
stock_best <-
workflow() %>%
add_model(rf_best) %>%
add_recipe(stock_rec) %>%
fit(data = stocked_train)
stock_rf_final <-
stock_best %>%
last_fit(data_split, metrics = metric_set(accuracy))
stock_rf_final %>%
collect_metrics()
## # A tibble: 1 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy multiclass 0.809 Preprocessor1_Model1
stock_rf_final %>%
collect_predictions()
## # A tibble: 8,319 x 5
## id .pred_class .row lake .config
## <chr> <fct> <int> <fct> <chr>
## 1 train/test split SU 2 SU Preprocessor1_Model1
## 2 train/test split SU 3 SU Preprocessor1_Model1
## 3 train/test split SU 5 SU Preprocessor1_Model1
## 4 train/test split SU 6 SU Preprocessor1_Model1
## 5 train/test split SU 8 SU Preprocessor1_Model1
## 6 train/test split SU 9 MI Preprocessor1_Model1
## 7 train/test split SU 10 MI Preprocessor1_Model1
## 8 train/test split SU 11 SU Preprocessor1_Model1
## 9 train/test split SU 13 SU Preprocessor1_Model1
## 10 train/test split SU 24 MI Preprocessor1_Model1
## # … with 8,309 more rows
stock_rf_final %>%
collect_predictions() %>%
select(.pred_class, lake) %>%
group_by(.pred_class, lake) %>%
summarize(count = n()) %>%
ungroup() %>%
ggplot() +
geom_tile(aes(x = lake, y = reorder(.pred_class, desc(.pred_class)), fill = count)) +
theme(
axis.title=element_text(size=15,face="bold"),
plot.title = element_text(size = 20, face = "bold"),
plot.caption = element_text(size = 15)) +
ylab("LDA topic")+
scale_fill_distiller(palette = "RdPu", trans = "reverse") +
labs(
x = "Stocked lake", y = "",
title = "Predicting Great Lakes",
caption = "Tidy Tuesday Plot: @hardin47 | Data: Commercial Fishing",
subtitle = "Predicted lake")