Alt text!

Thanks to the work of those at RStudio, alt text is easy to add to a figure chunk. I’ve added alt text to each of the figures in this week’s #TidyTuesday. To see the alt text, hover the mouse over the figure, and a text box with a plot description will pop up.

The Data

The data this week comes from Great Lakes Fishery Commission. Full details on the data can be found on their statistic notes and background notes.

fishing <- read_csv("fishing.csv", na = c("", "NA", "*")) %>%
  clean_names()
stocked <- read_csv("stocked.csv", na = c("", "NA", "*")) %>%
  clean_names()
fishing %>%
  filter(str_detect(region, "U.S. Total")) %>% 
  filter(!is.na(region)) %>% 
  mutate(species = fct_lump(species, 12)) %>% 
  filter(species != "Other") %>% 
  ggplot(aes(x = year, y = values, color = lake)) +
  geom_line() +
  facet_wrap(~species) +
  theme(legend.position = "top")

A line plot with year on the x-axis, values on the y-axis, lines colored by Great Lake, and faceted by species.  Values seem to be highest for Chubs species in Lake Michigan in the 1950s, and the Cisco species in Lake Erie in the early 1900s and Lake Superior in the mid-1900s.

fishing %>%
  filter(str_detect(region, "U.S. Total")) %>% 
  filter(!is.na(region)) %>% 
  mutate(species = fct_lump(species, 12)) %>% 
  filter(species != "Other") %>% 
  ggplot(aes(x = year, y = grand_total, color = lake)) +
  geom_line() +
  facet_wrap(~species) +
  theme(legend.position = "top")

A line plot with year on the x-axis, grand total on the y-axis, lines colored by Great Lake, and faceted by species. Lake Erie seems to have the most stocking.  In early 1900s it was stocked with Cisco, more recently it has been stocked with Yellow Perch.  There was also a bit of stocking of Cisco in Lake Superior in the mid-1900s

A model

I’m on a quest to learn tidymodels, so I’m going to try again today to model the given data. However, I am slightly uncomfortable with the idea of modeling this data because the idea of comparing things over time seems to violate all sorts of modeling conditions. Also, in the fishing dataset, one-third to one-half of the observations are missing. The data can’t possibly be missing at random! And last, I don’t understand the data. For example, the variable agemonth is described as the “Age in months.” The age of what? Presumably of the stocked fish, but a single row can’t be a single fish. So the stocked fish all come from the same clutch? Does that make sense? Also, what is mark (“marking method”) and validation (“validation method”) ? Seems like I should have much more knowledge about fishing and about these data before trying to model the data.

stocked_small <- stocked %>%
  select(year, lake, stat_dist, no_stocked, agemonth, species, sid) %>%
  mutate(lake = factor(lake)) %>%
  #mutate(lake = ifelse(lake == "MI", "MI", "notMI")) %>%
  filter(!(species %in% c("LAH", "LAS", "SMB", "STN", "YEP"))) %>%
  drop_na()

set.seed(47)

data_split <- initial_split(stocked_small, prop = 2/3)

stocked_train <- training(data_split)
stocked_test <- testing(data_split)

recipe

Note to figure out: the step_dummy() line caused the random forest to not run. The outcome variable cannot have been formatted using step_dummy(). Also, if step_dummy() is applied to a variable, the role will be predictor even if the update_role() is used.

stock_rec <-
  recipe(lake ~ ., data = stocked_train) %>%
  step_dummy(species) %>%
  update_role(stat_dist, sid, new_role = "ID") 

summary(stock_rec)
## # A tibble: 7 x 4
##   variable   type    role      source  
##   <chr>      <chr>   <chr>     <chr>   
## 1 year       numeric predictor original
## 2 stat_dist  nominal ID        original
## 3 no_stocked numeric predictor original
## 4 agemonth   numeric predictor original
## 5 species    nominal predictor original
## 6 sid        numeric ID        original
## 7 lake       nominal outcome   original

Just to mix it up, we’re going to try to classify the data into the different lakes. There are five Great Lakes. First we specify that we’re going to use a Random Forest as the engine / model of interest:

rf_mod1 <-
  rand_forest(mtry = 5,
              trees = 500,
              min_n = 5) %>%
  set_engine("ranger", importance = "permutation") %>%
  set_mode("classification") 

After we’ve specified the engine, we can then fit() the model:

stock_rf1 <-
  workflow() %>%
  add_model(rf_mod1) %>%
  add_recipe(stock_rec) %>%
  fit(data = stocked_train)

stock_rf1
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~5,      x), num.trees = ~500, min.node.size = min_rows(~5, x), importance = ~"permutation",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      16638 
## Number of independent variables:  12 
## Mtry:                             5 
## Target node size:                 5 
## Variable importance mode:         permutation 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.2538034

vip

The variable importance plot gives a measure of which variable(s) was/were typically chosen first as the most important information to discriminate between the lakes.

library(vip)

stock_rf1 %>%
  pull_workflow_fit() %>%
  vip()

Barplot of importance values for each of the variables in the random forest. agemonth has the highest importance (of roughly 0.22), followed by species = LAT (of about 0.2) and species = RBT (of about 0.15).

CV to find a better values of mtry and min_n in the random forest.

rf_mod2 <-
  rand_forest(mtry = tune(),
              trees = 500,
              min_n = tune()) %>%
  set_engine("ranger", importance = "permutation") %>%
  set_mode("classification") 

After we’ve specified the engine, we can then fit() the model:

stock_rf2 <-
  workflow() %>%
  add_model(rf_mod2) %>%
  add_recipe(stock_rec) %>%
  fit(data = stocked_train)

stock_rf2
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~tune(),      x), num.trees = ~500, min.node.size = min_rows(~tune(), x),      importance = ~"permutation", num.threads = 1, verbose = FALSE,      seed = sample.int(10^5, 1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      16638 
## Number of independent variables:  12 
## Mtry:                             12 
## Target node size:                 16638 
## Variable importance mode:         permutation 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.623763

Now we’ll CV the data so that we can run the random forest on different subsets of the training data.

set.seed(4747)
stock_folds <- vfold_cv(stocked_train, v = 3)

stock_grid <- grid_regular(mtry(range = c(2, 13)),
                          min_n(range = c(5, 40)),
                          levels = 5)

stock_grid
## # A tibble: 25 x 2
##     mtry min_n
##    <int> <int>
##  1     2     5
##  2     4     5
##  3     7     5
##  4    10     5
##  5    13     5
##  6     2    13
##  7     4    13
##  8     7    13
##  9    10    13
## 10    13    13
## # … with 15 more rows

Now we tune the model, this time using the grid of parameter values. Note that our problem is classifying the lakes which is not binary. Therefore the default ROC metric will fail. We use metric_set(accuracy) as the measure of fit.

doParallel::registerDoParallel()

set.seed(470)
tune_result <- tune_grid(
  stock_rf2,
  resamples = stock_folds,
  grid = stock_grid,
  metrics = metric_set(accuracy)
)

tune_result
## # Tuning results
## # 3-fold cross-validation 
## # A tibble: 3 x 4
##   splits               id    .metrics          .notes           
##   <list>               <chr> <list>            <list>           
## 1 <split [11092/5546]> Fold1 <tibble [25 × 6]> <tibble [25 × 1]>
## 2 <split [11092/5546]> Fold2 <tibble [25 × 6]> <tibble [25 × 1]>
## 3 <split [11092/5546]> Fold3 <tibble [25 × 6]> <tibble [25 × 1]>
tune_result %>%
  collect_metrics() %>%
  filter(.metric == "accuracy") %>%
  mutate(min_n = factor(min_n)) %>%
  ggplot(aes(x = mtry, y = mean, color = min_n)) +
  geom_line(alpha = 0.5, size = 1.5) +
  geom_point() + 
  labs(y = "accuracy",
       color = "minimum node size")

Line plot of mtry by accuracy, colored by the minimum terminal node size.  The smallest node size of 5 has the highest accuracy, and the accuracy levels off as mtry goes from 7 to 10 to 13.

Fit to the test data

rf_best <-
  rand_forest(mtry = 8,
              trees = 500,
              min_n = 5) %>%
  set_engine("ranger", importance = "permutation") %>%
  set_mode("classification") 

stock_best <-
  workflow() %>%
  add_model(rf_best) %>%
  add_recipe(stock_rec) %>%
  fit(data = stocked_train)

stock_rf_final <-
  stock_best %>%
  last_fit(data_split, metrics = metric_set(accuracy))

stock_rf_final %>%
  collect_metrics()
## # A tibble: 1 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy multiclass     0.809 Preprocessor1_Model1
stock_rf_final %>%
  collect_predictions() 
## # A tibble: 8,319 x 5
##    id               .pred_class  .row lake  .config             
##    <chr>            <fct>       <int> <fct> <chr>               
##  1 train/test split SU              2 SU    Preprocessor1_Model1
##  2 train/test split SU              3 SU    Preprocessor1_Model1
##  3 train/test split SU              5 SU    Preprocessor1_Model1
##  4 train/test split SU              6 SU    Preprocessor1_Model1
##  5 train/test split SU              8 SU    Preprocessor1_Model1
##  6 train/test split SU              9 MI    Preprocessor1_Model1
##  7 train/test split SU             10 MI    Preprocessor1_Model1
##  8 train/test split SU             11 SU    Preprocessor1_Model1
##  9 train/test split SU             13 SU    Preprocessor1_Model1
## 10 train/test split SU             24 MI    Preprocessor1_Model1
## # … with 8,309 more rows
stock_rf_final %>%
  collect_predictions() %>%
  select(.pred_class, lake) %>%
  group_by(.pred_class, lake) %>%
  summarize(count = n()) %>%
  ungroup() %>%
  ggplot() + 
  geom_tile(aes(x = lake, y = reorder(.pred_class, desc(.pred_class)), fill = count)) +
    theme(
        axis.title=element_text(size=15,face="bold"),
        plot.title = element_text(size = 20, face = "bold"),
        plot.caption = element_text(size = 15)) +
  ylab("LDA topic")+
  scale_fill_distiller(palette = "RdPu", trans = "reverse") +
  labs(
    x = "Stocked lake", y = "",
    title = "Predicting Great Lakes",
    caption = "Tidy Tuesday Plot: @hardin47 | Data: Commercial Fishing",
    subtitle = "Predicted lake") 

A heatmap showing the observed and predicted Great Lakes.  The model is fairly accurate, so the darker colors (higher counts) are observed on the diagonal with only a handful of observations being misclassified.  The misclassification happens evenly across all of the classes.