I’ve tried for a long time now to get the SVM to run on these data. I recognize that there isn’t a lot of signal here to predict winners! I think the lack of signal is what is keeping the model from converging. The code / steps below seem to be correct, but the file won’t knit, even after many hours. I’ll try SVMs again in a future week where maybe there is more signal!
The data this week comes from Kaggle.
This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. I scraped this data from www.sports-reference.com in May 2018. The R code I used to scrape and wrangle the data is on GitHub. I recommend checking my kernel before starting your own analysis.
olympics <- read_csv("olympics.csv") %>%
mutate(win = ifelse(is.na(medal), "no", "yes")) %>%
mutate(country = case_when(
noc == "USA" ~ "USA",
noc == "FRA" ~ "FRA",
noc == "GBR" ~ "GBR",
noc == "ITA" ~ "ITA",
noc == "GER" ~ "GER",
noc == "CAN" ~ "CAN",
noc == "JPN" ~ "JPN",
noc == "EUN" ~ "RUS",
noc == "RUS" ~ "RUS",
noc == "URS" ~ "URS",
noc == "BUL" ~ "BUL",
TRUE ~ "other"
))
In my ongoing pursuit to understand tidymodels, I’m going to use a support vector machine to predict whether or not a medal was won. I’m finally starting to get the hang of the tidymodels
structure:
Does the model includes parameters that need to be tuned, then the flow is slightly different:
tune()
)Notice that most people / teams don’t win. But with 40,000 winning observations, we should be able to fit a good model. Unfortunately, SVMs are really slow, so I can’t run the model on the full dataset. Instead, I’m taking a much smaller random sample.
olympics %>%
count(win)
## # A tibble: 2 × 2
## win n
## <chr> <int>
## 1 no 231333
## 2 yes 39783
Let’s look at the information contained in the data about winning over time. Notice the trick for empty bars at the ends (and the modular arithmetic to get bins of size 12).
olympics %>%
count(
year12 = 12 * ((year + 1) %/% 12),
win
) %>%
mutate(year12 = factor(year12)) %>%
ggplot(aes(x = year12, y = n, fill = win)) +
geom_col(position = position_dodge(preserve = "single"), alpha = 0.8) +
labs(x = "Year of Olympics (bins of size 12)", y = "Count", fill = "Win?")
I had to use drop_na()
early on because otherwise the predictions didn’t line up with the cases in the original dataset.
set.seed(123)
olympics_split <- olympics %>%
drop_na(sex, age, height, weight, year, season, win, country) %>%
# sampling 100% of the winners and 18% of the non-winners
# (18% was chosen just to get the groups to balance)
group_by(win) %>%
nest() %>%
ungroup() %>%
mutate(prop = c(0.09, 0.5)) %>%
mutate(samp = map2(data, prop, sample_frac)) %>%
select(-data, -prop) %>%
unnest(samp) %>%
mutate_if(is.character, as.factor) %>%
initial_split(strata = win)
olympics_train <- training(olympics_split)
olympics_test <- testing(olympics_split)
olympics_train %>%
count(win)
## # A tibble: 2 × 2
## win n
## <fct> <int>
## 1 no 11879
## 2 yes 11317
olympics_test %>%
count(win)
## # A tibble: 2 × 2
## win n
## <fct> <int>
## 1 no 3960
## 2 yes 3773
Note that we wouldn’t be able to run the model after this step because we haven’t set the parameters yet. We’ll need to tune them.
olympics_mod <-
svm_poly(
cost = 0.1,
degree = 3,
scale_factor = 1
) %>%
set_mode("classification") %>%
set_engine("kernlab")
olympics_mod
olympics_rec <- recipe(win ~ ., data = olympics_train) %>%
# don't want to use name, etc. to predict win
update_role(name, team, city, sport, event, medal, new_role = "ID") %>%
# remove variables
step_rm(noc, id) %>%
# unfortunately, I'm going to remove data with NA values
# the SVM won't use them anyway
# skip = TRUE means to not worry about the win variable when
# running predictions
#step_naomit(all_predictors(), win, skip = TRUE) %>%
# make dummy variables
step_dummy(all_nominal_predictors()) %>%
# remove zero variance predictors
step_zv(all_predictors()) %>%
# SVMs need centered & scaled variables
step_normalize(all_predictors())
prep(olympics_rec) #check to make sure recipe worked
olympics_wflow <- workflow() %>%
add_model(olympics_mod) %>%
add_recipe(olympics_rec)
olympics_wflow
Look at: https://workflows.tidymodels.org/reference/predict-workflow.html and https://parsnip.tidymodels.org/reference/predict.model_fit.html
olympics_fit <- olympics_wflow %>%
fit(data = olympics_train)
# the output of `predict` is two columns when `type = "prob"`
# so we need to make it into a tibble before binding the columns
olympics_pred_train <- olympics_train %>%
select(win, name) %>%
bind_cols(as_tibble(predict(olympics_fit,
new_data = olympics_train,
type = "prob")))
olympics_pred_test <- olympics_test %>%
select(win, name) %>%
bind_cols(as_tibble(predict(olympics_fit,
new_data = olympics_test,
type = "prob")))
p1 <- ggplot(olympics_pred_train) +
geom_jitter(aes(y = .pred_yes, x = win), color = "#114B5F",
alpha = 0.2, height = 0.1, width = 0.1) +
ggtitle("Prediction probability (of win), training data") +
ylab("")
p2 <- ggplot(olympics_pred_test) +
geom_jitter(aes(y = .pred_yes, x = win), color = "#114B5F",
alpha = 0.2, height = 0.1, width = 0.1) +
ggtitle("Prediction probability (of win), training data") +
ylab("")
p1 + p2 + plot_layout(ncol = 2)
Cross validated data is a built in independent sample
olympics_folds <- vfold_cv(olympics_train, strata = win, v = 5)
olympics_folds
doParallel::registerDoParallel()
set.seed(4747)
olympics_cv <- olympics_wflow %>%
fit_resamples(
resamples = olympics_folds,
metrics = metric_set(roc_auc, accuracy, sens, spec),
control = control_grid(save_pred = TRUE)
)
collect_metrics(olympics_cv)
Note that with the following model, we can’t run anything because we haven’t set the parameters yet. We’ll need to tune them.
olympics_mod_tune <-
svm_poly(
cost = tune(),
degree = tune(),
scale_factor = 1
) %>%
set_mode("classification") %>%
set_engine("kernlab")
olympics_mod_tune
We set up a grid of possible values to try (using mostly the default tuning values):
svm_grid <- grid_regular(cost(), degree())
svm_grid
doParallel::registerDoParallel()
set.seed(4747)
olympics_res_tune <-
tune_grid(
olympics_mod_tune, # the model
olympics_rec, # the recipe
resamples = olympics_folds,
grid = svm_grid,
metrics = metric_set(roc_auc, accuracy, sens, spec),
control = control_grid(save_pred = TRUE)
)
autoplot(olympics_res_tune)
show_best(olympics_res_tune, metric = "accuracy")
show_best(olympics_res_tune, metric = "roc_auc")
collect_metrics(olympics_res_tune)
best_values <- select_best(olympics_res_tune, metric = "roc_auc")
best_values
conf_mat_resampled(olympics_res_tune, parameters = best_values)
Note that to create the ROC curve, the variable win
is considered alphabetically by R. So to compare, we use .pred_no
instead of .pred_yes
.
olympics_res_tune %>%
collect_predictions() %>%
group_by(id) %>%
roc_curve(win, .pred_no) %>%
ggplot(aes(1 - specificity, sensitivity, color = id)) +
geom_abline(lty = 2, color = "gray80", size = 1.5) +
geom_path(show.legend = TRUE, alpha = 0.6, size = 1.2) +
scale_color_viridis(discrete = TRUE) +
labs(color = "CV Fold") +
coord_equal()
collect_predictions(olympics_res_tune) %>%
arrange(.row)
olympics_mod_best <-
svm_poly(
cost = 32,
degree = 1,
scale_factor = 1
) %>%
set_mode("classification") %>%
set_engine("kernlab")
olympics_final <- workflow() %>%
add_recipe(olympics_rec) %>%
add_model(olympics_mod_best) %>%
last_fit(olympics_split)
olympics_final %>%
collect_metrics()
olympics_final %>%
collect_predictions() %>%
conf_mat(win, .pred_class)
praise()
## [1] "You are smashing!"