My initial approach to these data is to see how well we can build a model that predicts the rating of the cup of coffee. First, let’s explore the data graphically to understand variables and relationships. Then we’ll apply random forests and prediction intervals on random forests from a recent project I’ve worked on with Benji Lu: A Unified Framework for Random Forest Prediction Error Estimation
In looking at the data, we notice a few things:
One of the observations has a total_cup_point of zero. That can’t be right. It’s aroma, flavor, etc. are also all zero, so it seems as though maybe that coffee wasn’t rated.
Sweetness is highly correlated to species.
Some of the numeric variables are much more correlated with the coffee rating than others of the variables.
cofrat <- read_csv("coffee_ratings.csv") %>%
filter(total_cup_points > 0)
cofrat %>% skimr::skim()
Name | Piped data |
Number of rows | 1338 |
Number of columns | 43 |
_______________________ | |
Column type frequency: | |
character | 24 |
numeric | 19 |
________________________ | |
Group variables |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
species | 0 | 1.00 | 7 | 7 | 0 | 2 | 0 |
owner | 7 | 0.99 | 3 | 50 | 0 | 315 | 0 |
country_of_origin | 1 | 1.00 | 4 | 28 | 0 | 36 | 0 |
farm_name | 359 | 0.73 | 1 | 73 | 0 | 571 | 0 |
lot_number | 1063 | 0.21 | 1 | 71 | 0 | 227 | 0 |
mill | 315 | 0.76 | 1 | 77 | 0 | 460 | 0 |
ico_number | 151 | 0.89 | 1 | 40 | 0 | 847 | 0 |
company | 209 | 0.84 | 3 | 73 | 0 | 281 | 0 |
altitude | 226 | 0.83 | 1 | 41 | 0 | 396 | 0 |
region | 59 | 0.96 | 2 | 76 | 0 | 356 | 0 |
producer | 231 | 0.83 | 1 | 100 | 0 | 691 | 0 |
bag_weight | 0 | 1.00 | 1 | 8 | 0 | 56 | 0 |
in_country_partner | 0 | 1.00 | 7 | 85 | 0 | 27 | 0 |
harvest_year | 47 | 0.96 | 3 | 24 | 0 | 46 | 0 |
grading_date | 0 | 1.00 | 13 | 20 | 0 | 567 | 0 |
owner_1 | 7 | 0.99 | 3 | 50 | 0 | 319 | 0 |
variety | 226 | 0.83 | 4 | 21 | 0 | 29 | 0 |
processing_method | 169 | 0.87 | 5 | 25 | 0 | 5 | 0 |
color | 218 | 0.84 | 4 | 12 | 0 | 4 | 0 |
expiration | 0 | 1.00 | 13 | 20 | 0 | 566 | 0 |
certification_body | 0 | 1.00 | 7 | 85 | 0 | 26 | 0 |
certification_address | 0 | 1.00 | 40 | 40 | 0 | 32 | 0 |
certification_contact | 0 | 1.00 | 40 | 40 | 0 | 29 | 0 |
unit_of_measurement | 0 | 1.00 | 1 | 2 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
total_cup_points | 0 | 1.00 | 82.15 | 2.69 | 59.83 | 81.10 | 82.50 | 83.67 | 90.58 | ▁▁▁▇▁ |
number_of_bags | 0 | 1.00 | 154.09 | 129.99 | 0.00 | 14.00 | 172.50 | 275.00 | 1062.00 | ▇▇▁▁▁ |
aroma | 0 | 1.00 | 7.57 | 0.32 | 5.08 | 7.42 | 7.58 | 7.75 | 8.75 | ▁▁▂▇▁ |
flavor | 0 | 1.00 | 7.53 | 0.34 | 6.08 | 7.33 | 7.58 | 7.75 | 8.83 | ▁▂▇▃▁ |
aftertaste | 0 | 1.00 | 7.41 | 0.35 | 6.17 | 7.25 | 7.42 | 7.58 | 8.67 | ▁▃▇▂▁ |
acidity | 0 | 1.00 | 7.54 | 0.32 | 5.25 | 7.33 | 7.58 | 7.75 | 8.75 | ▁▁▃▇▁ |
body | 0 | 1.00 | 7.52 | 0.31 | 5.08 | 7.33 | 7.50 | 7.67 | 8.58 | ▁▁▁▇▁ |
balance | 0 | 1.00 | 7.52 | 0.35 | 5.25 | 7.33 | 7.50 | 7.75 | 8.75 | ▁▁▃▇▁ |
uniformity | 0 | 1.00 | 9.84 | 0.49 | 6.00 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
clean_cup | 0 | 1.00 | 9.84 | 0.72 | 0.00 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
sweetness | 0 | 1.00 | 9.86 | 0.55 | 1.33 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
cupper_points | 0 | 1.00 | 7.51 | 0.43 | 5.17 | 7.25 | 7.50 | 7.75 | 10.00 | ▁▂▇▁▁ |
moisture | 0 | 1.00 | 0.09 | 0.05 | 0.00 | 0.09 | 0.11 | 0.12 | 0.28 | ▃▇▅▁▁ |
category_one_defects | 0 | 1.00 | 0.48 | 2.55 | 0.00 | 0.00 | 0.00 | 0.00 | 63.00 | ▇▁▁▁▁ |
quakers | 1 | 1.00 | 0.17 | 0.83 | 0.00 | 0.00 | 0.00 | 0.00 | 11.00 | ▇▁▁▁▁ |
category_two_defects | 0 | 1.00 | 3.56 | 5.31 | 0.00 | 0.00 | 2.00 | 4.00 | 55.00 | ▇▁▁▁▁ |
altitude_low_meters | 230 | 0.83 | 1751.03 | 8673.35 | 1.00 | 1100.00 | 1310.64 | 1600.00 | 190164.00 | ▇▁▁▁▁ |
altitude_high_meters | 230 | 0.83 | 1799.71 | 8672.71 | 1.00 | 1100.00 | 1350.00 | 1650.00 | 190164.00 | ▇▁▁▁▁ |
altitude_mean_meters | 230 | 0.83 | 1775.37 | 8672.53 | 1.00 | 1100.00 | 1310.64 | 1600.00 | 190164.00 | ▇▁▁▁▁ |
cofrat %>% summarize(sum(total_cup_points ==0))
## # A tibble: 1 x 1
## `sum(total_cup_points == 0)`
## <int>
## 1 0
cofrat %>% select(total_cup_points, aroma, flavor, aftertaste, acidity, body,
balance, uniformity, clean_cup, sweetness, cupper_points,
moisture, category_one_defects, quakers, species) %>%
ggpairs(mapping = ggplot2::aes(color = species), columns = c(2:7,1))
cofrat %>% select(total_cup_points, aroma, flavor, aftertaste, acidity, body,
balance, uniformity, clean_cup, sweetness, cupper_points,
moisture, category_one_defects, quakers, species) %>%
ggpairs(mapping = ggplot2::aes(color = species), columns = c(8:14,1))
For a first pass at predicting, I’m going to remove 40% of the observations to hold out for later prediction. I’ll use 60% to build a random forest. Following along with the instructions on the forestError GitHub repo.
I’ll start by using only the numerical variables (and species) for predicting the coffee rating. Note that random forests require no missing variables, so if we use other variables, a more sophisticated analysis would require imputing the missing values or dealing with them in some other way.
cofrat.rf <- cofrat %>% select(total_cup_points, aroma, flavor, aftertaste, acidity, body,
balance, uniformity, clean_cup, sweetness, cupper_points,
moisture, category_one_defects, quakers, species) %>%
mutate(species = as.factor(species)) %>%
filter(complete.cases(.))
# get number of observations and the response column index
n.rf <- nrow(cofrat.rf)
response.col <- 1
# split data into training and test sets
set.seed(4747)
train.rate = 0.6
train.ind <- sample(1:n.rf, n.rf * train.rate, replace = FALSE)
Xtrain <- cofrat.rf[train.ind, -response.col]
Ytrain <- unlist(cofrat.rf[train.ind, response.col])
Xtest <- cofrat.rf[-train.ind, -response.col]
Ytest <- unlist(cofrat.rf[-train.ind, response.col])
# fit random forest to the training data
cof_rf <- randomForest::randomForest(x = Xtrain, y = Ytrain,
nodesize = 5, ntree = 500, keep.inbag = TRUE)
# estimate conditional mean squared prediction errors, conditional
# biases, conditional prediction intervals, and conditional error
# distribution functions for the test observations
output <- quantForestError(cof_rf, Xtrain, Xtest, alpha = 0.05)
Now that we’ve run the random forest and predicted the outcomes, let’s visualize how well the prediction intervals captured the actual observed ratings. Likely due to the many fewer observations at the low ratings, the prediction intervals are much wider.
output$estimates %>% cbind(Ytest) %>%
mutate(capture = case_when(Ytest < lower_0.05 ~ "no",
Ytest > upper_0.05 ~ "no",
TRUE ~ "yes")) %>%
ggplot() +
geom_point(aes(x = Ytest, y = pred), size = 0.5) +
geom_segment(aes(x = Ytest, xend = Ytest, y = lower_0.05, yend = upper_0.05,
color = capture),
lwd=0.5) +
geom_abline(slope = 1, intercept = 0) +
geom_text(aes(label = paste("percent capture = ",
round(100*sum(capture=="yes")/(n.rf*(1 - train.rate)),2))),
x = 85, y = 70) +
ggtitle("Prediction Intervals for coffee ratings (numeric variables only)") +
xlab("True rating") +
ylab("Predicted rating") +
scale_color_brewer(type='div', palette = "Set1")
We now predict with more variables. I keep only those variables without substantial missing observations. We also do not use unique identifiers like owner, farm_name, lot_number, etc. Note that there are substantially fewer observations (due to missingness). So although I set the seed the same, the test observations will be different across the two analyses.
cofrat.rf2 <- cofrat %>% select(total_cup_points, aroma, flavor, aftertaste, acidity, body,
balance, uniformity, clean_cup, sweetness, cupper_points,
moisture, category_one_defects, quakers, species,
country_of_origin, processing_method, color,
altitude_mean_meters) %>%
mutate(country = case_when(country_of_origin == "Brazil" ~ "Brazil",
country_of_origin == "Costa Rica" ~ "Costa Rica",
country_of_origin == "Columbia" ~ "Columbia",
country_of_origin == "Honduras" ~ "Honduras",
country_of_origin == "El Salvador" ~ "El Salvador",
country_of_origin == "Guatemala" ~ "Guatemala",
country_of_origin == "Mexico" ~ "Mexico",
str_detect(country_of_origin, "United States") ~ "United States",
country_of_origin == "Taiwan" ~ "Taiwan",
TRUE ~ "other country")) %>%
mutate_if(is.character, as.factor) %>%
select(-country_of_origin) %>%
filter(complete.cases(.))
# get number of observations and the response column index
n.rf2 <- nrow(cofrat.rf2)
response.col <- 1
# split data into training and test sets
set.seed(4747)
train.rate = 0.6
train.ind <- sample(1:n.rf2, n.rf2 * train.rate, replace = FALSE)
Xtrain <- cofrat.rf2[train.ind, -response.col]
Ytrain <- unlist(cofrat.rf2[train.ind, response.col])
Xtest <- cofrat.rf2[-train.ind, -response.col]
Ytest <- unlist(cofrat.rf2[-train.ind, response.col])
# fit random forest to the training data
cof_rf2 <- randomForest::randomForest(x = Xtrain, y = Ytrain,
nodesize = 5, ntree = 500, keep.inbag = TRUE)
# estimate conditional mean squared prediction errors, conditional
# biases, conditional prediction intervals, and conditional error
# distribution functions for the test observations
output2 <- quantForestError(cof_rf2, Xtrain, Xtest, alpha = 0.05)
Now that we’ve run the random forest and predicted the outcomes, let’s visualize how well the prediction intervals captured the actual observed ratings. Likely due to the many fewer observations at the low ratings, the prediction intervals are much wider.
We see the same trends as above (fewer observations and wider intervals for lower rated coffees). Both sets of variables provide good interval predictions with capture rates above 97%.
output2$estimates %>% cbind(Ytest) %>%
mutate(capture = case_when(Ytest < lower_0.05 ~ "no",
Ytest > upper_0.05 ~ "no",
TRUE ~ "yes")) %>%
ggplot() +
geom_point(aes(x = Ytest, y = pred), size = 0.5) +
geom_segment(aes(x = Ytest, xend = Ytest, y = lower_0.05, yend = upper_0.05,
color = capture),
lwd=0.5) +
geom_abline(slope = 1, intercept = 0) +
geom_text(aes(label = paste("percent capture = ",
round(100*sum(capture=="yes")/(n.rf2*(1 - train.rate)),2))),
x = 85, y = 70) +
ggtitle("Prediction Intervals for coffee ratings (including factor variables)") +
xlab("True rating") +
ylab("Predicted rating") +
scale_color_brewer(type='qual')