My initial approach to these data is to see how well we can build a model that predicts the rating of the cup of coffee. First, let’s explore the data graphically to understand variables and relationships. Then we’ll apply random forests and prediction intervals on random forests from a recent project I’ve worked on with Benji Lu: A Unified Framework for Random Forest Prediction Error Estimation

Loading in the Data

In looking at the data, we notice a few things:

cofrat <- read_csv("coffee_ratings.csv") %>%
  filter(total_cup_points > 0)
cofrat %>% skimr::skim()
Data summary
Name Piped data
Number of rows 1338
Number of columns 43
_______________________
Column type frequency:
character 24
numeric 19
________________________
Group variables

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
species 0 1.00 7 7 0 2 0
owner 7 0.99 3 50 0 315 0
country_of_origin 1 1.00 4 28 0 36 0
farm_name 359 0.73 1 73 0 571 0
lot_number 1063 0.21 1 71 0 227 0
mill 315 0.76 1 77 0 460 0
ico_number 151 0.89 1 40 0 847 0
company 209 0.84 3 73 0 281 0
altitude 226 0.83 1 41 0 396 0
region 59 0.96 2 76 0 356 0
producer 231 0.83 1 100 0 691 0
bag_weight 0 1.00 1 8 0 56 0
in_country_partner 0 1.00 7 85 0 27 0
harvest_year 47 0.96 3 24 0 46 0
grading_date 0 1.00 13 20 0 567 0
owner_1 7 0.99 3 50 0 319 0
variety 226 0.83 4 21 0 29 0
processing_method 169 0.87 5 25 0 5 0
color 218 0.84 4 12 0 4 0
expiration 0 1.00 13 20 0 566 0
certification_body 0 1.00 7 85 0 26 0
certification_address 0 1.00 40 40 0 32 0
certification_contact 0 1.00 40 40 0 29 0
unit_of_measurement 0 1.00 1 2 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_cup_points 0 1.00 82.15 2.69 59.83 81.10 82.50 83.67 90.58 ▁▁▁▇▁
number_of_bags 0 1.00 154.09 129.99 0.00 14.00 172.50 275.00 1062.00 ▇▇▁▁▁
aroma 0 1.00 7.57 0.32 5.08 7.42 7.58 7.75 8.75 ▁▁▂▇▁
flavor 0 1.00 7.53 0.34 6.08 7.33 7.58 7.75 8.83 ▁▂▇▃▁
aftertaste 0 1.00 7.41 0.35 6.17 7.25 7.42 7.58 8.67 ▁▃▇▂▁
acidity 0 1.00 7.54 0.32 5.25 7.33 7.58 7.75 8.75 ▁▁▃▇▁
body 0 1.00 7.52 0.31 5.08 7.33 7.50 7.67 8.58 ▁▁▁▇▁
balance 0 1.00 7.52 0.35 5.25 7.33 7.50 7.75 8.75 ▁▁▃▇▁
uniformity 0 1.00 9.84 0.49 6.00 10.00 10.00 10.00 10.00 ▁▁▁▁▇
clean_cup 0 1.00 9.84 0.72 0.00 10.00 10.00 10.00 10.00 ▁▁▁▁▇
sweetness 0 1.00 9.86 0.55 1.33 10.00 10.00 10.00 10.00 ▁▁▁▁▇
cupper_points 0 1.00 7.51 0.43 5.17 7.25 7.50 7.75 10.00 ▁▂▇▁▁
moisture 0 1.00 0.09 0.05 0.00 0.09 0.11 0.12 0.28 ▃▇▅▁▁
category_one_defects 0 1.00 0.48 2.55 0.00 0.00 0.00 0.00 63.00 ▇▁▁▁▁
quakers 1 1.00 0.17 0.83 0.00 0.00 0.00 0.00 11.00 ▇▁▁▁▁
category_two_defects 0 1.00 3.56 5.31 0.00 0.00 2.00 4.00 55.00 ▇▁▁▁▁
altitude_low_meters 230 0.83 1751.03 8673.35 1.00 1100.00 1310.64 1600.00 190164.00 ▇▁▁▁▁
altitude_high_meters 230 0.83 1799.71 8672.71 1.00 1100.00 1350.00 1650.00 190164.00 ▇▁▁▁▁
altitude_mean_meters 230 0.83 1775.37 8672.53 1.00 1100.00 1310.64 1600.00 190164.00 ▇▁▁▁▁
cofrat %>% summarize(sum(total_cup_points ==0))
## # A tibble: 1 x 1
##   `sum(total_cup_points == 0)`
##                          <int>
## 1                            0
cofrat %>% select(total_cup_points, aroma, flavor, aftertaste, acidity, body,
                  balance, uniformity, clean_cup, sweetness, cupper_points, 
                  moisture, category_one_defects, quakers, species) %>%
  ggpairs(mapping = ggplot2::aes(color = species), columns = c(2:7,1)) 

cofrat %>% select(total_cup_points, aroma, flavor, aftertaste, acidity, body,
                  balance, uniformity, clean_cup, sweetness, cupper_points, 
                  moisture, category_one_defects, quakers, species) %>%
  ggpairs(mapping = ggplot2::aes(color = species), columns = c(8:14,1)) 

Predicting

For a first pass at predicting, I’m going to remove 40% of the observations to hold out for later prediction. I’ll use 60% to build a random forest. Following along with the instructions on the forestError GitHub repo.

Numerical Variables

I’ll start by using only the numerical variables (and species) for predicting the coffee rating. Note that random forests require no missing variables, so if we use other variables, a more sophisticated analysis would require imputing the missing values or dealing with them in some other way.

cofrat.rf <- cofrat %>% select(total_cup_points, aroma, flavor, aftertaste, acidity, body,
                  balance, uniformity, clean_cup, sweetness, cupper_points, 
                  moisture, category_one_defects, quakers, species) %>%
  mutate(species = as.factor(species)) %>%
  filter(complete.cases(.))


# get number of observations and the response column index
n.rf <- nrow(cofrat.rf)
response.col <- 1

# split data into training and test sets
set.seed(4747)
train.rate = 0.6
train.ind <- sample(1:n.rf, n.rf * train.rate, replace = FALSE)
Xtrain <- cofrat.rf[train.ind, -response.col]
Ytrain <- unlist(cofrat.rf[train.ind, response.col])
Xtest <- cofrat.rf[-train.ind, -response.col]
Ytest <- unlist(cofrat.rf[-train.ind, response.col])

# fit random forest to the training data
cof_rf <- randomForest::randomForest(x = Xtrain, y = Ytrain, 
                                     nodesize = 5, ntree = 500, keep.inbag = TRUE)

# estimate conditional mean squared prediction errors, conditional
# biases, conditional prediction intervals, and conditional error
# distribution functions for the test observations
output <- quantForestError(cof_rf, Xtrain, Xtest, alpha = 0.05)

Now that we’ve run the random forest and predicted the outcomes, let’s visualize how well the prediction intervals captured the actual observed ratings. Likely due to the many fewer observations at the low ratings, the prediction intervals are much wider.

output$estimates %>% cbind(Ytest) %>%
  mutate(capture = case_when(Ytest < lower_0.05 ~ "no",
                             Ytest > upper_0.05 ~ "no",
                             TRUE ~ "yes")) %>%
  ggplot() +
  geom_point(aes(x = Ytest, y = pred), size = 0.5) +
  geom_segment(aes(x = Ytest, xend = Ytest, y = lower_0.05, yend = upper_0.05, 
                   color = capture),
               lwd=0.5) + 
  geom_abline(slope = 1, intercept = 0) +
  geom_text(aes(label = paste("percent capture = ",  
                              round(100*sum(capture=="yes")/(n.rf*(1 - train.rate)),2))), 
            x = 85, y = 70) +
  ggtitle("Prediction Intervals for coffee ratings (numeric variables only)") +
  xlab("True rating") +
  ylab("Predicted rating") +
  scale_color_brewer(type='div', palette = "Set1")

All the variables

We now predict with more variables. I keep only those variables without substantial missing observations. We also do not use unique identifiers like owner, farm_name, lot_number, etc. Note that there are substantially fewer observations (due to missingness). So although I set the seed the same, the test observations will be different across the two analyses.

cofrat.rf2 <- cofrat %>% select(total_cup_points, aroma, flavor, aftertaste, acidity, body,
                  balance, uniformity, clean_cup, sweetness, cupper_points, 
                  moisture, category_one_defects, quakers, species,
                  country_of_origin, processing_method, color, 
                  altitude_mean_meters) %>%
  mutate(country = case_when(country_of_origin == "Brazil" ~ "Brazil",
                             country_of_origin == "Costa Rica" ~ "Costa Rica",
                             country_of_origin == "Columbia" ~ "Columbia",
                             country_of_origin == "Honduras" ~ "Honduras",
                             country_of_origin == "El Salvador" ~ "El Salvador",
                             country_of_origin == "Guatemala" ~ "Guatemala",
                             country_of_origin == "Mexico" ~ "Mexico",
                             str_detect(country_of_origin, "United States") ~ "United States",
                             country_of_origin == "Taiwan" ~ "Taiwan",
                             TRUE ~ "other country")) %>%
  mutate_if(is.character, as.factor) %>%
  select(-country_of_origin) %>%
  filter(complete.cases(.))


# get number of observations and the response column index
n.rf2 <- nrow(cofrat.rf2)
response.col <- 1

# split data into training and test sets
set.seed(4747)
train.rate = 0.6
train.ind <- sample(1:n.rf2, n.rf2 * train.rate, replace = FALSE)
Xtrain <- cofrat.rf2[train.ind, -response.col]
Ytrain <- unlist(cofrat.rf2[train.ind, response.col])
Xtest <- cofrat.rf2[-train.ind, -response.col]
Ytest <- unlist(cofrat.rf2[-train.ind, response.col])

# fit random forest to the training data
cof_rf2 <- randomForest::randomForest(x = Xtrain, y = Ytrain, 
                                      nodesize = 5, ntree = 500, keep.inbag = TRUE)

# estimate conditional mean squared prediction errors, conditional
# biases, conditional prediction intervals, and conditional error
# distribution functions for the test observations
output2 <- quantForestError(cof_rf2, Xtrain, Xtest, alpha = 0.05)

Now that we’ve run the random forest and predicted the outcomes, let’s visualize how well the prediction intervals captured the actual observed ratings. Likely due to the many fewer observations at the low ratings, the prediction intervals are much wider.

We see the same trends as above (fewer observations and wider intervals for lower rated coffees). Both sets of variables provide good interval predictions with capture rates above 97%.

output2$estimates %>% cbind(Ytest) %>%
  mutate(capture = case_when(Ytest < lower_0.05 ~ "no",
                             Ytest > upper_0.05 ~ "no",
                             TRUE ~ "yes")) %>%
  ggplot() +
  geom_point(aes(x = Ytest, y = pred), size = 0.5) +
  geom_segment(aes(x = Ytest, xend = Ytest, y = lower_0.05, yend = upper_0.05, 
                   color = capture),
               lwd=0.5) + 
  geom_abline(slope = 1, intercept = 0) +
  geom_text(aes(label = paste("percent capture = ",  
                              round(100*sum(capture=="yes")/(n.rf2*(1 - train.rate)),2))), 
            x = 85, y = 70) +
  ggtitle("Prediction Intervals for coffee ratings (including factor variables)") +
  xlab("True rating") +
  ylab("Predicted rating") +
  scale_color_brewer(type='qual')