Predicting Coffee Ratings

My initial approach to these data is to see how well we can build a model that predicts the rating of the cup of coffee. First, let’s explore the data graphically to understand variables and relationships. Then we’ll apply random forests and prediction intervals on random forests from a recent project I’ve worked on with Benji Lu: A Unified Framework for Random Forest Prediction Error Estimation

Loading in the Data

In looking at the data, we notice a few things:

One of the observations has a total_cup_point of zero. That can’t be right. It’s aroma, flavor, etc. are also all zero, so it seems as though maybe that coffee wasn’t rated.
Sweetness is highly correlated to species.
Some of the numeric variables are much more correlated with the coffee rating than others of the variables.

cofrat <- read_csv("coffee_ratings.csv") %>%
  filter(total_cup_points > 0)

cofrat %>% skimr::skim()

Data summary
Name	Piped data
Number of rows	1338
Number of columns	43
_______________________
Column type frequency:
character	24
numeric	19
________________________
Group variables

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
species	0	1.00	7	7	2
owner	7	0.99	3	50	315
country_of_origin	1	1.00	4	28	36
farm_name	359	0.73	1	73	571
lot_number	1063	0.21	1	71	227
mill	315	0.76	1	77	460
ico_number	151	0.89	1	40	847
company	209	0.84	3	73	281
altitude	226	0.83	1	41	396
region	59	0.96	2	76	356
producer	231	0.83	1	100	691
bag_weight	0	1.00	1	8	56
in_country_partner	0	1.00	7	85	27
harvest_year	47	0.96	3	24	46
grading_date	0	1.00	13	20	567
owner_1	7	0.99	3	50	319
variety	226	0.83	4	21	29
processing_method	169	0.87	5	25	5
color	218	0.84	4	12	4
expiration	0	1.00	13	20	566
certification_body	0	1.00	7	85	26
certification_address	0	1.00	40	40	32
certification_contact	0	1.00	40	40	29
unit_of_measurement	0	1.00	1	2	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
total_cup_points	0	1.00	82.15	2.69	59.83	81.10	82.50	83.67	90.58	▁▁▁▇▁
number_of_bags	0	1.00	154.09	129.99	0.00	14.00	172.50	275.00	1062.00	▇▇▁▁▁
aroma	0	1.00	7.57	0.32	5.08	7.42	7.58	7.75	8.75	▁▁▂▇▁
flavor	0	1.00	7.53	0.34	6.08	7.33	7.58	7.75	8.83	▁▂▇▃▁
aftertaste	0	1.00	7.41	0.35	6.17	7.25	7.42	7.58	8.67	▁▃▇▂▁
acidity	0	1.00	7.54	0.32	5.25	7.33	7.58	7.75	8.75	▁▁▃▇▁
body	0	1.00	7.52	0.31	5.08	7.33	7.50	7.67	8.58	▁▁▁▇▁
balance	0	1.00	7.52	0.35	5.25	7.33	7.50	7.75	8.75	▁▁▃▇▁
uniformity	0	1.00	9.84	0.49	6.00	10.00	10.00	10.00	10.00	▁▁▁▁▇
clean_cup	0	1.00	9.84	0.72	0.00	10.00	10.00	10.00	10.00	▁▁▁▁▇
sweetness	0	1.00	9.86	0.55	1.33	10.00	10.00	10.00	10.00	▁▁▁▁▇
cupper_points	0	1.00	7.51	0.43	5.17	7.25	7.50	7.75	10.00	▁▂▇▁▁
moisture	0	1.00	0.09	0.05	0.00	0.09	0.11	0.12	0.28	▃▇▅▁▁
category_one_defects	0	1.00	0.48	2.55	0.00	0.00	0.00	0.00	63.00	▇▁▁▁▁
quakers	1	1.00	0.17	0.83	0.00	0.00	0.00	0.00	11.00	▇▁▁▁▁
category_two_defects	0	1.00	3.56	5.31	0.00	0.00	2.00	4.00	55.00	▇▁▁▁▁
altitude_low_meters	230	0.83	1751.03	8673.35	1.00	1100.00	1310.64	1600.00	190164.00	▇▁▁▁▁
altitude_high_meters	230	0.83	1799.71	8672.71	1.00	1100.00	1350.00	1650.00	190164.00	▇▁▁▁▁
altitude_mean_meters	230	0.83	1775.37	8672.53	1.00	1100.00	1310.64	1600.00	190164.00	▇▁▁▁▁

cofrat %>% summarize(sum(total_cup_points ==0))

## # A tibble: 1 x 1
##   `sum(total_cup_points == 0)`
##                          <int>
## 1                            0

cofrat %>% select(total_cup_points, aroma, flavor, aftertaste, acidity, body,
                  balance, uniformity, clean_cup, sweetness, cupper_points, 
                  moisture, category_one_defects, quakers, species) %>%
  ggpairs(mapping = ggplot2::aes(color = species), columns = c(2:7,1))

cofrat %>% select(total_cup_points, aroma, flavor, aftertaste, acidity, body,
                  balance, uniformity, clean_cup, sweetness, cupper_points, 
                  moisture, category_one_defects, quakers, species) %>%
  ggpairs(mapping = ggplot2::aes(color = species), columns = c(8:14,1))

Predicting

For a first pass at predicting, I’m going to remove 40% of the observations to hold out for later prediction. I’ll use 60% to build a random forest. Following along with the instructions on the forestError GitHub repo.

Numerical Variables

I’ll start by using only the numerical variables (and species) for predicting the coffee rating. Note that random forests require no missing variables, so if we use other variables, a more sophisticated analysis would require imputing the missing values or dealing with them in some other way.

cofrat.rf <- cofrat %>% select(total_cup_points, aroma, flavor, aftertaste, acidity, body,
                  balance, uniformity, clean_cup, sweetness, cupper_points, 
                  moisture, category_one_defects, quakers, species) %>%
  mutate(species = as.factor(species)) %>%
  filter(complete.cases(.))


# get number of observations and the response column index
n.rf <- nrow(cofrat.rf)
response.col <- 1

# split data into training and test sets
set.seed(4747)
train.rate = 0.6
train.ind <- sample(1:n.rf, n.rf * train.rate, replace = FALSE)
Xtrain <- cofrat.rf[train.ind, -response.col]
Ytrain <- unlist(cofrat.rf[train.ind, response.col])
Xtest <- cofrat.rf[-train.ind, -response.col]
Ytest <- unlist(cofrat.rf[-train.ind, response.col])

# fit random forest to the training data
cof_rf <- randomForest::randomForest(x = Xtrain, y = Ytrain, 
                                     nodesize = 5, ntree = 500, keep.inbag = TRUE)

# estimate conditional mean squared prediction errors, conditional
# biases, conditional prediction intervals, and conditional error
# distribution functions for the test observations
output <- quantForestError(cof_rf, Xtrain, Xtest, alpha = 0.05)

Now that we’ve run the random forest and predicted the outcomes, let’s visualize how well the prediction intervals captured the actual observed ratings. Likely due to the many fewer observations at the low ratings, the prediction intervals are much wider.

output$estimates %>% cbind(Ytest) %>%
  mutate(capture = case_when(Ytest < lower_0.05 ~ "no",
                             Ytest > upper_0.05 ~ "no",
                             TRUE ~ "yes")) %>%
  ggplot() +
  geom_point(aes(x = Ytest, y = pred), size = 0.5) +
  geom_segment(aes(x = Ytest, xend = Ytest, y = lower_0.05, yend = upper_0.05, 
                   color = capture),
               lwd=0.5) + 
  geom_abline(slope = 1, intercept = 0) +
  geom_text(aes(label = paste("percent capture = ",  
                              round(100*sum(capture=="yes")/(n.rf*(1 - train.rate)),2))), 
            x = 85, y = 70) +
  ggtitle("Prediction Intervals for coffee ratings (numeric variables only)") +
  xlab("True rating") +
  ylab("Predicted rating") +
  scale_color_brewer(type='div', palette = "Set1")

All the variables

We now predict with more variables. I keep only those variables without substantial missing observations. We also do not use unique identifiers like owner, farm_name, lot_number, etc. Note that there are substantially fewer observations (due to missingness). So although I set the seed the same, the test observations will be different across the two analyses.

cofrat.rf2 <- cofrat %>% select(total_cup_points, aroma, flavor, aftertaste, acidity, body,
                  balance, uniformity, clean_cup, sweetness, cupper_points, 
                  moisture, category_one_defects, quakers, species,
                  country_of_origin, processing_method, color, 
                  altitude_mean_meters) %>%
  mutate(country = case_when(country_of_origin == "Brazil" ~ "Brazil",
                             country_of_origin == "Costa Rica" ~ "Costa Rica",
                             country_of_origin == "Columbia" ~ "Columbia",
                             country_of_origin == "Honduras" ~ "Honduras",
                             country_of_origin == "El Salvador" ~ "El Salvador",
                             country_of_origin == "Guatemala" ~ "Guatemala",
                             country_of_origin == "Mexico" ~ "Mexico",
                             str_detect(country_of_origin, "United States") ~ "United States",
                             country_of_origin == "Taiwan" ~ "Taiwan",
                             TRUE ~ "other country")) %>%
  mutate_if(is.character, as.factor) %>%
  select(-country_of_origin) %>%
  filter(complete.cases(.))


# get number of observations and the response column index
n.rf2 <- nrow(cofrat.rf2)
response.col <- 1

# split data into training and test sets
set.seed(4747)
train.rate = 0.6
train.ind <- sample(1:n.rf2, n.rf2 * train.rate, replace = FALSE)
Xtrain <- cofrat.rf2[train.ind, -response.col]
Ytrain <- unlist(cofrat.rf2[train.ind, response.col])
Xtest <- cofrat.rf2[-train.ind, -response.col]
Ytest <- unlist(cofrat.rf2[-train.ind, response.col])

# fit random forest to the training data
cof_rf2 <- randomForest::randomForest(x = Xtrain, y = Ytrain, 
                                      nodesize = 5, ntree = 500, keep.inbag = TRUE)

# estimate conditional mean squared prediction errors, conditional
# biases, conditional prediction intervals, and conditional error
# distribution functions for the test observations
output2 <- quantForestError(cof_rf2, Xtrain, Xtest, alpha = 0.05)

We see the same trends as above (fewer observations and wider intervals for lower rated coffees). Both sets of variables provide good interval predictions with capture rates above 97%.

output2$estimates %>% cbind(Ytest) %>%
  mutate(capture = case_when(Ytest < lower_0.05 ~ "no",
                             Ytest > upper_0.05 ~ "no",
                             TRUE ~ "yes")) %>%
  ggplot() +
  geom_point(aes(x = Ytest, y = pred), size = 0.5) +
  geom_segment(aes(x = Ytest, xend = Ytest, y = lower_0.05, yend = upper_0.05, 
                   color = capture),
               lwd=0.5) + 
  geom_abline(slope = 1, intercept = 0) +
  geom_text(aes(label = paste("percent capture = ",  
                              round(100*sum(capture=="yes")/(n.rf2*(1 - train.rate)),2))), 
            x = 85, y = 70) +
  ggtitle("Prediction Intervals for coffee ratings (including factor variables)") +
  xlab("True rating") +
  ylab("Predicted rating") +
  scale_color_brewer(type='qual')

Predicting Coffee Ratings

Jo Hardin

7/7/2020

Loading in the Data

Predicting

Numerical Variables

All the variables