The data today come from the Tate Art Museum.

Input data

artists <- readr::read_csv("artists.csv")
artwork <- readr::read_csv("artwork.csv")

Note, over half of the pieces in the collection are by Joseph Turner. We might be curious to analyze the data without his pieces.

artwork %>%
  group_by(artistId) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  head()
## # A tibble: 6 x 2
##   artistId count
##      <dbl> <int>
## 1      558 39389
## 2      300  1046
## 3     1659   623
## 4      138   612
## 5      747   578
## 6     2638   388
artists %>%
  filter(id == 558) 
## # A tibble: 1 x 9
##      id name  gender dates yearOfBirth yearOfDeath placeOfBirth placeOfDeath
##   <dbl> <chr> <chr>  <chr>       <dbl>       <dbl> <chr>        <chr>       
## 1   558 Turn… Male   1775…        1775        1851 London, Uni… Chelsea, Un…
## # … with 1 more variable: url <chr>

Let’s combine the datasets so that if we want to use gender to ask questions about the art, the information is available. Note there are 6 pieces of art without a connection to the artists database. And there are 194 artists without art in the dataset.

artartist1 <- inner_join(artists, artwork, by = c("id" = "artistId"))
artartist2 <- full_join(artists, artwork, by = c("id" = "artistId"))

nrow(artwork)
## [1] 69201
nrow(artartist1)
## [1] 69195
nrow(artartist2)
## [1] 69395

Location

Both placeOfBirth and placeOfDeath are given by “City, Country”, and it might be easier to visualize or model if we had only country.

artists <- artists %>%
  separate(placeOfBirth, c("cityBirth", "countryBirth") ,sep = ", ", fill = "left") %>%
  separate(placeOfDeath, c("cityDeath", "countryDeath") ,sep = ", ", fill = "left") 

artists %>% group_by(countryBirth) %>%
  mutate(count = n()) %>%
  filter(count >= 20) %>%
  ungroup() %>%
  ggplot(aes(x = countryBirth, y = yearOfBirth)) + 
  geom_boxplot()

Viz artwork

artartist1 %>%
  filter(id != 558) %>%
  ggplot(aes(x = year, color = gender, fill = gender)) + 
  geom_histogram()

Random Forest

In the interest of working through a Random Forest, not in the interest of an actual good prediction model, we’ll create a Random Forest. Note that I used year of birth and year of death to predict year of art, and so I’m surely going to overfit / have too much information.

set.seed(47)
library(tidymodels)
library(vip)

artRF <- artartist1 %>%
  filter(id != 558)

data_split <- initial_split(artRF, prop = 0.75)

art_train <- training(data_split)
art_test  <- testing(data_split) %>%
  filter(!is.na(gender)) %>%
  filter(!is.na(width)) %>%
  filter(!is.na(height)) %>%
  filter(!is.na(acquisitionYear)) %>%
  filter(!is.na(yearOfBirth)) %>%
  filter(!is.na(yearOfDeath))

rand_forest(mode = "regression") %>% 
   set_args(importance = "permutation") %>%
   fit(year ~ gender + width + height + acquisitionYear +
         yearOfBirth + yearOfDeath, 
       data = art_train) %>% 
   vip::vip() 

rand_forest(mode = "regression") %>% 
   fit(year ~ gender + width + height + acquisitionYear +
         yearOfBirth + yearOfDeath, data = art_train) %>% 
   predict(new_data = art_test) %>%
   ggplot(aes(y = .pred, x = art_test$year)) + 
   geom_point() +
   geom_abline(intercept = 0, slope = 1) +
  ylab("Predicted Year") +
  xlab("Year of Creation")

praise()
## [1] "You are classy!"