The data today come from the Tate Art Museum.
artists <- readr::read_csv("artists.csv")
artwork <- readr::read_csv("artwork.csv")
Note, over half of the pieces in the collection are by Joseph Turner. We might be curious to analyze the data without his pieces.
artwork %>%
group_by(artistId) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
head()
## # A tibble: 6 x 2
## artistId count
## <dbl> <int>
## 1 558 39389
## 2 300 1046
## 3 1659 623
## 4 138 612
## 5 747 578
## 6 2638 388
artists %>%
filter(id == 558)
## # A tibble: 1 x 9
## id name gender dates yearOfBirth yearOfDeath placeOfBirth placeOfDeath
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 558 Turn… Male 1775… 1775 1851 London, Uni… Chelsea, Un…
## # … with 1 more variable: url <chr>
Let’s combine the datasets so that if we want to use gender to ask questions about the art, the information is available. Note there are 6 pieces of art without a connection to the artists database. And there are 194 artists without art in the dataset.
artartist1 <- inner_join(artists, artwork, by = c("id" = "artistId"))
artartist2 <- full_join(artists, artwork, by = c("id" = "artistId"))
nrow(artwork)
## [1] 69201
nrow(artartist1)
## [1] 69195
nrow(artartist2)
## [1] 69395
Both placeOfBirth
and placeOfDeath
are given by “City, Country”, and it might be easier to visualize or model if we had only country.
artists <- artists %>%
separate(placeOfBirth, c("cityBirth", "countryBirth") ,sep = ", ", fill = "left") %>%
separate(placeOfDeath, c("cityDeath", "countryDeath") ,sep = ", ", fill = "left")
artists %>% group_by(countryBirth) %>%
mutate(count = n()) %>%
filter(count >= 20) %>%
ungroup() %>%
ggplot(aes(x = countryBirth, y = yearOfBirth)) +
geom_boxplot()
artartist1 %>%
filter(id != 558) %>%
ggplot(aes(x = year, color = gender, fill = gender)) +
geom_histogram()
In the interest of working through a Random Forest, not in the interest of an actual good prediction model, we’ll create a Random Forest. Note that I used year of birth and year of death to predict year of art, and so I’m surely going to overfit / have too much information.
set.seed(47)
library(tidymodels)
library(vip)
artRF <- artartist1 %>%
filter(id != 558)
data_split <- initial_split(artRF, prop = 0.75)
art_train <- training(data_split)
art_test <- testing(data_split) %>%
filter(!is.na(gender)) %>%
filter(!is.na(width)) %>%
filter(!is.na(height)) %>%
filter(!is.na(acquisitionYear)) %>%
filter(!is.na(yearOfBirth)) %>%
filter(!is.na(yearOfDeath))
rand_forest(mode = "regression") %>%
set_args(importance = "permutation") %>%
fit(year ~ gender + width + height + acquisitionYear +
yearOfBirth + yearOfDeath,
data = art_train) %>%
vip::vip()
rand_forest(mode = "regression") %>%
fit(year ~ gender + width + height + acquisitionYear +
yearOfBirth + yearOfDeath, data = art_train) %>%
predict(new_data = art_test) %>%
ggplot(aes(y = .pred, x = art_test$year)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
ylab("Predicted Year") +
xlab("Year of Creation")
praise()
## [1] "You are classy!"