The data this week comes from FiveThirtyEight and the corresponding article from FiveThirtyEight.
538 claims to have analyzed 1,615 films, but they don’t say how those particular films were chosen. But I can’t imagine this is every possible film. So I wonder whether the proportions calculated below are actually representative of a larger population or not.
movies <- read_csv("movies.csv") %>%
mutate(pass_num = as.numeric(ifelse(binary == "FAIL", 0, ifelse(binary == "PASS", 1, binary)))) %>%
mutate(rated2 = case_when(
rated == "N/A" ~ NA_character_,
rated == "Not Rated" ~ "no rating",
rated == "Unrated" ~ "no rating",
rated == "PG-13" ~ "PG",
rated == "TV-14" ~ "TV",
rated == "TV-PG" ~"TV",
TRUE ~ as.character(rated)
)) %>%
mutate(year_blck = case_when(
year <= 1975 ~ "1970-75",
year <= 1980 ~ "1976-80",
year <= 1985 ~ "1981-85",
year <= 1990 ~ "1986-90",
year <= 1995 ~ "1991-95",
year <= 2000 ~ "1996-2000",
year <= 2005 ~ "2001-05",
year <= 2010 ~ "2006-10",
year <= 2015 ~ "2011-15",
year <= 2020 ~ "2016-20",
TRUE ~ as.character(year)
)) %>%
mutate(year_blck2 = case_when(
year <= 1975 ~ "'70",
year <= 1980 ~ "'76",
year <= 1985 ~ "'81",
year <= 1990 ~ "'86",
year <= 1995 ~ "'91",
year <= 2000 ~ "'96",
year <= 2005 ~ "'01",
year <= 2010 ~ "'06",
year <= 2015 ~ "'11",
year <= 2020 ~ "'16",
TRUE ~ as.character(year)
))
raw_bechdel <- read_csv("raw_bechdel.csv")
same_films <- movies %>%
inner_join(raw_bechdel, by = c("imdb_id" = "imdb_id"))
all_films <- movies %>%
full_join(raw_bechdel, by = c("imdb_id" = "imdb_id"))
One thing to note is that there are two datatsets (one with information beyond just the test, and one with movies over a longer time period). For the movies which are listed in both datasets, the Bechdel Test results are slightly different. I’m not totally sure I understand the rating
value of 0, 1, 2, 3 (is it “and”? is it “or”? is it “at least”?), but I would have expected the following tabulation to line up more cleanly:
same_films %>% select(rating, clean_test) %>% table(useNA = "ifany")
## clean_test
## rating dubious men notalk nowomen ok
## 0 0 0 1 128 0
## 1 1 0 478 6 1
## 2 0 170 9 2 4
## 3 141 24 26 5 797
all_films %>% select(rating, clean_test) %>% table(useNA = "ifany")
## clean_test
## rating dubious men notalk nowomen ok <NA>
## 0 0 0 1 128 0 765
## 1 1 0 478 6 1 1454
## 2 0 170 9 2 4 711
## 3 141 24 26 5 797 4116
## <NA> 0 0 0 0 1 0
And while the right thing to do would be to go look at the individual films to determine the true status of the Bechdel Test, I’m going to move forward with a “3” rating on the large dataset and an “ok” rating on the smaller dataset.
The first thing I want to know is whether the proportion of movies that pass The Bechdel Test has improved over time. Are movie makers getting any better at recognizing how offensive movies are which don’t pass the test? It does seem that over time, the proportion of movies that pass the Bechdel Test is increasing.
all_films %>%
filter(!is.na(binary) & !is.na(year.x)) %>%
group_by(year.x) %>%
summarize(prop_pass = mean(binary == "PASS"), n_film = n()) %>%
ggplot(aes(x = year.x, y = prop_pass)) +
geom_line() +
geom_point(aes(size = n_film))
Let’s go back to the dataset with more variables to see if the test is different across different ratings. Because we are looking across ratings, I grouped the years to smooth out the variability a little bit. It seems like of the PG and R+ movies (which is the majority of the films), there is a slight increase in the proportion of movies that pass the Bechdel Test.
all_films %>%
filter(!is.na(rating) & !is.na(year_blck)) %>%
filter(rated2 != "TV" & !is.na(rated2)) %>%
mutate(rated3 = as.factor(ifelse(rated2 == "R", "R+", ifelse(rated2 == "NC-17", "R+",
ifelse(rated2 == "X", "R+", rated2))))) %>%
mutate(`movie rating` = fct_relevel(rated3, "no rating", "G", "PG", "R+")) %>%
group_by(year_blck, `movie rating`) %>%
summarize(prop_pass = mean(rating == 3), n_film = n()) %>%
ggplot(aes(x = year_blck, y = prop_pass, group = `movie rating`, color = `movie rating`)) +
geom_line() +
geom_point(aes(size = n_film)) +
xlab("year") + ylab("") +
ggtitle("Proportion of films that pass the Bechdel Test, broken down by rating")
What about factoring by genre? Thanks to code from @javendano585 at https://github.com/javendano585/TidyTuesday/tree/main/2021_Week_11 for the tidytext code which allowed the gene token to be pivoted to longer.
all_films %>%
filter(!is.na(genre) & !is.na(year_blck2)) %>%
unnest_tokens(input = genre, output = "genre", "words") %>%
mutate(genre = if_else(genre == "sci", "sci-fi", genre)) %>%
filter(genre != "fi") %>%
group_by(year_blck2, genre) %>%
summarize(prop_pass = mean(rating == 3), n_film = n()) %>%
ggplot(aes(x = year_blck2, y = prop_pass, group = genre, color = genre)) +
geom_line() +
geom_point(aes(size = n_film)) +
xlab("year") + ylab("") +
facet_wrap(~genre) +
ggtitle("Proportion of films that pass the Bechdel Test, faceted by genre")
praise()
## [1] "You are solid!"