The Data

The data this week comes from Data.Europa hattip to Data is Plural.

Wimdu wrote a short blog post on the most popular ERASMUS destinations.

erasmus <- read_csv("erasmus.csv") %>%
  mutate(send = countrycode(sending_country_code,
                                       origin =  'iso2c', 
                                       destination = 'country.name')) %>%
  mutate(send = case_when(
    sending_country_code == "EL" ~ "Greece",
    sending_country_code == "UK" ~ "UK",
    sending_country_code == "XK" ~ "Kosovo",
    TRUE ~ send
  )) %>%
  mutate(receive = countrycode(receiving_country_code,
                                       origin =  'iso2c', 
                                       destination = 'country.name')) %>%
  mutate(receive = case_when(
    receiving_country_code == "EL" ~ "Greece",
    receiving_country_code == "UK" ~ "UK",
    receiving_country_code == "XK" ~ "Kosovo",
    TRUE ~ receive
  ))

mobility <- read_csv("mobility.csv")

Wrangling

erasmus_country <- erasmus %>%
  select(academic_year, participant_gender, send, receive,
         participants, special_needs) %>%
  pivot_longer(cols = send:receive, names_to = "how", values_to = "country") %>%
  group_by(country, how, special_needs, participant_gender, academic_year) %>%
  summarize(total = sum(participants))

I was hoping to do more with this plot (including ordering, segmenting the barplots, etc.). Alas, gotta be done for now!

erasmus_country %>%
  filter(total > 100) %>%
  mutate(total = ifelse(how == "send", -total, total)) %>%  ggplot() + 
  geom_bar(aes(x = country, y = total, fill = participant_gender), stat = "identity") + 
  geom_hline(yintercept = 0) + 
  coord_flip() + 
  scale_fill_viridis_d()

Some odd data characteristics… how is the age -7184 years? or 1049 years?

erasmus %>%
  select(participant_age) %>%
  summary()
##  participant_age   
##  Min.   :-7184.00  
##  1st Qu.:   17.00  
##  Median :   21.00  
##  Mean   :   24.54  
##  3rd Qu.:   28.00  
##  Max.   : 1049.00

I’m also not totally sure what a single row represents. It seems like it means a particular combination of demographic characteristics. But are there really 17 participants with the same demographic characteristics which would make up a single row?

erasmus %>%
  filter(special_needs == "Yes") %>%
  select(special_needs, participants) %>%
  table()
##              participants
## special_needs    1    2    3    4    5    6    7    8    9   17
##           Yes 1956  351  114   60   31   15    3    4    1    1