Parfumo Fragrance

Author

Jo Hardin

Published

January 7, 2025

library(tidyverse) # ggplot, lubridate, dplyr, stringr, readr...
library(praise)

library(reticulate)
use_python("/Users/jsh04747/miniforge3/bin/python3", required = TRUE)
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

A fantastic gallery of seaborn plot: https://seaborn.pydata.org/examples/index.html

The Data

(First dataset of 2025 was to “bring your own data!”, so I went back and found this one from December 12, 2024: https://github.com/rfordatascience/tidytuesday/tree/main/data/2024/2024-12-10)

This week we’re diving into the fascinating world of fragrances with a dataset sourced from Parfumo, a vibrant community of perfume enthusiasts. Olga G. webscraped these data from the various fragrance sections on the Parfumo website. Here is a description from the author:

This dataset contains detailed information about perfumes sourced from Parfumo, obtained through web scraping. It includes data on perfume ratings, olfactory notes (top, middle, and base notes), perfumers, year of release and other relevant characteristics of the perfumes listed on the Parfumo website.

The data provides a comprehensive look at how various perfumes are rated, which families of scents they belong to, and detailed breakdowns of the key olfactory components that define their overall profile

parfumo = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-12-10/parfumo_data_clean.csv')

print(parfumo)
      Number  ...                                                URL
0        455  ...  https://www.parfumo.com/Perfumes/Le_Re_Noir/45...
1       0071  ...  https://www.parfumo.com/Perfumes/CB_I_Hate_Per...
2       0154  ...  https://www.parfumo.com/Perfumes/CB_I_Hate_Per...
3       0162  ...  https://www.parfumo.com/Perfumes/CB_I_Hate_Per...
4       0171  ...  https://www.parfumo.com/Perfumes/CB_I_Hate_Per...
...      ...  ...                                                ...
59320    NaN  ...  https://www.parfumo.com/Perfumes/Pascal_Morabi...
59321    NaN  ...  https://www.parfumo.com/Perfumes/Pascal_Morabi...
59322    NaN  ...  https://www.parfumo.com/Perfumes/Pascal_Morabi...
59323    NaN  ...  https://www.parfumo.com/Perfumes/Pascal_Morabi...
59324    NaN  ...  https://www.parfumo.com/Perfumes/Pascal_Morabi...

[59325 rows x 13 columns]

Making a barplot

First, I want to take the top 10 brands and label everything else “other.”

brand_counts = parfumo['Brand'].value_counts()
top_brands = brand_counts.nlargest(10).index


parfumo['TopBrand'] = parfumo['Brand'].apply(lambda x: x if x in top_brands else 'Other')

parfumo_sub = parfumo[np.isfinite(parfumo['Release_Year'])]
parfumo_sub = parfumo_sub[parfumo_sub['Release_Year'] >= 1900]

parfumo_sub.loc[:, 'Decade'] = (np.floor(parfumo_sub['Release_Year'] / 10) * 10).astype(int)
plt.figure(figsize=(10,6))

sns.countplot(
    data=parfumo_sub, 
    x="Decade",  hue="TopBrand",
    palette="dark", alpha=.6
)
plt.xlabel('')
plt.ylabel('Number of perfumes released')
plt.title('Most popular brands of perfume, release count per decade')
plt.yscale('log')    
plt.legend(title='Top Brand', loc='upper center', 
           bbox_to_anchor=(0.5, -0.15), ncol=2, frameon=False)
plt.tight_layout()

plt.savefig("parfumo.png", dpi=300, bbox_inches='tight')

plt.show()

Bar plot with decade on the x-axis and count of number of perfumes released on the y-axis.  The 10 most popular brands are colored bars with the vast majority of the brands being "other." In more recent decades, there are more perfumes described in the parfumo database.

Data scraped from parfumo and filtered later than 1990. Number of perfumes in the parfumo database by decade of release.
praise()
[1] "You are incredible!"