oct. 15, 2020
samuel mignot

Diachronic Analysis of Billboard Top 100 Songs (1958-2019)


A dataset of Billboard top 100 songs since 1958 (the advent) has been collected; containing year, rank, title, artist, lyrics, and audio features. Lyrics were scraped from genius.com. Audio features were added to the dataset using spotify's API.

Feature Engineering:

Lyric sentiment was extracted from each songs lyrics, using TextBlob. Textblob divides sentiment into two categories: 'polarity' and 'subjectivity'. Both were added to the dataset.

Exploratory Data Analysis:

Various features, averaged across songs, were plotted against year: including modality, energy, dancibility, and explicitness. Go to the section for a full list.

Data Preprocessing

A sklearn preprocessing pipeline was created to standard scale numeric features and count vectorize each songs lyrics. This is a preperatory step for training a Random Forest Model to predict age from lyrics and audio features.


A Random Forest was trained, and hyperparameter tuned by a RandomGrid. I then looked at the feature importance of the model to get a sense of the most date-indicative audio features and words.

Bonus Explorations I was also intersted in looking for songs with opposite valence and lyric sentiment scores: depressing songs with happy lyrics or happy songs with depressing lyrics. I decided to look into this by finding the songs with the highest difference in lyric sentiment and valence.

I call these songs Sonic Chimeras.

Next Steps:

  • Look at how the racial and sexual distribution of billboard top 100 artists has evolved.
  • Look at whether specific historical events are latentely (or concretely) present in the lyrics our sentiment of songs.
  • Test other models, including RNN-variants

Imports/Constants/Data Instantiation

In [224]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sklearn
import warnings

%matplotlib inline

rc = {'figure.figsize':(8,8),
      'axes.grid' : True,
      'grid.color': '.8',
      'font.size' : 13}

sns.set_style({'axes.facecolor':'white', 'grid.color': '.8', 'font.family':'IBM Plex Mono'})
In [225]:
key_map = {
    0: 'C',
    1: 'C#/Db',
    2: 'D',
    3: 'D#/Eb',
    4: 'E',
    5: 'F',
    6: 'F#/Gb',
    7: 'G',
    8: 'G#/Ab',
    9: 'A',
    10: 'A#/Bb',
    11: 'B',
In [226]:
df = pd.read_csv('bb_top_100.csv', index_col=0) 
In [229]:
df.drop(columns=['id', 'title', 'artist', 'lyric_url', 'lyrics', 'spotify_id', 'audio_features', 'uri', 'track_href', 'type'], axis=1)
year rank title artist c_artist c_title lyric_url lyrics c_lyrics spotify_id ... id uri track_href analysis_url duration_ms time_signature explicit lyric_sent polarity subjectivity
0 1980 73 "One Fine Day" Carole King Carole King "One Fine Day" https://www.genius.com/carole-king-one-fine-da... \n\nsse\n\n\n- written by Carole King and Gerr... One fine day, you'll look at me\nAnd you will ... 0yFNDhDfpUudsjtpwokyx0 ... 0yFNDhDfpUudsjtpwokyx0 spotify:track:0yFNDhDfpUudsjtpwokyx0 https://api.spotify.com/v1/tracks/0yFNDhDfpUud... https://api.spotify.com/v1/audio-analysis/0yFN... 150200.0 4.0 False Sentiment(polarity=0.30784832451499117, subjec... 0.307848 0.545811
1 1963 77 "That Sunday, That Summer" Nat King Cole Nat King Cole "That Sunday, That Summer" https://www.genius.com/nat-king-cole-that-sund... \n\nsse\n\n\n-Artist: Nat King Cole\n\n-peak B... (If I had to choose just one day)\nIf I had to... 3SWIQx1j5erTiT9IXcaRNH ... 3SWIQx1j5erTiT9IXcaRNH spotify:track:3SWIQx1j5erTiT9IXcaRNH https://api.spotify.com/v1/tracks/3SWIQx1j5erT... https://api.spotify.com/v1/audio-analysis/3SWI... 190667.0 4.0 False Sentiment(polarity=0.28585858585858587, subjec... 0.285859 0.439506
2 1964 69 "Leader of the Pack" The Shangri-Las The Shangri-Las "Leader of the Pack" https://www.genius.com/the-shangri-las-leader-... \n\nsse\n\n\n-Is she really going out with him... -Is she really going out with him?\n-Well, the... 6wzLLGFlWQ5jqTL13MU069 ... 6wzLLGFlWQ5jqTL13MU069 spotify:track:6wzLLGFlWQ5jqTL13MU069 https://api.spotify.com/v1/tracks/6wzLLGFlWQ5j... https://api.spotify.com/v1/audio-analysis/6wzL... 173533.0 4.0 False Sentiment(polarity=-0.0621632996632997, subjec... -0.062163 0.548190
3 1972 78 "Use Me" Bill Withers Bill Withers "Use Me" https://www.genius.com/bill-withers-use-me-lyrics \n\nsse\n\n\n'My friends feel it's their appoi... My friends feel it's their appointed duty\nThe... 4gRA0i5sxx3jAhHaVjPnUN ... 4gRA0i5sxx3jAhHaVjPnUN spotify:track:4gRA0i5sxx3jAhHaVjPnUN https://api.spotify.com/v1/tracks/4gRA0i5sxx3j... https://api.spotify.com/v1/audio-analysis/4gRA... 228327.0 4.0 False Sentiment(polarity=0.1869352869352869, subject... 0.186935 0.639927
4 1994 80 "Stay" Eternal Eternal "Stay" https://www.genius.com/eternal-stay-lyrics \n\nsse\n\n\n"Stay"\n\n\n\nStay (x3)\n\n\n\nSt... "Stay"\nStay (x3)\nStay baby\nStay, come on da... 6ldRtpROAk4H3BT4trDo9X ... 6ldRtpROAk4H3BT4trDo9X spotify:track:6ldRtpROAk4H3BT4trDo9X https://api.spotify.com/v1/tracks/6ldRtpROAk4H... https://api.spotify.com/v1/audio-analysis/6ldR... 207999.0 4.0 False Sentiment(polarity=0.275, subjectivity=0.58500... 0.275000 0.585000

5 rows × 33 columns

Extract Lyric Sentiment

Use TextBlob to extract polarity of lyrics. Positive and negative polarity map to positive and negative lyrics. Provided the data has been properly processed and collected, this should correlate with spotify's valence and modality audio feature metrics (which register the phonic positive negativity of a song).

In [21]:
from textblob import TextBlob
import math
In [22]:
df['lyric_sent'] = df.c_lyrics.apply(lambda x: TextBlob(x).sentiment if pd.notnull(x) else None)
In [23]:
df[['polarity','subjectivity']] = pd.DataFrame(df.lyric_sent.tolist(), index= df.index)

Correlation Matrix

In [154]:
corr = df[['year', 'loudness', 'danceability', 'valence', 'polarity', 'subjectivity', 'instrumentalness', 'mode', 'acousticness', 'liveness', 'explicit', 'energy', 'duration_ms']].corr()
sns.matrix.heatmap(corr, mask=np.triu(np.ones(corr.shape)), linewidth=1.2)

Most Common Keys

In [233]:
import scipy
ax = df['key'].value_counts().plot(kind='bar', figsize=(8,8))
ax.set_title('Most Common Song Keys')
Text(0.5, 1.0, 'Most Common Song Keys')

Major and Minor Tendencies

There is a very noticeable decline, since the mid 1950's, of Major Key songs (though it's hard not to go down from 90+%)

In [113]:
ax = sns.scatterplot('year', 'mode',data=maj.to_frame().reset_index())
ax.set_title("Fraction of Billboard Top 100 Songs in Major Key for each Year")
Text(0.5, 1.0, 'Fraction of Billboard Top 100 Songs in Major Key for each Year')
In [112]:
maj = df.groupby('year')['mode'].mean()
ax = sns.regplot('year', 'mode',data=maj.to_frame().reset_index())
ax.set_title("Fraction of Billboard Top 100 Songs in Major Key for each Year") 
Text(0.5, 1.0, 'Fraction of Billboard Top 100 Songs in Major Key for each Year')

Average Valence

Valence is another spotify metric that captures musical positiveness. In the Spotify documentation, it is defined as follows:

A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

In [111]:
val = df.groupby('year')['valence'].mean()
ax = sns.regplot('year', 'valence', data=val.to_frame().reset_index())
ax.set_title("Average Valence of Billboard Top 100 Songs per Year")

Lyric Sentiment

In [116]:
val = df.groupby('year')['polarity'].mean()
ax = sns.regplot('year', 'polarity', data=val.to_frame().reset_index())
ax.set_title("Average Polarity of Billboard Top 100 Songs per Year")
In [144]:
val = df.groupby('year')['subjectivity'].mean()
ax = sns.regplot('year', 'subjectivity', data=val.to_frame().reset_index())
ax.set_title("Average Subjectivity of Billboard Top 100 Songs per Year")


In [147]:
val = df.groupby('year')['danceability'].mean()
ax = sns.regplot('year', 'danceability', data=val.to_frame().reset_index())
ax.set_title("Average Danceability of Billboard Top 100 Songs per Year")

Average Loudness of Billboard Top 100 Songs per Year

Average Loudness of Songs per Year goes noticably up. However, it easily result from the evolution of recording technology not the actual features of the songs themselves (this, of course, could be a large problems for all of Spotify's metrics, which I don't believe are controlled for release year).

In [230]:
val = df.groupby('year')['loudness'].mean()
ax = sns.scatterplot('year', 'loudness', data=val.to_frame().reset_index())
ax.set_title("Average Loudness of Billboard Top 100 Songs per Year")


Duration has an interesting, definitely non-linear trend. A next step would be looking into possible explanations for this trend

In [160]:
val = df.groupby('year')['duration_ms'].mean()
ax = sns.regplot('year', 'duration_ms', data=val.to_frame().reset_index())
ax.set_title("Average Duration (in ms) of Billboard Top 100 Songs per Year")


In [163]:
val = df.groupby('year')['speechiness'].mean()
ax = sns.regplot('year', 'speechiness', data=val.to_frame().reset_index())
ax.set_title("Average Speechiness. of Billboard Top 100 Songs per Year")


In [159]:
val = df.groupby('year')['acousticness'].mean()
ax = sns.regplot('year', 'acousticness', data=val.to_frame().reset_index())
ax.set_title("Average Acousticness of Billboard Top 100 Songs per Year")


In [157]:
val = df.groupby('year')['instrumentalness'].mean()
ax = sns.regplot('year', 'instrumentalness', data=val.to_frame().reset_index())
ax.set_title("Average Instrumentalness of Billboard Top 100 Songs per Year")


In [158]:
val = df.groupby('year')['energy'].mean()
ax = sns.regplot('year', 'energy', data=val.to_frame().reset_index())
ax.set_title("Average Energy of Billboard Top 100 Songs per Year")


Fraction of Explicit Songs per Year noticably up. However, it could definitely result from recordings not the actual features of the songs themselves.

In [134]:
ax = ((df.groupby('year')['explicit'].sum())/(df.groupby('year')['explicit'].count())).plot()
ax.set_ylabel('Fraction of Explicit Songs per Year')
Text(0, 0.5, 'Fraction of Explicit Songs per Year')
In [234]:

df[(df.explicit) & (df.year<1990)]
year rank title artist c_artist c_title lyric_url lyrics c_lyrics spotify_id ... id uri track_href analysis_url duration_ms time_signature explicit lyric_sent polarity subjectivity
1520 1984 21 "Let's Go Crazy" Prince and The Revolution Prince and The Revolution "Let's Go Crazy" https://www.genius.com/prince-and-the-revoluti... \n\nsse\n\n\n[Spoken Intro]\n\n\nDearly belove... [Spoken Intro]\nDearly beloved\nWe are gathere... 0QeI79sp1vS8L3JgpEO7mD ... 0QeI79sp1vS8L3JgpEO7mD spotify:track:0QeI79sp1vS8L3JgpEO7mD https://api.spotify.com/v1/tracks/0QeI79sp1vS8... https://api.spotify.com/v1/audio-analysis/0QeI... 280000.0 4.0 True Sentiment(polarity=-0.13665660511363636, subje... -0.136657 0.585590
2999 1989 21 "Blame It on the Rain" Milli Vanilli Milli Vanilli "Blame It on the Rain" https://www.genius.com/milli-vanilli-blame-it-... \n\nsse\n\n\n[Verse 1]\n\nYou said you didn't ... [Verse 1]\nYou said you didn't need her\nYou t... 6onwlDmIVKE8bBgyBRSuS0 ... 6onwlDmIVKE8bBgyBRSuS0 spotify:track:6onwlDmIVKE8bBgyBRSuS0 https://api.spotify.com/v1/tracks/6onwlDmIVKE8... https://api.spotify.com/v1/audio-analysis/6onw... 257646.0 5.0 True Sentiment(polarity=0.013333333333333326, subje... 0.013333 0.558333
5785 1975 29 "Fight the Power" The Isley Brothers The Isley Brothers "Fight the Power" https://www.genius.com/the-isley-brothers-figh... NaN NaN 5q5qmdfdJAVOv1mbSk7xxN ... 5q5qmdfdJAVOv1mbSk7xxN spotify:track:5q5qmdfdJAVOv1mbSk7xxN https://api.spotify.com/v1/tracks/5q5qmdfdJAVO... https://api.spotify.com/v1/audio-analysis/5q5q... 318733.0 4.0 True NaN NaN NaN

3 rows × 33 columns

Create Preprocessing Pipeline

In [31]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import CountVectorizer
In [49]:
from sklearn.preprocessing import FunctionTransformer

text_vars = ['c_lyrics']
num_vars = ['polarity', 'subjectivity', 'duration_ms', 'time_signature', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'tempo']

preprocess = ColumnTransformer([
    ('nv', Pipeline([
        ('imp', SimpleImputer()),
        ('ss', StandardScaler())
    ]), num_vars),
    ('tv', Pipeline([
        ('imp', SimpleImputer(strategy='constant', fill_value='None')),
        ('oned', FunctionTransformer(np.reshape, kw_args={'newshape':-1})),
        ('count_v', CountVectorizer())
    ]), text_vars)

res = preprocess.fit_transform(df)

Train Random Forest to Predict Release Year

In [138]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
In [35]:
X_train, X_test, y_train, y_test = train_test_split(res, df.year, random_state=14, stratify=df.year, test_size=.2)
In [40]:
from joblib import dump, load

MODEL_SAVE_FILE = 'rcv.joblib'

if MODEL_SAVE_FILE not in os.listdir():
    params = {
     'bootstrap': [True, False],
     'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
     'max_features': ['auto', 'sqrt'],
     'min_samples_leaf': [1, 2, 4],
     'min_samples_split': [2, 5, 10],
     'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]

    rcv = RandomizedSearchCV(RandomForestRegressor(), params, cv=5, verbose=2)
    rcv.fit(X_train, y_train)
    dump(rcv, MODEL_SAVE_FILE)
    rcv = load(MODEL_SAVE_FILE)
In [41]:
RandomForestRegressor(bootstrap=False, max_depth=40, max_features='sqrt',
In [42]:

Most Important Features for Year Prediction

Got the function get_feature_names from Venkatachalam on stackoverflow.

In [138]:
def get_feature_names(column_transformer):
    """Get feature names from all transformers.
    feature_names : list of strings
        Names of the features produced by transform.
    def get_names(trans):
        if trans == 'drop' or (
                hasattr(column, '__len__') and not len(column)):
            return []
        if trans == 'passthrough':
            if hasattr(column_transformer, '_df_columns'):
                if ((not isinstance(column, slice))
                        and all(isinstance(col, str) for col in column)):
                    return column
                    return column_transformer._df_columns[column]
                indices = np.arange(column_transformer._n_features)
                return ['x%d' % i for i in indices[column]]
        if not hasattr(trans, 'get_feature_names'):
            warnings.warn("Transformer %s (type %s) does not "
                                 "provide get_feature_names. "
                                 "Will return input column names if available"
                                 % (str(name), type(trans).__name__))
            if column is None:
                return []
                return [name + "__" + f for f in column]

        return [name + "__" + f for f in trans.get_feature_names()]

    feature_names = []

    if type(column_transformer) == sklearn.pipeline.Pipeline:
        l_transformers = [(name, trans, None, None) for step, name, trans in column_transformer._iter()]
        # For column transformers, follow the original method
        l_transformers = list(column_transformer._iter(fitted=True))

    for name, trans, column, _ in l_transformers: 
        if type(trans) == sklearn.pipeline.Pipeline:
            # Recursive call on pipeline
            _names = get_feature_names(trans)
            # if pipeline has no transformer that returns names
            if len(_names)==0:
                _names = [name + "__" + f for f in column]

    return feature_names
In [139]:
sorted(zip(get_feature_names(preprocess), rcv.best_estimator_.feature_importances_), key=lambda x: x[1],reverse=True)[:50]
NameError                      Traceback (most recent call last)
<ipython-input-139-efa78ccdbf1a> in <module>
----> 1 sorted(zip(get_feature_names(preprocess), rcv.best_estimator_.feature_importances_), key=lambda x: x[1],reverse=True)[:50]

NameError: name 'preprocess' is not defined
In [43]:
from sklearn.metrics import mean_squared_error
In [44]:
np.sqrt(mean_squared_error(rcv.best_estimator_.predict(X_train), y_train))

Identifying Sonic Chimeras

Since this dataset contains lyrical and musical sentiment indicators—polarity for lyrics and valence/modality for musical—it would be fun to find contradictory songs: ones that are musically optimistic and lyrically sad or vice versa. There are a couple of potential ways of accomplishing this:

  1. Looking at Minor/Major Songs with Positive and Negative polarity (respectively)
  2. Looking at songs with the greatest difference between polarity and valence (this works because the scale is similar)
In [220]:

Minor Songs with the Happiest Lyrics

In [216]:
df.loc[(df['mode']==0.0)].sort_values('polarity', ascending=False).iloc[:NUMBER_OF_SONGS][['title', 'artist', 'valence', 'polarity', 'val-p']]
title artist valence polarity val-p
2968 "Up, Up and Away" The 5th Dimension 0.515 0.805000 0.441480
5264 "Fly, Robin, Fly" Silver Convention 0.939 0.800000 0.014372
4618 "Best of You" Foo Fighters 0.369 0.779530 0.571645
4421 "We Are Family" Sister Sledge 0.819 0.700000 0.072200
1551 "Beautiful" Akon featuring Colby O'Donis and Kardinal Offi... 0.614 0.632569 0.235277
4308 "Daddy's Home" Jermaine Jackson 0.604 0.625000 0.240572
1385 "Mi Gente" J Balvin and Willy William featuring Beyoncé 0.469 0.625000 0.375572
1384 "Mi Gente" J Balvin and Willy William featuring Beyoncé 0.469 0.625000 0.375572
5155 "Beautiful Life" Ace of Base 0.749 0.621605 0.093461
3534 "I'll Be Good to You" The Brothers Johnson 0.930 0.619559 0.088811

Major Songs with Saddest Lyrics

In [217]:
df.loc[(df['mode']==1.0)].sort_values('polarity', ascending=True).iloc[:NUMBER_OF_SONGS][['title', 'artist', 'valence', 'polarity', 'val-p']]
title artist valence polarity val-p
1526 "Music" Madonna 0.871 -0.675000 0.834655
2551 "Bad Moon Rising" Creedence Clearwater Revival 0.942 -0.675000 0.905655
815 "Everything About You" Ugly Kid Joe 0.738 -0.633212 0.675674
432 "Insane in the Brain" Cypress Hill 0.767 -0.613137 0.692193
3409 "Bad Boys" (theme from Cops) Inner Circle 0.533 -0.594397 0.446542
4923 "Shake It Off" Taylor Swift 0.943 -0.507692 0.802637
1957 "Shake It Off" Taylor Swift 0.943 -0.480196 0.785543
3715 "Don't Call Us, We'll Call You" Sugarloaf 0.760 -0.475000 0.599312
2479 "Crazy" K-Ci & JoJo 0.448 -0.474432 0.286959
1919 "Jump" Van Halen 0.796 -0.451229 0.620533

Songs With the Greatest Difference Between Valence and Polarity

In [218]:
df['valp'] = abs(df['valence']-df['polarity'])
In [219]:
df.sort_values('valp', ascending=False).iloc[:NUMBER_OF_SONGS][['title', 'artist', 'valence', 'polarity', 'valp']]
title artist valence polarity valp
2551 "Bad Moon Rising" Creedence Clearwater Revival 0.942 -0.675000 1.617000
1526 "Music" Madonna 0.871 -0.675000 1.546000
2438 "Cruel Summer" Bananarama 0.936 -0.569234 1.505234
2437 "Cruel Summer" Ace of Base 0.882 -0.586741 1.468741
4923 "Shake It Off" Taylor Swift 0.943 -0.507692 1.450692
1957 "Shake It Off" Taylor Swift 0.943 -0.480196 1.423196
432 "Insane in the Brain" Cypress Hill 0.767 -0.613137 1.380137
324 "Another Saturday Night" Sam Cooke 0.969 -0.402778 1.371778
1013 "Bad Boy" Miami Sound Machine 0.858 -0.513571 1.371571
815 "Everything About You" Ugly Kid Joe 0.738 -0.633212 1.371212