Feature engineering

In feature engineering, we carry out feature engineering, extract new features that are relevant for the problem. For Spotify data set, three additional features are extracted from Spotify API and Genius API. Two scripts are developed for the data extraction.

  1. Extract artist genre using Spotify API (JAVA) Here

  2. Extract song lyric using self developed package - pylyrics2 (Python) Here, demo can be found in appendix

import pandas as pd
import numpy as np
import re
from sentence_transformers import SentenceTransformer

Reading the data CSV

Read in the data CSV and store it as a pandas dataframe named spotify_df.

spotify_df = pd.read_csv('data/spotify_data.csv', index_col = 0 )
spotify_df.head(6)
acousticness danceability duration_ms energy instrumentalness key liveness loudness mode speechiness tempo time_signature valence target song_title artist
0 0.01020 0.833 204600 0.434 0.021900 2 0.1650 -8.795 1 0.4310 150.062 4.0 0.286 1 Mask Off Future
1 0.19900 0.743 326933 0.359 0.006110 1 0.1370 -10.401 1 0.0794 160.083 4.0 0.588 1 Redbone Childish Gambino
2 0.03440 0.838 185707 0.412 0.000234 2 0.1590 -7.148 1 0.2890 75.044 4.0 0.173 1 Xanny Family Future
3 0.60400 0.494 199413 0.338 0.510000 5 0.0922 -15.236 1 0.0261 86.468 4.0 0.230 1 Master Of None Beach House
4 0.18000 0.678 392893 0.561 0.512000 5 0.4390 -11.648 0 0.0694 174.004 4.0 0.904 1 Parallel Lines Junior Boys
5 0.00479 0.804 251333 0.560 0.000000 8 0.1640 -6.682 1 0.1850 85.023 4.0 0.264 1 Sneakin’ Drake

Artist Information

genres and popularity are extracted from Spotify API, which included the genres and popularity of the corresponding artist.

artist_df = pd.read_csv('data/artist_info.csv', index_col = 0 )
artist_df.head(6)
genres name popularity
artist_id
1RyvyyTE3xzB2ZywiAwp0i atl hip hop Future 91
1RyvyyTE3xzB2ZywiAwp0i hip hop Future 91
1RyvyyTE3xzB2ZywiAwp0i pop rap Future 91
1RyvyyTE3xzB2ZywiAwp0i rap Future 91
1RyvyyTE3xzB2ZywiAwp0i southern hip hop Future 91
1RyvyyTE3xzB2ZywiAwp0i trap Future 91

Pivot the artist table with genres in columns and artist in row, count the number of artist appeared.

artist_df_pivot = (
    artist_df.pivot_table(
        index="name",
        columns="genres",
        values="popularity",
        #aggfunc=lambda x: len(x.unique()),
        aggfunc="count",
    )
    .add_prefix("genres_")
    .reset_index()
)

artist_df_pivot.fillna(0, inplace=True)

Join pivoted artist table to original table

spotify_df = spotify_df.merge(artist_df_pivot, left_on='artist', right_on='name')
spotify_df = spotify_df.drop(['name'], axis=1)
spotify_df.head(6)
acousticness danceability duration_ms energy instrumentalness key liveness loudness mode speechiness ... genres_vocaloid genres_west coast rap genres_west coast trap genres_wonky genres_worcester ma indie genres_world genres_world fusion genres_world worship genres_worship genres_zolo
0 0.0102 0.833 204600 0.434 0.021900 2 0.1650 -8.795 1 0.4310 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0344 0.838 185707 0.412 0.000234 2 0.1590 -7.148 1 0.2890 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.1850 0.704 282253 0.431 0.097200 8 0.2490 -7.893 1 0.1310 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0796 0.868 342773 0.627 0.000000 1 0.0983 -4.843 0 0.1160 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.2180 0.793 194240 0.607 0.000005 4 0.3480 -6.488 0 0.0821 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.3740 0.772 167158 0.612 0.000002 7 0.1080 -7.274 0 0.2540 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

6 rows × 877 columns

Song Information

lyrics is extracted from pylyrics2 API, which included the lyrics of the corresponding song.
A python script is developed for scraping the lyrics here

lyrics_df = pd.read_csv('data/lyrics_info_clean.csv', index_col = 0 )
lyrics_df.head(6)
artist lyrics
song_title
Mask Off Future call call hendrix promise swear swear heard s...
Xanny Family Future three exotic broads got em soakin panties tol...
Blood On the Money Future gave lil trotty thangs nigga walk give friend...
Move That Dope Future real dope dealers real haha hahaha young nigg...
Blow a Bag Future yeah woke feeling like fucking paper freeband...
Lay Up Future beast mode zaytoven fuck bitch lay fuck bitch...

Join the lyrics with the dataframe.

spotify_df = spotify_df.merge(lyrics_df)
spotify_df.head(6)
acousticness danceability duration_ms energy instrumentalness key liveness loudness mode speechiness ... genres_west coast rap genres_west coast trap genres_wonky genres_worcester ma indie genres_world genres_world fusion genres_world worship genres_worship genres_zolo lyrics
0 0.0102 0.833 204600 0.434 0.0219 2 0.165 -8.795 1 0.431 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 call call hendrix promise swear swear heard s...
1 0.0102 0.833 204600 0.434 0.0219 2 0.165 -8.795 1 0.431 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 three exotic broads got em soakin panties tol...
2 0.0102 0.833 204600 0.434 0.0219 2 0.165 -8.795 1 0.431 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 gave lil trotty thangs nigga walk give friend...
3 0.0102 0.833 204600 0.434 0.0219 2 0.165 -8.795 1 0.431 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 real dope dealers real haha hahaha young nigg...
4 0.0102 0.833 204600 0.434 0.0219 2 0.165 -8.795 1 0.431 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 yeah woke feeling like fucking paper freeband...
5 0.0102 0.833 204600 0.434 0.0219 2 0.165 -8.795 1 0.431 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 beast mode zaytoven fuck bitch lay fuck bitch...

6 rows × 878 columns

Lyrics analysis (NLP)

Using paraphrase-distilroberta-base-v1 to maps sentences & paragraphs to a 768 dimensional dense vector space.

spotify_df_dropna = spotify_df.query("lyrics == lyrics")
embedder = SentenceTransformer("paraphrase-distilroberta-base-v1")
emb_sents = embedder.encode(spotify_df_dropna["lyrics"].to_list())

Merge with original dataframe.

emb_sent_df = pd.DataFrame(emb_sents, index=spotify_df_dropna.index).add_prefix('emb_sent_')
spotify_df = spotify_df.join(emb_sent_df)
spotify_df.head(6)
acousticness danceability duration_ms energy instrumentalness key liveness loudness mode speechiness ... emb_sent_758 emb_sent_759 emb_sent_760 emb_sent_761 emb_sent_762 emb_sent_763 emb_sent_764 emb_sent_765 emb_sent_766 emb_sent_767
0 0.0102 0.833 204600 0.434 0.0219 2 0.165 -8.795 1 0.431 ... -0.043803 -0.447969 0.751878 -0.143682 -0.581486 -0.016169 0.619145 0.287845 -0.234614 -0.471101
1 0.0102 0.833 204600 0.434 0.0219 2 0.165 -8.795 1 0.431 ... -0.242343 0.043233 0.199484 0.161576 -0.384538 0.126095 0.462813 0.112398 0.002923 -0.256447
2 0.0102 0.833 204600 0.434 0.0219 2 0.165 -8.795 1 0.431 ... -0.316598 -0.339585 0.321482 0.280396 -0.003586 0.132081 -0.445193 0.301628 -0.287356 -0.504784
3 0.0102 0.833 204600 0.434 0.0219 2 0.165 -8.795 1 0.431 ... 0.087176 -0.227342 0.278918 0.479892 -0.431301 0.224406 1.213418 -0.474493 -0.316002 -0.365130
4 0.0102 0.833 204600 0.434 0.0219 2 0.165 -8.795 1 0.431 ... 0.117087 -0.418446 -0.040964 0.428880 -0.310667 -0.276378 0.260933 -0.361207 -0.267050 -0.332415
5 0.0102 0.833 204600 0.434 0.0219 2 0.165 -8.795 1 0.431 ... -0.182009 -0.468775 0.299990 0.213028 0.011709 -0.149237 0.507415 0.057419 -0.013157 -0.161389

6 rows × 1646 columns

Export CSV

Export new csv with additional feature for further machine learning process.

spotify_df.to_csv('data/spotify_df_processed.csv',index=False)