Feature engineering¶
In feature engineering, we carry out feature engineering, extract new features that are relevant for the problem. For Spotify data set, three additional features are extracted from Spotify API and Genius API. Two scripts are developed for the data extraction.
Extract artist genre using Spotify API (JAVA) Here
Extract song lyric using self developed package - pylyrics2 (Python) Here, demo can be found in appendix
import pandas as pd
import numpy as np
import re
from sentence_transformers import SentenceTransformer
Reading the data CSV¶
Read in the data CSV and store it as a pandas dataframe named spotify_df.
spotify_df = pd.read_csv('data/spotify_data.csv', index_col = 0 )
spotify_df.head(6)
| acousticness | danceability | duration_ms | energy | instrumentalness | key | liveness | loudness | mode | speechiness | tempo | time_signature | valence | target | song_title | artist | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.01020 | 0.833 | 204600 | 0.434 | 0.021900 | 2 | 0.1650 | -8.795 | 1 | 0.4310 | 150.062 | 4.0 | 0.286 | 1 | Mask Off | Future |
| 1 | 0.19900 | 0.743 | 326933 | 0.359 | 0.006110 | 1 | 0.1370 | -10.401 | 1 | 0.0794 | 160.083 | 4.0 | 0.588 | 1 | Redbone | Childish Gambino |
| 2 | 0.03440 | 0.838 | 185707 | 0.412 | 0.000234 | 2 | 0.1590 | -7.148 | 1 | 0.2890 | 75.044 | 4.0 | 0.173 | 1 | Xanny Family | Future |
| 3 | 0.60400 | 0.494 | 199413 | 0.338 | 0.510000 | 5 | 0.0922 | -15.236 | 1 | 0.0261 | 86.468 | 4.0 | 0.230 | 1 | Master Of None | Beach House |
| 4 | 0.18000 | 0.678 | 392893 | 0.561 | 0.512000 | 5 | 0.4390 | -11.648 | 0 | 0.0694 | 174.004 | 4.0 | 0.904 | 1 | Parallel Lines | Junior Boys |
| 5 | 0.00479 | 0.804 | 251333 | 0.560 | 0.000000 | 8 | 0.1640 | -6.682 | 1 | 0.1850 | 85.023 | 4.0 | 0.264 | 1 | Sneakin’ | Drake |
Artist Information¶
genres and popularity are extracted from Spotify API, which included the genres and popularity of the corresponding artist.
artist_df = pd.read_csv('data/artist_info.csv', index_col = 0 )
artist_df.head(6)
| genres | name | popularity | |
|---|---|---|---|
| artist_id | |||
| 1RyvyyTE3xzB2ZywiAwp0i | atl hip hop | Future | 91 |
| 1RyvyyTE3xzB2ZywiAwp0i | hip hop | Future | 91 |
| 1RyvyyTE3xzB2ZywiAwp0i | pop rap | Future | 91 |
| 1RyvyyTE3xzB2ZywiAwp0i | rap | Future | 91 |
| 1RyvyyTE3xzB2ZywiAwp0i | southern hip hop | Future | 91 |
| 1RyvyyTE3xzB2ZywiAwp0i | trap | Future | 91 |
Pivot the artist table with genres in columns and artist in row, count the number of artist appeared.
artist_df_pivot = (
artist_df.pivot_table(
index="name",
columns="genres",
values="popularity",
#aggfunc=lambda x: len(x.unique()),
aggfunc="count",
)
.add_prefix("genres_")
.reset_index()
)
artist_df_pivot.fillna(0, inplace=True)
Join pivoted artist table to original table
spotify_df = spotify_df.merge(artist_df_pivot, left_on='artist', right_on='name')
spotify_df = spotify_df.drop(['name'], axis=1)
spotify_df.head(6)
| acousticness | danceability | duration_ms | energy | instrumentalness | key | liveness | loudness | mode | speechiness | ... | genres_vocaloid | genres_west coast rap | genres_west coast trap | genres_wonky | genres_worcester ma indie | genres_world | genres_world fusion | genres_world worship | genres_worship | genres_zolo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0102 | 0.833 | 204600 | 0.434 | 0.021900 | 2 | 0.1650 | -8.795 | 1 | 0.4310 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0.0344 | 0.838 | 185707 | 0.412 | 0.000234 | 2 | 0.1590 | -7.148 | 1 | 0.2890 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 0.1850 | 0.704 | 282253 | 0.431 | 0.097200 | 8 | 0.2490 | -7.893 | 1 | 0.1310 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 0.0796 | 0.868 | 342773 | 0.627 | 0.000000 | 1 | 0.0983 | -4.843 | 0 | 0.1160 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 0.2180 | 0.793 | 194240 | 0.607 | 0.000005 | 4 | 0.3480 | -6.488 | 0 | 0.0821 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5 | 0.3740 | 0.772 | 167158 | 0.612 | 0.000002 | 7 | 0.1080 | -7.274 | 0 | 0.2540 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
6 rows × 877 columns
Song Information¶
lyrics is extracted from pylyrics2 API, which included the lyrics of the corresponding song.
A python script is developed for scraping the lyrics here
lyrics_df = pd.read_csv('data/lyrics_info_clean.csv', index_col = 0 )
lyrics_df.head(6)
| artist | lyrics | |
|---|---|---|
| song_title | ||
| Mask Off | Future | call call hendrix promise swear swear heard s... |
| Xanny Family | Future | three exotic broads got em soakin panties tol... |
| Blood On the Money | Future | gave lil trotty thangs nigga walk give friend... |
| Move That Dope | Future | real dope dealers real haha hahaha young nigg... |
| Blow a Bag | Future | yeah woke feeling like fucking paper freeband... |
| Lay Up | Future | beast mode zaytoven fuck bitch lay fuck bitch... |
Join the lyrics with the dataframe.
spotify_df = spotify_df.merge(lyrics_df)
spotify_df.head(6)
| acousticness | danceability | duration_ms | energy | instrumentalness | key | liveness | loudness | mode | speechiness | ... | genres_west coast rap | genres_west coast trap | genres_wonky | genres_worcester ma indie | genres_world | genres_world fusion | genres_world worship | genres_worship | genres_zolo | lyrics | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0102 | 0.833 | 204600 | 0.434 | 0.0219 | 2 | 0.165 | -8.795 | 1 | 0.431 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | call call hendrix promise swear swear heard s... |
| 1 | 0.0102 | 0.833 | 204600 | 0.434 | 0.0219 | 2 | 0.165 | -8.795 | 1 | 0.431 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | three exotic broads got em soakin panties tol... |
| 2 | 0.0102 | 0.833 | 204600 | 0.434 | 0.0219 | 2 | 0.165 | -8.795 | 1 | 0.431 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | gave lil trotty thangs nigga walk give friend... |
| 3 | 0.0102 | 0.833 | 204600 | 0.434 | 0.0219 | 2 | 0.165 | -8.795 | 1 | 0.431 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | real dope dealers real haha hahaha young nigg... |
| 4 | 0.0102 | 0.833 | 204600 | 0.434 | 0.0219 | 2 | 0.165 | -8.795 | 1 | 0.431 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | yeah woke feeling like fucking paper freeband... |
| 5 | 0.0102 | 0.833 | 204600 | 0.434 | 0.0219 | 2 | 0.165 | -8.795 | 1 | 0.431 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | beast mode zaytoven fuck bitch lay fuck bitch... |
6 rows × 878 columns
Lyrics analysis (NLP)¶
Using paraphrase-distilroberta-base-v1 to maps sentences & paragraphs to a 768 dimensional dense vector space.
spotify_df_dropna = spotify_df.query("lyrics == lyrics")
embedder = SentenceTransformer("paraphrase-distilroberta-base-v1")
emb_sents = embedder.encode(spotify_df_dropna["lyrics"].to_list())
Merge with original dataframe.
emb_sent_df = pd.DataFrame(emb_sents, index=spotify_df_dropna.index).add_prefix('emb_sent_')
spotify_df = spotify_df.join(emb_sent_df)
spotify_df.head(6)
| acousticness | danceability | duration_ms | energy | instrumentalness | key | liveness | loudness | mode | speechiness | ... | emb_sent_758 | emb_sent_759 | emb_sent_760 | emb_sent_761 | emb_sent_762 | emb_sent_763 | emb_sent_764 | emb_sent_765 | emb_sent_766 | emb_sent_767 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0102 | 0.833 | 204600 | 0.434 | 0.0219 | 2 | 0.165 | -8.795 | 1 | 0.431 | ... | -0.043803 | -0.447969 | 0.751878 | -0.143682 | -0.581486 | -0.016169 | 0.619145 | 0.287845 | -0.234614 | -0.471101 |
| 1 | 0.0102 | 0.833 | 204600 | 0.434 | 0.0219 | 2 | 0.165 | -8.795 | 1 | 0.431 | ... | -0.242343 | 0.043233 | 0.199484 | 0.161576 | -0.384538 | 0.126095 | 0.462813 | 0.112398 | 0.002923 | -0.256447 |
| 2 | 0.0102 | 0.833 | 204600 | 0.434 | 0.0219 | 2 | 0.165 | -8.795 | 1 | 0.431 | ... | -0.316598 | -0.339585 | 0.321482 | 0.280396 | -0.003586 | 0.132081 | -0.445193 | 0.301628 | -0.287356 | -0.504784 |
| 3 | 0.0102 | 0.833 | 204600 | 0.434 | 0.0219 | 2 | 0.165 | -8.795 | 1 | 0.431 | ... | 0.087176 | -0.227342 | 0.278918 | 0.479892 | -0.431301 | 0.224406 | 1.213418 | -0.474493 | -0.316002 | -0.365130 |
| 4 | 0.0102 | 0.833 | 204600 | 0.434 | 0.0219 | 2 | 0.165 | -8.795 | 1 | 0.431 | ... | 0.117087 | -0.418446 | -0.040964 | 0.428880 | -0.310667 | -0.276378 | 0.260933 | -0.361207 | -0.267050 | -0.332415 |
| 5 | 0.0102 | 0.833 | 204600 | 0.434 | 0.0219 | 2 | 0.165 | -8.795 | 1 | 0.431 | ... | -0.182009 | -0.468775 | 0.299990 | 0.213028 | 0.011709 | -0.149237 | 0.507415 | 0.057419 | -0.013157 | -0.161389 |
6 rows × 1646 columns
Export CSV¶
Export new csv with additional feature for further machine learning process.
spotify_df.to_csv('data/spotify_df_processed.csv',index=False)