Feature engineering¶

In feature engineering, we carry out feature engineering, extract new features that are relevant for the problem. For Spotify data set, three additional features are extracted from Spotify API and Genius API. Two scripts are developed for the data extraction.

Extract artist genre using Spotify API (JAVA) Here
Extract song lyric using self developed package - pylyrics2 (Python) Here, demo can be found in appendix

import pandas as pd
import numpy as np
import re
from sentence_transformers import SentenceTransformer

Reading the data CSV¶

Read in the data CSV and store it as a pandas dataframe named spotify_df.

spotify_df = pd.read_csv('data/spotify_data.csv', index_col = 0 )
spotify_df.head(6)

	acousticness	danceability	duration_ms	energy	instrumentalness	key	liveness	loudness	mode	speechiness	tempo	time_signature	valence	target	song_title	artist
0	0.01020	0.833	204600	0.434	0.021900	2	0.1650	-8.795	1	0.4310	150.062	4.0	0.286	1	Mask Off	Future
1	0.19900	0.743	326933	0.359	0.006110	1	0.1370	-10.401	1	0.0794	160.083	4.0	0.588	1	Redbone	Childish Gambino
2	0.03440	0.838	185707	0.412	0.000234	2	0.1590	-7.148	1	0.2890	75.044	4.0	0.173	1	Xanny Family	Future
3	0.60400	0.494	199413	0.338	0.510000	5	0.0922	-15.236	1	0.0261	86.468	4.0	0.230	1	Master Of None	Beach House
4	0.18000	0.678	392893	0.561	0.512000	5	0.4390	-11.648	0	0.0694	174.004	4.0	0.904	1	Parallel Lines	Junior Boys
5	0.00479	0.804	251333	0.560	0.000000	8	0.1640	-6.682	1	0.1850	85.023	4.0	0.264	1	Sneakin’	Drake

Artist Information¶

genres and popularity are extracted from Spotify API, which included the genres and popularity of the corresponding artist.

artist_df = pd.read_csv('data/artist_info.csv', index_col = 0 )
artist_df.head(6)

	genres	name	popularity
artist_id
1RyvyyTE3xzB2ZywiAwp0i	atl hip hop	Future	91
1RyvyyTE3xzB2ZywiAwp0i	hip hop	Future	91
1RyvyyTE3xzB2ZywiAwp0i	pop rap	Future	91
1RyvyyTE3xzB2ZywiAwp0i	rap	Future	91
1RyvyyTE3xzB2ZywiAwp0i	southern hip hop	Future	91
1RyvyyTE3xzB2ZywiAwp0i	trap	Future	91

Pivot the artist table with genres in columns and artist in row, count the number of artist appeared.

artist_df_pivot = (
    artist_df.pivot_table(
        index="name",
        columns="genres",
        values="popularity",
        #aggfunc=lambda x: len(x.unique()),
        aggfunc="count",
    )
    .add_prefix("genres_")
    .reset_index()
)

artist_df_pivot.fillna(0, inplace=True)

Join pivoted artist table to original table

spotify_df = spotify_df.merge(artist_df_pivot, left_on='artist', right_on='name')
spotify_df = spotify_df.drop(['name'], axis=1)
spotify_df.head(6)

	acousticness	danceability	duration_ms	energy	instrumentalness	key	liveness	loudness	mode	speechiness	...
0	0.0102	0.833	204600	0.434	0.021900	2	0.1650	-8.795	1	0.4310	...
1	0.0344	0.838	185707	0.412	0.000234	2	0.1590	-7.148	1	0.2890	...
2	0.1850	0.704	282253	0.431	0.097200	8	0.2490	-7.893	1	0.1310	...
3	0.0796	0.868	342773	0.627	0.000000	1	0.0983	-4.843	0	0.1160	...
4	0.2180	0.793	194240	0.607	0.000005	4	0.3480	-6.488	0	0.0821	...
5	0.3740	0.772	167158	0.612	0.000002	7	0.1080	-7.274	0	0.2540	...

6 rows × 877 columns

Song Information¶

lyrics is extracted from pylyrics2 API, which included the lyrics of the corresponding song.
A python script is developed for scraping the lyrics here

lyrics_df = pd.read_csv('data/lyrics_info_clean.csv', index_col = 0 )
lyrics_df.head(6)

	artist	lyrics
song_title
Mask Off	Future	call call hendrix promise swear swear heard s...
Xanny Family	Future	three exotic broads got em soakin panties tol...
Blood On the Money	Future	gave lil trotty thangs nigga walk give friend...
Move That Dope	Future	real dope dealers real haha hahaha young nigg...
Blow a Bag	Future	yeah woke feeling like fucking paper freeband...
Lay Up	Future	beast mode zaytoven fuck bitch lay fuck bitch...

Join the lyrics with the dataframe.

spotify_df = spotify_df.merge(lyrics_df)
spotify_df.head(6)

	acousticness	danceability	duration_ms	energy	instrumentalness	key	liveness	loudness	mode	speechiness	...	lyrics
0	0.0102	0.833	204600	0.434	0.0219	2	0.165	-8.795	1	0.431	...	call call hendrix promise swear swear heard s...
1	0.0102	0.833	204600	0.434	0.0219	2	0.165	-8.795	1	0.431	...	three exotic broads got em soakin panties tol...
2	0.0102	0.833	204600	0.434	0.0219	2	0.165	-8.795	1	0.431	...	gave lil trotty thangs nigga walk give friend...
3	0.0102	0.833	204600	0.434	0.0219	2	0.165	-8.795	1	0.431	...	real dope dealers real haha hahaha young nigg...
4	0.0102	0.833	204600	0.434	0.0219	2	0.165	-8.795	1	0.431	...	yeah woke feeling like fucking paper freeband...
5	0.0102	0.833	204600	0.434	0.0219	2	0.165	-8.795	1	0.431	...	beast mode zaytoven fuck bitch lay fuck bitch...

6 rows × 878 columns

Lyrics analysis (NLP)¶

Using paraphrase-distilroberta-base-v1 to maps sentences & paragraphs to a 768 dimensional dense vector space.

spotify_df_dropna = spotify_df.query("lyrics == lyrics")
embedder = SentenceTransformer("paraphrase-distilroberta-base-v1")
emb_sents = embedder.encode(spotify_df_dropna["lyrics"].to_list())

Merge with original dataframe.

emb_sent_df = pd.DataFrame(emb_sents, index=spotify_df_dropna.index).add_prefix('emb_sent_')
spotify_df = spotify_df.join(emb_sent_df)
spotify_df.head(6)

	acousticness	danceability	duration_ms	energy	instrumentalness	key	liveness	loudness	mode	speechiness	...	emb_sent_758	emb_sent_759	emb_sent_760	emb_sent_761	emb_sent_762	emb_sent_763	emb_sent_764	emb_sent_765	emb_sent_766	emb_sent_767
0	0.0102	0.833	204600	0.434	0.0219	2	0.165	-8.795	1	0.431	...	-0.043803	-0.447969	0.751878	-0.143682	-0.581486	-0.016169	0.619145	0.287845	-0.234614	-0.471101
1	0.0102	0.833	204600	0.434	0.0219	2	0.165	-8.795	1	0.431	...	-0.242343	0.043233	0.199484	0.161576	-0.384538	0.126095	0.462813	0.112398	0.002923	-0.256447
2	0.0102	0.833	204600	0.434	0.0219	2	0.165	-8.795	1	0.431	...	-0.316598	-0.339585	0.321482	0.280396	-0.003586	0.132081	-0.445193	0.301628	-0.287356	-0.504784
3	0.0102	0.833	204600	0.434	0.0219	2	0.165	-8.795	1	0.431	...	0.087176	-0.227342	0.278918	0.479892	-0.431301	0.224406	1.213418	-0.474493	-0.316002	-0.365130
4	0.0102	0.833	204600	0.434	0.0219	2	0.165	-8.795	1	0.431	...	0.117087	-0.418446	-0.040964	0.428880	-0.310667	-0.276378	0.260933	-0.361207	-0.267050	-0.332415
5	0.0102	0.833	204600	0.434	0.0219	2	0.165	-8.795	1	0.431	...	-0.182009	-0.468775	0.299990	0.213028	0.011709	-0.149237	0.507415	0.057419	-0.013157	-0.161389

6 rows × 1646 columns

Export CSV¶

Export new csv with additional feature for further machine learning process.

spotify_df.to_csv('data/spotify_df_processed.csv',index=False)

Spotify User Behaviour Predictor