Conclusion

  • This is a Python base notebook

Summary of the whole project

import pandas as pd
results = pd.read_csv('data/model_results.csv', index_col = 0 )
results.reset_index().rename(index= {0: 'fit time', 1: 'score time', 2: 'test accuracy', 3: 'train accuracy', 4: 'test ROC AUC', 5: 'train ROC AUC'})
dummy LogisticReg LogisticReg_opt RandomForest XGBoost LGBM Cat_Boost averaging
fit time 0.138 (+/- 0.015) 13.436 (+/- 0.415) 729.602 (+/- 1.470) 3.025 (+/- 0.158) 6.629 (+/- 0.337) 2.496 (+/- 0.136) 43.591 (+/- 0.065) 14.263 (+/- 0.906)
score time 0.054 (+/- 0.003) 0.138 (+/- 0.048) 0.126 (+/- 0.062) 0.096 (+/- 0.003) 0.072 (+/- 0.003) 0.061 (+/- 0.003) 0.675 (+/- 0.045) 0.278 (+/- 0.010)
test accuracy 0.520 (+/- 0.001) 0.887 (+/- 0.006) 0.888 (+/- 0.008) 0.900 (+/- 0.011) 0.944 (+/- 0.007) 0.950 (+/- 0.003) 0.954 (+/- 0.005) 0.951 (+/- 0.004)
train accuracy 0.520 (+/- 0.000) 0.972 (+/- 0.002) 0.974 (+/- 0.003) 0.997 (+/- 0.001) 0.997 (+/- 0.001) 0.997 (+/- 0.001) 0.995 (+/- 0.001) 0.997 (+/- 0.001)
test ROC AUC 0.500 (+/- 0.000) 0.955 (+/- 0.005) 0.954 (+/- 0.005) 0.960 (+/- 0.002) 0.986 (+/- 0.001) 0.991 (+/- 0.002) 0.991 (+/- 0.002) 0.986 (+/- 0.001)
train ROC AUC 0.500 (+/- 0.000) 0.997 (+/- 0.000) 0.998 (+/- 0.000) 1.000 (+/- 0.000) 1.000 (+/- 0.000) 1.000 (+/- 0.000) 1.000 (+/- 0.000) 1.000 (+/- 0.000)

Amongst the models, LGBMClassifier is the best model. Even though CatBoostClassifier has the best accuracy and ROC AUC score, the fit time forcat_boost is slow. It can be a concern as the algorithm is likely to refit every user by the latest song listening history whenever the user wants to update their playlist. For LGBMClassifier, the test accuracy is 0.950, which is 0.004 lower than CatBoostClassifier, but the fit time is 20 times shorter.

Therefore, LGBMClassifier will be used for the Soptify user behavior prediction.