Default Credit Card Client Predictor

Author: Jacqueline Chong, Junrong Zhu, Lianna Hovhannisyan, Macy Chan

Imports

1. Introduction


In this mini project, a classification problem of predicting whether a credit card client will default or not. For this problem, Default of Credit Card Clients Dataset is used. In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. For additional information about the data set, a associated research paper is available through the UBC library.




The dataset is based on Taiwan’s credit card client default cases from April to September. It has 30000 examples, and each example represents particular client’s information. The dataset has 24 observations with respective values such as gender, age, marital status, last 6 months bills, last 6 months payments, etc, including the final default payment of next month column: labeled 1 (client will make a default) and 0 (client will not make a default). The detailed description of each feature can be found here.

As seen from above, the data set does not possess any missing values.

We have decided not to change the feature names are not changed, as we find them explanatory as such.



2. Data splitting


  1. Split the data into train and test portions.

</br> We decided to have 30% of the observations in the test data and 70% in the train data set. Overall the data set has 30,000 observations, thus the test set should have enough examples to provide good affirmation for the model: more precisely, the train set will have 21000 observations, and test set 9000.



3. EDA


  1. The count, as well as percentage of overall distribution of classes indicates that there is an imbalance between No (0) and Yes (1) classes. Overall, we are more interested in minimizing Type I error (predicting no default payment, when in reality the client made a default payment the following month), as opposed to Type II error (predicting default payment, when in reality no default payment was made by the client). For the creditors, it is for upmost importance to have the model that correctly predicts the account’s next month status, especially if the client is going to make a default payment. The correct prediction will help creditors to plan their risk and budget management better and take steps before the situation gets out of control.
  1. Therefore, as we have class imbalance, accuracy will not be used for evaluation of the model. The chosen scoring metric includes F1 score, recall, and average precision. F1 score is the harmonic mean of recall and precision score providing a good idea about both scores, thus it will be our primary scoring metric throughout evaluation.

Let's examine the relation between the amount of given credit and the default accounts. As the examination is for the credit companies, the amount of given credit to the person should reflect the trust the company has towards individual. Thus the relation between default and given credit amount is at upmost interest for our analysis.

Indeed we can see from the plot, that as the amount of given credit is getting higher, the less accounts tend to have defaults.

By looking at the relation of bill statements and frequency of defaulting the credit, we see the same relation: higher bill statement results in less default accounts.

We should be very careful while examining such sensitive data as the gender. As shown in the Table below, we have more females, than males in our data set.

But we can see from the diagram and summary statistic table, that the likelihood of males to make a default payment is higher than for females. This is interesting observation, however we acknowledge that we are dealing with the limited amount of data and that other factors may have influenced such statistics. Usage of sensitive data such as gender should be at upmost importance, but as our model is not causing direct harm to a particular group, we decide to keep it among our features.



4. Feature engineering


We decided to add the features 'avg_bill_amt' and 'avg_pay_amt'. Apart from having individual payment/bill columns, such as PAY_0 to PAY_6 and BILL_AMT1 to BILL_AMT6, adding the average pay column ('avg_pay_amt') and average bill column ('avg_bill_amt') will better reflect the overall credit card payment/spending pattern as they would possibly provide useful information to train the model.



5. Preprocessing and transformations


Type Reason Features
category All of these features have fixed number of categories. MARRIAGE has 4 classes, and we consider it as categorical instead of ordinal feature because we didn't want to be biased and rank different status in any particular order, similar reasoning for SEX feature. For EDUCATION, there are some values shown as unknown and others. Since the meanings behind them are vague, we do not want to group them together as they may have unique patterns among all the unknowns or all the others, thus we decided to treat them as categorical feature. SEX, MARRIAGE, EDUCATION
ordinal Sequential numbers ranged from -2 to 8 (-2 as the best, 8 as the worst). PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6
numeric Numeric columns needs standardization PAY_AMT1, PAY_AMT2, PAY_AMT5, AGE, BILL_AMT5, PAY_AMT4, BILL_AMT6, BILL_AMT2, BILL_AMT4, LIMIT_BAL, BILL_AMT1, BILL_AMT3, avg_bill_amt, avg_pay_amt, PAY_AMT3, PAY_AMT6
drop Unique identifier for every record in the data set (wouldn't be helpful for model training) ID





6. Baseline model




7. Linear models


We will carry out hyperparameter optimization: C controls the regularization, and class_weight hyperparameter for tackling class imbalance.

We can see that with optimized hyperparameters, Logistic Regression is doing much better. However, it is obvious that we are dealing with underfitting (almost no gap between scores). The std is very small ranging in +- 0.01.

From confusion matrix, we can see that we have a lot more FPs, then FNs, which is also visible in the average precision and f1 scores. Overall goal is to maximize the f1 score, thus in the later exercise we will look at non-linear models to see if we can beat this score.



8. Different models


Default parameters reasoning:

All the non-linear models are overfitting, but compared to RandomForest and XGBoost, the LGBM and CatBoost are overfitting less the train data, as the gap between train and validation scores is lower. The worst model in regards of overfitting is RandomForest with almost perfect one score in train set, and lower than 0.5 scores on cross validation set (for f1 score). Logistic Regression is underfitting, the training score is low and the gap between train and validation score is very small. Similarly dummy is underfitting.

LGBM Classifier has the lowest fit time and compared to all non linear models, the difference is quite high. Random Forest and CatBoost have the longest fit time, with latest being the slowest overall.

Score time for all classifiers is fast, except random forest. It is still fast, but compared to other models, we can notice difference of around 0.1s.

Stability of scores is more or less stable, with standard deviation ranging in around 0.01.

RandomForest gives better recall and average precision scores compared to XGboost cross validation score, however f1 score is lower compared to other non-linear models.

The best model among non-linear models is LGBM Classifier. Despite overfitting, it is giving us the best f1 + average precision score, as well as the lowest fit and score time. The worst model is Random Forest, XGBoost, as they have the lowest f1 score among non-linear models, and take a long time to fit.

As for Logistic Regression, it gives us best score (except for average precision) in recall and f1 score. Regarding recall, f1 scores and speed, LGBM is the only non-linear model which can be comparable to Logistic Regression.

Thus, we cannot conclude that non-linear models beat linear models.



9. Feature selection


After including feature selection, the cv scores did not improve compared to our best non linear model LGBM: all three cv scores f1, average precision and recall decreased. Even though, the overfitting is less (the gap between train and cv score), we do not see improvement by including feature selection in overall performance.



10. Hyperparameter optimization


We will start from RandomForest Classifier and optimize three of its hyperparameters: n_estimators which determines the number of trees in the forest, max_depth determines the number of levels in the tree, and class_weight which is used for tackling class imbalance (controls weights associated with the classes).

The next we will optimize hyperparameter for XGBoost. We will optimize the max_depth hyperparameter, which determines the maximum depth of a tree (more depth, more complex model).

Next we will optimize hyperparameters for LGBM Classifier: max_depth which determines the depth of the tree, num_leaves which determines the number of leaves for base learners, and the class_weight hyperparameter for tackling class imbalance (controls weights associated with the classes).

Lastly, we will perform hyperparamaters optimization for CatBoost Classifier: max_depth which determines the the depth of the tree, learning_rate which controls the gradient step size (smaller size, more iterations) and can help to minimize the error associated with the Loss function, and auto_class_weights which determines the weight given to each class.

We are going to group classifiers with optimized arguments in the dictionary and get mean cross validation scores for each to summarize them in the results dictionary. They will appear with the 'unoptimized' versions, thus it will be easier to compare.

We can notice that after optimization, we are getting better cross validation scores in all non-linear models.

The overfitting seems to be better in comparison with "unoptimized" models, however all models seem to still overfit. RandomForest is giving us much better results, the difference between cv and train score gap is much less, and all other scores are up. In fact, the scores became competitive with the other non-linear models. After optimization, CatBoost seems to overfit more.

Std is low for all models.

The best model after hyperparameter optimization is LGBM Classifier, which gives us the highest cv f1 score, and is the fastest among all models. Even though logistic regression has higher recall score, then LGBM, in overall performance we are more prioritizing f1 score. Also the difference between score and fit time between LGBM and logistic regression is not very large, and we can say that both models are quick. Thus we are choosing LGBM as our final model with hyperparameters as follow: 'lgbmclassifiernum_leaves': 100, 'lgbmclassifiermax_depth': 5, 'lgbmclassifier__class_weight': 'balanced' and the f1 score of 0.546.



11. Interpretation and feature importances


We examined the most important features of LGBMClassifier with both SHAP and eli5 methods. The results from these two method are somewhat similar, as 4 out of 5 top features from eli5 are the top 4 features in SHAP method (the top 4 are: PAY_0, LIMIT_BAL , BILL_AMT1 and PAY_AMT2 respectively).

From the SHAP summary plot, it's easy to spot that PAY_0 has significantly larger mean absolute SHAP value comparing to all other features. Similarly, the top feature PAY_0 has a much larger weight among all the features ineli5 output table.

On the contrary, some equally important features in eli5 is ranked differently in SHAP.

Something interesting from the result here is that, having PAY_0 as the most important is assuming that the payment status in September is the most essential feature among all the variables, but this doesn't align with the reality.



12. Results on the test set


The test set's score were slightly lower than the validation score. As the training and test set is rather large, 21000, and 9000 respectively, we trust our results. Optimization bias occurs when the training set is small and many models are evaluated during hyperparameter optimization, which might cause us to get a good validation score 'by chance'. However, since both the validation and test scores are comparable, we do not think that we had issues with optimization bias.

Part 3

Splitting the classes into two separate lists. One for no_default and the other for yes_default:



13. Summary of results


Amongst the models, it is clear that the model that we have chosen LightGBM_opt is optimal in terms of total time required for fit and score, has the best f1 score, recall score, and average precision score. Moreover, the difference in scores between the test set and validation set are minimal, suggesting that there is minimal optimisation bias. Based on our test scores, which are highlighted in part 12 of this lab, our model is able to identify about 60.9% of the true default cases, and correctly identify 54.0% of all the default cases.

First, it is very important to highlight that discussion of the score and model should not be misinterpreted. For instance, the feature importance interpretation, our model suggested that PAY_0 weighted much more than other payments (PAY_1, PAY_2, etc.). In reality, we consider customer credit in a longer and consistent manner, which means payments made in every months should be treated equally. While the model suggested the payment in one particular month would weight much more than the rest months, hence we should not be overly confident/interpret our result of the model top importance features.

Moreover, as discussed in EDA, the gender imbalance is present in the dataset, and we believe to avoid further discussions and possible misinterpretation (ethical), it will be better to include an analysis of results for distinguished groups in later studies.

There is definitely room for improvement in our project, in terms of increasing its performance (CV score wise) and its interpretability.

In terms of the choice of models, we could have attempted various ensemble methods, such as averaging and stacking to improve the performance of our model. It would allow us to combine different models, and potentially get better results. However, this would decrease interpretability, as there is randomness injected, and also reduce code maintainability, as each model needs to be optimized individually.

In terms of the actual dataset, although we attempted to create new features, we did not test for correlation between pairs of the explanatory variables. We could have potentially calculated the variance inflation factor, and manually removed several features. While not a foolproof method and it will lend itself to a possibility of underfitting, we could have explored that option.