代做DBA3803: Predictive Analytics in Business调试数据库编程

2024.10.29 - 首页 >> Algorithm 算法

DBA3803: Predictive Analytics in Business

Credit Card Churn

1. Abstract

In the rapidly changing economic environment, bankers are key stakeholders in the country’s economy who should be able to make quick and accurate decisions, especially pertaining to their customers. As churn affects the gross revenue of corporate banks, it is crucial to know who will likely attrite and take action to prevent a higher churn rate. As part of our feature engineering, we identified 3 customer segments using K-Prototype clustering to further our analysis. By utilising selected features from the credit card churn data set and testing it out on several models, the XGBoost model, overall, performs well in predicting customer churn, with the highest F1 Score of 0.878957. To reduce the churn rate, targeted marketing campaigns towards the identified segments would be pivotal.

2. Introduction

In financial services, customer churn is of particular concern to companies such as credit unions, banks, insurance agencies, and credit card companies. Attrition rates can reach as high as 25-30% for these companies (Kaemingk, 2018). Research has shown that corporate banks lose 10-15% of gross revenues annually to customer churn (Karthikeyan et al., 2017). With the highly competitive bank scene, consumers are spoilt for choices for their credit card services, alarming corporate banks who wish to increase their profits. Not to mention, the lofty transaction fees and poor exchange rates deter users from using cards (Teope, 2021). Hence, we aim to predict the likelihood of customer cancellation to allow banks to take measures to proactively retain customers and to remain relevant in this competitive industry.

Our analysis is based on a credit card churn dataset from Kaggle (Goyal, 2021). The dataset includes 10, 127 data points and 23 attributes (Appendix, Table A1). The classification problem ofpredicting credit card cancellation takes on a binary variable (Attrition Flag) and four distinct models: Logistic Regression, Random Forest, Gradient Boost, and XGBoost will be explored. The most suitable model will be chosen by analysing the trade-offs between model performance and potential business loss of inaccuracies.

3. Data-Preprocessing

3.1 Removing Data Points

The data is relatively clean as there were no missing values in the data set. However, it contained interim workings from previous projects that we removed for the purpose of this report. The affected columns are ( 1) Naive Bayes classifier for month 1 and (2) Naive Bayes classifier for month 2. Next, we decided to remove the “client number” as it is the index of customers which affected our prediction models. The team also removed the “Unknown” response in the Marital Status and Income Category because, without a clear marital status and income range, the outcome of the prediction will be affected. After the data cleaning, there are 20 features that we used for our predictive model, of which 6 are categorical features and 14 numerical features with 8,348 rows.

3.3 Train-Test Split

We then split the data into training and test sets with an 80/20 split where the target variable (y) is the Attrition Flag and the predictor variable (X) takes on all the other features. The training and test sets contain 6,678 and 1,670 data points respectively.

3.4 Data Scaling & One-Hot Encoding

Next, we proceeded to normalise the dataset after the train/test split by using the MinMaxScaler in SKlearn which translates the numerical features individually between zero and one. The use of MinMaxScaler over StandardScaler was decided due to the non-negative constraint of our data points.

Normalising our data is important for Logistic regression with regularisation techniques as it assumes that features are on a similar scale. Furthermore, it reduces runtime and supports faster convergence of the optimizer. Scaling also ensures that regularisation penalties are applied fairly across all features.

After the scaling of numerical data, we one-hot encoded the categorical variables prior to our predictive analysis. We also tried Label-Encoding for all ordinal categorical features if we figured a dimensionality reduction might increase performance. However, this intuition was proved wrong after cross-validation.

4. Exploratory Data Analysis

In the beginning, we created a pair plot of all continuous numerical variables (Appendix, Figure A1), where we gained a lot of valuable insights, which we are exploring further in the following 3 chapters.

4.1 Univariate Analysis

As can be seen from Figure 1, the proportion of customers who churn is 15.9% ( 1,328), indicating an imbalance in the data set used. It should also be highlighted that as most customers paid their credit before the end of the month, there are a lot of customers with a total revolving balance of zero Figure 2.

Figure 1. The proportion of Attrited Customers Figure 2. Distribution of Total Revolving Balance

4.2 Bivariate Analysis

In order to understand the relationship among the numerical variables in the dataset, we created a heatmap (Appendix, Figure A2) across all numerical variables. It can be observed that there is a perfect correlation between average open-to-buy and credit limit variables. This is because customers can only buy credit lines within their credit limit. The high correlation of 0.79 between the months on book and age is intuitive as older customers naturally have a longer relationship with the bank. The team used the chi-square test of independence in the “scipy.stats” package to find the association between categorical variables taking on Cramer’s V measure. As can be seen in Table A2 in the Appendix, the “gender” and “income category” have the highest degree of association of 0.839.

4.3 Multivariate Analysis

By taking into consideration the total transaction amount, total transaction count, and attrition flag variables, a scatterplot was created to visualize how all these attributes interact across the graph. Based on the spread in Figure 3, there seem to be three different segments that can be derived. Moreover, those who are attrited are, in general, amongst those who have low-mid total transaction amount and transaction count.

Figure 3. Scatter Plot of Transaction Amount and Transaction Count Based on Attrition Flag

The clear separation of data suggests that there are hidden customer segments within our dataset. Thus, we decided to leverage our knowledge of clustering techniques to identify these clusters and their unique properties which are further elaborated in Section 5.

4.4 Feature Engineering

Based on the observation we initially introduced 5 interaction effects variables into our dataset, for the features with the highest correlation (Appendix, Table A2). Additional necessary feature engineering will be covered in its respective methods.

5. Clustering

5.1 Methodology

The algorithm the group decided on to cluster our data set is K-Prototypes under the ‘kmodes ’ package. Unlike K-means clustering which clusters only numerical data, K-Prototype is able to account for both categorical and numerical variables which is more representative of our data.

The data was clustered on a copy of the main dataset which was scaled using the “MinMaxScaler” . We decided on scaling as our dataset contains numerical values of different units. Once the K-Prototype model identified the customer segment, we added a new feature “cluster” to our main data set which we used for our predictive models.

Our initial model saw the data being clustered based on the target variable “attrition_flag” . 2 of the 3 clusters were either 100% attrited or not attrited, which severely skewed our predictive model results. Hence, we decided to drop the “attrition_flag” column before running the K-Prototype model.

In reference to the elbow plot in Figure A3 of the Appendix, we decided to break the data into 3 clusters. Table 1 shows the breakdown of the more significant variables we used to identify the properties of each cluster and we labelled them as such:

5.2 Customer Segments

( 1) High-Risk Customers: Lowest credit limit but high accumulated unpaid credit (‘Total_Revolving_Bal’). It would be wise to monitor these groups of customers closely as they have accrued a high credit balance with the bank.

(2) Premium Customers: Healthiest credit score with moderate balance credit. These clients fall into the higher income category and have higher transacting amounts using the credit card.

(3) Low-Risk Customers: Moderate credit score with low usage. This group of customers are punctual in terms of credit billing and has a low threat of defaulting on credit loans.

Table 1. Parameters and Decision Variable

5.3 Data Exploration

Figure 4. Percentage of Attrited Customers for Different Clusters Across Income Categories

As can be seen from Figure A4 in the Appendix, the ‘High Risk’ cluster has the highest number of customers who stayed, despite the high median in total revolving balance. Interestingly, customers in the ‘Low Risk’ cluster have the highest churn rate. Some possible reasons might be caused by dissatisfaction with the bank's customer service, hence, switching to competitors (Kaemingk, 2018). Furthermore, looking at the ‘Premium ’ cluster for the income category of more than $120,000 in Figure 4, 56.67% of the customers attrited. This is higher than that of the ‘Low-Risk’ and ‘High-Risk’ clusters, signifying that the bank lost a lot of its premium customers.

6. Models

The models are trained on the normalised dataset and cross-validated using “GridSearchCV” in the SKlearn package to find the hyperparameters that give the best F1 score. The F1 score was chosen over accuracy as it captures both the precision and sensitivity of our model in our imbalanced dataset. Because of the limited number of attrited customers, the accuracy will always yield high values. This ensures our model is robust enough to capture the imbalances in our data and prevent it from being too naive.

6.0 Models ruled out

Taking into consideration the imbalanced dataset and the number of attributes used in the analysis, making it subjected to the curse of dimensionality, the team decided not to proceed with the KNN Classification model and a Support Vector Classifier. With a large dataset of 8,348 observations after data cleaning the classifiers showed a long run time (SVC with RBF kernel have a time complexity of O(n^2*p)) which might not be practical for banks. They want to forecast the churn rate often to predict potentially churning customers.

6.1 Logistic Regression

Logistic Regression stands out for its high interpretability, allowing direct interpretation of coefficients that indicate the direction and strength of the relationships between features and the target. L1 and L2 penalties were applied with cross-validation and different solvers to avoid overfitting. As logistic regression assumes linear relationships, we tried a logarithmic transformation of the variables Credit_Limit, and Avg_Open_To_Buy, given that these variables have a negative exponential relationship. However, only Credit_Limit improved model performance after cross-validation. The rest was not able to normalise the distribution. For logistic regression, we fitted 5-folds for each of the 30 candidates, totalling 150 fits.

6.2 Random Forest

Random Forests are suited for capturing complex relationships in data, making them appropriate for analysing churn where factors might interact in a non-linear manner. With the inclusion of diverse control variables, Random Forest can handle the complexity and identify key features contributing to churn. It can also handle the ordinal nature of card types effectively. However, being an ensemble method, Random Forest sacrifices interpretability for improved predictive performance. We fitted the training data using the random forest method, which searches for the best feature among a random subset of features. We fitted 5 folds for each of 162 candidates, totalling 810 fits. The maximum depth is 15 with 2 splits and 150 estimators.

6.3 Gradient Boost

Gradient Boosting Machine can sequentially correct errors and handle subtle relationships in the data, capturing non-linearities in factors that contribute to churn rates which the logistic regression model is unable to capture. Given that gradient boosting works well with imbalanced datasets that are structured, it is an efficient model for our large-scale data set. We fitted 5-folds for each of 324 candidates, totalling 1620 fits. The maximum depth is 4, with 10 splits and 200 estimators.

6.4 XGBoost

XGBoost is a more regularised form. of Gradient Boosting, which makes it perform. better than Gradient boosting. We fitted 5-folds for each of 243 candidates, totalling 1215 fits. The maximum depth is 5, with 300 estimators.

7. Model Evaluation

7.1 Loss Matrix

The team decided to include another dimension of model performance by analysing the cost of false predictions. The cost of False Negative (FN) cases is high as card cancellation translates to the loss of an entire customer’s lifetime value. On the other hand, False Positive (FP) cases are lower as false attrition cases would mean inefficient allocation of resources by the bank in an attempt to retain the customer. Table 2 shows the full breakdown of the cost matrix.

	Actual Attrited (1)	Actual Not Attrited (0)
Predicted Attrited (1)	$0 (TP)	$500 (FP)
Predicted Not Attrited	$46,284 (FN)	$0 (TN)

Table 2. Loss Matrix in Dollar Terms

7.2 Model Selection

Using the F1 score and Estimated Losses as our performance metric, we selected the XGBoost as our predictive algorithm as it outperformed all other models after cross-validation and in the test set as seen in Table 3. More in-depth analysis of the individual models can be found in Appendix B. Logistic Regression performed badly on sensitivity and F1 Score while still having a comparable accuracy. As detecting positives (and therefore sensitivity) is highly important to the bank, this model is not useful for the company. Meanwhile, Random Forest and Gradient Boosting could still be applied as they deliver good performance in those metrics.

	Log Regression	Random Forest	Gradient Boost	XGBoost
Accuracy	0.914970	0.950299	0.958084	0.960479
Sensitivity	0.627178	0.780488	0.815331	0.818815
Specificity	0.974693	0.985539	0.987708	0.989877
F1 Score	0.717131	0.843691	0.869888	0.876866
AUC	0.948146	0.984231	0.984186	0.989194
Estimated Losses (‘000)	-4969.888000	-2925.892000	-2461.552000	-2413.768000

Table 3. Summary of Model Performance on Out-Of-Sample Data

7.3 Model Interpretation

Figure 5 highlights the top 10 significant factors that determine credit card attrition based on our XGBoost model. We can conclude that Credit Limit and Customer Loyalty are the main factors that will determine if a customer will churn. In Section 8, we will elaborate more on how the bank can leverage this information and employ marketing efforts to address these areas.

Figure 5. Top 10 Feature Importance

7.4 Threshold

As mentioned in Section 7.1, the cost of false predictions is costly for banks. The models above use the default threshold of 0.5 in order to predict classes, which might not be optimal. As a result, we decided to push our model a step further by optimising the threshold value of our XGBoost model.

The nature of recall as a performance metric is such that it is negatively related to the threshold and will only decrease as the threshold increases. Thus, we decided to use the F1 score as a constraint of sorts, finding the point of intersection between the F1 score and Recall after cross-validation. This equilibrium point is the mathematical derivation of a “sweet spot” between model performance and threshold. The optimal threshold is 0.222 (Appendix, Figure A5).

After tuning the threshold value of our model, we can see improvements in the F1 Score and Estimated Losses of our improved XGBoost model. Understandably, lowering the threshold affected the specificity and accuracy of our model but we felt the slight decrease in these matrices is compensated by a way higher detection of churning customers, which is crucial to the bank.

However, the tuned threshold is relatively low at 0.222, going against the standard that thresholds should be higher for financial institutions due to the high cost of bad debt, emphasising the need to reduce false negative cases. Despite this, we believe our tuning methods are accurate but more data needs to be provided to further refine our model parameters. All in all, the selected tuned XGBoost model delivers great Out-of-Sample performance and can be confidently deployed by the bank to predict churn rates.

	XGBoost (Original)	XGBoost (adj Threshold)
Accuracy	0.957485	0.961677
Sensitivity	0.808362	0.895470
Specificity	0.988431	0.975416
F1 Score	0.867290	0.889273
AUC	0.988486	0.988486
EstimatedLosses (‘000)	-2553.620000	-1405.520000

Table 4. XGBoost Performance Before & After Threshold Tuning

7.5 Limitations

Our XGBoost model is only useful for banks and not for other financial service companies in general as it considers internal features like the credit card type, which are bank-specific. The latter doesn't state the reasons for specific customer churning. This makes the bank able to predict the churning customers but not necessarily able to take the right corrective actions.

8. Recommendation

In today’s world, personalized services and marketing have become the norm across diverse industries, heightening expectations within the banking sector. A survey conducted by Bain & Company reveals that respondents who perceive their bank as tailoring their experience are inclined to grant higher ratings in terms of customer loyalty (Du Toit et al., 2023). Personalization builds trust, making customers reluctant to switch to other providers, potentially increasing the usage rate of credits as customers stay. This is aligned with our findings that Credit Limit and Customer Loyalty are a great determinant of churn rate (Section 7.3). Hence, it is crucial for banks to accurately predict customer churn and provide targeted offerings to their customers.

Our recommendation centred around providing personalized and timely offerings to the three customer segments: Low-risk, High-risk, and Premium, that were created.

( 1) For low-risk customers, banks could provide birthday or holiday promotions to boost spending, since they have low credit card usage. These promotions could be based on their usual spending habits on their credit cards.

(2) For high-risk customers, focusing on financial literacy advertisements would allow customers to better handle their credits, reducing customer churn due to high unpaid credit.

(3) For premium customers, privileges such as free access to airport lounges and partnerships with luxury brands to offer gifts could be advertised via email when customers spend up to a certain amount.

Furthermore, as the ‘Premium ’ customers in the upper-income bracket have the highest percentage of attrition, banks should push out a tailored promotion to the ‘Premium ’ segment to retain them.

9. Future considerations

Additionally, we identified the three most important features, which are Months_on_book, Credit_Limit and Education_Level. These features are most crucial in predicting the customer churn rate. Further research can be done by the bank to dive deeper into those variables and break them down, allowing more reasons which cause the customers to attrite to the surface. This would enable the banks to find more specific and effective measures to retain credit card customers.

10. Conclusion

As illustrated by the data and models, having a high churn rate is costly for banks, in terms ofprofits and worsening of reputation. To tackle this issue, we have tested out different classification models to find the best model based on the F1 Score and selected the XGBoost model for its F1 Score of 0.889273. By identifying customers who are likely to attrite based on the factors, the bank would be able to target these customers and improve customer satisfaction levels, deterring attrition, and hence reducing losses.