代做DBA3803: Predictive Analytics in Business调试数据库编程

- 首页 >> Algorithm 算法

DBA3803: Predictive Analytics in Business

Credit Card Churn

1. Abstract

In the rapidly changing economic environment, bankers are key stakeholders in the country’s economy who should be able to make quick and accurate decisions, especially pertaining to their customers. As churn affects the gross revenue of corporate banks, it is crucial to know who will likely attrite and take action  to  prevent  a  higher  churn  rate.  As  part  of our  feature  engineering, we identified 3 customer segments using K-Prototype clustering to further our analysis. By utilising selected features from the credit card churn data set and testing it out on several models, the XGBoost model, overall, performs well in predicting customer churn, with the highest F1 Score of 0.878957. To reduce the churn rate, targeted marketing campaigns towards the identified segments would be pivotal.

2. Introduction

In financial services, customer churn is of particular concern to companies such as credit unions, banks, insurance agencies, and credit card companies. Attrition rates can reach as high as 25-30% for these companies (Kaemingk, 2018). Research has shown that corporate banks lose 10-15% of gross revenues annually  to  customer  churn   (Karthikeyan   et  al.,  2017).  With  the  highly   competitive  bank   scene, consumers are  spoilt  for choices  for their credit card services, alarming corporate banks who wish to increase their profits. Not to mention, the lofty transaction fees and poor exchange rates deter users from using cards (Teope, 2021). Hence, we aim to predict the likelihood of customer cancellation to allow banks to take measures to proactively retain customers and to remain relevant in this competitive industry.

Our analysis is based on a credit card churn dataset from Kaggle (Goyal, 2021). The dataset includes 10, 127 data points and 23 attributes (Appendix, Table A1). The classification problem ofpredicting credit card cancellation takes on a binary variable (Attrition Flag) and four distinct models: Logistic Regression, Random Forest, Gradient Boost, and XGBoost will be explored. The most suitable model will be chosen by analysing the trade-offs between model performance and potential business loss of inaccuracies.

3. Data-Preprocessing

3.1 Removing Data Points

The data is relatively clean as there were no missing values in the data set. However, it contained interim workings from previous projects that we removed for the purpose of this report. The affected columns are ( 1) Naive Bayes classifier for month 1 and (2) Naive Bayes classifier for month 2. Next, we decided to remove the “client number” as it is the index of customers which affected our prediction models. The team also removed the “Unknown” response in the Marital Status and Income Category because, without a clear marital status and income range, the outcome of the prediction will be affected. After the data cleaning, there are 20 features that we used for our predictive model, of which 6 are categorical features and 14 numerical features with 8,348 rows.

3.3 Train-Test Split

We then split the data into training and test sets with an 80/20 split where the target variable (y) is the Attrition Flag and the predictor variable (X) takes on all the other features. The training and test sets contain 6,678 and 1,670 data points respectively.

3.4 Data Scaling & One-Hot Encoding

Next,  we proceeded to normalise the  dataset  after the train/test  split by using the MinMaxScaler in SKlearn  which  translates  the  numerical  features  individually  between  zero   and  one.  The  use   of MinMaxScaler over StandardScaler was decided due to the non-negative constraint of our data points.

Normalising our data is important for Logistic regression with regularisation techniques as it assumes that features are on a similar scale. Furthermore, it reduces runtime and supports faster convergence of the optimizer. Scaling also ensures that regularisation penalties are applied fairly across all features.

After the scaling of numerical data, we one-hot encoded the categorical variables prior to our predictive analysis. We also tried Label-Encoding for all ordinal categorical features if we figured a dimensionality reduction might increase performance. However, this intuition was proved wrong after cross-validation.

4. Exploratory Data Analysis

In the beginning, we created a pair plot of all continuous numerical variables (Appendix, Figure A1), where we gained a lot of valuable insights, which we are exploring further in the following 3 chapters.

4.1 Univariate Analysis

As can be seen from Figure  1, the proportion of customers who churn is 15.9% ( 1,328), indicating an imbalance in the data set used. It should also be highlighted that as most customers paid their credit before the end of the month, there are a lot of customers with a total revolving balance of zero Figure 2.

Figure 1. The proportion of Attrited Customers             Figure 2. Distribution of Total Revolving Balance

4.2 Bivariate Analysis

In order to understand the relationship among the numerical variables in the dataset, we created a heatmap (Appendix, Figure A2) across all numerical variables. It can be observed that there is a perfect correlation between average open-to-buy and credit limit variables. This is because customers can only buy credit lines within their credit limit. The high correlation of 0.79 between the months on book and age is intuitive  as  older  customers  naturally  have  a  longer  relationship  with  the  bank.  The  team  used the chi-square test of independence in the “scipy.stats” package to find the association between categorical variables taking on Cramer’s V measure. As can be seen in Table A2 in the Appendix, the “gender” and “income category” have the highest degree of association of 0.839.

4.3 Multivariate Analysis

By   taking   into   consideration   the   total   transaction amount,  total  transaction   count,   and   attrition   flag variables, a scatterplot was created to visualize how all these attributes interact across the graph. Based on the spread  in  Figure  3,  there  seem  to  be  three  different segments that can be derived. Moreover, those who are attrited   are,   in   general,   amongst   those   who  have low-mid total transaction amount and transaction count.

Figure 3. Scatter Plot of Transaction Amount and Transaction Count Based on Attrition Flag

The clear separation of data suggests that there are hidden customer segments within our dataset. Thus, we decided to leverage our knowledge of clustering techniques to identify these clusters and their unique properties which are further elaborated in Section 5.

4.4 Feature Engineering

Based on the observation we initially introduced 5 interaction effects variables into our dataset, for the features with the highest correlation (Appendix, Table A2). Additional necessary feature engineering will be covered in its respective methods.

5. Clustering

5.1 Methodology

The algorithm the group decided on to cluster our data set is K-Prototypes under the ‘kmodes ’ package. Unlike K-means clustering which clusters only numerical data, K-Prototype is able to account for both categorical and numerical variables which is more representative of our data.

The data was clustered on a copy of the main dataset which was scaled using the “MinMaxScaler” . We decided on  scaling  as our dataset contains numerical values of different units. Once the K-Prototype model identified the customer segment, we added a new feature “cluster”  to our main data set which we used for our predictive models.

Our initial model saw the data being clustered based on the target variable “attrition_flag” . 2 of the 3 clusters were either  100% attrited or not attrited, which severely skewed our predictive model results. Hence, we decided to drop the “attrition_flag” column before running the K-Prototype model.

In reference to the elbow plot in Figure A3 of the Appendix, we decided to break the data into 3 clusters. Table 1 shows the breakdown of the more significant variables we used to identify the properties of each cluster and we labelled them as such:

5.2 Customer Segments

( 1) High-Risk     Customers:     Lowest     credit      limit     but     high     accumulated     unpaid     credit (‘Total_Revolving_Bal’). It would be wise to monitor these groups of customers closely as they have accrued a high credit balance with the bank.

(2)  Premium Customers: Healthiest credit score with moderate balance credit. These clients fall into the higher income category and have higher transacting amounts using the credit card.

(3) Low-Risk Customers: Moderate credit score with low usage. This group of customers are punctual in terms of credit billing and has a low threat of defaulting on credit loans.

Table 1. Parameters and Decision Variable

5.3 Data Exploration

Figure 4. Percentage of Attrited Customers for Different Clusters Across Income Categories

As can be seen from Figure A4 in the Appendix,  the ‘High Risk’ cluster has the highest number of customers who stayed, despite the high median in  total revolving balance. Interestingly, customers  in the ‘Low Risk’ cluster have the highest churn  rate. Some possible reasons might be caused by  dissatisfaction with the bank's customer service,  hence,   switching   to   competitors   (Kaemingk,  2018).  Furthermore,  looking  at  the  Premium ’ cluster  for  the  income  category  of  more  than  $120,000 in Figure 4, 56.67% of the customers attrited.   This   is   higher   than   that   of   the  ‘Low-Risk’ and ‘High-Risk’ clusters, signifying  that the bank lost a lot of its premium customers.

6. Models

The  models  are  trained  on  the  normalised  dataset  and  cross-validated  using  “GridSearchCV”  in  the SKlearn package to find the hyperparameters that give the best F1 score. The F1 score was chosen over accuracy as it captures both the precision and sensitivity of our model in our imbalanced dataset. Because of the limited number of attrited customers, the accuracy will always yield high values. This ensures our model is robust enough to capture the imbalances in our data and prevent it from being too naive.

6.0 Models ruled out

Taking  into  consideration  the  imbalanced  dataset  and  the  number  of attributes  used  in  the  analysis, making  it  subjected  to  the  curse  of dimensionality,  the  team  decided  not to proceed with the KNN Classification model and a Support Vector Classifier. With a large dataset of 8,348 observations after data  cleaning  the  classifiers  showed  a  long  run  time  (SVC  with  RBF  kernel  have  a  time  complexity  of O(n^2*p)) which might not be practical for banks. They want to forecast the churn rate often to predict potentially churning customers.

6.1 Logistic Regression

Logistic Regression stands out for its high interpretability, allowing direct interpretation of coefficients that indicate the direction and strength of the relationships between features and the target. L1 and L2 penalties  were  applied  with   cross-validation  and  different  solvers  to  avoid  overfitting.  As  logistic regression   assumes   linear   relationships,  we  tried   a   logarithmic   transformation   of  the   variables Credit_Limit,   and   Avg_Open_To_Buy,   given   that   these   variables   have   a   negative   exponential relationship. However, only Credit_Limit improved model performance after cross-validation. The rest was not able to normalise the distribution. For logistic regression, we fitted 5-folds for each of the 30 candidates, totalling 150 fits.

6.2 Random Forest

Random Forests are  suited  for capturing complex relationships in data, making them appropriate for analysing churn where factors might interact in a non-linear manner. With the inclusion of diverse control variables, Random Forest can handle the complexity and identify key features contributing to churn. It can also handle the ordinal nature of card types effectively. However, being an ensemble method, Random Forest sacrifices interpretability for improved predictive performance. We fitted the training data using the random forest method, which searches for the best feature among a random subset of features. We fitted 5 folds  for each  of 162  candidates, totalling  810  fits.  The maximum depth is  15 with 2 splits and  150 estimators.

6.3 Gradient Boost

Gradient Boosting Machine can sequentially correct errors and handle subtle relationships in the data, capturing non-linearities in factors that contribute to churn rates which the logistic regression model is unable to capture. Given that gradient boosting works well with imbalanced datasets that are structured, it is an efficient model for our large-scale data set. We fitted 5-folds for each of 324 candidates, totalling 1620 fits. The maximum depth is 4, with 10 splits and 200 estimators.

6.4 XGBoost

XGBoost is a more regularised form. of Gradient Boosting, which makes it perform. better than Gradient boosting. We fitted 5-folds for each of 243 candidates, totalling 1215 fits. The maximum depth is 5, with 300 estimators.

7. Model Evaluation

7.1 Loss Matrix

The team decided to include another dimension of model performance by analysing the cost of false predictions. The cost of False Negative (FN) cases is high as card cancellation translates to the loss of an entire customer’s lifetime value. On the other hand, False Positive (FP) cases are lower as false attrition cases would mean inefficient allocation of resources by the bank in an attempt to retain the customer. Table 2 shows the full breakdown of the cost matrix.

 

Actual Attrited (1)

Actual Not Attrited (0)

Predicted Attrited  (1)

$0 (TP)

$500 (FP)

Predicted Not Attrited

$46,284 (FN)

$0 (TN)

Table 2. Loss Matrix in Dollar Terms

7.2 Model Selection

Using the F1 score and Estimated Losses as our performance metric, we selected the XGBoost as our predictive algorithm as it outperformed all other models after cross-validation and in the test set as seen in Table  3.  More  in-depth   analysis   of  the  individual  models  can  be  found  in  Appendix  B.  Logistic Regression performed badly on sensitivity and F1 Score while still having a comparable accuracy. As detecting positives (and therefore sensitivity) is highly important to the bank, this model is not useful for the company. Meanwhile, Random Forest and Gradient Boosting could still be applied as they deliver good performance in those metrics.

 

Log Regression

Random Forest

Gradient Boost

XGBoost

Accuracy

0.914970

0.950299

0.958084

0.960479

Sensitivity

0.627178

0.780488

0.815331

0.818815

Specificity

0.974693

0.985539

0.987708

0.989877

F1 Score

0.717131

0.843691

0.869888

0.876866

AUC

0.948146

0.984231

0.984186

0.989194

Estimated Losses (‘000)

-4969.888000

-2925.892000

-2461.552000

-2413.768000

Table 3. Summary of Model Performance on Out-Of-Sample Data

7.3 Model Interpretation

Figure  5  highlights  the  top   10  significant  factors  that  determine  credit  card  attrition  based  on  our XGBoost model. We can conclude that Credit Limit and Customer Loyalty  are the main factors that will determine if a customer will churn. In Section 8, we will elaborate more on how the bank can leverage this information and employ marketing efforts to address these areas.

 

Figure 5. Top 10 Feature Importance

7.4 Threshold

As mentioned in Section 7.1, the cost of false predictions is costly for banks. The models above use the default threshold of 0.5 in order to predict classes, which might not be optimal. As a result, we decided to push our model a step further by optimising the threshold value of our XGBoost model.

The nature of recall as a performance metric is such that it is negatively related to the threshold and will only decrease as the threshold increases. Thus,  we decided to use the F1 score as a constraint of sorts, finding the point of intersection between the F1 score and Recall after cross-validation. This equilibrium point is the mathematical derivation of a “sweet spot” between model performance and threshold. The optimal threshold is 0.222 (Appendix, Figure A5).

After tuning the threshold value of our model, we can see improvements in the F1 Score and Estimated Losses of our improved XGBoost model. Understandably, lowering the threshold affected the specificity and accuracy of our model but we felt the slight decrease in these matrices is compensated by a way higher detection of churning customers, which is crucial to the bank.

However, the tuned threshold is relatively low at 0.222, going against the standard that thresholds should be higher for financial institutions due to the high cost of bad debt, emphasising the need to reduce false negative cases. Despite this, we believe our tuning methods are accurate but more data needs to be provided to further refine our model parameters. All in all, the selected tuned XGBoost model delivers great Out-of-Sample performance and can be confidently deployed by the bank to predict churn rates.

 

XGBoost  (Original)

XGBoost

(adj Threshold)

Accuracy

0.957485

0.961677

Sensitivity

0.808362

0.895470

Specificity

0.988431

0.975416

F1 Score

0.867290

0.889273

AUC

0.988486

0.988486

EstimatedLosses (‘000)

-2553.620000

-1405.520000

Table 4. XGBoost Performance Before & After Threshold Tuning

7.5 Limitations

Our XGBoost model is only useful for banks and not for other financial service companies in general as it considers internal features like the credit card type, which are bank-specific. The latter doesn't state the reasons for specific customer churning. This makes the bank able to predict the churning customers but not necessarily able to take the right corrective actions.

8. Recommendation

In today’s world, personalized services and marketing have become the norm across diverse industries, heightening expectations within the banking sector. A survey conducted by Bain & Company reveals that respondents who perceive their bank as tailoring their experience are inclined to grant higher ratings in terms of customer loyalty (Du Toit et al., 2023). Personalization builds trust, making customers reluctant to  switch to other providers, potentially increasing the usage rate of credits as customers stay. This is aligned with our findings that Credit Limit and Customer Loyalty are a great determinant of churn rate (Section 7.3). Hence, it is crucial for banks to accurately predict customer churn and provide targeted offerings to their customers.

Our recommendation centred around providing personalized and timely offerings to the three customer segments: Low-risk, High-risk, and Premium, that were created.

( 1) For low-risk customers, banks could provide birthday or holiday promotions to boost spending, since they have low credit card usage. These promotions could be based on their usual spending habits on their credit cards.

(2) For  high-risk  customers,  focusing  on  financial  literacy  advertisements  would  allow  customers to better handle their credits, reducing customer churn due to high unpaid credit.

(3) For premium customers, privileges such as free access to airport lounges and partnerships with luxury brands to offer gifts could be advertised via email when customers spend up to a certain amount.

Furthermore, as the ‘Premium ’ customers in the upper-income bracket have the highest percentage of attrition, banks should push out a tailored promotion to the ‘Premium ’ segment to retain them.

9. Future considerations

Additionally, we identified the three most important features, which are Months_on_book, Credit_Limit and Education_Level. These features are most crucial in predicting the customer churn rate. Further research can be done by the bank to dive deeper into those variables and break them down, allowing more reasons which cause the customers to attrite to the surface. This would enable the banks to find more specific and effective measures to retain credit card customers.

10. Conclusion

As illustrated by the data and models, having a high churn rate is costly for banks, in terms ofprofits and worsening of reputation. To tackle this issue, we have tested out different classification models to find the best model based on the F1 Score and selected the XGBoost model for its F1 Score of 0.889273. By identifying customers who are likely to attrite based on the factors, the bank would be able to target these customers and improve customer satisfaction levels, deterring attrition, and hence reducing losses.



站长地图