Next Article in Journal
Effect of Partial Replacement of Wheat with Fava Bean and Black Cumin Flours on Nutritional Properties and Sensory Attributes of Bread
Previous Article in Journal
Enhancing Stroke Risk Prediction: Leveraging Machine Learning and Magnetic Resonance Imaging Data for Advanced Assessment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Application of XGBoost Algorithm to Develop Mutual Fund Marketing Prediction Model for Banks’ Wealth Management †

Graduate Institute of Global Business and Strategy, National Taiwan Normal University, Taipei 106, Taiwan
Presented at the 2024 IEEE 7th International Conference on Knowledge Innovation and Invention, Nagoya, Japan, 16–18 August 2024.
Eng. Proc. 2025, 89(1), 3; https://doi.org/10.3390/engproc2025089003
Published: 21 February 2025

Abstract

:
Competition in Taiwan’s banking industry is becoming fierce. Banks’ traditional income based on interest rates is insufficient to support their growth. Therefore, banks are eager to expand their wealth management business to increase profits. The fee income from the sale of mutual funds is one of the major sources of banks’ wealth management business. The problem is how to look for the right customers and contact them effectively. Therefore, it is necessary to develop classification prediction models for these banks to evaluate their customers’ potential to buy mutual fund products sold by commercial banks and then deploy marketing resources on these customers to increase banks’ profits. Recently, the XGBoost algorithm has been widely used in conducting classification tasks. Therefore, using the eXtreme Gradient Boosting algorithm, a mutual fund marketing prediction model is developed based on a commercial bank’s data in this study. The results show that whether a customer has an unsecured loan, a customer’s amount of assets in the bank, the number of months for transactions, a place of residence, and whether the bank is the main bank for the total amount of credit card bills in the past six months are the top five factors for the models, providing valuable information for effective wealth management and marketing.

1. Introduction

Competition in Taiwan’s banking industry is becoming intense because 70 banks are serving 23 million people in Taiwan [1]. Conventional income from interest rate spreads between deposits and loans is no longer sufficient to sustain growth for Taiwan’s banks. Therefore, banks are eager to expand their wealth management to increase profits. The fee income from the sale of mutual funds and after-sale asset management services is one of the major sources of banks’ wealth management business. However, the fee income can be obtained after banks have successfully sold the mutual funds to customers. Therefore, it is required for them to look for the right customers and contact them effectively to realize the fee income.
Direct marketing models are used to choose the right customers [2]. The factors that explain whether a bank’s customers buy mutual fund products recommended by banks are the model’s input variables. The classification algorithm is important to study the relationship between the target variable (whether to buy or not to buy the mutual fund) and the explained factors. Then, marketing resources can be precisely allocated to customers with a high potential to buy mutual funds. As a result, banks can earn fee income from the sale of mutual funds and then increase their profits.
In this research, a classification prediction model is developed for banks’ marketing decisions. The model evaluates their customers’ potential to buy mutual fund products and then deploys marketing resources to these customers.
Recently, the extreme gradient boost (XGBoost) algorithm has been widely used in credit scoring, bankruptcy prediction, and fraud detection in the banking sector owing to its exceptional performance in conducting classification [3,4]. By recognizing the potential of the XGBoost algorithm and applying it to develop mutual fund marketing prediction models based on a commercial bank’s data, the application and effectiveness of the algorithm in the banking industry are validated.
In this research, the following are explored:
  • Factors to develop a direct marketing model.
  • Effectiveness of the XGBoost algorithm in building a mutual fund marketing model.
  • Comparison of the XGBoost algorithm with the other algorithms.
The organization of this paper is as follows: After introducing the research background and purpose in Section 1, Section 2 reviews the related literature. Section 3 describes the data set and research process. Section 4 presents the results and discussion. Finally, concluding remarks are presented in Section 5.

2. Literature Review

2.1. Direct Marketing of Financial Commodities

Past research highlighted the importance of direct marketing and proposed that direct marketing is critical for businesses to engage customers and drive sales [2,3]. Reference [2] adopted a system perspective to understand how different models interact within the broader marketing system and proposed that quantitative models help optimize these marketing efforts. The quantitative models of direct marketing include statistical methods, such as logistic regression, and machine learning techniques, including support vector machines (SVMs), artificial neural networks (ANNs), decision trees, and related enhanced algorithms (e.g., random forest (RF), a method that operates by constructing a multitude of decision trees at training time and outputting the mode of the classes). These algorithms are used to predict the likelihood of a customer buying a product or service.
Commercial banks are eager to apply financial technology (Fintech) and marketing technology (Martech) to boost their revenues and profits [3]. Reference [4] found that marketing activities have positive impacts on commercial banks’ profit.

2.2. XGboost Algorithm

The XGBoost model, a tree-based Gradient Boosting ensemble, is renowned for its high efficiency and predictive accuracy. Its application in banking includes credit card fraud detection, bankruptcy prediction, and personal credit risk assessment, underscoring its significance and potential. The model performance of various algorithms is evaluated using two data sets (European data set and German data sets) in Ref. [5], including SVMs, logistic regression, RFs, K-Nearest Neighbors, J48 decision trees, ANNs, and boosting frameworks for decision trees (such as XGBoost), for credit card fraud detection. The results show that XGBoost cannot outperform ANNs and RFs based on the F1 measure. The XGBoost algorithm with a feature importance selection mechanism is applied to construct a bankruptcy prediction model [6]. The model performance is compared with other algorithms and feature selection approaches. The feature importance selection mechanism in XGBoost calculates how each feature contributes to the model’s predictions. When evaluating forecast accuracy using the area under curve (AUC), XGBoost demonstrated superior discrimination power compared to other feature selection techniques, including stepwise discriminant analysis, stepwise logistic regression, and partial least square-discriminant analysis. XGBoost is used to explain bank default [7]. They suggested that lower values of retained earnings to average equity, pretax return on assets, and total risk-based capital ratio are linked to a higher risk of bank failure. Additionally, an excessively high yield on earning assets increases banks’ likelihood of financial distress.
Logistic regression is applied to select features, and the XGBoost algorithm is used to construct a classification model for assessing personal credit risk [8]. Decision tree and K-nearest neighbor algorithms were used as benchmarks for comparison. Their results demonstrated that XGBoost performed well in terms of AUC, accuracy, and Type II error.
Past research on the application of the XGBoost algorithm focused on credit evaluation, whether for individual customers or corporate customers. Few studies reported the algorithm’s performance on marketing decisions in the banking industry. The performance and feasibility of banks’ marketing decisions were evaluated using a real data set from a commercial bank in this study.

3. Data Set and Research Process

In this research, XGboost, a python-based package, was used to develop a classification model.

3.1. Data Set

The data set was obtained from a commercial bank. It contained 156,666 records. 14,075 records presented the data of customers who bought mutual funds within three months, and 142,591 records presented the data of customers who did not buy mutual funds within three months. A 1:1 undersampling approach was used to avoid the problem of rare events being ignored by the training model [9]. As a result, 28,150 records remained and were split into a training set (21,112) and a testing set (7038).
The summary of the data set is shown in Table 1. The target variable is a binary variable that defines whether a customer buys a mutual fund within three months. Four facets include the customer’s transaction with the bank, the customer’s democratic profile, the channel usage profile, and the holding situation of financial commodities. They were used to determine whether a customer would buy mutual funds. For example, Reference [10] identified that e-banking channels could enhance banks’ market shares. Thus, features reflecting channel preference were determined in this study.
A total of 36 features were determined to describe every data point (i.e., every potential customer and the customer’s outcome), as shown in Table 2. After screening out features that contained missing values over 20%, 32 features remained. Then, categorical variables were transformed into dummy variables. Finally, 56 variables were used to develop the XGBoost model.

3.2. Research Process

The XGBoost algorithm was used to develop the classification model. Using the trained XBGoost model, the feature importance was determined to explain the impact of each feature on the target variable (whether a customer buys the mutual fund within three months). The testing data set was used for model evaluation. Finally, to understand XGBoost model performance, logistic regression, decision trees, support vector machines, and neural network algorithms were applied for model comparison. Figure 1 shows the research process of this research.

3.3. XGBoost

XGBoost operates in a series of steps to enhance predictive accuracy and efficiency. The input data and parameters for the algorithm are as follows:
Input data: Explanatory feature matrix (X) and target variable (y).

3.3.1. Parameters

XGBoost has many parameters to optimize the model’s performance. These parameters are categorized into three groups: general parameters, booster parameters, and learning task parameters. These parameters offer flexibility to fine-tune the XGBoost model for various types of data and tasks to achieve optimal performance and generalization. The parameter tuning strategy was referred to the web resource [11].
The parameters in this research included the following:
  • Booster specifies options including tree-based models and linear models. Tree-based models were used as the booster in this research.
  • Eta (learning_rate) is the step size shrinkage to prevent overfitting. The value range is between 0 and 1. It controls the weight of the new tree’s contribution.
  • Lambda (Reg_lambda) is the regularization parameter of the algorithm. L2 regularization term on weights is specified to avoid overfitting.
  • The “objective” parameter is used for the learning task and the corresponding loss function. It was defined as “binary:logistic”, for logistic regression to the binary classification task in this research.
  • Eval_metric: Evaluation metrics for validation data. In this research, eval_metric were specified as “error”, which used the binary classification error rate as evaluation metrics.
  • n_estimators presented the number of boosting rounds (trees to grow).
  • In early_stopping_rounds, training stops when a metric is not improved for a certain number of rounds.

3.3.2. Algorithm

The algorithm works in the following steps:
  • Initialization: The process starts with an initial prediction for each data point, often using the base class probabilities. For binary classification, the probability belongs to class 1, typically set to 0.5.
  • Iterative Boosting:
Residual Calculation is used to compute the residuals for each data point. In this research’s classification task, the residuals were defined as the difference between the actual class labels (0 or 1) and the predicted probabilities.
Gradient Calculation: The gradient of the loss function is calculated concerning the current predictions. Logistic regression loss involves computing the derivative of the logistic loss function.
3.
Model Training on Gradients
A new decision tree is trained on these gradients. The goal of the new tree is to predict the gradient of the loss function, effectively focusing on the errors made by the current iteration model.
4.
Weighted Summation: Predictions are updated by adding the weighted output of the new tree to the current predictions. The learning rate controls the weight of the new tree’s contribution. For logistic regression, the model adds the predicted log odds (logits) from the new tree to the existing log odds and then applies the logistic function to obtain updated probabilities.
5.
Regularization: XGBoost applies regularization to penalize the complexity of the trees (e.g., the number of leaves and the leaf weights), prevent overfitting, and ensure that the model generalizes well to new data.
6.
Tree Pruning: The algorithm grows each tree to a specified maximum depth and then prunes it by removing branches that do not significantly improve the loss function. This further helps prevent overfitting.
7.
Iteration: Steps 2–6 are repeated iteratively to add new trees until the specified number of trees is reached or until further improvements in the loss function become minimal.
8.
Final Prediction: For classification, the final prediction is made as the probability of class 1. This is obtained by applying the logistic function to the sum of the log odds predicted by all trees. For binary classification, a threshold (commonly 0.5) is applied to the probability of assigning class labels.
By iteratively adding trees that focus on the residuals (errors) of the current model, XGBoost enhances the model’s ability to correctly classify data points, leading to high accuracy and robustness.

3.3.3. Feature Importance

In XGBoost, feature importance is measured using gain, cover, frequency, SHapley Additive exPlanations (SHAP) values, and permutation importance. These methods are used to understand which features contribute the most to the model’s predictions.
In this research, the gain was measured as the improvement in accuracy brought by a feature to the branches. It measures how much each feature contributes to reducing the loss function (error) in the model. The higher the gain, the more important the feature is considered.

4. Results and Discussion

In the data exploration phase, interesting relationships between explanatory features and the target variables were observed. For features with a numerical data type, after conducting a statistical F-test on the two groups (y = 0 and y = 1) of data points, the top one significant feature was the “asset” (F-value = 6923.65, p < 0.01). The y = 1 group had a higher asset level than the y = 0 group, which revealed that customers with more assets were more likely to buy mutual funds. The second significant feature was the “unsecured_loan” (F-value = 3598.33, p < 0.01). The y = 0 group had a higher unsecured loan level than the y = 1 group, which showed that customers with more unsecured loans were less likely to buy mutual funds. The third significant feature is the “secured_loan” (F-value = 893.70, p < 0.01). The y = 1 group had a higher secured loan level than the y = 0 group, which revealed that customers with more secured loans were more likely to buy mutual funds. As the customers have assets as collateral, they have the financial resources to purchase funds for financial purposes.
Table 3 shows the top 15 important features that indicate whether a customer has an unsecured loan, a customer’s amount of assets in the bank, the number of months a customer has dealt with the bank, the place of residence, and whether the bank is the main bank for the total amount of credit card bills in the past six months are top five factors for the models, providing crucial insights for effective wealth management marketing. The first two important features calculated by the XGBoost model also echo the analysis results of the F-test, as mentioned earlier.
Figure 2 shows the confusion matrix of the XGBoost model. The accuracy was 0.6371, the precision was 0.5956, the recall rate was 0.8756, and the F1 score (the harmonic mean of precision and recall) was 0.7090. The classification performance was slightly better than the other statistical methods and machine learning models (Table 4).
Lift indicates how well a model predicts positive outcomes better than random guessing. It is the ratio of the cumulative gain (true positives captured) to the expected gain if no model is used. Based on the lift performance metric, the top 10% lift of the XGBoost model was 1.6, which was also slightly better than the other models. Figure 3 shows the curve of the lift of the top five factors of the XGBoost model.

5. Conclusions

Past research on direct marketing has mainly focused on the retailing sector and seldom discussed financial commodities in the banking industry. This research’s findings provide information to help direct marketing decisions in the banking industry. Five factors influencing the model include a customer with an unsecured loan, the amount of assets that a customer holds in the bank, the duration of the customer’s relationship with the bank, the customer’s place of residence, and whether the bank is the primary institution for their total credit card bill payments over the past six months. These factors are important to formulate effective wealth management marketing strategies. The XGBoost algorithm’s classification performance in judging whether a potential customer buys mutual funds is slightly better than that of the other models.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from the case bank and are available from the author with the permission of the case bank.

Acknowledgments

The author appreciate the case bank for providing data and domain expertise in this research.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. WLIST.pdf. Available online: https://www.cbc.gov.tw/public/data/EBOOKXLS/WLIST.pdf (accessed on 29 July 2024).
  2. Olson, D.L.; Chae, B. Direct marketing decision support through predictive customer response modeling. Decis. Support Syst. 2012, 54, 443–451. [Google Scholar] [CrossRef]
  3. Shih, J.-Y.; Chen, W.-H.; Chang, Y.-J. Developing target marketing models for personal loans. In Proceedings of the 2014 IEEE International Conference on Industrial Engineering and Engineering Management, Selangor, Malaysia, 9–12 December 2014; pp. 1347–1351. [Google Scholar] [CrossRef]
  4. Chen, K. The effects of marketing on commercial banks’ operating businesses and profitability: Evidence from US bank holding companies. Int. J. Bank Mark. 2020, 38, 1059–1079. [Google Scholar] [CrossRef]
  5. Sontakke, A.; Yewale, M.; Zambare, S.; Tendulkar, S.; Chaudhari, A. Credit Card Fraud Detection Using Machine Learning and Predictive Models: A Comparative Study. In Hybrid Intelligent Systems; Abraham, A., Siarry, P., Piuri, V., Gandhi, N., Casalino, G., Castillo, O., Hung, P., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 171–180. [Google Scholar] [CrossRef]
  6. Jabeur, S.B.; Stef, N.; Carmona, P. Bankruptcy Prediction using the XGBoost Algorithm and Variable Importance Feature Engineering. Comput. Econ. 2023, 61, 715–741. [Google Scholar] [CrossRef]
  7. Carmona, P.; Climent, F.; Momparler, A. Predicting failure in the U.S. banking industry: An extreme gradient boosting approach. Int. Rev. Econ. Financ. 2019, 61, 304–323. [Google Scholar] [CrossRef]
  8. Wang, K.; Li, M.; Cheng, J.; Zhou, X.; Li, G. Research on personal credit risk evaluation based on XGBoost. Procedia Comput. Sci. 2022, 199, 1128–1135. [Google Scholar] [CrossRef]
  9. Ling, C.X.; Li, C. Data Mining for Direct Marketing: Problems and Solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 27–31 August 1998. [Google Scholar]
  10. Nazaritehrani, A.; Mashali, B. Development of E-banking channels and market share in developing countries. Financ. Innov. 2020, 6, 12. [Google Scholar] [CrossRef]
  11. Notes on Parameter Tuning—XGBoost 2.1.1 Documentation. Available online: https://xgboost.readthedocs.io/en/stable/tutorials/param_tuning.html (accessed on 11 August 2024).
Figure 1. Research process.
Figure 1. Research process.
Engproc 89 00003 g001
Figure 2. Confusion matrix of XGBoost model.
Figure 2. Confusion matrix of XGBoost model.
Engproc 89 00003 g002
Figure 3. Lift of XGBoost Model.
Figure 3. Lift of XGBoost Model.
Engproc 89 00003 g003
Table 1. Summary of data set.
Table 1. Summary of data set.
DescriptionNumber of Records
Customers who do not buy mutual funds within three months (Y = 0)142,591
Customers who buy mutual funds within three months (Y = 1)14,075
1:1 undersampling (14,075 × 2)28,150
Training set (75% of data points)21,112
Test set (25% of data points)7038
Table 2. Definition of features (input variables).
Table 2. Definition of features (input variables).
Feature NameDefinitionData Type
add_nowPlace of residenceNOMINAL
ageCustomer’s ageINTERVAL
allprice_6monTotal amount of credit card bills in the past six monthsINTERVAL
allprice_6mon_bankWhether the total amount of credit card bills in the past six months was incurred in this bankBINARY
assetCustomer’s assets in the bankINTERVAL
cardWhether the customer has a debit card in the bankBINARY
chi_numNumber of customers’ childrenINTERVAL
CreditWhether the customer is a new credit card user within three months in the bankBINARY
credit_oldWhether the customer has a credit card in the bankBINARY
cus_monNumber of months of transactions with this bankINTERVAL
degreeCustomer’s education levelORDINAL
delete_insurance_creditcardWhether the customer used a credit card to pay insurance premiums in the past yearBINARY
delete_insurance_depositWhether premiums are withheld from the customer’s depositsBINARY
ebank_monThe customer’s online banking usage periodINTERVAL
f_rate_atmATM usage ratio by customers in the past year—foreign currency depositsINTERVAL
f_rate_counterCustomer usage ratio of over-the-counter counters in the past year—foreign currency depositsINTERVAL
f_rate_ebankInternet banking usage ratio in the past year—foreign currency depositsINTERVAL
f_rate_phoneRate of voice calls used by customers in the past year—foreign currency depositsINTERVAL
foreign_currency_loanWhether the customer has a foreign currency loanBINARY
genderGenderBINARY
incomeIncomeINTERVAL
insuranceWhether the customer has insurance productsBINARY
job_gCustomer’s occupationNOMINAL
liabilityCustomer’s liabilities within the bankINTERVAL
marryCustomer’s marital statusBINARY
prefer_fPreferred channel in the past year—foreign currency depositsNOMINAL
prefer_twPreferred channel in the past year—Taiwan dollar depositsNOMINAL
pro_numTotal number of items held by customersINTERVAL
sec_1monprice_1yrAverage monthly security transaction amount in the past yearINTERVAL
sec_freq_1yrNumber of security transactions in the past yearINTERVAL
secured_loanWhether the customer has a secured loanBINARY
tw_rate_atmATM usage ratio in the past year—New Taiwan dollar depositsINTERVAL
tw_rate_counterRatio of customers using over-the-counter counters in the past year—New Taiwan dollar depositsINTERVAL
tw_rate_ebankInternet banking usage ratio in the past year—New Taiwan dollar depositsINTERVAL
tw_rate_phoneRate of voice calls used by customers in the past year–New Taiwan dollar depositsINTERVAL
unsecured_loanWhether the customer has an unsecured loanBINARY
Table 3. Feature importance.
Table 3. Feature importance.
FeaturesFeature Importance
unsecured_loan0.5033
Asset0.0728
cus_mon0.0189
add_now_NANTOU0.0170
add_now_MIAOLI0.0142
add_now_PINGTUNG0.0136
add_now_KAOHSIUNG CITY0.0135
foreign_currency_loan0.0123
gender_F0.0122
prefer_f_30.0122
Marry0.0117
prefer_f_20.0116
allprice_6mon_bank0.0114
Liability0.0108
delete_insurance_deposit0.0108
Table 4. Model performance on test set.
Table 4. Model performance on test set.
ModelAccuracy RateF1Lift (Top 10%)
Logistic Regression0.62790.701.5
Decision Tree0.63670.711.6
XGBoost0.63710.711.6
Neural Networks0.62940.701.5
Support Vector Machines0.62650.701.4
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shih, J.-Y. Application of XGBoost Algorithm to Develop Mutual Fund Marketing Prediction Model for Banks’ Wealth Management. Eng. Proc. 2025, 89, 3. https://doi.org/10.3390/engproc2025089003

AMA Style

Shih J-Y. Application of XGBoost Algorithm to Develop Mutual Fund Marketing Prediction Model for Banks’ Wealth Management. Engineering Proceedings. 2025; 89(1):3. https://doi.org/10.3390/engproc2025089003

Chicago/Turabian Style

Shih, Jen-Ying. 2025. "Application of XGBoost Algorithm to Develop Mutual Fund Marketing Prediction Model for Banks’ Wealth Management" Engineering Proceedings 89, no. 1: 3. https://doi.org/10.3390/engproc2025089003

APA Style

Shih, J.-Y. (2025). Application of XGBoost Algorithm to Develop Mutual Fund Marketing Prediction Model for Banks’ Wealth Management. Engineering Proceedings, 89(1), 3. https://doi.org/10.3390/engproc2025089003

Article Metrics

Back to TopTop