Application of XGBoost Algorithm to Develop Mutual Fund Marketing Prediction Model for Banks’ Wealth Management

Shih, Jen-Ying

doi:10.3390/engproc2025089003

Open AccessProceeding Paper

Application of XGBoost Algorithm to Develop Mutual Fund Marketing Prediction Model for Banks’ Wealth Management^†

by

Jen-Ying Shih

Graduate Institute of Global Business and Strategy, National Taiwan Normal University, Taipei 106, Taiwan

^†

Presented at the 2024 IEEE 7th International Conference on Knowledge Innovation and Invention, Nagoya, Japan, 16–18 August 2024.

Eng. Proc. 2025, 89(1), 3; https://doi.org/10.3390/engproc2025089003

Published: 21 February 2025

(This article belongs to the Proceedings of 2024 IEEE 7th International Conference on Knowledge Innovation and Invention)

Download

Browse Figures

Versions Notes

Abstract

:

Competition in Taiwan’s banking industry is becoming fierce. Banks’ traditional income based on interest rates is insufficient to support their growth. Therefore, banks are eager to expand their wealth management business to increase profits. The fee income from the sale of mutual funds is one of the major sources of banks’ wealth management business. The problem is how to look for the right customers and contact them effectively. Therefore, it is necessary to develop classification prediction models for these banks to evaluate their customers’ potential to buy mutual fund products sold by commercial banks and then deploy marketing resources on these customers to increase banks’ profits. Recently, the XGBoost algorithm has been widely used in conducting classification tasks. Therefore, using the eXtreme Gradient Boosting algorithm, a mutual fund marketing prediction model is developed based on a commercial bank’s data in this study. The results show that whether a customer has an unsecured loan, a customer’s amount of assets in the bank, the number of months for transactions, a place of residence, and whether the bank is the main bank for the total amount of credit card bills in the past six months are the top five factors for the models, providing valuable information for effective wealth management and marketing.

Keywords:

XGBoost; mutual fund; marketing prediction model; wealth management; commercial bank

1. Introduction

Competition in Taiwan’s banking industry is becoming intense because 70 banks are serving 23 million people in Taiwan [1]. Conventional income from interest rate spreads between deposits and loans is no longer sufficient to sustain growth for Taiwan’s banks. Therefore, banks are eager to expand their wealth management to increase profits. The fee income from the sale of mutual funds and after-sale asset management services is one of the major sources of banks’ wealth management business. However, the fee income can be obtained after banks have successfully sold the mutual funds to customers. Therefore, it is required for them to look for the right customers and contact them effectively to realize the fee income.

Direct marketing models are used to choose the right customers [2]. The factors that explain whether a bank’s customers buy mutual fund products recommended by banks are the model’s input variables. The classification algorithm is important to study the relationship between the target variable (whether to buy or not to buy the mutual fund) and the explained factors. Then, marketing resources can be precisely allocated to customers with a high potential to buy mutual funds. As a result, banks can earn fee income from the sale of mutual funds and then increase their profits.

In this research, a classification prediction model is developed for banks’ marketing decisions. The model evaluates their customers’ potential to buy mutual fund products and then deploys marketing resources to these customers.

Recently, the extreme gradient boost (XGBoost) algorithm has been widely used in credit scoring, bankruptcy prediction, and fraud detection in the banking sector owing to its exceptional performance in conducting classification [3,4]. By recognizing the potential of the XGBoost algorithm and applying it to develop mutual fund marketing prediction models based on a commercial bank’s data, the application and effectiveness of the algorithm in the banking industry are validated.

In this research, the following are explored:

Factors to develop a direct marketing model.
Effectiveness of the XGBoost algorithm in building a mutual fund marketing model.
Comparison of the XGBoost algorithm with the other algorithms.

The organization of this paper is as follows: After introducing the research background and purpose in Section 1, Section 2 reviews the related literature. Section 3 describes the data set and research process. Section 4 presents the results and discussion. Finally, concluding remarks are presented in Section 5.

2. Literature Review

2.1. Direct Marketing of Financial Commodities

Past research highlighted the importance of direct marketing and proposed that direct marketing is critical for businesses to engage customers and drive sales [2,3]. Reference [2] adopted a system perspective to understand how different models interact within the broader marketing system and proposed that quantitative models help optimize these marketing efforts. The quantitative models of direct marketing include statistical methods, such as logistic regression, and machine learning techniques, including support vector machines (SVMs), artificial neural networks (ANNs), decision trees, and related enhanced algorithms (e.g., random forest (RF), a method that operates by constructing a multitude of decision trees at training time and outputting the mode of the classes). These algorithms are used to predict the likelihood of a customer buying a product or service.

Commercial banks are eager to apply financial technology (Fintech) and marketing technology (Martech) to boost their revenues and profits [3]. Reference [4] found that marketing activities have positive impacts on commercial banks’ profit.

2.2. XGboost Algorithm

The XGBoost model, a tree-based Gradient Boosting ensemble, is renowned for its high efficiency and predictive accuracy. Its application in banking includes credit card fraud detection, bankruptcy prediction, and personal credit risk assessment, underscoring its significance and potential. The model performance of various algorithms is evaluated using two data sets (European data set and German data sets) in Ref. [5], including SVMs, logistic regression, RFs, K-Nearest Neighbors, J48 decision trees, ANNs, and boosting frameworks for decision trees (such as XGBoost), for credit card fraud detection. The results show that XGBoost cannot outperform ANNs and RFs based on the F1 measure. The XGBoost algorithm with a feature importance selection mechanism is applied to construct a bankruptcy prediction model [6]. The model performance is compared with other algorithms and feature selection approaches. The feature importance selection mechanism in XGBoost calculates how each feature contributes to the model’s predictions. When evaluating forecast accuracy using the area under curve (AUC), XGBoost demonstrated superior discrimination power compared to other feature selection techniques, including stepwise discriminant analysis, stepwise logistic regression, and partial least square-discriminant analysis. XGBoost is used to explain bank default [7]. They suggested that lower values of retained earnings to average equity, pretax return on assets, and total risk-based capital ratio are linked to a higher risk of bank failure. Additionally, an excessively high yield on earning assets increases banks’ likelihood of financial distress.

Logistic regression is applied to select features, and the XGBoost algorithm is used to construct a classification model for assessing personal credit risk [8]. Decision tree and K-nearest neighbor algorithms were used as benchmarks for comparison. Their results demonstrated that XGBoost performed well in terms of AUC, accuracy, and Type II error.

Past research on the application of the XGBoost algorithm focused on credit evaluation, whether for individual customers or corporate customers. Few studies reported the algorithm’s performance on marketing decisions in the banking industry. The performance and feasibility of banks’ marketing decisions were evaluated using a real data set from a commercial bank in this study.

3. Data Set and Research Process

In this research, XGboost, a python-based package, was used to develop a classification model.

3.1. Data Set

The data set was obtained from a commercial bank. It contained 156,666 records. 14,075 records presented the data of customers who bought mutual funds within three months, and 142,591 records presented the data of customers who did not buy mutual funds within three months. A 1:1 undersampling approach was used to avoid the problem of rare events being ignored by the training model [9]. As a result, 28,150 records remained and were split into a training set (21,112) and a testing set (7038).

The summary of the data set is shown in Table 1. The target variable is a binary variable that defines whether a customer buys a mutual fund within three months. Four facets include the customer’s transaction with the bank, the customer’s democratic profile, the channel usage profile, and the holding situation of financial commodities. They were used to determine whether a customer would buy mutual funds. For example, Reference [10] identified that e-banking channels could enhance banks’ market shares. Thus, features reflecting channel preference were determined in this study.

A total of 36 features were determined to describe every data point (i.e., every potential customer and the customer’s outcome), as shown in Table 2. After screening out features that contained missing values over 20%, 32 features remained. Then, categorical variables were transformed into dummy variables. Finally, 56 variables were used to develop the XGBoost model.

3.2. Research Process

The XGBoost algorithm was used to develop the classification model. Using the trained XBGoost model, the feature importance was determined to explain the impact of each feature on the target variable (whether a customer buys the mutual fund within three months). The testing data set was used for model evaluation. Finally, to understand XGBoost model performance, logistic regression, decision trees, support vector machines, and neural network algorithms were applied for model comparison. Figure 1 shows the research process of this research.

3.3. XGBoost

XGBoost operates in a series of steps to enhance predictive accuracy and efficiency. The input data and parameters for the algorithm are as follows:

Input data: Explanatory feature matrix (X) and target variable (y).

3.3.1. Parameters

XGBoost has many parameters to optimize the model’s performance. These parameters are categorized into three groups: general parameters, booster parameters, and learning task parameters. These parameters offer flexibility to fine-tune the XGBoost model for various types of data and tasks to achieve optimal performance and generalization. The parameter tuning strategy was referred to the web resource [11].

The parameters in this research included the following:

Booster specifies options including tree-based models and linear models. Tree-based models were used as the booster in this research.
Eta (learning_rate) is the step size shrinkage to prevent overfitting. The value range is between 0 and 1. It controls the weight of the new tree’s contribution.
Lambda (Reg_lambda) is the regularization parameter of the algorithm. L2 regularization term on weights is specified to avoid overfitting.
The “objective” parameter is used for the learning task and the corresponding loss function. It was defined as “binary:logistic”, for logistic regression to the binary classification task in this research.
Eval_metric: Evaluation metrics for validation data. In this research, eval_metric were specified as “error”, which used the binary classification error rate as evaluation metrics.
n_estimators presented the number of boosting rounds (trees to grow).
In early_stopping_rounds, training stops when a metric is not improved for a certain number of rounds.

3.3.2. Algorithm

The algorithm works in the following steps:

Initialization: The process starts with an initial prediction for each data point, often using the base class probabilities. For binary classification, the probability belongs to class 1, typically set to 0.5.
Iterative Boosting:

Residual Calculation is used to compute the residuals for each data point. In this research’s classification task, the residuals were defined as the difference between the actual class labels (0 or 1) and the predicted probabilities.

Gradient Calculation: The gradient of the loss function is calculated concerning the current predictions. Logistic regression loss involves computing the derivative of the logistic loss function.

3.: Model Training on Gradients

A new decision tree is trained on these gradients. The goal of the new tree is to predict the gradient of the loss function, effectively focusing on the errors made by the current iteration model.

4.: Weighted Summation: Predictions are updated by adding the weighted output of the new tree to the current predictions. The learning rate controls the weight of the new tree’s contribution. For logistic regression, the model adds the predicted log odds (logits) from the new tree to the existing log odds and then applies the logistic function to obtain updated probabilities.
5.: Regularization: XGBoost applies regularization to penalize the complexity of the trees (e.g., the number of leaves and the leaf weights), prevent overfitting, and ensure that the model generalizes well to new data.
6.: Tree Pruning: The algorithm grows each tree to a specified maximum depth and then prunes it by removing branches that do not significantly improve the loss function. This further helps prevent overfitting.
7.: Iteration: Steps 2–6 are repeated iteratively to add new trees until the specified number of trees is reached or until further improvements in the loss function become minimal.
8.: Final Prediction: For classification, the final prediction is made as the probability of class 1. This is obtained by applying the logistic function to the sum of the log odds predicted by all trees. For binary classification, a threshold (commonly 0.5) is applied to the probability of assigning class labels.

By iteratively adding trees that focus on the residuals (errors) of the current model, XGBoost enhances the model’s ability to correctly classify data points, leading to high accuracy and robustness.

3.3.3. Feature Importance

In XGBoost, feature importance is measured using gain, cover, frequency, SHapley Additive exPlanations (SHAP) values, and permutation importance. These methods are used to understand which features contribute the most to the model’s predictions.

In this research, the gain was measured as the improvement in accuracy brought by a feature to the branches. It measures how much each feature contributes to reducing the loss function (error) in the model. The higher the gain, the more important the feature is considered.

4. Results and Discussion

In the data exploration phase, interesting relationships between explanatory features and the target variables were observed. For features with a numerical data type, after conducting a statistical F-test on the two groups (y = 0 and y = 1) of data points, the top one significant feature was the “asset” (F-value = 6923.65, p < 0.01). The y = 1 group had a higher asset level than the y = 0 group, which revealed that customers with more assets were more likely to buy mutual funds. The second significant feature was the “unsecured_loan” (F-value = 3598.33, p < 0.01). The y = 0 group had a higher unsecured loan level than the y = 1 group, which showed that customers with more unsecured loans were less likely to buy mutual funds. The third significant feature is the “secured_loan” (F-value = 893.70, p < 0.01). The y = 1 group had a higher secured loan level than the y = 0 group, which revealed that customers with more secured loans were more likely to buy mutual funds. As the customers have assets as collateral, they have the financial resources to purchase funds for financial purposes.

Table 3 shows the top 15 important features that indicate whether a customer has an unsecured loan, a customer’s amount of assets in the bank, the number of months a customer has dealt with the bank, the place of residence, and whether the bank is the main bank for the total amount of credit card bills in the past six months are top five factors for the models, providing crucial insights for effective wealth management marketing. The first two important features calculated by the XGBoost model also echo the analysis results of the F-test, as mentioned earlier.

Figure 2 shows the confusion matrix of the XGBoost model. The accuracy was 0.6371, the precision was 0.5956, the recall rate was 0.8756, and the F1 score (the harmonic mean of precision and recall) was 0.7090. The classification performance was slightly better than the other statistical methods and machine learning models (Table 4).

Lift indicates how well a model predicts positive outcomes better than random guessing. It is the ratio of the cumulative gain (true positives captured) to the expected gain if no model is used. Based on the lift performance metric, the top 10% lift of the XGBoost model was 1.6, which was also slightly better than the other models. Figure 3 shows the curve of the lift of the top five factors of the XGBoost model.

5. Conclusions

Past research on direct marketing has mainly focused on the retailing sector and seldom discussed financial commodities in the banking industry. This research’s findings provide information to help direct marketing decisions in the banking industry. Five factors influencing the model include a customer with an unsecured loan, the amount of assets that a customer holds in the bank, the duration of the customer’s relationship with the bank, the customer’s place of residence, and whether the bank is the primary institution for their total credit card bill payments over the past six months. These factors are important to formulate effective wealth management marketing strategies. The XGBoost algorithm’s classification performance in judging whether a potential customer buys mutual funds is slightly better than that of the other models.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from the case bank and are available from the author with the permission of the case bank.

Acknowledgments

The author appreciate the case bank for providing data and domain expertise in this research.

Conflicts of Interest

The author declares no conflicts of interest.

References

WLIST.pdf. Available online: https://www.cbc.gov.tw/public/data/EBOOKXLS/WLIST.pdf (accessed on 29 July 2024).
Olson, D.L.; Chae, B. Direct marketing decision support through predictive customer response modeling. Decis. Support Syst. 2012, 54, 443–451. [Google Scholar] [CrossRef]
Shih, J.-Y.; Chen, W.-H.; Chang, Y.-J. Developing target marketing models for personal loans. In Proceedings of the 2014 IEEE International Conference on Industrial Engineering and Engineering Management, Selangor, Malaysia, 9–12 December 2014; pp. 1347–1351. [Google Scholar] [CrossRef]
Chen, K. The effects of marketing on commercial banks’ operating businesses and profitability: Evidence from US bank holding companies. Int. J. Bank Mark. 2020, 38, 1059–1079. [Google Scholar] [CrossRef]
Sontakke, A.; Yewale, M.; Zambare, S.; Tendulkar, S.; Chaudhari, A. Credit Card Fraud Detection Using Machine Learning and Predictive Models: A Comparative Study. In Hybrid Intelligent Systems; Abraham, A., Siarry, P., Piuri, V., Gandhi, N., Casalino, G., Castillo, O., Hung, P., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 171–180. [Google Scholar] [CrossRef]
Jabeur, S.B.; Stef, N.; Carmona, P. Bankruptcy Prediction using the XGBoost Algorithm and Variable Importance Feature Engineering. Comput. Econ. 2023, 61, 715–741. [Google Scholar] [CrossRef]
Carmona, P.; Climent, F.; Momparler, A. Predicting failure in the U.S. banking industry: An extreme gradient boosting approach. Int. Rev. Econ. Financ. 2019, 61, 304–323. [Google Scholar] [CrossRef]
Wang, K.; Li, M.; Cheng, J.; Zhou, X.; Li, G. Research on personal credit risk evaluation based on XGBoost. Procedia Comput. Sci. 2022, 199, 1128–1135. [Google Scholar] [CrossRef]
Ling, C.X.; Li, C. Data Mining for Direct Marketing: Problems and Solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 27–31 August 1998. [Google Scholar]
Nazaritehrani, A.; Mashali, B. Development of E-banking channels and market share in developing countries. Financ. Innov. 2020, 6, 12. [Google Scholar] [CrossRef]
Notes on Parameter Tuning—XGBoost 2.1.1 Documentation. Available online: https://xgboost.readthedocs.io/en/stable/tutorials/param_tuning.html (accessed on 11 August 2024).

Figure 1. Research process.

Figure 2. Confusion matrix of XGBoost model.

Figure 3. Lift of XGBoost Model.

Table 1. Summary of data set.

Description	Number of Records
Customers who do not buy mutual funds within three months (Y = 0)	142,591
Customers who buy mutual funds within three months (Y = 1)	14,075
1:1 undersampling (14,075 × 2)	28,150
Training set (75% of data points)	21,112
Test set (25% of data points)	7038

Table 2. Definition of features (input variables).

Feature Name	Definition	Data Type
add_now	Place of residence	NOMINAL
age	Customer’s age	INTERVAL
allprice_6mon	Total amount of credit card bills in the past six months	INTERVAL
allprice_6mon_bank	Whether the total amount of credit card bills in the past six months was incurred in this bank	BINARY
asset	Customer’s assets in the bank	INTERVAL
card	Whether the customer has a debit card in the bank	BINARY
chi_num	Number of customers’ children	INTERVAL
Credit	Whether the customer is a new credit card user within three months in the bank	BINARY
credit_old	Whether the customer has a credit card in the bank	BINARY
cus_mon	Number of months of transactions with this bank	INTERVAL
degree	Customer’s education level	ORDINAL
delete_insurance_creditcard	Whether the customer used a credit card to pay insurance premiums in the past year	BINARY
delete_insurance_deposit	Whether premiums are withheld from the customer’s deposits	BINARY
ebank_mon	The customer’s online banking usage period	INTERVAL
f_rate_atm	ATM usage ratio by customers in the past year—foreign currency deposits	INTERVAL
f_rate_counter	Customer usage ratio of over-the-counter counters in the past year—foreign currency deposits	INTERVAL
f_rate_ebank	Internet banking usage ratio in the past year—foreign currency deposits	INTERVAL
f_rate_phone	Rate of voice calls used by customers in the past year—foreign currency deposits	INTERVAL
foreign_currency_loan	Whether the customer has a foreign currency loan	BINARY
gender	Gender	BINARY
income	Income	INTERVAL
insurance	Whether the customer has insurance products	BINARY
job_g	Customer’s occupation	NOMINAL
liability	Customer’s liabilities within the bank	INTERVAL
marry	Customer’s marital status	BINARY
prefer_f	Preferred channel in the past year—foreign currency deposits	NOMINAL
prefer_tw	Preferred channel in the past year—Taiwan dollar deposits	NOMINAL
pro_num	Total number of items held by customers	INTERVAL
sec_1monprice_1yr	Average monthly security transaction amount in the past year	INTERVAL
sec_freq_1yr	Number of security transactions in the past year	INTERVAL
secured_loan	Whether the customer has a secured loan	BINARY
tw_rate_atm	ATM usage ratio in the past year—New Taiwan dollar deposits	INTERVAL
tw_rate_counter	Ratio of customers using over-the-counter counters in the past year—New Taiwan dollar deposits	INTERVAL
tw_rate_ebank	Internet banking usage ratio in the past year—New Taiwan dollar deposits	INTERVAL
tw_rate_phone	Rate of voice calls used by customers in the past year–New Taiwan dollar deposits	INTERVAL
unsecured_loan	Whether the customer has an unsecured loan	BINARY

Table 3. Feature importance.

Features	Feature Importance
unsecured_loan	0.5033
Asset	0.0728
cus_mon	0.0189
add_now_NANTOU	0.0170
add_now_MIAOLI	0.0142
add_now_PINGTUNG	0.0136
add_now_KAOHSIUNG CITY	0.0135
foreign_currency_loan	0.0123
gender_F	0.0122
prefer_f_3	0.0122
Marry	0.0117
prefer_f_2	0.0116
allprice_6mon_bank	0.0114
Liability	0.0108
delete_insurance_deposit	0.0108

Table 4. Model performance on test set.

Model	Accuracy Rate	F1	Lift (Top 10%)
Logistic Regression	0.6279	0.70	1.5
Decision Tree	0.6367	0.71	1.6
XGBoost	0.6371	0.71	1.6
Neural Networks	0.6294	0.70	1.5
Support Vector Machines	0.6265	0.70	1.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shih, J.-Y. Application of XGBoost Algorithm to Develop Mutual Fund Marketing Prediction Model for Banks’ Wealth Management. Eng. Proc. 2025, 89, 3. https://doi.org/10.3390/engproc2025089003

AMA Style

Shih J-Y. Application of XGBoost Algorithm to Develop Mutual Fund Marketing Prediction Model for Banks’ Wealth Management. Engineering Proceedings. 2025; 89(1):3. https://doi.org/10.3390/engproc2025089003

Chicago/Turabian Style

Shih, Jen-Ying. 2025. "Application of XGBoost Algorithm to Develop Mutual Fund Marketing Prediction Model for Banks’ Wealth Management" Engineering Proceedings 89, no. 1: 3. https://doi.org/10.3390/engproc2025089003

APA Style

Shih, J.-Y. (2025). Application of XGBoost Algorithm to Develop Mutual Fund Marketing Prediction Model for Banks’ Wealth Management. Engineering Proceedings, 89(1), 3. https://doi.org/10.3390/engproc2025089003

Article Menu

Application of XGBoost Algorithm to Develop Mutual Fund Marketing Prediction Model for Banks’ Wealth Management^†

Abstract

1. Introduction