1. Introduction
In the context of current market conditions and increasing enterprise competition, the differences between products and services continue to diminish. Businesses have gradually shifted their marketing strategies from focusing on products to focusing on customers. The enterprise’s primary objective should be to reduce user churn [
1]. In terms of product price positioning, marketing strategy, and service enhancement, the development of a new user incurs enormous expenses for an enterprise. Promotion and publicity will convince potential customers more effectively. Therefore, retaining old customers and recognizing the value of old users are essential for expanding business and expanding the market. This has played a significant role in strengthening businesses’ competitive advantage in the same industry.
In response to user churn, the historical information of customers is one of the most-valuable assets for businesses and managers. It can be utilized to develop loss-prone customer identification models [
2]. Big data analysis technology can be combined with data-mining algorithms to discover the laws contained in historical data, and through the development of mathematical models and other techniques, the data value is converted into reusable, inheritable knowledge [
3]. The development of emerging technologies and data-mining technologies has enabled comprehensive research on customer loss forecasts in many industries, including the financial industry. However, big data analysis is still lacking in customer loss forecasting [
4]. Applying big data analysis to the enterprise’s historical user transaction data, developing an effective user loss model for user behavior analysis, and achieving early warning of users at risk of loss comprise is the crucial method for matching product function and operational strategy.
Fundamentally, user churn identification and early warning are a double classification problem, that is a two-class classification problem; there are only two possibilities of user loss and user retention. This experiment mainly analyzed the data of user churn. As part of customer churn prediction, Li et al. (2018) utilized the LR, SVM, alternate, and genetic algorithms to address supervised learning and proportional label learning [
5]. De et al. (2018) addressed the problem that decision trees and logistic regression models have difficulty handling linear relationships and interactions and proposed using decision rules for customer categorizations [
6]. To improve the traditional deep neural network model for UCI public bank employee churn, Mundada et al. (2019) used the Tukey outlier preprocessing method, feature scaling, and the Adam optimization algorithm [
7]. Using the user data provided by KKBOX, a music information service, Gregory (2018) processed the time series data using a method of time-sensitive feature engineering [
8]. In order to predict user churn, he developed a weighted average model based on XGBoost and LightGBM. According to Wang et al. (2019), the subscriber churn problem in advertising business management can be solved by extracting static and dynamic features from the long-term data of subscribers on advertising platforms and using the GBDT algorithm to predict whether subscribers will churn in the future [
9]. Zhang et al. (2014) used the adaptive Boosting algorithm combined with CART regression, used samples to train and test the model, and demonstrated through experiments that the method was applicable in the community setting [
10]. Ahmad et al. (2019) assisted telecom operators with predicting subscriber churn by extracting social network features and hybrid sampling of raw data using a combination of the RF, DT, GBM, and XGBoost algorithms [
11].
Typically, the number of churned customers is small compared to the number of retained customers. Therefore, it is difficult to identify churned customers using general classification algorithms, as only the majority of class samples (retained users) are accurately identified. Therefore, the classification algorithms used in traditional business processes perform poorly on an overall basis in classifying churned and retained customers [
12]. As of today, there are two main levels of solutions to the classification problem for unbalanced data: the processing of data and the improvement of algorithms. Data processing is a method for reducing the imbalance of the original data distribution, with data resampling techniques being the most-prevalent [
13,
14]. The optimization of standard classification algorithms is the primary focus of algorithm improvement. In existing studies, using cost-sensitive learning to increase the misclassification cost of a few classes of samples and focusing on the classification accuracy of churned customers have been used to identify customer churn. According to Bahnsen et al. (2015), cost-sensitive learning was introduced into the random forest, logistic regression, and decision tree algorithms, and a measurement method was proposed that considered the customer churn cost, which resulted in a 26.4% savings in financial costs on cable supplier data [
15]. For the purposes of predicting customer churn in telecommunications, Luo et al. (2010) developed the plain Bayesian, logistic regression, multilayer perceptron, and multilayer perceptron algorithms based on cost-sensitive learning theory [
15]. Based on the high-dimensional unbalanced data of telecom customer churn, Özmen et al. (2020) proposed a multi-objective cost-sensitive ant colony optimization algorithm, which minimizes the cost of misclassification while also minimizing the number of features [
16]. Wong et al. (2020) introduced cost-sensitive learning into the field of deep learning, proposing a cost-sensitive deep neural network and its ensemble learning version [
17]. At the same time, random under-sampling and hierarchical feature extraction were applied to the hidden layer of the deep neural network to improve its generalization ability. In an analysis of user churn prediction, Al-Madi et al. (2018) used Genetic Programming with Cost-Sensitive Learning (GP-CSL) as an optimization algorithm, and the authors concluded that the GP-CSL method was able to identify churned users better in the case of a high penalty cost [
18]. By using resampling methods and cost-sensitive learning methods that increase the misclassification cost of churned customer samples, scholars have primarily addressed the issue of imbalance between the number of churned and retained users in the user-churn-modeling process. Using the resampling method, data labels can be balanced by generating samples from a few categories. However, resampling data based only on the information contained in the current few categories of samples will result in a lack of diversity in the data, as well as generate noise during the sampling process, making it more difficult to differentiate between different types of samples. Although the cost-sensitive method includes a misclassification penalty cost in the training loss of the model, it assigns weights to a relatively small number of samples from a relatively small number of classes, does not distinguish between individual samples within each class, and does not take into account dynamically adjusting attention to different samples based on training results during the model’s training process. In the analysis of unbalanced data, dividing the dataset based on the number of samples in each category produces minority versus majority samples; dividing the dataset based on the difficulty of the classifier results in hard versus easy samples. The problem of imbalance between classes and between easy and difficult samples is a significant factor contributing to the lack of sufficient certainty in the classifier to discriminate between classes, which results in the output value being close to the decision threshold [
19]. It is, therefore, important to take into account the cost of misclassifying a few classes of samples in the imbalanced data classification problem of identifying user churn, as well as the churned users judged as difficult samples during the training process, so that the user churn model takes into account the deeper mining of difficult cases as well.
This paper incorporated a loss function based on the Focal Loss function that focuses on both minority samples and difficult-to-score samples into the Light Gradient Boosting Machine (LightGBM) classification model, based on the analysis presented above. The addition of category weights and focus parameters to LightGBM’s original cross-entropy loss function addresses positive–negative sample imbalance and simple–difficult sample imbalance, respectively, and dynamically adjusts the sample loss contribution during the training process of the model, which results in the user churn of FocalLoss_LightGBM based on difficult case mining. In order to accomplish this, the article analyzed credit card transaction data published on the Kaggle data science website and provided by commercial banks. In addition, the article compared the constructed model to Support Vector Machines (SVMs), Random Forests (RFs), Extreme Gradient Boosting (eXtreme Gradient Boosting (XGBoost)), and the original LightGBM model.
Based on the consideration of the problem of unbalanced data, this experiment focused on a small number of samples and samples that were difficult to score and proposes a user churn FocalLoss_LightGBM model based on difficult case mining, which can effectively identify user churn and has high stability.
4. Conclusions
The proportion of churned customers is low in the study of user churn, which results in the unbalanced nature of historical user data. Because of this, it is difficult to improve the model’s accuracy in identifying potential churned users using general machine learning models and a single prediction accuracy metric. This paper optimized the original cross-entropy loss function and introduced category weights and focus parameters to control the weights of positive and negative samples, as well as simple–difficult samples and adjusted the misclassification cost of the samples based on the proportion of samples and their classification difficulty in each training round to construct a user churn model based on difficult cases, FocalLoss_LightGBM. The results demonstrated that, in comparison to support vector machine, random forest, and LightGBM, the proposed model not only identified churned users with greater precision, but also with greater identification stability across different dataset subsets. The proposed user churn model expands the study of big data analytics for the purpose of identifying potential churned users. Applying the model to the actual user management process can help businesses effectively identify customers with churning propensities, obtain user dynamics, rapidly develop marketing strategies and retention plans, and reduce user relationship management expenses. In addition, as a result of the model’s high identification stability across multiple datasets, it can be extended to identify churn in the telecommunications, Internet, and new media industries in future research, and it has a strong application in classification problems involving typical imbalance characteristics, such as the detection of financial fraud default. To filter out the factors that are most-important to user retention from the enormous amount of user data, we will consider incorporating feature extraction of user history data in the future, in order to assist businesses in identifying lost customers and developing more targeted and differentiated strategies for improving customer service.