Hybrid Feature Engineering Based on Customer Spending Behavior for Credit Card Anomaly and Fraud Detection

Alamri, Maram; Ykhlef, Mourad

doi:10.3390/electronics13203978

Open AccessArticle

Hybrid Feature Engineering Based on Customer Spending Behavior for Credit Card Anomaly and Fraud Detection

by

Maram Alamri

^*

and

Mourad Ykhlef

Information System Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(20), 3978; https://doi.org/10.3390/electronics13203978

Submission received: 22 August 2024 / Revised: 29 September 2024 / Accepted: 9 October 2024 / Published: 10 October 2024

(This article belongs to the Special Issue New Challenges in Information Security and Privacy and Cyber Resilience)

Download

Browse Figures

Versions Notes

Abstract

:

For financial institutions, credit card fraud detection is a critical activity where the accuracy and efficiency of detection models are important. Traditional methods often use standard feature selection techniques that may ignore refined patterns in transaction data. This paper presents a new approach that combines feature aggregation with Exhaustive Feature Selection (EFS) to enhance the performance of credit card fraud detection models. Through feature aggregation, higher-order characteristics are created to capture complex relationships within the data, then find the most relevant features by evaluating all possible subsets of features systemically using EFS. Our method was tested using a public credit card fraud dataset, PaySim. Four popular learning classifiers—random forest (RF), decision tree (DT), logistic regression (LR), and deep neural network (DNN)—are used with balanced datasets to evaluate the techniques. The findings show a large improvement in detection accuracy, F1 score, and AUPRC compared to other approaches. Specifically, our method had improved F1 score, precision, and recall measures, which underlines its ability to handle fraudulent transactions’ nuances more effectively as compared to other approaches. This article provides an overall analysis of this method’s impact on model performance, giving some insights for future studies regarding fraud detection and related fields.

Keywords:

credit card; anomaly detection; fraud detection; feature engineering; feature selection; feature aggregation; exhaustive feature selection

1. Introduction

Credit card payments are becoming increasingly popular in today’s global digital economy. Credit cards are widely utilized in e-commerce and online transactions, making them common in online banking. However, as credit card use has developed and grown, a wide range of fraud has occurred. Banks and cardholders are suffering significant losses as a result of fraudsters’ more sophisticated approaches for conducting illegal transactions. According to Nilson’s research [1], credit card fraud losses totaled approximately USD 28.65 billion in 2019, with an estimated increase to approximately USD 32.96 billion globally by 2024.

Criminals may use fake cards to simulate legal users’ actions while attempting to obtain credit card information. Criminals can now carry out fraudulent transactions with greater ease because of technological advancements. Banks and financial institutions work hard to lower fraudulent activity rates through fraud prevention and detection. They prevent payment fraud with a variety of ways, including real-time monitoring of the risk score and physical biometrics rules [2]. They can also profit from applying machine learning and deep learning to study human behavior and develop an effective fraud detection system. While there are numerous fraud detection systems, this subject still requires more contributions from researchers because criminals continue to create new methods of operation, which enhance the loss for banks and institutions [2].

Fraudulent transactions can be discovered more simply and reliably by utilizing machine learning and deep learning models, as well as classification and clustering approaches. These algorithms are trained using historical datasets to identify patterns that show fraudulent behavior [3].

Feature engineering is a great strategy for increasing credit card recognition system performance since it assists in identifying the important features that will allow the system to work more effectively and generate better results. Feature engineering is a technique for improving machine learning performance by turning data into features that more correctly represent the underlying problem [4]. Feature selection, a subset of feature engineering, is the process of determining which features from the initial batch of features are most useful for the model’s predictions. The basic goals of feature selection are to improve model performance while also reducing training and prediction time [4]. Frequently, the credit card dataset may contain a large number of factors that have a negative impact on the classifier’s performance during the training process. To overcome the issue of high feature dimensionality, feature selection is proposed. The decision to employ the feature selection approach is determined by the nature of the issue that a researcher wishes to investigate. In addition to evaluating client spending behavior, feature aggregation is recommended for use in analyzing customer behavior and the elements that contributed to fraud in the transaction.

Our approach constructs and evaluates baseline classification models using hybrid feature engineering in the PaySim dataset, which is used for simulating credit card transactions. We use feature aggregation with exhaustive feature selection (EFS) to increase fraud detection and reduce false positives. We give an in-depth look into how these well-known feature engineering techniques and their interactions affect the performance of classification models and explain how the findings might be utilized in practice when dealing with fraud detection.

This work focuses on hybrid feature aggregation and EFS to provide efficient features for analyzing consumer spending behavior and improving the detection model’s results. The remainder of this paper is organized as follows. Section 2 provides a background on feature engineering, including a basic overview of feature extraction and selection. Section 3 addresses related research on feature engineering in credit card transaction data. Section 4 describes the dataset used in the experiments. Section 5 describes the metrics used to evaluate the experiments. Section 6 describes the proposed hybrid feature engineering in great depth. Section 7 describes the experiments conducted and the results obtained, as well as discusses the reported results. Finally, Section 8 concludes the paper with a few recommendations for future work.

2. Background

The primary goal of ML is to identify patterns in data so that it can be transformed into knowledge. Many different features may be present in popular datasets [5]. In the majority of applications, datasets are created by combining information from various sources, each of which has benefits and drawbacks. It is crucial to turn these unprocessed data sources into features that will benefit the detection model before using machine learning [5]. This vital stage, known as feature engineering, is essential to the machine learning process. The process of transforming raw data into novel features and identifying the best variables that enhance the learning classifier’s accuracy is critical for the detection of credit card fraud. The better the feature generated, the more accurate the result obtained [6].

2.1. Feature Extraction

The feature extraction method extracts new features from the original dataset, which is very useful for reducing the number of resources required for processing without missing relevant feature datasets [7]. It involves creating new features that are dependent on the original input feature set in order to reduce the high dimensionality of a feature vector. Feature extraction thus results in a remarkable transformation of the initial features into more significant features. It is less vulnerable to overfitting and performs well in classification [7].

A feature aggregation is a type of feature extraction that involves combining multiple raw data points into a single feature. For every transaction, additional features are added based on predetermined criteria. The value of a new feature is determined by applying an aggregation function to a specific subset of previous transactions [8]. The objective is to generate a record of a cardholder’s activity from their transaction history that measures how different the present transaction is from their past ones [8]. The aggregation feature is a powerful technique for feature extraction, and it is used in a wide variety of machine-learning applications, including fraud detection, customer segmentation, and product recommendation.

2.2. Feature Selection

Feature selection is employed to minimize the impact of dimensionality on a dataset by selecting a subset of features which most effectively define the data [7]. From the input data, it selects the features that are essential to the mining task and eliminates unnecessary and unimportant ones. The main goal of feature selection is to create a subset of features that is as small as possible while still accurately capturing the essential features of the input data [7]. Feature selection has many benefits, including the ability to reduce data size, reduce storage requirements, increase prediction accuracy, avoid overfitting, and shorten training and execution times for variables that are simple to understand [7]. A crucial stage in the application of machine learning techniques is that of feature selection [9]. This is partially because the dataset used for training and testing may have a large feature space, which could negatively affect the models’ overall performance. The type of problem a researcher is attempting to address determines which feature selection method to use [9].

Feature selection can be divided into three categories: the filter approach, the wrapper technique, and hybrid or embedded approaches [10], as shown in Figure 1. Filter methods choose features according to the intrinsic attributes of the data [10]. Filter approaches can score specific features or analyze entire feature subsets. Commonly utilized approaches include information gain, correlation, chi-square, Fisher score, feature weighting, K-means and ReliefC [11]. The Wrapper technique employs a learning mechanism to determine the relevance of a specific set of features or attributes [10]. Wrapper evaluates feature subsets based on the performance of a simulation algorithm, which is used as a black box evaluator [11]. Thus, a wrapper will assess subsets based on the classifier’s performance for classification tasks. Wrappers are considerably slower than filters at finding sufficiently appropriate subsets because they are restricted by the resource demands of the modelling algorithm [11]. EFS is a wrapper approach that examines every feature subset to determine the optimal feature subset based on the model performance evaluation. Hybrid and embedded methods involve feature selection in the learning process [10]. They execute feature selection while the modelling method is being executed. Hybrid approaches combine the most effective features of filtering and wrapping [11]. Some embedded approaches execute feature weighting using a regularization model with objective functions that minimize fitting errors while forcing the feature coefficients to be modest or even zero [11].

Figure 1. Feature selection methods.

3. Related Works

Credit card datasets contain many features, some of which are irrelevant, so selecting the best discriminating features is critical. A recent study [12] proposed a bank customer classification model that was improved through the application of a feature engineering process. The proposed model had three components: feature transformation, feature selection, and machine learning classification. The main finding of the study was that feature engineering techniques can significantly improve the classification of bank customer behavior. The study [12] demonstrated that classification methods perform better in predicting customer behavior by transforming behavioral data into different data structures and selecting relevant features. This suggests that feature engineering can play a major role in enhancing customer activity for banking institutions.

Another study [13] introduced an effective data mining method for detecting credit card fraud that focused on feature selection and decision cost for accuracy improvement. The method involved selecting relevant features using an extended wrapper approach that partitioned the data before applying the feature selection algorithm and performing ensemble classification with cost-sensitive C4.5 decision trees. The main goal of this study was to identify stable features that are not influenced by the dataset size and can also be applied to other datasets. The experimental results showed that this approach improves performance compared to other classification algorithms as measured by accuracy, recall, and F1 score.

A further study [14] used a deep learning architecture and feature engineering method based on homogeneity-oriented behavior analysis (HOBA) for a credit card fraud detection system. HOBA is a feature engineering process that is employed in the proposed fraud detection system. The primary objectives of the suggested methodology were to assist credit card issuers in effectively identifying fraudulent transactions, secure the interests of consumers, and reduce regulatory expenses and fraud losses. A real-world dataset from a major Chinese commercial bank was used to assess the system, and the findings illustrated its efficacy in detecting fraudulent transactions with a manageable false positive rate.

One paper [15] reported the comparison of a simple forward selection algorithm to an exhaustive search method and backward selection for feature selection in linear regression. This demonstrated that the features identified by the forward selection method are as accurate as the exhaustive search approach. A variety of datasets were used in the empirical examination. The authors determined that the forward selection algorithm can achieve the same outcomes as the most widely used exhaustive search approach in most evaluation metrics commonly used today.

A study by Jiang et al. introduced a novel approach to credit card fraud detection [16]; cardholders were grouped according to their transactional patterns, aggregating transactions within each group, extracting behavioral patterns, training classifiers, and utilizing a feedback mechanism to identify fraudulent activity online. A feedback mechanism was implemented to solve the concept drift issue, enabling the detection procedure to adjust the cardholder’s changing transaction behaviors in time. The experimental findings demonstrated that the suggested strategy performs better at detecting fraud than alternative techniques.

Another paper [17] proposed the Neural Aggregate Generator (NAG), a neural network-based feature extraction module that learns feature aggregates end-to-end, mimicking the structure of manual aggregates. The intention was for this to replace manual feature engineering methods, which rely on costly human expertise. By using soft feature value matching and relative feature importance weighting, the NAG improved learnable aggregates over traditional ones. In terms of fraud classification performance, the NAG outperformed manual aggregates and other end-to-end approaches, such as LSTM and generic CNNs.

Further research [18] introduced an integer formulation that identifies behavioral patterns associated with past fraudulent events through the computation of functions of transaction data. This formulation enabled filtering of transactions that satisfy particular criteria and the aggregation of transactions across various features. Using real-world data obtained from a French bank, the model was evaluated, and the outcomes showed that the proposed approach was efficacious in identifying fraudulent transactions.

Finally, a novel feature engineering methodology was designed for deep learning in the context of financial fraud detection with a specific focus on autoencoder neural network models [19]. The objective of this framework was to enhance the efficacy of fraud detection algorithms by generating and choosing impactful features depending on their significance. The researchers utilized the framework to analyze a genuine transaction dataset obtained from a private bank in Europe. They conducted experiments using three distinct types of datasets: the original data, feature sets that they generated, and feature sets carefully picked for their effectiveness. The findings indicated that by using the novel framework, the deep learning models utilizing the chosen features surpassed those using the original data.

As indicated in related works for using feature engineering for credit card transaction data, researchers tend to use feature extraction, aggregation, and feature selection to analyze and select the optimal features that will improve the credit card fraud detection system. However, their approaches focus either on feature aggregation or on feature selection. In this paper, feature aggregation and EFS are used together to improve the performance of the detection system and raise the F1 score.

4. Dataset

There are few publicly available datasets on financial services, particularly in the rapidly expanding industry of mobile money transfers. Many scholars have researched the topic of fraud detection using financial datasets. Because financial transactions are essentially private, there are no publicly available datasets that address the issue at hand. The dataset was generated using the PaySim simulator, which generates synthetic credit card transactions [20].

PaySim datasets can assist academics, financial institutions, and government agencies in testing their fraud detection algorithms or assessing the effectiveness of other strategies in comparable testing environments by providing a shared, publicly accessible synthetic dataset [20].

PaySim uses aggregated data from a private dataset to generate a synthetic dataset that mimics typical transaction behavior. Illegal activity is subsequently added into the dataset to assess how well fraud-detection systems work. Using a sample of real transactions taken from a month’s worth of financial logs from a mobile money service provider in an African nation, it simulates mobile money transactions. An international company that offers a mobile banking service in more than 14 nations worldwide gave the original records [20]. Eleven attributes and more than six million records make up the dataset.

The main justification for the synthetic dataset technique is that most researchers’ Kaggle datasets are converted using principal component analysis (PCA), and there are only time and quantity attributes. As a result, its features are limited and cannot fully assess customer behavior; thus, additional qualities, such as those of a synthetic dataset, are required.

Table 1 describes each attribute in the PaySim dataset. For the isFraud attribute in this dataset, fraudulent agents intend to profit by gaining control of consumers’ accounts and attempting to empty them by transferring them to another account and then cashing out of the system; isFlaggedFraud is an attribute that indicates illegal attempts, defined here as any attempt to transfer money exceeding USD 200,000 in a single transaction.

Dataset preparation involves cleaning the data by removing skewed data, outliers, and missing values before feeding them to the model training, as described in this study [21]. Furthermore, exploratory data analysis (EDA) is used to understand the overall distribution of the data, as well as the correlation and dependency between various input features. This publication [21] explains the full process of EDA in depth.

After preparing the dataset, we balanced it using Hybrid Tomek links BCB-SMOTE, as proposed in [21]. To balance the PaySim dataset, the dataset was split into a training set, a validation set, and a test set to evaluate the proposed method. The training set comprised 60% of the entire dataset, while the validation and test sets comprised 20%, and 20%, respectively. The hybrid undersampling and oversampling technique was applied to the training set, while the test set was utilized with an random forest (RF) model. The use of RF in this study yielded the main advantage of minimal training time compared to other algorithms. While the F1 score is crucial in evaluating balanced datasets and the accuracy of predicting credit card fraud is of utmost importance, RF demonstrated precise output prediction even when dealing with large datasets [22].

5. Evaluation Metrics

To validate and test the credit card fraud detection model, the test dataset was processed to validate that it produced correct results based on the evaluation metric. The evaluation of ML algorithms is generally performed using different metrics, such as accuracy, precision, recall, F1 scores, area under the receiver operating characteristic curve (AUC-ROC), and area under the average precision and recall curve (AUPRC).

The confusion matrix is used to assess the performance of a classification model [23]. It displays the numbers of true positives, false positives, and false negatives. True positives are cases in which the model correctly predicts a positive outcome, whereas true negatives are those in which the model correctly predicts a negative outcome [23]. The number of false positives is the number of instances in which the model predicts a positive outcome but the actual outcome is negative. The number of false negatives is the number of instances in which the model predicts a negative outcome but the actual outcome is positive [23].

The accuracy, precision, and recall metrics are described with respect to the confusion matrix in Table 2. Accuracy is the most obvious measure of a model’s predictive ability. The numerator in this measure contains all correctly labeled positive and negative class instances (TP: fraud; TN: or non-fraud) [24]:

A c c u r a c y = (T P + T N) / (T P + T N + F P + F N)

(1)

Precision, also known as the positive predictive value, is the proportion of true positives to predicted positives generated by a model. A precision value of 1 indicates that all predicted positive instances are indeed positive (FP: incorrectly classified fraud transactions) [23].

P r e c i s i o n = (T P) / (T P + F P)

(2)

Recall, also known as the true-positive rate, is the proportion of predicted positives to all positive instances in the sample. A recall value of 1 indicates that all positive samples were correctly identified (FN: incorrectly classified non-fraud transactions) [23].

R e c a l l = (T P) / (T P + F N)

(3)

For a classification task and an imbalanced dataset, the F1 score is the harmonic mean of the precision and recall values. The F1 score was calculated as follows:

F 1 = 2 \times (R e c a l l \times P r e c i s i o n) / (R e c a l l + P r e c i s i o n)

(4)

The AUPRC provides the area under precision and recall for several thresholds [25]. This is a plot of precision versus recall, which corresponds to the false discovery rate curve. It is simple to compare various classification models using the AUPRC, which summarizes the precision–recall curve [26]. The AUPRC value of the perfect classifier was 1. The system’s high recall and precision produce results with accurate labels [26]. The AUPRC metric examines the positive predictive value and true positive rate, making it more sensitive to improvements for the positive class (fraud class) [27].

6. Features Engineering for Customer Spending Behavior Detection

Feature engineering is an essential task in data mining involving modification of the feature space of a dataset to enhance the performance of detection modeling. Its significance lies in the fact that the majority of machine learning models cannot effectively make decisions when a fraudster alters their fraudulent strategy. This section characterizes significant features that help in detecting credit card fraud based on customer spending behavior. It presents feature selection using aggregation for features that specify customer spending behavior, an exhaustive search is applied to select the most relevant features as deliverables for the classifiers, and the evaluation and results of the proposed method are discussed.

6.1. Methodology

Feature engineering is the task or process of evolving the feature representation of a predictive modeling problem to fit the needs of a training algorithm [28]. Feature engineering is an important part of preparing data for machine learning. It involves the practice of creating appropriate features from given features to improve detection performance and the process of generating new features by applying transformation functions, such as mathematics and aggregate operators [29]. In the training phase of developing a model for detecting fraud, feature aggregation and feature selection are crucial elements. Feature selection aims to eliminate unnecessary or redundant features that may be excluded from the analysis, which can accelerate model training and enhance the performance of classification models [30].

At the core of credit card detection is an analysis of the spending patterns of cardholders. This involves choosing the best characteristics to capture the unique nature of a credit card transaction. Both genuine and fraudulent transactions can change rapidly. Therefore, to achieve effective credit card transaction classification, an ideal feature selection strategy that significantly distinguishes between genuine and fraudulent transactions is recommended. The algorithm and features chosen to reflect the cardholder’s spending behavior impact the way credit card fraud detection systems operate.

After achieving balancing through the application of hybrid Tomek links BCBSMOTE in the PaySim dataset, the subsequent task involves analyzing the dataset. This dataset contains numerous features that are the key characteristics observed in widely held credit card transaction datasets. Therefore, we employed a feature engineering approach that involved aggregation based on customer spending behavior and then selected the most pertinent features using an exhaustive search (Figure 2). This approach generates novel feature combinations that capture complex patterns in data and improve the model’s ability to detect fraud signals. Enabling aggregation of features that are particularly relevant to fraud detection. The idea behind aggregation assists in capturing the patterns that characterize fraudulent transactions. Moreover, EFS ensures that all possible feature combinations are evaluated to identify the optimal subset that enhances the model, while many existing methods use heuristics to select features. Our literature review (above) has demonstrated that feature aggregation and feature selection play a crucial role in enhancing the performance of credit card fraud detection models.

Figure 2. The proposed feature engineering for PaySim dataset.

Before applying feature engineering, cleaning processing for the dataset is performed. Analysis of the PaySim dataset to detect customers’ spending behavior showed that some attributes, such as Step and isFlaggedFraud, were irrelevant and would not add any value to the final results. Thus, these inessential attributes were dropped from the dataset, reducing its size to apply the model more quickly.

Feature engineering has two stages: feature aggregation and feature selection using an exhaustive search. Feature aggregation extracts and combines features to create new relevant features that help detect customer spending behavior. Here, the dataset was split into a training set, validation set, and test set, with 60% for the training set, 20% for the validation set, and 20% for the test set. This procedure was implemented to ensure comprehensive model evaluation and mitigate the risk of overfitting. Validation enables the monitoring of model performance throughout the training process. Concurrently, the test set provided an impartial assessment of the final model’s capacity for generalization. Finally, an exhaustive search was conducted. This was performed to review all possible feature combinations and to locate the model with the highest F1 score. The F1 score metric was chosen instead of accuracy because it better measures model performance (accuracy measures the model without considering dataset balancing).

6.2. Feature Aggregation

Aggregating information over a series of transactions is performed through feature aggregation. Two crucial factors need to be considered when creating derived attributes: the primary attribute selection and the aggregation duration [31]. The number of transactions over a given period, such as a day, week, month, or three months, can be the derived attribute. Additionally, a particular merchant may be included in the total of the transactions. Numerous numerical and categorical attributes are present in the credit card transaction dataset. The total number of possible combinations of primary attributes and time period or amount are infinite. As a result, the attributes chosen for the model become crucial to its effectiveness; this choice may vary depending on how the fraudulent behavior of the offenders changes over time [31]. The benefits of using feature aggregation are greater accuracy of the model operation, faster detection of fraudulent behavior, and better coverage of different aspects of customer behavior.

In this study, feature aggregation was used based on customer spending behavior prior to each transaction. The PaySim dataset has 11 attributes, each of which represents information. When customer spending behavior is considered, important factors have to be analyzed, in this case, the amount spent and the balance for the original account. Thus, two new derived attributes were created from the primary attributes: AverageAmount and ChangrInBalance_Orig.

AverageAmount is the average of the amount attributes that are generally transacted by the customer; it thus depicts the average transaction for each customer. This is represented by Equation (5), where

a_1, a_2, a_3, \dots, a_n

is the transaction amount and n is the number of transactions:

A v e r a g e A m o u n t = ((a_1 + a_2 + a_3 + \dots a_n)) / n

(5)

ChangeInBalance_Orig is the difference between the NewBalanceOrig and OldBalanceOrig attributes as represented by Equation (6), where B2 is a new balance and B1 is an old one:

C h a n g e I n B a l a n c e_O r i g = | B 2 - B 1 |

(6)

Table 3 shows the feature aggregation that enabled the learning algorithm to learn various patterns of customer spending behavior and thus classify fraud patterns.

Figure 3 illustrates the PaySim dataset attributes before applying aggregation and after adding two new derived attributes using aggregation. It can be seen that two attributes, Steps and isFraud, are dropped in the pre-processing phase.

Figure 3. Feature aggregation for PaySim dataset.

6.3. Exhaustive Feature Selection (EFS)

Credit card fraud detection frequently involves handling a significant number of features or variables involved in the transactions. Some features may not be important or illuminating for the purpose of fraud detection. Feature selection selects appropriate features from potential features in the dataset, which addresses two issues: efficacy and compatibility. It chooses effective features for enhancing machine learning model predictions. It also makes features simple to employ for various types of machine learning algorithms. The selection of appropriate features in machine learning is crucial for four main reasons: first, it allows the machine learning algorithm to train more efficiently; second, it helps to simplify the model and enhance its interpretability; third, selecting the proper subset of features can significantly improve the model’s performance; and fourth, it assists in reducing overfitting, which is a common problem in machine learning.

The most straightforward feature selection approach is an exhaustive search [32]. This approach focuses on reviewing all the possible feature subset combinations and selecting the feature subset that performs the model with the best evaluation [32]. The EFS strategy employs brute-force strategies to identify the optimal feature subset [33]. The machine learning algorithm’s performance is evaluated over all possible feature combinations in the dataset [33]. The optimal feature subset is based on the best performance. The exhaustive search method is the greediest among wrapper strategies, attempting all potential feature combinations before selecting the best (Figure 4). The process for selecting the best feature subset using EFS is as follows [33]:

All features are selected from the original dataset.
The feature selection process is started.
The machine learning algorithm’s performance is assessed against all potential feature combinations in the dataset, and the optimal feature subset is determined as that which produces the best performance.

Figure 4. Exhaustive feature selection (EFS).

The exhaustive search can begin with a set that has just one feature and proceed by adding more features to the set, or it can start with a full feature set and eliminate features one by one.

In EFS, the number of combinations is represented by Equation (7), where n is the number of features:

{E F S = (2}^{n} - 1)

(7)

Additionally, EFS has an

O (2^n)

cost, where n is the original dataset’s feature count. For large features, the EFS is expensive and ineffective. Because there are few features in this study, EFS is effective in yielding an optimal feature subset.

6.4. Classification

6.4.1. Decision Tree (DT)

One of the most effective techniques frequently employed in a variety of domains, including machine learning, image processing and pattern recognition, is the decision tree (DT). DT is a well-liked model that successfully integrates several fundamental tests, each of which compares a numerical feature to a threshold value. Additionally, DT is a classification model that is frequently used in data mining. Every tree is made up of nodes and branches; every subset specifies a value that a node may accept; and each node represents features in a category that needs to be classified [34].

6.4.2. Random Forest (RF)

A group of DT classifiers make up RF. Compared to DTs, RF has the advantage of correcting the overfitting habit. To train each tree, a random subset of the training set is sampled [22]. Next, a DT is built, with each node being divided into a chosen feature of a random subset of the functionality. Since each tree in RF is trained independently of the others, it is incredibly quick to train datasets with lots of features and data instances. It is known that the RF algorithm is resistant to overfitting and offers a good estimate of the generalization error [22].

6.4.3. Logistic Regression (LR)

Among the classification algorithms used most frequently in machine learning is logistic regression [35]. LR is a linear model that determines the probability of an event occurring according to input features and a popular technique for binary classification tasks. It produces interpretable results, can handle large datasets fast, and is an effective baseline model for many classification problems [36]. The relationships between binary, continuous, and continuous predictors are described by the LR model [35].

6.4.4. Deep Neural Networks (DNN)

The DNN, along with shallow neural network-like models, is another type of artificial neural network. The number of hidden layers between the input and output layers is used as the categorization criterion. Similar to other common artificial neural networks, an activation function will carry a signal obtained by multiplying the input and its matching weight from the input layer to the hidden layers [37]. Combining feature extraction and classification to speed up learning and decision-making, the DNN has recently become increasingly useful and produced positive results in a variety of applications [38].

6.5. Evaluation

This study evaluates the efficiency of feature engineering approaches using feature aggregation and feature selection with the assessment metrics F1 score, accuracy, precision, recall, and AUPRC.

For feature aggregation evaluation—to assess its effect on the model performance—the results after applying feature aggregation are compared with the original data. Then, the performance of the feature selection using EFS is evaluated by comparing the use and non-use of feature aggregation. The baseline learning classifiers models RF, DT, LR, and DNN were selected for testing as these have proved to be the best algorithms for detecting fraud in credit card transactions [39,40].

The RF classifier was built using the parameters in order to produce a better performance shown in Table 4.

The DT and LR classifiers, on the other hand, were built by using the random_state parameter and setting the value to 42.

Whereas the DNN classifier was built using a sequential model based on deep learning, this model has a layer-by-layer design, as shown in Figure 5, with each layer containing weights matching the layer above.

Figure 5. Sequential model architecture.

TensorFlow, an open-source AI package mostly used in domains such as machine learning and deep learning network design, was employed in the model architecture. The primary function of TensorFlow in deep learning is to build models using dataflow graphs. The layers are defined as input, hidden, and output [41]. Keras is a Python library [42] that defines model layers by generating a sequential model using the ‘sequential()’ method. It allows for defining the deep learning model layer by layer. The input shape can only be defined for the first layer, as the subsequent shapes will automatically shape the input. Here, in each experiment, the input shape was different as the number of features changed. Each layer’s activation function was defined, and new layers were added using the method ‘add’.

There were seven hidden layers; the architecture of the DNN classifier is shown in Figure 6. The activation function used in the hidden layers was the Rectified Liner Unit (ReLU). The ReLU function was used to set all the negative values to zero, which converts all the computations where negative values are involved to zero, thus simplifying it [43]. Batch normalization was applied after each hidden layer to speed up the network’s convergence and assist in decreasing internal covariate shifts. This also improves gradient maintenance throughout the backpropagation process, which enhances performance overall [44]. The SoftMax function was used to generate the final classification output, which comprised the last output layer. SoftMax calculates the probability of each target class over all possible target classes [45].

Figure 6. DNN classifier architecture (the input shape was 10, the number of the features after aggregation).

After successfully defining the network model, the model was then compiled. Configuring the learning process before proceeding to the next training step is vital. TensorFlow was used for the compilation process. This step was conducted using the ‘compile()’ technique. Three key parameters were defined to construct the model and prepare for the next training step, which involved binary_crossentropy, optimizer, and metrics. Binary_crossentropy refers to the loss function, which evaluates weights to see how inaccurate model predictions are. The optimizer updates weights to minimize loss function and improve model accuracy. To reduce training time and produce better results Stochastic Gradient Descent (SGD) was selected as the optimizer. Metrics comprise a collection of metrics. Here, these were accuracy, precision, recall, and F-score.

The model was trained on the training dataset using the ‘fit()’ technique. In this phase, we specified the training dataset, epochs, and batches. Each epoch parameter represents a single training phase over multiple batches; weights are updated with each epoch. A batch is an iteration that runs every epoch and uses a portion of the training data. Our model was fitted for 100 training epochs and a batch size of 256 and the ‘evaluate()’ function was then used to determine the model’s metric based on the testing data; for details of the parameters used in the DNN structure refer to Table 5.

As illustrated in Table 5, each hidden layer has a different number of neurons. This is known as a variable-width neural network. This allows the network to adapt to the complexity of the data and task under consideration.

This study additionally analyzes the results of the optimal feature subsets produced using baseline learning classifier models based on customer spending behavior. F1 score was the metric utilized to evaluate how the feature engineering performed, where the highest F1 score showed the best performance for the model. Also, AUPR was applied to appraise any overfitting and bias in the model.

7. Results and Discussion

This section reviews the results from the feature engineering tested on the PaySim dataset. As stated, to evaluate the performance of the feature aggregation and selection, the dataset was divided into three sets: training, validation, and test sets, which comprised 60% of the training, 20% of validation, and 20% of testing of the original datasets, respectively. During splitting, customer sample separation is performed by ensuring that samples from the same customers are kept within the same subset. This approach ensures that the model’s performance can be accurately assessed on unseen data while preventing data leakage between sets. A stratified split based on customer samples helps maintain the distribution of customer characteristics across all three sets, enhancing the generalizability of the model. In addition, this method allows for a more robust evaluation of feature aggregation and selection techniques, as it accounts for potential customer-specific patterns or behaviors. The evaluation compares the results to enable assessment of the proposed method and analyzes the results from the experimentation of the EFS using baseline classifier models. The outcomes of the experiments are then reported

It was observed that the evaluation performance of the validation set was very close to that of the test set. Table 6 and Figure 7 (below) provide detailed information on the performance measurements of all applied methods.

Before Feature Aggregation

This shows that the RF and DT models produced the highest F1 scores—of 85.20% and 86.27%, respectively. They also achieved high accuracy and recall, indicating strong detection capabilities. The LR model gave a significantly lower performance, particularly in the precision and F1 scores, highlighting their limitations in this context. On the other hand, with an F1 score of 75.81% and an accuracy of 99.94%, the DNN performed reasonably well; nevertheless, it lagged in AUPR, suggesting that it could better identify positive cases.

After feature aggregation

An improvement was observed across most classifiers. RF showed an increase in F1 score to 87.74 and AUPR to 77.05%. DT experienced a slight decline in F1 score but maintained high recall. LR still underperformed, albeit with marginal improvements, indicating that feature aggregation alone is insufficient for these models. Conversely, the DNN showed improved recall and AUPR, highlighting the benefits of feature aggregation for complex models.

EFS with feature aggregation

DT showed the best performance, with an F1 score of 90.21% and near-perfect recall of 99.48%, indicating an exceptional ability to detect fraud cases. RF maintained high performance, though slightly lower than DT, with balanced metrics across the board. LR showed improvements, but its overall performance remained suboptimal compared to other models. The DNN performance was stable but did not significantly improve (i.e., as compared to the aggregation alone).

EFS without feature aggregation

RF and DT maintained high accuracy and recall, but their F1 scores slightly declined compared to the proposed method. The DNN performance also dropped significantly, suggesting that feature aggregation is crucial for effectiveness. Nevertheless, LR showed an unexpected improvement in precision and recall, indicating that EFS alone might help in certain scenarios.

As shown in Figure 7 (below), it was found that applying feature aggregation alone enhanced the performance of most of the models, especially for recall and AUPR, indicating its utility in capturing important patterns. By refining the feature set and improving detection capabilities, the application of both EFS and feature aggregation provided the best overall performance, particularly for DT and RF.

Figure 7. Performance evaluation using different models.

The proposed method was evaluated against a prior investigation [27] that utilized Recursive Feature Elimination (RFE) on the PaySim dataset by employing a support vector machine (SVM) classifier. The results demonstrate that our proposed approach achieves a higher F1 score and precision using a decision tree (DT) classifier, indicating improved performance, as shown in Table 7. These superior results suggest that our approach may be more effective in identifying fraudulent transactions in the global financial system. Furthermore, a higher precision indicates a reduced likelihood of false positives, which is crucial for maintaining customer trust and minimizing unnecessary investigations.

The good results of combining feature aggregation with EFS demonstrated the beneficial combination of the two techniques. Together, they refined the feature set and improved the model’s overall performance, allowing it to recognize more significant features. The proposed method, EFS with feature aggregation, thus significantly raised F1 score, precision, recall, and AUPR, which are critical for identifying fraud.

Complexity

When performing feature aggregation, two new features are generated (average amount and change in balance) from the original data; thus, the time complexity is

O (N)

. where N is the number of data points (transactions).

Moreover, when EFS is applied, the time complexity is

O ((2^{p} - 1) * T_{m o d e l})

,

p = n + 2

, where

n

is the number of original features and

2

is the newly aggregated features. The total number of subsets of these p features is

2^{p} - 1

excluding the empty set. For each subset, one needs to train and evaluate a model that takes time

T_{m o d e l}

.

The complexity of hybrid feature aggregation and EFS is the sum of the time required for feature aggregation and the time required for EFS. Because

O (N)

is linear and

O ((2^{p} - 1) * T_{m o d e l})

is exponential, the overtime complexity simplifies:

O ((2^{p}) * T_{m o d e l})

Analyzing Customer Spending Behavior

According to the customer spending behavior analysis, each classifier model has the best feature subset based on the highest F1 score metric. As Table 8 shows, most of the features in the subsets are similar, focusing on the type and the amount of transaction, the balance before and after the spending, and its average amount. To identify the customer spending behavior that leads to fraud, additional features are required in the subset, such as ChangeInBalance_Orig, which captures the change in the balance due to a transaction, and nameDest and newbalanceDest, which are related to the destination account and its balance, providing insight into the recipients of transactions.

The study of customer spending habits relies on several pertinent attributes that give critical information about transaction flow. The feature ‘type’ describes the nature of each financial activity, differentiating between normal and potentially fraudulent activities. ‘amount’ indicates the monetary value of transactions, which is essential in detecting abnormal spending trends (i.e., those that deviate from a customer’s usual conduct). ‘oldbalanceOrg’ and ‘newbalanceOrig’ represent the account balances before and after the transaction and are crucial for determining whether transactions are feasible and identifying anomalies (large differences between these balances might suggest suspicious behavior). ‘averageAmount’ shows an average transaction size over time, enabling one to know if this transaction is abnormal (Figure 8).

Figure 8. Average transaction amounts over time.

As illustrated in Figure 9, the correlation matrix highlights strong correlations between several features. oldbalanceOrg and newbalanceOrig have a perfect positive correlation, indicating a strong linear relationship. This is expected as the new balance is usually derived from the old balance. Moreover, changeInBalance_Orig is strongly correlated with oldbalanceOrg and newbalanceOrig, suggesting that balance changes are directly related to the initial and final balance amounts. The perfect correlation between ‘amount’ and ‘averageAmount’ demonstrates the model’s ability to detect inconsistencies in transaction data, resulting in increased predictive power. The distinct informational value of ‘type’ indicates that different transaction types have various impacts on customer spending behavior. By identifying the most informative features, the model has achieved exceptional performance in predicting customer spending patterns.

Figure 9. Correlation matrix for customer behavior-based features.

8. Conclusions

Reliability and quality of features are essential for empowering machine learning algorithms and gaining insightful data to analyze customer spending behavior that indicates fraud. This article aimed to apply feature engineering to distinguish customer spending behavior. The feature engineering applied here combined feature aggregation and EFS. The evaluation of the PaySim dataset has shown that combining feature aggregation with EFS significantly improves classifier performance, especially for DT and RF. Feature aggregation and EFS can produce more robust and accurate credit card fraud detection systems models. While any single approach can improve performance on its own, the best results come from their integration, which is why combining them is especially beneficial in practical fraud prevention applications. Overall, feature engineering is crucial for the optimization of model accuracy and reliability.

Future research needs to investigate more extensive feature engineering techniques, such as automated processes and deep feature construction, to capture the complex patterns in transaction data. Moreover, to strengthen detection model robustness and ethical issues in fraud prevention, adaptive techniques that can adapt to the changing credit card fraud landscape and privacy-preserving approaches such as federated learning should be focused on.

Author Contributions

Conceptualization, M.A.; methodology, M.A.; software, M.A.; validation, M.A.; formal analysis, M.A.; investigation, M.A.; resources, M.A.; writing—original draft preparation, M.A.; writing—review and editing, M.A.; visualization, M.A.; supervision, M.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: [https://www.kaggle.com/datasets/ealaxi/paysim1] accessed on 11 February 2021.

Acknowledgments

This research paper was supported by a grant from the “Research Centre of the Female Scientific and Medical Colleges”, Deanship of Scientific Research (GSR), King Saud University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

The Nilson Report. Payment Card Fraud Losses Reach $27.85 Billion Annual Fraud Statistics. Available online: https://nilsonreport.com/ (accessed on 23 October 2023).
Mbakwe, A.; Adewale, S. Machine Learning Algorithms for Credit Card Fraud Detection. Mach. Learn. Appl. Int. J. (MLAIJ) 2022, 9, 17–26. [Google Scholar] [CrossRef]
Madhurya, M.J.; Gururaj, H.L.; Soundarya, B.C.; Vidyashree, K.P.; Rajendra, A.B. Exploratory analysis of credit card fraud detection using machine learning techniques. Glob. Transit. Proc. 2022, 3, 31–37. [Google Scholar] [CrossRef]
Susarla, D.; Ozdemir, S. Feature Engineering Made Easy: Identify Unique Features from Your Dataset in order to Build Powerful Machine Learning Systems; Packt Publishing Ltd.: Birmingham, UK, 2018. [Google Scholar]
Baesens, B.; Höppner, S.; Verdonck, T. Data engineering for fraud detection. Decis. Support Syst. 2021, 150, 113492. [Google Scholar] [CrossRef]
Kumar, A.; Gopal, R.D.; Shankar, R.; Tan, K.H. Fraudulent review detection model focusing on emotional expressions and explicit aspects: Investigating the potential of feature engineering. Decis. Support Syst. 2022, 155, 113728. [Google Scholar] [CrossRef]
Zebari, R.; Abdulazeez, A.; Zeebaree, D.; Zebari, D.; Saeed, J. A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J. Appl. Sci. Technol. Trends 2020, 1, 56–70. [Google Scholar] [CrossRef]
Jurgovsky, J.; Granitzer, M.; Ziegler, K.; Calabretto, S.; Portier, P.-E.; He-Guelton, L.; Caelen, O. Sequence classification for credit-card fraud detection. Expert Syst. Appl. 2018, 100, 234–245. [Google Scholar] [CrossRef]
Ileberi, E.; Sun, Y.; Wang, Z. A machine learning based credit card fraud detection using the GA algorithm for feature selection. J. Big Data 2022, 9, 24. [Google Scholar] [CrossRef]
Ranjan, R.; Chhabra, J.K. Automatic feature selection using enhanced dynamic Crow Search Algorithm. Int. J. Inf. Technol. 2023, 15, 2777–2782. [Google Scholar] [CrossRef]
Jovic, A.; Brkic, K.; Bogunovic, N. A review of feature selection methods with applications. In Proceedings of the 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 25–29 May 2015. [Google Scholar] [CrossRef]
Abedin, M.Z.; Hajek, P.; Sharif, T.; Satu, M.S.; Khan, M.I. Modelling bank customer behaviour using feature engineering and classification techniques. Res. Int. Bus. Financ. 2023, 65, 101913. [Google Scholar] [CrossRef]
Noghani, F.F.; Moattar, M. Ensemble classification and extended feature selection for credit card fraud detection. J. Artif. Intell. Data Min. 2017, 5, 235–243. [Google Scholar] [CrossRef]
Zhang, X.; Han, Y.; Xu, W.; Wang, Q. HOBA: A novel feature engineering methodology for credit card fraud detection with a deep learning architecture. Inf. Sci. 2021, 557, 302–316. [Google Scholar] [CrossRef]
Kamalov, F.; Elnaffarr, S.; Cherukuri, A.; Jonnalagadda, A. Forward feature selection: Empirical analysis. J. Intell. Syst. Internet Things 2024, 11, 44–54. [Google Scholar] [CrossRef]
Jiang, C.; Song, J.; Liu, G.; Zheng, L.; Luan, W. Credit Card Fraud Detection: A Novel approach using aggregation strategy and feedback mechanism. IEEE Internet Things J. 2018, 5, 3637–3647. [Google Scholar] [CrossRef]
Dastidar, K.G.; Jurgovsky, J.; Siblini, W.; Granitzer, M. NAG: Neural feature aggregation framework for credit card fraud detection. Knowl. Inf. Syst. 2022, 64, 831–858. [Google Scholar] [CrossRef]
Escobar, M.; D’Ambrosio, C.; Liberti, L.; Vanier, S. (Eds.) Integer Formulation for Computing Transaction Aggregation to Detect Credit Card Fraud. In Proceedings of the CTW-Workshop on Graph Theory and Combinatorial Optimization, Online, 14–16 September 2020. [Google Scholar]
Ikeda, C.; Ouazzane, K.; Yu, Q.; Hubenova, S. New feature Engineering Framework for Deep learning in Financial Fraud Detection. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 10–21. [Google Scholar] [CrossRef]
Lopez-Rojas, E.; Elmir, A.; Axelsson, S. Paysim: A financial mobile money simulator for fraud detection. In Proceedings of the 28th European Modeling and Simulation Symposium, Larnaca, Cyprus, 26–28 September 2016; pp. 249–255. [Google Scholar]
Alamri, M.; Ykhlef, M. Hybrid undersampling and oversampling for handling imbalanced credit card data. IEEE Access 2024, 12, 14050–14060. [Google Scholar] [CrossRef]
Tadvi, F.; Shinde, S.; Patil, D.; Dmello, S. Real time credit card fraud detection. Int. Res. J. Eng. Technol. (IRJET) 2021, 8, 2177–2180. [Google Scholar]
Mondal, I.A.; Haque, M.E.; Hassan, A.-M.; Shatabda, S. Handling imbalanced data for credit card fraud detection. In Proceedings of the 2021 24th International Conference on Computer and Information Technology (ICCIT), Dhaka, Bangladesh, 18–20 December 2021. [Google Scholar] [CrossRef]
Alharbi, A.; Alshammari, M.; Okon, O.D.; Alabrah, A.; Rauf, H.T.; Alyami, H.; Meraj, T. A novel Text2IMG Mechanism of Credit Card Fraud Detection: A Deep Learning approach. Electronics 2022, 11, 756. [Google Scholar] [CrossRef]
Karthik, V.S.S.; Mishra, A.; Reddy, U.S. Credit Card Fraud Detection by Modelling Behaviour Pattern Using Hybrid Ensemble Model. Arab. J. Sci. Eng. 2021, 47, 1987–1997. [Google Scholar] [CrossRef]
Arora, V.; Leekha, R.S.; Lee, K.; Kataria, A. Facilitating User Authorization from Imbalanced Data Logs of Credit Cards Using Artificial Intelligence. Mob. Inf. Syst. 2020, 2020, 1–13. [Google Scholar] [CrossRef]
Rtayli, N.; Enneya, N. Enhanced credit card fraud detection based on SVM-recursive feature elimination and hyper-parameters optimization. J. Inf. Secur. Appl. 2020, 55, 102596. [Google Scholar] [CrossRef]
Khurana, U.; Samulowitz, H.; Turaga, D. Feature engineering for predictive modeling using reinforcement learning. In AAAI Technical Track: Machine Learning, Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Public Knowledge Project: Burnaby, BC, Canada, 2018; Volume 32, p. 32. [Google Scholar] [CrossRef]
Nargesian, F.; Samulowitz, H.; Khurana, U.; Khalil, E.B.; Turaga, D. Learning Feature Engineering for Classification. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence; International Joint Conferences on Artificial Intelligence Organization: California City, CA, USA, 2017; pp. 2529–2535. [Google Scholar] [CrossRef]
Wang, H.; Liang, Q.; Hancock, J.T.; Khoshgoftaar, T.M. Enhancing Credit Card Fraud Detection Through a Novel Ensemble Feature Selection Technique. In Proceedings of the 2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI), Bellevue, WA, USA, 4–6 August 2023; pp. 121–126. [Google Scholar] [CrossRef]
Jha, S.; Guillen, M.; Westland, J.C. Employing transaction aggregation strategy to detect credit card fraud. Expert Syst. Appl. 2012, 39, 12650–12657. [Google Scholar] [CrossRef]
Nersisyan, S.; Novosad, V.; Galatenko, A.; Sokolov, A.; Bokov, G.; Konovalov, A.; Alekseev, D.; Tonevitsky, A. ExhauFS: Exhaustive search-based feature selection for classification and survival regression. PeerJ 2022, 10, e13200. [Google Scholar] [CrossRef]
Dissanayake, K.; Johar, M.G.M. Comparative study on heart disease prediction using feature selection techniques on classification algorithms. Appl. Comput. Intell. Soft Comput. 2021, 2021, 5581806. [Google Scholar] [CrossRef]
Charbuty, B.; Abdulazeez, A. Classification based on decision tree algorithm for machine learning. J. Appl. Sci. Technol. Trends 2021, 2, 20–28. [Google Scholar] [CrossRef]
Atchaya, P.; Somasundaram, K. Novel Logistic Regression over Naive Bayes Improves Accuracy in Credit Card Fraud Detection. J. Surv. Fish. Sci. 2023, 2, 2172–2181. [Google Scholar] [CrossRef]
Yundong, W.; Zhulev, N.A.; Ahmed, N.O.G. Credit Card Fraud Identification using Logistic Regression and Random Forest. Wasit J. Comput. Math. Sci. 2023, 2, 1–8. [Google Scholar] [CrossRef]
Nguyen, N.; Duong, T.; Chau, T.; Nguyen, V.-H.; Trinh, T.; Tran, D.; Ho, T. A proposed model for card fraud detection based on CatBoost and deep neural network. IEEE Access 2022, 10, 96852–96861. [Google Scholar] [CrossRef]
Dang, T.K.; Tran, T.C.; Tuan, L.M.; Tiep, M.V. Machine learning based on resampling approaches and deep reinforcement learning for credit card fraud detection systems. Appl. Sci. 2021, 11, 10004. [Google Scholar] [CrossRef]
Afriyie, J.K.; Tawiah, K.; Pels, W.A.; Addai-Henne, S.; Dwamena, H.A.; Owiredu, E.O.; Ayeh, S.A.; Eshun, J. A supervised machine learning algorithm for detecting and predicting fraud in credit card transactions. Decis. Anal. J. 2023, 6, 100163. [Google Scholar] [CrossRef]
Alarfaj, F.K.; Malik, I.; Khan, H.U.; Almusallam, N.; Ramzan, M.; Ahmed, M. Credit card fraud detection using State-of-the-Art machine learning and deep learning algorithms. IEEE Access 2022, 10, 39700–39715. [Google Scholar] [CrossRef]
TensorFlow. Available online: https://www.tensorflow.org/ (accessed on 14 July 2024).
Team, K. Keras: Deep Learning for Humans. Available online: https://keras.io/ (accessed on 14 July 2024).
Shenvi, P.; Samant, N.; Kumar, S.; Kulkarni, V. Credit Card Fraud Detection using Deep Learning. In Proceedings of the 2019 IEEE 5th International Conference for Convergence in Technology (I2CT), Bombay, India, 29–31 March 2019; pp. 1–5. [Google Scholar] [CrossRef]
Cherif, A.; Ammar, H.; Kalkatawi, M.; Alshehri, S.; Imine, A. Encoder-decoder graph neural network for credit card fraud detection. J. King Saud Univ. Comput. Inf. Sci. 2024, 36, 102003. [Google Scholar] [CrossRef]
Ming, R.; Abdelrahman, O.; Innab, N.; Ibrahim, M.H.K. Enhancing fraud detection in auto insurance and credit card transactions: A novel approach integrating CNNs and machine learning algorithms. PeerJ Comput. Sci. 2024, 10, e2088. [Google Scholar] [CrossRef]

Table 1. Dataset attributes.

Attributes	Description
Step	Address a unit of time in the real world. In case 1 step is 1 h. Total steps: 744 (30 days simulation)
Type	CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER
Amount	Amount of the transaction
nameOrig	The customer who started the transaction
oldBalanceOrig	Initial balance before the transaction
newBalanceOrig	The new balance after the transaction
nameDest	The customer who is the recipient of the transaction
oldBalanceDest	Initial balance recipient before the transaction
newBalanceDest	New balance recipient after the transaction
isFraud	The transaction made by fraudulent
isFlaggedFraud	The business model is to control the huge transfer between accounts and flag illegal attempts

Table 2. Confusion matrix.

Predicted Class	Actual Class
		Positive (Fraud)	Negative (Non-Fraud)
	Positive	True positive (TP)	False positive (FP)
	Negative	False negative (FN)	True negative (TN)

Table 3. Newly created features based on aggregation in PaySim dataset.

Aggregations	Description
AverageAmount	The average amount transacted from the original account.
ChangeInBalance_Orig	Change in the balance for original customers: (NewBalanceOrig—OldBalanceOrig).

Table 4. RF parameters.

Parameter	Value
n_estimators	100
max_depth	60
random_state	42

Table 5. Hyperparameters of deep neural network (DNN).

Hyperparameter	Value
No. of neurons in the input layer	Depending on the data (before aggregation = 8, after aggregation = 10)
No. of neurons in the output layer	2
No. of hidden layer	7
No. of neurons in each hidden layer	128, 64, 32, 16
Optimizer	Stochastic Gradient Descent (SGD)
Learning rate	0.001
Batch size	256
No. of epochs	100
Activation Function	ReLU

Table 6. Performance evaluation.

Feature Engineering		Models	F1 Score	Accuracy	Precision	Recall	AUPR
	Before Feature Aggregation	RF	85.20%	99.95%	81.27%	89.53%	72.77%
		DT	86.27%	99.96%	80.81%	93.36%	74.87%
		LR	12.41%	98.43%	6.69%	86.122%	5.78%
		DNN	75.81%	99.94%	79.74%	72.25%	57.65%
	After Feature Aggregation	RF	87.74%	99.96%	89.96%	85.63%	77.05%
		DT	82.41%	99.95%	79.85%	85.14%	68.01%
		LR	9.72%	97.97%	5.15%	84.29%	4.36%
		DNN	77.89%	99.94%	76.58%	79.24%	60.71%
	EFS without Feature Aggregation	RF	84.90%	99.96%	93.11%	78.03%	72.68%
		DT	86.58%	99.96%	77.94%	97.36%	75.90%
		LR	18.14%	99.04%	10.20%	81.83%	8.37%
		DNN	57.45%	99.88%	53.47%	62.07%	33.24%
	EFS with Feature Aggregation (Proposed)	RF	87.43%	99.96%	85%	90.01%	76.52%
		DT	90.21%	99.97%	82.52%	99.48%	82.10%
		LR	10.14%	98.11%	5.40%	82.56%	4.48%
		DNN	76.06%	99.94%	78.51%	73.77%	57.95%

Table 7. Comparison with previous study.

	F1 Score	Accuracy	Precision	Recall
[27]	88%	100%	78%	100%
Proposed Approach	90.21%	99.97%	82.52%	99.48%

Table 8. Best feature subset using baseline classifiers.

Classification Algorithm	Best Features Subset
DT	[ ‘type’, ‘amount’, ‘oldbalanceOrg’, ‘newbalanceOrig’, ‘changeInBalance_Orig’, ‘averageAmount’ ]
RF	[ type’, ‘amount’, ‘oldbalanceOrg’, ‘newbalanceOrig’, ‘nameDest’, ‘newbalanceDest’, ‘changeInBalance_Orig’, ‘averageAmount’ ]
LR	[ ‘type’, ‘amount’, ‘oldbalanceOrg’, ‘newbalanceOrig’, ‘oldbalanceDest’, ‘newbalanceDest’, ‘averageAmount’ ]
DNN	[ ‘type’, ‘amount’, ‘oldbalanceOrg’, ‘newbalanceOrig’, ‘nameDest’, ‘oldbalanceDest’, ‘newbalanceDest’, ‘changeInBalance_Orig’ ]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alamri, M.; Ykhlef, M. Hybrid Feature Engineering Based on Customer Spending Behavior for Credit Card Anomaly and Fraud Detection. Electronics 2024, 13, 3978. https://doi.org/10.3390/electronics13203978

AMA Style

Alamri M, Ykhlef M. Hybrid Feature Engineering Based on Customer Spending Behavior for Credit Card Anomaly and Fraud Detection. Electronics. 2024; 13(20):3978. https://doi.org/10.3390/electronics13203978

Chicago/Turabian Style

Alamri, Maram, and Mourad Ykhlef. 2024. "Hybrid Feature Engineering Based on Customer Spending Behavior for Credit Card Anomaly and Fraud Detection" Electronics 13, no. 20: 3978. https://doi.org/10.3390/electronics13203978

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Hybrid Feature Engineering Based on Customer Spending Behavior for Credit Card Anomaly and Fraud Detection

Abstract

1. Introduction

2. Background

2.1. Feature Extraction

2.2. Feature Selection

3. Related Works

4. Dataset

5. Evaluation Metrics

6. Features Engineering for Customer Spending Behavior Detection

6.1. Methodology

6.2. Feature Aggregation

6.3. Exhaustive Feature Selection (EFS)

6.4. Classification

6.4.1. Decision Tree (DT)

6.4.2. Random Forest (RF)

6.4.3. Logistic Regression (LR)

6.4.4. Deep Neural Networks (DNN)

6.5. Evaluation

7. Results and Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI