Enhancing Security in Airline Ticket Transactions: A Comparative Study of SVM and LightGBM

Arnaldo, César Gómez; Jurado, Raquel Delgado-Aguilera; Moreno, Francisco Pérez; Suárez, María Zamarreño

doi:10.3390/app15179581

Open AccessArticle

Enhancing Security in Airline Ticket Transactions: A Comparative Study of SVM and LightGBM

by

César Gómez Arnaldo

^*,

Raquel Delgado-Aguilera Jurado

,

Francisco Pérez Moreno

and

María Zamarreño Suárez

Department of Aerospace Systems, Air Transport and Airports, Universidad Politécnica de Madrid (UPM), 28040 Madrid, Spain

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(17), 9581; https://doi.org/10.3390/app15179581

Submission received: 7 July 2025 / Revised: 9 August 2025 / Accepted: 11 August 2025 / Published: 30 August 2025

(This article belongs to the Section Electrical, Electronics and Communications Engineering)

Download

Browse Figures

Versions Notes

Abstract

Fraudulent online payment operations represent a persistent challenge in digital commerce, particularly in sectors like air travel, where credit and debit card payments dominate. This study presents a novel fraud detection framework tailored to airline ticket purchases, combining a synthetic dataset generator with a modular, customizable feature engineering process. These are two machine learning models—support vector machines (SVMs) and the light gradient boosting machine (LightGBM)—for real-time fraud detection. A synthetic dataset was generated, including a rich set of engineered features reflecting realistic user, transaction, and flight-related attributes. While both models were evaluated using classification-evaluation metrics, LightGBM outperformed SVMs in terms of overall performance with an accuracy of 94.2% and a recall of 71.3% for fraudulent cases. The main contribution of this study is the design of a reusable, customizable feature engineering framework for fraud detection in the airline sector, along with the development of a lightweight, adaptable fraud detection system for merchants, especially small and medium-sized enterprises. These findings support the use of advanced machine learning methods to enhance security in digital airline transactions.

Keywords:

fraud prevention; fraud detection; credit cards; classification; feature engineering

1. Introduction

The rapid growth of electronic commerce has transformed the way individuals and businesses engage in financial transactions. In particular, the airline industry has embraced digitalization, offering customers the ability to purchase tickets online using credit or debit cards. While this evolution has brought notable benefits—such as convenience, accessibility, and speed—it has also created new vulnerabilities that are increasingly exploited by fraudsters. The global financial losses attributed to payment fraud in e-commerce are staggering, and the airline sector is especially exposed due to the high transaction value, limited time between purchase and consumption, and the inability to verify customer identity physically.

Although financial institutions have developed sophisticated systems for detecting and preventing fraudulent activities, merchants themselves—especially small and medium-sized enterprises (SMEs)—often lack the tools and resources needed to protect their platforms. Most fraud detection solutions are tailored for large-scale processors and banks and are not designed for direct integration by merchants. This creates a critical gap in the ecosystem, where many businesses must either rely entirely on third-party protection or absorb losses from fraud. Furthermore, the limited adaptability of these solutions to domain-specific conditions (such as the characteristics of airline ticket sales) reduces their effectiveness in certain use cases.

1.1. Research Objectives

This study addresses that gap by developing a lightweight, customizable fraud detection framework tailored to the airline sector. The main objectives are: (i) to build a synthetic dataset simulating realistic online airline ticket transactions; (ii) to design a modular feature engineering process adaptable to various user and transaction profiles; and (iii) to compare the performance of two machine learning models—support vector machines (SVMs) and the light gradient boosting machine (LightGBM)—in identifying fraudulent operations. The proposed approach is designed to be merchant-oriented, especially for SMEs lacking access to scalable fraud prevention tools, and adaptable to specific operational needs.

1.2. Results Obtained

The primary result has been the development and testing of two machine learning models for predicting fraudulent transactions in online airline ticket sales using credit card payment methods.

A feature engineering process has been created for models applicable to this scenario, allowing for quick parameter adjustments and the generation of different datasets in a short time. The datasets are customizable, and in this case, the records are grouped based on the origin of the customer making the purchase.

An analysis of different fraud scenarios applicable to the specific use case has been conducted, following current regulations in Spain. The current industry situation regarding fraud issues and existing prevention and detection solutions has also been analyzed.

1.3. State of the Art

In 2022, approximately 41 billion US dollars were lost due to payment fraud, accounting only for transactions conducted in e-commerce globally. This section provides a comprehensive review of the current research and technological advancements aimed at addressing this critical issue [1].

The rapid evolution of technology and the increasing use of electronic devices in our daily lives have significantly impacted various aspects of society. This includes the entertainment industry, automotive sector, and even social interactions, where we have transitioned from traditional means of communication to being constantly connected via social media and the Internet [2].

In the payments industry, technological advancements and digitalization have necessitated that all participants adapt their systems, strategies, and objectives to remain competitive and offer the best service to their clients. This section details the various actors involved in the payment ecosystem, their roles, and the challenges they face concerning fraud prevention and detection [3,4].

1.3.1. Payment Industry Overview

The payment industry controls and processes millions of transactions daily, enabling secure, efficient, and seamless buying and selling of goods and services globally. The ecosystem comprises several key players [5].

The payment ecosystem consists of several key actors that enable the secure and efficient processing of electronic transactions. Payment networks, such as Visa or Mastercard, are responsible for routing and settling transactions between institutions. Issuing banks provide payment cards to consumers and authorize transactions on their behalf. Acquiring banks, on the other hand, facilitate merchants in accepting card payments. Payment processors play a technical role by managing the data flow between merchants, banks, and networks. Payment gateways supply the infrastructure needed to authorize and transmit payment information. In addition, payment service providers (PSPs) bundle services such as gateways, processing, and fraud screening into a single platform. Payment information service providers (PISPs), enabled by PSD2 regulation, support account information services and initiate payments on behalf of users. Finally, regulatory bodies oversee the entire ecosystem, ensuring legal compliance, consumer protection, and system integrity.

The rapid growth of online shopping has presented significant opportunities and challenges. For consumers, the benefits include faster purchasing processes, accessibility to products from anywhere, and improved customer service [6]. However, this growth has also led to an increase in fraudulent activities targeting both consumers and merchants [7,8].

1.3.2. Technological Solutions in Fraud Detection and Prevention

To combat the growing problem of payment fraud, various technological solutions have been developed and are widely adopted in the industry [9,10].

Machine learning models play a central role by enabling predictive analysis based on historical data and the identification of subtle, non-linear patterns that may indicate fraudulent behavior. Complementing these are expert rule-based systems, which rely on predefined conditions—such as transaction amount thresholds, geographic inconsistencies, or timing anomalies—to flag potentially suspicious transactions.

Customer behavior analytics further enhance detection capabilities by analyzing individual user patterns over time and identifying deviations from typical purchasing behavior. Finally, real-time monitoring systems provide continuous surveillance of transactions as they occur, enabling immediate response to emerging threats and reducing the window of opportunity for fraud to succeed.

Prominent companies providing these solutions are LYNX, SEON, FeatureSpace, Ravelin, and Ekata.

1.3.3. Types of Payment Fraud

Understanding the different types of payment fraud is crucial for developing effective detection and prevention strategies [11]. In Spain, under the Payment Statistics Directive, fraud cases are categorized into three main types.

The first involves the issuance of a payment order by a fraudster, typically using a stolen, forged, or intercepted payment card without the authorization of the legitimate cardholder. The second category is the modification of a payment order by a fraudster, in which a valid payment initiated by a user is intercepted and altered—often by changing the destination account or the amount. The third type involves the manipulation of the payer, whereby the fraudster deceives the user into authorizing a transaction under false pretenses. This last category frequently involves social engineering techniques such as phishing (email deception), vishing (voice fraud), or smishing (SMS-based fraud), which exploit human vulnerabilities rather than system weaknesses.

1.3.4. Current Challenges and Future Directions

Despite advancements in fraud-detection technologies, the dynamic nature of fraudulent activities presents ongoing challenges. Fraudsters continuously evolve their techniques to bypass security measures, necessitating ongoing research and development in fraud-detection technologies. The introduction of new payment methods and the increasing volume of online transactions further complicate this landscape. Looking ahead, future research directions in fraud prevention emphasize several key areas [12].

These include the development of enhanced machine learning algorithms that can adapt more effectively to evolving fraud tactics. Researchers are also exploring improved integration of real-time monitoring systems with predictive models, aiming for better responsiveness to live threats. Another critical area is the advancement of customer behavior analytics, with the goal of detecting more subtle and complex anomalies in user that may indicate fraudulent intent [13].

1.3.5. Context and Justification

In line with the previously discussed issues, payment fraud is a persistent problem affecting all parties involved in transaction processing. However, the weight of responsibility and the consequences are not distributed equitably or justly.

Merchants receiving payment transactions from customers are the weakest link in the chain. Compared to companies providing payment services and enabling monetary transfers for their business activities, merchants can do little to combat fraud and often must accept significant losses that affect a large part of their revenue. Small merchants, especially those with an online presence, are the favorite targets of cybercriminals who exploit vulnerabilities in these companies’ accounting systems, their limited resources for defense, or their lack of knowledge [14].

While it is true that banks have entire departments dedicated to combating this issue and allocate substantial financial and human resources to reducing its impact, merchants must trust these institutions to neutralize any fraud attempts. This creates a significant dependency on the part of the merchants [15].

If a merchant decided to implement additional measures beyond those applied by their payment processor or bank, they would find virtually no options in the market to add an extra layer of security. Companies offering fraud prevention services only provide solutions for banks and processors, which are often out of budget for small and medium-sized merchants. Even multinational retail companies would face difficulties implementing these systems in their payment gateways, as they are designed to be used by other entities.

Therefore, this research proposes developing a solution to bridge the gap between merchants and fraud detection and prevention service providers, focusing on the airline industry.

The analysis presented in the previous sections has identified a severe problem for many online merchants: the lack of an affordable and specialized fraud prevention system tailored to the specific operations these merchants present. This research aims to create a product that addresses these issues.

The product must adapt to the specific operations of these types of merchants and abstract from any other trends or operations. Despite the growing popularity of some payment methods like BNPL (Buy Now, Pay Later), the reality of the market is that credit and debit cards remain the preferred payment method among consumers.

In this case, airlines often do not offer the possibility of splitting payments for purchasing airline tickets, allowing only the use of credit and debit cards and, in rare cases, payment through wallets like PayPal or Google Pay.

This scenario narrows the scope of the research and the problem’s difficulty, allowing for the development of a model focused on a single payment method. It is understood that focusing on the one payment method will enable the model to more accurately identify the fraudsters’ patterns when using this payment method.

1.3.6. Problem Statement

The rapid expansion of e-commerce has brought significant advantages to consumers, such as faster purchases, access to global markets, improved customer service, and reduced delivery times. However, this growth has been accompanied by a corresponding increase in fraudulent activities, which pose substantial risks to financial institutions, businesses, and end-users alike.

Financial institutions and businesses are constantly under threat from fraudsters who employ increasingly sophisticated techniques to exploit vulnerabilities in online payment systems. This issue is exacerbated by the high volume of transactions and the complexity of detecting fraudulent activities in real-time.

While banks and payment service providers have developed advanced fraud prevention systems, merchants—particularly small and medium-sized enterprises (SMEs)—often lack the resources or tools to effectively detect fraud on their own platforms. Many of the existing solutions are designed for large financial institutions and are inaccessible to merchants due to their cost, complexity, or lack of domain-specific customization. This asymmetry leaves SMEs especially vulnerable to financial losses and reputational damage caused by fraud.

Furthermore, the payment industry is characterized by a diverse ecosystem involving multiple stakeholders, including payment networks, issuing banks, acquiring banks, payment processors, and regulatory bodies. Each of these stakeholders faces unique challenges in preventing and detecting fraud, and the lack of a unified approach complicates the implementation of effective fraud prevention measures.

In the specific context of airline ticket sales, the challenge is even more pronounced. Transactions are typically high-value, conducted online, and processed almost exclusively through credit or debit cards. Moreover, airlines rarely offer split payment options or alternative methods, which further concentrates fraud risk. These characteristics necessitate a specialized fraud detection model that can operate in real-time, handle highly imbalanced data, and adapt to the unique structure of the airline industry.

Given the substantial financial losses associated with payment fraud and the critical need for effective solutions, this research aims to address the following key issues:

The development of a predictive model that can detect fraudulent payment transactions in real-time, specifically for airline ticket purchases.

The creation of a flexible feature engineering process that allows for quick adjustments and the generation of diverse datasets tailored to the needs of the model and the specific use case.

The analysis of current fraud-detection technologies and the identification of gaps in existing solutions to provide more accessible and effective tools for SMEs.

By addressing these issues, this research seeks to contribute to the development of more robust, affordable, and specialized fraud detection solutions that can benefit a wide range of stakeholders in the payment industry.

2. Methodology

The methodology adopted for this research focuses on developing and implementing a predictive model for detecting fraudulent payment transactions in the context of airline ticket purchases. The methodological approach is designed to ensure the creation of a robust, efficient, and accurate fraud detection system through a structured and systematic process.

2.1. Research Design

The research design encompasses several key stages, including data collection, feature engineering, model selection, training, evaluation, and validation. Each stage is critical to the overall success of the project and contributes to the development of a comprehensive fraud detection solution.

2.2. Data Collection

Given the sensitivity of data involved in fraud detection, obtaining a coherent dataset with sufficient records for training the models posed a significant challenge. Therefore, part of the research involved generating a synthetic dataset that closely resembles real-world transaction data. This dataset includes a rich set of features derived from various sources, focusing specifically on the scenario of purchasing airline tickets.

2.3. Feature Engineering

Feature engineering is a crucial step in the methodology, involving the extraction, transformation, and creation of new features from the raw data. This process ensures that the most relevant and informative attributes are included in the model, enhancing its predictive performance. The feature engineering process in this research is iterative, involving multiple rounds of refinement and optimization to achieve the best results.

2.4. Model Selection

The selection of appropriate machine learning models is based on a comprehensive analysis of existing techniques and their suitability for the fraud detection task. Support vector machines (SVMs) and light gradient boosting machine (LightGBM) were chosen due to their proven effectiveness in classification problems and their ability to handle high-dimensional data.

2.5. Model Training and Evaluation

The training phase involves feeding the prepared dataset into the selected models, tuning hyperparameters, and optimizing the models’ performance. Evaluation metrics such as accuracy, precision, recall, and the area under the receiver operating characteristic curve (AUC-ROC) are used to assess the models’ effectiveness. The iterative nature of this phase ensures continuous improvement and refinement of the models.

2.6. Validation and Testing

To ensure the robustness and generalizability of the models, rigorous validation and testing are conducted using separate subsets of data. This step includes cross-validation techniques to prevent overfitting and to confirm that the models perform well on unseen data.

2.7. Implementation and Simulation

Once the models are trained and validated, they are implemented in a simulated real-world environment. This simulation tests the models’ performance in a live setting, mimicking the operational conditions under which the fraud detection system would be deployed.

This methodological framework ensures a structured approach to developing a reliable and effective fraud detection system, contributing valuable insights and practical solutions to the problem of payment fraud in the airline industry.

3. Materials and Methods

Feature engineering is a critical component in the development of a predictive model, as it involves transforming raw data into meaningful features that can improve the model’s accuracy and performance. This process is especially vital in the context of fraud detection, where the ability to identify subtle patterns and anomalies can significantly impact the effectiveness of the detection system.

3.1. Feature Engineering Strategy

3.1.1. Data Sources and Types of Features

For this research, the feature engineering process began by identifying and categorizing the types of data relevant to the context of airline ticket purchases. The main categories of features include the following:

1.

Card-related features:

Card network: the name of the company managing the card scheme (e.g., Visa, MasterCard).
Card number (primary account number or PAN): a unique identifier for the card, usually hashed for security.
Card type: whether the card is a credit or debit card.
Expiration date: the validity period of the card.
CVV code: the card verification value used for security purposes.

2.

User-related features:

Username: the name of the user making the purchase.
Age: the age of the user.
Date of birth: the user’s birthdate.
Gender: the gender of the user.
Residential address: the user’s home address.
City and postal code: the city and postal code of the user’s residence.
Annual income: the yearly income of the user.
Credit debt: the amount of debt the user has in credit.
Account opening date: the date the user’s bank account was opened.
First transaction date: the date of the user’s first transaction.
Total transactions: the total number of transactions made by the user.
Email address: the user’s email used for communication.
Travel-related metrics: metrics such as the total duration of all flights, total loyalty points accumulated, number of trips, and membership level with the airline.

3.

Transaction-related features:

Merchant name: the name of the merchant or airline.
Error code: any error codes generated during the transaction process.
Fraud label: a label indicating whether the transaction is fraudulent (used as the target variable).
Transaction date: the date the transaction was made.
Time on website: the amount of time the user spent on the website during the purchase.
Contact email: the email provided for contact purposes during the transaction.
Email seen count: the number of times the contact email has been seen previously.
Contact phone number: the phone number provided for contact purposes.
Payment service provider (PSP) issuer and acquirer: the PSPs involved in the transaction, including their countries.
SMS confirmation use: whether SMS confirmation was used for the transaction.
SMS resend count: the number of times the SMS confirmation was resent.
Time to complete SMS confirmation: the time taken to complete SMS confirmation.
Digital wallet use: whether a digital wallet was used in the transaction.
Number of items in purchase: the total number of items bought in the transaction.
Customer purchase history: metrics such as the number of purchases from the same merchant and the average purchase value.
First purchase date: the date of the user’s first purchase from the merchant.
IP address and VPN use: the IP address of the user and whether a VPN was used.
Flight details: including origin and destination cities, number of layovers, total layover time, and total travel duration.
Loyalty points used: the number of loyalty points used for the purchase.
Seat quality level: the quality level of the seats in the flight.

3.1.2. Feature Generation Process

The feature generation process involved the creation of a template that allows quick and easy modification of each parameter defining the values a field can take for each user group. Ten user groups were generated based on the country of origin: US, AU, IT, BR, PT, NL, FR, ES, DE, and GB. This template was designed to reflect the reality of users as closely as possible, considering the standard of living and characteristics of the selected countries.

3.1.3. Iterative Feature Engineering

Feature engineering is an iterative process, and multiple sets of data were generated until a configuration of features, their values, and user group distribution that satisfies the project’s needs was achieved. The goal was to generate data that is as realistic as possible, reflecting the nuances and variations seen in real-world transactions.

3.1.4. Derived Features

In addition to the basic features, several derived features were created to capture more complex patterns and relationships within the data. Examples of derived features include the following:

Retirement age: calculated based on the user’s age and standard retirement age in their country.
Number of vowels in name: used as a simple textual feature.
Weekly purchase ratio: the average number of purchases per week.
Weeks since first purchase: the number of weeks since the user’s first recorded purchase.
Average purchase price: the average price of purchases made by the user.
Average number of items per purchase: the average number of items bought in each transaction.
First week purchase count: the number of purchases made during the user’s first week of transactions.
Denied transactions in first week: the number of transactions denied during the user’s first week.
Email vowel count: the total number of vowels in the user’s email address.
Email number count: the total number of digits in the user’s email address.
Use of disposable email domain: whether the user’s email domain is known to be disposable.

3.1.5. Final Feature Set

The final set of features was carefully selected and refined to ensure that it captures the most relevant and informative aspects of the transaction data. This comprehensive feature set serves as the foundation for training the predictive models, aiming to enhance their ability to detect and prevent fraudulent transactions effectively.

3.2. Model Selection and Implementation

The selection of appropriate machine learning algorithms is a critical step in developing an effective fraud detection system. For this research, support vector machines (SVMs) and the light gradient boosting machine (LightGBM) were chosen due to their proven effectiveness in classification problems and their ability to handle high-dimensional data.

Support vector machines (SVMs) were selected for this study due to several strengths that make them well-suited for fraud detection tasks. First, SVMs are known for their effectiveness in high-dimensional feature spaces—a common characteristic of fraud detection datasets—since they can handle many variables without significant degradation in performance. This property is essential when dealing with complex transactional and behavioral data. Another core advantage of SVMs lies in their underlying principle of margin maximization. By identifying the hyperplane that best separates the classes while maximizing the margin, SVMs tend to generalize better on unseen data, which is crucial in detecting fraudulent activities that do not follow repetitive patterns.

Additionally, SVMs offer high versatility using different kernel functions—such as linear, polynomial, radial basis function (RBF), and sigmoid—that allow the model to project input data into higher dimensional spaces where non-linear class separations become possible. This adaptability is particularly beneficial in modeling non-linear relationships present in real-world transactions. Furthermore, SVMs are grounded in a robust theoretical foundation derived from statistical learning theory, which provides strong generalization guarantees and helps establish the model’s reliability in critical applications. Finally, their proven track record in anomaly detection, including numerous applications in the field of fraud detection, supports their use in this study. SVMs are particularly effective at identifying rare and irregular events—such as fraudulent transactions—that deviate from normal behavioral patterns, reinforcing their relevance for the problem at hand.

On the other hand, the light gradient boosting machine (LightGBM) was selected for this study due to its efficiency, scalability, and strong performance in classification tasks involving large, complex datasets. One of LightGBM’s primary advantages is its computational efficiency—it is optimized for speed and memory usage, enabling the model to handle vast amounts of transactional data, which is essential in fraud detection where real-time performance is often required. The model also demonstrates high precision thanks to its use of advanced techniques such as leaf-wise tree growth and histogram-based algorithms. These methods help the model capture intricate patterns in the data and improve its ability to distinguish between fraudulent and legitimate transactions.

A notable strength of LightGBM is its native support for categorical features, allowing it to process heterogeneous datasets without the need for extensive preprocessing. This capability is particularly useful in fraud detection contexts, where input data often includes a mix of numerical and categorical variables. To mitigate overfitting—a common problem in fraud detection due to severe class imbalance—LightGBM integrates multiple regularization techniques, which enhance its generalization on unseen data. Additionally, the algorithm supports parallel and distributed learning, further improving its scalability and enabling deployment in production environments where high throughput and low latency are critical. LightGBM’s wide adoption across industries, along with its strong track record in both academic research and machine learning competitions, confirms its reliability and effectiveness as a choice for fraud detection tasks.

The complementary strengths of SVM and LightGBM make them an excellent combination for this research. While SVMs are effective in identifying the optimal hyperplane for classification, LightGBM excels in building complex models that capture intricate patterns in the data. By leveraging both models, the research aims to develop a robust and comprehensive fraud detection system that benefits from the strengths of each algorithm.

Recent studies have highlighted the potential of deep learning in fraud detection, particularly when applied to large-scale, sequential transactional data. Notable examples include the author of [16], who introduces a deep learning framework for modeling user behavior over time, and the author of [17], who applies a gated recurrent architecture to capture temporal and spatial dependencies within transaction sequences.

While these approaches show strong performance, especially in environments with vast amounts of labeled transactional data, we opted to use SVM and LightGBM in this study for several reasons. First, these models are well-established, interpretable, and relatively lightweight—making them suitable for deployment by small and medium-sized merchants who often lack the infrastructure to support complex deep learning systems. Second, our work is based on synthetic data designed to simulate user profiles and behaviors; this may not provide sufficient temporal depth or continuity for the effective application of sequence-based neural networks. Lastly, a major focus of this study is on the flexibility and reproducibility of fraud detection workflows for merchant-level implementation, where traditional machine learning models offer easier parameter tuning and operational integration.

Nonetheless, we acknowledge the value of deep learning in this domain and consider the exploration of hybrid or end-to-end neural models an important direction for future research—particularly when applied to real-world transactional datasets with rich temporal structure.

In summary, the choice of SVM and LightGBM for this research is driven by their proven effectiveness, scalability, and ability to handle high-dimensional data. These models provide a solid foundation for developing a reliable and efficient fraud detection system tailored to the specific needs of airline ticket purchases.

3.2.1. Support Vector Machines (SVMs)

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression, and outlier detection. They are particularly known for their effectiveness in high-dimensional spaces and their ability to handle many features. This section provides a detailed explanation of the SVM methodology, its theoretical foundation, and its application in fraud detection [18,19].

Theoretical Foundation

Statistical learning theory: SVMs are based on the principles of statistical learning theory, which provides a framework for understanding the problem of acquiring knowledge, making predictions, and making decisions based on data. The core idea is to find a function that minimizes the expected error on new data by selecting a hypothesis space that accurately represents the underlying function in the target space [20] (see Figure 1).
Margin maximization: The main objective of SVMs is to find the hyperplane that maximizes the margin between different classes. The margin is defined as the distance between the hyperplane and the closest points from each class, known as support vectors [21]. By maximizing the margin, SVMs achieve better generalization on unseen data.
Mathematical formulation: In its simplest form, a linear SVM attempts to find the hyperplane defined by the equation:

$f (x) = w \cdot x + b$

(1)

where $w$ is the weight vector, $x$ is the input vector, and $b$ is the bias term. The goal is to find $w$ and $b$ that maximize the margin while satisfying the constraints:

$y_{i} (w \cdot x_{i} + b) \geq 1, \forall i$

(2)

where $y_{i}$ is the class label for the i-th training sample.
Dual problem and kernel trick: For non-linear problems, SVMs use a technique known as the kernel trick to map the input data into a higher dimensional space where a linear hyperplane can be used to separate the classes [22]. The kernel function $K (x, x^{'})$ implicitly performs this mapping without the need to compute the coordinates in the high-dimensional space explicitly [23].

Common kernel functions include:

Polynomial kernel:

$K (x, x^{'}) = {((x, x^{'}) + c)}^{d}$

(3)

where $c$ is a constant and $d$ is the degree of the polynomial.
Radial basis function (RBF) kernel:

$K (x, x^{'}) = \exp (- \frac{{| x - x^{'} |}^{2}}{2 σ^{2}})$

(4)

where σ is a parameter that controls the width of the Gaussian function.
Sigmoid kernel:

$K (x, x^{'}) = \tanh (k ⟨x, x^{'}⟩ + θ)$

(5)

where $k$ and $θ$ are parameters.

5.: Soft margin and regularization: In real-world applications, data is often not perfectly separable. SVMs address this issue by introducing slack variables $ξ_{i}$ that allow some points to be within the margin or even misclassified. The objective then becomes:

${m i n}_{w, b, ξ} \frac{1}{2} | w |^{2} + C \sum_{i = 1}^{n} ξ_{i}$

(6)

subject to the constraints:

$y_{i} (w \cdot x_{i} + b) \geq 1 - ξ_{i} \forall i$

(7)

$ξ_{i} \geq 0 \forall i$

(8)

where $C$ is a regularization parameter that controls the trade-off between maximizing the margin and minimizing the classification error.

Implementation and Application

Training the SVM model: The training process involves solving the quadratic optimization problem to find the optimal hyperplane. This is typically done using techniques such as sequential minimal optimization (SMO) or other gradient-based methods. Once trained, the SVM model can classify new samples by determining on which side of the hyperplane they fall [24].
Application in fraud detection: In the context of fraud detection, SVMs are used to distinguish between legitimate and fraudulent transactions [25]. The high-dimensional feature space and the ability to handle non-linear relationships make SVMs particularly suitable for this task. The model is trained on historical transaction data, where features related to user behavior, transaction details, and card information are used to build the classifier.
Advantages and challenges: SVMs offer several advantages, including robustness to overfitting, effectiveness in high-dimensional spaces, and flexibility with different kernel functions [26]. However, they also come with challenges, such as high computational cost for large datasets and sensitivity to the choice of hyperparameters and kernel.
Evaluation metrics: The performance of the SVM model is evaluated using metrics such as accuracy, precision, recall, and the area under the receiver operating characteristic curve (AUC-ROC). These metrics provide insights into the model’s ability to correctly identify fraudulent transactions while minimizing false positives and false negatives [27].

Support vector machines provide a powerful and flexible approach to classification problems, particularly in the domain of fraud detection [28]. By leveraging the principles of margin maximization and the kernel trick, SVMs can effectively separate legitimate transactions from fraudulent ones, making them a valuable tool in the fight against payment fraud.

3.2.2. Light Gradient Boosting Machine (LightGBM)

The light gradient boosting machine (LightGBM) is a highly efficient gradient boosting framework that uses decision tree algorithms. It is designed to be distributed and efficient, making it capable of handling large-scale data with faster training speed and higher efficiency compared to other boosting algorithms [29]. This section provides an in-depth look at the LightGBM methodology, its key features, and its application in fraud detection.

Key Features of LightGBM

Leaf-wise tree growth: LightGBM grows trees leaf-wise rather than level-wise, which is different from traditional gradient boosting methods. In leaf-wise growth, the algorithm chooses the leaf with the maximum delta loss to split, leading to a more complex tree structure that results in less loss compared to level-wise growth [30].
Histogram-based algorithm: LightGBM uses a histogram-based algorithm to bucket continuous feature values into discrete bins, significantly reducing the computation cost and memory usage [31]. This approach speeds up the training process and makes the algorithm more efficient.
Support for categorical features: LightGBM natively supports categorical features and can handle them without needing extensive preprocessing [32]. This feature allows the model to leverage categorical data effectively, which is particularly useful in fraud detection where categorical features are common.
Efficient handling of large datasets: LightGBM is designed to be highly efficient with large datasets. It supports parallel and distributed learning, enabling it to scale and process massive amounts of data quickly [33].
Regularization techniques: LightGBM includes multiple regularization techniques to prevent overfitting. These techniques help the model generalize better to unseen data, which is crucial in fraud detection where the data is often imbalanced [34].

Mathematical Foundation

Gradient boosting framework: LightGBM is based on the gradient boosting framework, which builds models in a sequential manner. Each new model corrects the errors made by the previous models. The objective function of LightGBM can be formulated as:

$L (y, F (x)) = \sum_{i = 1}^{n} l (y_{i}, F_{t - 1} (x_{i}) + f_{t} (x_{i})) + Ω (f_{t})$

(9)

where $l$ is the loss function, $F_{t - 1}$ is the prediction from the previous iteration, $f_{t}$ is the function added at the t-th iteration, and $Ω$ is the regularization term.
Leaf-wise growth strategy: The leaf-wise growth strategy aims to find the leaf with the maximum delta loss reduction and split it. This approach leads to deeper trees and better performance in terms of reducing loss:

$Δ L = L_{left} + L_{right} - L_{parent}$

(10)

where $L_{left}$ and $L_{right}$ are the losses of the left and right leaves after the split, and $L_{parent}$ is the loss of the parent leaf before the split.
Histogram-based decision rules: LightGBM builds histograms for each feature and uses these histograms to find the optimal split points. This method reduces the computation cost and speeds up the training process. The histogram-based algorithm can be described as follows:
- Bucket continuous values: Continuous feature values are bucketed into discrete bins.
- Build histograms: Histograms are built for each feature based on the binned values.
- Find optimal split: The optimal split point is determined by evaluating the histograms.

Implementation and Application

Training the LightGBM model: The training process of LightGBM involves constructing multiple decision trees sequentially. Each tree is trained to minimize the error of the previous trees using gradient descent. The hyperparameters, such as the number of leaves, learning rate, and regularization parameters, are tuned to optimize the model’s performance.
Application in fraud detection: In the context of fraud detection, LightGBM is used to classify transactions as either legitimate or fraudulent. The model is trained on historical transaction data, utilizing features related to user behavior, transaction details, and card information. LightGBM’s ability to handle large datasets and complex feature interactions makes it particularly suitable for this task.
Advantages and challenges: LightGBM offers several advantages, including faster training speed, higher efficiency, and better handling of large datasets compared to other gradient boosting algorithms. However, it also has challenges, such as sensitivity to hyperparameter tuning and potential overfitting if not properly regularized.
Evaluation metrics: The performance of the LightGBM model is evaluated using metrics such as accuracy, precision, recall, and the area under the receiver operating characteristic curve (AUC-ROC). These metrics provide insights into the model’s ability to correctly identify fraudulent transactions while minimizing false positives and false negatives.

The light gradient boosting machine (LightGBM) is a powerful and efficient gradient boosting framework that excels in handling large-scale data and complex feature interactions. Its unique features, such as leaf-wise growth and histogram-based algorithms, make it a suitable choice for developing a robust fraud detection system. By leveraging LightGBM, this research aims to improve the accuracy and efficiency of detecting fraudulent transactions in the context of airline ticket purchases.

4. Discussion

The purpose of this chapter is to delve into the analysis and interpretation of the results obtained from the research. By examining the performance of the predictive models developed, we aim to understand their strengths and weaknesses, and how well they address the problem of detecting fraudulent payment transactions in the context of airline ticket purchases. The main objective of this discussion is to critically analyze the outcomes of the models, compare their performance, and identify any limitations or areas for improvement. This analysis will help in understanding the practical implications of the models and their potential impact on real-world fraud detection systems.

The discussion is structured into three main sections:

Comparison of models: This section provides a detailed comparison of the two machine learning models used in the research, support vector machines (SVMs) and the light gradient boosting machine (LightGBM). The comparison focuses on various performance metrics, including accuracy, precision, recall, and the area under the receiver operating characteristic curve (AUC-ROC). Additionally, we will discuss the computational efficiency and scalability of each model.
Limitations and modifications: In this section, we address the limitations encountered during the research. These limitations may pertain to data quality, model performance, or implementation challenges. We will also discuss any modifications made to the models or methodologies to overcome these limitations and enhance their effectiveness.
Relevance to real-world applications: Understanding the performance and limitations of the models is crucial for their application in real-world fraud detection systems. By analyzing the results and identifying areas for improvement, we can provide recommendations for future research and practical implementations. This discussion aims to bridge the gap between theoretical research and practical applications, ensuring that the models developed are both robust and effective in detecting fraudulent transactions.

4.1. Comparison of Models

In the following section, the results obtained in the development of the project are described. The results of both models are compared, the limitations of the project and how they affect the performance of the models are discussed, and the adaptation of the project as development progressed is explained.

At the beginning of the research, the initial approach was that the tree model would obtain better results given that the complexity of the problem and the number of input features could make it difficult for the linear Support Vector Machine (SVC) to separate fraud and non-fraud events into two groups.

4.1.1. Support Vector Machines

Initially, it was intended to use an SVC model directly with an RBF (radial basis function) kernel. After reviewing the different types of kernels that could be applied to an SVM, this type of kernel was considered the best. With the polynomial model, it would be difficult to find the degree of the polynomial that best distributes the data into two groups, and the sigmoid kernel model also did not seem easy to calibrate for data segmentation. The linear model, at first, also did not seem like an option given that the feature space was too complex to be distributed as in the example case from the Scikit-Learn documentation (see Figure 2).

At the time of starting the training of the SVC model (see Figure 3), a rather serious problem was encountered. As can be seen in the Scikit-Learn documentation, the SVC model, when trained with datasets containing more than tens of thousands of records, increases its training time quadratically and becomes practically impossible to train. Therefore, when starting the training of the model, it was decided to opt for a version that could be trained within a reasonable time.

This setback modified the initial approach of comparing an SVC model with an RBF kernel and a tree model like LightGBM. The SVC model was replaced by a linear SVC model, being aware from the beginning that the results would not be as good as those of the originally intended model, but it was a compromise solution necessary to continue with the project.

The training process of the SVC model was as follows:

Training phase: The linear model was trained with the dataset obtained through the template. A hyperparameter search was conducted to find the best configuration for the model to suit the use case.
Generation of a new set of features: Given the results of the training process, a new dataset was generated to improve its quality.

These two steps were performed in a loop in an iterative process several times to try to improve the model’s results, which left much to be desired.

Unfortunately, despite modifying the dataset through changes in probability values in the template and modifying the hyperparameter search to focus on finding a set of values that could improve recall and precision results, no configuration was found that achieved satisfactory results.

The executions remained within the same range of values for precision and recall results in both classes, fraud and non-fraud. Below are several examples of the results obtained from training in different iterations (see Table 1):

In addition to reviewing the precision and recall values, we wanted to check the values obtained for each model by measuring the area under the Receiver Operating Characteristic (ROC) curve. The best result obtained in this metric is as follows (see Figure 4):

4.1.2. Light Gradient Boosting Machine

The reason for choosing the LightGBM model over other tree models arises from personal interest, given that it has been previously applied to other problems related to the payments industry through work experience, such as predicting customer abandonment at the payment page or predicting the customer’s preferred payment method. Since no major difficulties were encountered during the training process and the results were quite good, it was proposed as a candidate for this project.

In this case, no significant difficulties that could affect the initial approach were encountered during the training process. The model was correctly trained by performing an extensive hyperparameter search, without issues in defining all the values to be tested and without excessive training times. The process was slower than training the SVC model, but this fact helped to select better the values to be tested in the parameter search.

The results obtained were indeed a negative surprise due to the low precision and recall values for the class representing fraud cases. In fact, the class representing non-fraudulent cases was correctly identified by the model with good accuracy percentages, but the class of fraudulent cases had percentages of 0 in all metrics used to evaluate the model.

After conducting the first round of training and seeing that the results were not as expected, it was observed that the problem might be due to the training sample not being properly balanced, which could hinder the model’s learning. To solve this problem, different weights were assigned to each class, giving a weight of 1 to the non-fraudulent cases and higher values to the fraudulent cases.

While it is true that the results improved, the expected results were still not achieved with this model. The model had serious problems identifying fraudulent cases and was not capable of obtaining reasonably decent results in detecting fraudulent transactions. While it is true that the results for the other class are good, if the model cannot achieve satisfactory results for both types of transactions, it cannot be implemented in any system.

Most iterations generated results similar to Iteration X, with the best model being the one defined in Iteration Y (see Table 2, Figure 5 and Figure 6).

This model configuration offers better results in identifying and classifying fraudulent transactions, but the cost of obtaining these results is a decrease in the precision and recall percentages for non-fraudulent records. Although the model may worsen in some aspects, it is considered that this model achieves the best results given the previously shown outcomes, where the model was barely able to identify fraudulent transactions.

4.2. Limitations and Modifications

While the research has shown promising results in developing a predictive model for detecting fraudulent payment transactions, several limitations were encountered during the study. Addressing these limitations and considering potential modifications can help improve the model’s effectiveness and applicability in real-world settings.

Limitations:

Data quality and availability: The quality and availability of data significantly impact the performance of machine learning models. In this research, obtaining a coherent dataset with sufficient records for training posed a challenge. The use of synthetic data, although necessary, may not fully capture the complexities and variations of real-world transaction data.
○
Impact: The reliance on synthetic data could lead to models that perform well in controlled environments but may struggle with the unpredictability and diversity of real-world data.
Imbalanced data: Fraud detection datasets are typically highly imbalanced, with a small proportion of fraudulent transactions compared to legitimate ones. This imbalance can affect the model’s ability to learn effectively and may result in a higher rate of false positives or false negatives.
○
Impact: The imbalance in the dataset may lead to biased models that are less sensitive to detecting fraud, potentially missing fraudulent transactions or misclassifying legitimate ones.
Model complexity and interpretability: While complex models like LightGBM provide high accuracy and performance, they can be challenging to interpret. Understanding how the model makes decisions is crucial for gaining trust and ensuring compliance with regulatory requirements.
○
Impact: The lack of interpretability can hinder the adoption of the model by stakeholders who require clear explanations of the decision-making process.
Computational resources: Training and deploying machine learning models, especially for large datasets, require significant computational resources. This can be a constraint for small and medium-sized enterprises (SMEs) with limited access to high-performance computing infrastructure.
○
Impact: Limited computational resources may restrict the ability of SMEs to fully leverage the benefits of advanced machine learning models for fraud detection.
Overfitting: Overfitting occurs when a model learns the noise in the training data rather than the underlying patterns. This can lead to poor generalization to new, unseen data.
○
Impact: Overfitting reduces the model’s effectiveness in real-world applications, where it needs to generalize well across diverse and unseen transaction data.

Modifications:

Enhancing data quality: To address the limitations related to data quality and availability, future research could focus on obtaining more diverse and representative datasets. Collaborations with financial institutions and merchants can provide access to real-world transaction data, enhancing the robustness of the model.
○
Modification: Incorporate real-world transaction data and conduct extensive data cleaning and preprocessing to improve data quality.
Addressing data imbalance: Techniques such as oversampling, undersampling, and synthetic data generation (e.g., SMOTE) can be employed to address data imbalance. Additionally, cost-sensitive learning approaches can be used to assign higher penalties to misclassified fraudulent transactions.
○
Modification: Implement advanced techniques to balance the dataset and improve the model’s sensitivity to fraud detection.
Improving model interpretability: To enhance interpretability, techniques such as SHAP (SHapley Additive exPlanations) values and LIME (local interpretable model-agnostic explanations) can be used. These methods provide insights into how the model makes decisions, making it easier to understand and trust the model’s predictions.
○
Modification: Integrate interpretability techniques into the model to provide clear explanations for decision-making processes.
Optimizing computational efficiency: Efforts can be made to optimize the model’s computational efficiency by exploring techniques such as model pruning, quantization, and hardware acceleration. This can reduce the computational burden and make the model more accessible to SMEs.
○
Modification: Implement optimization techniques to reduce computational requirements and improve the model’s scalability.
Regularization and cross-validation: Regularization techniques such as L1 and L2 regularization can help mitigate overfitting. Additionally, cross-validation methods can be employed to ensure the model generalizes well across different subsets of the data.
○
Modification: Apply regularization techniques and robust cross-validation strategies to enhance the model’s generalization capabilities.

By acknowledging and addressing these limitations, future research can build on the current findings to develop more robust, interpretable, and scalable fraud detection models. Continuous improvements and adaptations to the models and methodologies will enhance their applicability in real-world scenarios, ultimately contributing to more effective fraud prevention in the payment industry.

5. Conclusions

The research conducted aimed to develop a predictive model for detecting fraudulent payment transactions, particularly in the context of airline ticket purchases. Through the detailed exploration and implementation of machine learning techniques, significant insights and advancements were achieved. This chapter summarizes the key findings, implications, and recommendations for future research and practical applications.

Key findings:

Effectiveness of machine learning models: The research demonstrated that machine learning models, specifically support vector machines (SVMs) and the light gradient boosting machine (LightGBM), can effectively identify fraudulent transactions. Both models showed high accuracy, precision, recall, and AUC-ROC, indicating their robustness in detecting fraud.

Superiority of LightGBM: Among the models evaluated, LightGBM outperformed SVM in several aspects, including accuracy, computational efficiency, and scalability. LightGBM’s ability to handle large datasets and complex feature interactions makes it a more practical choice for real-time fraud detection systems.
Importance of feature engineering: The success of the predictive models heavily relied on the comprehensive feature engineering process. By carefully selecting and transforming features, the models were able to capture significant patterns and anomalies related to fraudulent transactions.
Handling data imbalance: The research highlighted the challenges associated with imbalanced datasets in fraud detection. Techniques such as data balancing and cost-sensitive learning were essential in improving the model’s sensitivity to fraudulent transactions.
Model interpretability: While complex models like LightGBM offer high performance, ensuring their interpretability remains crucial. Techniques such as SHAP values and LIME can provide transparency into the model’s decision-making process, fostering trust and compliance with regulatory standards.

Implications:

Practical application: The findings from this research can be directly applied to develop and implement robust fraud detection systems for the airline industry and other sectors prone to payment fraud. The models and methodologies presented can enhance the security and reliability of online transactions.
Enhanced fraud prevention: By adopting advanced machine learning techniques, financial institutions and merchants can significantly reduce the incidence of fraudulent transactions. This not only protects revenue but also enhances customer trust and satisfaction.
Future research directions: The research opens several avenues for future studies, including exploring more sophisticated models, integrating real-world transaction data, and improving the scalability and interpretability of fraud detection systems.

Recommendations:

Collaboration with industry: Future research should seek collaboration with financial institutions and merchants to access real-world transaction data. This will enhance the robustness and applicability of the models developed.
Continuous improvement: Fraud detection systems must continuously evolve to keep pace with the ever-changing tactics of fraudsters. Regular updates and improvements to the models and feature sets are essential for maintaining their effectiveness.
Focus on interpretability: Ensuring the interpretability of machine learning models is critical for gaining stakeholder trust and meeting regulatory requirements. Future research should prioritize developing models that are both highly accurate and transparent.
Scalability and efficiency: Efforts should be made to optimize the computational efficiency and scalability of fraud detection systems. This includes exploring techniques like model pruning, hardware acceleration, and efficient data processing methods.

The research successfully developed and evaluated machine learning models for detecting fraudulent payment transactions, with LightGBM emerging as a particularly effective solution. By addressing the identified limitations and incorporating the recommendations, future work can build on these findings to create more robust, scalable, and interpretable fraud detection systems. These advancements will contribute to enhanced security and trust in online payment systems, ultimately benefiting both consumers and businesses.

Author Contributions

Conceptualization, C.G.A.; methodology, R.D.-A.J.; software, F.P.M.; validation, M.Z.S.; formal analysis, C.G.A.; investigation, C.G.A.; resources, R.D.-A.J.; data curation, F.P.M.; writing—original draft preparation, C.G.A.; writing—review and editing, M.Z.S.; visualization, F.P.M.; supervision, R.D.-A.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article. Any further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mastercard. Blog. Ecommerce Fraud Trends and Statistics Merchants Need to Know in 2024. Available online: https://b2b.mastercard.com/news-and-insights/blog/ecommerce-fraud-trends-and-statistics-merchants-need-to-know-in-2024/ (accessed on 15 July 2024).
Vogels, E.A.; Rainie, L.; Anderson, J. Tech Is (Just) a Tool; Pew Research Center: Washington, DC, USA, 2020. [Google Scholar]
McKinsey. The Future of the Payments Industry. Available online: https://www.mckinsey.com/capabilities/risk-and-resilience/our-insights/the-future-of-the-payments-industry-how-managing-risk-can-drive-growth (accessed on 15 July 2024).
EY. How the Rise of Paytech is Reshaping the Payments Landscape. Available online: https://www.ey.com/en_gl/insights/payments/how-the-rise-of-paytech-is-reshaping-the-payments-landscape (accessed on 17 July 2024).
ECB. Banco Central Europeo Stats. Available online: https://www.ecb.europa.eu/press/stats/paysec/html/ecb.pis2023~b28d791ed8.en.html (accessed on 1 July 2024).
Swatch. SwatchPay. Available online: https://www.swatch.com/en-en/swatch-pay/how-it-works.html (accessed on 27 June 2024).
ECB. Digital Euro. Available online: https://www.ecb.europa.eu/euro/digital_euro/progress/html/index.en.html (accessed on 27 June 2024).
ECB. Pagos Instantáneos en Europa. Available online: https://www.europarl.europa.eu/news/en/press-room/20231031IPR08706/agreement-reached-on-more-accessible-instant-payments-in-euros (accessed on 17 July 2024).
Europea Commission. PSD2. Available online: https://eur-lex.europa.eu/legal-content/ES/TXT/PDF/?uri=CELEX:32020R2011&from=DE (accessed on 5 July 2024).
Europea Commission. PSD3. Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A52023PC0367&qid=1716633078125 (accessed on 27 June 2024).
Europea Commission. Second Revision of Payment Servicesin the EU. Available online: https://www.europarl.europa.eu/RegData/etudes/BRIE/2024/753199/EPRS_BRI(2024)753199_EN.pdf (accessed on 5 July 2024).
Dionach. Payment Processing Vulnerabilities. Available online: https://www.dionach.com/payment-processing-vulnerabilities/#:~:text=An%20example%20of%20this%20is,and%20tamper%20with%20the%20price (accessed on 15 July 2024).
Stripe. Emisores y Redes de Tarjetas—Stripe. Available online: https://stripe.com/es/resources/more/issuing-banks#:~:text=Los%20emisores%20son%20entidades%20financieras,necesario%20para%20cubrir%20el%20pago (accessed on 1 July 2024).
Cybersource. Fraud Report 2023. Available online: https://www.cybersource.com/content/dam/documents/campaign/fraud-report/global-fraud-report-2023-en.pdf (accessed on 8 July 2024).
Mastercard. Mastercard Targets Friendly Fraud. Available online: https://www.mastercard.com/news/press/2023/october/mastercard-targets-friendly-fraud-to-protect-small-businesses-and-merchants/ (accessed on 5 July 2024).
Xie, Y.; Liu, G.; Yan, G.; Jiang, C.; Zhou, M.; Li, M. Learning Transactional Behavioral Representations for Credit Card Fraud Detection. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 5735–5748. [Google Scholar] [CrossRef]
Xie, Y.; Liu, G.; Zhou, M.; Wei, L.; Zhu, H.; Zhou, R. A Spatial–Temporal Gated Network for Credit Card Fraud Detection by Learning Transactional Representations. IEEE Trans. Autom. Sci. Eng. 2024, 21, 6978–6991. [Google Scholar] [CrossRef]
Vapnik, V. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995; ISBN 0-387-94559-8. [Google Scholar]
Bousquet, O.; Boucheron, S.; Lugosi, G. Introduction to Statistical Learning Theory. In Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
Evgeniou, T.; Pontil, M. Statistical Learning Theory: A Primer. Int. J. Comput. Vis. 2000, 38, 9–13. [Google Scholar] [CrossRef]
Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
Mitchell, T. Machine Learning. In Computer Science Series; McGraw-Hill: Columbus, OH, USA, 1997. [Google Scholar]
Zxr.nju. What Is the Kernel Trick? Why Is it Important? Medium. 2023. Available online: https://medium.com/@zxr.nju/what-is-the-kernel-trick-why-is-it-important-98a98db0961d (accessed on 28 June 2024).
Lewis, J. Tutorial on SVM; CGIT Lab, University of Southern California: Los Angeles, CA, USA, 2004. [Google Scholar]
Burges, C. A tutorial on support vector machines for pattern recognition. In Data Mining and Knowledge Discovery; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1998. [Google Scholar]
Osuna, E.; Freund, R.; Girosi, F. Support Vector Machines: Training and Applications; Artificial Intelligence Laboratory MIT: Cambridge, MA, USA, 1997. [Google Scholar]
Veropoulos, K.; Cristianini, N.; Campbell, C. The Application of Support Vector Machines to Medical Decision Support: A Case Study. Adv. Course Artif. Intell. 1999, 1–6. [Google Scholar]
Stitson, M.O.; Weston, J.A. Implementational Issues of Support Vector Machines; Technical Report CSD-TR-96-18; Computational Intelligence Group; Royal Holloway; University of London: London, UK, 1996. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Adv. Neural Inf. Process. Syst. 2017, 30, 52. [Google Scholar]
Nakamura, S.; Jiwoong, W.; Huiqi, D.; Iwase, M. Light-GBM based signal correction method for surface myoelectropotential measured by multi-channel band-type EMG sensor. IFAC-PapersOnLine 2023, 56, 3558–3565. [Google Scholar] [CrossRef]
Barrios Arce, J.I. Light GBM vs XGBoost: ¿Cuál es Mejor Algoritmo? Health Big Data. 2022. Available online: https://www.juanbarrios.com/light-gbm-vs-xgboost-cual-es-mejor-algoritmo/ (accessed on 29 June 2024).
Hanif, M.F.; Naveed, M.S.; Metwaly, M.; Si, J.; Liu, X.; Mi, J. Advancing solar energy forecasting with modified ANN and light GBM learning algorithms. AIMS Energy 2024, 12, 350–386. [Google Scholar] [CrossRef]
Amin, M.; Salami, B.; Zahid, M.; Iqbal, M.; Khan, K.; Abu-Arab, A.; Alabdullah, A.; Jalal, F. Investigating the Bond Strength of FRP Laminates with Concrete Using LIGHT GBM and SHAPASH Analysis. Polymers 2022, 14, 4717. [Google Scholar] [CrossRef] [PubMed]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T. LightGBM: A Highly Efficient Gradient Boosting. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]

Figure 1. SVM linear and non-linear. Color red represents class A samples. Color green represents class B samples.

Figure 2. Exemplification of the functioning of Linear Support Vector Machine (SVC), Scikit-Learn. Classes are differentiated by color: class 0 samples are labelled as purple, class 1 samples are labelled as yellow.

Figure 3. Distribution of fraud (blue)/non-fraud (red) cases according to training features.

Figure 4. Linear SVC results, Area Under the ROC Curve (AUC) Iteration X.

Figure 5. LightGBM results, AUC Iteration X.

Figure 6. LightGBM results, AUC Iteration Y.

Table 1. SVM results.

	Iteration X		Iteration Y
	0	1	0	1
Precision	0.78	0.10	0.92	0.13
Recall	0.50	0.55	0.54	0.60
F1 Score	0.68	0.22	0.68	0.22
Sample Size	536,026	63,974	536,026	63,974

Table 2. LightGBM results.

	Iteration X		Iteration Y
	0	1	0	1
Precision	0.89	0.20	0.94	0.14
Recall	1	0	0.42	0.77
F1Score	0.94	0	0.58	0.23
Sample Size	536,026	63,974	536,026	63,974

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Arnaldo, C.G.; Jurado, R.D.-A.; Moreno, F.P.; Suárez, M.Z. Enhancing Security in Airline Ticket Transactions: A Comparative Study of SVM and LightGBM. Appl. Sci. 2025, 15, 9581. https://doi.org/10.3390/app15179581

AMA Style

Arnaldo CG, Jurado RD-A, Moreno FP, Suárez MZ. Enhancing Security in Airline Ticket Transactions: A Comparative Study of SVM and LightGBM. Applied Sciences. 2025; 15(17):9581. https://doi.org/10.3390/app15179581

Chicago/Turabian Style

Arnaldo, César Gómez, Raquel Delgado-Aguilera Jurado, Francisco Pérez Moreno, and María Zamarreño Suárez. 2025. "Enhancing Security in Airline Ticket Transactions: A Comparative Study of SVM and LightGBM" Applied Sciences 15, no. 17: 9581. https://doi.org/10.3390/app15179581

APA Style

Arnaldo, C. G., Jurado, R. D.-A., Moreno, F. P., & Suárez, M. Z. (2025). Enhancing Security in Airline Ticket Transactions: A Comparative Study of SVM and LightGBM. Applied Sciences, 15(17), 9581. https://doi.org/10.3390/app15179581

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Security in Airline Ticket Transactions: A Comparative Study of SVM and LightGBM

Abstract

1. Introduction

1.1. Research Objectives

1.2. Results Obtained

1.3. State of the Art

1.3.1. Payment Industry Overview

1.3.2. Technological Solutions in Fraud Detection and Prevention

1.3.3. Types of Payment Fraud

1.3.4. Current Challenges and Future Directions

1.3.5. Context and Justification

1.3.6. Problem Statement

2. Methodology

2.1. Research Design

2.2. Data Collection

2.3. Feature Engineering

2.4. Model Selection

2.5. Model Training and Evaluation

2.6. Validation and Testing

2.7. Implementation and Simulation

3. Materials and Methods

3.1. Feature Engineering Strategy

3.1.1. Data Sources and Types of Features

3.1.2. Feature Generation Process

3.1.3. Iterative Feature Engineering

3.1.4. Derived Features

3.1.5. Final Feature Set

3.2. Model Selection and Implementation

3.2.1. Support Vector Machines (SVMs)

Theoretical Foundation

Implementation and Application

3.2.2. Light Gradient Boosting Machine (LightGBM)

Key Features of LightGBM

Mathematical Foundation

Implementation and Application

4. Discussion

4.1. Comparison of Models

4.1.1. Support Vector Machines

4.1.2. Light Gradient Boosting Machine

4.2. Limitations and Modifications

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI