An Intelligent Financial Fraud Detection Support System Based on Three-Level Relationship Penetration

Li, Xiang; Chu, Lei; Li, Yujun; Xing, Zhanjun; Ding, Fengqian; Li, Jintao; Ma, Ben

doi:10.3390/math12142195

Open AccessArticle

An Intelligent Financial Fraud Detection Support System Based on Three-Level Relationship Penetration

by

Xiang Li

^1,2,

Lei Chu

^1,2,3,*

,

Yujun Li

^1,2,*,

Zhanjun Xing

^1,3,

Fengqian Ding

²,

Jintao Li

^1,3

and

Ben Ma

^1,3

¹

Smart State Governance Laboratory, Shandong University, Qingdao 266237, China

²

School of Information Science and Engineering, Shandong University, Qingdao 266237, China

³

School of Political Science and Public Administration, Shandong University, Qingdao 266237, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2024, 12(14), 2195; https://doi.org/10.3390/math12142195

Submission received: 11 June 2024 / Revised: 2 July 2024 / Accepted: 10 July 2024 / Published: 12 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

Financial fraud is a serious challenge in a rapidly evolving digital economy that places increasing demands on detection systems. However, traditional methods are often limited by the dimensional information of the corporations themselves and are insufficient to deal with the complexity and dynamics of modern financial fraud. This study introduces a novel intelligent financial fraud detection support system, leveraging a three-level relationship penetration (3-LRP) method to decode complex fraudulent networks and enhance prediction accuracy, by integrating the fuzzy rough density-based feature selection (FRDFS) methodology, which optimizes feature screening in noisy financial environments, together with the fuzzy deterministic soft voting (FDSV) method that combines transformer-based deep tabular networks with conventional machine learning classifiers. The integration of FRDFS optimizes feature selection, significantly improving the system’s reliability and performance. An empirical analysis, using a real financial dataset from Chinese small and medium-sized enterprises (SMEs), demonstrates the effectiveness of our proposed method. This research enriches the financial fraud detection literature and provides practical insights for risk management professionals, introducing a comprehensive framework for early warning and proactive risk management in digital finance.

Keywords:

financial fraud detection; enterprise financial risks; three-level relationship penetration; fuzzy deterministic soft voting

MSC:

68Txx

1. Introduction

The rapid adoption of digital financial services has significantly increased the complexity and scope of financial transactions, leading to a parallel rise in the incidence and sophistication of financial fraud. Traditional fraud detection methods, which often rely on static rule sets and historical data, struggle to keep up with the dynamic strategies of modern fraudsters [1,2]. This evolving landscape demands robust detection systems capable of addressing the covert and multifaceted nature of modern financial fraud, which not only undermines financial stability but also erodes public trust, causing substantial economic damage across multiple sectors [3].

Financial fraud can undermine local financial stability and public trust, with far-reaching consequences for the lives of individuals [4]. Financial fraud, perceived as financial abuse, poses a significant challenge to the socio-economy, resulting in considerable economic losses for governments, organizations, businesses, and individuals [5]. It is defined as an act that brings economic benefits [6,7] to an individual or organization through unethical and illegal means [8]. Various forms of financial fraud include, but are not limited to, credit card fraud [9,10], insurance fraud [11,12], money laundering [13], medical fraud [14], and securities fraud [15].

Traditional financial fraud detection methods typically rely on expert rule-based systems [16,17]. As effective as these methods are to a certain extent, their limitations of relying on historical data and static rule sets make them unable to adapt to the ever-changing strategies of fraudsters. Machine learning models, such as Support Vector Machines (SVMs) [18] and Gradient Boosted Decision Trees (GBDTs) [19], allow dynamic analysis of user information and transaction behavior. However, while the accuracy and adaptability of these methods have improved, they tend to ignore important aspects, such as the fluidity of financial risk and the complex interactions between entities [20,21].

Recent interest has also been shown in the application of deep tabular networks for representation learning. For example, the differentiable trees approach by Kontschieder et al. [22] exploits the advantages of tree ensembles with gradient-based neural network optimization. In recent years, TabNet [23], as a representative deep tabular network model, has garnered widespread attention for its excellent performance in handling tabular data. TabNet enhances model interpretability and prediction accuracy by learning sparse feature selection from data. Moreover, Hollmann and colleagues recently introduced TabPFN [24], a Bayesian neural network pre-trained on synthetic tabular data, proving its superiority over gradient-boosted trees in comprehensive evaluations. These studies suggest that deep tabular networks hold potential advantages in handling complex, high-dimensional tabular data, presenting a valuable direction for binary classification tasks in financial risk control.

On the other hand, the application of Graph Neural Networks (GNNs) [25] has transformed the field of Financial Fraud Detection (FFD) [26]. The studies by Liu et al. [27] further substantiate the utility of GNNs in FFD, showing how these networks can be trained to identify patterns indicative of fraudulent behavior by analyzing transactional and relational data within financial networks [28]. Wang et al. [29] and Liu et al. [30] illustrate how GNNs leverage user interactions to detect subtle fraud patterns, outperforming traditional data analysis methods. However, the effectiveness of GNNs depends heavily on the quality and completeness of data. As discussed by Gao et al. [31], GNN-based networks have high data quality requirements and challenges, such as dealing with the computational demands of large graphs, are resource-intensive and may limit real-time applications.

Recent advancements in Fuzzy Rough Set Theory (FRST) [32] have significantly improved feature selection in financial fraud detection, focusing on challenges such as computational inefficiency and handling uncertain data [33]. Chen et al. [34] initiated this progression with a graph–theoretic algorithm that enhanced computational efficiency. Building on this, Jain et al. [35] introduced an intuitionistic model that deepened uncertainty analysis by evaluating multiple hesitancy degrees, paving the way for more precise data handling. Addressing gaps left by prior research, Ji et al. [36] integrated FRST with neural networks, boosting detection capabilities by leveraging advanced computational architectures. Following this, An et al. [37] tackled variable class densities in complex datasets with relative fuzzy rough approximations, enhancing classification accuracy and highlighting FRST’s adaptability to diverse scenarios. Zhao et al. [38] synthesized these improvements into an optimized feature selection method that processes high-dimensional data efficiently while maintaining robustness against noise and data incompleteness. This series of studies shows that FRST is continuously improving in increasing computational efficiency and data processing capabilities to cope with complex data with high dimensionality and high noise.

However, despite these advancements, several practical challenges remain in implementing these sophisticated methods in real-world financial institutions. Studies have shown that the reliance on historical data can limit the effectiveness of fraud detection models [39]. Data quality and accessibility issues, such as incomplete or noisy data, further complicate the implementation process [40]. Additionally, the computational demands of advanced analytical techniques can be prohibitive for institutions with limited resources [41]. Integration with existing systems and workflows also poses significant hurdles, requiring careful planning and substantial modifications [42]. Furthermore, maintaining the adaptability of fraud detection models in response to evolving fraud tactics is an ongoing challenge that demands continuous research and updates [43,44].

To address these challenges, this paper introduces a new approach that combines the Three-Level Relationship Penetration (3-LRP) framework with the FDSV methodology. The 3-LRP framework systematically uncovers fraud risks by analyzing direct, indirect, and peripheral business relationships, which traditional models often overlook. While historical data is still used, the robustness of our model is significantly enhanced by employing a Gaussian fuzzy-based feature selection method that effectively deletes anomalies and redundant features. Additionally, our method benefits from the FRDFS feature selection method, which provides a more comprehensive modeling of enterprise risk. The FDSV method combines diverse machine learning techniques to improve prediction accuracy. This dual approach not only simplifies data preprocessing but also capitalizes on the inherent structure of raw data, aiming to deliver solutions that are both efficient and interpretable. The main contributions of this research are:

Introduction of the FDSV Method: This method merges transformer-based deep tabular networks with machine learning classifiers, enhancing prediction accuracy and robustness.
Proposal of the FRDFS Feature Selection Method: This method accomplishes the screening of core features and the deletion of anomalous samples.
Innovative Use of the 3-LRP Framework for Comprehensive Risk Assessment: This framework focuses on internal and related enterprise information.
Development of a Financial Fraud Detection Decision Support System: This system incorporates relationship penetration, data aggregation, risk indicator construction, and FDSV, adaptable to specific tasks.

The paper is organized as follows: Section 2 discusses prior background knowledge, Section 3 details the financial fraud prediction support system, Section 4 presents experiments and discussions, and Section 5 concludes the work and presents future challenges.

2. Theoretical Background

Classical rough set theory is good at dealing with clear equivalence relations, but falls short when dealing with complex categorical features, highlighting the need for a more general approach. The fuzzy rough set (FRS) concept addresses this by accommodating both numerical and categorical data, leveraging the strengths of rough and fuzzy sets to handle ambiguity and partial membership. This integration not only broadens the analytical capabilities for diverse data types but also enriches the decision-making process by introducing a decision attribute d alongside condition attributes A, thereby enhancing the feature selection and data interpretation framework.

Given an information system

D = (U, A U {d})

with a decision attribute d, wherein A represents the condition attributes and d represents the decision attribute, the fuzzy lower approximation

\underline{R} (X)

captures the degree to which elements in universe U certainly belong to X, based on the fuzzy relation R. It is defined for every element x ∈ U as the infimum of the fuzzy complement of R(x,y) and the membership degree of y in X:

\underline{R} (X) = \inf_{y \in U} \max {1 - R (x, y), X (y)}

(1)

On the contrary, the fuzzy upper approximation reflects the likelihood that an element in U belongs to X and is defined as the minimum of the fuzzy relation R(x,y) and the upper certainty of the degree of membership of y in X. The fuzzy upper approximation reflects the likelihood that an element in U belongs to X:

\bar{R} (X) = \sup_{y \in U} \max {(x, y), X (y)}

(2)

where x represents an individual instance from the universe U, and y iterates over all instances in U. R(x,y) is a fuzzy relation on U that measures the degree of similarity or closeness between instances x and y, which can be based on various distance or similarity metrics relevant to the domain of analysis. X(y) denotes the membership function of set X, which assigns a degree of membership ranging from 0 (not a member) to 1 (full member) to each element y in U.

The FRS approximations set the stage for evaluating the relevance of attribute subsets in distinguishing between instances, pivotal for feature selection. The positive region

P O S_{R}^{B} (D)

of decision D with respect to attribute subset

B \subseteq A

is formulated as the union of fuzzy lower approximations for all decision classes

D_{t}

induced by D:

P O S_{R}^{B} (D) = \cup_{D_{t} \in U / D} \underline{R}

(3)

The dependency of decision D on attributes B is quantified through the fuzzy dependency function

γ_{R}^{B} (D)

, calculated as the proportion of instances in U classified with certainty:

γ_{R}^{B} (D) = \frac{1}{| U |} \sum_{x \in U} {\underline{R}}^{B} (D) (x)

(4)

Equation (4) illustrates the use of fuzzy dependency to assess attribute importance, guiding feature selection by pinpointing attributes that significantly influence decision-making, with those yielding higher dependency values being prioritized for retention in the dataset.

Fuzzy rough set theory enhances feature selection by accurately handling data ambiguities, facilitating the identification of critical features through dependency measures. This approach allows for nuanced differentiation within the dataset, leading to more effective feature selection. However, the approach can be computationally demanding, particularly with large datasets, and requires precise calibration of fuzzy relations and membership functions to ensure the accuracy and reliability of the feature selection process.

3. Method

3.1. Overview

This part presents a fraud prediction framework grounded in a 3-LRP approach. The framework emphasizes strategies accentuating risk associations and embraces a proactive reconnaissance mindset for an exhaustive evaluation of financial fraud risks. The first phase involves data aggregation. Based on the 3-LRP method, relevant data from the target financial institutions and their associated 3-LRP entities are extracted to construct the initial financial fraud risk dataset. The second phase focuses on the construction of the risk indicator pool, which includes handling missing values, data normalization, and feature engineering, to build a financial fraud risk assessment indicator pool. Subsequently, we advance to the model analysis phase, where the FDSV model is employed to assess and analyze high-risk financial activities. Finally, based on the requirements of different institutions, the results can be dispatched to various departments.

As illustrated in Figure 1, the overall framework of the proposed model shows fraud prediction is structured into three pivotal stages. One distinctive characteristic of financial fraud risk is its latency. The initial stage is marked by a 3-LRP of financial data acquisition and integration. When a financial institution faces significant challenges, retracing and mitigating the losses become formidable. Therefore, early warnings of financial vulnerability are paramount. The risks that these institutions confront are primarily anchored in the transference of personnel-associated risks.

The risks faced by financial institutions often originate from the transfer and transmission of personnel-related risks. Such risk transfers can be direct or indirect, but their ramifications can deeply impact the overall stability of the institution. It is underscored that, while many institutions prioritize the recruitment of seasoned experts, those with latent vulnerabilities might attract individuals with questionable backgrounds. This could not only heighten internal risks for the institution but might also jeopardize its collaborations with other entities.

This system contemplates not just the internal risk factors of an institution but also its affiliations and interactions with other entities. By doing so, a more precise evaluation of the overall risk profile of an institution is achieved, steering the creation of pertinent risk management strategies. The subsequent figure showcases the relational network delineating risk penetration for a particular institution, furnishing a lucid view of how these risks interlink.

3.2. Three-Level of Relationship Penetration for Data Aggregation

As illustrated in the Figure 2, the institution marked in red is our primary concern. Enclosed within the red-dashed perimeter, the first layer of risk penetration entities is visible. This layer encompasses entities intimately connected to the target institution, such as its board members, executives, supervisors, shareholders, legal representatives, financial partners, and non-individual stakeholders. Radiating from this core network, the blue-dashed boundary indicates the second layer, while the orange-dashed frontier demarcates the third layer of risk penetration entities.

Beginning with the target financial institution and its closely related entities (e.g., board members, executives, regulators, shareholders, etc.), which form the first layer of the relationship network, and using data mining techniques such as association rule mining and network analysis, we identify entities that are directly or indirectly related to the target institution. This includes entities that may have a significant impact on the institution’s risk profile, such as financial partners and key customers. With this approach, an initial financial fraud risk dataset can be constructed to provide the basis for in-depth analyses and predictions.

During this process, the system can detect financial institutions that appear to be operating normally but are secretly conducting illegal fund-raising. These risks might remain hidden in the shadows until they escalate into serious problems. Typically, risks within an institution are propagated through its affiliated individuals and entities. Therefore, it is imperative for law enforcement agencies to proactively conduct on-site inspections and delve into investigations of these enterprises and their associated natural persons and institutional entities. This 3-LRP approach contributes to a comprehensive formulation of the financial fraud risk assessment paradigm.

3.3. Fuzzy Rough Density-Based Feature Selection Methodology

In the field of financial data analysis, in the face of fuzzy labels and deceptive samples specific to financial datasets, traditional feature selection methods are prone to overfitting and cannot easily detect features that are more important for decision attributes. Our mathematical contribution includes the integration of Fuzzy Rough Set Theory (FRST) with Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [45] algorithms to enhance the robustness and discriminative power of feature selection in noisy environments. Incorporating the DBSCAN clustering algorithm, the FRDFS methodology differentiates data points into core, border, and outlier subsets. These are characterized as follows:

For clustering, we employ DBSCAN, where the distance metric

d (x, y)

is adapted as:

d (x, y) = \sqrt{\sum_{i = 1}^{n} {(\frac{x_{i} - y_{i}}{σ_{i}})}^{2}}

(5)

where

σ_{i}

is the standard deviation of the

i

-th feature. This adaptive distance metric improves clustering performance in noisy environments.

Core Points (C): An object

x_{i}

is a core point if the number of points within its neighborhood, defined by radius ξ, is at least MinPts:

C = {x_{i} \in U | | N_{ξ} (x_{i}) | \geq M i n P t s}

(6)

Border Points (B): An object

x_{i}

is a border point if it is within the ξ-neighborhood of a core point but has fewer than MinPts within its own ξ-neighborhood:

B = {x_{i} \in U | | N_{ξ} (x_{i}) | \geq M i n P t s a n d N_{ξ} (x_{i}) \cap C \neq \emptyset}

(7)

Outliers (O): An object

x_{i}

is an outlier if it is neither a core point nor a border point:

O = {x_{i} \in U | N_{ξ} (x_{i}) | \cap C \neq \emptyset a n d | N_{ξ} (x_{i}) | < M i n P t s}

(8)

To better deal with data of different density types, first we removed the noisy data because the core and the data close to the core are those that give better support to the decision response, and then introduced an adaptive distance metric that can be dynamically adjusted depending on whether it is a comparison between the core points or a comparison involving the boundary points. The metric is denoted as follows:

D_{a d a p t i v e} (x, y) = {\begin{matrix} D_{E u c l i d e a n} (x, y) i f b o t h x a n d y a r e c o r e p o i n t s \\ D_{G u s s i a n} (x, y) i f e i t h e r x o r y i s b o r d e r p o i n t s \end{matrix}}

(9)

where

D_{E u c l i d e a n} (x, y)

is the distance function between the core points, which is defined as follows:

D_{E u c l i d e a n} (x, y) = \sqrt{{\sum_{k \in B} (x_{k} - y_{k})}^{2}}

(10)

where

x_{k}, y_{k}

are the values of the k-th attribute for instances x and y, respectively, and B represents the set of attributes. This is a direct distance measure that represents the distance between points in a high-density core area. It provides a clear and explicit similarity measure for closely spaced instances.

D_{G u s s i a n} (x, y)

is a Gaussian-modified distance function, applied when either of the instances is a border point, defined as:

D_{G u s s i a n} (x, y) = \exp (- \frac{\sum_{k \in B} (x_{k} - y_{k})}{2 σ^{2}})

(11)

where

σ

denotes a scaling factor used to adjust the sensitivity of the distance measurements to the variance of the data, especially for boundary points where data density decreases and uncertainty increases. The function moderates boundary effects by incorporating a Gaussian kernel, thereby smoothing the transition between high-density core regions and low-density boundary regions. It is particularly well suited to deal with ambiguities at boundary points where direct Euclidean distances may not accurately represent the underlying structure of the data.

Next, the fuzzy lower approximation

{\underline{R}}_{a d a p t i v e} (X)

is defined as the set of instances in U that are strongly associated with the subset X under consideration:

{\underline{R}}_{a d a p t i v e} (X) = {x \in U | \forall y \in U, R_{a d a p t i v e} (x, y) \geq τ_{L}}

(12)

where

τ_{L}

denotes a threshold determining the minimum similarity score for x to be included in the lower approximation of X.

R_{a d a p t i v e} (x, y)

represents the adaptive fuzzy similarity between instances x and y, reflecting their relational closeness according to the nature of the data points (core or border). The fuzzy lower approximation

{\underline{R}}_{a d a p t i v e} (X)

is defined as the set of instances in U that are strongly associated with the subset X under consideration:

{\bar{R}}_{a d a p t i v e} (X) = {x \in U | \exists y \in U, R_{a d a p t i v e} (x, y) \geq τ_{U}}

(13)

where

τ_{U}

denotes a threshold determining the minimum similarity score for x to be included in the lower approximation of X.

Next, rough entropy is used to help complete the feature selection. Rough entropy is a measure of uncertainty in classifying instances based on the selected features, which is described in the following definition:

E_{r o u g h} (B) = - \sum_{X \in U / D} p (X) \log p (X)

(14)

where p(X) denotes the normalized cardinality of the positive region X with respect to B. U/D represents the partition of U induced by the decision attribute D. The log function in Equation (13) is used to transform the probabilities into a more manageable range and to stabilize the variance, which helps in dealing with skewed data distributions and enhances the robustness of the feature selection process. Minimizing rough entropy is synonymous with maximizing certainty in the classification process, making it an integral component of the feature selection criterion in the FRDFS methodology.

The FRDFS algorithm systematically evaluates and selects the most informative subset of features from a given decision table; the pseudo-code for this algorithm is shown in Algorithm 1 below:

Algorithm 1. Fuzzy Rough Density-based Feature Selection Algorithm

Input:

Set of models M, with individual predictions p_{m}

and associated output probabilitie p_{m}

for i i \in {1, 2, \dots, m}

Output: A selected feature subset FS_best of B.

1.Begin

2. Initialize FS_best(T) ← ∅, FS_visited(T) ← ∅, FS_adaptive(T) ← ∅

3. Partition U into core, border, and outlier points using DBSCAN with parameters ξ and MinPts.

4. Remove outliers from U to reduce noise.

5. Compute adaptive fuzzy similarity R_adaptive for each pair of points (x, y) in U.

6. For each feature f

\in

B do

7. Compute the fuzzy lower and upper approximations using R_adaptive and thresholds τ_L and τ_U.

8. Calculate the rough entropy E_rough(f) for feature f.

9. End

10. Rank features in B in ascending order of their rough entropy E_rough(f).

11. For each feature f

\in

B, starting with the lowest E_rough(f) do

12. Add f to FS_best if it improves the positive region POS_R_adaptive(d).

13. If adding f increases the positive region POS_R_adaptive(d) then

14. Update FS_visited by adding f.

15. End If

16. Recalculate E_rough(f) for remaining features in B | FS_visited.

17. End

18. FS_best is the final selected feature subset.

19.End

The core of the FRDFS approach is the adaptive fuzzy similarity matrix R_adaptive, which captures subtle similarities between data points based on their local density context. This matrix is used to compute the fuzzy lower and upper approximations for a subset of the data, which in turn informs the calculation of the rough entropy for each feature. Based on the ability of the features to reduce the rough entropy, the features are ranked and iteratively selected to specify the decision boundaries in the dataset. The result is a set of improved features (FS_best) that are expected to improve the robustness and accuracy of subsequent data modelling tasks. This selection mechanism focuses on local data density and fuzzy set theory and is particularly adept at dealing with the ambiguity and noise inherent in financial datasets.

3.4. Fuzzy Deterministic Soft Voting

The FDSV method advances financial fraud detection by integrating fuzzy logic into the ensemble of classifiers, including RF [46], SVM [47], LGBM [48], TabPFN and TabNet. This integration is key to dynamically adjusting each classifier’s influence based on the certainty of its predictions, thereby enhancing the overall prediction accuracy and reliability.

Fuzzy set theory provides a robust framework for addressing the ambiguity and uncertainty inherent in predictive modelling. By defining fuzzy sets, the ambiguities associated with the predictions made by various classifiers are expressed quantitatively. The fuzzy membership functions have the following meanings

μ_{A} (P_{i}) = (1 + {(\frac{P_{i} - T_{i}}{σ_{i}})}^{2 k_{i}})^{- 1}

(15)

where

P_{i}

symbolizes the probability of fraudulent activity as forecasted by classifier i,

T_{i}

and

σ_{i}

are the tailored confidence threshold and the standard deviation of the membership function, respectively. For each classifier,

k_{i}

is the shape parameter, allowing the curvature of the membership function to be fine-tuned to match the prediction distribution of different classifiers.

To assign weights to each classifier’s prediction, the dynamic weight of each classifier’s prediction is defined by integrating the fuzzy membership degree with an intricate model that also considers the collective certainty of all classifiers:

W_{i}^{'} = λ_{i} \cdot μ_{A_{i}} (P_{i}) + γ \cdot \sin (\sum_{j \neq i} δ_{j} \cdot μ_{A_{i}} (P_{i})) λ

(16)

where

λ_{i}

corresponds to the individual weight adjustment factor for classifier i, emphasizing the importance of its independent prediction;

λ

serves as the global adjustment coefficient to modulate the collective influence of all classifier confidences; and

δ_{j}

represents the weight factor associated with the predictions from classifier j.

The final predictive value F is computed using a weighted average formula, taking into account the adjusted weights and prediction probabilities of all classifiers:

F = \sum_{i = 1}^{n} \frac{λ_{i} \cdot \tanh (W_{i}^{'}) \cdot P_{i}^{0}}{ν} + ξ \cdot \sin (\sum_{i = 1}^{n} W_{i}^{'})

(17)

where

ν

is used as a normalization factor to ensure that the sum of all weights is equal to 1, thus maintaining the probability distribution of the final prediction. The

\tanh (W_{i}^{r})

denotes the non-linear adjustment of weights, enhancing the model’s sensitivity to classifier confidence ratings. The use of the sine function in Equation (16) introduces periodicity and non-linearity into the weight adjustment process. This approach leverages principles from harmonic analysis to improve the flexibility and accuracy of the model.

This methodology combines the strengths of various classifiers and integrates fuzzy logic and non-linear functions to introduce a degree of “softness” into each prediction. Consequently, it allows the model to exhibit enhanced robustness and adaptability in the face of diverse and uncertain datasets. In order to illustrate the implementation steps of the FDSV method more clearly (as shown in Algorithm 2), the following provides a pseudo-code representation of the algorithm:

Algorithm 2. Fuzzy Deterministic Soft Voting Algorithm

Input:

Set of models {M = {model}_{1} {, model}_{2}, \dots, {model}_{m}}

Confidence

thresholds (T_{i}

), standard deviations (σ_{i}

), shape parameters (K_{i}

)
Global

adjustment coefficient (G), individual weight factors (W_{i}

), and sine amplitude adjustment (A)

Output: Final weighted prediction F

1: Begin

2: Initialize an array of weights W for each model in M

3: Initialize total confidence score TCS to 0

4: Initialize final weighted vote FWV to 0

5: for each model i in M do

6:

Obtain the prediction probability P_{i}

for model i

7:

Calculate the Gaussian membership value G M V_{i}

:

8:

G M V_{i}

= \exp - ((P_{i}

- T_{i}

) / σ_{i}

)^2/2)

9:

Calculate the adjusted weight A W_{i}

for the prediction:

10:

{A W}_{i}

= W_{i}

* (1 + G * G M V_{i}

* K_{i}

)

11:

Add A W_{i}

to the total confidence score TCS

12: End for

13: for each model i in M do

14:

Calculate the final weight F W_{i}

:

15:

F W_{i}

= A W_{i}

/TCS

16:

Calculate the weighted prediction W P_{i}

:

17:

W P_{i}

= F W_{i}

* P_{i}

* (1 + A * \sin (π * F W_{i}

))

18:

Add W P_{i}

to the final weighted vote FWV

19: End for

20: Normalize the final weighted vote to obtain F:

21:

F = FWV / sum of all W P_{i}

22: Return F

23: End

When the system is used by law enforcement agencies to identify potential fund-raising fraud enterprises, adjusting the weights of each prediction model can be effective in distinguishing between high-risk and low-risk enterprises. For example, if a model predicts that the probability of an enterprise engaging in fraud is close to 0.5, this indicates a high level of uncertainty in the prediction, which correspondingly reduces its weight in the final composite assessment. Conversely, if a model is very confident in its prediction (i.e., the probability is well below 0.5), then its weight in the composite assessment increases. This dynamic weight adjustment mechanism allows law enforcement agencies to more accurately identify businesses that are truly at risk of fraud, while reducing the likelihood that normal businesses will be mistaken for fraudulent ones. This approach improves the quality of the overall decision-making process and allows law enforcement agencies to be more efficient and precise in combating financial fraud.

4. Experiment

4.1. FFD Dataset

In this part of the experiment, the dataset is extracted from data related to SMEs registered in China. The dataset comprises records of 22,507 enterprises, capturing aspects from basic registration details to online sentiments from platforms.

The National Enterprise Credit Information Publicity System (https://www.gsxt.gov.cn/index, accessed on 2 July 2024) is the largest public enterprise information publication website in China. It includes basic information about the enterprise, such as shareholders, finance, annual report, etc. It also includes information as to whether the enterprise has been published in the blacklist. Qi Chacha (https://www.qcc.com/, accessed on 2 July 2024) is China’s large-scale enterprise information integration query tool, for which the use of deep learning and feature extraction graph construction technology, at the industry user scale, data volume and other scale indices, are the industry’s first position, including relevant data, such as enterprise management and senior supervisory reports, affiliated company information, operating abnormalities, administrative penalties, etc. Tencent Keen Security Lab (https://cloud.tencent.com/, accessed on 2 July 2024) is a security research team under Tencent, focusing on network security research, helping enterprises to understand current security trends and potential risks by analyzing a variety of network threat intelligence, which can be combined with real-time information for enterprise public opinion risk and early warning.

Based on the database presented above, using the method based on 3-LRP, a dataset containing 22,507 pieces of real data, 12,088 black sample firms with a pointer of 1, and the rest of the enterprises as normal firmsm was constructed. There are a total of 356 features in the dataset, including 205 Basic Registration Details features, 45 first-level relationship indicators, 67 second-level relationship indicators, and 39 third-level relationship indicators. Partial data presentation is shown in Table 1 below:

These are relevant to fraud detection because they provide insights into the legal and operational risks associated with the business. For example, the number of legal representatives executed (A1) and the number of directors and officers involved (A2) directly highlight potential legal issues within a company and indicate a propensity for fraudulent activity. Similarly, the discrepancy between the registered address and the actual business address (A3) may signal an attempt to conceal the true state of operations, which is a common tactic of fraudulent behavior. Financial and operational indicators, such as the number of directors and senior managers in drug-related cases (A4) and pyramid scheme cases involving legal representatives (A5), further reveal the integrity and risk profile of key personnel within the enterprise. The number of firms involved in cases at the third-level for firms (A6) and the number of firms with executed senior executives at the second-level for firms (A7) extends the risk assessment to a wider network of relevant entities, highlighting systemic risk. The proportion of natural person shareholders (A8) and the number of senior executives (A9) provide additional context on ownership and management structures, which can help to understand the impact of individual stakeholders.

4.2. Evaluation Criteria

To comprehensively evaluate the model’s performance, a confusion matrix is employed to represent the results of the binary classification. Rows represent the predicted condition, while columns indicate the actual condition. The matrix is defined as shown in the Table 2 below:

Specifically, TP (True Positive) represents the actual number of positive samples predicted to be positive; FN (False Negative) represents the actual number of positive samples predicted to be negative; FP (False Positive) represents the actual number of negative samples predicted to be positive; and TN (True Negative) is the actual number of negative samples correctly predicted to be negative. It can be inferred that (TP + FN) is the total number of actual positive samples, while (TP + FP) represents the total number of samples predicted to be positive. Based on the confusion matrix, the following metrics can be defined:

P r e c i s i o n = \frac{T P}{T P + F P}

(18)

R e c a l l = \frac{T P}{T P + F N}

(19)

F 1 - s c o r e = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(20)

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(21)

Precision specifically refers to the proportion of instances that are correctly predicted as positive out of all instances that are predicted as positive. On the other hand, recall measures the proportion of actual positive instances that have been correctly identified. Precision provides an overall measure indicating the fraction of all instances that have been correctly predicted. F1-Scores are customized for classification tasks and indicate the reconciled mean of precision and recall, with values fluctuating between 0 and 1.

4.3. Experimental Validation

4.3.1. Experimental Validation for Chinese SMEs

The data processing workflow began with FRDFS, an algorithm that initially identified and definition removed 86 redundant attributes. After streamlining the dataset, the DBSCAN clustering algorithm was applied, which revealed the structural distribution of the financial entities: 9537 core objects, 3650 boundary objects and 3600 outlier objects. The identification of core objects is particularly important because these objects are considered to support model decisions and reflect typical financial behavior. In contrast, outlier objects are usually equated with noise, which may lead to overfitting, and should be removed from the corresponding noise points.

To visually substantiate the effectiveness of FRDFS, we utilized t-SNE for dimensionality reduction, projecting the multi-dimensional data onto a two-dimensional plane. The resulting visualization is shown in Figure 3 below:

The visualization effectively demonstrates the clustering tendencies within the dataset post t-SNE reduction, as discerned by the DBSCAN algorithm. Cores are grouped densely, denoting well-defined clusters with sufficient local density to comply with the specified eps and min_samples parameters. The border points form peripheries around these core regions, suggesting a gradation in point density, indicative of the algorithm’s sensitivity in identifying cluster edges. Outliers appear scattered and isolated, reinforcing the robustness of DBSCAN in distinguishing between in-cluster data and noise.

The distribution and separation of clusters provide insight into the dataset’s intrinsic structure and the clustering algorithm’s performance. The clusters vary in size and density, with some clusters being tightly connected, implying a high degree of similarity within these clusters, while others are more dispersed, suggesting variability within the clusters. The presence of outliers reflects the natural disorder within the dataset, emphasising the need for noise filtering for accurate data modelling. The top 20 features in terms of attribute importance screened by FRDFS are shown in the table below:

Table 3 highlights the top 20 attributes crucial for financial analysis of Chinese SMEs, selected through the 3-LRP framework. Attributes such as “Number of Corporate Enforcements” and “Number of Second-Level Corporate Cancellations” offer insights into the companies’ compliance and stability. The “Tencent Lingkun Regulatory Index” and enforcement-related metrics such as “Number of Shareholder Enforcements” emphasize the framework’s depth in capturing regulatory and governance risks. In particular, the inclusion of secondary and third level penetration indicators highlights the efficacy of the 3-LRP in profiling the financial behavior of SMEs and highlights its role in enhancing fraud detection models.

To assess whether the accuracy of our model is fixed or specific to the dataset, we acknowledge that model performance is often influenced by the characteristics of the dataset used. The dataset in this study consists of records from Chinese SMEs, and the accuracy reported is reflective of this specific dataset. Variations in data characteristics, such as feature distributions, data quality, and noise levels, can impact the performance of the model. For the task of enterprise risk identification in containing high noise and high dimensionality, an FRDFS is proposed for feature engineering to help select core features and noise samples for deletion. With a line up to verify the effectiveness of FRDFS, the prediction models with and without FRDFS for feature engineering are compared. The specific experimental results are shown in the following table.

In Table 4, the impact of FRDFS on classification performance is evident. All models exhibit an increase in Recall 1 upon incorporating FRDFS, demonstrating its effectiveness in boosting the detection of high-risk companies (labeled as ‘1’). LR, for instance, sees an uplift in Recall 1 from 0.66 to 0.71, and a similar trend is observed with KNN and SVM, where Recall 1 increases from 0.68 to 0.82 and 0.61 to 0.82, respectively. This consistent improvement across models underscores the robustness of FRDFS in enhancing model performance, particularly in detecting high-risk companies, which is a critical aspect of financial fraud detection.

Advanced models, such as TabPFN and LGBM, also show significant improvements when augmented with FRDFS, with TabPFN + FRDFS achieving a Recall 1 of 0.93 and LGBM + FRDFS reaching 0.85. These results highlight the capability of FRDFS in leveraging the strengths of advanced models, further enhancing its ability to identify positive cases more accurately. The FDSV model, in particular, stands out with the highest Accuracy of 0.96 and a strong Recall 1 of 0.93, suggesting that FDSV is highly effective in discerning fraudulent behavior. This is central to the objective of our experiment, indicating that the integration of FRDFS substantially boosts the model’s fraud detection capabilities.

The consistent increase in performance metrics, such as Recall 1 and Accuracy, across various models when incorporating FRDFS points to its robustness and adaptability. FRDFS not only improves the detection of high-risk companies but also enhances the overall classification performance, making it a valuable component in financial fraud detection systems. The ability to maintain high performance across different models and datasets further attests to the reliability and effectiveness of our proposed methodology.

To further assess the impact of feature selection on model performance, we conducted a sensitivity analysis focusing on the number of features used in the models, as shown in Figure 4. Figure 4 illustrates the effect of varying the number of features on the Accuracy, Recall, and F1-Score of the models.

As shown in Figure 4, model performance improves as the number of features increases, reaching an optimal point at around 30 features. Beyond this point, additional features result in a gradual decline in performance. This suggests that while more features can provide more information, there is a threshold, beyond which the inclusion of additional features introduces noise and redundancy, negatively impacting model performance.

The sensitivity analysis demonstrates the importance of selecting an optimal number of features to achieve the best model performance. It highlights the robustness of our feature selection methodology, which effectively balances the inclusion of informative features with the exclusion of irrelevant ones. The results further validate the effectiveness of FRDFS in enhancing model performance by ensuring that the most relevant features are retained.

Subsequently, in order to better compare the effects of different models, the ROC curves after feature selection using FRDFS for all models are shown Figure 5 below:

As shown in Figure 5, the ROC curve quantitatively describes the classification effect of different models in distinguishing legitimate and suspicious activities of Chinese SMEs. The AUC of the FDSV model is 0.97, which indicates that the model has the capability to distinguish between fraudulent and genuine cases. The AUC of the traditional classifier LR is 0.77, and the AUC of SVM is 0.85, which highlights the stronger differentiation ability of the FDSV model.

The ROC curves in Figure 4 further illustrate the superior performance of our proposed methodology. The FDSV model achieves an AUC of 0.97, significantly higher than the traditional classifiers LR and SVM, which have AUCs of 0.77 and 0.85, respectively. This high AUC value indicates that FDSV has a better capability of distinguishing between fraudulent and genuine cases, which is crucial for reliable fraud detection. The improvement in AUC values across all models when incorporating FRDFS also underscores its effectiveness in enhancing model performance. The robustness and reliability of our proposed methodology are evident from its ability to consistently deliver high performance in fraud detection tasks, making it a highly effective tool for financial fraud detection systems.

4.3.2. Experimental Validation of 3-LRP

Various attributes of financial transactions, such as the number of previous suspicious activities linked to an account or the frequency of high-value transactions, play a critical role in assessing potential fraud risks. By integrating these attributes, the FDSV model efficiently predicts financial fraud. Pursuing this methodology, a comparative experiment was carried out using the 3-LRP model. The results of this experiment are delineated in the table below.

Table 5, Table 6 and Table 7 illustrates the performance differences between single-layer and double-layer penetration methods in assessing financial risks. In the single-layer penetration, a foundational assessment approach is utilized, focusing on a singular risk dimension. Despite its simplicity, this method demonstrates considerable effectiveness, with all key metrics, i.e., accuracy, precision, recall, and F1-score, consistently registering at 0.79. This outcome underscores the value of even basic assessment methods in providing meaningful insights into financial risks.

In contrast, the double-layer penetration method broadens the risk assessment scope by incorporating additional risk dimensions. This expanded approach leads to a notable improvement in accuracy, which rises to 0.87. The precision, recall, and F1-score of the double-layer method, measured at 0.83, 0.75, and 0.86 respectively, also show marked enhancements. These figures indicate a superior ability of the double-layer penetration method in detecting and evaluating risks, compared to its single-layer counterpart. Such an increase in performance metrics highlights the benefits of a more nuanced and comprehensive risk assessment approach in financial fraud detection. The model’s ability to simulate risk is further enhanced with the extension of the penetration relationship to the third layer. It is worth noting that the recall 1 using FDSV has experienced a substantial improvement increase to 0.95, indicating that 3-LRP is already able to model the risks that firms may face in a more complete manner.

The results demonstrate the effectiveness of our proposed method. It is important to note that the accuracy reported in this study is specific to the dataset used, which consists of records from Chinese SMEs. The performance of the model may vary when applied to different datasets due to variations in data characteristics, such as feature distributions, data quality, and the presence of noise. To further validate the robustness and generalizability of our model, future work will involve testing the model on diverse datasets.

4.3.3. Experimental Validation of Credit Risk Assessment Dataset

In the following sections, we delve into the experimental results of different models on financial credit datasets from public FFDs in three different countries: the German GM credit dataset, the Australian credit dataset, and the Japanese CRX credit dataset, again with a binary classification task. The data are collected from Kaggle’s website, the UCI Machine Learning Repository, and others. The performance of the FDSV model is validated with publicly available datasets. The following are the detailed results of these experiments.

As shown in Figure 6, Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11, various models have been evaluated under different datasets to obtain a comprehensive view of the performance of these models in financial fraud detection. By analyzing the precision, recall and F1-scores for both the risk-free and risky categories, together with specific data points, one can draw meaningful conclusions. In the GM dataset, when considering the risk-free category, models such as LGBM and TabPFN exhibit impressive precision, indicating their ability to accurately identify risk-free instances. However, Tabnet stands out with a higher recall of 0.82, emphasizing its effectiveness in capturing true risk-free cases, albeit with a slightly lower precision of 0.76. The balance between precision and recall is evident in their similar F1-scores of 0.79, highlighting the inherent trade-offs in model performance within this category.

The risk categories shown in Figure 6 are critical in detecting real cases of fraud, and FDSV’s high precision of 0.75 is critical in minimizing false positives and avoiding the mislabeling of normal transactions as fraudulent. FDSV also has a respectable recall of 0.76, which highlights its ability to identify many real cases of fraud. FDSV’s advantage in precision and recall resulted in a high F1-score of 0.75, demonstrating that it is both comprehensive and effective in detecting financial fraud.

In the Australian dataset, within the Risk-Free category, LGBM achieves a precision of 0.77, while TabPFN achieves a precision of 0.79. However, Tabnet stands out with a higher recall of 0.86, highlighting its proficiency in capturing true risk-free cases. This balance between precision and recall is reflected in their similar F1-scores, emphasizing the inherent trade-offs in model performance. Transitioning to the Risk category in the Australian dataset, TabPFN continues to excel with high precision (0.91), essential for minimizing false positives in fraud detection. RF also demonstrates notable improvement in the Risk category, suggesting its effectiveness in identifying risk instances with a precision of 0.86. Once again, FDSV showcases a balanced performance in both precision and recall, maintaining high precision in this category, achieving a precision of 0.82 and an impressive recall of 0.93.

Moving to the CRX dataset, in the Risk-Free category, LGBM achieves a precision of 0.77, and TabPFN achieves a precision of 0.79. However, TabPFN stands out with a higher recall of 0.91, indicating its proficiency in capturing true risk-free cases. As observed in previous datasets, FDSV maintains a balanced performance in both precision and recall, resulting in a competitive F1-score, with a precision of 0.82 and a remarkable recall of 0.93. In the Risk category of the CRX dataset, TabPFN once again excels with high precision (0.85), crucial for reducing false positives in fraud detection. RF maintains its trend of improvement in the risk category, highlighting its suitability for identifying risk instances with a precision of 0.82. Consistently, FDSV demonstrates strength in maintaining high precision in the Risk category, achieving a precision of 0.82.

Then, a case study was conducted to serve as an example study of the FDSV methodology, as shown in the table below.

Table 8 shows the classification probabilities of several samples through different classifiers and the FDSV model. As can be seen from the table, multiple classifiers are on an uncertain classification boundary. For example, for sample 1, the SVM model has a probability of 0.45 for class 1 and 0.55 for class 0, leading to a final classification error. However, TabPFN achieve higher classification confidence on this sample. Therefore, by using the proposed FDSV model, we can ensure that classifiers with higher confidence are given greater ensemble weights, thereby correcting this tendency towards error. In other words, models with greater certainty make a larger contribution to the final prediction outcome.

The experiments conducted across different datasets demonstrate varying model performances. However, amidst this variability, FDSV consistently stands out by delivering competitive and robust results. FDSV’s unique strength lies in its exceptional ability to strike a balance between precision and recall. This crucial balance makes it an ideal choice for comprehensive financial fraud detection, as it excels in correctly identifying fraud cases while effectively minimizing false alarms. FDSV’s dependable performance underscores its potential to enhance the accuracy and reliability of fraud detection systems in the realm of financial security.

5. Conclusions

In the dynamic arena of financial fraud detection, accurately assessing risks is pivotal for sustaining business vitality. This study introduces a significant advancement in fraud detection through the integration of the 3-LRP framework and the FDSV model, both enhanced by fuzzy rough set-based feature selection techniques. The 3-LRP framework systematically uncovers hidden fraud risks by analyzing direct, indirect, and peripheral corporate relationships, providing a deep dive into complex relational networks essential for comprehensive risk management. Simultaneously, the FDSV method utilizes a sophisticated algorithm that intelligently combines diverse machine learning classifiers to improve the accuracy and robustness of fraud predictions. This dual approach not only ensures a thorough analysis of transactional data but also enhances the interpretability of the results, enabling more informed decision-making within enterprises.

Despite the clear benefits of this integrated approach, challenges, such as the reliance on historical data and potential issues with data quality and accessibility, continue to pose limitations. These challenges underscore the necessity for ongoing research to refine the models’ adaptability and effectiveness in real-world scenarios. Future studies are anticipated to enhance the robustness of the 3-LRP and FDSV methods and explore semi-supervised learning models that can better handle data imbalances and provide more nuanced risk evaluations across various sectors. Moreover, adapting and extending the proposed methods to other types of fraud or different industries, such as insurance claims, healthcare, and e-commerce, will be important. Customizing the feature sets and refining the relational models to suit the specific characteristics and requirements of each industry, along with incorporating real-time data processing capabilities, will enhance the applicability of the methods in dynamic environments where timely fraud detection is essential. This continued evolution aims to optimize the detection capabilities and extend the applicability of these models, ensuring they remain effective tools in the fight against financial fraud.

Author Contributions

Conceptualization, X.L. and L.C.; Methodology, X.L. and L.C.; Formal analysis, X.L.; Writing—original draft preparation, X.L.; Project administration, X.L.; Supervision, L.C., Y.L. and Z.X.; Funding acquisition, L.C., Y.L. and Z.X.; Writing—review and editing, L.C., Y.L., Z.X., F.D., J.L. and B.M.; Data curation, F.D., J.L. and B.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Shandong Social Science Planning Fund Program (23CSDJ28).

Data Availability Statement

Restrictions apply to the availability of these data. Data was obtained from [Smart State Governance Lab] and are available [from Lei Chu] with the permission of [Smart State Governance Lab].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhu, X.; Ao, X.; Qin, Z.; Chang, Y.; Liu, Y.; He, Q.; Li, J. Intelligent financial fraud detection practices in post-pandemic era. Innovation 2021, 2, 100176. [Google Scholar] [CrossRef]
Soltani, M.; Kythreotis, A.; Roshanpoor, A. Two decades of financial statement fraud detection literature review; combination of bibliometric analysis and topic modeling approach. J. Financ. Crime 2023, 30, 1367–1388. [Google Scholar] [CrossRef]
Cross, C. “I knew it was a scam”: Understanding the triggers for recognizing romance fraud. Criminol. Public Policy 2023, 22, 613–637. [Google Scholar] [CrossRef]
West, J.; Bhattacharya, M. Intelligent financial fraud detection: A comprehensive review. Comput. Secur. 2016, 57, 47–66. [Google Scholar] [CrossRef]
Ngai, E.W.; Hu, Y.; Wong, Y.H.; Chen, Y.; Sun, X. The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decis. Support Syst. 2011, 50, 559–569. [Google Scholar] [CrossRef]
Tong, G.; Shen, J. Financial transaction fraud detector based on imbalance learning and graph neural network. Appl. Soft Comput. 2023, 149, 110984. [Google Scholar] [CrossRef]
Cheng, C.H.; Kao, Y.F.; Lin, H.P. A financial statement fraud model based on synthesized attribute selection and a dataset with missing values and imbalanced classes. Appl. Soft Comput. 2021, 108, 107487. [Google Scholar] [CrossRef]
Al-Hashedi, K.G.; Magalingam, P. Financial fraud detection applying data mining techniques: A comprehensive review from 2009 to 2019. Comput. Sci. Rev. 2021, 40, 100402. [Google Scholar] [CrossRef]
Hua, S.; Zhang, C.; Yang, G.; Fu, J.; Yang, Z.; Wang, L.; Ren, J. An FTwNB Shield: A Credit Risk Assessment Model for Data Uncertainty and Privacy Protection. Mathematics 2024, 12, 1695. [Google Scholar] [CrossRef]
Zhu, M.; Shia, B.C.; Su, M.; Liu, J. Consumer Default Risk Portrait: An Intelligent Management Framework of Online Consumer Credit Default Risk. Mathematics 2024, 12, 1582. [Google Scholar] [CrossRef]
Druică, E.; Vâlsan, C.; Ianole-Călin, R.; Mihail-Papuc, R.; Munteanu, I. Exploring the link between academic dishonesty and economic delinquency: A partial least squares path modeling approach. Mathematics 2019, 7, 1241. [Google Scholar] [CrossRef]
Aslam, F.; Hunjra, A.I.; Ftiti, Z.; Louhichi, W.; Shams, T. Insurance fraud detection: Evidence from artificial intelligence and machine learning. Res. Int. Bus. Financ. 2022, 62, 101744. [Google Scholar] [CrossRef]
Gerbrands, P.; Unger, B.; Getzner, M.; Ferwerda, J. The effect of anti-money laundering policies: An empirical network analysis. EPJ Data Sci. 2022, 11, 15. [Google Scholar] [CrossRef]
Hill, A.; Mirchandani, M.; Pilkington, V. Ivermectin for COVID-19: Addressing potential bias and medical fraud. In Open Forum Infectious Diseases; Oxford University Press: New York, NY, USA, 2022; Volume 9, p. ofab645. [Google Scholar]
Tsai, Y.C.; Huang, H.W. Internal control material weakness opinions and the market’s reaction to securities fraud litigation announcements. Financ. Res. Lett. 2021, 41, 101833. [Google Scholar] [CrossRef]
Vatsa, V.; Sural, S.; Majumdar, A.K. A rule-based and game-theoretic approach to online credit card fraud detection. Int. J. Inf. Secur. Priv. (IJISP) 2007, 1, 26–46. [Google Scholar] [CrossRef]
Rashid, M.A.; Al-Mamun, A.; Roudaki, H.; Yasser, Q.R. An overview of corporate fraud and its prevention approach. Australas. Account. Bus. Financ. J. 2022, 16, 101–118. [Google Scholar] [CrossRef]
Singh, A.; Jain, A.; Biable, S.E. Financial Fraud Detection Approach Based on Firefly Optimization Algorithm and Support Vector Machine. Appl. Comput. Intell. Soft Comput. 2022, 1468015. [Google Scholar] [CrossRef]
Ali, A.A.; Khedr, A.M.; El-Bannany, M.; Kanakkayil, S. A Powerful Predicting Model for Financial Statement Fraud Based on Optimized XGBoost Ensemble Learning Technique. Appl. Sci. 2023, 13, 2272. [Google Scholar] [CrossRef]
Zheng, H.; Dong, B. Quantum Temporal Winds: Turbulence in Financial Markets. Mathematics 2024, 12, 1416. [Google Scholar] [CrossRef]
Tian, Y.; Wu, Y. Systemic Financial Risk Forecasting: A Novel Approach with IGSA-RBFNN. Mathematics 2024, 12, 1610. [Google Scholar] [CrossRef]
Kontschieder, P.; Fiterau, M.; Criminisi, A.; Bulo, S.R. Deep Neural Decision Forests. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1467–1475. [Google Scholar]
Arik S, Ö.; Pfister, T. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 6679–6687. [Google Scholar]
Hollmann, N.; Müller, S.; Eggensperger, K.; Hutter, F. Tabpfn: A transformer that solves small tabular classification problems in a second. arXiv 2022, arXiv:2207.01848. [Google Scholar]
Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
Dou, Y.; Liu, Z.; Sun, L.; Deng, Y.; Peng, H.; Yu, P.S. Enhancing graph neural network-based fraud detectors against camouflaged fraudsters. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual Event, Ireland, 19–23 October 2020; pp. 315–324. [Google Scholar]
Liu, Y.; Ao, X.; Qin, Z.; Chi, J.; Feng, J.; Yang, H.; He, Q. Pick and choose: A GNN-based imbalanced learning approach for fraud detection. Proc. Web Conf. 2021, 3168–3177. [Google Scholar]
Wang, D.; Lin, J.; Cui, P.; Jia, Q.; Wang, Z.; Fang, Y.; Yu, Q.; Zhou, J.; Yang, S.; Qi, Y. A semi-supervised graph attentive network for financial fraud detection. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; pp. 598–607. [Google Scholar]
Wang, X.; Yu, K.; Lee, D. Graph neural networks for financial fraud detection. J. Financ. Crime 2021, 28, 1230–1245. [Google Scholar]
Liu, Y.; Chen, X.; Zhao, Y. A graph neural network approach to financial fraud detection. J. Netw. Comput. Appl. 2020, 165, 102673. [Google Scholar]
Gao, Y.; Wang, X.; He, X.; Liu, Z.; Feng, H.; Zhang, Y. Alleviating structural distribution shift in graph anomaly detection. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, Singapore, 27 February–3 March 2023; pp. 357–365. [Google Scholar]
Riza, L.S.; Janusz, A.; Bergmeir, C.; Cornelis, C.; Herrera, F.; Śle, D.; Benítez, J.M. Implementing algorithms of rough set theory and fuzzy rough set theory in the R package “RoughSets”. Inf. Sci. 2014, 287, 68–89. [Google Scholar] [CrossRef]
Li, X.; Luo, C. An intelligent stock trading decision support system based on rough cognitive reasoning. Expert Syst. Appl. 2020, 160, 113763. [Google Scholar] [CrossRef]
Chen, J.; Mi, J.; Lin, Y. A graph approach for fuzzy-rough feature selection. Fuzzy Sets Syst. 2020, 391, 96–116. [Google Scholar] [CrossRef]
Jain, P.; Tiwari, A.; Som, T. A fitting model based intuitionistic fuzzy rough feature selection. Eng. Appl. Artif. Intell. 2020, 89, 103421. [Google Scholar] [CrossRef]
Ji, W.; Pang, Y.; Jia, X.; Wang, Z.; Hou, F.; Song, B.; Liu, M.; Wang, R. Fuzzy rough sets and fuzzy rough neural networks for feature selection: A review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2021, 11, e1402. [Google Scholar] [CrossRef]
An, S.; Zhao, E.; Wang, C.; Guo, G.; Zhao, S.; Li, P. Relative fuzzy rough approximations for feature selection and classification. IEEE Trans. Cybern. 2021, 53, 2200–2210. [Google Scholar] [CrossRef] [PubMed]
Karkošková, S. Data governance model to enhance data quality in financial institutions. Inf. Syst. Manag. 2023, 40, 90–110. [Google Scholar] [CrossRef]
Kumar, G.; Jain, S.; Singh, U.P. Stock market forecasting using computational intelligence: A survey. Arch. Comput. Methods Eng. 2021, 28, 1069–1101. [Google Scholar] [CrossRef]
Kurshan, E.; Shen, H. Graph computing for financial crime and fraud detection: Trends, challenges and outlook. Int. J. Semant. Comput. 2020, 14, 565–589. [Google Scholar] [CrossRef]
Sina, A. Open AI and its Impact on Fraud Detection in Financial Industry. J. Knowl. Learn. Sci. Technol. ISSN 2023, 2, 263–281. [Google Scholar] [CrossRef]
Zhang, Z.; Wan, J.; Zhou, M.; Lai, Z.; Tessone, C.J.; Chen, G.; Liao, H. Temporal burstiness and collaborative camouflage aware fraud detection. Inf. Process. Manag. 2023, 60, 103170. [Google Scholar] [CrossRef]
Zhao, X.; Liu, Z.; Núñez, M. Optimized fuzzy rough set-based feature selection for high-dimensional data. Inf. Sci. 2024, 580, 20–34. [Google Scholar]
Nunez, M.; Lora, A.T.; Gómez, J.M.; Riquelme, J.C. Hybrid machine learning models for efficient fraud detection. J. Comput. Appl. Math. 2024, 385, 113560. [Google Scholar]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. Density-based spatial clustering of applications with noise. In Proceedings of the International Conference Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; Volume 240. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 1–9. [Google Scholar]

Figure 1. The overall framework of the proposed model.

Figure 2. 3-LRP relationship diagram for target companies.

Figure 3. Density-Based Clustering of Financial Entities Using FRDFS and t-SNE Visualization.

Figure 4. The sensitivity analysis of classifiers.

Figure 5. The ROC Curve Comparison of Classifiers.

Figure 6. Assessment classification performance of each model for the risk-free category in the GM dataset. From left to right, the accuracy, recall, and F1-score of each model in the risk-free category.

Figure 7. Assessment classification performance of each model for the risk category in the GM dataset. From left to right, the accuracy, recall, and F1-score of each model in the risk category.

Figure 8. Assessment classification performance of each model for the risk-free category in the Australian dataset. From left to right, the accuracy, recall, and F1-score of each model in the risk-free category.

Figure 9. Assessment classification performance of each model for the risk category in the Australian dataset. From left to right, the accuracy, recall, and F1-score of each model in the risk category.

Figure 10. Assessment classification performance of each model for the risk-free category in the CRX dataset. From left to right, the accuracy, recall, and F1-score of each model in the risk-free category.

Figure 11. Assessment classification performance of each model for the risk category in the CRX dataset. From left to right, the accuracy, recall, and F1-score of each model in the risk category.

Table 1. Brief description of the FFD dataset for Chinese SMEs.

Index	A1	A2	A3	A4	A5	A6	A7	A8	A9	…	Label
1	1	2	No	0	0	17	0	1.0	3	…	1
2	0	0	No	0	0	1	0	0.6	6	…	0
3	0	0	No	1	0	9	0	0.6	2	…	1
4	0	0	Yes	0	1	1	0	1.0	2	…	0
5	0	0	Yes	0	0	1	0	1.0	2	…	0
6	1	0	None	0	0	10	0	1.0	3	…	1
7	0	0	Yes	5	0	9	0	1.0	6	…	1
8	0	0	Yes	0	0	1	0	0.6	12	…	0
9	0	1	Yes	0	2	0	0	0.0	7	…	0
10	1	1	No	0	2	2	1	1.0	3	…	1

Table 2. Confusion matrix of actual versus predicted class labels.

	Positive	Negative
Positive	True Positive (TP)	False Positive (FP)
Negative	False Negative (FN)	True Negative (TN)

Table 3. Top 20 important attributes in for Chinese SMEs.

Index	Definition
1.	“Number of Corporate Enforcements”
2.	“Number of Second-Level Corporate Cancellations”
3.	“Number of Natural Person Shareholders within Third Level”
4.	“Tencent Lingkun Regulatory Index”
5.	“Number of Companies within Third Level with Adjusted Executive Accounts”
6.	“Number of Current Employees in the Company”
7.	“Number of Employees Involved in Pyramid Schemes”
8.	“Number of Second-Level Companies with Dishonest Legal Persons”
9.	“Number of Employees Involved in Drugs”
10.	“Number of Natural Person Shareholders Involved in Illegal Fund-raising in Investee Companies”
11.	“Number of Shareholder Enforcements”
12.	“Number of Enforcements of Companies Associated with Natural Person Shareholders”
13.	“Number of Key Personnel in Directors, Senior Managers, and Supervisors”
14.	“Whether the Company is Cancelled”
15.	“Number of Dishonest Acts by Directors, Senior Managers, and Supervisors”
16.	“Number of Employees Subject to Enforcement”
17.	“Number of Shareholders Subject to Enforcement”
18.	“Number of Second-Level Companies with Dishonest Executives”
19.	“Number of Financial Dishonesty Incidents”
20.	“Number of Legal Persons Involved in Cases in Second Level”

Table 4. Discriminant accuracy of combined FRDFS models and single models.

Model	Recall 0	Recall 1	Accuracy
LR	0.83	0.66	0.71
LR + FRDFS	0.86	0.71	0.79
KNN	0.98	0.68	0.76
KNN + FRDFS	0.92	0.82	0.87
SVM	0.98	0.61	0.80
SVM + FRDFS	0.96	0.82	0.85
TabPFN	0.86	0.90	0.86
TabPFN + FRDFS	0.85	0.93	0.90
TabNet	0.98	0.81	0.88
TabNet + FRDFS	0.98	0.85	0.91
LGBM	0.99	0.82	0.89
LGBM +FRDFS	0.97	0.85	0.93
FDSV	0.99	0.93	0.96

Table 5. One-Level penetration classification report.

	Precision	Recall	F1-Score
1	0.67	0.67	0.67
0	0.84	0.84	0.84
Accuracy			0.79
Macro Avg	0.76	0.76	0.76
Weighted Avg	0.79	0.79	0.79

Table 6. Two-Level penetration classification Report.

	Precision	Recall	F1-Score
1	0.83	0.75	0.79
0	0.88	0.93	0.90
Accuracy			0.87
Macro Avg	0.86	0.87	0.84
Weighted Avg	0.83	0.75	0.86

Table 7. Third-Level penetration classification report.

	Precision	Recall	F1-Score
1	0.92	0.95	0.94
0	0.98	0.99	0.98
Accuracy			0.96
Macro Avg	0.95	0.96	0.95
Weighted Avg	0.96	0.96	0.96

Table 8. Cases of classification probabilities of multiple samples by different models.

		RF	SVM	TabPFN	TabNet	LGBM	FDSV
Sample 1	0	0.53	0.55	0.15	0.48	0.10	0.13
	1	0.47	0.45	0.85	0.52	0.90	0.87
	Prediction	0	0	1	1	1	1
	Label	1	1	1	1	1	1
Sample 2	0	0.49	0.52	0.05	0.51	0.08	0.14
	1	0.51	0.48	0.95	0.50	0.92	0.86
	Prediction	1	0	1	0	1	1
	Label	1	1	1	1	1	1
Sample 3	0	0.70	0.90	0.45	0.40	0.55	0.25
	1	0.30	0.10	0.55	0.60	0.45	0.75
	Prediction	0	0	1	1	1	1
	Label	1	1	1	1	1	1
Sample 4	0	0.60	0.55	0.20	0.15	0.80	0.35
	1	0.40	0.45	0.80	0.85	0.20	0.65
	Prediction	0	0	1	1	0	1
	Label	1	1	1	1	1	1
Sample 5	0	0.45	0.10	0.50	0.70	0.05	0.45
	1	0.55	0.90	0.50	0.30	0.95	0.80
	Prediction	1	1	\	0	1	1
	Label	1	1	1	1	1	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Chu, L.; Li, Y.; Xing, Z.; Ding, F.; Li, J.; Ma, B. An Intelligent Financial Fraud Detection Support System Based on Three-Level Relationship Penetration. Mathematics 2024, 12, 2195. https://doi.org/10.3390/math12142195

AMA Style

Li X, Chu L, Li Y, Xing Z, Ding F, Li J, Ma B. An Intelligent Financial Fraud Detection Support System Based on Three-Level Relationship Penetration. Mathematics. 2024; 12(14):2195. https://doi.org/10.3390/math12142195

Chicago/Turabian Style

Li, Xiang, Lei Chu, Yujun Li, Zhanjun Xing, Fengqian Ding, Jintao Li, and Ben Ma. 2024. "An Intelligent Financial Fraud Detection Support System Based on Three-Level Relationship Penetration" Mathematics 12, no. 14: 2195. https://doi.org/10.3390/math12142195

APA Style

Li, X., Chu, L., Li, Y., Xing, Z., Ding, F., Li, J., & Ma, B. (2024). An Intelligent Financial Fraud Detection Support System Based on Three-Level Relationship Penetration. Mathematics, 12(14), 2195. https://doi.org/10.3390/math12142195

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Intelligent Financial Fraud Detection Support System Based on Three-Level Relationship Penetration

Abstract

1. Introduction

2. Theoretical Background

3. Method

3.1. Overview

3.2. Three-Level of Relationship Penetration for Data Aggregation

3.3. Fuzzy Rough Density-Based Feature Selection Methodology

3.4. Fuzzy Deterministic Soft Voting

4. Experiment

4.1. FFD Dataset

4.2. Evaluation Criteria

4.3. Experimental Validation

4.3.1. Experimental Validation for Chinese SMEs

4.3.2. Experimental Validation of 3-LRP

4.3.3. Experimental Validation of Credit Risk Assessment Dataset

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI