1. Introduction
Fraud can be defined as “An act of intentional deception or dishonesty perpetrated by one or more individuals, generally for financial gain” [
1]. Another definition links fraud to personal gain [
2]. The main characteristic of fraud is the effort to ensure it is undetected. When a human agent (i.e., an employee) interacts with the system with malicious intent, the stored data can be questioned. Fraud has many shapes and forms. Possible examples include credit cards, vendors, and payroll fraud. While there are known types of fraud, other types may exist within different organizations, contexts, and cultures. This depends on the benefits of the action to the individual committing fraud. Occupational fraud is the most challenging because employees know their internal systems [
1]. They can make the data appear legitimate to ensure they are not exposed.
Research has explored various forms of fraud and fraud detection techniques. These studies aimed to uncover the actions and reveal those implicit using fraudulent data. However, the data resulting from fraud remains stored. This means that the quality of such data may be compromised, and the analysis results could be unreliable. While research recognizes many forms of fraud, very few studies have considered the impact of fraud activities on data quality dimensions [
3].
Data Quality is an essential function of data management. Data Quality Management (DQM) involves employing processes, methods, and technologies to ensure that data quality meets specific business requirements [
4]. Data preparation is one of the most important activities in data quality management. This may include data cleaning, transformation, and integration. During the lifecycle of data quality management, data quality is assessed against a set of business rules, and data preparation activities are planned accordingly.
In addition, it is necessary to identify the root cause of data quality issues to minimize repetitive data preparation activities. According to the Data Management Organization Guide [
5], the root causes of data quality issues can be categorized into five classes: issues caused by lack of leadership, issues caused by data entry processes, issues caused by data processing functions, issues caused by system design, and issues caused by fixing issues. None of these categories included quality issues caused by fraudulent activity; a more recent categorization was provided by [
6] in the context of healthcare. The authors classified data quality root causes into six categories borrowed from the Ishikawa diagram: issues caused by personnel capabilities, issues caused by materials, issues caused by machines, issues caused by methods, issues caused by management, and issues related to the mission. The latter includes two sub-categories related to healthcare insurance fraud: financial incentives or disincentives, and reimbursement systems. The authors mentioned the effect of medical insurance fraud, such as upcoding, on data quality.
This study proposes a process to measure the effect of fraudulent activities on data quality dimensions. The proposed model was applied to the case of undeserved, sick leaves.
2. Objectives
Propose a process to measure the effects of fraudulent data on data-quality dimensions.
Apply the proposed model to the case of undeserved sick leaves to investigate the affected dimensions by fraudulent data.
Identify patterns of fraudulent data effect on data quality dimensions
The remainder of the paper is organized as follows:
Section 2 presents the literature review,
Section 3 explains the research methodology,
Section 4 presents the case of undeserved sick leaves, and
Section 5 concludes the paper.
3. Background
3.1. Data Preparation and Data Quality
Data preparation is a major phase in data analysis. It consists of four main processes: data profiling, cleaning, integration, and transformation [
7]. According to [
8], data profiling evaluates data quality before performing data-cleansing activities. Data cleaning, one of the main processes of data preparation, was defined in [
8] as improving data quality. The authors in [
9] suggested using a data quality framework developed for relational databases as guidelines for performing data cleaning on electronic health records (EHR). They applied the proposed method to two types of EHR, and the results were promising. In contrast, the effect of imputing missing data on data quality was evaluated in [
10]. Data integration combines data from multiple heterogeneous sources in a unified view to satisfy user queries [
11]. The quality of the query response depends heavily on the data quality of the source systems. Thus, the authors of [
11] suggested a framework for data integration using data quality. The framework evaluates the data sources using data quality measures and then ranks them using the top-K queries to select the highly ranked ones. Finally, data transformation is the process of transferring data from one format or structure to another. Aiming at improving FHIR (Fast Healthcare Interoperability Resources (FHIR) data quality, the authors in [
12] developed a Java-based FHIR RDF data transformation toolkit to facilitate the use and validation of FHIR RDF (Resource Description Framework (RDF)) data. The present research functions in two data preparation processes: (1) data profiling for quality measurement and (2) data cleaning for fraudulent data removal (labelling).
3.2. Data Quality Assessment Methodology
According to [
13], the conventional data-quality assessment consists of five main steps.
Data analysis: analyzing schema and other information (metadata) to fully understand the data and management rules.
DQ requirement analysis: Surveying the opinions of data users and administrators to identify quality issues and set new quality targets.
Identifying the critical area: The most relevant databases and dataflows were selected.
Process modeling: This provides a model of the processes that produce or update data.
Quality measurement: Select the quality dimensions affected by the quality issues identified in the DQ requirements analysis step and define corresponding metrics; measurement can be objective when it is based on quantitative metrics or subjective when it is based on qualitative evaluations by data administrators and users.
The research at hand is concerned with only some of the data quality assessment lifecycle, but only with steps contributing to the comparison between fraudulent and non-fraudulent data. These steps include identifying critical data, process modeling, and quality measurement. Details are provided in the methodology section.
4. Literature Review
The work in [
14] determined inter-hospital differences in acute myocardial infarction (AMI-CFR), aiming to evaluate the extent to which Belgian discharge records allow the assessment of the quality of care in the field of AMI to identify the starting points for quality improvement. Sensitivity analysis was used to compare the collected datasets.
As part of a wider research work, [
15,
16] investigated the effect of introducing diagnosis-related group systems (DRG) in 2004 on the distribution of admission weights in very low birth weight infants. A significant decrease in the number of admitted low-weight infants was observed in both studies. In Ref. [
6], the change was linked to the control imposed by the new coding system. They questioned the quality of the data before the introduction of the DRG systems. However, these studies could have measured the quality dimensions and their associated business rules.
Authors in [
17] studied the accuracy and validity of thyroid surgery administrative hospital data by comparing its measures to medical data measures. The authors observed important discrepancies that affected data quality. These discrepancies can be attributed to upcoding practices [
6]. However, the authors did not link these dimensions to specific business rules (or associated metrics).
Finally, the authors in [
3] discussed the effects of fraud on the data quality dimensions. It has been claimed that fraud in transactional systems may affect five dimensions: consistency, coherence, believability, timeliness, and interpretability. However, the authors did not provide objective evidence for their claims.
All the mentioned works (summarized in
Table 1 compares related studies based on the use of data quality dimensions) consider data quality from a narrowed point of view; that is, the effect on quality dimensions needs to be studied more in detail using business rules and associated metrics. In addition, it has been observed that the case of undeserved sick leave, a healthcare occupational fraud, is not covered from this perspective.
5. Methodology
A six-step process is proposed to assess which quality dimensions are affected by fraud data.
Figure 1 shows these steps, and further explanation follows.
A domain expert conducted the first step. He determines the business process affected by the fraudulent activity and the data they produce or update. These data will be evaluated. Subsequently, the organizational data quality business rules related to the business process in question were identified along with their associated DQ dimensions. Following this step, each business rule is measured on the original data, that is, the data before fraud is detected. Next, fraudulent data were detected. This may be as naïve as a subjective manual annotation. Alternatively, it can be performed using sophisticated machine learning and artificial intelligence techniques such as classification, clustering, and anomaly detection. In data cleaning, the identified data are treated either by removing fraudulent records/highlighting them or by correcting fraudulent values. A fraudulent value would make part of the entity (a record) illegitimate, but the entity itself still exists and is valid. However, fraudulent records result from an entirely illegitimate or non-existing entity. In the next step, the business rules are measured again, considering genuine data only: data that are not labeled as fraudulent, kept if fraudulent data are removed, or data with correct values.
6. Analysis and Results
The data were obtained from a local hospital, where some records were known to have been inserted to obtain undeserved sick leaves. These are sick leave days prescribed to patients based on non-existing medical conditions. Such sick leaves result from an alleged fraudulent activity in which a physician issues it to a patient to take days off without a medical need. This practice is known here, and the country tries to combat it by imposing punishments and legal consequences.
7. Overview of the Case Study
Sick leaves are days off given to employees or students to recover from their medical conditions. The abuse of this service is called undeserved sick leave and is considered occupational fraud [
18]. Undeserved sick leaves can be obtained in four ways:
Providing fake certificates with false stamps and signature
Providing fake certificates with true stamps and false signatures; such notes are generally certified by hospital staff such as nurses.
Pretending sickness symptoms and lying to doctors to obtain a correct certificate.
Providing a fully correct document issued by a physician.
Researchers have begun to investigate the problem of underserved sick leaves. In [
19], a set of methods for analyzing individual changes in sick leave diagnoses overtime was discussed. The authors of [
20] described sick leave patterns in Saudi Arabia based on data. A set of machine learning models for detecting underserved sick leaves in hospitals in Saudi Arabia was proposed [
18]. Finally, a clustering model for feature selection applied to uncover undeserved sick leave sellers on social media was proposed in [
21].
The dataset consists of 20,021 records of sick leaves entered by physicians. Each record shows information about the patient, treating physician, and sick leave. The data show the number of days of sick leave, the start date, and the diagnosis. It provides the MRN of patients, their sex, and their age. The dataset contained 17 attributes, some of which were removed during the initial cleaning and preparation of the data. At this stage, an evaluation of (Data Quality) DQ dimensions is required to demonstrate the effect of fraudulent data on data quality.
Table 2 summarizes the dataset’s attributes, their types, and descriptions after standard data cleaning.
Owing to the problem of undeserved sick leaves, there is a need for a domain expert to assist with the identification process. The expert created two additional attributes to evaluate each record: Sp_match and Dx_match. Sp_match identifies the matching between the physician’s specialization and the problem for which the patient was granted sick leave. Dx_match examines the match between a patient’s complaint and the number of days given as sick leave. Both attributes take one of the following values: 1 if there is a match, 0 if there is a mismatch, or two if the expert cannot decide.
Table 3 provides a statistical description of this dataset. Following this introduction, the proposed methodology was applied.
7.1. Identify Affected Business Processes and Critical Data
Per the expert, the business process (and its produced and updated data) affected by issuing undeserved sick leaves is “issuing a sick leave”. Hence, sick leave records were the subject of this study.
7.2. Identify Relevant Business Rules and DQ Dimensions
Eleven business rules were identified concerning the process of issuing sick leaves. Some of these business rules have already been mapped to data-quality dimensions, whereas others still need to be mapped to data-quality dimensions. A subject area expert was consulted to support mapping the rest of the business rules with data quality dimensions. The list of dimensions studied was limited to the basic dimensions mentioned in [
5], and those judged to be affected by fraudulent data [
3]. The following section discusses and maps these business rules into seven dimensions.
Accuracy refers to the degree to which data represents real-life entities. A sick leave in the system is said to be accurate if it is represented by a sick leave document delivered to the patient. This is described by business rule BR1 in
Table 4.
Completeness refers to whether all required data are present. Although most of the fields in the dataset should be complete, only two business rules related to issuing sick leaves were identified. BR2 and BR3 in
Table 4 concern the population of the fields “Problem” and “Number of days,” respectively. The reason for not covering the rest of the fields is that their completeness is ensured through system design.
Timeliness refers to the chronological patterns of transactions commonly observed in the data and information exchanges between subsystems. This was interpreted as the total number of sick leaves obtained repeatedly (day/week/month). A stable pattern reflects high timeliness. See BR4 in
Table 4.
According to [
5], uniqueness states that no entity exists more than once in a dataset. In this case, a business rule is identified in relation to uniqueness. It states that a patient can take only one sick leave at a time. See
Table 4, BR5.
Validity refers to whether data are consistent with a defined domain of values. Considering the process of issuing sick leave, the field “Problem” is said to be valid if its value is among a set of problems defined by the hospital. For example, vitamin D deficiency is a problem that requires medical consultation but does not require sick leave. In
Table 4, is defined as BR6.
The authors in [
3] define coherence as the agreement of the relationships between information streams. An existing business rule related directly to coherence is the match between the diagnosis of the patient and the specialization of the treating physician (issuing sick leave). This business rule and its measures, metrics, and status indicators are described in
Table 4 as in BR7. Believability: The extent to which the data are accepted in a specific environment and in accordance with relevant rules as true or as an item that seems true, real, and credible [
22]. The starting point for detecting fraud from data is the unbelievability of the data. Believability is highly subjective, yet an attempt to measure it is presented in [
22]. The authors suggested using four other dimensions to measure believability: accuracy, consistency with rational and organizational rules, resource appropriateness, and consistency with previous experiments. The accuracy was calculated objectively by comparing the data to the real world. As mentioned before, once sick leave is issued by a doctor, it is printed and delivered to the patient, which means that accuracy regarding sick leave is always fully met. Consistency with rational and organizational rules was measured by an expert on a scale of 0 to 1. The role of the expert is to specify the rational and organizational rules that should be considered. The expert in the casestudy has specified two main rules:(1) the match between the diagnosis and the physician’s specialization; (2) the number of days should be consistent with the diagnosis; for instance, the flu does not need more than two days. Resource appropriateness is calculated using two measures: relevance and reliability. Because sick leaves can only be issued by doctors, their relevance is fully ensured. However, its reliability remains questionable. The expert mapped the physician’s reliability to the frequency of sick leaves being issued. Finally, the expert excluded the “Previous Experiments in comparison” dimensions, as no previous experiments were available. Consequently, the weights of accuracy and consistency with previous experience are zero, which excludes them from the equation. The authors in [
22] used all aspects discussed to calculate one value of believability. However, it is not possible to identify the weight of each related dimension, particularly when the subjectivity level is very high. Hence, in the case at hand, business rules are mapped for each subdimension separately. See BR8, BR9, and BR10 in
Table 4. Interpretability: This represents explaining the system traces left by operators’ activities. This permits us to explain the pertinence of suspect imbalances, which appear when one or more transaction elements are disregarded or when anomalous traces are found. In the context of the case at hand, this was seen as a patient problem description (see BR11 in
Table 4).
7.3. Measuring DQ Dimension on Original Data
DQ dimensions were measured based on the metrics listed in
Table 4. However, further interpretation is needed for how each metric is calculated based on the data in the case study to further explain the obtained results. Accuracy: The system does not allow sick leave to be printed unless the record is successfully saved in the database. This means that the database represents all records that have been printed, and accuracy is not suspected in this context.
Completeness: Three business rules were identified for this DQ dimension concerning the problem, number of days, and start date attributes. To calculate each business rule, the number of sick leaves issued with the “problem” attribute holding a null value was counted. Subsequently, the percentage of this count is calculated based on the total number of sick leaves issued. The process is repeated for the number of days and start date attributes.
Timeliness: A pivot table combining all sick leaves issued daily was constructed. The standard deviation of the grouped number of sick leaves was then calculated.
Uniqueness: The number of duplicate pairs of MRN (start date) was calculated to determine how often one patient received more than one sick leave on the same day.
Validity: To calculate the business rule related to validity, a pivot table showing all problems inserted by physicians to justify a sick leave issue was created. Each problem was evaluated subjectively against the conditions that entailed sick leave based on hospital guidelines. An example of an invalid problem would be giving sick leave to someone complaining of a vitamin D deficiency.
Coherence: The number of sick leaves for which physician specialization does not match the diagnosis of the patient that s/he issued the sick leave is calculated. This is provided using the SP_mismatch attribute in the dataset. An example of a non-match is when an (ear, nose, and throat) ENT specialist issues a sick leave with an abdominal pain diagnosis. Believability is measured through consistency and source reliability.
To calculate the effect on consistency with rational and organizational rules, there are two metrics: the percentage of non-matching between the diagnosis and the number of days given in sick leave. This was provided as the percentage of non-matching sick leaves to the total number of sick leaves. The expert domain also provided a non-match between the diagnosis and the number of days given. There are guidelines for the maximum allowed sick leave days according to the severity of the diagnosis and the need to rest. An example of such a mismatch is to give four days off for an “acute upper respiratory infection,” also known as flu. Another metric is calculating the maximum number of days given for each diagnosis and then counting and comparing the number of sick leaves within this maximum threshold. Only diagnoses that appeared > 20 times were included in this metric. Many diagnoses appear infrequently. For ease of calculation, only those that appeared frequently are included. The number of times these sick leaves were issued was compared with the maximum number of days for such a diagnosis. Subsequently, they were compared to the Dx_match attribute. Sick leaves were counted as a mismatch. The percentage of sick leaves among the overall sick leaves was calculated.
Source reliability: Based on a specific threshold, the percentage of physicians who exceeded this threshold was calculated using this metric. This number represents the annual number of sick leaves issued by physicians. For demonstration, this was set up as 100 sick leaves per year.
The last dimension to consider is interpretability. Physicians should provide a full description of a patient’s complaints. To measure interpretability, the number of words used in the diagnosis can be used to measure the level of detail provided by the physician. To compute this, a minimum of three words that describe the problem must be considered. The percentage of sick leaves with full description was calculated based on this threshold.
Table 5 provides a summary of the results obtained based on each metric.
7.4. Data Cleaning
A machine learning model was applied to identify undeserved leaves. In a previous study, machine-learning models were developed [
18]. The models built and tested Naive bayes (NB), K-Nearest Neighbors (KNN), and Logistic regression. It also considered the class imbalance problem, as undeserved sick leaves are expected to be much less than authentic non-fraudulent leaves. Four proportions of the dataset with different ratios among the classes (deserved Vs undeserved) have been created. Each classification technique is evaluated under the sampled data proportions considering a set of measures such as accuracy, specificity, and Area Under-Curve (AUC).
The LR classifier shows the best performance on the original data (accuracy = 97%, specificity = 76% and AUC = 87%), followed by NB, then K-NN. However, on the sampled data, NB outperformed both LR and K-NN with an accuracy of 90%, specificity of up to 94%, and AUC of up to 88%.
The Naïve Bayes model built on sampled data (34% deserved sick leaves and 66% undeserved sick leaves) has been deployed and applied to the dataset considered in this research. The dataset used in this stage is collected in a different timeframe than the dataset used for building the model, yet they both have the same structure and characteristics. The model labeled each record as undeserved or deserved. The model could identify 1075 undeserved records, representing 7% of the dataset. Although the cleaning process is an integral part of this work, it is beyond the scope of this paper to explain the details of the model used. The authors would like to stress that other detection methods can also be used. However, the model developed in another study using the same data was applied here for convenience.
The dataset was cleaned, and undeserved sick leaves were removed (not considered for further analysis) to measure the DQ dimensions without these records.
7.5. Measuring DQ Dimensions on Cleaned Data
After cleaning, the same metrics were used to calculate data.
Table 6 summarizes the results after removing undeserved sick leaves from the dataset.
8. Discussion
It has been observed that the completeness of the field start date has decreased. There were 15,788 records with null values on the start date. After cleaning, this number was reduced to 15,178. The number of records with null values cleaned was 610 of the 1075 records cleaned (56.67%). Although this proportion was considered significant, the percentage of completeness decreased as the total number of records decreased. It is important to note that the best completeness value that can be obtained after cleaning this dataset is 22.34%, appearing when all undeserved sick leaves records cleaned have their start date unpopulated. This means that fraudulent data can negatively affect completeness, but this is not the case at hand. However, this case shows that fraudulent data creates a false picture of the quality of the dataset. That is, the injected complete data, in this case, slightly increased completeness. However, it may increase to an acceptable level in other cases, particularly if the metric value is close to the acceptable boundaries.
As mentioned, timeliness was measured using the standard deviation of the daily aggregate of sick leaves obtained from the hospital. The value obtained before cleaning the undeserved leaves is 9.45. The daily aggregate varied between 0 and 48 (see the daily aggregate histogram in
Figure 2). After cleaning, the measure decreased to 8.92, and the daily aggregated varied between 0 and 44 (see
Figure 3). The best value that may be obtained after cleaning is 8.06, which can be achieved if all undeserved sick leaves are issued during days with high aggregates, reducing the maximum number of days issued daily to 26 instead of 48. In other words, the effect of fraudulent data on timeliness was significant, as it showed a significant improvement after cleaning.
Its validity before cleaning was high (99.984%). Only three records with invalid data were included in the dataset. After cleaning, none of the records were cleaned, as none were fraudulent. However, the validity after cleaning improved by 0.001 as the number of records was reduced. This implies that the injection of fraudulent records affects validity or any similar dimension measured using the number of records.
Coherence is reflected by a match between physician specialization and the issue of sick leave. Before cleaning, coherence was 96.59%, which increased to 98.79% after cleaning. Overall, 683 records with physician specializations did not match the submitted diagnosis, among which only 23 records remained after cleaning. This demonstrates the significant effect of fraudulent data on this dimension.
Believability is one of the most complex, subjective dimensions. This is observed in the case at hand from two perspectives.
First, consistency with rational and organizational rules: this is expressed by two business rules. BR9 limits the number of days to those specified in the internal hospital guide. Before cleaning, the metric value of this business rule increased from 99.19% to 99.55%. Thus, 164 records with the maximum number of days were identified, and 85 records remained after cleaning. The second rule is BR8, which is mapped to coherence. The same effect has been reported previously.
Second, source reliability: physicians issuing a reasonable number (100 per year) of sick leaves per month (BR10). This evaluation has been performed in recent years. Before cleaning, 3.26% of physicians exceeded the threshold. In total, 16 doctors were 490 doctors. After cleaning, the number of doctors exceeded the threshold value (16). However, this percentage increased as the number of doctors decreased (from 490 to 410). The data of 80 doctors were removed during cleaning because all their issued sick leaves were undeserved. This shows that the business rule does not control fraudulent activity but is still affected by fraudulent data.
Interpretability is reflected in the availability of a detailed description of the case in which sick leave is issued (BR11). Before cleaning, 80.99% of the records were interpreted as being associated with the full case description. This percentage improved to 87.12% after cleaning fraudulent data. This may be explained by the inability to enter details about non-genuine diagnoses. Thus, interpretability is significantly affected by fraudulent data.
9. Conclusions, Implications, and Recommendations
This study proposes a process for evaluating the effect of fraudulent data on the data quality dimension during the preparation phase. The suggested model was applied to the occupational fraud case of undeserved sick leaves. The results reveal that fraudulent data can affect many dimensions in several ways. Some business rules are directly associated with fraud activity, others are not associated with it, and others are partially associated with it.
Business rules are directly associated with fraud activity, meaning there is a critical overlap between fraudulent data and data that do not adhere to the business rule. In this case, the metric value of the business rule should improve as fraudulent data that negatively affects business rules is removed. An example of this BR8 is mapped to coherence and believability (mismatch between diagnosis and physician specialization).
A business rule is not associated with fraudulent activity, which means that there is no overlap between fraudulent data and data not adhering to the business rule. However, the business rule measure is calculated based on the overall number of records. In this case, a lower number of records affected the metric value. For instance, the validity in the studied case (BR3) is measured using the percentage of invalid values in several fields among which the “problem”. Only three records were identified as invalid and not part of the fraudulent data. However, the metric value decreased because the overall number of records decreased. This case may include a situation in which the metric is calculated based on some aggregates rather than on the overall size of the data, such as the source reliability in the case studied (BR10).
Business rules are partially associated with fraudulent activity, which means that a partial overlap with fraudulent data and measure is based on the overall number of records. In this case, the effect of removing fraudulent data and the effect of removing records from the dataset may be contradictory, and the metric value remains at the boundaries of the old value. For example, the completeness of the “start date” is decreased because the number of records is lowered. However, if all fraudulent data had unpopulated start date, the metric value would have been improved (the best possible value was significant).
As mentioned, the authors in [
3] subjectively discussed the effects of fraud on timeliness, coherence, consistency, believability, and interpretability. This research objectively confirmed the effects on timeliness, coherence, believability, and interpretability. However, the data at hand did not provide sufficient evidence for consistency. This does not deny that fraudulent data might significantly affect this dimension. However, further applications of the proposed method to other datasets are required to investigate this dimension. Moreover, the effect is also observed in other dimensions, namely, completeness and validity.
A limitation of this study is that it only investigated fraudulent records. Other applications of the model are required to consider the different granularities of the data, such as fraud at the attribute value level.