**Algorithms in Decision Support Systems**

Editor

**Vicente Garc´ıa-D´ıaz**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editor* Vicente Garc´ıa-D´ıaz University of Oviedo Spain

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Algorithms* (ISSN 1999-4893) (available at: https://www.mdpi.com/journal/algorithms/special issues/decision systems).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-0588-6 (Hbk) ISBN 978-3-0365-0589-3 (PDF)**

© 2021 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


## **About the Editor**

**Vicente Garc´ıa-D´ıaz** is an Associate Professor in the Department of Computer Science at the University of Oviedo. He is a Software Engineer, with a Ph.D. in Computer Science. He has a Master's in Occupational Risk Prevention and the qualification of University Expert in Blockchain Application Development. He is also part of the editorial and advisory board of several journals and has been an editor of several Special Issues in books and journals. He has supervised 100+ academic projects and published 100+ research papers in journals, conferences, and books. His teaching interests are primarily in the design and analysis of algorithms and the design of domain-specific languages. His current research interests include decision support systems, health informatics, and eLearning.

## **Preface to "Algorithms in Decision Support Systems"**

Decision support systems (DSSs) are increasingly important information systems that help to make decisions related to unstructured and semi-unstructured decision problems that do not have a simple solution from a human point of view. They are currently used in different areas, such as medical diagnosis, catastrophe avoidance, agriculture, sustainable development, sales projections, inventory organization, production design, etc. The architecture of a common DSS is basically composed of three main components: (1) knowledge base; (2) user interface; and (3) model to infer the decisions. Such models may be based on multiple types of algorithms, such as neural networks, logistic regression, classification trees, fuzzy logic, etc. Although there are multiple works that try to optimize the operation of DSSs, researchers are still trying to optimize their performance by refining and proposing new algorithms that normally are adapted to the set of data available for a particular domain of knowledge. Thus, the aim of this book is to enhance the state of the art in this area significantly, as well as improving the performance of DSSs in specific domains.

The book is structured in such a way that the first works (articles 1-4) focus on some relevant knowledge domains and the last works (articles 5-8) deal with more general aspects on which researchers are actively working.

Thus, to start with, Fernando Lopez-Mart ´ ´ınez et al. [1] discuss a platform that is the new pillar for the Keralty Foundation to improve population health management, value-based care, and new upcoming challenges in healthcare. The benefits of using the new data platform include better healthcare outcomes, improvement of clinical operations, reduced costs of care, and the generation of accurate medical information. Several machine learning algorithms are used with standardized datasets to improve the effectiveness of public health interventions, improving diagnosis, and clinical decision support.

In the second work, Roanes-Lozano et al. [2] develop a prototype of a rule-based expert system aimed at an amateur competition player that is not accompanied by her coach to a championship. The player must answer a set of questions about how she is serving that day and her usual serving technique. Then, the system obtains a diagnosis using logic inference about the possible reasons. A certain knowledge of the tennis terminology and technique is required from the player, but that is something known at this level. The underlying logic is Boolean, and the inference engine is algebraic.

In the third work, Guimapi et al. [3] propose a platform that can improve insect pest management in the biological control context. It is an interactive platform for fitting data derived from experiments to mathematical expressions and carrying out spatial visualization. It uses experimental data as the input for model fitting, then applies the obtained model at the landscape level via a spatial temperature grid data to yield regional and continental maps. Different modules and functionalities of the tool are presented with the case study.

In the fourth work, Josyula et al. [4] present an evaluation framework for train rescheduling algorithms, which are very important whenever disturbances occur. They also present two train rescheduling algorithms: a heuristic and a mixed-integer linear programming-based exact algorithm. Finally, they conduct an experiment to compare the two multi-objective algorithms using a proposed framework. It is found that the heuristic algorithm is suitable for solving simpler disturbance scenarios since it is quick in producing decent solutions.

In the fifth work, Pendharkar [5] illustrates that, in addition to unbiased evaluations, the ensemble dimensionality reduction research in data envelopment analysis scores results in unique rankings that have high entropy. Under restrictive assumptions, it is also shown that the ensemble scores are normally distributed. Ensemble models do not require any new modifications to existing objective functions or constraints, and when ensemble scores are normally distributed, returns-to-scale hypothesis testing can be carried out using traditional parametric statistical techniques.

In the sixth work, Feng et al. [6] propose an unknown radar emitter identification method based on semi-supervised and transfer learning. Firstly, they construct a support vector machine model based on transfer learning, which can solve the problem that the training data and the testing data do not satisfy the same-distribution hypothesis. Then, they design a semi-supervised co-training algorithm, which can solve the problem that insufficient labeled data result in inadequate training of the classifier. Simulation experiments show that the proposed combination can effectively identify an unknown radar emitter.

In the second to last work, Koukoutsis et al. [7] list the more significant of the hazards and risks related to managing, updating, modifying, and upgrading the data and program cores of very large-scale DSSs. The authors also introduce a new general methodology for designing DSSs that are robust and circumvent these risks. The core of this new approach is the introduction of a meta-database, called teleological, on the base of which the management, updating, modification, reduction, growth, and upgrading of the system may be safely and efficiently achieved.

Lastly, Thanajiranthorn and Songram [8] suggest a new associate classification algorithm to directly discover a compact number of efficient rules for classification without the pruning process. A vertical data representation technique is implemented to avoid redundant rule generation and to reduce time spent in the mining process. The experimental results show that the proposed algorithm achieves in terms of accuracy a number of generated rule items, classifier building time, and memory consumption, especially when compared to the well-known algorithms.

I hope this book is to the liking of the reader and serves to deepen the knowledge of the exciting area of algorithms applied in decision support systems.

#### **BIBLIOGRAPHY**

[1] Lopez-Mart ´ ´ınez, F.; Nu´nez-Valdez, E.R.; Garc ˜ ´ıa-D´ıaz, V.; Bursac, Z. A Case Study for a Big Data and Machine Learning Platform to Improve Medical Decision Support in Population Health Management. Algorithms 2020, 13, 102. https://doi.org/10.3390/a13040102

[2] Roanes-Lozano, E.; Casella, E.A.; Sanchez, F.; Hernando, A. Diagnosis in Tennis Serving ´ Technique. Algorithms 2020, 13, 106. https://doi.org/10.3390/a13050106

[3] Guimapi, R.A.; Mohamed, S.A.; Biber-Freudenberger, L.; Mwangi, W.; Ekesi, S.; Borgemeister, C.; Tonnang, H.E.Z. Decision Support System for Fitting and Mapping Nonlinear Functions with Application to Insect Pest Management in the Biological Control Context. Algorithms 2020, 13, 104. https://doi.org/10.3390/a13040104

[4] Josyula, S.P.; Krasemann, J.T.; Lundberg, L. An Evaluation Framework and Algorithms for Train Rescheduling. Algorithms 2020, 13, 332. https://doi.org/10.3390/a13120332

[5] Pendharkar, P.C. A Comparison of Ensemble and Dimensionality Reduction DEA Models Based on Entropy Criterion. Algorithms 2020, 13, 232. https://doi.org/10.3390/a13090232

[6] Feng, Y.; Wang, G.; Liu, Z.; Feng, R.; Chen, X.; Tai, N. An Unknown Radar Emitter Identification Method Based on Semi-Supervised and Transfer Learning. Algorithms 2019, 12, 271. https://doi.org/10.3390/a12120271

[7] Koukoutsis, E.; Papaodysseus, C.; Tsavdaridis, G.; Karadimas, N.V.; Ballis, A.; Mamatsi, E.; Mamatsis, A.R. Design Limitations, Errors and Hazards in Creating Decision Support Platforms with Large- and Very Large-Scale Data and Program Cores. Algorithms 2020, 13, 341. https://doi.org/10.3390/a13120341

[8] Thanajiranthorn, C.; Songram, P. Efficient Rule Generation for Associative Classification. Algorithms 2020, 13, 299. https://doi.org/10.3390/a13110299

> **Vicente Garc´ıa-D´ıaz** *Editor*

*Article*

## **A Case Study for a Big Data and Machine Learning Platform to Improve Medical Decision Support in Population Health Management**

**Fernando López-Martínez 1,\*,†,‡, Edward Rolando Núñez-Valdez 1,‡, Vicente García-Díaz 1,‡ and Zoran Bursac 2,‡**


Received: 28 February 2020; Accepted: 21 April 2020; Published: 23 April 2020

**Abstract:** Big data and artificial intelligence are currently two of the most important and trending pieces for innovation and predictive analytics in healthcare, leading the digital healthcare transformation. Keralty organization is already working on developing an intelligent big data analytic platform based on machine learning and data integration principles. We discuss how this platform is the new pillar for the organization to improve population health management, value-based care, and new upcoming challenges in healthcare. The benefits of using this new data platform for community and population health include better healthcare outcomes, improvement of clinical operations, reducing costs of care, and generation of accurate medical information. Several machine learning algorithms implemented by the authors can use the large standardized datasets integrated into the platform to improve the effectiveness of public health interventions, improving diagnosis, and clinical decision support. The data integrated into the platform come from Electronic Health Records (EHR), Hospital Information Systems (HIS), Radiology Information Systems (RIS), and Laboratory Information Systems (LIS), as well as data generated by public health platforms, mobile data, social media, and clinical web portals. This massive volume of data is integrated using big data techniques for storage, retrieval, processing, and transformation. This paper presents the design of a digital health platform in a healthcare organization in Colombia to integrate operational, clinical, and business data repositories with advanced analytics to improve the decision-making process for population health management.

**Keywords:** decision support systems; population health management; big data; machine learning; deep learning; personalized patient care

#### **1. Introduction**

Colombia's health system is formed by the public sector and the private sector. The general social security system has two plans, contributory and subsidized. The contributory regime covers salaried workers, pensioners, and independent workers, with the subsidized plan covering anyone who cannot pay. Enrollment coverage increased from 96.6% in 2014 to 97.6% in 2015 [1].

The National Health Authority's primary purpose in Colombia is to improve the quality of healthcare and strengthening supervision, surveillance, and control of the health system. The 2015 Statutory Health Law No. 1751 places the responsibility for guaranteeing the right to health with the health system and recognizes health as a fundamental social right and makes it the state's responsibility to pursue an approach in health promotion and disease prevention [2].

The health sector in Colombia supports all initiatives for implementing new technologies to prevent cardiovascular diseases, disabilities, and high-cost hospitalization cases [3]. There is a remarkable need to improve the prediction of the risk of conditions for the population through the integration and unification of massive volumes of data and the implementation of effective advance analytic solutions to improve the decision-making process and population health management in Colombia's population [4].

Keralty organization is formed by a group of insurance and health services companies with a global presence, which together develops an integral health model, whose purpose is to produce health and well-being to people throughout their lives. The organization is committed to keeping its users healthy and autonomous, focusing on prevention, identification, and management of health risks, control, and care of disease and dependency [5]. The organization is a leader in Colombia by providing integrated health services and is recognized for their human, scientific, technical, and ethical approach [6].

This paper presents how we can obtain value from a large volume of heterogeneous data generated by different data sources in healthcare, and the architecture implemented. The development of proper advanced data analytics methods such as machine learning and big data analytics to perform meaningful real-time analysis on the data to predict clinical complications before it happens and to support the decision-making process are challenging but much needed to handle the complexity of the data-driven problems we are currently facing.

#### *1.1. Related Work*

Several initiatives in Europe, Asia, and North America aim to develop healthcare digital platforms with collaborative access tools to allow the exchange and sharing of information and knowledge wherever and whenever needed throughout the attention process. This type of frameworks and architectures will allow maximum quality and efficiency for patient's care, and to provide appropriate attention to the patient's condition and risks.

Castilla and Leon, for example, implemented a digitalization of health services as a tool to increase the efficiency of the services and increase the security in the attention to patient [7]. A healthcare cyber-physical system assisted by cloud and big data is being developed in the department of computer science at Pace University in New York [8]. This system consists of a data integration layer, a data management layer, and a data-analytics service layer to improve the functioning of the healthcare system. In France, a group of researchers implemented a wearable knowledge as a service platform to cleverly manage heterogeneous data coming from wearable devices to assist the physicians in supervising the patient health [9]. Another interesting work was presented at the International Conference on Computational Intelligence and Data Science (ICCIDS 2018). The authors proposed a hybrid four-layer healthcare model to improve disease diagnostic [10]. In India, a centralized architecture for an end to end integration of healthcare systems deployed in the cloud environment was developed using fog computing [11].

Medical organizations are investing more and more in developing a healthcare platform that integrates data, applications, business processes, and user interfaces to gain knowledge and useful insights for clinical decisions, drug recommendation systems, and better disease diagnoses. Some other examples of big data applications in healthcare can be found in healthcare monitoring, where data captured from wearable devices can assist providers in managing symptoms of patients online and adjust their prescriptions [12]. An analytical platform called "MedAware" has been developed to detect errors in medical prescriptions and clinical errors, reducing the hospital admission and readmission in real-time [13]. In the healthcare prediction field, a healthcare system called "Gemini (Generalizable Medical Information analysis and Integration system)" was developed to collect, process, and analyze large volumes of clinical data and apply machine learning algorithms for performing predictive analytics [14]. Other platforms have been implemented for genomics data analytics to generate predictions based on DNA molecular changes and mutations [15]. Another type of healthcare platform is related to the healthcare knowledge system, defined as the combination of clinical data and physician expertise to support clinical decision-making and diagnosis [16].

#### *1.2. Why Big Data and Machine Learning?*

Big data and machine learning are redefining healthcare goals for the future. Healthcare data are impacting the way disease research is performed, and the level of complexity in population health management is increasing as the traditional fee for service approach is transformed into the value-based care model [17,18].

Population health management is basically the aggregation of patient health data from multiple data sources, and the analysis and transformation into actionable insights to generate informed decisions to improve clinical and financial outcomes [19].

Big data technologies will allow us to bring large volumes of structured and unstructured data from disparate data sources into a data repository to be examined and analyzed. Machine learning models will assist in discovering insights from complex datasets with capabilities such as finding unseen patterns, making new predictions, and analyzing trends on health data. Machine learning is being used in a variety of clinical domains with the analysis of hundreds of clinical parameters resulting in effective and efficient models to improve the outcomes and quality of medical care models [20].

The implementation of this platform shows the enormous potential in using big data to individualize medical treatment, the opportunity for improving the lives of the patients, delivering better medical care, and reduced waste at an operational level [21]. Other chances for big data in healthcare for Keralty organization are:


In addition to the big data technologies used to build the platform, another essential component is the advanced analytic module of the platform. This module contains several machine learning algorithms to support clinical diagnosis. However, the organization should feel confident in these models and how they can be applied to specific use cases. These first models will alert providers to changes in high-risk conditions such as sepsis and hypertensive patients.

The main objective of this paper is to present the developed platform and its components to allow Keralty organization to derive better and more actionable insights from their data, i.e., to derive meaningful information from all these data in a way that allows them to improve care and lower costs needed for value-based reimbursement and business objectives while providing the highest quality care for population health management [22]. The goal is to be aligned with the triple aim framework developed by the Institute for Healthcare Improvement that describes an approach to optimizing healthcare system performance.The implementation of this platform intends to resolve several problems in health services to assist patients and their families in managing their health by providing better access to healthcare services [23].

#### **2. Proposed Digital Health Platform**

Keralty organization currently have several information systems such as Health Information Systems (HIS), Lab Information Systems (LIS), Radiology Information Systems (RIS), Enterprise Resource Planning (ERP), and Customer Relationship Management (CRM), among others, in their ambulatory care centers, hospitals, and home care, which support their integrated health model. The information from these systems was not consolidated on a single platform, and its access and availability generated an operative load, which obstructs all health management processes and the support of clinical decisions for physicians. Consequently, we proposed the design and implementation of a healthcare, clinical, and business data repository with advanced analytic capabilities to consume machine learning prediction models to improve the decision-making process and population health management at the organization. The digital health platform conceptual framework is shown in Figure 1.

**Figure 1.** Conceptual Framework—Keralty Health Portal.

The implementation of the platform was an ambitious project that required integrating health information from disparate sources, building numerous technological and functional components, and the definition of IT management processes robust enough to support interoperability with other systems. The digital health information platform included patient-related data, Electronic Health Records (EHR), diagnostic reports, prescriptions, medical images, pharmacy records, research data, operational data, financial data, and human resources data.

This project was innovative and pioneered the designing and building of a comprehensive health digital platform for a healthcare organization in Colombia, with the patient being at the center of it and all of its information aggregated and summarized based on the standardized enterprise data repository. This information can be accessed quickly and intuitively when and where it is needed, hiding all technical complexity and providing longitudinal process management tools, as well as tools for decision support for professionals. The difference of this platform with other implementations was the development of a medical portal with a patient 360 view that uses data from the enterprise data repository to generate real-time early warning scores, patient surveillance, open API for hospitals integration, prediction of health risk patterns, high-risk markers, co-morbidity detection to predict critical diseases, early diagnosis of diseases, treatment comparison with medical guidelines, and measurement of efficiency of specific drugs to provide the best quality of care.

The Digital Healthcare platform architecture can ingest data from over 50 different source systems at the granular level, including claims, clinical, financial, administrative, wearables, genomics, and socioeconomic data. Few platforms today can integrate that many heterogeneous data sources successfully. The platform can consume machine learning models on-demand without the need for further development. The data logic models are on top of the raw data and can be accessed, reused, and updated through open APIs without the need for clinical and business logic changes. The platform was able to integrate successfully structured and unstructured data. It is commonly seen that this type of platforms in the market is built to either integrate structured data or unstructured but few cases successfully integrate both. Open microservices APIs were created for operations such as authorization, identity management, interoperability, and data pipeline management. These microservices enable the development of third-party applications to interoperate with the platform.

#### *2.1. Platform Architecture*

The initial approach was to build a big data processing pipeline with a Microsoft Azure lambda architecture to support real-time and batch analytics. This approach is shown in Figure 2. This architecture has different mechanisms to consume data depending on the source and timing needed to generate insights. In addition, with this approach, we can have professionals with different skills working in parallel to build the platform.

**Figure 2.** Azure Big Data and Machine Learning Lambda Architecture.

The architecture contains a batch layer, a real-time layer, and a serving layer. The batch layer is in charge of persistent storage and is able to scale horizontally. The real-time layer process streaming data and performs dynamic computation. The serving layer query data on the repositories and consume the prediction models.

From the infrastructure point of view, the platform offers the flexibility of being implemented in a hybrid environment, namely the cloud and the local data processing center, through the use of virtualization techniques, containers, and load balancing systems. The design of the infrastructure was prepared to provide a flexible set of resources that can be used on-demand and based on the specific workload requirements. The infrastructure deployment relied heavily on automation to provide fluid operations.

#### *2.2. Data Repository*

An enterprise-wide staging repository for the big data analytics platform was considered. The data lake allows capturing data of any volume, type, and ingestion speed in one single place for storing heterogeneous data. This staging area included capabilities such as security, scalability, reliability, and availability. The data can be passed, processed directly from the staging area, or can be ingested to an enterprise data warehouse for historical load, preparation, and serve for BI and machine learning needs. This data warehouse repository has a scale-out architecture and massively parallel processing (MPP) engine.

Data models were developed to cover clinical, social, and healthcare program domains. Each model performs validations and processing on the data received, decoupling the processing and administration of the data from the source. These data models can also be extended to store additional attributes specific to the implementation, allowing these models to subscribe to certain types of messages, using the mapping and filtering options provided by the data processing pipelines. Once these subscriptions are created, the model will be loaded with all relevant messages to those who are subscribed and stored in the data lake.

For data storage, the data are loaded into a data warehouse with a daily refresh. This healthcare data repository contains a highly normalized data model for fast and efficient querying and analysis. This repository is read-only.

#### *2.3. Integration and Interoperability*

The platform provides a mechanism to integrate data from heterogeneous sources, define workflows to ingest data from different data stores, and transform and process data to data stores to be consumed by BI applications. A cloud-based data integration service is used to create these data-driven workflows and orchestrate all automation, transformation, and data movement in the platform. The main tasks this integration service should perform are: creation and scheduling of data pipelines to ingest data from different data sources, processing and transformation of the data, and store data in data stores such as data lakes or data warehouses.

Azure Data Factory automates and orchestrates the entire data integration process from end to end in the platform. We built the ETL (extract, transform, and load) pipelines with this Azure component. The data are extracted from the source locations, transformed from its source format to the target Azure data lake's schema, and loaded into Azure data lake and the data warehouse, where they can be used for analytics and reporting. Azure Data Factory defines control flows that execute various tasks in the transform and load process.

We used the mechanism called mapping data flows, combining control flows and data flows to build the data transformations with an easy-to-use visual user interface. These data flows are then executed as activities within Azure Data Factory pipelines. Data Factory is certified by HIPAA (Health Insurance Portability and Accountability Act), which protects the data while they are in use with Azure. In the data flow, we created transformation streams where we define the source data and create the graph with the transformations, schema operations such as derived column, aggregate, surrogate keys and selects, and the output settings.

#### *2.4. Data Security and Privacy Model*

In terms of security, the platform guarantees authentication, access control, and encryption capabilities. The security mechanisms of the platform can provide protection, alert monitoring, and support the OAuth 2.0 protocol for authentication with REST interfaces. ACLs are enabled on folders, subfolders, and files. The platform also provides encryption mechanisms to protect the data. All these capabilities are accompanied by the implementation of enterprise security policies and regulatory compliance requirements.

#### *2.5. Stream Analytics*

The platform can handle mission-critical real-time data and offer end to end streaming pipelines with continuous integration and continuous delivery (CI-CD) services. Other capabilities such as in-memory processing, data encryption, and support of international security standards including HIPAA (Health Insurance Portability and Accountability Act), HITRUST (Health Information Trust Alliance), and GDPR (General Data Protection Regulation).

#### *2.6. Advanced Analytics*

The analytic data component consists of two areas: The first area is the BI models we develop for tactical, operational, and strategic decisions. The second area comprehends several prediction models that need to be developed. Currently, there are two prediction models developed by the authors of this paper to support population health management, specifically the diagnosis of sepsis and hypertension prediction [24,25]. These insights assist providers in the detection and tracking of chronic diseases. The machine learning component is used to build, test, consume, and deploy predictive analytic models on-demand and as requested for the organization. The platform provides self-service dashboards and visualizations that use data from the repositories to drive the decision-making process. The machine learning application layer is one of the essential layers of this platform.

Once the data are integrated, aggregated, and normalized in the system, the platform offers a tool to provide knowledge management through the business intelligence interface providing data analysis, design, and training of machine learning models, as well as development and management of results-based care indicators or population health management. The platform provides a tool where clinicians, researchers, and scientists can mine the data and get valuable information.

Machine learning models can be trained and customized in preconfigured data domains, allowing the storage of the results for future use. Data researchers and scientists can develop advanced tools to obtain information and value of the data stored in the solution, taking advantage of the model design, training, and validation component. We briefly present the predictive models implemented in the platforms.

• **Machine Learning Classification for a Hypertensive Population:** This prediction model evaluates the association between gender, race, BMI (Body Mass Index), age, smoking, kidney disease, and diabetes using logistic regression. Data were collected from NHANES datasets from 2007 to 2016 to train and test the model, a dataset of 19,709 samples with (83%) non-hypertensive individuals and (17%) hypertensive individuals. The results show a sensitivity of 77%, a specificity of 68%, precision on the positive predicted value of 32% in the test sample, and a calculated AUC of 0.73 (95% CI [0.70–0.76]). The model used to estimate the probability that a person will belong to the hypertensive or non-hypertensive class is:

> *<sup>p</sup>* <sup>=</sup> *<sup>e</sup>*(*β*0+*β*<sup>1</sup> *gender*+*β*2*age*+*β*3*race*+*β*4*bmi*+*β*5*kidney*+*β*6*smoke*+*β*7*diabetes*) 1 + *e*(*β*0+*β*<sup>1</sup> *gender*+*β*2*age*+*β*3*race*+*β*4*bmi*+*β*5*kidney*+*β*6*smoke*+*β*7*diabetes*)

We used the logistic regression classification model in this experiment to evaluate the importance of the risk factor variables and their relationship with the prevalence of hypertension among a nationally representative sample of adults ≥20 years in the United States (n = 19,759). The distribution of the samples by hypertensive patients, gender, and race is shown in Table 1.


**Table 1.** Number of samples by hypertensive class, gender, and race.

We computed chi-square test between each independent variable and the dependent variable to indicate the strength of evidence that there is some association between the variables. Chi-square was selected due to the categorical form of the data used in the model, and it is considered one of the best methods to estimate the dependency between the class and the features when the feature can take a fixed number of possible values that belong to a group or nominal category.

Table 2 shows the *p*-value for each variable; the null hypothesis is reject for any *p* ≤ 0.05, while the null hypothesis is not rejected when *p* > 0.05. *p*-values for the variables GENDER, BMIRANGE\_1, BMIRANGE\_3, and KIDNEY\_2 are not statistically significant at 0.05 alpha level; the clinical importance of these variables in the model for interpretation allows us to include them. We ran the model with and without the variables, and there were no significant changes in the accuracy score, positive predicted value rate, and true positive rate.

The training dataset was derived from a random sampling of 70% (13,831) of the extracted study population and the test sampling the remaining 30% (5928) to evaluate the model on the ground-truth that was never used for training. We ran the logistic regression model on the entire dataset to verify the accuracy score of the model.


**Table 2.** Chi2 test and *p*-value for the independent variables.

The Logistic Regression model uses the logit function to express the relationship of the risk factors as:

$$\log \text{it}(p) = \ln(\frac{p}{1-p}) = \beta\_0 + \beta\_1 X\_1 + \beta\_2 X\_2 + \dots + \beta\_i X\_i$$

The probability of success can be expressed as:

$$p = \frac{\mathfrak{e}^{(\beta\_0 + \beta\_1 \mathcal{X}\_1 + \beta\_2 \mathcal{X}\_2 + ... + \beta\_i \mathcal{X}\_i)}}{1 + \mathfrak{e}^{(\beta\_0 + \beta\_1 \mathcal{X}\_1 + \beta\_2 \mathcal{X}\_2 + ... + \beta\_i \mathcal{X}\_i)}}$$

where *p* is the predicted probability of having hypertension, *Xi* are the risk factors or independent variables, and *β<sup>i</sup>* are the coefficients that are estimated by using the method of maximum likelihood and allow us to calculate the odds that, for every unit increase in *Xi*, the odds of having hypertension changes by *eβ*.

• **A neural network approach to predict early neonatal sepsis:** We developed a non-invasive neural network classification model for early neonatal sepsis detection. The data used in this study are from Crecer's Hospital center in Cartagena-Colombia. A dataset of 555 neonates with (66%) of negative cases and (34%) of positive cases was used to train and test the model. The study results show a sensitivity of 80.32%, a specificity of 90.4%, precision on the positive predicted value of 83.1% in the test, sample and a calculated area under the curve of 0.925 (95% Confidence Interval [91.4–93.06]). The neural network architecture can be seen in Figure 3.

**Figure 3.** Multilayer Perceptron Architecture.

Table 3 shows the parameters of the architecture. Labels X1–X7 are informative only, and the input size is 27 variables.


**Table 3.** Model architecture parameters.

The model used an anonymous dataset from a private medical institution in Cartagena, Colombia, from 2016 to 2017. Demographic, laboratory data, blood pressure, and body measures data were part of the dataset. This dataset includes cases of live newborns of ages inferior to 72 h with a diagnosis of early neonatal sepsis by clinical criteria and laboratory blood cultures. Control cases were part of the dataset including all newborns healthy by clinical diagnosis and who returned healthy for a follow up at 72 h.

This retrospective study includes 186 cases and 368 controls based on a case-control relationship of 1:2 with a 95% trust factor and power of 80%. Bivariate analysis and logistic regression were performed to detect the variables associated with early sepsis, and the statistical significance was considered at the alpha level of 0.05.

This model considered nine sociodemographic, fourteen obstetric, nine neonatal, and four maternal infectious related pathology variables. Table 4 shows the quantitative sociodemographic variables, Table 5 shows the qualitative sociodemographic variables, Table 6 shows the quantitative neonatal variables, Table 7 shows the qualitative neonatal variables, Table 8 shows the quantitative obstetric variables, Table 9 shows the qualitative obstetric variables, and Table 10 shows the qualitative maternal infections of the cases and controls.

A bivariate chi-square test with correction was performed to the qualitative variables to find a statistical association between the independent variable and the possibility to develop early neonatal sepsis. For continuous variables, the Mann–Whitney U test was performed. From this statistical analysis, it is essential to show that we did not find significant statistical evidence for the variables age, start of marital status at younger than 18 years old, gender, APGAR (Appearance, Pulse, Grimace, Activity, and Respiration) value less than 7 after 1 and 5 min, the number of pregnancies, and the type of birth. Prenatal control is not associated with the case of sepsis; however, assisting to five prenatal controls are associated with the protection to avoid the appearance of early neonatal sepsis. There was no evidence with the variables IUGR (Intrauterine Growth Restriction) background and multiple pregnancies. Twenty-seven (27) variables were selected as input variables for our artificial neural network architecture.




**Table 5.** Qualitative sociodemographic variables in cases (186) and controls (369).

**Table 6.** Quantitative Neonatal variables in cases (186) and controls (369).



**Table 7.** Qualitative Neonatal variables in cases (186) and controls (369).

**Table 8.** Quantitative Obstetric variables in cases (186) and controls (369).


**Table 9.** Qualitative Obstetric variables in cases (186) and controls (369).



**Table 10.** Qualitative maternal infections variables in cases (186) and controls (369).

In terms of computational timing, It is difficult to evaluate the complexity and timing of a machine learning algorithm. However, based on the algorithmic complexity, we can measure the time performance in terms of its training time complexity using big O notation because the classification time of the models can vary depending on the stress in the computational performance and power. In terms of timing, the classification prediction with the trained models is less than 1 s. The time complexity of the logistic regression could be expressed as *O*((*f* + 1)*csE*), where f is the number of features (+1 because of bias), *c* is the number of possible outputs, *s* is the number of samples, and *E* is the number of epochs to run. For the neural network approach, *O*(*pnl*<sup>1</sup> + *nl*1*nl*<sup>2</sup> + ...), where *p* is the number of features and *nli* is the number of neurons at layer *i* in a neural network [26].

#### **3. Actual Platform Benefits**

The implementation of the platform became the digital healthcare ecosystem for the organization. The organization can populate workflow information systems with critical decision-making insights, accurate and reliable healthcare data that significantly increased the value of the healthcare outcome to patients and care providers. This platform delivers significant benefits to the organization, such as physicians having an intelligent application that can be configured to their preferences and optimized to their disciplines, patients receiving more personalized care, an improvement in healthcare workflow and patient care, and personalized care for physicians and patients.

We describe in the following subsections several use cases that effectively present the change and digital transformation of the organization with the implementation of the platform.

#### *3.1. Reduce Total Cost of Care for Care Coordination*

With a robust data analytic component, the organization was able to prioritize opportunities for improvement and to improve the way care is coordinated and delivered throughout its network of hospitals and medical facilities. The results include a considerable increase in financial results in just six months.

The organization uses the platform to generate timely, meaningful, and actionable data to drive change and improve the quality of care for patients. The organization uses the data for risk-stratification of the network's population, prioritization of the care coordination activities, and prevention activity's interventions. Risk stratification was completed for all patients, enabling care managers to identify individuals at various risk levels for unnecessary services and high-cost utilization, improving patient outcomes and experience. The analytical component also reduces unnecessary visits, facilitates access to specialty care and community-based services, and achieves healthcare outcomes. Other benefits include 3% increase in the detection of high-risk patients with primary care, 20% increase in the number of patients with ongoing care managed, and 10% percent reduction in emergency department utilization per member among care managed patients.

#### *3.2. Self-Service Analytic*

As described in this paper, the healthcare platform combines and standardizes data across different source systems to provide actionable insights in a single platform. The platform integrates data from different sources, such as claims data, cost data, financial data, clinical data, and other patient data. With self-service analytics, the organization increases the number of users accessing the analytic component, improving data visibility and providing actionable insights to improve patient outcomes.

#### *3.3. Reduce Deaths from Sepsis*

The organization improved sepsis mortality rates and improving care outcomes by using the advanced analytic component of the platform. Sepsis impacts almost 1.7 million adults in the U.S. and is responsible for nearly 270,000 annual deaths. One-third of all hospital deaths are patients with sepsis [27]. The machine learning prediction model used in the platform was developed by one of the authors of this paper, as described before. It is still too early to mention the results of the utilization of this feature. However, the goal of the organization is to reduce its sepsis mortality rate, the costs of the creation of its sepsis care transformation team, and the implementation of an evidence-based sepsis care practice.

#### *3.4. Discussion and Limitations*

The digital health platform helps Keralty organization with closing the gaps between multiple datasets, improving clinical benefits, improving patient's lives, supporting better decision-making to manage larger populations, and improving overall health outcomes. However, the need for algorithms with high accuracy in medical diagnosis is still a challenge that needs to be improved precisely and efficiently [28]. The increasing complexity of building end-to-end platforms to integrate disparate systems and to apply machine learning techniques in specific areas such as computer vision, natural language processing, reinforcement learning, and other generalized methods present many challenges when forming the interdisciplinary team needed and the set of technological components used for the implementation.

Some challenges should be considered in the design and implementation of machine learning projects for healthcare. One of the most critical challenges requires algorithms that can answer causal questions. These questions are beyond classical machine learning algorithms because they require a formal model of interventions [29]. To address this type of question from the analytical component of the platform, we need to learn from data differently and to gain knowledge in causal models to understand how machine learning algorithms need to be trained. Another challenge is to create reliable outcomes from heterogeneous data sources with the participation of SME (Subject Matter Experts) who understand the disease; the machine learning predictive accuracy and correct clinical interpretation depend on the criteria and context of the disease. Providers and machine learning engineers should work together on model interpretability and applicability. Machine learning implementation is not an easy task; the selection of predictive features and optimization of hyperparameters is another challenge that needs to be mastered to implement models that provide useful insights [30]. The success and meaningful use of these algorithms, and their integration into the platform depends on the accuracy of the models and their interpretability.

#### **4. Results of Advanced Analytics**

After training and testing the logistic regression model for predicting hypertension, we generated some evaluation metrics to evaluate the classifier. Table 11 shows the confusion matrix with the classification results, include the true positive value (730), true negative value (3407), false negative (216), and false positive value (1575). The classification report in Table 12 shows the calculated precision and sensitivity.


**Table 11.** Confusion matrix.

#### **Table 12.** Classification report.


The test sampling of 5928 contains 4982 (84%) non-hypertensive and 946 (16%) hypertensive patients. The model shows a sensitivity of 730/946 = 77% and a specificity of 3407/4982 = 68%. The precision of the model was 730/2305 = 32% and the negative predicted value 3407/3623 = 94%. The false negative rate of the model was 216/946 = 22%. The model was better at identifying individuals who will not develop hypertension than those who will develop hypertension.

For the neural network approach to predict early neonatal sepsis, Table 13 shows the confusion matrix with the classification results of actual class label vs. the predicted ones, including the true positive value (49), true negative value (95), false negative (12), and false positive value (10).


The classification report in Table 14 shows the precision and sensitivity. The sensitivity of the model is moderately acceptable due to the imbalanced testing dataset, and there is still a high number of false negatives.



A sensitivity of 80.3% and a specificity of 90.4% show that the model might be useful for detecting positive cases, and the true negative rate shows that the model is also efficient at identifying negative cases. The high precision value of 83.1% and the AUC of 0.925 confirm the adequacy of the model as a preliminary screening tool. The percentage of positive cases shows that the model works better than random guessing and the conditional probability of negative test results is considerably low. The accuracy of 86.74% shows that the model correctly identifies negative cases and positive cases based on the characteristics of the dataset and the small number of cases examined.

#### **5. Comparison with Other Platforms**

A review of several healthcare platforms shows that the architecture presented in this paper covers all the categories from integration, interoperability, security care, and advanced analytics. Generally, other implementations only focused on one specific area, as shown in Table 15 and taken from the International Conference on Computational Intelligence and Data Science (ICCIDS 2018) and a healthcare frameworks review proposed in the Journal of King Saud University [31].


**Table 15.** Comparison of healthcare big data platforms.

We designed and implemented a healthcare platform using big data technologies with actionable insights to augment human decision-making at the organization impacting the population's health, public health, and to capture social determinants of health. This platform comprehends all the features we use in the comparison. Raghupathi et al. reported a conceptual architecture to present big data analytic outlines in healthcare with no predictive analytic capabilities and no patient 360 view. Patel et al. designed a big data architecture platform to improve data aggregation in the healthcare industry and to provide a reduction in healthcare cost, predicting analytic, preventive care, and drug discovery capabilities but without patient 360 view capabilities. Chawla et al. presented a patient-centric healthcare framework—Collaborative Assessment and Recommendation Engine (CARE)—to improve patient-centric treatment and diagnosis without real-time monitoring and 360 view capabilities. Abinaya et al. implemented a fascinating e-Health service application for diagnosing heart diseases. Balladini et al. designed a real-time architecture of big data for Francisco Lopez Lima Hospital in Argentina to process physiological data. This platform did not include predictive analytic and patient 360 view. Belle et al. implemented a genomic data processing platform that provides image analytic and signal processing of psychological data. Mezghani et al. designed a big data platform for integrating heterogeneous wearable data in healthcare for real-time monitoring and diagnosis. Lastly, Chen et al. presented a real-time big data platform to improve communication and collaboration between patients and providers, increasing the quality of care that clinical teams can provide.

#### **6. Conclusions and Future Work**

This paper provides details of an optimized and secure healthcare platform that revolutionizes the healthcare industry in Colombia by providing better information to patients and care teams. The use of this technology reduces the costs associated with healthcare.

The proposed digital health platform allows us to address population health challenges, to understand better patient's health, and to find hidden patterns that traditional data analytics fail to

find. The organization can use unified patient-generated data, financial data, and socioeconomic data to detect patterns and to discover a group of patients who share similar health behavior. The analysis of clinical and non-clinical data allows predicting patient's health with better accuracy. The platform also allows better health discoveries and actions based on treatment history for individuals and groups of patients.

Keralty organization recognized that better care coordination was required for patients receiving care. The organization wanted to improve quality outcomes, provider engagement and recruitment, and its own economic health. To meet these objectives, the organization focuses on clinician engagement and organizational alignment, ensuring widespread access to meaningful, actionable data, and the use of the healthcare analytics platform to inform decisions and drive improvement. Keralty believes the use of machine learning will be one of the most important, life-saving technologies ever introduced to the organization. We believe the opportunities are virtually limitless for the platform to improve and accelerate clinical, workflow, and financial outcomes.

More future work needs to be done on the platform to continue improving all the benefits for the entire organization. Tools for performing knowledge discovery process will be added to the ecosystem. The organization is planning to start the implementation of prescriptive analytics models to assist the organization in making smarter decisions in population health management. The architecture team will look at the possibility of implementing Map/Reduce-based computations for processing data with high scalability and to execute low latency and high concurrency analytical queries on top of Hadoop clusters.

**Author Contributions:** Conceptualization, F.L.-M., V.G.-D. and E.R.N.-V.; Methodology, F.L.-M., V.G.-D. and E.R.N.-V.; Software, F.L.-M.; Validation, F.L.-M., V.G.-D., Z.B. and E.R.N.-V.; Formal Analysis, F.L.-M.; Investigation, F.L.-M., V.G.-D., Z.B. and E.R.N.-V.; Resources, F.L.-M.; Data Curation, F.L.-M. and Z.B.; Writing—Original Draft Preparation, F.L.-M., V.G.-D. and E.R.N.-V.; Writing—Review and Editing, F.L.-M., V.G.-D., Z.B. and E.R.N.-V.; Visualization, F.L.-M., Z.B. and E.R.N.-V.; Supervision, F.L.-M. and V.G.-D.; Project Administration, V.G.-D. and E.R.N.-V.; Funding Acquisition, F.L.-M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** This document presents an independent study supported by the company Sanitas USA. The points of view expressed are those of the authors and not necessarily those of Sanitas USA. We thank Ivan Murcia VP of Healthcare Services at Sanitas USA and Santiago Thovar, CIO at Keralty who provided insight and expertise that greatly assisted the study.

**Conflicts of Interest:** The authors declare no conflict of interest. The founders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Diagnosis in Tennis Serving Technique**

#### **Eugenio Roanes-Lozano 1,\*, Eduardo A. Casella 2, Fernando Sánchez <sup>3</sup> and Antonio Hernando <sup>4</sup>**


Received: 18 March 2020; Accepted: 22 April 2020; Published: 25 April 2020

**Abstract:** Tennis is a sport with a very complex technique. Amateur tennis players have trainers and/or coaches, but are not usually accompanied by them to championships. Curiously, in this sport, the result of many matches can be changed by a small hint like 'hit the ball a little higher when serving'. However, the biomechanical of a tennis stroke is only clear to an expert. We, therefore, developed a prototype of a rule-based expert system (RBES) aimed at an amateur competition player that is not accompanied by his/her coach to a championship and is not serving as usual (the RBES is so far restricted to serving). The player has to answer a set of questions about how he/she is serving that day and his/her usual serving technique and the RBES obtains a diagnosis using logic inference about the possible reasons (according of the logic rules that have been previously given to the RBES). A certain knowledge of the tennis terminology and technique is required from the player, but that is something known at this level. The underlying logic is Boolean and the inference engine is algebraic (it uses Groebner bases).

**Keywords:** rule-based expert systems; tennis hitting technique; computer algebra systems; Groebner bases; Boolean logic

#### **1. Introduction**

The first and third authors of this paper are veteran tennis players, regularly taking part in both veterans and open tennis championship (they were the +55 doubles champions of Extremadura region in 2018). They have a patent ( Fernando María Sánchez Fernández, Eugenio Roanes Lozano: Dispositivo de entrenamiento para determinar la posición óptima en el tenis. Country: Spain. Request number: 201730051. Priority date: 3 de febrero de 2017. Patent holder: Universidad de Extremadura y Universidad Complutense de Madrid.) regarding the recovery position in tennis [1]. The first author has been the +45 champion of Extremadura region several times and is a tennis instructor ranked not so long ago among the top 500 players of Spain).

The second author was their tennis coach. He is an "ITF Level 2 Tennis Coach" with a long experience in teaching tennis and coaching tennis players both in Argentina and Spain.

Finally, the fourth author is an experienced expert systems developer.

Tennis is a sport with a sophisticated technique [2–4]. The big distance between the hand and the racket's sweet spot makes it difficult to play well. The high precision required is normally achieved through the learning of the technique and a hard training. Something similar happens with golf.

It is well known that there are many aspects in tennis teaching that have to be addressed (tactic, technique, physical training, etc.) and a coach is clearly needed for improving the player's level. For instance, the biomechanical of a stroke is only clear to the expert's eyes (although there are some personalized computer studies of this aspect for elite athletes like high jumpers).

Most amateur tennis players are not accompanied by a coach when playing championships, except when playing team competitions. In this kind of championship, only the captain of the team can talk to players during the matches. An experience of many tennis players is that a simple hint like "hit the ball at a higher point when you serve" can change the result of a match.

We would like to address here just this aspect of the whole process of tennis coaching: finding out what is going wrong in a certain match with respect to the player's usual technique. It applies inference to the answers of the player to a set of questions regarding both the problem and the player's usual technique. It requires a certain knowledge of the technique from the player (let us underline that it is focused to amateur competition players, not to beginners). Therefore, this is neither a system devoted to learn how to play tennis, nor a substitute for a coach. It only tries to correct the mistakes of the stroke going wrong along one match and return the situation to the usual one.

The article is illustrated with the part of the system corresponding to the serve.

While the algebraic model for rule-based expert systems (RBES) is well known (see Section 4), the application to tennis is, as far as we know, entirely novel.

The expert system we have developed tries to model a well-known international standard describing the tennis serving technique. An important issue in an expert system is the knowledge base and the formalism used to represent it. Since propositional Boolean logic is enough to represent this standard, we have adopted this logic to embody the knowledge of our expert system. Artificial Intelligence has proposed many elaborated representation formalisms to model uncertainty or fuzzy knowledge in expert systems, as well as machine learning techniques to acquire this knowledge base. However, we have not required all these techniques for designing our system because of the same purpose of our system: output the recommendations for tennis serving technique, according to a well-known international standard. For our purpose we have only needed to model this standard by means of propositional Boolean logic, resulting in a system with a high performance.

#### **2. Some Introductory Notes about Rule-Based Expert Systems Based on Boolean Propositional Logic**

In this section, we will describe some outlines of RBES based on Boolean propositional logic for representing knowledge. It may be skipped by an acquainted reader, but it is included in order to make the article self-contained, as it could have readers from different environments.

Boolean propositional logic uses a finite set of atomic propositions *X*1, ... , *Xm* for defining formulae (Definition 1) through the connectives ¬, ∨, ∧, and →.

#### **Definition 1** (Formula)**.** *A formula A is recursively defined as:*


*We define* C *as the set of formulae.*

**Definition 2** (Rule)**.** *By its importance, we define rules as the formulae with the form* (*A*<sup>1</sup> ∧ *A*<sup>2</sup> ∧ ... ∧ *Ak*) → (*Ak*<sup>+</sup><sup>1</sup> ∨ ... ∨ *An*) *where each formula A*1, ... , *An is either an atomic proposition (X) or the negation of an atomic proposition (*¬*X).*

**Remark 1.** *We have chosen the previous definition of rule because we believe it is the most intuitive one. A rule of the form:*

*B* → *C* ∨ *D*

*is equivalent to the rule:*

*B* ∧ ¬*C* → *D*

*but it is less intuitive as what is a cause and what is an effect is not on different sides of the implication symbol. Similarly, the implications could be substituted by disjunctions, what has the advantage of using fewer symbols, but would make it less readable.*

**Remark 2.** *Sometimes some rules with the same antecedent and different consequents are grouped into a single formula for the sake of brevity and clarity. For example, the information in the three rules:*


*B* → *E*

*can be summarized in the formula B* → *C* ∧ *D* ∧ *E. Although it is not formally correct (it doesn't comply with Definition 2), we shall hereinafter admit these abbreviations as rules too (in order to save space and to clarify the underlying ideas related to tennis hitting technique). The algebraic approach used has no problem to deal with this sort of formulae, and the timings obtained for globally computing which output are obtained are negligible (as will be shown in Section 6).*

**Definition 3** (IC)**.** *An* integrity constraint *(*IC*) is a piece of knowledge added by the experts and expressing that some potential facts cannot hold at the same time.*

**Definition 4** (elements of a RBES)**.** *By means of the concept of formula, we can define the elements of which a RBES is composed:*


The concept of tautological consequence of propositional logic is used for determining the knowledge which the RBES can infer. As usual, we make use of {*A*1, ..., *An*} |= *B* to denote that the formula *B* is a tautological consequence of the formulae *A*1, ..., *An* (we say that {*A*1, ..., *An*} |= *B* if and only if, whenever the formulae *A*1,...,*An* hold, then *B* also holds). In this way, a RBES with knowledge base {*R*1, ..., *Rn*} and having as input the set of formulae {*F*1, ..., *Fn*} infers the formula *B* ∈ C if and only if the following holds:

$$\{F\_{1}, \dots, F\_{n'}R\_1, \dots, R\_{n'}\} \mid = B$$

Another important issue on propositional Boolean logic is related to the concept of consistency (in an informal way, {*A*1, ..., *An*} is inconsistent if both a formula and its negation can be simultaneously inferred). Obviously, the formulae embodying the knowledge-base of a RBES must be consistent.

#### **3. Organizing the Knowledge of the Tennis Hitting Technique RBES**

The propositional variables of the RBES have been grouped in four homogeneous blocks:


The structure of the RBES is summarized in Figure 1.

**Figure 1.** Some rules directly deduce output from the facts. Other rules lead to intermediate results (guards). And another rules obtain output from the guards. Therefore, the figure shows the hierarchy among input variables (facts), intermediate conclusions (guards) and final conclusions (output).

#### *3.1. Facts—Block I (Generalities)*

In this first block consists just of three questions: the system is informed about the kind of serve the player is executing and two technical details about his/her abilities (see Table 1). We shall briefly explain these facts for those unaware of tennis technique.



Regarding the spin:


For the sake of simplicity we consider *x*1, *x*2, *x*<sup>3</sup> mutually exclusive. This is not completely accurate, as an advanced player could serve with a mixture of topspin and slice spins.

Each kind of serve normally requires the ball to be tossed at different places (see *l*1, *l*2, *l*<sup>3</sup> below) (Let us underline that the encoding chosen use redundant variables, that could be avoided if using 3-valued variables instead. Nevertheless, according to our experience, Boolean computations are much faster, even if more variables are needed.). But this way the opponent is aware of the kind of spin of the serve. It is really difficult although possible to always toss the ball at the same place (*z*0) (Observe that there is no particular reason for choosing the subscripts of the variables (0, 1, 2, 3, 4, 5, 6, 7) the way they have been chosen.). Roger Federer is a good example of this ability.

If the serve is correctly executed, the player tends to move upwards (*y*1) and forward (inside the court).

#### *3.2. Facts—Block II (Details)*

Tennis is not a precise science in the sense that there isn't a unique valid truth for everyone. There are different "schools" and ways to teach and even to execute the different strokes.

They also change with time: before Björn Borg nearly no player played two-handed backhand and used topspin. Nowadays the vast majority of professional players play two-handed backhand and use topspin in most strokes from the back of the court. Meanwhile, two-handed forehand wasn't a success and was only used by a few players like Hans Gildemeister.

There are very successful unorthodox styles, like Jim Courier's baseball-style forehand or Rafael Nadal's forehand finishing. A system like the one presented here would be useless in such a case.

There are also great tennis players with an untraditional serve, like John P. McEnroe, and players with a pure traditional service, like Roger Federer.

Therefore we have just considered what could be considered the standard serving technique [2–4], and have transcribed the causality underlying what is explained in the tennis books into logical rules to the best of our knowledge.

After many years of development of this sport, we consider that there is a consensus in the clear cause-effect relation in tennis hitting technique. For instance, a clear simplified example is the following: if you serve and the ball is too far away in front of you and too low when you hit it, the racket will be facing the floor when it hits the ball, so the ball will bounce just in front of you, never on the other side of the net. Obviously most situations are more complex, but the consensus in the standard technique and the above mentioned causality is what makes plausible (from our point of view) to use a Boolean logic approach, as we compare what the player is doing with the standard theoretical technique (the latter admitted as a truth). Something due to the authors is how they have translated that consensual knowledge into rules.

As said in Section 3.2 above, we have considered that the different spins are mutually exclusive as simplification.

Tables 2 and 3 (splitted only for the sake of space) collects the most typical items of the execution of the serve to be taken into account (in our personal opinion).

Let us underline the difference between *t*<sup>1</sup> and *a*1: the toss can be correct or even too high, but the player can let the ball fall too low anyway.

It is clear that we have the following incompatibilities:



#### **Table 2.** Facts—Block IIa.


#### *3.3. Guards*

Some rules will clearly have an immediate consequence in the serve, but others just provoke intermediate consequences that we have denoted "guards" (Table 4). We have decided to introduce these technical intermediate steps in the rule base in order to organize it more clearly, as they somehow give technical reasons for the knowledge expressed in the rules involved. They can be very useful for an expert in tennis and logic revising the RBES.



Initially, we have decided that guards were not visible to the players, as they are technical reasons for reaching an effect in the serve. Nevertheless they could be an explanation for a player that is curious for obtaining a logic explanation of what happens, as they are somehow an intermediate level of diagnosis. If it is considered interesting for the system to include them as visible output, it would be enough to compute some more normal forms (see Sections 6.1 and 6.2 below).

#### *3.4. Output*

Let us detail the main different output of performing a certain serve (Table 5).



No incompatibility should be included among the output. Different technical errors could lead to opposite outputs, what is not contradictory.

#### *3.5. Rules and Incompatibilities (Integrity Constraints)*

We shall not detail all the rules but we shall begin by giving an example of two simple concatenated rules.

Rule 5 is:

$$\mathbb{R}5: t1 \to \mathbb{g}1$$

and it means:

R5: Too low toss → Lack of balance (equilibrium of the body)

(the consequent not an output, but a guard—an intermediate technical conclusion, not visible for the end user). This rule expresses that, if the player throws the ball in to the air too low, he/she will hit the ball in a flexed position, what will provoke a lack of balance.

Rule 23 is:

$$\mathbb{R}23: \mathbb{g}1 \to c1 \land c2:$$

and it means:

R23: Lack of balance (equilibrium of the body) → Imprecision ∧ Stroke too weak (lack of power)

(the latter are conclusions—output). Therefore, if we have *t*1, we conclude *c*1 and *c*3 This is a very simple example, just to get the flavour of the process. We do not detail all rules in this way for the sake of space.

The rules regarding the facts considered are:

*R*1 : *d*1 → *c*2 *R*2 : *d*2 → *g*1 *R*3 : *v*1 → *c*2 *R*4 : ¬*w*1 → *g*1 *R*5 : *t*1 → *g*1 *R*6 : *t*3 → *c*1 *R*7 : *l*1 ∧ ¬*z*0 ∧ ¬*x*1 → *g*2 ∧ *c*5 *R*8 : *l*2 ∧ ¬*z*0 ∧ ¬*x*2 → *g*2 ∧ *c*5 *R*9 : *l*3 ∧ ¬*z*0 ∧ ¬*x*3 → *g*2 ∧ *c*5 *R*10 : *a*1 → *c*3 *xor c*4 *R*11 : *a*2 → *g*3 *R*12 : ¬*m*1 → *c*1 *R*13 : ¬*e*1 → *c*1 ∧ *c*2 *R*14 : ¬*e*2 → *g*4 *R*15 : ¬*k*1 → *c*1 ∧ *c*2 *R*16 : ¬*k*2 → *c*1 ∧ *c*2 *R*17 : ¬*k*3 → *c*1 ∧ *c*2 *R*18 : ¬*k*4 → *c*1 ∧ *c*2 *R*19 : ¬*k*5 → *c*1 ∧ *c*2 *R*20 : *k*6 → *c*2 *R*21 : ¬*k*7 ∧ *y*1 → *c*2 ∧ *c*5 *R*22 : ¬*e*1 → *c*1 ∧ *c*2

and the rules corresponding to the guards are:

*R*23 : *g*1 → *c*1 ∧ *c*2 *R*24 : *g*2 → *c*1 ∧ *c*2 *R*25 : *g*3 → *c*1 *R*26 : *g*4 → *c*1

Let us underline that we have grouped some rules for the sake of brevity. For instance

*R*7 : *l*1 ∧ ¬*z*0 ∧ ¬*x*1 → *g*2 ∧ *c*5

should be formally written

*R*7*a* : *l*1 ∧ ¬*z*0 ∧ ¬*x*1 → *g*2 *R*7*b* : *l*1 ∧ ¬*z*0 ∧ ¬*x*1 → *c*5

(this way it would be in accordance with Definition 4). Similarly, we have included *xor* (exclusive or) in *R*10:

*R*10 : *a*1 → *c*3 *xor c*4

The knowledge behind this rule could have also be written:

*R*10*n* : *a*1 → *c*3 ∨ *c*4 *IC*0 : ¬(*c*3 ∧ *c*4)

(note that in this case *R*10*n* ∧ *IC*0 → *R*10 but *R*10 → *R*10*n* ∧ *IC*0).

Next, we enumerate integrity constraints related to choose one option over three ones:

*IC*1 : ¬(*x*1 ∧ *x*2) *IC*2 : ¬(*x*2 ∧ *x*3) *IC*3 : ¬(*x*1 ∧ *x*3) *IC*4 : *x*1 ∨ *x*2 ∨ *x*3 *IC*5 : ¬(*t*1 ∧ *t*2) *IC*6 : ¬(*t*2 ∧ *t*3) *IC*7 : ¬(*t*1 ∧ *t*3) *IC*8 : *t*1 ∨ *t*2 ∨ *t*3 *IC*9 : ¬(*l*1 ∧ *l*2) *IC*10 : ¬(*l*2 ∧ *l*3) *IC*11 : ¬(*l*1 ∧ *l*3) *IC*12 : *l*1 ∨ *l*2 ∨ *l*3

and the integrity constraints regarding choosing at most one option out of two are:

*IC*13 : ¬(*a*1 ∧ *a*2) *IC*14 : ¬(*d*1 ∧ *d*2)

#### **4. An Algebraic Model for RBES**

In this section, we will see how we can implement an expert system based on Boolean propositional logic in a computer algebra system.

Let us consider an expert system based on Boolean propositional logic with propositions *X*<sup>1</sup> ... *Xm*. Each formula in C can be represented as a Boolean polynomial (see Definition 5). This representation makes use of the ideal *<sup>I</sup>* in Z2[*x*1, ..., *xm*]:

$$I = \langle \mathbf{x}\_1^2 - \mathbf{x}\_1, \dots, \mathbf{x}\_m^2 - \mathbf{x}\_m \rangle\_1$$

**Notation 1** (NF)**.** *Let NF*(*pol*, *I*) *denote the 'normal form' of polynomial pol modulo ideal I.*

**Definition 5** (function *<sup>ϕ</sup>*)**.** *We define recursively the function <sup>ϕ</sup>* : C −→ <sup>Z</sup>2[*x*1, ..., *xm*] *as follows:*


As may be seen in the previous definition, for any formula *<sup>A</sup>*, *<sup>ϕ</sup>*(*A*) is a polynomial in Z2[*x*1, ..., *xm*] whose variables are never to a power greater than 1 (continuously computing the reduction modulo ideal *<sup>I</sup>* is equivalent to work in the residue class ring Z2[*x*1, ..., *xm*]/*I*). Besides, the total degree of the polynomial may be, at most, equal to the number of variables.

Once described how Boolean formulae are represented by polynomials, we show, by Theorem 1, how the problem of determining if a formula is tautological consequence of others can be translated into an algebraic problem [5]. Previous works not providing a residue class ring as a model for the logic and not considering the extension to RBES are [6,7] (Boolean case) and [8,9] (many-valued case). A detailed survey can be found in [10]. There are recent developments in this line of research such as [11,12].

**Theorem 1.** *Let A*1, ..., *An*, *B* ∈ C*. The following holds:*


*Algorithms* **2020**, *13*, 106

Theorem 1 is important since the algebraic problems involved in this theorem may be solved making use of Groebner bases [13–15]. On the ground of this, many expert systems have been developed in different fields [16–21].

Indeed, the question of determining whether a RBES with knowledge-base {*R*1, ..., *Rn*} infers the formula *B* when its input is the set of facts {*F*1, ..., *Fn*} may be solved by checking whether the following holds:

$$NF(\mathcal{q}(\neg B), I + \langle \mathcal{q}(\neg F\_1), \dots, \mathcal{q}(\neg F\_{\mathcal{n}}), \mathcal{q}(\neg R\_1), \dots, \mathcal{q}(\neg R\_{\mathcal{n}}) \rangle) = 0$$

We can simplify the previous expression through the definition of the following ideals:

$$J = \langle \varphi(\neg F\_1), \dots, \varphi(\neg F\_n) \rangle$$

$$K = \langle \varphi(\neg R\_1), \dots, \varphi(\neg R\_{n'}) \rangle$$

In this way, the expression above results into:

$$NF(\varphi(\neg B), I+J+K) = 0$$

Similarly, the consistency of such RBES for the set of facts {*F*1, ..., *Fn*} can be decided using Groebner bases, as it is equivalent to:

$$GB\left(I + \left<\mathcal{q}(\neg F\_1), \dots, \mathcal{q}(\neg F\_n), \mathcal{q}(\neg R\_1), \dots, \mathcal{q}(\neg R\_n)\right>\right) \neq \left<1\right>$$

what can be written:

*GB*(*I* + *J* + *K*) = 1

### **5.** *Maple* **Implementation**

The implementation used has been written in the computer algebra system *Maple* (*Maple* is a trademark of Waterloo Maple Inc., Waterloo, ON, Canada) and was introduced in 2008 [22] . The complete code is included in order to show how brief the code of the algebraic approach is (just a few lines). Firstly the algebraic packages for computing Groebner bases and normal forms and for declaring the polynomial ring where calculations will take place, are loaded:

```
restart;
with(Groebner):
with(Ore_algebra):
```
Afterwards the list of polynomial variables, the polynomial ring, and the ordering (pure lexicographic with the order for variables given by list SV), are declared:

SV:=x1,x2,x3,z0,y1,d1,d2,v1,w1,t1,t2,t3,l1,l2,l3,a1,a2,m1,e1,e2, k1,k2,k3,k4,k5,k6,k7,o1,g1,g2,g3,g4,c1,c2,c3,c4,c5: A:=poly\_algebra(SV,characteristic=2): Orde:=MonomialOrder(A,'plex'(SV)):

Now the ideal (iI), generated by the square of the polynomial variables minus themselves, is defined using the auxiliary function fu:

```
fu:=var->var^2-var:
iI:=map(fu,[SV]):
```
Finally, the logical connectives are defined as functions (the binary ones have an "&" in order to be infix operators) and are algebraic expressions reduced modulo the ideal iI in ring A using the ordering OrdeiI:

```
NEG :=(m::algebraic) -> NormalForm(1+'m',iI,Orde):
'&AND' :=(m::algebraic,n::algebraic) ->
              NormalForm(expand(m*n),iI,Orde):
'&OR' :=(m::algebraic,n::algebraic) ->
              NormalForm(expand(m+n+m*n),iI,Orde):
'&IMP' :=(m::algebraic,n::algebraic) ->
              NormalForm(expand(1+m+m*n),iI,Orde):
'&XOR' :=(m::algebraic,n::algebraic) ->
              (m &OR n) &AND NEG(m &AND n):
```
The translation of the rules and integrity constraints of the RBES developed in this article in the algebraic approach detailed above is:

```
R1:= d1 &IMP c2:
R2:= d2 &IMP g1:
R3:= v1 &IMP c2:
R4:= NEG(w1) &IMP g1:
R5:= t1 &IMP g1:
R6:= t3 &IMP c1:
R7:= (l1 &AND NEG(z0) &AND NEG(x1)) &IMP (g2 &AND c5):
R8:= (l2 &AND NEG(z0) &AND NEG(x2)) &IMP (g2 &AND c5):
R9:= (l3 &AND NEG(z0) &AND NEG(x3)) &IMP (g2 &AND c5):
R10:= a1 &IMP (c3 &XOR c4):
R11:= a2 &IMP g3:
R12:= NEG(m1) &IMP c1:
R13:= NEG(e1) &IMP (c1 &AND c2):
R14:= NEG(e2) &IMP g4:
R15:= NEG(k1) &IMP (c1 &AND c2):
R16:= NEG(k2) &IMP (c1 &AND c2):
R17:= NEG(k3) &IMP (c1 &AND c2):
R18:= NEG(k4) &IMP (c1 &AND c2):
R19:= NEG(k5) &IMP (c1 &AND c2):
R20:= k6 &IMP c2:
R21:= (NEG(k7) &AND y1) &IMP (c2 &AND c5):
R22:= NEG(e1) &IMP (c1 &AND c2):
R23:= g1 &IMP (c1 &AND c2):
R24:= g2 &IMP (c1 &AND c2):
R25:= g3 &IMP c1:
R26:= g4 &IMP c1:
IC1:= NEG(x1 &AND x2):
IC2:= NEG(x2 &AND x3):
IC3:= NEG(x1 &AND x3):
IC4:= x1 &OR x2 &OR x3:
IC5:= NEG(t1 &AND t2):
IC6:= NEG(t2 &AND t3):
IC7:= NEG(t1 &AND t3):
IC8:= t1 &OR t2 &OR t3:
IC9:= NEG(l1 &AND l2):
IC10:= NEG(l2 &AND l3):
IC11:= NEG(l1 &AND l3):
IC12:= l1 &OR l2 &OR l3:
IC13:= NEG(a1 &AND a2):
```

```
IC14:= NEG(d1 &AND d2):
```
And the (polynomial) ideal of rules is, in this case:

```
J:=[ NEG(R1), NEG(R2), NEG(R3), NEG(R4), NEG(R5), NEG(R6),
     NEG(R7), NEG(R8), NEG(R9), NEG(R10),NEG(R11),NEG(R12),
     NEG(R13), NEG(R14), NEG(R15), NEG(R16),NEG(R17),NEG(R18),
     NEG(R19), NEG(R20), NEG(R21), NEG(R22),NEG(R23),NEG(R24),
     NEG(R25), NEG(R26), NEG(IC1), NEG(IC2),NEG(IC3),NEG(IC4),
     NEG(IC5), NEG(IC6), NEG(IC7), NEG(IC8),NEG(IC9),NEG(IC10),
     NEG(IC11),NEG(IC12),NEG(IC13),NEG(IC14) ]:
```
Let us observe that the generators of the ideals are given as lists of polynomials. The Groebner bases of the ideals are automatically computed by *Maple* command Basis just by introducing the list of generators and the chosen order, as shown in Section 6.

#### **6. Examples**

*6.1. A Correct Serve*

Let us introduce *x*3, ¬*z*0, *y*1, ¬*d*1, ¬*d*2, ¬*v*1, *w*1, *t*2, *l*3, ¬*a*1, ¬*a*2, *m*1, *e*1, *e*2, *k*1, *k*2, *k*3, *k*4, ¬*k*6, *k*7, *o*1 as facts. The ideal K of facts will be:

K:=[ NEG(x3),NEG(NEG(z0)),NEG(y1),NEG(NEG(d1)),NEG(NEG(d2)), NEG(NEG(v1)),NEG(w1),NEG(t2),NEG(l3),NEG(NEG(a1)), NEG(NEG(a2)),NEG(m1),NEG(e1),NEG(e2),NEG(k1),NEG(k2), NEG(k3),NEG(k4),NEG(NEG(k6)),NEG(k7),NEG(o1) ]:

and the Groebner basis of the ideal generated by the ideal *I* + *J* + *K* (note that *I* is denoted iI in the *Maple* implementation) will be:

```
time0:=time():
B:=Basis([op(iI),op(J),op(K)], Orde);
                      [c5^2+c5,...,x2,x1]
time()-time0;
```
0.296

(the complete output is omitted for he sake of space). This ideal is not 1, so there is no inconsistency (for these facts).

Let us check if any of *c*1, *c*2, *c*3, *c*4, *c*5 follow from these facts:

```
time0:=time():
NormalForm(NEG(c1),B,Orde);
                              1 + c1
NormalForm(NEG(c2),B,Orde);
                              1 + c2
NormalForm(NEG(c3),B,Orde);
                              c3 + 1
NormalForm(NEG(c4),B,Orde);
                              1 + c4
NormalForm(NEG(c5),B,Orde);
                              c5 + 1
time()-time0;
                              0.078
```
As 0 was never obtained, none of the *ci* is obtained by forward firing. It can be seen above how the answers were computed in less than one-tenth of a second (on a standard PC), after the previous computation of the Groebner basis of the ideal *I* + *J* + *K* (what took less than three-tenths of a second).

#### *6.2. An Incorrect Serve*

Let us exchange now *l*3 by *l*1. Then the ball is not correctly tossed according to the serve chosen, as the player has affirmed that he/she cannot serve in any of the three ways tossing the ball the same place (¬*z*0)

The ideal K of facts will now be:

```
K:=[ NEG(x3),NEG(NEG(z0)),NEG(y1),NEG(NEG(d1)),NEG(NEG(d2)),
     NEG(NEG(v1)),NEG(w1),NEG(t2),NEG(l1),NEG(NEG(a1)),
     NEG(NEG(a2)),NEG(m1),NEG(e1),NEG(e2),NEG(k1),NEG(k2),
     NEG(k3),NEG(k4),NEG(NEG(k6)),NEG(k7),NEG(o1) ]:
```
and the Groebner basis of the ideal generated by *I* + *J* + *K* will be:

```
time0:=time():
B:=Basis([op(iI),op(J),op(K)], Orde):
                       [c5+1,...,x2,x1]
time()-time0;
                              0.218
```
(the complete output is again omitted for he sake of space). This ideal is not 1, so there is no inconsistency (for these facts) either.

Let us check if any of *c*1, *c*2, *c*3, *c*4, *c*5 follow from these facts:

```
tiempo:=time():
NormalForm(NEG(c1),B,Orde);
                                 0
NormalForm(NEG(c2),B,Orde);
                                 0
NormalForm(NEG(c3),B,Orde);
                              c3 + 1
NormalForm(NEG(c4),B,Orde);
                              c4 + 1
NormalForm(NEG(c5),B,Orde);
                                 0
time()-time0;
```
0.078

Therefore *c*1, *c*<sup>2</sup> and *c*<sup>5</sup> are obtained by forward firing, that is, we could have:


The answers were computed in similar times.

#### *6.3. A Remark about Timings*

Although it is well-known that the complexity of Groebner bases computation is double exponential in the worst case in the general case, there are very important simplifications in the algebraic model for RBES knowledge extraction and verification:


(it is a polynomial Boolean ring, or, if the structure of the RBES is directly translated, a polynomial Boolean algebra [10]).

These features allow to treat problems of a size that would not be treatable, for instance, in real geometry. For example, the RBES of [18] consists of 313 rules (of the type specified in Remark 1) that are simplified to 182 complex rules (in the way mentioned in Remark 2). The problem treated here, with 26 rules and 14 integrity constraints, is tiny in comparison.

The specialized software PolyBoRi [23] is extremely fast due to dealing only with Boolean rings, that is, with this special kind of rings.

#### **7. Conclusions**

The main achievement is that there is no comparable RBES devoted to tennis hitting technique (as far as we know).

This is a first step in the exploration of this kind of RBES. Many more things could be done.

As future work, a comfortable GUI should be developed, either stand alone, like in [18], or accessible through Internet, like [19]. If *Maple* was the chosen tool to implement the RBES, *Maplets* could be used, although it would be even faster to use a specialized software such as MiniSat [24] or PolyBoRi [23]. Nevertheless, performance is not the key issue here, as we are just exploring the field and the system is (at least so far) simple.

Another important extension is to include backward reasoning in order to automatically determine which single propositional changes could change a certain output. This can be achieved with the same algebraic approach used, as shown in [25], but it development is not trivial and is left for a future work.

Also the possibility to use modal many-valued logics is not yet discarded, as the player can have doubts and perhaps not all variables should be considered Boolean.

Obviously the system could be detailed with more facts (like wind) and more complex possibilities (like mixing slice and topspin effects). Moreover, it could be extended to the other main strokes of tennis.

With the implementation included above, the player can study the effect of changing some of the given facts.

**Author Contributions:** Theoretical mathematical and computational details of the approach: E.R.-L., F.S., A.H.; underlying tennis concepts and ideas and subsequent logical rules: E.R.-L., E.A.C., F.S. All authors have read and agree to the published version of the manuscript.

**Funding:** This research was partially funded by the research project PGC2018-096509-B-100 (Government of Spain).

**Acknowledgments:** We would like to thank the anonymous reviewers for their mot valuable comments and suggestions, that have greatly improved the article.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **Decision Support System for Fitting and Mapping Nonlinear Functions with Application to Insect Pest Management in the Biological Control Context**

**Ritter A. Guimapi 1,2,\*, Samira A. Mohamed 1, Lisa Biber-Freudenberger 3, Waweru Mwangi 2, Sunday Ekesi 1, Christian Borgemeister <sup>3</sup> and Henri E. Z. Tonnang <sup>1</sup>**


Received: 4 March 2020; Accepted: 16 April 2020; Published: 24 April 2020

**Abstract:** The process of moving from experimental data to modeling and characterizing the dynamics and interactions in natural processes is a challenging task. This paper proposes an interactive platform for fitting data derived from experiments to mathematical expressions and carrying out spatial visualization. The platform is designed using a component-based software architectural approach, implemented in R and the Java programming languages. It uses experimental data as input for model fitting, then applies the obtained model at the landscape level via a spatial temperature grid data to yield regional and continental maps. Different modules and functionalities of the tool are presented with a case study, in which the tool is used to establish a temperature-dependent virulence model and map the potential zone of efficacy of a fungal-based biopesticide. The decision support system (DSS) was developed in generic form, and it can be used by anyone interested in fitting mathematical equations to experimental data collected following the described protocol and, depending on the type of investigation, it offers the possibility of projecting the model at the landscape level.

**Keywords:** Nonlinear regression; interactive platform; component-based approach; software architecture; Eclipse-RCP (Rich Client Platform); spatial prediction

#### **1. Introduction**

The ability to make reliable predictions from data through mathematical and computational concepts is fundamental in scientific research. Analytical methods expressed with mathematical modeling and simulations are often used to make predictions [1]. However, some group of scientists like biologists and entomologists are not always equipped with the necessary knowledge of calculus allowing them to perform certain types of analysis. This justifies why the various algorithms developed for fitting data to mathematical equations are not used by many. Among the techniques for fitting data to mathematical equations, nonlinear regression represents one of the most used approaches [2]. It is a very helpful process in engineering, agricultural, and natural science, and it is used to capture and understand the underling relationships among variables (dependent and independent) of interest described by mathematical expressions.

The selection of mathematical expressions for the fitting process should not be done randomly, and it has to obey a certain logic. The best way is usually guided by the theory and knowledge from the field

of interest that provided the data. This insight guides the selection of the mathematical expressions used for describing the theoretical knowledge and establishing the boundaries conditions. If the mathematical expressions of the functions are too complex, it will be difficult and maybe impossible to proceed with parameter estimation. The approach should be grounded with an analytical and realistic understanding of the phenomena, which sometimes may obey biological and physical science principles. The commonly used methods for nonlinear parameter estimation include the ordinary least squares and the maximum likelihood methods [3].

The input data for model fitting may be obtained from biological and environmental processes, and it may be useful to link the resulting model to climatic variables such as temperature for spatial visualization. This approach is necessary as it allows expanding the prediction from mathematical models to be used at the landscape level as a guide in large-scale decision-making. Indeed, maps are very useful to visualize the variability of the real relationships that exist between objects and components abstracted by models. A simple model will result in a less precise map projection, while a complex model, which incorporates several variables and parameters, will allow depicting the specifics in the map. The type of model to adopt depends on its usefulness with regard to the purpose of the study and the scale to visualize [4]. However, producing a meaningful output for the end-user represents another major challenge. Simulation outputs sometimes need a special type of representation to ease their interpretation by managers who generally do not have the data analytics skills required to benefit from the proposed solution.

Spatial data are available in raster or vector format. The intrinsic nature of raster format makes them more suitable for mathematical modeling and spatial analysis. The creation of raster outputs during the mapping process is usually achieved through spatial interpolation of model projections on point locations [5,6]. There are many available interpolation approaches, which can be classified into deterministic or statistical approaches. Deterministic approaches include techniques such as proximity interpolation and inverse distance weighted (IDW), while statistical approaches include techniques such as trend surfaces and ordinary kriging [7]. Regardless of the chosen approach, the output is a map in raster format that can support agro-ecological zoning (AEZ) landscapes organized in units with similar characteristics related to the level of suitability for species or organism breeding [8]. The strategy includes the mapping of AEZ based on the inventories of climate similarities and an evaluation of the land suitability of each zone for managerial policies. In the context of biological control strategy, the concept can be adopted to identify the suitable locations for the field application of fungal-based biopesticide or a release of parasitoids of predators to reduce the population density of insect pests.

To help the end-user exploit both the fitting and the mapping features and make a complete decision, we opted to embed these features in an interactive computer-based platform that will assist uploading experimental data and support the analysis required at different stages. Such platforms are generally viewed as decision-making software (DM software) and, in most cases, are based on multi-criteria decision-making (MCDM) [9,10]. One of the purposes of DM software is indeed to provide the user with technical details, allowing them to mainly focus on the decisional aspect supported by software outputs [10]. Although there are many MCDM tools available, none of them are applicable to the type of data fitting presented here [9,11]. We take into consideration many criteria which are embedded in sequential steps for a rigorous and robust analysis.

A decision support system (DSS) is made of four major components: (1) the data management component, (2) the model management component, (3) the knowledge management component, and (4) the user interface management component [12–14]. The steps in the analysis processes are computerized to reduce the time and human effort required for making decisions [15,16]. In general, there are two types of DSS, namely, model-based and data-based systems. The model-based system is a standalone platform, which is not connected to any other information system. It requires the incorporation of mathematical expressions and applies optimization algorithms for fitting data to the expressions; it usually operates through a user interface, which ensures easy operation [17]. In contrast, the data-based system is a platform to explore and analyze large datasets, from various

sources stored in databases and warehouses, before employing data mining and analytics to process the information, thereby yielding outputs [18]. The primary goal of such a platform is to support the user in decision-making for real and concrete problems.

This paper presents the design and implementation of an interactive computer-based tool for fitting experimental data to mathematical expressions and doing the spatial projection of the result at local, regional, and continental scales. The usefulness of the platform is illustrated for predicting the potential ecological fitness and spatial variations of the virulence of an entomopathogen fungal-based biopesticide used in an integrated pest management (IPM) context.

The sections below are organized as follows: Section 2 provides details of the methodology starting with the input datasets, followed by the modeling framework, and ending with the architecture of the DSS. Section 3 presents a case study where we demonstrate the application of the DSS. Section 4 discusses the results, and Section 5 provides the conclusion with areas of potential future investigation.

#### **2. Materials and Methods**

#### *2.1. Input Dataset*

To use the platform for fitting and mapping operations, some input data are required and should be collected and organized in specific formats, as described below.

#### 2.1.1. Experimental Data

The fitting process takes, as input, experimental data recorded from observations from the field or laboratory. In the context of biological control used as the case study here, the virulence data have to be obtained from laboratory experiments replicated over a range of different temperatures under which both the insect and the entomopathogen can interact. Data recorded from the experiments are structured and organized per replicate and temperature, then used as inputs to the platform. Temperature was selected as the key variable due to the paramount role it plays in the development, survival, reproduction, and mortality of Entomopathogenic Fungi (EPF) and insects. A time step of hour, day, or week is necessary for each replication, which allows observing changes in the dependent variable (mortality). Records used as input into the platform are structured and organized as presented in Figure A1 (Appendix A).

#### 2.1.2. Climate Data

Climate data are useful during the mapping process and the spatial projection of the model. Data such as temperature of the studied area are extracted from global climate data (http://www.worldclim. org/). These climate layers contain monthly minimum, maximum, and mean temperatures organized in raster files with "Flat" format (*.flt* and *.hdr* file), with a spatial resolution ranging from 30 arc-seconds (~1 km) to 10 arc-minutes (~20 km) [19].

#### *2.2. Model Fitting Description*

The model fitting process can be divided into three stages: pre-regression, regression, and post-regression steps [20]. The pre-regression stage mainly consists of the selection of a set of mathematical expressions to be fitted to the data. The regression stage consists of selecting the proper method for parameter estimation, while the post-regression stage consists of sets of activities necessary to evaluate the model.

To illustrate the model fitting process, the following mathematical annotation is used:

$$z = f(\mathbf{x}, \boldsymbol{\beta}) + \boldsymbol{\theta}\_r$$

where *z* is the response variable, *x* represents the input variables, β is the parameter of the model to be estimated, and θ is the error. *ƒ* is a mathematical function such as an exponential or logarithmic function (see Table 1 for more examples), whose expression depends on *x* and β. For example, if *f* is the exponential function with the mathematical expression *f*(*x*) = *b*(*x* − *xb*) 2 , you have β*<sup>f</sup>* = {*b*, *xb*}.

The ability to identify and include the right variables in a model is equally important to the type of model chosen. We consider a group of *m* observations {*zi*} on variables {*xi*} and try to estimate parameter β by minimizing θ. The minimization of θ follows the least-square curve-fitting procedure using the Levenberg–Marquardt (LM) algorithm. For a given equation *f*(*x*,β), parameters are estimated such that the sum of square of the deviations *s*(β) minimizes θ via the following expression: *m* 2 .

$$s(\beta) = \sum\_{i=1}^{m} \left[ z\_i - f(\mathbf{x}\_i | \beta) \right]^-$$
 
$$\text{Thus I M algorithm}$$

The LM algorithm combines the gradient descent and the Gauss–Newton methods. The gradient descent method reduces the sum of the squared errors by updating the parameters in the downhill direction (the direction opposite to the gradient of the equation), while the Gauss–Newton method reduces the sum of the squared errors by assuming the least-square function is locally quadratic with parameters near the optimum. The fitting procedure starts with an initial value of *x0*; then, *x* is adjusted by Δ only for downhill steps verifying (JTJ + λI) Δ = J Tr, where *J* is the Jacobian matrix of derivatives of the residuals for the parameters, λ is the damping parameter between the two steps, and *r* is the residual vector. A set of equations used in the study that can be selected by the user are presented in the Supplementary Materials.

**Table 1.** Summary of key functions used for fitting. The "Name" column gives the name of the model, the "Equation" column gives the mathematical expression of the model, the "comment" column gives the number of derived sub-models from the original main model, and the last column gives the reference for the model. *T* is the independent variable, and *m(T)* represents the virulence model. ID—identifier.


The selection of mathematical expressions for the fitting process is guided by the type of data available to fit, the knowledge of the system, and its boundaries conditions. The estimation of mathematical expression parameters is directly linked to the convergence of the fitting algorithm and the initial conditions [3].

#### *2.3. Features of the DSS*

Spatial visualization emerged as the field that combines the abilities of human perception through data picturing and computer simulation using analytics at the landscape level. Spatial visualization was successfully applied in various studies, and the current platform is inspired by these concepts.

The main features of the platform and how they are linked is described in the unified modeling language (UML) use case diagram shown in Figure 1.

**Figure 1.** Use case diagram presenting the interactions among the platform components and their functionalities. Each circle represents a feature implemented within the decision support system (DSS). The links from the user to a feature represent the direct interactions of the user with the DSS, while the arrows with labels "include" characterize the relationships and level of dependency between features.

The model fitting and mapping process are done through an interactive process that involves the user in all steps to ensure good and reliable decisions are taken, based on graphical outputs and statistical criteria available in the DSS. The process is summarized in Figure 2.

The fitting procedure begins with a preliminary visual display of input data using statistical aggregation functions to provide the average value for all replicates of the experiment. Then, among the 82 nonlinear mathematical expressions, those which can better fit the data are selected. The database of mathematical expressions is obtained from the literature, and each equation is selected based on its ability to capture the relationship that exists between the dependent and the independent variable. The functions used during this process are implemented in R computer language. The implementation of the Levenberg–Marquardt algorithm in R-package minpack.lm is used for fitting the mathematical expressions to data and estimating the parameters [21–23]. *R*-squared, *R*-squared adjusted, Akaike information criterion (AIC), root-mean-squared error (RMSE). and the sum of squared residual are proposed to help to choose the best-fitted model. Once the fitting process is completed, the information is saved and transferred to the mapping perspective.

**Figure 2.** Detailed flowchart diagram of the platform; the figure displays two processes: the fitting process in which experimental data are fitted to nonlinear mathematical expressions using the Marquardt optimization algorithm and the mapping process that begins by linking the obtained best-fitted model to the climate database for map creation. The GUI represents the graphical user interface, and the KMS is the knowledge management system that processes all the simulations.

The mapping exercise requires spatial data as input, which are organized by grids that allow applying the model at the landscape level. From the graphical user interface (GUI), a request is made to a specific studied region, and the values of temperature in the selected locations are extracted from the database in the form of monthly and annual averages. A point object is used to pick the model mathematical expression, and it is consecutively applied in each geographical coordinate of the region of interest to estimate the value of the dependent variable. This allows the value of the dependent variable at each location to be estimated and stored in an ASCII file format (.asc). Additionally, the mapping perspective includes some Udig features for map editing and viewing. It further offers the possibility to transfer the ASCII file obtained to any geographic information system (GIS) software for processing (see Figure A2, Appendix A, for a summary of the algorithm).

#### *2.4. Software Design and Architecture*

The tool is designed using the component-based software architectural approach centered on the Rich Client Platform (RCP) framework. The adopted approach for the software development is proven to simplify and facilitate the productivity and the quality of the end product by enhancing the concepts of software reuse, modularity, extensibility, customization, and reduction of development time [24]. The platform extends all the properties of the Udig framework (Figure 3). Moreover, software developed using RCP can run on a variety of operating systems (e.g., Windows, Linux, or Mac).

**Figure 3.** Unified modeling language (UML) component diagram. The users interact with the inbuilt software environment through the perspectives (model designer and mapping perspective) of the GUI. The GUI is based on the Eclipse-RCP (Rich Client Platform) and Udig-RCP components. Rserve allows the communication between R and Java by creating objects with port 6311.

The Eclipse integrated development environment (EDI), which is one of the most commonly used tools for the programming of component-based software [24], was used to develop the tool (see "Supplementary Materials" for the link to source code and setup). Eclipse has the advantage that it allows modular and object-oriented programming, which accepts simple development and extension of the tool product [24]. The platform is structured into five major components (see Sections 2.4.1–2.4.5) including the RComponent, Eclipse RCP, Rserve, Udig-SDK, and the graphical user interface (GUI) (Figure 3). A procedural approach is used in RComponent for the fitting of mathematical equations and statistical criteria, while object-oriented programming is used to interconnect all the components and implement the software GUI.

#### 2.4.1. RComponent

RComponent integrates the features to import data, carry out the analytics, and display the results. RComponent consists of R-scripts encompassing the mathematical expressions, fitting algorithms with procedures for equations and parameter estimations, statistical criteria for selection, and the mapping environment. RComponent enables selecting the equation that best fits the data, which is then applied to temperature raster files for spatial projection. Different R packages (minpack.lm, MASS for nonlinear

regression and equation fitting; sp, maptools, rgdal, maps, doRNG for spatial operations leading to the creation of the map) are used within RComponent [25–30]. The *nls.lm()* function implemented in the minpack.lm package was used to simulate the Levenberg–Marquardt algorithm. Spatial objects representing climate data are imported using the function *readGDAL ()* from the rgdal package and *writeAsciiGrid(),* while the maptools package is used to export spatial grid data into ASCII format. Commonly used statistical criteria for comparing and selecting mathematical expressions such as *R*-squared, *R*-squared adjusted, Akaike information criterion (AIC), root-mean-squared error (*RMSE*), and the sum of squared residual (SSR) are implemented into RComponent of the platform.

#### 2.4.2. Eclipse RCP

The Eclipse RCP component enables the building of fast and reliable end-user interfaces to guide users during the interactions with the software functionalities. This component consists of different low-level frameworks that offer ways to rapidly develop client-side applications. The software includes the following key features from the RCP framework: the Equinox OSGI (Open Services Gateway initiative) for describing the modular approach used in Java application under Eclipse; the core platform that contains the runtime engine responsible for the run of plug-ins; the JFace dialog preference and wizard framework; the Standard Widget Toolkit (SWT), which provides objects with subclasses of image, color, and font; the Eclipse IDE workbench. All of these features are used to provide a user-friendly and flexible solution.

#### 2.4.3. Rserve

Rserve is a software component used for the creation of objects and the evaluation of the RComponent scripts, as well as their integration into the Java application [31]. This component is used in the developed tool as an object middleware to enable communication and exchange of information between RComponent and the Eclipse RCP component. It is also called an object request broker, which allows the application to send objects and request responses via object-oriented systems. Once a request is issued, Rserve opens a connection to collect the parameters of the request and establishes another connection with RComponent to provide resources for the execution of the request. When the task is completely executed, the output produced by RComponent is sent back to the GUI.

#### 2.4.4. Udig-SDK

Udig-SDK is an open-source desktop application framework built with Eclipse RCP technology, which can be used as a plug-in in RCP applications. To handle the mapping process, the software extends Udig properties to provide a complete Java solution for desktop geographic information system (GIS) data access, editing, and visualization of maps. It is based on GeoTools, ImageIO-Ext, ImageIO, JAI (Java Advanced Imaging), and Gda; the Udig-SDK component allows the mapping of spatial data. Among the key features offered by Udig for map edition, we have layers, style, (AOI (Area of Interest), and catalog. These features allow the user of the developed tool to add or create layers and customize maps.

#### 2.4.5. Graphical User Interface (GUI)

This component is the direct link between the user and the software system. Different perspectives are used to bring key functionalities together in a screen layout that is simple and interactive. The software GUI consists of two perspectives: the model designer and the mapping perspective. The model designer displays functionalities needed for the analytics to (i) import, modify, plot, and visualize experimental data, and (ii) carry out equation fitting. The mapping perspective includes all functionalities for the mapping process. Both perspectives include views, a toolbar, a menu bar, and other graphic components needed for a user graphic interface.

#### 2.4.6. DSS Output Evaluation

The developed DSS produces two main outputs: the model that best fit the experimental data and the geographical distribution map displaying areas of suitability or application of control measures. To evaluate the prediction accuracy of the outputs, users are requested to conduct additional investigations in natural conditions and use the model (if fitted with data from the laboratory) to mimic the field behavior or vice versa. When the output of the DSS is a map, ground scouting is necessary to confirm the projections.

#### **3. Case Study: Using the DSS to Fit Time–Dose–Mortality Data to Mathematical Expressions and Mapping the Potential Zone of E**ffi**cacy of Fungal-Based Biopesticides in the Killing of Insect Pests**

The case study consisted of using the DSS features to map the potential zone of efficacy for the virulence of the biopesticide *Metarhizium anisopliae* isolate ICIPE 62 against mustard aphids.

Fungi are the most widespread entomopathogenic organisms in terrestrial ecosystems [32,33], gaining many interest for their potential use as biopesticides, due to their ability to infect and kill insects [32,34,35]. Their efficacies were demonstrated both in laboratories and in the field, leading to a worldwide increasing interest in the development of EPF-derived products for commercialization. They are, however, highly influenced by biotic (e.g., pest dynamics) and abiotic (e.g., sunlight, rainfall, temperature, humidity) factors. The platform is used to fit the time- and temperature-dependent mortality data to temperature-dependent virulence mathematical expressions. Thereafter, the virulence equation that best fitted the data is spatially projected to map the areas of the potential efficacy of the biopesticide in controlling the targeted pest.

Data on the mortality (dependent parameter) of mustard aphids caused by the biopesticide *Metarhizium anisopliae* isolate ICIPE 62 were obtained in the laboratory at five different temperatures (independent parameter): 10, 15, 20, 25, and 30 ◦C. Many isolates of *Metarhizium anisopliae* were reported to have high pathogenicity against several insect pests [36–38]. *Lipaphis pseudobrassicae*, commonly known as turnip aphid or mustard aphid, is a pest that can feed on many types of crops; it was recorded from host plants belonging to over 10 families including Brassicaceae, Cucurbitaceae, and Solanaceae [39]. This pest is globally distributed with the highest level of infestation occurring in Africa, while it is also present in Europe, Asia, and America [40]. The optimal temperature for its development is reported to be around 20 ◦C [41]. The virulence map was produced for Kenya and Cameroon after completing the fitting process. The spatial evaluation of the generated map was made by comparing the known locations of successful field application of the biopesticide in controlling targeted insect pests to the predicted level of virulence. Illustrations are done below in an interactive way through the features of the designed tool.

#### *3.1. Data Input, Visualization, and Model Fitting Features*

The fitting exercise starts with the creation of a new project using the provision of general information such as the project name and name of the EPF, the name of the author, and a brief description of the project. After the creation of the project, the default display in the environment of the tool is the model designer perspective, which is interactive and intuitive in guiding the user through different steps. The independent and dependent variables, in this case, are temperature and insect mortality due to the interaction with EPF, respectively. For example, the virulence rates of the EPF on the targeted insects at temperatures of 15 and 25 ◦C are 0.7 and 0.89, respectively (Figure 4). The next step consists of loading the experimental data file, which is displayed in Figure 5 on the left side of the user frame. On the right side of the GUI, the recorded mortality is plotted against the corresponding temperature values (Figure 4). Furthermore, the tool offers the user the possibility to include additional temperature values not included in the initial input data file. The new data could be obtained from the literature.

**Figure 4.** User interface (UI) for importing, plotting, and modifying experimental data.


**Figure 5.** Wizard for the selection list for the fitting process. This frame assists the user in the selection of equations to be fitted with experimental data.

The user can choose to fit a single or many mathematical equations among the 82 available in the software to the input data (Figure 5).

For each equation selected to be fitted to the data, the graph and statistical parameters are generated and displayed in the user interface (Figure 6). By combining the graphical display of the output with the different values of the goodness of fit (Figure 7), the fitted equation with the best performance is selected as the temperature-dependent virulence model for mapping purposes.

**Figure 6.** Display of all the fitted equations for the selection of the model.


**Figure 7.** Evaluation criteria and goodness of fit for each fitted equation. Cells in red suggest the best-performing function for each evaluation criterion.

#### *3.2. Mapping Features*

Within the mapping perspective, the selected model is linked to climatic data to generate the map (ASCII file) of the potential virulence efficacy. Layers for minimum and maximum temperatures can be imported, and the extent of the area of interest can be defined. Optionally, a filter can be applied to limit the minimum and maximum temperature values that are considered suitable for the process. After the model is linked with a climatic database, a map is produced through a spatial interpolation technique. The user has the option to use the features provided through Udig for additional editing and layouts of the efficacy map or to transfer the generated ASCII file to another GIS software such as QGIS.

The results considered in the case study indicate that the optimum temperature to apply the EPF isolate ICIPE 62 with the highest level of virulence is 22 ◦C (Topt = 21.98 ◦C ± 0.29). Based on the Lorentzian four-parameter model results, the maps of the potential zones of the efficacy of EPF isolate ICIPE 62 when applied against mustard aphid were produced for Kenya and Cameroon (Figures 8 and 9, respectively). A level of efficacy that varies between 0 and 1 characterizes the virulence level of ICIPE 62 against the targeted pest.

**Figure 8.** Kenyan map of the potential efficacy of ICIPE 62 isolate when used against mustard aphid, modeled with the software. The level of efficacy varies between 0 and 1. Locations with 0% level of efficacy are displayed in white, values between 0 and 0.5 are displayed in blue, values between 0.5 and 0.75 are displayed in green, and values between 0.75 and 1 are displayed in red. Red indicates the highest efficacy levels. The yellow circles surround areas in Kenya where the EPF isolate ICIPE 62 is successfully used, and results were used for validation of the developed model.

Locations with a virulence level of efficacy ranging from 0 to 0.5 are locations with an average environmental temperature below 15 ◦C. Locations with a virulence level of efficacy ranging from 0.5 to 0.8 are locations with an average environmental temperature varying from 15 ◦C to 17 ◦C and from 26 ◦C to 30 ◦C. Locations with a virulence level of efficacy ranging from 0.8 to 1 correspond to locations with an average environmental temperature varying from 17 ◦C to 26 ◦C.

To evaluate the prediction accuracy of the output map, field releases of the EFP were conducted at the point locations circled in Figure 8. The outcome, which was a successful control measure, is compared with the model-predicted level of virulence that is greater than or equal to 50%.

**Figure 9.** Cameroon map of the potential efficacy of ICIPE 62 isolate when used against mustard aphid, modeled with the software. The level of efficacy varies between 0 and 1. Locations with 0% level of efficacy are displayed in white, values between 0 and 0.5 are displayed in blue, values between 0.5 and 0.75 are displayed in green, and values between 0.75 and 1 are displayed in red. Red indicates the highest efficacy levels.

#### **4. Discussion**

This paper presented the design and implementation of an interactive platform for fitting experimental data to mathematical expressions and generating the spatial projection of the result at the local, regional, and continental scales. The platform was developed with the combination of R programming language, RCP architecture, and Udig, which makes it an interactive decision support tool with multiple components, as described in the literature [12–14]. The most important components of the platform are RComponent and the graphical user interface (GUI). However, the trigger of the iterative procedure is the acquisition of the experimental data used as input.

The fitting procedure implemented in the platform applied the LM algorithm. The LM algorithm combines two approaches; it operates like the gradient descent method when the equation parameters are far from their optimal values, while it performs like the Gauss–Newton method when the parameter values are close to their optimum (Nelles, 2001). The main advantage of the LM algorithm is its ability to converge to optimum values faster than the gradient descent or the Gauss–Newton method. This allows the LM algorithm to still find the optimal value of parameters when the initial values of equation parameters are distant from the optimum. This makes the LM algorithm very robust when compared to other minimization algorithms usually used in the fitting procedure (Nelles, 2001).

The GUI is considered as a key component of the tool as it is used directly by the end-user to interact and communicate with the system. The design of the software GUI satisfies the requirements for a software development interface as described in Reference [42]. The component-based approach is adopted to rely on the possibilities to extend and reuse developed components. By employing such a development concept, the maintenance of the software architecture can be secured while new functionalities can be easily integrated. Component-based UIs further enable functional and logical decomposition of the software GUI into perspectives, which helps in defining features needed by the end-users. Component-based UIs accelerate the development process, which is contrary to the iterative or agile development approach that tends to slow the process of developing software. The modularity offered by the RCP component allows easily integrating additional features in the tool.

In comparison to other fields such as education or finance, the need for such tools in agriculture and IPM research activities is lacking despite their importance. This is particularly required to assist in monitoring numerous processes in the crop production system, such as field release of biological control agents. In recent years, progress was perceptible with the proliferation of DSS [43,44], which can be web-based [43] or standalone like the proposed DSS. Despite disparities among conceptual approaches, what they all have in common is the emphases put on input data and the user interface. One advantage of a web-based DSS is the possibility it offers to directly access real-time environmental data to carry out analytics [44], which make these categories of tools highly dependent on an internet connection that is not always available in most of Africa. A standalone DSS, once installed, only requires acquiring input data to be used anywhere and offline. Moreover, the uniqueness of this tool is the flexibility it offers to easily provide spatial distribution maps using gridded values of temperature for delimiting potential areas of field application of control measures. A similar attempt was presented in References [45,46]; however, it is not directly applicable as users still need to be knowledgeable in statistical and geospatial sciences. The current platform generalizes and combines all the steps used in these studies to provide researchers, with no computational skills, the opportunity to process the fitting and select the models that can further be projected spatially at the landscape level.

**Application:** The Lorentzian 4-parameter model obtained in the application section estimates an optimum temperature for the higher virulence of ICIPE62 in killing the aphid at about 21 ◦C. When comparing the current distribution map of the *Lipaphis pseudobrassicae* [40] with the map of the potential zone of the efficacy of ICIPE 62 isolate in Kenya and Cameroon, we observed that many invaded locations fit well with a potential zone of efficacy with virulence level greater than or equal to 0.5. Although the estimates were made without full inclusion of other environmental variables that have impacts on the fungi efficacy in killing insects, the outputs are promising. However, it will be useful to explore the association of temperature with other factors such as relative humidity to improve the accuracy of the prediction, especially for mapping the virulence level of the EPF. Indeed, many studies highlighted the key role played by both temperature and relative humidity in enhancing the virulence of EPF on insect pests [47–49]. On this note, a good perspective to consider for improving the current tool will be to consider the association of at least two factors (temperature and relative humidity, for example) as independent variables in the fitting process.

#### **5. Conclusions and Future Works**

Herein, we presented an interactive generic platform for predicting agro-ecological processes through the use of experimental data that are fitted to mathematical expressions, whereby the resulting best-fitted equation can be applied at the landscape level. The platform is a useful tool for anyone interested in fitting and conducting a spatial visualization of data. The current platform could be of great help for science pathologists and IPM practitioners worldwide in their attempt to increase the use and application of control measures within an agriculture IPM context. To further improve DSS development, the economic injury level (EIL) concepts that rely on economic threshold models may be added as a guide in evaluating the cost and benefit of deploying any control measure. The added module will integrate a combination of environmental and biological factors which are linked to economic factors, yielding an environmental economic threshold module. However, to avoid complexity, it is also possible that outputs from the developed tool can be used as input in other software specifically developed to conduct econometric analysis.

**Supplementary Materials:** The source code of the software is available online at https://github.com/Atoundem/ EPFA\_RCP. The setup and installation requirements can be downloaded at https://github.com/Atoundem/EPFA\_ RCP/tree/master/Setup\_Intall\_Requirement.

**Author Contributions:** Conceptualization, S.E. and H.E.T.; Methodology, R.A.G. and H.E.Z.T.; Software, R.A.G. and H.E.Z.T.; Writing—original draft preparation, R.A.G.; Writing—review and editing, R.A.G., L.B.-F., C.B. and H.E.Z.T.; Supervision, S.A.M., W.M. and H.E.Z.T.; Funding acquisition, S.E. and H.E.Z.T. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Volkswagen Foundation under Grant [VW-94362].

**Acknowledgments:** The first author of this study is a PhD student working in the fellowship project (VW-89362) of Henri E.Z. Tonnang funded by the Volkswagen Foundation under the funding initiative Knowledge for Tomorrow: Cooperative Research Projects in Sub-Saharan on Resources, their Dynamics, and Sustainability—Capacity Development in Comparative and Integrated Approaches. The authors thank the Federal Ministry for Economic Cooperation and Development (BMZ), Germany, which provided the financial support through the Tuta IPM project, the German Academic Exchange Service (DAAD), and the STRIVE project funded by the German Federal Ministry of Education and Research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**


**Figure A1.** Experimental data: Var1 represents the range of the independent variable (temperature); Var2 is the number of replicates in the experiment, Var3 is the time duration of each experiment; Var4 and Var5 are records of the dependent variable (mortality) with the variation of the independent variable.

**Figure A2.** Algorithm: model fitting and mapping processes.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **An Evaluation Framework and Algorithms for Train Rescheduling**

#### **Sai Prashanth Josyula \*, Johanna Törnquist Krasemann and Lars Lundberg**

Department of Computer Science, Blekinge Institute of Technology, 37141 Karlskrona, Sweden; johanna.tornquist.krasemann@bth.se (J.T.K.); lars.lundberg@bth.se (L.L.)

**\*** Correspondence: sai.prashanth.josyula@bth.se

Received: 26 October 2020; Accepted: 7 December 2020; Published: 11 December 2020

**Abstract:** In railway traffic systems, whenever disturbances occur, it is important to effectively reschedule trains while optimizing the goals of various stakeholders. Algorithms can provide significant benefits to support the traffic controllers in train rescheduling, if well integrated into the overall traffic management process. In the railway research literature, many algorithms are proposed to tackle different versions of the train rescheduling problem. However, limited research has been performed to assess the capabilities and performance of alternative approaches, with the purpose of identifying their main strengths and weaknesses. Evaluation of train rescheduling algorithms enables practitioners and decision support systems to select a suitable algorithm based on the properties of the type of disturbance scenario in focus. It also guides researchers and algorithm designers in improving the algorithms. In this paper, we (1) propose an evaluation framework for train rescheduling algorithms, (2) present two train rescheduling algorithms: a heuristic and a MILP-based exact algorithm, and (3) conduct an experiment to compare the two multi-objective algorithms using the proposed framework (a proof-of-concept). It is found that the heuristic algorithm is suitable for solving simpler disturbance scenarios since it is quick in producing decent solutions. For complex disturbances wherein multiple trains experience a primary delay due to an infrastructure failure, the exact algorithm is found to be more appropriate.

**Keywords:** algorithm evaluation; decision support systems; parallel algorithms; multi-objective optimization; train rescheduling

#### **1. Introduction**

In railway traffic systems, whenever disturbances occur, it is important to effectively reschedule trains while optimizing the goals of various stakeholders. In real time, rescheduling of trains during a disturbance is typically carried out manually by a dispatcher [1], or a train traffic controller. In this process, deviations from the original plan and conflicts are detected by constantly supervising the status of traffic and infrastructure [1]. A railway traffic management system (TMS) constitutes remote equipment and software tools which can support the dispatchers in managing (or controlling) a network's railway traffic [2]. Today's dispatching is supported in various ways by TMSs, which typically [3]: (i) show the current status of the railway network in the dispatching area, (ii) show the positions of the trains, status of signals and switches in real time, (iii) predict train movements and detect (potential) conflicts. However, when it comes to conflict resolution, only a few of the currently existing railway TMSs are actually able to compute and suggest alternative rescheduling decisions, let alone incorporate advanced rescheduling algorithms [4]. A TMS that incorporates intelligent, flexible and (semi)autonomous train rescheduling algorithms has several benefits [5], e.g., facilitating the incorporation of important information in the decision-making process, enabling a wider and longer planning horizon, and reducing the work load of the dispatchers/traffic controllers, if incorporated

well in the traffic management process. As of 2020, several countries (e.g., Sweden and Switzerland) are preparing to deploy new TMSs in railways that integrate, unify and automate a significant part of the traffic management process. A major advantage of such integrated systems is the possibility to increase the level of automation for rescheduling the railway traffic during disturbances [5].

Several algorithms for tackling the train rescheduling problem have been proposed in railway research publications [6]. There is, however, a need to analyze and compare the effectiveness and efficiency of proposed algorithms, to assess the main strengths and limitations. The benefits of comparing the algorithms and the solutions output by them are threefold: (i) enables the practitioners to select an algorithm suitable for the occurring disturbance scenario, (ii) guides the researchers and algorithm designers in improving their algorithms, and (iii) increases future co-operation among researchers and enables exchange of innovative solutions [7]. In this paper, we propose criteria to consider when evaluating a train rescheduling algorithm or comparing it against other algorithms.

The paper is organized as follows. In the next section, we present the train rescheduling problem and the scope of this study, along with some key terminology. In Section 3, we present an overview of related research work and a brief discussion of the main research challenges addressed in this paper, along with the expected research contributions. Section 4 presents the first part of the framework that serves to classify and compare train rescheduling algorithms on a functional/conceptual level. Section 5 presents the second part of the framework that contains a selection of key aspects proposed to be used in systematic performance evaluation and benchmarking of algorithms. Section 6 presents a description of the two rescheduling algorithms that are used to demonstrate the framework's applicability. Section 7 presents the chosen dataset containing the problem instances and the experimental setup. In Section 8, we demonstrate the application of the framework for evaluating and comparing the performance of the two algorithms. In that section, we present the results from the evaluation and discussions based on our analysis. Finally, we present some conclusions and suggest future work in Section 9.

#### **2. Problem Description, Scope, and Terminology**

Railway engines and wagons are known as rolling stock. A train denotes both a composition of rolling stock as well as a timetabled service, allowing the transportation of travellers (passenger trains) or goods (freight trains) between stations [8]. Often, the operation of passenger and freight train services is based on preplanned timetables which ensure operational feasibility of the services by respecting the applicable constraints. A disturbance in a railway network is an unexpected event that renders the originally planned timetable infeasible by introducing 'conflicts'. A conflict is considered to be a situation that arises when two trains are scheduled to occupy an infrastructure resource during overlapping time periods in a way such that one or more system constraints are violated. Several actions need to be taken in real time to prevent or resolve conflicts.

Disturbances are relatively small perturbations in the railway system that can be handled by rescheduling only the railway traffic (i.e., the train timetable) [9]. Disturbances are triggered by incidents such as over-crowded platform(s) that possibly lead to unexpectedly long boarding times and minor delays, or e.g., shorter signalling system failures that may cause more significant delays for several trains. Larger incidents caused by e.g., longer signalling system failures require not only train rescheduling, but also rolling stock and crew rescheduling. Such incidents are often referred to as disruptions [9].

Railway timetables are ideally planned with appropriate time margins in order to enable delayed trains to recover from minor delays and to prevent the propagation of delays from one train to another (i.e., knock-on effects). In case of a disturbance that causes a significant delay to one or more trains, conflicts may arise in the original timetable, thus making it operationally infeasible. The identification and resolution of these conflicts, by adjusting the existing timetable, to obtain a feasible timetable in real time, is known as train rescheduling. The aim of train rescheduling is to quickly obtain a revised feasible timetable of sufficient quality [9]. Train rescheduling is also known as train dispatching and *real-time railway traffic rescheduling*.

When rescheduling trains, the two main stakeholders involved in the process are infrastructure managers and railway operators. The railway infrastructure manager (IM) owns the railway network and associated infrastructure [10]. The IM manages and coordinates all the traffic in the network, both freight and passenger, assuring the operational safety and quality of services [10,11]. The IM also maintains and innovates the rail infrastructure [11]. Examples of IMs are Trafikverket (Sweden), Jernbaneverket (Norway), and Infrabel (Belgium). An important role within the IM organization is that of a dispatcher. Typically, a dispatcher is responsible for monitoring and controlling the railway traffic (i.e., train movements) and rescheduling the traffic plan for his/her control area [12]. The dispatcher is often also responsible for ensuring the safety of scheduled maintenance activities in the railway network. A company operating passenger or freight rail services over the railway infrastructure is called a train operating company (TOC). It is also known as a railway operator or a railway undertaking. Examples are SJ, Tågkompaniet, GreenCargo (Sweden), FlyToget and CargoNet (Norway).

The rescheduling tactics used to prevent and resolve conflicts can be broadly categorized into the following two types: (1) *IM tactics* and (2) *IM + TOC tactics*. IM tactics are typically used by the dispatcher to handle disturbances. Such rescheduling tactics can generally be used without consulting the TOCs and they include (i) retiming (i.e., allocating new arrival and departures times to one or more trains), (ii) local rerouting (i.e., allocating alternative tracks to one or more trains) and (iii) reordering (i.e., prioritizing a train over another). *IM* + *TOCtactics* are typically deployed to handle disruptions and they require the dispatcher to consult with the affected TOCs. Examples of such tactics are (i) global rerouting (i.e., changing the route of trains), (ii) train cancellations (partially/fully cancelling the affected services) and (iii) short-turning of trains. Typically, the IM + TOC tactics are considered to be major decisions, compared to the IM tactics. The reason is that the effects of an IM + TOC decision spill over to other organizations and stakeholders' operational plans.

The actions performed by a typical train rescheduling algorithm can be broadly categorized into two main tasks: (i) computing alternative rescheduling solutions and (ii) selecting a solution based on the objective(s). The computation of alternative solutions primarily involves employing the different mentioned rescheduling tactics to resolve identified potential conflicts. The decision maker is the person responsible for making the decisions regarding adjustments in the disturbed train timetable that will lead to a rescheduled timetable. The algorithm might also assist the decision maker in the selection of a revised timetable, for example, by presenting an analysis of the computed alternative rescheduling solutions and ranking the solutions based on a selection of qualitative and quantitative solution quality indicators.

Table 1 gives an example of how a train rescheduling algorithm can assist the decision-making process. The framework proposed in this paper primarily focuses on the capabilities and performance of algorithms, while the interaction between rescheduling algorithms, involved human decision makers, and the traffic management system, is not described in detail.


**Table 1.** Examples of how train rescheduling can be viewed with/without an algorithm's assistance.

#### **3. Related Work**

Train rescheduling algorithms and solution approaches have been reviewed time and again, e.g., [6,9,15]. In one of the early works, Törnquist [15] presents a review of algorithms and models for railway scheduling and dispatching. The author presents a framework to classify and compare in detail the various train scheduling approaches. More recently, Cacchiani et al. [9] present an overview of algorithms and recovery models for real-time railway disturbance and disruption management. Fang et al. [6] classify and compare the problem models, solution approaches, and problem types for rescheduling in railway networks. In these works, the important characteristics of various algorithmic approaches have been discussed and classified.

Practitioners and researchers may want to simultaneously compare outputs of two or more algorithms in order to assess their relative efficiency [16]. Limited research has been performed on comparing the performance of train rescheduling algorithms [6,17]. In one of the early works, Wegele et al. [7] compare two decision support tools, developed for the Dutch and German railway networks, to assess their effectiveness in optimal train rescheduling. The two train rescheduling algorithms use reordering as the rescheduling tactic and are configured to minimize the total train delays. Based on common input railway instances from the Dutch railway network, the authors propose a comparison between the obtained rescheduling solutions. The two algorithms and their obtained solutions are compared using (i) blocking time plots, (ii) total train delays and (iii) total travel time of all trains. The authors point out that their comparison is slightly imbalanced since the two tools model the train dynamics differently.

Min et al. [18] propose a train rescheduling algorithm, which they compare with the MILP-based heuristic algorithm of Törnquist and Persson [19] and a priority-based heuristic algorithm. The authors use real-world instances from the Seoul metro rail network comprising 23 stations and mixed railway traffic. The authors primarily compare and report (i) the objective values in the obtained solutions, (ii) the distribution of the relative optimality gap of the obtained solutions. The authors consider two cases: (a) the algorithms run to completion, (b) a predefined computational time limit of 1 min. However, the focus of their work is not on a comparison framework for train rescheduling algorithms. From a performance comparison point of view, a noticeable drawback in [18] is the lack of consideration of several other important quality indicators in the rescheduled timetables.

Fan et al. [17] compare eight different approaches (brute force, tabu search, simulated annealing, etc.) to solve the train rescheduling problem. The eight algorithms are configured to minimize the delay costs. The authors use (i) a rail infrastructure bounded by two simple junctions, (ii) a timetable consisting of 12 trains with mixed railway traffic and (iii) four disturbance scenarios, to evaluate the algorithms. The metrics used for evaluating the algorithms are (i) the ordering of trains, (ii) delay cost of each train, (iii) total delay cost (in pounds) and (iv) computation time. The authors comment on the suitability of the algorithm to solve a particular type of disturbance. It is unclear how the authors' approach can be extended to a larger infrastructure.

Samà et al. [16] evaluate several alternative MILP formulations of the train rescheduling problem with different objective functions. Their study focuses on (i) identifying the MILP formulations that give inefficient solutions and modifying them with the addition of appropriate constraints and (ii) identifying relatively efficient formulations among a set of available formulations. They perform experiments on a Dutch railway network with mixed traffic and multiple delayed trains, using rescheduling time windows of 30 min and 1 h.

In recent times, researchers have surveyed and discussed the different objectives and quality indicators for railway rescheduling in various contexts, e.g., Samà et al. [16], Törnquist Krasemann [20], Corman et al. [21], Josyula et al. [22]. While solving the train rescheduling problem, there is no general agreement in the literature on the objective function(s) to be adopted [16]. Similarly, often, there is no shared meaning for many quality indicators [21]. One example is the passenger inconvenience caused due to a rescheduled timetable, for which the literature adopts a wide range of definitions [23].

Based on the review of related work, some of the observed weaknesses and challenges are addressed in this paper by presenting: (i) a framework to evaluate and compare train rescheduling algorithms while using multiple quality indicators and a (ii) a proof-of-concept of the framework by comparing two multi-objective rescheduling algorithms. The two algorithms are extended versions of existing train rescheduling algorithms. The main contributions of the research presented in this paper are: (i) an evaluation framework for train rescheduling algorithms and a demonstration of its applicability and (ii) a systematic evaluation of the rescheduling solutions resulting from the two algorithms for realistic input data.

#### **4. Framework Part I: Classification of Algorithm Capabilities and Characteristics**

This section presents the first part of the evaluation framework, which serves to describe and compare alternative train rescheduling algorithms on a functional (or a conceptual) level. Originating from the existing classifications of train rescheduling algorithms in [6,9,15], we use the algorithm characteristics presented in Table 2 for the classification and description of algorithms for train rescheduling.

Many of the characteristics mentioned in Table 2 are elaborated in detail as follows.

*Infrastructure granularity:* A railway network can be considered on three different levels of granularity: microscopic, mesoscopic or macroscopic [11,14]. A microscopic modelling approach represents every relevant element of the railway infrastructure in detail, e.g., block sections of different length separated by signals and switches, properties of individual tracks and platforms in stations. This is typically important for scheduling the interaction of many different trains in congested sub-networks, stations and junctions. A macroscopic approach typically disregards any fine-grained segmentation of the tracks [14] and each modelled infrastructure element could represent several physical resources. For example, the capacity restriction of a segment between two stations is often represented by a cumulative function that restricts the number of trains that simultaneously occupy the segment, but without allocating unique tracks and platforms. Several algorithms adopt a mixed approach by using a mesoscopic model of the infrastructure and traffic [14]. Often, individual tracks and platforms are modelled, but not the layout of stations and junctions.

*Time representation:* Time representation refers to how the time that a train is scheduled to occupy a certain infrastructure resource is modelled. The choice of time representation affects how detailed the interaction of trains can be modelled and how the problem size grows with an increased scheduling time window. In Table 3, four rescheduling approaches that adopt a continuous time representation are mentioned, while e.g., Harrod and Schlechte [24] present and compare two alternative models that adopt a discrete time representation.

*Special considerations:* While rescheduling trains, a few core constraints need to be enforced for the feasibility of the resulting timetable(s). From a macroscopic modelling perspective, the following constitute the core constraints in train rescheduling:



**Table 2.** The characteristics used to classify and describe algorithms and their capabilities.

Naturally, depending on if the problem is modelled using a microscopic, mesoscopic, or macroscopic approach, the core constraints are formulated in different ways. For example, in a mesoscopic approach, the headway and clear time constraints may be used to implicitly enforce the network capacity constraints. In such an approach, a track can be divided into several block sections and other units of physical track resources (e.g., switches).

If the algorithm accounts for other problem characteristics and constraints besides the core constraints, they are mentioned under the special considerations [15]. Examples of such constraints are (i) consideration of train length when allocating a platform for passengers transfer, (ii) considering synchronized arrival and departures of connecting trains, and (iii) considering the length and/or weight of freight trains when rescheduling unplanned stops and overtakes, which may introduce additional constraints [27].

*Applicable infrastructures, sections and tracks:* These three related aspects specify the properties that can be represented by an algorithm's model of the rail infrastructure. For example, an algorithm that is intended to be used for rescheduling trains on a single line may assume that all stations are linked in a sequence and that all trains travel between the stations in a chronological order. A line is here a "sequence of segments between two major stations with possibly several intermediate stations", while a railway network is instead comprised of "one or several junctions of lines" [15]. In a network, one station can be connected directly with more than two stations [6]. Hence, an algorithm may or may not be able to reschedule trains in a network setting. Furthermore, the segments between the stations and their capacity restrictions may be modelled differently depending on if e.g., only single track is considered or if segments with several alternative tracks can be used to reschedule delayed trains. Whether the algorithm can represent only tracks that permit traffic in only one direction, or in both directions (i.e., bi-directional traffic), is also relevant to capture. For example, in some rail networks, a double-tracked line consists of two parallel, uni-directional tracks where one is dedicated to traffic in one direction and the other to traffic in the reversed direction. Hence, overtaking is then

only possible at stations and an algorithm may base its computation on this assumption and enforce associated constraints. In Sweden, basically, all tracks allow bi-directional traffic (i.e., there are signals for trains in either direction). Allowing faster trains to overtake slower trains on the line between stations is a frequently-used measure to enable trains to catch up and to reduce delay propagation.

*Solving strategy:* When the original problem is solved as one instance, it is said to be a centralized approach [15]. A *decomposition* approach replaces the original problem with a sequence of smaller sub-problems, the solutions to which are computed and then recombined or extended to the original problem. Examples are (i) the rolling horizon approach (decomposition in time), (ii) partitioning trains into groups and sequentially solving the problem associated with each group, and (iii) approaches where entire administrative areas are considered as single entities to carry out inter-area coordination among trains [25].

*Solution space exploration:* Based on the country, IMs have specific rules for resolving conflicts during a disturbance. For example, in Sweden, the general dispatching strategy gives priority to on-time trains over the trains that deviate from the originally planned schedule. The reason behind this rule is to prevent a delay from propagating to trains that run according to schedule [28]. The dispatchers can, however, make exceptions to this rule when it is well-motivated. An algorithm that abides by such specific rules cannot fully explore the solution space for all possible rescheduling solutions.

*Control loop:* A *control loop* gives the interaction between the rescheduling tool and traffic operations [26]. In *open loop* rescheduling, the rescheduling decisions are computed and implemented only once at the beginning of a selected time window (e.g., two hours from the time when disturbance occurs). In *multiple open loop* rescheduling, the algorithm can be applied at successive times over the time window. Whenever additional information regarding traffic conditions is available, the calculations can be reconsidered [11]. However, the algorithm does not consider the actions computed and implemented during its previous runs in the selected time window. A closed loop rescheduling is defined as a multiple open loop with memory [26]. In this type of control loop, dispatching actions are immediately computed and adjusted every time updated information is available, on the basis of the current traffic state and the previously computed rescheduling decisions. In a closed loop, information updates are taken into account whenever available [11].

*Evaluation context:* The context in which the algorithm designer evaluates his/her proposed algorithm can be classified as: a station or a terminal area, a line, or a network.

*Applicable scenarios:* An algorithm could be designed such that it is applicable only to a subset of the possible disturbance scenarios. In contrast, an algorithm may be able to solve any type of disturbance scenario. An algorithm with the latter functionality could be more relevant in a practical context, where any type of disturbance scenario could arise. The applicable scenarios include the types of disturbances that an algorithm has been demonstrated to be able to solve.


**3.**Algorithm characteristics of four train rescheduling algorithms.

**Table** 



We apply the aforementioned part of the evaluation framework on four train rescheduling algorithms to compare their capabilities. The classification of the four algorithms, shown in Table 3, is based on the descriptions and demonstrations of the approaches in the mentioned references. Hence, detailed information is not available to cover all aspects to the same extent. Furthermore, other versions of those approaches may also exist and be in use in other settings.

#### **5. Framework Part II: Key Aspects for a Systematic Evaluation of Algorithm Performance**

This section presents the second part of the framework, which suggests a selection of key aspects to be used in systematic performance evaluation and benchmarking of algorithms.

During train rescheduling, the objective(s) of an algorithm refer to the aspects that are to be minimized and/or maximized in the solutions [5]. A train rescheduling algorithm may produce one or more rescheduling solutions. The objectives that drive the computation of solutions indicate the quality aspects that are important to be considered by the algorithm. However, there may also be other properties of the produced solution(s) that affect their relevance and acceptability, of which some properties may be difficult to incorporate explicitly in the computations of good rescheduling solutions. For example, an IM may want to assess the robustness of the timetables produced by the algorithm. This property is typically easier to define and compute once the solutions are generated, but less suitable to formalize as a constraint or penalty.

Having a standard set of quality indicators allows comparison of solutions computed by different algorithms, irrespective of the objectives of the specific algorithms. A quality indicator may be comprised of one metric or a set of metrics. These metrics can be used to compare algorithms and to reveal their strengths and weaknesses. These can also be used to monitor the search process of an algorithm and explicitly guide the search for improved solutions [30]. Table 4 lists the seven quality indicators and their corresponding metrics that are considered in the proposed evaluation framework.


**Table 4.** A description of the chosen quality indicators.

1. *Train punctuality:* The percentage of trains that arrive at their final destination within a given threshold of *t* minutes represents train punctuality. This metric is frequently used by railway organizations and in rail literature, with various threshold values, e.g., 0 min [16], 3 min [20,29], 5

min [21]. For this reason, we use the percentage of early, on-time and delayed trains (for different threshold values) as metrics for train punctuality.

2. *Train delays:* Tardiness of a train at a relevant point in the network, e.g., a station, is its delay in arriving at the point [16], within a chosen *t* min threshold. Total accumulated delay (TAD) is the total tardiness of all trains at their intermediary, scheduled commercial stations (a commercial station is a station where the train stops for alighting passengers). Total final delay (TFD) is the total tardiness of all trains at their final destinations.

The tardiness metrics TAD3 and TFD3 capture the delays for a threshold of 3 min. The delays incurred in the solutions output by an algorithm are expected to be as close as possible to the optimal.

Total accumulated delay is an important delay metric often used in railway operations analysis [31]. Total final delay is a frequently used metric in the objective functions of existing train rescheduling algorithms proposed by the research community. The reason for selecting a threshold of 3 min is due to its use by the Swedish railway authority to continuously monitor and log arrival and departure train delays in the associated traffic management system. Furthermore, delays larger than 3 min are more likely to cause the passengers to miss train connections, compared to smaller delays.


Note that, if the rescheduling time window does not include the freight train's complete journey, the first and last stations in the problem instance are considered to be the departure and arrival yards, respectively.

5. *Passenger delays:* Total passenger delay (TPD) captures the total delays experienced by all passengers while alighting at their destinations. This metric is frequently used in the rail literature to estimate passenger inconvenience [23]. We multiply the number of alighting passengers with the associated train delay at that particular train stop, where only a delay larger than a threshold of 3 min is counted. This metric, called TPD3, is used to estimate the inconvenience that the passengers would experience due to the rescheduled timetable.


#### **6. Train Rescheduling Algorithms Used in the Experiment**

We conduct an experiment on two alternative algorithms and apply the proposed framework to assess and compare the performance of each algorithm. In this section, we describe the two chosen algorithms.

When considering the perspectives of multiple stakeholders, solving the train rescheduling problem with separate multiple objectives may be more beneficial and natural than other approaches, e.g., the weighted sum approach [6]. An a priori method for multi-objective optimization requires the preference information regarding the objectives to be expressed prior to the solution process [32], e.g., the lexicographic method [33]. In contrast, an a posteriori method returns a solution set which is a representative of the Pareto-optimal solutions. The final solution is then chosen from the available set, either by using another method or by the decision maker. Below, we present a short description of the two train rescheduling algorithms that are evaluated and benchmarked in the experiment.

#### *6.1. Algorithm 1: An a Posteriori Multi-Objective Parallel Heuristic Algorithm*

Algorithm 1 (denoted ALG1 hereafter) is an extension of the multi-objective parallel heuristic algorithm presented in [22]. The algorithm constructs (and simultaneously navigates) a binary tree by iteratively detecting and resolving conflicts. The root node corresponds to the original timetable which turns infeasible due to the disturbance. At each node, a conflict detection operation is performed on the corresponding timetable. The detected conflicts are arranged in a chronological order and the first conflict is chosen to be resolved. Each internal node of the binary tree represents a conflict between exactly two trains. Each outgoing edge corresponds to the rescheduling decision made as a part of conflict resolution. Reordering, retiming trains, and local rerouting are the employed rescheduling tactics. Leaf nodes in the unpruned branches correspond to feasible solutions.

The search tree construction is decomposed into disjoint tasks, which are computed in parallel by multiple threads. Starting with the root node, each node is visited using a parallel depth-first search (DFS) strategy to find the best solution. A selection of evaluation metrics (e.g., TFD3, TPD3) are used for pruning. Throughout the search, all the parallel threads share and update the values of the upper bound of each selected metric. Based on these values, the branches leading to undesirable solutions are pruned. The parallel program can be launched with the required number of parallel threads. Once the specified number of threads are created, each thread runs in parallel an instance of the sequential DFS.

In any intermediate timetable state, i.e., at any internal node, first, one of the trains in the chosen conflict is prioritized over the other. Typically, each of the two outgoing edges corresponds to a prioritization alternative. Then, a child node is created by performing the following actions: (i) by locally rerouting the unprioritized train if an empty track is available throughout the train's occupancy of the conflict section, (ii) otherwise, by making the unprioritized train wait on a prior section (likely causing a reordering), and retiming it accordingly to resolve the conflict. Thus, reordering is always accompanied by retiming. Each edge in the binary tree, i.e., each rescheduling decision, corresponds

to either (i) a track reassignment, (ii) retiming, or (iii) reordering and retiming, of a train. See [22] for further details about the algorithm.

The following extensions to [22] have been done to construct ALG1:


Table 5 summarizes the main differences between ALG1 and the multi-objective algorithm in [22]. In the experiment, ALG1 is configured to consider two objectives: minimizing TFD3 and TPD3. It is run in parallel using eight threads, and equal weights are used in the TOPSIS method.


**Table 5.** Main differences between the multi-objective algorithm in [22] and ALG1.

#### *6.2. Algorithm 2: An a Priori Multi-Objective Optimization Model Solved Using a Commercial Optimization Software*

Algorithm 2 (denoted ALG2 hereafter) is a lexicographic extension of the single-objective event-based MILP model described in [35] and originally proposed by Törnquist and Persson [19]. The MILP model in [19] has more restrictions than the slimmed-down version (i.e., ALG2) used for our benchmarks in the experiment. The two algorithms, ALG1 and ALG2, use the same constraints and problem formulation.

When using the lexicographic method, preferences of the objectives are imposed by ordering the objective functions according to the practitioner's choice, rather than by assigning weights. The advantages of this method are that it does not require the objective functions to be normalized, and it always provides a Pareto-optimal solution as output [33].

The Java code corresponding to the implementation of the MILP model is extended by adding the code in Listing 1, thus making it a multi-objective algorithm. The setObjectiveN() method is used to set five objectives with different priorities. In Listing 1, a unique integer index in [0, *n* − 1] is assigned to each of the *n* objectives, as required by the employed commercial solver. The integer priority for each objective is assigned, keeping in mind that the larger the value, the higher is the priority. The solver allows lower-priority objectives to degrade those with a higher priority by the specified absolute or relative tolerance amounts (abstol or reltol, respectively). In our lexicographic approach, we restrict lower-priority objectives from degrading the values of higher-priority objectives by specifying the values of abstol and reltol as zero.

**Listing 1.** Code that corresponds to adding multiple objectives.

```
// setObjectiveN(expression, index, priority, weight, abstol, reltol, name)
// Primary objective and four other objectives
model.setObjectiveN(tfd3, 0, 4, 1, 0, 0, "TFD+3min");
model.setObjectiveN(tad3, 1, 3, 1, 0, 0, "TAD+3min");
model.setObjectiveN(tpd3, 2, 2, 1, 0, 0, "TPD+3min");
model.setObjectiveN(track_reassignments, 3, 1, 1, 0, 0, "Track reassignments");
model.setObjectiveN(event_end_deviations, 4, 0, 1, 0, 0, "Deviations");
```
After extending the implementation of the MILP model by using setObjectiveN() with the appropriate arguments, we employ the commercial solver, which uses the following algorithm (called ALG2 in this paper), to solve the extended model. ALG2 first searches for an optimal solution for the highest-priority objective, i.e., minimizing TFD3. It then searches for an optimal solution for the next objective [36], i.e., minimizing TAD3, but only from among the solutions with optimal value of TFD3. The algorithm then searches for an optimal solution that minimizes TPD3 from among the solutions with optimal TFD3 and TAD3. Similarly, the algorithm searches for an optimal solution that minimizes track reassignments. Finally, the algorithm finds an optimal solution that minimizes event end-time deviations and outputs that solution to the user. Note that the relative prioritization between the five objectives in ALG2 is not connected in any way to the proposed framework.

#### **7. Description of the Experiment**

In Section 7.1, we describe the input dataset and the disturbance scenarios used in the experiment. In Sections 7.2–7.4, we describe the experimental setup, the gathering of ideal points, and the selected statistical test, respectively. Finally, in Section 7.5, we discuss the fairness in benchmarking the algorithms.

#### *7.1. Dataset and Scenarios*

A railway network in the southern part of Sweden is chosen for the experiment. The chosen network comprises the railway stretch between Karlskrona-Malmö, via Kristianstad and Hässleholm (see Figure 1). The railway line is (i) single-track from Karlskrona to Hässleholm, (ii) double-track from Hässleholm to Malmö, (iii) with four tracks between Arlöv and Malmö. The original timetable consists of mixed traffic. It includes (i) regional passenger trains that take a travel time of 1.5 h between Karlskrona and Kristianstad, and 1 h between Kristianstad and Malmö, (ii) freight trains that run different stretches, (iii) long-distance passenger trains with a piece of their journey between Hässleholm-Malmö. Table 6 presents the characteristics of the problem dataset used in the experiment.

The 30 disturbance scenarios constituting the dataset are described in Table 7. All of them occur during peak hours: 4:00 p.m.–6:00 p.m. A rescheduling time window of 1.5 h is considered. The time window starts from the time of occurrence of the disturbance.

In the first ten disturbances, a passenger train initially experiences a primary temporary delay (of 7–25 min) at one section within the infrastructure. In each of the next ten disturbances, a passenger train has a malfunction, resulting in increased minimum running times (between 20–100%) on all sections it plans to occupy. For these scenarios, the percentage increase in the minimum train running time, e.g., 20%, is mentioned. In the final ten scenarios, the disturbance is due to an infrastructure failure causing, e.g., a speed reduction on a particular section, which results in increased minimum train running times (of 2–6 min) for all trains running through that section. Table 7 shows, for each disturbance, the total number of trains with a primary delay. For the last ten disturbance scenarios, the number of freight trains incurring initial primary delay is also mentioned.

**Figure 1.** Considered infrastructure: Karlskrona-Malmö [37].


**Table 6.** Characteristics of the problem dataset.


**7.** The dataset containing 30 disturbances.

**Table** 

#### *7.2. Experimental Setup*

The experiment is performed on a laptop equipped with a quad-core CPU (Intel Core i7-8550U(Santa Clara, CA, USA) ) and 16 GB RAM. The underlying operating system is Windows 10 Education. (Redmond, WA, USA). The compiler used to compile the C++ code corresponding to ALG1 is Microsoft C/C++ Optimizing Compiler Version 19.14 for x64. The Gurobi Optimizer version 8.0.0 was used to solve the MILP model, with the default number of parallel threads (i.e., eight threads). ALG1 is also configured to run using the same number of threads.

#### *7.3. Gathering the Ideal Point for Each Scenario*

In order to assess the train and passenger delays in the best solutions computed by ALG1 and ALG2, we need to have a common reference. The performance of a solution in terms of optimality can be quantified by computing its closeness to the ideal point. We compute the ideal point for each disturbance scenario by using the optimization solver to optimize each objective individually. For example, in this case, by solving the rescheduling problem three times using the single-objective MILP model with the following objectives: (i) minimizing TFD3, (ii) minimizing TAD3, (iii) minimizing TPD3. Typically, the ideal point is hypothetical, i.e., it often does not exist in the solution space [33]. Table 8 shows the computation of ideal point for disturbance scenario 1.


**Table 8.** Computing ideal point for disturbance scenario 1.

#### *7.4. Statistical Analysis*

Given the results obtained for the input problem dataset, for a performance indicator, we want to confirm or reject statistically that there is a significant difference in the performance of the algorithms. To this end, the results of the experiment are analyzed using a statistical test. A general assumption of many statistical tests, called parametric tests, is that the data are normally distributed [38]. Non-parametric statistical tests, unlike their parametric counterparts, make no assumption about the data distribution. Hence, we use a non-parametric test called Wilcoxon signed-rank to test our hypotheses.

Using the Wilcoxon signed-rank test, we test if there is a difference between the performance of the two algorithms. The null hypothesis is that the median of the differences between pairs of observations is zero [39]. The *p*-value is interpreted as the probability that the difference in the medians of the observations (corresponding to the two algorithms) can be attributed to chance alone [40]. We apply the two-tailed Wilcoxon signed-rank test with a significance level *α* = 0.05. We reject the null hypothesis if the obtained *p*-value is less than 0.05. Rejecting the null hypothesis shows that the difference between the performance of the two algorithms was unlikely to occur by chance. We use the scipy.stats.wilcoxon() function from the open-source software SciPy to perform the Wilcoxon signed-rank test. In the next section, wherever relevant, we mention the *p*-value obtained from the aforementioned test.

#### *7.5. Fairness in Benchmarking the Algorithms*

Beiranvand et al. [41] provide key recommendations for a fair benchmarking of optimization algorithms. Based on their recommendations, we described the algorithms, their parameters, the problem dataset, the computational environment, and the employed statistical techniques with an acceptable level of detail.

If the goal of algorithm comparison is to determine the best algorithm to use for a particular real-world application, using a real-world dataset is typically the best option [41]. The problem dataset used in our experiment is based on real-world data, in contrast to an artificial dataset. When benchmarking optimization algorithms, measuring wall-clock time is very relevant in real-world settings. The other alternative is to measure CPU time, which has its pros and cons [41]. In order to maximize the reliability of the collected data, we ensure that the background operations of the computer are kept to a minimum.

Many studies that compare optimization algorithms use basic statistics (e.g., average execution time) to report the experimental results [41]. Though it is reasonable to report those, a disadvantage is that such statistics provide little information about the overall performance of the compared algorithms. Numerical tables allow comprehensive reporting of benchmarking results and are recommended to be reported for the sake of completeness [41]. We report detailed numerical tables and analyze the results in a transparent and fair manner.

#### **8. Results and Discussion**

The application of the first part of the evaluation framework was demonstrated in Table 3 of Section 4. The table comprised four algorithms, of which the first two algorithms are the bases of ALG1 and ALG2, respectively. For most of the characteristics, both ALG1 and ALG2 have the same values as their base version algorithms (shown in Table 3). The remaining five characteristics are shown in Table 9.


**Table 9.** Algorithm characteristics of the two algorithms (abridged).

In this section, we demonstrate the application of the second part of the evaluation framework proposed in Section 5. The results from the experimental benchmark of ALG1 and ALG2, based on the evaluation framework, are presented and analyzed. The detailed results are shown in Figures 2–5 and Tables 10–14.

For each scenario, the table cells corresponding to the algorithm with a comparatively large value are highlighted in grey. The average values of the recorded metrics are shown using Tables 15–19.

In the solutions obtained by the two algorithms, train punctuality is shown in Figure 2 and Table 10. Train delays are recorded in Table 12 and compared to the optimal values. Delay propagation is shown in Table 14. This table records the number of trains experiencing secondary delays anywhere in their itinerary, in the obtained rescheduled timetables. Freight train performance is shown in Figures 3 and 4. Track reassignments and passenger delays are shown in Tables 11 and 13, respectively. Computation times of the two algorithms are shown in Figure 5.

When the delay in the obtained solution: (i) is within 1% of the optimal value, the corresponding cell is not highlighted, (ii) is within 20% of the optimal value, the table cell is highlighted in light grey and (iii) is greater than 20% of the optimal value, the cell is highlighted in grey.


**Table 10.** Train punctuality.

**Table 11.** Track reassignments.



**Table 12.** Train delays.




**Table 15.** An overview of train punctuality: average values over all scenarios.

**Table 16.** An overview of train delays: average values over all scenarios.


**Table 17.** An overview of delay propagation: average values over all scenarios.


**Table 18.** An overview of freight train performance: average values over all scenarios.


**Table 19.** An overview of track reassignments: average values over all scenarios.


#### *8.1. Train Punctuality*

A general observation is that, in the solutions obtained by ALG2, the percentage of trains exactly on time is typically higher compared to the solutions obtained by ALG1 (see Figure 2b). Furthermore, in the solutions obtained using ALG1, trains often reach their final destination earlier than initially planned in the original timetable, while, in the solutions generated by ALG2, trains rarely arrive at their final station earlier than the originally planned arrival time (see Figure 2a). This makes sense as there is no penalty for early train arrivals in ALG1, whereas one of the objectives of ALG2 is to minimize the end time deviations in train events, albeit the objective with the least priority.

What can also be observed from Figure 2b is that the share of affected trains is significantly larger in scenarios 21–30, which primarily is an effect from the initial source of delay, i.e., a temporary infrastructure failure that immediately affects multiple trains.

ALG1 typically provides solutions with a higher percentage of delayed trains (see Table 10). In the solutions to scenarios 1–10 obtained using ALG1, no train experiences a delay ≥ 30 min at their final stations. In the solutions obtained by the algorithms for scenarios 21–30, trains are always punctual at their final stations within 10 min (see Table 10). However, in the solutions obtained by ALG1, more trains experience smaller delays. See the comparatively higher percentages of delayed trains for ALG1 for scenarios 21–30 in Table 10.

#### *8.2. Train Delays*

According to the average values in Table 16, ALG2 outperforms ALG1 in obtaining solutions with smaller train delays. The statistical significance of the difference in the performance of the two algorithms could be confirmed only for TFD3 (*p* = 0.002). For TAD3, statistical significance could not be confirmed from the obtained results (*p* = 0.29). This means that the difference in the values of TAD3 for the two algorithms is more likely to have occurred by chance.

For several of the scenarios 1–10, the TFD3 in ALG1's solutions is often rather far from the optimal value (see Table 12). In contrast, ALG2 always obtains a solution with optimal TFD3, even if it means causing a large delay to a single train. For example, in the solution obtained by ALG2 for scenario 6, although the solution's TFD3 is minimized (54.9 min), a train experiences a delay ≥ 30 min at its final station (see Table 10). ALG1's solution for scenario 6 has a significantly larger TFD3 (78.1 min). However, in that solution, no train experiences a final delay ≥ 30 min. This is a good example of the trade-off between reducing individual train delays and reducing total train delays.

For a majority of the scenarios 11–30, both the algorithms found either ideal solutions or solutions that are very close to ideal (see Table 12). This is an interesting result, as the ideal point is generally expected to be unattainable [33]. A trade-off was expected between minimizing final and accumulated delays; we did not expect to obtain a solution with optimal TFD3 as well as optimal TAD3. It is surprising that such a trade-off between TFD3–TAD3 does not occur more frequently in Table 12.

An interesting observation can be made from the results obtained for scenarios 3, 6, 14, and 18. For these four scenarios, ALG1 produces solutions that have a smaller TAD3, compared to ALG2 (see Table 12). Note that ALG1 does not try to minimize TAD3. However, in the obtained solutions, the delays accumulated by trains at commercial stops are close to optimal. On the other hand, ALG2 has minimizing TAD3 as its second objective. A reason for this anomaly is that ALG1, while minimizing the passenger delays, indirectly reduces the delays experienced by trains at commercial stops. On the other hand, once ALG2 optimizes the final delays (i.e., the value of TFD3) and uses it as a bound, it cannot reduce the accumulated delays beyond a certain point.

#### *8.3. Delay Propagation*

According to the average delay propagation values in Table 17, ALG2 outperforms ALG1 in obtaining solutions with less delay propagation. We could confirm the difference in the performance of the two algorithms as statistically significant only for secondary delays ≤ 3 min (*p* = 0.03). The differences in the obtained secondary delays > 3 min are not statistically significant (*p* = 0.09). The latter shows that the performance difference is likely to have occurred by chance.

In the solutions of ALG1, many trains often incur secondary delays (see Table 14). In comparison, the solutions obtained using ALG2 typically have fewer trains with secondary delays. In the solutions obtained by ALG1 for disturbances 21–30, the delay caused by the disturbance is almost always propagated to other trains. In comparison, in the solutions obtained by ALG2, there is less propagation of delays caused by the disturbance (see Table 14). When using ALG1, secondary delays > 3 min also appear more frequently, compared to ALG2.

#### *8.4. Freight Train Performance*

According to the average values of the metrics used for freight train performance (Table 18), the rescheduling strategy used by ALG1 is problematic from a freight train perspective. Note that none of the algorithms explicitly optimize any metric related to freight trains. In addition, note that, in this experiment, there is no additional time associated with enforcing an unplanned train stop, unlike in, e.g., the model adopted in [19]. Hence, no correlation between the increase of unplanned stops and travel times is expected.

The solutions obtained by the algorithms for scenario 7 are interesting; the freight trains have eight unplanned stops (see Figure 3). For this scenario, ALG1 obtained a solution (i) with larger deviation in freight train departure times (*d* = 16 min) compared to ALG2 (*d* = 8 min), (ii) with minimal increase in travel times (*i* = 2 min) compared to ALG2 (*i* > 10 min). Hence, with respect to freight train travel times, the solution of ALG1 may be seen as a good alternative to ALG2's solution.

Disturbance scenarios 11–20 are those where a passenger train runs slower throughout its route. For a majority of these scenarios, ALG2 obtains solutions in which the values of *d*, *i*, and *u* are negligible (see Figure 3). ALG2 shows a similar performance for disturbance scenarios 21–30, wherein it obtains solutions with small values of the considered metrics. In the solutions obtained by ALG1 for the last ten scenarios, the freight trains incur comparatively (i) large departure deviations, (ii) larger number of unplanned stops, and (iii) higher increase in travel times (see Figure 3). The rescheduling performed by ALG1 often caused many freight trains to arrive early (see Figure 4), even when the train initially affected by the disturbance is a passenger train.

#### *8.5. Passenger Delays*

The average passenger delays for ALG1 and ALG2, across all the scenarios, are 837 min and 856 min, respectively. This difference in the performance of the two algorithms concerning passenger delays could not be confirmed as statistically significant (*p* = 0.68). This means that the difference in the values of TPD3 for the two algorithms is more likely to have occurred by chance.

ALG2 often obtained solutions with TPD3 within 1% of the optimal TPD3 (see Table 13). For scenarios 3, 6, 14, and 18, ALG1 produces solutions that have a significantly smaller TPD3, compared to ALG2 (see Table 13). This shows a strength of its approach which simultaneously considers minimizing TPD3 and TFD3 with equal priority. On the other hand, ALG2 has minimizing TPD3 as its third objective. For the aforementioned scenarios, after ALG2 optimizes the train delays and uses them as bounds, it cannot reduce the passenger delays beyond a certain point.

#### *8.6. Track Reassignments*

The average track reassignments in the solutions obtained by ALG1 and ALG2, rounded to the nearest integer, are 5 and 1, respectively (see Table 19). This difference in algorithm performance concerning total track reassignments is statistically significant (*p* = 0.01).

In the configuration of ALG2, minimizing the number of track reassignments is an objective that has little priority. Irrespective of that, the algorithm produces solutions with minimal track reassignments. It is reasonable to assume that the optimal number of track reassignments in a rescheduled timetable is zero. In other words, for the input dataset, a rescheduled timetable can be obtained without making any track reassignments. With that in mind, one can say that, over the entire dataset, ALG2 achieved a very good trade-off between minimizing the delays and the number of track reassignments. Thus, ALG2 produces solutions with minimal track reassignments while giving close-to-optimal train and passenger delays.

For scenarios 1–20, in the solutions output by both the algorithms, only passenger trains incur track reassignments. In case of ALG1, the reason for this is as follows. ALG1 does not reallocate the tracks of trains that are not directly affected by the disturbance. In scenarios 1–20, since the initially disturbed train is a passenger train, only the track allocation of that train is modified during rescheduling. A consequence of this rescheduling strategy is as follows. In each of these scenarios, all the track reassignments in the solutions obtained by ALG1 belong to one train, since ALG1 confines the track reassignments to the initially disturbed passenger train. In contrast, ALG2 changes the tracks of various passenger trains at stations. Only one passenger train incurring a reasonable number of track reassignments could be perceived as an advantage of using ALG1, from a passenger perspective as well as from a dispatching perspective. The reason for the latter is that fewer rescheduled trains make it easier for the dispatcher to supervise during a disturbance.

For scenarios 26–30, the solutions obtained using ALG1 involve many track reassignments (see Table 11). The reason is that, when ALG1 encounters a conflict involving a train with primary delay, it first tries to resolve the conflict by reallocating the train's track. Thus, for the trains with a primary delay, the algorithm prefers track reassignment over retiming and reordering. In each of the disturbances 26–30, more than 24 trains incur primary delays (as shown in Table 7). Due to the rescheduling strategy employed by ALG1, the solutions obtained for these scenarios have many track reassignments (see Table 11). In contrast, for these scenarios, the solutions produced by ALG2 do not involve any track reassignments. Both the algorithms obtained almost-ideal train delays and passenger delays in the solutions for disturbances 26–30. Since ALG2 obtained these solutions without performing any track reassignments, it shows that, for these disturbance scenarios, there is no trade-off between minimizing the number of track reassignments and minimizing the values of train and passenger delays.

#### *8.7. Computation Times*

On average, ALG1 is five times faster than ALG2. It takes around 9 seconds to reach completion, compared to the latter algorithm's average computation time of 47 seconds (see Figure 5). This difference in performance of the two algorithms with respect to computation times is statistically significant (*p* < 0.001).

ALG1 solves any disturbance scenario in the dataset to completion in about 1 min. ALG2 can take up to 6 min to solve specific scenarios to completion in the dataset. Interesting disturbances occur in scenarios 6 and 9, where ALG2 takes more than 5 min to find the solution and prove its optimality. These two are the scenarios for which the Pareto-optimal solution obtained by ALG2 has a non-optimal TAD3 and TPD3 (see Tables 12 and 13). This means that, for these scenarios, a trade-off needs to be made between minimizing e.g., TPD3 and the primary objective of ALG2, which is TFD3. Thus, compared to the output Pareto-optimal solution, a solution with a lower value of TPD3 cannot be obtained without increasing the value of e.g., TFD3. The longer computation times for these scenarios could be due to ALG2 trying to prove the Pareto-optimality of the obtained solution before reaching completion.

Similar to the case of scenarios 6 and 9, ALG2 takes a longer amount of time to reach completion for scenario 18. This is also a disturbance scenario for which the obtained Pareto-optimal solution has a non-ideal TAD3 and TPD3. Figure 6 shows the progress of ALG2 while solving scenario 18. Notice that, for this scenario, most of the time is spent in finding solutions rather than proving optimality of the found solutions. While minimizing TPD3, around 9 sec is taken by ALG2 to realize that the TPD3 of the obtained solution cannot be improved further, without increasing the values of the primary and the secondary objectives (TFD3 and TAD3, respectively). The gaps in Figure 6 correspond to the time taken for presolving and root relaxation.

**Figure 6.** The progress of ALG2 for scenario 18.

From Figure 5, Tables 12 and 13, the following can be observed. For a particular disturbance scenario, whenever there exist solutions in the search space such that there is no significant tradeoff between the first three objectives, ALG2 takes close to the average computation time to find the solution. It is noticed that ALG2 takes longer computation times only when there is a trade-off among the minimization objectives. No such trend could be observed for ALG1, other than the fact that it takes longer than its average time for a few disturbance scenarios.

#### **9. Conclusions and Future Work**

The main purpose of the paper was: (i) to present a framework for classification, evaluation and comparison of alternative algorithms and (ii) to demonstrate its application and relevance. The presented framework can be extended and adapted to fit the purpose of other similar evaluation studies. It should be seen as a module-based framework wherein the user can add or exclude certain indicators, e.g., when no freight train traffic is analyzed, the related metrics can be excluded. A user can also include additional metrics of interest [21] in a particular indicator. For example, the maximum secondary delay [42] can be included in the delay propagation indicator. The reason is that it can be useful to log and compare the train that experiences the largest secondary delay as well as the magnitude of that delay.

When using the framework, if a user is satisfied with a subset of the presented indicators and metrics, he/she can choose to stop collecting further measures. Then, as the next step, the user can make a qualitative assessment of particular solutions that each algorithm produces, based on the requirements. An example of such an assessment is shown in [20], where the computed rescheduling solutions are analyzed and scrutinized with other qualitative properties in mind, in addition to the quantitative metrics.

Two noteworthy extensions to the framework are as follows: (i) In addition to measuring the time taken by algorithms to reach completion, one can decide a time limit of e.g., 15 seconds and compare the best solutions obtained by the algorithms within that limit, (ii) For each disturbance scenario, one can collect the solutions obtained by the algorithms during rescheduling as time progresses. The collected solutions can then be analyzed, based on a selected metric, using progress over time plots, such as in [35]. The two extensions are difficult to implement for parallel algorithms, since the order in which solutions are explored/obtained by such algorithms is typically non-deterministic. With large problem datasets, it is not practically viable to implement the second extension into the evaluation framework, even for sequential algorithms.

ALG1 is a heuristic algorithm that considers two minimization objectives with equal priority: minimizing TFD3 and TPD3. ALG2 is an exact algorithm that has five minimization objectives, with minimizing TFD3 as the primary objective. A threshold of 3 min is considered for all the delays appearing in the minimization objectives. Based on the carried out evaluation, we analyze the overall strengths and shortcomings of the two train rescheduling algorithms and their application when solving the 30 disturbance scenarios. A strength of ALG1 is that it is good at quickly finding solutions with small passenger delays. Weaknesses of ALG1 are apparent when it is solving disturbances due to an infrastructure failure. The solutions obtained for these scenarios have significant delay propagation, unsatisfactory freight train performance and many track reassignments.

The strength of ALG2 is its ability to reschedule during infrastructure failures. In the studied scenarios, these failures are of rather modest size. When solving these disturbances, ALG2 is certainly the better choice, since it obtained significantly better solutions and is always within 30 seconds. The main weakness of ALG2 is its speed, particularly while solving disturbances 1–20. Typically, ALG2 obtained good rescheduling solutions for all the considered disturbances. However, compared to ALG1, ALG2 is slow in obtaining solutions. ALG2 took as long as 6 min for a disturbance that is solved in 6 sec using ALG1 (see scenario 6).

When solving disturbances caused by a delayed or a malfunctioned train, a dispatcher can use ALG1 to quickly obtain a decent solution. If the comparatively slower ALG2 is to be used to solve these disturbances, the following are a few suggestions to improve its speed: (i) reduce the number of objectives by merging two or more of the lower-priority objectives into a single objective and (ii) increase the number of parallel threads simultaneously exploring the solution space. A suggestion to improve the practicability of ALG1 is to limit the number of track reassignments while solving disturbances where multiple trains have primary delays. This can result in more practical rescheduling solutions that are easier to implement.

For several of the disturbances considered in the dataset, ideal rescheduling solutions were obtained (with respect to TFD3, TAD3 and TPD3). For most other disturbance scenarios, the Pareto-optimal solution obtained by ALG2 is very close to the hypothetical ideal solution. The frequent existence of a feasible solution in the solution space that simultaneously minimizes TFD3, TAD3, and TPD3 is surprising. Future work could investigate the conditions under which a multi-objective train rescheduling problem contains an ideal solution in its solution space, particularly with respect to the train and passenger delays.

**Author Contributions:** Conceptualization, S.P.J. and J.T.K.; Methodology, S.P.J. and J.T.K.; Software, S.P.J. and J.T.K.; Validation, S.P.J; Formal analysis, S.P.J.; Investigation, S.P.J.; Resources, S.P.J. and J.T.K.; data curation, J.T.K.; Writing—original draft preparation, S.P.J.; Writing—review and editing, S.P.J., J.T.K. and L.L.; Visualization, S.P.J.; Supervision, J.T.K. and L.L.; Project administration, J.T.K.; Funding acquisition, J.T.K. All authors have read and agreed to the submitted version of the manuscript.

**Funding:** This research was funded by the Shift2Rail project: FR8Rail II (Grant No: 826206), and the research project BLIXTEN (Grant No: TRV 2018/108023) funded by Trafikverket via KAJT.

**Acknowledgments:** Many thanks to the anonymous reviewers for their constructive comments, which greatly improved the paper. The authors would also like to thank Tomas Lidén and Martin Joborn for providing valuable comments on several sections of a preliminary version of this paper.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


TOPSIS Technique for Order of Preference by Similarity to Ideal Solution

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **A Comparison of Ensemble and Dimensionality Reduction DEA Models Based on Entropy Criterion**

#### **Parag C. Pendharkar**

Information Systems School of Business Administration, Pennsylvania State University, Harrisburg 777 West Harrisburg Pike, Middletown, PA 17057, USA; pxp19@psu.edu; Tel.: +1-717-948-6028

Received: 28 July 2020; Accepted: 14 September 2020; Published: 16 September 2020

**Abstract:** Dimensionality reduction research in data envelopment analysis (DEA) has focused on subjective approaches to reduce dimensionality. Such approaches are less useful or attractive in practice because a subjective selection of variables introduces bias. A competing unbiased approach would be to use ensemble DEA scores. This paper illustrates that in addition to unbiased evaluations, the ensemble DEA scores result in unique rankings that have high entropy. Under restrictive assumptions, it is also shown that the ensemble DEA scores are normally distributed. Ensemble models do not require any new modifications to existing DEA objective functions or constraints, and when ensemble scores are normally distributed, returns-to-scale hypothesis testing can be carried out using traditional parametric statistical techniques.

**Keywords:** data envelopment analysis; dimensionality reduction; ensembles; exhaustive state space search; entropy

#### **1. Introduction**

Data envelopment analysis (DEA) is a prominent technique for the non-parametric relative efficiency analysis of a set of decision-making units (DMUs) drawn from a similar production process [1]. DEA models are used in both operation research and data mining literature [2]. Some of the traditional properties of production functions, such as the monotonicity and convexity of the inputs and outputs, that are fundamental in DEA models are often found to be attractive in some data mining models where datasets are noisy and model resistance to learning noise is necessary [3]. An important aspect of DEA models is the reliability of DMU efficiency scores. It is generally accepted that the DEA efficiency estimates are reliable when the sample size is large [4]. Since the reliability of the DEA scores is dependent on the sample size, Cooper et al. [5] have suggested the following rule for the minimum number (*n*) of DMUs for reliable DEA analysis (each DMU has *m* inputs and *s* outputs):

$$m \ge \max\{3(m+s), \ m \times s\}\tag{1a}$$

For small-size datasets, where violations of the minimum number of DMUs specified by Equation (1a) frequently occur, dimensionality reduction (also known as variable reduction or variable selection) approaches are frequently used to select a subset of variables to satisfy Equation (1a). A variety of variable selection approaches are available in the literature. Among these variable selection approaches are statistical [6], regression [7], efficiency contribution measure [8], bootstrapping [9], hypothesis testing [10], variable aggregation [11] and statistical experiment designs [12]. Variable selection approaches are criticized extensively for applying parametric procedures and linear relationship assumptions for selecting variables to determine an unknown non-linear and non-parametric efficiency frontier. Nataraja and Johnson [13] provide a good description of some of these procedures and their pros and cons.

Pendharkar [14] proposed a competing approach to the dimensionality reduction/variable selection problem called the ensemble DEA. In his approach, traditional DEA analysis is conducted for all possible input and output combinations, and the efficiency scores of each DEA model for each DMU are averaged as an ensemble efficiency score for a DMU. Drawing from machine learning literature, Pendharkar [14] showed that the ensemble efficiency score is a reliable estimate of the "true" efficiency of a DMU. Even for small datasets, certain combinations of inputs will satisfy the criterion set by Equation (1a), while others will violate it, but the average ensemble score will be closer to the true efficiency of the DMU and will be reliable. Pendharkar [14] also proposed an exhaustive search procedure to generate all possible input and output combinations, and proposed a formula to compute the number of unique DEA models that need to be run to compute an average ensemble score. This number *N* of unique DEA models may be computed using the following formula:

$$N = \left(\sum\_{i=1}^{m} \binom{m}{i}\right) \times \left(\sum\_{i=1}^{s} \binom{s}{i}\right) = (2^m - 1) \times (2^s - 1). \tag{1b}$$

Using Banker et al.'s [15] variable-returns-to-scale (VRS) DEA BCC model, and data and models obtained from a few studies in the literature, Pendharkar [14] showed that the ensemble DEA model provides a better ranking of DMUs than the models proposed in a few studies from the literature.

This research investigates the additional properties and statistical distribution of the ensemble DEA model scores. It is shown that there are added benefits of ensemble efficiency scores. In particular, the ensemble efficiency scores maximize entropy, meaning that the DMU ranking distribution generated by the ensemble efficiency scores has a lower bias when compared to some competing radial and non-radial variable selection models recently reported in the literature, and second, the ensemble efficiency scores may be normally distributed under certain restrictive assumptions. The normal distribution of the efficiency score feature is particularly attractive because returns-to-scale hypothesis testing may be conducted by using traditional difference-in-means parametric statistical procedures. Both of these features are tested using data and models reported in a published study [16]. The rest of the paper is organized as follows: In Section 2, the basic DEA radial and non-radial models, ensemble DEA model and Entropy criterion for comparing different DEA models are described. In Section 3, using Iranian gas company data, the results of ensemble DEA models are compared with the results of variable selection models used in Toloo and Babaee's [16] study. Additionally, in Section 3, the properties of the ensemble DEA scores are investigated in terms of the entropy criterion and their statistical distributions. In Section 4, the paper concludes with a summary and directions for future research.

#### **2. DEA Preliminaries, Ensemble DEA Model, Entropy Criterion for DEA Model Comparisons and Statistical Distribution of Ensemble Scores**

The basic DEA model assumes *n* DMUs, with each DMU consisting of *m* different inputs that produce *s* different outputs. The input and output vectors are semi-positive, and for DMU*<sup>j</sup>* (*j* = 1, ... , *<sup>n</sup>*), the space for the input and output vectors (*xj*,*yj*) R*m*+*<sup>s</sup>* <sup>+</sup> . For a DMUo, its relative efficiency may be computed by using the linear programming model under the constant returns-to-scale assumption. This efficiency is computed by solving the following model:

$$\max \sum\_{r=1}^{s} u\_r y\_{r\nu\_r} \tag{2a}$$

subject to:

$$\sum\_{i=1}^{m} v\_i \mathbf{x}\_{i\boldsymbol{\nu}} = 1 \tag{2b}$$

$$\sum\_{r=1}^{s} u\_r y\_{rj} - \sum\_{i=1}^{m} v\_i x\_{ij} \le 0 \quad \text{for all } j = 1, \dots, n \tag{2c}$$

*Algorithms* **2020**, *13*, 232

$$v\_{i\prime}u\_r \ge \varepsilon \text{ for all } i = 1, \dots, m \text{ and } r = 1, \dots, s \tag{2d}$$

where *vi* and *ur* are the weights associated with the *i*th input and *j*th output, respectively. The constant ε > 0 is infinitesimally non-Archimedean. The model (2a)–(2d) is often called the primary CCR model [1], and its dual is written as follows:

$$\text{minimize } \theta - \varepsilon \left( \sum\_{i=1}^{m} s\_i^- + \sum\_{r=1}^{s} s\_r^+ \right), \dots, \dots \tag{2e}$$

subject to:

$$\sum\_{j=1}^{n} \lambda\_j \mathbf{x}\_{ij} + s\_i^- = \theta \mathbf{x}\_{i\nu} \quad i = 1, \dots, m \tag{2f}$$

$$\sum\_{j=1}^{n} \lambda\_j y\_{rj} - s\_r^+ = y\_{n\ast} \quad r = 1, \dots, s\_{\prime} \text{ and} \tag{2g}$$

$$
\lambda\_j, s\_i^-, s\_r^+ \ge 0 \text{ for all } i = 1, \dots, m; j = 1, \dots, n; r = 1, \dots, s \tag{2h}
$$

The VRS BCC model augments the system (2e)–(2h) by adding the following constraint:

$$\sum\_{j=1}^{n} \lambda\_j = 1$$

The aforementioned models are radial DEA models that are criticized for not providing input or output projections (for inefficient DMUs) that satisfy Pareto optimality conditions [17]. Fare and Lovell [18] independently proposed radial DEA models that allow for input or output reductions at variable rates. The radial version of the CCR model is mathematically represented in the following dual form:

$$\text{minimize } \frac{1}{m} \sum\_{i=1}^{m} \theta\_i$$

subject to:

$$\sum\_{j=1}^{n} \lambda\_j \mathbf{x}\_{ij} \le \theta\_i \mathbf{x}\_{i\alpha} \quad i = 1, \dots, m$$

$$\sum\_{j=1}^{n} \lambda\_j y\_{rj} \ge y\_{m\prime} \quad r = 1, \dots, s$$

$$\theta\_i \le 1, \quad i = 1, \dots, m \quad \dots,$$

$$\lambda\_i \ge 0, \quad j = 1, \dots, n$$

Pendharkar [14] proposed an ensemble DEA model based on the popularity of ensemble models in machine learning literature. The ensemble DEA model requires an exhaustive search procedure using a binary vector *z* whose components indicate whether an input or output is considered in performing DEA analysis. The dimension of this binary vector is (*m* + *s*). Figure 1 illustrates the *z* vector and exhaustive search tree for two-input-and-one-output datasets. The exhaustive tree is pruned (dotted edges) for models that have either no inputs or no outputs. DEA analysis is then conducted on the remaining models, and the efficiency results of each model for each DMU are averaged and used as ensemble DEA scores. To illustrate the ensemble DEA approach on a two-input-and-one-output dataset, a CCR DEA analysis using partial Cobb–Douglas production function data on US economic growth between 1899 and 1910 [19] is conducted. Table 1 illustrates the results of our DEA analysis and resulting ensemble scores. The two inputs were labor in person-hours worked per year and the amount of capital invested. The output was the total annual production. The results of the analysis show that the traditional DEA with *z* = [111] does not provide unique rankings (for the years 1901 and 1902 receive the same efficiency score), but the ensemble DEA model provides unique DMU rankings. Pendharkar's [14] study provides a theoretical basis for the reliability of ensemble DEA scores.

**Figure 1.** Exhaustive Search Tree for possible unique combinations of two-input-one-output datasets. **Table 1.** Ensemble data envelopment analysis (DEA) scores for 1899–1910 US economic growth data.


The maximum entropy (ME) principle has been applied to DEA DMU ranking distribution [20] and model comparisons [21]. The ME principle measures the DMU ranking bias by using a more general family of distributions [22]. Several statistical distributions can be characterized as ME densities [23]. The ME distributions are the least biased distributions obtained by imposing moment constraints that are inherent in the data [21]. To obtain the ME for a given set of DMUs and their efficiencies, normalized ranks are first obtained by computing <sup>θ</sup><sup>∗</sup> *<sup>i</sup> <sup>n</sup> <sup>i</sup>*=<sup>1</sup> θ<sup>∗</sup> *i* , for each DMU, and then computing the ME for a certain model *z* as follows:

$$\text{ME}^z = -\sum\_{i=1}^n \left( \frac{\partial\_i^\*}{\sum\_{i=1}^n \partial\_i^\*} \right) \ln \left( \frac{\partial\_i^\*}{\sum\_{i=1}^n \partial\_i^\*} \right)$$

The ME for the DEA models in Table 1 are ME<sup>111</sup> = 2.4768, ME101 = 2.4775 and ME<sup>011</sup> = 2.4757. The model with labor as an input and production as an output (*z* = [101]) has the highest entropy and has the least bias, with a maximum difference between DMU efficiencies for closely ranked DMUs for the years 1901 and 1902. The ensemble entropy is 2.4769, and since it is an average of all *z*-vector combinations, the comparison benchmark for ensemble entropy is the model with *z* = [111]. The ensemble entropy is higher than the benchmark. The highest possible entropy value or upper bound (UB) for a model is given by the following expression:

$$\text{ME}^{\text{UB}} = -n \times \left( \left( \frac{1}{n} \right) \ln \left( \frac{1}{n} \right) \right) \tag{2i}$$

The MEUB for the data in Table 1 is 2.485, and the ensemble entropy is very close to the maximum value. It is important to note that obtaining the maximum value is not always desirable, but it provides a theoretical benchmark estimate for a completely unbiased normalized DMU score distribution.

To compute ensemble efficiency scores, an *n* × *m* matrix *E* of DEA efficiency scores is necessary. The rows of such a matrix are the numbers of DMUs, and the columns are the numbers of models given by the numbers of eligible models considered in computing ensemble efficiency scores. This number of eligible models will have an upper bound given by *N*, computed using Equation (1b). The elements of this matrix will be efficiency scores for each DMU computed for a given model identified by column number. Figure 2 illustrates a five-DMU-and-five-model matrix. The ensemble efficiency score (θ*<sup>E</sup> <sup>i</sup>* ) for each DMU is computed using the following formula:

$$
\theta\_i^E = \frac{\sum\_{j=1}^m \theta\_{ij}^\*}{m} \tag{2j}
$$

**Figure 2.** An illustration of 5 × 5 ensemble efficiency score matrix.

A few observations can be made about any row *i* {1, ... , *n*} of the ensemble efficiency score matrix. First, all the elements of a given row are an independent computation of efficiency scores by the same DMU under a different model number with its unique set of input(s) and output(s). Second, in all the elements of a given row, the DMU is maximizing its efficiency given its model constraints. Thus, each row represents independent evaluations by a DMU under the maximum decisional efficiency (MDE) principle [24]. The MDE principle was introduced by Troutt [25] to develop a function to the aggregate the performance of multiple decision-makers. The underlying assumption of the MDE principle is that all decision-makers seek to maximize their decisional efficiencies. Troutt [26] later used the MDE approach to rank DMUs and showed that DMUs deemed efficient under MDE are also efficient when ranked using the DEA. For a linear aggregator function, such as the one used in Equation (2j), Troutt [26] illustrated that the decisional efficiencies θ can be described by the following probability density function (pdf):

$$\mathcal{R}g(\theta) = c\_{\theta} \epsilon^{\alpha \theta}, \; a > 0 \; \text{and} \; \theta \; \in \; [0, 1] \tag{2k}$$

The pdf in (2k) is monotone, increasing on its interval with a mode at θ = 1 (see Figure 5 for illustration). Using the laws of probability, the value of *c*<sup>α</sup> = <sup>α</sup> (*e*<sup>α</sup> <sup>−</sup> 1)−1. Since each element in a given row of the ensemble efficiency score matrix is an independent evaluation by a decision-maker (i.e., a DMU in an ensemble model) trying to maximize its decisional efficiency θ∗ *ij* for *j* = {1, ... , *m*}, the probability density function for each row (DMU) can be written as:

$$\lg(\theta\_i) = \mathfrak{c}\_{\alpha\_i} \varepsilon^{\alpha\_i \partial\_i}, \ a\_i > 0 \text{ and } \theta\_i \in [0, 1] \tag{2l}$$

The central limit theorem mentions that the cumulative distribution functions (cdfs) of the sums of independently identically distributed random variables asymptotically converge to a Gaussian cdf. The ensemble efficiency scores are normalized sums of independent efficiency assessments that will be distributed with a pdf given by (2l). These sums can be considered independent and identically distributed if α<sup>1</sup> = α<sup>2</sup> = ... = α*n*. Under the restrictive assumption that α<sup>1</sup> = α<sup>2</sup> = ... = α*n*, the ensemble efficiency scores are guaranteed to asymptotically converge to a normal distribution by the central limit theorem. In practice, however, the ensemble efficiency scores are not entirely random or perfectly identically distributed (due to the slight likely variation of Equation (2l)'s α*<sup>i</sup>* parameters for each row), and each ensemble model does introduce a degree of mild randomization. For mild differences in the row pdf parameters α*i*, where α<sup>1</sup> ≈ α<sup>2</sup> ≈ ... ≈ α*n*, the ensemble efficiency scores are likely to be normally distributed. A reader may note that under ideal conditions, where α<sup>1</sup> = α<sup>2</sup> = ... = α*<sup>n</sup>* and individual DMU scores follow Equation (2l)'s distribution, the entropy of the ensemble scores will be highest and close to the highest upper bound mentioned in Equation (2i) because the distribution in Equation (2i) has a mode of 1 (see Figure 5). Thus, it may be argued that the likelihood of normality of the ensemble scores increases when the entropy of the ensemble scores is closer to its upper bound given by Equation (2i). It is important to note that an entropy equal to the exact value of the upper bound given by Equation (2i) is undesirable because at that value, the distribution is a uniform distribution where all the DMUs are fully efficient for all the models. The entropy of the pdf in Equation (2k) is maximized on the interval [0, 1] when the mean of the distribution is greater than 0.5 [27]. Additionally, another important aspect of the distribution of the ensemble efficiency scores is that both the rows and columns of ensemble efficiency scores (Figure 2) play a role in the pdf of the ensemble efficiency scores because the rows represent sampling from the MDE distributions and the columns represent sampling from the distribution of the sums of independent variables. Larger sample sizes increase the statistical reliability and robustness of the results.

#### **3. Comparing Variable Selection Models and Ensemble Model Using Gas Company Data and Entropy Criterion**

For small datasets, many input or output variables are aggregated so that the selected variables satisfy the heuristic given in Equation (1a). There are two problems with all the variable selection approaches. First, they use an artificial criterion to select variables for a non-linear and non-parametric approach. Any artificial/subjective criterion will make some assumptions that are harder to justify. Second, these techniques have several selection parameters and thresholds that often lead to inconsistencies in applying these techniques. For example, Toloo and Babaee [16] illustrate three problems with a variable selection approach and suggested an improved approach. By contrast, the ensemble DEA approach does not make any assumptions, and for small datasets, trying out different input and output combinations and aggregating efficiency scores provide more reliable efficiency estimates than variable selection models. Part of the reason for the stability of ensemble DEA efficiency scores is that, even for small datasets, some DEA models in an ensemble will always satisfy the heuristic given in Equation (1a), which will increase the reliability of the ensemble efficiency scores due to model averaging. This stability of ensemble efficiency scores is illustrated by comparing ensemble scores with the results of models from Toloo and Babaee's [16] study and using the entropy criterion.

To compare the results, the dataset from Toloo and Babaee's [16] study is used. The dataset consists of three inputs and four outputs from an Iranian gas company. The inputs are budget (*x*1), staff (*x*2) and cost (*x*3). The outputs are customers (*y*1), the length of the gas network (*y*2), the volume delivered (*y*3) and gas sales (*y*4). Table 2 lists these data. Table 3 lists the efficiency scores of the ensemble DEA with the CCR and BCC models and models used by Toloo and Babaee [16]. Using formula (1b), a total of 105 unique DEA models were used to compute the DEA ensemble efficiency score.


**Table 2.** The Iranian gas company dataset.

**Table 3.** The results of experiments.


& Results taken from Toloo and Babaee's [16] study.

The entropies of the Ensemble CCR, Ensemble BCC, Non-Radial and Radial models were 2.616, 2.621, 2.615 and 2.599, respectively. The MEUB from Equation (2i) is 2.639. Comparing the Ensemble CCR with the Non-Radial and Radial CCR models shows that the Ensemble CCR model has a higher entropy. Only the VRS Ensemble BCC model has a higher entropy than the Ensemble CCR model. The standard deviations of the Ensemble BCC model are mostly lower than the CCR model's as well. More importantly, the Ensemble CCR model generates unique rankings for the DMUs, whereas the Non-Radial and Radial models generate a tie for three DMUs. The Ensemble BCC model also generates unique rankings, but the differences occur at the third decimal place. The Ensemble BCC efficiency scores for DMU 10, 12 and 13 were 0.960, 0.959 and 0.962, respectively.

Figures 3 and 4 illustrate the numbers of models (out of 105 total models) where a DMU was fully efficient. These figures are useful for understanding to what extent the assumption α*<sup>1</sup>* ≈ α*<sup>2</sup>* ≈ ... ≈ α*<sup>n</sup>* was satisfied for the theoretical normal distribution of the ensemble efficiency scores. For these parameters to be similar, the expectation is that a similar number of fully efficient DMUs should exist across all models. Clearly, some DMUs are never fully efficient under any of 105 models and the assumption of identical distributions is violated. While the assumption is violated, Figure 4 illustrates that some DMUs, e.g., 1, 8, 10, 12 and 13, have a somewhat similar number of fully efficient DMUs to others. These ensemble scores of these DMUs may be considered as normalized random sums generated from identical distributions (such as Distribution 1). All of these DMUs have ensemble efficiency scores greater than 0.95. Similarly, DMUs 5, 6 and 11, in Figure 4, have no fully efficient scores, and these may also be considered as random normalized sums generated from identically distributed pdfs (such as Distribution 2).

**Figure 3.** Number of times a DMU is fully efficient in Ensemble CCR models.

**Figure 4.** Number of times a DMU is fully efficient in Ensemble BCC models.

The ensemble scores for this dataset appear to be random normalized sums from two or more pdfs of the forms given in Equation (2k). Given that these are independent random normalized sums, it can be easily shown that the product of two or more independent MDE pdfs is also an MDE pdf. Figure 5 illustrates two sample MDE pdfs for two different values of alpha. The entropy of an MDE pdf is maximized when the mean of a distribution is greater than 0.5 [27]. For the ensemble BCC model, from Table 3, this criterion is satisfied. The lowest value of the ensemble BCC score is 0.57, which is greater than the mean of 0.5 required to maximize entropy and higher than the lowest values for the efficiency scores for the radial, non-radial and ensemble CCR models. As a result, the ensemble BCC model appears to maximize its entropy slightly better than the ensemble CCR model.

**Figure 5.** The maximum decisional efficiency (MDE) probability density function (pdf) for α = 5 and α = 10, respectively.

While ensemble scores have a minor violation of an identical distribution for some DMUs, a formal test of the normality of the distribution of the ensemble efficiency scores was conducted. Table 4 illustrates the results of these tests. The Shapiro–Wilk statistic for the Ensemble CCR model is 0.944, and that for the Ensemble BCC model is 0.876, which, at 14 degrees of freedom, are non-significant, consistent the null hypothesis that the efficiency score distribution is normally distributed at the 95% level of statistical significance.



A paired sample *t*-test for the difference in mean efficiency scores between the Ensemble CCR and the Ensemble BCC models gives a |*t*|-value of 3.524, which is significant at the 99% level of statistical significance (*df* = 13), indicating that a variable returns-to-scale relationship exists between inputs and outputs. The normality of the ensemble efficiency score distributions increases the power of parametric statistical tests.

#### **4. Summary, Conclusions and Directions for Future Work**

A significant amount of research in the DEA literature has focused on dimensionality reduction/variable selection techniques for small datasets. These techniques are often criticized and have their limitations, with no clear way of selecting which technique is the best. A better approach would be to use an ensemble DEA score that does not make any additional assumptions and provides models that have high entropy values and normally distributed scores under restrictive assumptions. Pendharkar [14], in his study, has already provided a theoretical foundation for the reliability of ensemble DEA scores. The added benefit of ensemble DEA scores is that they provide unique DMU rankings.

The normality of ensemble DEA scores is not guaranteed unless the ensemble DEA scores are normalized sums generated from independent identically distributed MDE pdfs. This assumption may not be strictly satisfied in most real-world datasets, but the current study shows that minor deviation from this assumption may be tolerated because the entropy of all MDE pdfs is maximized when normalized sums have a value greater than 0.5. This means that, typically, the differences in means between the underlying pdfs (Equation (2l)) for ensemble entropy scores will be less than 0.5, and, while these pdfs may not be identically distributed, the means of these distributions will be close, resulting in the likely normal distribution of ensemble scores in most real-world cases. The normality of ensemble DEA scores allows for the application of traditional statistical tests for return-of-scales hypothesis tests. Traditional DEA hypothesis-testing methods are not perfect and are known to be slightly biased [28]. Future research may focus on comparing ensemble DEA-based hypothesis testing with traditional DEA hypothesis testing to identify which method provides reliable results.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **An Unknown Radar Emitter Identification Method Based on Semi-Supervised and Transfer Learning**

#### **Yuntian Feng \*, Guoliang Wang, Zhipeng Liu, Runming Feng, Xiang Chen and Ning Tai**

State Key Lab. of Complex Electromagnetic Environment Effects on Electronics and Information System; Luoyang 471003, China; guoliangwangcemee@163.com (G.W.); zhipengliucemee@163.com (Z.L.); runmingfengcemee@163.com (R.F.); xiangchencemee@163.com (X.C.); ningtaicemee@163.com (N.T.) **\*** Correspondence: fengyuntian2009@live.cn; Tel.: +86-1873-906-7817

Received: 13 November 2019; Accepted: 10 December 2019; Published: 16 December 2019

**Abstract:** Aiming at the current problem that it is difficult to deal with an unknown radar emitter in the radar emitter identification process, we propose an unknown radar emitter identification method based on semi-supervised and transfer learning. Firstly, we construct the support vector machine (SVM) model based on transfer learning, using the information of labeled samples in the source domain to train in the target domain, which can solve the problem that the training data and the testing data do not satisfy the same-distribution hypothesis. Then, we design a semi-supervised co-training algorithm using the information of unlabeled samples to enhance the training effect, which can solve the problem that insufficient labeled data results in inadequate training of the classifier. Finally, we combine the transfer learning method with the semi-supervised learning method for the unknown radar emitter identification task. Simulation experiments show that the proposed method can effectively identify an unknown radar emitter and still maintain high identification accuracy within a certain measurement error range.

**Keywords:** semi-supervised learning; transfer learning; radar emitter

#### **1. Introduction**

Radar emitter identification is the key link in radar reconnaissance. It extracts the characteristic parameters and working parameters on the basis of radar signal sorting. Based on these parameters, we can obtain the information such as the system, use, type and platform of the target radar, and further deduce the battlefield situation, threat level, activity rule, tactical intention, etc., and provide important intelligence support for one's own decision-making [1]. The most commonly used radar emitter identification method is the pulse described word-based method. As new radar systems are born, and the radar is becoming more complex, the method is difficult to cope with the complex electromagnetic environment of modern battlefields. In order to obtain better identification results, researchers began to extract a variety of new features in the time domain [2], frequency domain [3] and time-frequency domain [4] for the identification of radar emitters.

With the rise of deep learning techniques, more and more researchers have applied CNN and DBN in the radar emitter identification task, which achieves good performance. Zhou Z et al. [5] developed a novel deep architecture for automatic waveform recognition, which outperformed the existing shallow algorithms and other hand-crafted, feature-based methods. Cain L et al. [6] investigated an application of convolutional neural networks (CNN) for rapid and accurate classification of electronic warfare emitters. Sun J et al. [7] proposed a deep learning model named as unidimensional convolutional neural network (U-CNN) to classify the encoded high-dimension sequences with big data.

Kong M et al. [8] used the CNN deep learning algorithm to identify the radar radiation sources, which could extract more detailed features of the radar and improve the recognition rate. To cope with the complex electromagnetic environment and varied signal styles, Wang X et al. [9] proposed

a novel method based on the energy cumulant of short time Fourier transform and reinforced deep belief network to gain a higher correct recognition rate for radar emitter intra-pulse signals at a low signal-to-noise ratio.

In the past battlefields, the types of radar emitters are single and limited, and the above methods can solve the problem of radar emitter identification well. However, with the increasing number and variety of radar emitters, many unknown emitters will appear in the future battlefield. As time goes by and the location changes, the current identification methods will face two problems. First, the training data and testing data no longer satisfy the same-distribution hypothesis, resulting in a decrease in the classification performance of the machine learning model. Second, the number of available labeled samples for unknown emitters is seriously insufficient, which may lead to over-fitting of the machine learning model.

In recent years, the transfer learning methods [10] and the semi-supervised learning methods [11] have gained more and more attention. Transfer learning does not require that the training data and testing data meet the conditions of the same distribution in the model training process, and utilizes the knowledge in a large number of known samples for training, which is good for cross-domain learning. However, the transferring of a large amount of irrelevant information will also cause negative transfer, which reduces the effect of identification. Semi-supervised learning can use the information in a small number of labeled samples and find patterns from a large number of unlabeled samples, and then perform classification, avoiding the use of only a small number of labeled samples for training, which may result in over-fitting. However, as information continues to increase, the training data and testing data will also not satisfy the same-distribution hypothesis.

In view of the different characteristics of transfer learning and semi-supervised learning, this paper combines the two methods to propose an unknown radar emitter identification method based on semi-supervised and transfer learning. Firstly, we construct the support vector machine model based on transfer learning, using the information of labeled samples in the source domain to train in the target domain, which can solve the problem that the training data and the testing data do not satisfy the same-distribution hypothesis. Then we design a semi-supervised, co-training algorithm, using the information of unlabeled samples to enhance the training effect, which can solve the problem that insufficient labeled data results in inadequate training of the classifier. Finally, we combine the transfer learning method with the semi-supervised learning method for the unknown radar emitter identification task.

Our major contributions are summarized as follows: (1) Focusing on the actual application scenarios to study radar emitter identification, and simultaneously solving the problem that training data and testing data do not satisfy the same-distribution hypothesis and the problem of insufficient labeled data, which provides a good thinking way for future research in this area; (2) proposing a method combining support vector machine based on transfer learning with semi-supervised co-training algorithm; (3) verifying the interaction between the transfer learning method and the semi-supervised learning method for unknown radar emitter identification task.

#### **2. Relevant Research**

#### *2.1. Transfer Learning*

Transfer learning refers to learning the knowledge in the source domain *Ds*, and using in the target domain *Dt* that is not the same distribution with *Ds* but is related to *Ds*, which makes good the problem of insufficient training data. Unlike traditional machine learning methods, transfer learning [12] does not require training data and testing data to satisfy the same-distribution hypothesis. It can discover and extract knowledge in the source domain *Ds* that matches the distribution of the target domain *Dt* and is useful for identification in the target domain *Dt*.

Then it establishes classification models in the target domain *Dt*, which can make efficient use of existing labeled samples to avoid re-labeling in the target domain *Dt*.

From the perspective of transfer methods, transfer learning includes four basic methods: sample-based transfer [13], feature-based transfer [14], model-based transfer [15] and relationship-based transfer [16]. The sample-based transfer method refers to producing rules according to certain weights, and reusing data samples for transfer learning. The feature-based transfer method refers to mutual transfer by feature transformation, which reduces the gap between the source domain and the target domain, or transforms the data features of the source domain and the target domain into a unified feature space, and then utilizes the traditional machine learning methods for identification. The model-based transfer method refers to finding the parameter information shared between the source domain and the target domain to implement transferring. The relationship-based transfer method has a completely different approach from the above three methods, focusing on the similarity between the source domain samples and target domain samples.

#### *2.2. Semi-Supervised Learning*

The commonly used machine learning methods can be divided into three categories: supervised learning, unsupervised learning and semi-supervised learning. Supervised learning refers to only using labeled samples for training, and may not obtain a model with high generalization ability in the case of fewer labeled samples. Unsupervised learning refers to only using unlabeled samples for training, regardless of labeled samples, which results in a waste of samples. Semi-supervised learning can process a small number of labeled samples and a large number of unlabeled samples at the same time, combining the advantages of supervised learning and unsupervised learning.

The four most commonly used algorithms for semi-supervised learning are Self-Training [17], Co-Training [18], Generative Model [19] and Graph-Based Semi-supervised [20]. The Self-Training algorithm refers to the use of a self-classifier to continuously generate high-confidence samples for improving the final classification performance. The Co-Training algorithm refers to separately training the classifier on two views, which is representative of multi-view learning. The Generative Model-based method means that the data of different categories meets different distributions, and if its conditional probability distribution is known, the parameters of the model can be solved. The Graph-Based, Semi-supervised method refers to passing the label information of labeled samples to unlabeled samples according to the adjacency relationship in the graph, thereby realizing the classification of the unlabeled samples.

#### **3. Unknown Radar Emitter Identification Based on Semi-Supervised and Transfer Learning**

In this section, we use the support vector machine model as the base classifier. Firstly, we construct the support vector machine based on transfer learning, and define the calculation index to measure the transfer ability. Then we study the training effect enhancement method based on the semi-supervised co-training algorithm. Finally, we combine the transfer learning method with the semi-supervised learning method for the unknown radar emitter identification task.

#### *3.1. Support Vector Machine Based on Transfer Learning*

The support vector machine (SVM) model has the characteristics of simple structure and global optimization, and is good at solving small sample and nonlinear problems. Therefore, this section chooses the support vector machine model as the base classifier to perform radar emitter identification.

In the process of constructing the SVM model based on transfer learning, it is necessary to utilize the data in two domains at the same time, namely source domain *Ds* and target domain *Dt*. The data in source domain *Ds* refers to the known radar emitters that are detected during non-wartime, and the data in target domain *Dt* refers to the emerging radar emitters in wartime.

When the amount of data in source domain *Ds* is large, noise in source domain *Ds* affects the use of the data in target domain *Dt*.

In order to better optimize the target equation, this section filters the data with high similarity in the source domain *Ds* in the process of transferring the SVM model, and uses the Euclidean distance to define the distance function σ(*V<sup>i</sup> <sup>s</sup>*, *<sup>D</sup><sup>j</sup> t* ), which can measure the similarity between the source domain data and the target domain data. Its formula is as follows:

$$\sigma(V\_{s\prime}^i, D\_t^j) = -\frac{1}{k} \sum\_{(\boldsymbol{x}\_{\boldsymbol{\beta}}, \boldsymbol{y}\_f) \in D\_l} \exp\{-\beta \parallel V\_s^i - \boldsymbol{x}\_{\boldsymbol{\beta}} \parallel\_2^2\} \tag{1}$$

where *V<sup>i</sup> <sup>s</sup>* is the support vector for source domain, β is the importance degree of *Vs*, (*xj*, *yj*) is the sample in target domain *Dt* and its real category, *V<sup>i</sup> <sup>s</sup>* <sup>−</sup> *xj* <sup>2</sup> <sup>2</sup> is the Euclidean distance between *Vs* and *Dt*, k is the number of samples in target domain *Dt*.

The specific steps of the support vector machine based on transfer learning are shown in Algorithm 1.

**Algorithm 1.** Support vector machine based on transfer learning


$$\min\_{\boldsymbol{w}} \text{0.5 } \|\boldsymbol{w}\|\_{2}^{2} + \text{C} \sum\_{j=1}^{k} \varepsilon\_{j} + \sum\_{l=1}^{m} \sigma(V\_{s\nu}^{l} \boldsymbol{D}\_{l}^{j}) \overline{\boldsymbol{\varepsilon}}\_{l}$$

Where m is the number of samples in source domain *Ds*, C is the penalty term, w is the weights of classification hyperplane in the SVM model.

3. Generate new training set *D* in target domain *Dt*, and retrain the SVM model. The optimization problem of the objective function is described with the Lagrangian coefficient as follows:

$$\text{max}L(\boldsymbol{\alpha}) = \sum\_{l=1}^{m+k} \alpha\_l - 0.5 \sum\_{l=1}^{m+k} \sum\_{j=1}^{m+k} \alpha\_l \alpha\_j y\_l y\_j (\boldsymbol{\alpha}\_l \* \boldsymbol{\alpha}\_j)$$

Where *xi* ∗ *xj* is the inner product of the vector *xi* and the vector *xj*, *yi* is the real category label of *xi*, *yj* is the real category label of *xj*, α*<sup>i</sup>* and α*<sup>j</sup>* are the Lagrangian multipliers, α = (α1, α2, ··· , α*m*+*k*) *<sup>T</sup>* is the Lagrangian multiplier vector.

4. Solve the above optimization problem and get the optimal solution α∗ , which means getting the final SVM model. Its form is as follows:

$$f(\boldsymbol{\chi}) = \operatorname{sign}\left[\sum\_{j=1}^{m+k} y\_l \alpha^\* \left(\boldsymbol{\chi}\_l \star \boldsymbol{\chi}\_j\right) + y\_l - \varepsilon\_l - \left(\sum\_{l=1}^{m+k} \alpha^\* \boldsymbol{\chi}\_l y\_l\right)^T \boldsymbol{\chi}\_l\right]$$

#### *3.2. Transfer Ability*

The transfer ability can reflect the influencing ability of the samples in source domain *Ds* on the target domain *Dt*. The calculation process involves two important indices: the similarity between the sample in source domain *Ds* and the sample in target domain *Dt*; the consistency between the prediction result of the sample *xi* in the classifier f and its real category. Therefore, the calculation formula of transfer ability is as follows:

$$\alpha\_{i} = \sigma(V\_{s'}^{i}D\_{t}^{j}) \* f(\mathbf{x}\_{i}) = = y\_{i} \tag{2}$$

where σ(*V<sup>i</sup> <sup>s</sup>*, *<sup>D</sup><sup>j</sup> t* ) is the similarity distance function, f is the SVM classifier trained by the above transfer learning method, *f*(*xi*) is the predicted value of *xi* in source domain *Ds* by the classifier f, *yi* is the real category label of *xi*. By calculating the transfer ability, it is helpful to select the samples in source domain *Ds* which are related to target domain *Dt*.

#### *3.3. Training E*ff*ect Enhancement Based on Semi-Supervised Co-Training Algorithm*

The transfer learning-based support vector machine can select the appropriate samples from source domain *Ds* for the training on target domain *Dt*, which can improve the final identification performance. Unlike the above, semi-supervised learning can use the unlabeled samples in target domain *Dt* to enhance the final training effect. This section constructs the semi-supervised co-training algorithm based on the base classifier SVM model. The specific steps are shown in Algorithm 2.

**Algorithm 2.** Semi-supervised co-training algorithm


Perform identification on the unlabeled samples in target domain *Dt* by using the classifiers *f*<sup>1</sup> and *f*2, respectively, and obtain the posterior probabilities of the samples belonging to each emitter category, and select p samples with the highest confidence for each category;

Add the selected samples to the training set and retrain the classifiers *f*<sup>1</sup> and *f*<sup>2</sup> on the training set.

End

The two feature sets **x**<sup>1</sup> and **x**<sup>2</sup> in the co-training algorithm refer to two views and need to satisfy sufficient redundancy and conditional independence. Through continuous iterative training, unlabeled samples in the target domain *Dt* are available for labeling, which helps to enhance the training effect.

#### *3.4. Combination of Transfer Learning Method and Semi-Supervised Learning Method*

This section combines the transfer learning method with the semi-supervised learning method, while taking advantage of the two methods, which can use useful information in source domain *Ds* for cross-domain learning, and can enhance the training effect with unlabeled samples in target domain *Dt*. The specific process is shown in Figure 1.

**Figure 1.** Specific process of the semi-supervised and transfer learning algorithm.

The basic idea of the semi-supervised transfer learning algorithm is to first use the small number of labeled samples in the target domain as training data to train two different classification models, namely the SVM model based on transfer learning and the semi-supervised co-training model; then we select some samples from source domain, use the SVM model based on transfer learning to evaluate the transfer ability of each sample, delete the samples that are not related to the target domain and obtain candidate sample set. After this we select some unlabeled samples from the target domain, use the semi-supervised co-training model to evaluate the confidence of each sample, delete the samples with lower confidence and add the remaining samples to the candidate sample set.

In the process of selection training samples, not only must we consider the transfer ability, but we must assess the confidence of the sample's category. Then we add the samples satisfying the conditions to the training set. The above sample selection method is based on the basic assumptions of transfer learning and semi-supervised learning. By repeating the process, the number of labeled samples in target domain *Dt* can be continuously increased.

#### **4. Experiments**

#### *4.1. Experiment Settings*

#### 4.1.1. Experiment Environment

We build the simulation experiment development environment of Windows7 + Matlab2017b + Libsvm3.22, where Libsvm3.22 is used to implement the SVM model as the base classifier. Its kernel function is based on the radial basis function exp(−|*x*−*xi*<sup>|</sup> 2 <sup>σ</sup><sup>2</sup> ). On this basis we use Matlab to realize the transfer learning and semi-supervised learning method in this paper.

#### 4.1.2. Experiment Data

We use the characteristic parameters such as pulse amplitude(PA), carrier frequency (CF), pulse width (PW), pulse repetition interval (PRI) and angle of arrival (AOA) to simulate generating the emitter data of six system-like radars. For the signal parameters, they are set at the same intermediate frequency: 10 MHz, and the sampling frequency is 100 MHz. 1000 signal samples are generated using the above five pulse description words for radar 1, radar 2 and radar 3, respectively, and a total of 3000 signal samples are as known radar emitter data corresponding to the source domain data above.

In addition, 1000 signal samples are generated for radar 4, radar 5 and radar 6, respectively, and a total of 3000 signal samples are as unknown radar emitter data corresponding to the target domain data above. The mean values and standard deviations after normalization of the known radar emitter data and the unknown radar emitter data are significantly different, so they no longer satisfy the assumption of the same distribution, which can be used to verify the transfer learning and semi-supervised learning method. The details of the experiment data are shown in Tables 1 and 2.


**Table 1.** Known radar emitter data.


**Table 2.** Unknown radar emitter data.

A radar signal sample is written as *Wi* = [*PAi*,*CFi*, *PWi*, *PRIi*, *AOAi*] *<sup>T</sup>*. The distribution of the specific parameters of the radar is shown in Figure 2, (a) describes the entire data set from the perspective of parameters PA and AOA, and (b) describes the entire data set from the perspective of parameters of CF, PW and PRI.

**Figure 2.** Distribution and change characteristics of the five parameters of pulse amplitude (PA), carrier frequency (CF), pulse width (PW), pulse repetition interval (PRI) and angle of arrival (AOA).

#### *4.2. Interaction between Transfer Learning Method and Semi-Supervised Learning Method*

The experiment uses the known radar emitter data as labeled samples for auxiliary training, and the unknown radar emitter data as unlabeled samples to be identified. The number of unlabeled samples and labeled samples can be adjusted to verify the interaction between the transfer learning method and the semi-supervised learning method.

First, we keep the number of labeled samples unchanged, and adjust the number of unlabeled samples to verify the impact of the semi-supervised learning method on the transfer learning method. The results are shown in Figure 3. It can be seen from the experiment results that when the number of unlabeled samples is zero, that is, we only carry out transfer learning without semi-supervised learning, the identification accuracy is 11.3% lower than the optimal identification accuracy.

When the number of unlabeled samples is slowly increasing, the identification accuracy will also continue to rise, indicating that the unlabeled samples help to make the transfer learning method select high-similarity samples from the unknown radar emitter data; that is, the semi-supervised learning method is positively correlated with the transfer learning method, and has not weakened it. As the number of unlabeled samples increases further, the identification accuracy will gradually stabilize, indicating that the high-similarity samples in the unknown radar emitter data have been completely screened out, and the optimal recognition rate can reach 93.6%.

**Figure 3.** Impact of the semi-supervised learning method on the transfer learning method.

Secondly, we keep the number of unlabeled samples unchanged, and adjust the number of labeled samples to verify the impact of the transfer learning method on the semi-supervised learning method. The results are shown in Figure 4. It can be seen from the experiment results that when the number of labeled samples is zero, that is, we only carry out semi-supervised learning without transfer learning, the difference between the maximum identification accuracy and the minimum identification accuracy in the classification identification results reaches 17.9%, indicating that the use of semi-supervised learning alone makes the model less stable. When the number of labeled samples is slowly increased, the difference between the maximum identification accuracy and the minimum identification accuracy in the classification identification results will continue to drop to 4.3%, indicating that the transfer learning method helps to make self-correction of the semi-supervised learning method.

**Figure 4.** Impact of the transfer learning method on the semi-supervised learning method.

It can be seen from the above experiment results that the semi-supervised and transfer learning method proposed in this paper can comprehensively utilize the information of unlabeled samples and labeled samples. When the number of unlabeled samples is greater than 1000, and the number of labeled samples is greater than 1500, the performance of the model will tend to be stable and achieve the highest identification accuracy. Therefore, in the following we use 1500 known radar emitter samples and 1000 unknown radar emitter samples to train the model for contrast experiments.

#### *4.3. Contrast Experiments*

In order to further verify the effectiveness of the proposed method, we train the base classifier SVM model, the SVM model based on transfer learning, the SVM model based on semi-supervised learning and the SVM model based on semi-supervised and transfer learning respectively to identify the unknown radar emitter samples. In addition, in order to verify the adaptability of the model to the measurement error, we introduce an error deviation level test algorithm [21]. The specific experiment results are shown in Figure 5.

From the contrast experiment results, it can be known that when only using the base classifier SVM model for identification, the identification accuracy obtained is less than 55%. The main reason is that the known radar emitter data and the unknown radar emitter data do not satisfy the same-distribution hypothesis, resulting in an inability to obtain a valid classifier. When using the SVM model based on semi-supervised and transfer learning for identification, the optimal identification accuracy can be achieved within a certain measurement error range. Identification accuracy can reach more than 90% in the measurement error range of 15%, indicating that the method has good noise adaptability, and is obviously superior to the SVM model based on transfer learning, and the SVM model based on semi-supervised learning. The main reason is that the semi-supervised and transfer learning can make full use of sample information to achieve good performance without a lot of iteration.

The identification accuracy obtained by the SVM model based on transfer learning is slightly better than that obtained by the SVM model based on semi-supervised learning. The main reason is that there are not many available training samples in the target domain, which leads to the fact that only using the semi-supervised learning method cannot enhance the training effect. When the measurement error is greater than 10%, the identification accuracy of the transfer learning method and the semi-supervised learning method will be significantly reduced, indicating that their noise adaptability is not good.

**Figure 5.** Contrast experiment results.

#### *4.4. Results Discussion*

For the radar emitter identification task, deep learning models can often achieve the best results. Therefore, in this section, we construct the CNN model [6] and the U-CNN model [7] to compare with our method proposed in this paper. In the two deep learning models, radar pulse description words are used to represent radar signals, and as input to the model, which is the same as the processing of our method, so it is appropriate to compare CNN, U-CNN and our method together. The specific experiment results of different models are shown in Table 3. In the traditional identification scenario, that is, where we only use the labeled samples in source domain to train the models and then test on the source domain data, U-CNN can achieve the best performance. Its identification accuracy is up to 98.5%, while the identification accuracy of our method is only 95.3%. In the unknown identification scenario, that is, wherein we use the labeled samples in source domain and the unlabeled samples in target domain to train the models and then test on the unknown radar emitters in target domain, the identification accuracy of CNN and U-CNN decrease sharply. However, our method can still reach an identification accuracy of 91.6%. The experiment results show that compared with the currently most popular deep learning models, although our method still has disadvantages in the traditional identification scenario, it can achieve the best performance when facing unknown radar emitters.

**Table 3.** Identification accuracy of different models.


#### **5. Conclusions**

In the radar emitter identification task, the traditional methods are often difficult to identify unknown radar emitters. Aiming at the problem, this paper proposes an unknown radar emitter

identification method based on semi-supervised and transfer learning. The transfer learning method can solve the problem that the training data and testing data do not satisfy the same-distribution hypothesis, and the semi-supervised learning method can utilize the information of unlabeled samples to enhance the final training effect. Simulation experiments show that the proposed method can achieve an identification accuracy of 91.6% in the measurement error range of 15%, which is 15.3% higher than the deep learning model in the unknown identification scenario. The next step is to continue to optimize the model and lighten it for automatic compression, in order to minimize the running time of our method.

**Author Contributions:** Formal analysis, N.T.; Funding acquisition, R.F.; Investigation, X.C.; Supervision, G.W.; Visualization, Z.L.; Writing—review & editing, Y.F.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## **Design Limitations, Errors and Hazards in Creating Decision Support Platforms with Large- and Very Large-Scale Data and Program Cores**

**Elias Koukoutsis 1,\*, Constantin Papaodysseus 1, George Tsavdaridis 2, Nikolaos V. Karadimas 3, Athanasios Ballis 4, Eirini Mamatsi <sup>1</sup> and Athanasios Rafail Mamatsis <sup>1</sup>**


Received: 26 October 2020; Accepted: 10 December 2020; Published: 14 December 2020

**Abstract:** Recently, very large-scale decision support systems (DSSs) have been developed, which tackle very complex problems, associated with very extensive and polymorphic information, which probably is geographically highly dispersed. The management, updating, modification and upgrading of the data and program core of such an information system is, as a rule, a very difficult task, which encompasses many hazards and risks. The purpose of the present work was (a) to list the more significant of these hazards and risks and (b) to introduce a new general methodology for designing decision support (DS) systems that are robust and circumvent these risks. The core of this new approach was the introduction of a meta-database, called teleological, on the base of which management, updating, modification, reduction, growth and upgrading of the system may be safely and efficiently achieved. The very same teleological meta-database can be used for the construction of a sound decision support system, incorporating elements of a previous one at a future stage.

**Keywords:** very large-scale decision support systems; very large-scale data and program cores of information systems; meta-database; teleological meta-database; thematic list; indicators list; computational methods list; geographically dispersed systems; external sources

#### **1. Introduction**

Nowadays, there is a considerable number of public and private operators, who make extensive use of decision support systems (DSSs or DS systems) in a surprising large number of operational procedures and businesses. These operations include:



These examples of operations constitute only a small number of those encountered in practice. In addition, in most cases, each associated DSS employs sets of very complex computational models and algorithms. In essence, a DSS is a special kind of a very complex information system (IS), where an information system is an integrated set of components for collecting, storing and processing data and for providing information, knowledge and digital products [28].

The present manuscript is organized as follows:

In Section 2, the authors deal with the recent tendency/necessity to build very large-scale decision support systems. In fact, it is argued that the advances in information and communication technology have given the opportunity to engineers to try to develop DSSs that tackle very big and/or particularly complicated practical problems.

In Section 3, a number of serious difficulties and problems are stated, in connection to specific properties/characteristics of a very large-scale decision support system. These properties concern the following:


In Section 4, crucial characteristics of contemporary, mainstream methodologies for creating large-scale decision support systems are reported. It is argued that any ad hoc development of a DS system is bound to suffer from serious intrinsic problems, which only a consistent fully methodological approach can resolve.

In Section 5, the authors introduce a methodology that circumvents all the aforementioned problems.

In Section 6, the authors state the reasons for which the introduced methodology resolves these problems associated with very large-scale DSS development.

Finally, in Section 7, the conclusions of the present work are stated.

#### **2. The Tendency to Increase the Volume and Complexity of Data and Program Collection of the DSS**

A considerable number of contemporary DSS applications eventually ask for very large data and program collections. The tendency to increase the size and complexity of these collections is due to a variety of factors, such as:

(a) Advances in information technology and communications (ITC) have created a market for DSSs with very large-scale data and program collections.

In fact, progress in software design and implementation makes possible the almost instant handling of huge data sets. For example, the conceptual development of balanced, N-ary trees allows for accessing a specific entry among trillions of entries in four or five very fast steps only. In addition, contemporary hardware allows for creating and managing/handling "storages", that is, devices that may contain thousands of disks (hard disks or DSSs).

	- The effort for developing the European Transport Information System for Policy Support (ETIS).

Maintaining an efficient transport infrastructure and formulating a common transport policy are critical elements for the economic and social development of the European Union (EU). In 1994, the EU Commission adopted a comprehensive proposal for the "Trans-European Networks" (TEN-T) guidelines [29], which are plans for improving the performance and further development of the complex EU infrastructure. The first version of these guidelines covered 70,000 km of rail tracks (including 22,000 km of new and upgraded tracks for high-speed trains), 58,000 km of roads (including 15,000 km of new roads), corridors and terminals for combined transport, 267 airports of common interest, networks of inland waterways and many seaports. It is evident that these numbers have greatly increased since the adoption of TEN-T guidelines.

The effort for performance improvement of the Trans-European Networks and the application of a successful transport policy requires significant information concerning the current status of the networks and the corresponding inhabited regions. More specifically, this required information includes passenger and freight transport volumes, traffic congestion, environmental impacts of transport, etc. It is also necessary to consider reliable forecasts for the time evolution of the aforementioned factors. The relevant information exists in a large number (more than 500) of dispersed, heterogeneous and autonomous databases throughout the entire Europe. In addition, more relevant information is acquired from the results of an ensemble of pertinent scientific studies. The European Transport Information System for Policy Support has been envisaged as a tool for the support of decision making on transport policies and policies related to the Trans-European Transport Networks. ETIS must accommodate policy-related information in a repository that has to be kept up to date by experts. In ETIS, policy questions are linked to the relevant data sets of the sources via a hierarchical structure of policy criteria, policy indicators and data variables [20].

• A second example is the information system for the port of Rotterdam. This system includes a huge amount of geographical information. It is noteworthy that the port of Rotterdam incorporates more than 5000 docs, an airport, a considerably large storage area for alumina and abundant apron spaces for temporary storage of goods and containers.

The IS of the port of Rotterdam, called Port Community System (PCS), provides services that focus on all port sectors, such as containers, break bulk, dry bulk and liquid bulk. Anyone in the logistics chain can easily and efficiently exchange information through these services. Moreover, the PCS offers a package of properly tailored services to each of the target groups of clients and operators.

From the entire previous discussion, it is evident that the information involved in the PCS is huge and very complex [30].

(c) The inclusion and use of data, which are directly or indirectly georeferenced (map-related data), in a DSS. In the first case, the direct one, the exact co-ordinates of an object are explicitly provided. On the contrary, in the second case, the indirect one, there is a link to an object, which determines the data co-ordinates. For example, if an accident takes place at a specific point of a certain road and then, if the exact co-ordinates of the accident location are registered in an IS, this event is directly georeferenced. On the other hand, if the registered information is something like the following: "the accident happened at the 23rd kilometer of the road from city A to city B", then the event is indirectly georeferenced.

Usually, the main bulk of the information is indirectly georeferenced. A surprisingly large part of the information that should be included in such a DSS is georeferenced, even though this georeferencing is well hidden at a first glance. For example, in most nations, such as the USA, the laws and directives may drastically change when crossing state boarders. Even in Greece, where the legislation is much more uniform, when one travels from an island to the mainland, one faces various changes in authority jurisdiction and legislation in the interior of the island, the mainland, at ports and on the sea (near-port area, national waters and international waters).

It is evident that the requirement that a large-scale decision support system handles and uses map-related information may greatly increase the volume of its data and program core.

	- Distributed and/or geographically dispersed systems, where properly selected duplication of part of the information and programs must be implemented, in order to increase the efficiency of the local systems. For example, a very frequently accessed part of information by all sub-systems must be preferably kept in every local sub-system.
	- Umbrella-type systems that receive and process information in real time from a large number of heterogeneous, autonomous and, possibly, legacy systems. Sometimes, these systems may cover a large part or the whole of a continent. Such is the case of a European system for maritime surveillance, the complexity and size of which is evident.

#### **3. Severe Di**ffi**culties Appearing in the Actual Deployment of a Very Large-Scale Decision Support System**

We use the general term "large data collection" in order to describe the extended data core of a large-scale decision support system. The data collection may include dozens or even hundreds of groups of loosely related data sets or databases, the content of which may be thematically divergent.

In this work, we will often use examples of large-scale decision support systems for policy support and decision making in the area of transport. We will do this for two main reasons:


We would like to mention the following, widely used terminologies in transport DS systems: The data and program collections are often called "repositories", "observatories" or "transport observatories". Frequently, when the transport observatories cover exclusively the needs of a single country, they are called "national models" [31–33].

#### *3.1. Early Design Errors and Serious Di*ffi*culties Due to the Large Data and Program Collection Characteristics*

In this Subsection, we present some serious difficulties/problems, inherent to large data and program collection characteristics. We will describe below some of the most severe difficulties of this type:

	- This way of data acquisition does not guarantee a precise/consistent, methodological collection of data. On the contrary, the random, heuristic and ad hoc data collection, as a rule,

results in a core with data and programs that are non-interoperable, internally inconsistent, practically impossible to organize or update properly and impossible to upgrade.

• Due to the aforementioned problems concerning the selected data, the core of the corresponding DSS practically collapses under the weight of the non-organizable data collections.

Consequently, this design methodology has been rather quickly abandoned.

	- (i) As a rule, in a very large DS system, the data are multi-thematic; hence, it is logical to expect a great variability of definitions, meanings and contents associated with the data.
	- (ii) The aforementioned variability is frequently accompanied by a great diversity of forms and formats of the elements of the data collection. This diversity practically always asks for a different handling of each subset of similar data.
	- (iii) (Complexity is another serious problem associated with large-scale data collections and the corresponding DS systems. The term "complexity" is used in order to describe data with a particularly great number of interconnections, inter-relations and interdependencies.
	- (iv) Data polymorphism can easily destroy an arbitrary DS system. For example, the most common computer object in transport, the "transport link", may literally have tenths or even hundreds of different definitions, meanings and data contents. Thus, the link between two cities A and B may refer to:
		- Various road connections, where each road has its own characteristics (e.g., a highway with a given number of lanes with or without tolls, a national road, a secondary street, etc.);
		- Various train connections, each one frequently with its own characteristics, such as high-velocity trains (TGV), intercity trains, local trains, commercial trains, etc.;
		- Airplane connections, with airports having different connections with the city center;
		- Seaways, where each way frequently includes ferries, container carriers, bulk carriers, tankers, cruisers, etc.;
		- Inland navigation, etc.

In addition, the information concerning a single object in the DSS drastically depends on the purpose of the use of this object. This fact makes large-scale decision support system designers employ ad hoc solutions, which usually make upgrades and thematical shifts much more difficult or even impossible. An efficient solution to this problem will be presented later in Sections 5 and 6 of the present work.


than 500 external sources [20], which are indeed heterogeneous and autonomous. Handling this number of different sources is a very difficult task and, in most cases, tackling this task is by no means automatic. Indeed, usually, a large initial effort by experts is required, in order to create a first version of an interoperable, internally consistent data and program core [34].

(e) In a large data collection, a considerable number of differences in the meaning and/or definition of common variables, coming from different sources, may appear. Thus, the task of defining the exact meaning and form of the variables that are common to many sources of the DS system may prove to be extremely difficult.

For example, in the first stages of the ETIS design, it has been proved that the difficulty of defining the term/variable "long distance trips" [35] was almost insurmountable. In fact, for the Scandinavians, this term referred to a distance greater than 100 km, while such an option practically excluded trips in the Netherlands entirely. In addition, other countries expressed a strong desire to associate with this term the age of the passengers and/or to incorporate in it whether the relevant trip is transit or not. Today, one may think that the term "long distance trips" must also include as a sub-variable the information whether the vehicle (automobile, motorcycle or train) is electric or conventional.

Thus, it not a surprise that a new, complete definition of the term "long distance trips" was given by the Transport General Directory and that Eurostat issued a directive to all EU Members to conform with this new definition and to always provide all relevant information.

(f) The next difficulty concerns the "level of detail" (abbreviated as LoD) or "granularity". We shall try to clarify the content and importance of the term "level of detail" by means of the following example, concerning Google Maps. In fact, this application starts from offering a global earth projection. Then, the user may gradually increase the level of detail at an arbitrary point of the projection, according to his/her desire. At each such selection of the user, a more detailed map of the associated area appears on the screen. At certain levels, new geographical objects appear on the screen. For example, one may select a specific country and see its map. Subsequently, one may choose an area of this country and see the corresponding map in more detail. After a sequence of successive selections, where each one of them offers a greater level of detail, one may suddenly see some cities as points and some sketches of roads; hence, new objects appear. Further increase of the detail may offer more and more dense maps of the desired area. If one selects the option "satellite", more and more detailed images of the selected area appear on the screen. In the higher detail level, the user may see individual public buildings and constructions, houses and details of them.

In the first steps of large DS system implementation, many designers considered that it is sufficient to include the higher level of detail only in the information system. However, the two aforementioned examples, but also many other DS systems that do not necessarily include geographical information, demonstrate that the approach "include the highest level of detail only" is completely erroneous. On the contrary, as it will be analytically presented in Sections 5 and 6 of the present work, a sound and systematic method for handling the levels of detail in a large DS system is to include all necessary information, associated with any level of detail, according to each issue and problem that the DSS has to tackle. In other words, the levels of detail in each application of the DSS strongly depend on the issue and problem in hand; therefore, the corresponding information system must be designed by keeping this issue dependence in mind.

*3.2. Extreme Problems in Deploying, Managing, Maintaining, Updating and Upgrading Very Large-Scale Decision Support Systems*

The deployment of a (very) large-scale decision support system meets with severe difficulties, due to the reasons described immediately below, which are in strong accordance with the content and the observations of [36,37]:


#### **4. Contemporary, Mainstream Methodologies for Creating Large-Scale Decision Support Systems**

First of all, a complete/thorough analysis and study, concerning the set of the required indicators, usually precedes the system design. This part of the DSS project is absolutely necessary; otherwise, corrections and additions at a later stage may be difficult or even impossible. The initial study must clearly and comprehensively offer information about the following, which will also prove very useful in the analysis of Sections 5 and 6:

	- "Noise";
	- "Congestion";
	- "Pollutant emissions".

For example, a complete associated path could be the following: transport policy issues → environmental impacts of transport → harmful environmental impacts of transport → noise → noise produced at the transport network → elementary/non-separable issue: noise produced by road traffic outside of the urban areas → indicator(s): noise measurements for all links and nodes of the road network outside of the urban areas.

On the other hand, the subcategory "beneficial impacts" must, among others, include:


It must be emphasized that the form of an indicator can greatly vary, from a simple set of numbers (e.g., the former Kyoto indicators for environmental pollution) [42–45] to a complete database (e.g., a map indicating the road congestion of an area, with seasonal and daily indications).


The related information is, as a rule, a result of tedious efforts of experts, usually belonging to numerous scientific disciplines. Evidently, the work of the experts is supported by information technology and communication (ITS) engineers.

As a next step, ITS experts, using the aforementioned abstract results, implement the following:


Contemporary solutions may include even more advanced software tools, such as "web services" and "containers" [46]. Moreover, in recent designs, several servers remain totally "in-memory", in order to achieve highly increased performance; however, this solution is particularly expensive.

Next, suppose that a team of engineers implements a DS system ad hoc, without incorporating a rich set of metadata in it. In fact, without the complete knowledge of the appropriate metadata, the programs and data of an information system are practically entangled black boxes, to which

corrections, maintenance, improvements, expansion and updating are difficult, if not impossible to be realized. In addition, a considerable number of users themselves are experts and/or they are supported by teams of experts; consequently, they will definitely need to know the details of acquisition and evaluation of the indicators. Thus, we would like to emphasize that the limiting factor in the design of large-scale decision support systems is not the hardware or software capabilities, but the very organization of the huge information itself.

#### **5. A Novel, Systematic Methodology for Designing, Maintaining and Upgrading a Very Large-Scale Decision Support System**

It has been previously stated that the design and implementation of a very-large scale decision support system is realized by means of the following two general stages:

Stage 1: A group of experts, after an extensive analysis of the problems/requirements in hand, ends up with an abstract structure, which determines all due operations of the DS system.

Stage 2: Based on this abstract structure, a group of ICT engineers produces an ad hoc implementation of the sought-for information system. However, as it has been more extensively described in Section 4 of the present work, any ad hoc implementation intrinsically suffers from serious flaws and problems of maintenance and upgrading of the developed system.

Hence, in the present section, a novel, systematic approach that circumvents these difficulties is described. This new approach includes the following steps:


We would like to emphasize that we render this teleological meta-database the "heart" of the entire DS system and its operations.

#### *5.1. The Part of the Template that Includes the Compete Thematic Decomposition*

As it has been already pointed out in Section 4, the first part of the template concerns the complete thematic list of the topics or issues for which the DSS must provide information/answers. An exhaustive thematic decomposition is applied to this list, expressed via a tree-like graph structure, down to further-non-separable objects/entities. We noticed that in the case of polymorphic data, the final leaves of the thematic decomposition usually highly depend on the sub-problem in hand. In other words, a specific object must be further decomposed, when the DSS tackles a certain problem, say A, while it may be a non-separable entity, when the system offers information concerning another problem, say B.

An abstract visualization of this tree-like, graph structure that performs such a thematic decomposition is depicted in Figure 1. In fact, concerning Figure 1, we would like to point out the following:

• The entire thematic list of the DSS incorporates issues I1, I2, ... , IN.


/LQNVWRWKH6WUXFWXUHIRUWKH'66,QGLFDWRUVDQGWKHLUDFTXLVLWLRQRUHYDOXDWLRQPHWKRGV

**Figure 1.** The structure of a thematic decomposition, an analytic explanation of which is presented in Section 5.1, where "MD" stands for "metadata".

We repeat that, often, the length of the path that leads to NS1,1 may be different from the length that leads to NS1,2, which in turn may be different from the length of the path that leads to NS1,3, and so on. These differences in length are indicated in Figure 1 by the fact that the non-separable objects are placed at different levels.

Evidently, issue I2 ends up to the non-separable objects NS2,1, NS2,2, ... , NS2,F2 and so forth. We stress that, in many instances, issue IJ may end up to an arbitrary number of non-separable objects NSJ,R following various different paths. In this case, we choose as length of the path that links IJ with NSJ,R, the length of the path that connects these two entities, which starts at the higher-most level, namely, the one closer to IJ, and so on.


#### *5.2. The Template Part Covering the DSS Indicators and*/*or Data and Their Acquisition or Evaluation Method(s)*

Each issue and/or each sub-issue, described in Section 5.1 and Figure 1, may be linked (and is indeed very frequently linked) to an ensemble of indicators, namely, certain quantitative characteristics of the issues and/or sub-issues in hand. Indicators help the user(s) of the DS system to make decisions as objectively as possible. As we have already pointed out, a simple, but characteristic, example of an important indicator is the noise generated by vehicles at various specific points of a road network [48]. Another crucial indicator is the volume of pollutants emitted by vehicles at these points [49].

We will use the term "the complete list of indicators" to express the entire set of indicators, incorporated in a DS information system.

Each indicator must be associated with (a) the data necessary for its computation and/or (b) the exact method(s) for acquiring or evaluating it.

As a rule, there are three ways of obtaining indicators, namely:


An abstract representation of all three methods for acquiring or evaluating indicators is shown in Figure 2.

As far as this part of the template is concerned, one more important issue must be emphasized. A very consistent approach to describe a method that computes one or more indicators, so that the DS system remains maintainable and upgradable, is to decompose each such computational method into lesser "functional blocks" and describe all the necessary internal communication of these blocks (i.e., for each block, its input intermediate variables, the origins of these input variables, the output intermediate variables and the destination of these output variables must be described).

This can be better understood my means of a simple, but fictitious, practical example, which is schematically shown in Figure 3. In this example, a certain computational method is decomposed into four (4) well-defined sub-blocks, namely, "model 1", "model 2", "sub-method 1" and "sub-method 2".

**Figure 2.** The structure of the meta-database template concerning the indicators, local data and computational methodologies.

**Figure 3.** An example of a first step of decomposition of a computational method. Any indicator and/or local datum may be compound, i.e., it may comprise a structure of simpler data.

Then, each sub-method (namely, model 1, ... , sub-method 2) may be further decomposed in an analogous manner and so on. The decomposition stops when the experts that have initially developed the system feel that they have described the entire computational method adequately, so that another expert (or group of experts) can fully understand it at a later stage.

In Figure 4, another example is shown, which is actually encountered in the transport application area. In fact, the example refers to a specific method that computes the noise levels at a number of certain road network traffic links.

**Figure 4.** Computation of the levels of noise at certain links of a road network.

First, a traffic forecast model uses traffic counts on the links of the network and traffic growth factors (local data) to forecast traffic volumes for a future planning period. At the next step, a sub-computational method takes these forecasted traffic volumes and road capacity figures and computes the average vehicle velocities. Finally, a noise prediction model takes the computed average vehicle velocities, forecasted traffic volumes at the links of the network, as well as other road network characteristics and estimates the noise levels at these links of the network for the future planning period. The description of the computational methods starts at this level of abstraction. The sub-methods, shown in Figure 4, should be further decomposed. Once more, the decomposition stops when the experts that developed the system feel that the entire description is sufficient.

#### *5.3. Including the External Sources and the Way to Access Them in the Template*

We use the term "external sources" to describe the set of geographically dispersed databases linked to the main decision support system; these data bases are, as a rule, autonomous and most probably heterogeneous. Each such database may offer data and/or indicators to the main DS system. In turn, the main system may use these external data or indicators intact or employ them, in order to compute and/or generate other data or indicators. The part of the system that implements the communication of each external source with the main system and, possibly, computes all necessary quantities, is called "mediator"; the corresponding process itself is called "mediation".

As mentioned before, each external source provides a part of the data of the main system and/or a part of the indicators. Consequently, the proposed meta-database must include:

	- The quality of the indicators;
	- Any possible special characteristics these indicators possess;

• Any limitations of the use of the group of specific indicators.

#### *5.4. Including a Complete, Structured Description of Experts' Knowledge in the Teleological Meta-Database*

Figures 1 and 2 depict a fundamental characteristic of the meta-database, expressed by the fact that a "metadata box" is attached to every entity in the template. Each such box should incorporate the relevant knowledge of the experts who built the system and it should explain every choice of them (such as the thematic list, all issues, sub-issues, indicators, data, computational methods, external sources, etc.). The content of the "metadata box" should be written in such a way, so that at a later stage, experts, belonging to the same scientific discipline, can understand any special characteristic and any limitation of the use of all entities in the decision support system. The level of description must be detailed enough, so that these other experts can take over the maintenance and upgrade of the DS system. We must emphasize that confidentiality and/or intellectual property issues can be resolved by letting:


Preferably, all this information must be further organized by building a specifically oriented sub-meta-database, which will provide all possible ways for accessing this information (e.g., accredited keywords, characteristic terms and methodology names). Additionally, a knowledge base [50,51] may be included in the DS system, with references to other related, important information.

#### *5.5. Incorporating the Proper Documentation and Navigation Information for the Users in the Teleological Meta-Database*

As it has been many times emphasized before, a large-scale decision support system is very complex. Thus, even a user expert in the related scientific field cannot use the system properly unless he/she is suitably advised. Therefore, it is rather imperative that specific documentation and navigation information must be included in the meta-database, so that a user, who is not familiar with the DS system, can benefit from it with minimal difficulty. There are two aspects of the aforementioned kind of information:


Overall, the documentation template of the meta-database can be built in such a manner, so as to assist all user experts and their collaborators to find out themselves how to use the DS system as easily as possible.

#### **6. Substantial Advantages of a Decision Support System Developed on the Basis of the Proposed New Teleological Meta-Database**

As it has been already discussed, the meta-database must have the following characteristics and properties:


(d) The aforementioned software programs of the DSS must be very well documented, and this documentation must be properly placed in the template of the meta-database in a clear-cut manner.

Consequently, the proposed metadata structure can be considered as an exact and thorough content description and functional map of the DS system in hand.

One main novelty of the methodology described here is that during the entire life cycle of the DS system, a group of experts and engineers, even radically different than the one that developed the specific information system, may achieve the following, by employing the proposed teleological meta-database:


We would like to emphasize that any change imposed on the system must be accurately reflected/incorporated in the meta-database template. In this way, the DSS will maintain all the very good properties mentioned above; in the opposite case, all aforementioned actions concerning the DS system will not be feasible.

We would like to point out that any part of the information of a DS system based on the meta-database may be georeferenced/geographical. Actually, the use of the meta-database permits a very advanced architecture of the georeferenced/geographical information, as it will be described in another work.

#### **7. Conclusions**

Recent advances in information technology and communications allow for the development of very large-scale decision support systems. However, the complexity, the polymorphism and the size of the data and/or program core of such a very large DS system generate a number of very serious problems, which are outlined here. These problems have to do with:


Moreover, it is argued that any ad hoc/non-strictly methodological development of a very large-scale DS system will definitely lead to very serious intrinsic problems.

In the present manuscript, it was shown that only a consistent, fully methodological approach can resolve all the aforementioned problems and it can drastically minimize the associated risks and hazards. Such a methodological approach was explicitly introduced here. This methodology was associated with the development of a very detailed, consistent and rigorous meta-database, which fully describes the DS system and it includes:


**Author Contributions:** Conceptualization, E.K., C.P. and A.B.; Investigation, G.T., N.V.K. and A.B.; Project administration, E.K.; Resources, G.T., N.V.K. and E.M.; Supervision, E.K. and C.P.; Validation, E.K., C.P., G.T., N.V.K., A.B. and E.M.; Writing—original draft, E.K., C.P. and G.T.; Writing—review & editing, G.T., N.V.K., A.B., E.M. and A.R.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Efficient Rule Generation for Associative Classification**

#### **Chartwut Thanajiranthorn \* and Panida Songram**

Department of Computer Science, Faculty of Informatics, Mahasarakham University, Mahasarakham 44150, Thailand; panida.s@msu.ac.th

**\*** Correspondence: chartwut@bru.ac.th; Tel.: +66-619-395-455

Received: 2 September 2020; Accepted: 14 November 2020; Published: 17 November 2020

**Abstract:** Associative classification (AC) is a mining technique that integrates classification and association rule mining to perform classification on unseen data instances. AC is one of the effective classification techniques that applies the generated rules to perform classification. In particular, the number of frequent ruleitems generated by AC is inherently designated by the degree of certain minimum supports. A low minimum support can potentially generate a large set of ruleitems. This can be one of the major drawbacks of AC when some of the ruleitems are not used in the classification stage, and thus (to reduce the rule-mapping time), they are required to be removed from the set. This pruning process can be a computational burden and massively consumes memory resources. In this paper, a new AC algorithm is proposed to directly discover a compact number of efficient rules for classification without the pruning process. A vertical data representation technique is implemented to avoid redundant rule generation and to reduce time used in the mining process. The experimental results show that the proposed algorithm archives in terms of accuracy a number of generated ruleitems, classifier building time, and memory consumption, especially when compared to the well-known algorithms, Classification-based Association (CBA), Classification based on Multiple Association Rules (CMAR), and Fast Associative Classification Algorithm (FACA).

**Keywords:** associative classification; class association rule; vertical data representation; classification

#### **1. Introduction**

Nowadays, there a number of classification techniques that have been applied to various real-world applications, i.e., graph convolutional networks for text classification [1], automated classification of epileptic electroencephalogram (EEG) signals [2], iris cognition [3], and anomaly detection [4]. Associative classification (AC) is a well-known classification technique that was first introduced by Lui et al. [5]. It is a combination of two data-mining techniques, association rule mining, and classification. Association rule mining discovers the relationship between items in a dataset. Meanwhile, classification aims to predict the class label of any given instance from learning-labeled dataset. AC focuses on finding Class Association Rules (CARs) that satisfy certain minimum support and confidence thresholds in the form *x* → *c*, where *x* is a set of attribute values and *c* is a class label. AC has been reported in the literature to outperform other traditional classifiers [6–13], In addition, a CAR is an if–then rule that can be easily understood by general users. Therefore, AC is applied in many fields, i.e., phishing website detection [6,7,11], heart disease prediction [8,9], groundwater detection [12], and detection of low-quality information in social networks [10].

In traditional AC algorithms, minimum support threshold is a significant key parameter that is used to select frequent ruleitems and to then eliminate frequent ruleitems in which confidence values do not satisfy minimum confidence. This manner leads to a large number of frequent ruleitems. Nguyen and Nguyen [14] demonstrated that the number of 4 million frequent ruleitems can be

generated when the minimum support threshold is set to 1%. Moreover, a number of AC-based techniques, i.e., Classification-based Association (CBA) [5], Fast Associative Classification Algorithm (FACA) [11], CAR-Miner-diff [14], Predictability-Based Class Collative Class Association Rules (PCAR) [15], Weighted Classification Based on Association Rules (WCBA) [16], and Fast Classification Based on Association Rules (FCBA) [17], create all possible CARs in order to determine a set of valid CARs that can be used in the classification process. Recently, Active Pruning Rules (APR) [13] has been proposed as a novel evaluation method. APR can be used to avoid generating all CARs. However, the exhaustive search for finding rules in classifiers may cause an issue in large datasets or low minimum support. Creating candidate CARs consumes intensive computational times and memory. The minimal process of candidate generation is still challenging because it is quite affected in terms of training time, input/output (I/O) overheads, and memory usage [18].

In this paper, a new algorithm is proposed to directly generate a small number of efficient CARs for classification. A vertical data format [19] is used to represent ruleitems associated with their transaction IDs. The intersection technique is used to easily calculate support and confidence values from the format. The ruleitems with 100% of confidence will be added to the classifier as a CAR. Whenever a CAR with 100% confidence is found, the transaction associated with the CAR will be removed by using a set difference to avoid generating redundant CARs. Finally, a compact classifier is built for classification. In conclusion, the contribution of this paper is as follows.


This paper is structured as follows. In Section 2, related works of AC are described. The basic definitions are delineated in Section 3. The proposed algorithm is introduced in Section 4. The discussion on the experimental results is in Section 5. Lastly, the conclusion of the study is stated in Section 6.

#### **2. Related Work**

In the past, AC-based algorithms have been proposed and studied. The study's objective is to understand some drawbacks and to increase the effectiveness of the algorithms. Lui et al. [5] introduced the CBA algorithm which integrated association rule mining and classification. The process of the CBA algorithm is divided into two steps. First, CARs are generated based on the famous search method the Apriori algorithm [20]. Second, CARs are sorted and then pruned to select efficient CARs in a classifier. The CBA algorithm was proven to produce a lower error rate than C4.5 [21]. Unfortunately, the CBA algorithm encounters a large number of candidate generation problems due to Apriori inheritance which finds all possible frequent rules at each level.

Li et al. [22] presented the Classification based on Multiple Association Rules (CMAR) algorithm. Unlike CBA, CMAR adopts a Frequent pattern tree (FP-tree) and a Cosine R-tree (CR-tree) for rule generation and classification phases. It divides the subset in FP-tree to search frequent ruleitems and then adds the frequent ruleitems to CR-tree according to their frequencies. Hence, CMAR only needs to scan the database once. The CMAR algorithm uses multiple rules to predict unseen instances based on chi-square method. In the experiment, CMAR was compared with CBA and C4.5 in terms of accuracy. The experimental result shows that CMAR performs better than the others.

Abdelhamid [6] proposed an Enhanced Multi-label Classifier-based Associative Classification (eMCAC) for phishing website detection. It generates rules with multiple class labels from a single dataset without recursive learning. The eMCAC algorithm applies a vertical data format to represent datasets. The support and confidence values for a multi-label rule are calculated based on the average support and confidence values of all classes. The class is assigned to the test instance if attribute values are fully matched to the rule's antecedent. The experimental results show that the eMCAC algorithm outperforms CBA, PART, C4.5, jRiP, and MCAR [23] on the real-world phishing data in terms of accuracy.

Hadi et al. [11] proposed the FACA algorithm for phishing website detection. It applies a Diffset [24] in the rule-generation process to increase the speed of classifier building time. First, the FACA algorithm discovers k-ruleitems by extending frequent (k − 1)-ruleitems. Then, ruleitems are ranked according to the number of attribute values, confidence, support, and occurrence. To predict unseen data, the FACA algorithm utilizes the All Exact Match Prediction Method. The method matches unseen data with all CARs in the classifiers. Next, unseen data are assigned to the class label with the highest count. From the experimental result, the FACA algorithm outperforms CBA, CMAR, MCAR, and ECAR [25] in terms of accuracy.

Song and Lee [15] introduced Predictability-Based Collective Class Association Rule algorithm (PCAR) to enhance rule evaluation. The PCAR algorithm uses inner cross-validation between the test dataset and train dataset to calculate a predictability value of CARs. Then, CARs are ranked according to rule predictive values, rule confidence, rule support, rule antecedent length, and rule occurrences. Finally, the full-matching method is applied to assign a class label for unseen data. To evaluate the performance of PCAR, PCAR was compared with C4.5, RIPPER, CBA, and MCAR on the accuracy, and PCAR was shown to outperform the others.

Alwidian et al. [16] proposed the WCBA algorithm to enhance the accuracy of a classifier based on the weighting technique. WCBA assumes that the importance of attributes is not equal. For example, in medicine, some attributes are more important than other attributes for prediction. Consequently, weights of all attributes are assigned by experts in the domain. Then, the weighted method is used to select useful CARs and a statistical measure is used for the pruning process. In addition, CARs are priors sorted by using the harmonic mean, which is an average value between support and confidence. The WCBA algorithm is more significantly accurate than CBA, CMAR, MCAR, FACA, and ECBA. However, the WCBA algorithm generates CARs based on the Apriori technique that scans the database many times.

Rajab proposed [13] the Active Pruning Rule (APR) algorithm. The new pruning process was introduced in APR. CARs are ranked by confidence, support, and rule length. Each training instance is matched over a set of CARs. The first rule that matches an instance is added to the classifier. Then, instances containing the first rule are removed. The support and confidence of remaining rules are recalculated, and all CARs are re-ranked. The APR algorithm was proven to reduce the size of the classifier and to maintain predictive accuracy performance. However, the APR algorithm still has to face a massive number of candidates from a rule-generation process. From previous works, the advantages and disadvantages are shown in Table 1.

The previous algorithms on AC generally result in high predictability of rules. However, most of them produce k-ruleitems from (k − 1)-ruleitems. They have to calculate supports when a new ruleitems is recovered. To calculate support and confidence values, they have to search all transactions in databases multiple times. Moreover, a huge number of candidate CARs are generated and pruned later to reduce unnecessary CARs. To reduce the problems, the proposed algorithm will directly generate efficient CARs for classification so that the pruning and sorting processes are not necessary. The efficient CARs in our works are rules with 100% confidence which are generated based on the idea that some attribute values can immediately indicate the class label if all attribute values belong to a class label. To easily check attribute values belonging to any class label, vertical data representation is used in the proposed algorithm. Furthermore, simple set theories, intersection, and set difference are adapted to easily calculate support and confidence values without scanning a database multiple times.



#### **3. Basic Definitions**

Let *A* = {*a*1, *a*2,..., *am*} be a finite set of all attributes in dataset. *C* = {*c*1, *c*2,..., *cn*} is a set of classes, *g*(*x*) is a set of transactions containing itemset *x*, and |*g*(*x*)| is the number of transactions containing *x*.

**Definition 1.** *An item can be described as an attribute ai containing a value vj, denoted as* (*ai*, *vj*)*.*

**Definition 2.** *An itemset is the set of items, denoted as* (*ai*1, *vi*1),(*ai*2, *vi*2), ...,(*aik*, *vik*)*.*

**Definition 3.** *A ruleitem is of the form itemset*, *cj , which represents an association between itemsets and class in a dataset; basically, it is represented in the form itemset* → *cj.*

**Definition 4.** *The length of a ruleitem is the number of items, denoted as k* − *ruleitem.*

**Definition 5.** *The absolute support of ruleitem r is the number of transactions containing r, denoted as sup*(*r*)*. The support of r can be found from (1).*

$$\sup(r) = |\mathbf{g}(r)|\tag{1}$$

**Definition 6.** *The confidence of ruleitem itemset*, *cj is the ratio of the number of transactions that contains the itemset in class in cj and the number of transactions containing the itemset, as in (2).*

$$\text{conf}(\langle \text{itemset}, c\_{\rangle} \rangle) = \frac{|\lg(\langle \text{itemset}, c\_{\rangle} \rangle)|}{|\lg(\text{itemset})|} \times 100\tag{2}$$

**Definition 7.** *Frequent ruleitem is a ruleitem in which support is not less than the minimum support threshold (minsup).*

**Definition 8.** *Class Association Rule (CAR) is a frequent ruleitem in which confidence is not less than the minimum confidence threshold (mincon f ).*

#### **4. The Proposed Algorithm**

In this section, a new algorithm, called the Efficient Class Association Rule Generation (ECARG) algorithm, is presented. The pseudo code of the proposed algorithm is shown in Algorithm 1.


First, 1-frequent ruleitems are generated (line 1). To quickly find 1-frequent ruleitems, the proposed algorithm takes the advantage of a vertical data format to calculate the support of

the ruleitems. The support of the ruleitems can be obtained from |*g*(*itemset*) ∩ *g*(*ck*)|. If any 1-ruleitem does not meet the minimum support threshold, it will not be extended with the other ruleitems. Moreover, the confidence of the frequent ruleitems can be calculated from Equation (2) by using the vertical data format. If the confidence of the ruleitem is 100%, the ruleitems will be added to the classifier directly (line 7); otherwise, it will be considered extended with the others (line 5).

After discovering the most effective CAR with 100% confidence, the transaction IDs associated with the CAR will be removed to avoid redundant CARs (line 8). To remove the transaction IDs, a set difference plays an important role in our algorithm. Let *ri* be a CAR with 100% confidence and *T* be a set of ruleitems in the same class of *ri*. For all *rj* ∈ *T*, the new transaction IDs of *rj* is *g*(*rj*) = *g*(*rj*) − *g*(*ri*). Then, the new transaction IDs, support, and confidence values of all rules are updated (line 9).

In each iteration, if there is no CAR with 100% confidence, the ruleitem *r* with the highest confidence will be first to be considered extended in a breadth-first search manner. It will be combined with other ruleitems in the same class until the new CAR has 100% confidence (line 5). If *ri* is extended with *rj* to be *rnew* and *g*(*rj*) ⊆ *g*(*ri*), then *conf*(*rnew*) = 100%. After the extended CAR is added to the classifier, the transaction IDs associated with the CAR will be removed. Finally, if no ruleitem satisfies the minimum support threshold, the CAR generation will be stopped.

The proposed algorithm continues to find a default class in order to insert it to the classifier. The class with the most remaining transaction IDs is selected as the default class (line 12).

To demonstrate the examples, the dataset in Table 2 is used as example data. The minimum support and confidence thresholds are set to 2 and 50%, respectively.


**Table 2.** A sample dataset.

The vertical data format represents associated transaction IDs of 1-ruleitem, as shown in Table 3. The last 2 columns of Table 3 show the support and confidence of ruleitems that are calculated. From Table 2, the *a*<sup>2</sup> value in *atr*1 occurs in transaction IDs 5, 6, 7, and 9, denoted as *g*(*atr*1, *a*<sup>2</sup>) = {5, 6, 7, 9}. Class *A* is in transaction IDs 1, 2, 3, 4, 8, and 9, denoted as *g*(*A*) = {1, 2, 3, 4, 8, 9}, while class *B* is in transaction IDs 5, 6, and 7, denoted as *g*(*B*) = {5, 6, 7}. The transaction IDs containing *atr*1, *a*2 → *A* are *g*(*atr*1, *a*<sup>2</sup>) ∩ *g*(*A*) = {5, 6, 7, 9}∩{1, 2, 3, 4, 8, 9} = {9}, so the supports of *atr*1, *a*2 → *A* are 1. The rule *atr*1, *a*2 → *A* will not be extended because its support is less than the minimum support threshold. Transaction IDs containing *atr*1, *a*2 → *B* are *g*(*atr*1, *a*<sup>2</sup>) ∩ *g*(*B*) = {5, 6, 7, 9}∩{5, 6, 7} = {5, 6, 7}, so the supports of *atr*1, *a*2 → *B* are 3. Hence, this rule is a frequent ruleitem.

The confidence of *atr*1, *<sup>a</sup>*2<sup>→</sup> *<sup>B</sup>* can be obtained from <sup>|</sup>*g*(5,6,7)<sup>|</sup> <sup>|</sup>*g*(5,6,7,9)<sup>|</sup> <sup>×</sup> <sup>100</sup> <sup>=</sup> <sup>3</sup> <sup>4</sup> × 100 = 75%. The confidence of *atr*1, *a*2 → *B* is not 100% so it will be extended, whereas the confidence of *atr*1, *<sup>a</sup>*1<sup>→</sup> *<sup>A</sup>* is <sup>|</sup>*g*(1,2,3,4)<sup>|</sup> <sup>|</sup>*g*(1,2,3,4)<sup>|</sup> <sup>×</sup> <sup>100</sup> <sup>=</sup> <sup>4</sup> <sup>4</sup> × 100 = 100%, so it is the first CAR added to the classifier.


**Table 3.** The rules that meet minimum support threshold (white background cell).

After discovering the first CAR, the transaction IDs associated with the CAR will be removed. From Table 3, if *atr*1, *a*<sup>1</sup> is found, the class will absolutely be *A*. Hence, *atr*1, *a*1 → *A* does not need to be extended with the other attribute values and transaction IDs 1, 2, 3, and 4 should be removed. The ECARG algorithm adopts a set difference, which can help to remove transaction IDs more conveniently.

For example, *g*(*atr*1, *a*1 → *A*) = {1, 2, 3, 4} and *g*(*atr*3, *c*1 → *A*) = {1, 3, 4, 9}. The new transaction IDs of *g*(*atr*3, *c*1 → *A*) = *g*(*atr*3, *c*1 → *A*) − *g*(*atr*1, *a*1 → *A*) = {1, 3, 4, 9} − {1, 2, 3, 4} = {9}. Then, the new transaction IDs, support, and confidence values of all rules are updated as shown in Table 4.

From Table 4, there is no CAR with 100% confidence. *atr*1, *a*2 → *B* has the maximum confidence, and *atr*3, *c*2 → *B* = {5, 6} is a subset of *g*(*atr*1, *a*2 → *B*) = {5, 6, 7}. Hence, the new rule (*atr*1, *a*2),(*atr*3, *c*2) → *B* is found with 100% confidence. Then the extension of (*atr*1, *a*2),(*atr*3, *c*2) → *B* is stopped. For 2-ruleitem extended from *atr*1, *a*2 → *B*, there is only one rule with 100% confidence and it is added to the classifier as the second CAR.

**Table 4.** The remained transaction IDs after generating the first Class Association Rule (CAR).


After the second CAR is added to classifiers, the transaction IDs associated with CAR are removed. The remaining transaction IDs are shown in Table 5. There is only one ruleitem that satisfies the minimum support threshold: the ruleitem *atr*2, *b*3 → *A* which does not meet 100% of confidence. No ruleitem passes the minimum support threshold to be extended with the ruleitem *atr*2, *b*3 → *A* so CAR generation is stopped.


**Table 5.** Transaction IDs after generating the second CAR.

With the remaining transaction IDs in Table 5, the ECARG algorithm continues to find a default class and to add it to the classifier. In this step, the class with the most relevant transaction IDs is selected as the default class. In Table 5, class *A* remains in transaction IDs 8 and 9 while class *B* remains in transaction ID 7. The remaining transaction IDs are relevant to class *A* the most, so the default class is *A*. In case the number of associated remaining transaction IDs with each class is not changed, the majority class in the classifier is the default class. Finally, all CARs in the classifier are shown in Table 6.

**Table 6.** All CARs from ECARG.


To observe the effect of 100% confidence ruleitems, we tested another version of ECARG, ECARG2. The difference in ECARG2 is ruleitem extension. If a ruleitem with 100% confidence cannot be found from the extension, the ruleitem with the highest confidence will be selected as a CAR and added to classifiers. For example, in Table 5, ruleitem *atr*2, *b*3 → *A* is the only ruleitem that satisfies the minimum support and minimum confidence. Hence, ECARG2 selects the ruleitem as the third CAR. The associated transaction IDs are removed, and the remaining transaction ID is shown in Table 7. There is only one transaction ID with class *B*. Consequently, the default class is *B*. Finally, all CARs from ECARG2 are shown in Table 8.

**Table 7.** Transaction IDs after ECARG2 generated the third CAR.



**Table 8.** All CARs from ECARG2.

#### **5. Experimental Setting and Result**

The experiments were implemented and tested on a system with the following environment: Intel Core i3-6100u 2.3 GHz processor with 8 GB DDR4 main memory, running Microsoft Windows 10 64-bit version. Our algorithm is compared with the well-known algorithms CBA, CMAR, and FACA. All algorithms were implemented in java. The implementing java version of the CBA algorithm using CR-tree is from WEKA [26]. The implementation of CMAR in JAVA is from [27]. Four algorithms are tested on 14 datasets from the UCI Machine Learning Repository. The characteristics of the datasets are shown in Table 9. Ten-fold cross-validation is used to divide testing instances and training instances based on previous works [12,17,23,26,27]. Accuracy rates, the number of CARs, classifier building times, and memory consumption are used to measure the performance of the four algorithms.


**Table 9.** Characteristics of the experiment datasets.

To study the sensitivity of thresholds on the ECARG algorithm, we set different minimum support thresholds and different minimum confidence thresholds in the experiment. First, we set the minimum support thresholds from 1% to 4% and analyze different minimum confidence thresholds between 60%, 70%, 80%, and 90%. Figure 1 shows the accuracy rates of all datasets. The results show that, when the minimum support thresholds are increased, the accuracy rates are decreased. If the minimum confidence thresholds are increased, the accuracy rates are slightly down.

The highest accuracy rates are given in most datasets, Diabetes, Iris, Labor, Lymph, Mushroom, Post-operative, Tic-tac-toe, Vote, Wine, and Zoo, when minimum support and minimum confidence are set to 2% and 60%, respectively. Therefore, the minimum support is set to 2%, and minimum confidence is set to 60% in the next experiments.

*Algorithms* **2020**, *13*, 299

**Figure 1.** Accuracy rates in various *minsup* and *minco f* on all datasets.

Table 10 reports the accuracy rates of the CBA, CMAR, FACA, ECARG, and ECARG2 algorithms on the UCI datasets. The results show that both of our algorithms outperform the others on average. This gain resulting from the methodology found the most efficient rule in each iteration and eliminated redundant rules simultaneously. To be more precise, we further analyzed the win-lost-tie records. Based on Table 10, the win-lost-tie records of the ECARG2 algorithm against CBA, CMAR, FACA, and ECARG in terms of accuracy are 11-3-0, 11-3-0, and 9-4-1, 8-6-0, respectively. We can observe that ECARG gives an accuracy slightly less than ECARG2. However, the ECARG algorithm results in the highest accuracy in 6 of 14 datasets.


**Table 10.** Accuracies of CBA, CMAR, FACA, ECARG, and ECARG2.

Table 11 shows the average number of CARs generated from CBA, CMAR, FACA, ECARG, and ECARG2 algorithms. The result shows that the CMAR algorithm generates the highest number of rules, while the ECARG algorithm generates the lowest. In particular, the ECARG algorithm generates 8 CARs on average against 14 datasets whereas the CBA, CMAR, FACA, and ECARG2 algorithms derive 19, 240, 13, and 18 CARs on average, respectively. The accomplishment of the proposed algorithm is the discovery of the most efficient CAR in each iteration and the elimination of unnecessary transaction IDs that leads to redundant CARs.


**Table 11.** The average number of generated rules on the UCI datasets.

Table 12 shows the average classifier building time of the proposed algorithm against CBA, CMAR, and FACA. The experimental result clearly shows that our algorithm is the fastest among all algorithms in the 14 datasets. ECARG takes fewer seconds to construct the classifier than CBA, CMAR, FACA, and ECARG2 by 2.134, 0.307, 2.883, and 0.0162, respectively. This can be explained by the fact that CBA and FACA uses an Apriori-style approach to generate candidates. When the value for minimum support is low on large datasets, it is costly to handle a large number of candidate ruleitems. The CMAR algorithm based on FP-growth is better than CBA and FACA in some cases, but it takes more classifier-generating time than ECARG and ECARG2.


**Table 12.** The classifier building time in seconds.

Table 13 reveals the memory consumption in the classifier building process of all 5 algorithms. The results show that ECARG consumes less memory than CBA, CMAR, FACA, and ECARG2 by 22.62 MB, 73.15 MB, 36.57 MB, and 0.98 MB, respectively. The memory consumption of ECARG is the best since it eliminates unnecessary data in each iteration. From the result in Table 14, our proposed algorithm gives a higher F-measure on average than the other algorithms. In particular, the ECARG2 outperformed CBA, CMAR, FACA, and ECARG by 3.82%, 25.38%, 25.38%, 12.74%, and 1.84%, respectively.

**Table 13.** The classifier building memory consumption in megabytes.


**Table 14.** F-measure of Classification-based Association (CBA), Classification based on Multiple Association Rules (CMAR), Fast Associative Classification Algorithm (FACA), ECARG, and ECARG2.


Table 15 shows standard deviations of accuracy rate, the number of generated rules, building times, memory consumption, and F-measure of ECARG. The standard deviation values of building time and memory consumption are low and show that the building time and memory consumption in each fold is approximately marginal. The standard deviation values of the number of generated rules are relevant.

The standard deviation values of accuracy rates and F-measure show that the values of accuracy rates and F-measure in each fold are marginally different on almost all datasets. However, when evaluating the small datasets, Contact-lenses, Labor, Lymph, and Post-operative, the standard deviation values are high because 10-fold cross-validation splits a very small testing set that can potentially affect the efficiency of the classifier. For example, the Contact-lenses dataset composes only 2 or 3 transactions in each testing set. Consequently, only one false classification occurs in the testing set and then reduces the accuracy rate dramatically.


**Table 15.** Standard deviations of ECARG.

From the experimental results, the ECARG algorithm outperforms CBA, CMAR, and FACA in terms of accuracy rate and the number of generated rules. A key achievement of the ECARG algorithm is that the technique generates valid rules with 100% confidence to build classifiers. The high confidence demonstrates the high possibility of class occurrences occurring in an itemset. Therefore, the ECARG algorithm produces a small classifier but gives high accuracy. While the CBA, CMAR, and FACA algorithms build classifiers from CARs that meet the minimum confidence threshold, some of the CARs have low confidences so they may predict incorrect classese and then the accuracies of CBA, CMAR, and FACA are lower than the proposed algorithm in the most dataset.

Moreover, ECARG outperforms the others in terms of building time and memory consumption. This key achievement applies simple set theories, i.e., intersection and set difference, processing on vertical data, which can potentially reduce time and memory consumption. Furthermore, the search space can be reduced as unnecessary transactions are eliminated in each stage and, therefore, the classifier building time is minimized.

#### **6. Conclusions**

This paper proposes algorithms to enhanced associative classification. Unlike the traditional algorithms, the proposed algorithms do not need a sorting and pruning process. Candidate generation is carried out by attempting to select a first general rule with the highest accuracy. Moreover, a search space is reduced early by cutting down items with low statistical significance. Furthermore, a vertical data format, intersection, and set difference methods are applied to calculate support and confidence and to remove unnecessary transaction IDs, decreasing computation time and memory consumption.

The experiments were conducted on 14 UCI datasets. The experimental results show that the ECARG algorithm outperforms the CBA, CMAR, and FACA algorithms in terms of accuracy by 4.78%, 17.79%, and 1.35%, respectively. Furthermore, ECARG generates smaller rules than the other algorithms in almost all datasets. In addition, ECARG results in the most optimal classifier-generating time and memory usage on average. We can conclude that the proposed algorithm gives a compact classifier with a high accuracy rate, improves computation time, and reduces memory usage.

However, the ECARG algorithm does not well perform on imbalanced datasets, such as Breast, Car, Diabetes, and Post-operative. This is because the ECARG algorithm tends to find 100% confidence CARs and to eliminate unnecessary transactions. Therefore, ruleitems belonging to minority classes will not meet the minimum support threshold or 100% confidence and they are eliminated accordingly. Consequently, the classifier cannot classify the minority class correctly.

**Author Contributions:** Methodology, C.T.; supervision, P.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was financially supported by Mahasarakham University (Grant year 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Algorithms* Editorial Office E-mail: algorithms@mdpi.com www.mdpi.com/journal/algorithms

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18