# **Knowledge Engineering and Data Mining**

Edited by Agnieszka Konys and Agnieszka Nowak-Brzezińska Printed Edition of the Special Issue Published in *Electronics*

www.mdpi.com/journal/electronics

## **Knowledge Engineering and Data Mining**

## **Knowledge Engineering and Data Mining**

Editors

**Agnieszka Konys Agnieszka Nowak-Brzezi ´nska**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editors* Agnieszka Konys West Pomeranian University of Technology Szczecin Szczecin Poland

Agnieszka Nowak-Brzezinska ´ University of Silesia Sosnowiec Poland

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Electronics* (ISSN 2079-9292) (available at: https://www.mdpi.com/journal/electronics/special issues/Knowledge Data Mining).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-6788-4 (Hbk) ISBN 978-3-0365-6789-1 (PDF)**

© 2023 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**



Reprinted from: *Electronics* **2022**, *11*, 1720, doi:10.3390/electronics11111720 ............. **271**

## **About the Editors**

#### **Agnieszka Konys**

Agnieszka Konys is an Assistant Professor in the Faculty of Computer Science and Information Technology at the West Pomeranian University of Technology in Szczecin, Poland (since 2013). Her research studies the topics of ontology, knowledge representation methods, semantic web technologies, knowledge management and reasoning, and sustainability assessment.

#### **Agnieszka Nowak-Brzezi ´nska**

Agnieszka Nowak-Brzezinska is an Associate Professor in the Faculty of Science and Technology ´ at the University of Silesia in Katowice, Poland (since 2002). In 2019, she was awarded the academic degree of Habilitated Doctor of Philosophy in Engineering and Technology Science in the field of information and communication technology by the Polish Academy of Sciences. Her research studies the topics of outlier detection algorithms, clustering algorithms for complex data structures, knowledge engineering, and information retrieval systems.

### *Editorial* **Knowledge Engineering and Data Mining**

**Agnieszka Konys <sup>1</sup> and Agnieszka Nowak-Brzezi ´nska 2,\***


Knowledge engineering and data mining are the two biggest pillars of modern intelligent systems. Knowledge induction from data is often based on using a wide range of machine learning algorithms and feature selection or extraction algorithms. When we collect various data types, we need solutions that will allow us to supervise these data correctly. Recently, machine-learning-based methods are increasingly employed to solve such problems; however, the selection of an appropriate feature selection technique, sampling mechanism, and/or classifiers for building decision support systems is very challenging. To address this challenging task, article [1] examines the effectiveness of various data science techniques concerning the issue of credit decision support. In particular, a processing pipeline was designed that consists of methods for data resampling, feature discretization, feature selection, and binary classification.

The capability of machine learning to discover hidden patterns in large datasets encourages researchers to invent data with high-dimensional features. In contrast, not all features are needed by machine learning, and, in many cases, high-dimensional features decrease the performance of machine learning. The research presented in paper [2] investigates and proposes methods to determine the best feature selection method in the domain of psychosocial education.

Recommendation systems are powerful tools that are integral parts of a great many websites. Most often, recommendations are presented in the form of a list that is generated by using various recommendation methods. Typically, however, these methods do not generate identical recommendations, and their effectiveness varies between users. In order to solve this problem, the application of aggregation techniques was suggested in article [3], the aim of which is to combine several lists into one, which, in theory, should improve the overall quality of generated recommendations.

Ontologies, and especially formal ones, have traditionally been investigated as a means with which to formalize an application domain, so as to carry out automated reasoning on it. The union of the terminological part of an ontology and the corresponding assertional part is known as a knowledge graph. On the other hand, database technology has often focused on the optimal organization of data, so as to boost efficiency in their storage, management, and retrieval. Graph databases are a recent technology that specifically focus on element-driven data browsing rather than on batch processing.

Paper [4] proposes an intermediate format that can be easily mapped onto a formal ontology on the one hand, so as to allow complex reasoning, and onto a graph database on the other, so as to benefit from efficient data handling. Selecting the right supplier is a critical decision in sustainable supply chain management. Paper [5] proposes and implements an ontology-based approach for knowledge acquisition from the text for a sustainable supplier selection domain. This approach is dedicated to acquiring complex relationships from texts and coding these in the form of rules.

Whenever we need to analyze big data we need to do it effectively, with the shortest possible time and the highest possible accuracy. If we deal with multidimensional data that

#### **Citation:** Konys, A.;

Nowak-Brzezi ´nska, A. Knowledge Engineering and Data Mining. *Electronics* **2023**, *12*, 927. https:// doi.org/10.3390/electronics12040927

Received: 6 February 2023 Accepted: 8 February 2023 Published: 13 February 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

are computing-intensive, applications should be parallelized and run on modern multicore machines to reduce the execution time. In paper [6] the authors demonstrate how to apply an affine transformation framework and generate parallel 2D tiled code computing GLREs (general linear recurrence equations).

The most popular classification techniques are decision trees, k-nearest neighbor classifiers, naive Bayes classifiers, or neural networks. A very interesting approach is presented in paper [7], where the study developed an autocorrect system for UAV smoke tracing. An AI model was used to calculate smoke tube angle corrections, such that smoke tube angles could be immediately corrected when smoke is sprayed.

Another interesting approach was presented in [8]. The exploration of oil and gas in offshore regions is increasing due to global energy demand. The weather in offshore areas is truly unpredictable due to the sparsity and unreliability of metocean data. Using metocean data, offshore wave height and period are predicted from the wind speed by three state-of-the-art machine learning algorithms (an artificial neural network, a support vector machine, and random forest).

Another interesting research is presented in [9], where the authors present an original concept of the classification of types of project tasks, which will allow for the more beneficial use of collected data in management support systems in the IT industry. The classification algorithms presented in the article are based on the manual recognition of task types. Rules based on keywords are created, which allow for the automatic recognition of task types at subsequent occurrences, which will allow for the fully automated operation of a task classification as well as subtask classification algorithm on a real-time basis and, finally, for the comprehensive support of the management of the development process.

A knowledge-mining- and graph-convolutional-network-based method is described in paper [10], where the authors propose a novel graph-convolutional-network-based method for the knowledge mining of interactions between drugs from the extensive literature. Thus, identifying possible drug–drug interactions (DDIs) has always been a crucial research topic in the field of clinical pharmacology.

A convolutional neural network is also used by the authors of paper [11], in which a neural network helps to discern the morphological information hidden in Chinese characters and a pretrained model obtains vectors with medical features. The different vectors are stitched together to form a multi-feature vector. Deep learning requires a large amount of annotated data to train the model, as does the proposed model, but large-scale annotated data in the Chinese electronic medical record domain require medical experts for annotation annotate, which can be time-consuming.

The healthcare sector is one of the most sensitive sectors in our society, and it is believed that the application of specific and detailed database creation and design techniques can improve the quality of patient care. In this sense, the better management of emergency resources should be achieved. Paper [12] presents an optimized database designed for emergency care. The general objective of the project was to create a database that was as complete as possible and with a great diversity of information, which would represent, in detail, all possible aspects of emergency health activity. A multi-model database allowed for the exploitation of information with predictive models.

Knowledge delivery is the topic which has recently been explored in an enormous way. The reason for this is the post-COVID-19 era in university education, where instructors around the world were at the forefront of implementing hybrid learning spaces for knowledge delivery. The purpose of the study presented in paper [13] is not only to divert the primary use of a YouTube channel into a tool to support asynchronous teaching, it also aims to provide feedback to instructors and suggest steps as well as actions to implement in their teaching modules to ensure students' access to new knowledge while promoting their engagement and satisfaction, regardless of the learning environment, i.e., face-to-face, distance, and hybrid. By analyzing and interpreting data directly from YouTube channel reports, six variables were identified and tested to quantify the lack of statistically significant changes in learners' viewing habits.

In facial aesthetics, soft-tissue landmark recognition and linear as well as angular measurements play critical roles in treatment planning. Visual identification and judgment by hand are time-consuming and prone to errors. As a result, user-friendly software solutions are required to assist healthcare practitioners in improving treatment planning. Paper [14] presents "A Computational Tool for Detection of Soft Tissue Landmarks and Cephalometric Analysis". The goal of the authors is to create a computational tool that may be used to identify and save critical landmarks from patient X-ray pictures. The second goal is to create automated software that can assess the soft-tissue facial profiles of patients in both linear and angular directions by using the landmarks that have been identified.

A variety of different techniques with which to support decisions requires deep knowledge about the advantages and disadvantages of these techniques, especially when we need to deal with multicriteria tasks. Multicriteria methods have gained traction in academia and industry practices for effective decision making. Paper [15] provides a complete overview of multicriteria methods through a bibliometric study, enabling scholars to comprehend the current state and future development patterns of multicriteria decisionmaking methods research.

We believe that this Special Issue covers the entire knowledge engineering pipeline: from data acquisition and data mining to knowledge extraction and exploitation. For this reason, we tried to gather the many researchers operating in the field to contribute to a collective effort in understanding the trends and future questions in the fields of knowledge engineering and data mining.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

## *Article* **Credit Decision Support Based on Real Set of Cash Loans Using Integrated Machine Learning Algorithms**

**Paweł Ziemba 1,\*, Jarosław Becker 2,\*, Aneta Becker 3, Aleksandra Radomska-Zalas 2, Mateusz Pawluk <sup>4</sup> and Dariusz Wierzba <sup>5</sup>**


**Abstract:** One of the important research problems in the context of financial institutions is the assessment of credit risk and the decision to whether grant or refuse a loan. Recently, machine learning based methods are increasingly employed to solve such problems. However, the selection of appropriate feature selection technique, sampling mechanism, and/or classifiers for credit decision support is very challenging, and can affect the quality of the loan recommendations. To address this challenging task, this article examines the effectiveness of various data science techniques in issue of credit decision support. In particular, processing pipeline was designed, which consists of methods for data resampling, feature discretization, feature selection, and binary classification. We suggest building appropriate decision models leveraging pertinent methods for binary classification, feature selection, as well as data resampling and feature discretization. The selected models' feasibility analysis was performed through rigorous experiments on real data describing the client's ability for loan repayment. During experiments, we analyzed the impact of feature selection on the results of binary classification, and the impact of data resampling with feature discretization on the results of feature selection and binary classification. After experimental evaluation, we found that correlationbased feature selection technique and random forest classifier yield the superior performance in solving underlying problem.

**Keywords:** credit scoring; cash loans; machine learning; decision model; classification; feature selection; resampling; discretization

#### **1. Introduction**

Nowadays, banks and financial institutions carefully analyze the credit risk of their clients [1]. The current world situation, i.e., COVID-19 pandemic, affects not only people's lives, but also has a negative impact on economic factor, especially related to paying liabilities by potential borrowers [2]. According to that issue, credit scoring systems [1] are needed by such organizations in order to select the most promising clients to work with and offer well-tailored services for them. These models are particularly suited for financial institutions, due to the ability of assessing the numerical score of individual customers, which determines their loan repayment probability [3]. Under the hood the final decision is made—whether loan granting is justified or not. Most often, credit risk is assessed on the basis of historical data, using mainly statistical or machine learning methods [4], among them, e.g., rough sets [5], usually combined with: probability theory [6], fuzzy

**Citation:** Ziemba, P.; Becker, J.; Becker, A.; Radomska-Zalas, A.; Pawluk, M.; Wierzba, D. Credit Decision Support Based on Real Set of Cash Loans Using Integrated Machine Learning Algorithms. *Electronics* **2021**, *10*, 2099. https:// doi.org/10.3390/electronics10172099

Academic Editor: Jaime Lloret

Received: 5 August 2021 Accepted: 26 August 2021 Published: 30 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

sets [7], decision trees [8], Neural Networks and Support Vector Machines [9], or genetic algorithms [10].

Of particular importance in the problems of credit scoring are classification models that play role of decision models [11], usually, supported by feature selection, data resampling and feature discretization methods [12]. There exist many applications of above techniques in numerous publications [1–4,13–18]. Reduction of computational burden and significant improvement of model efficiency and understandability can be achieved when relevant feature subset is selected [19]. Moreover, credit scoring models may be sensitive due to the dataset imbalance, i.e., the number of positive and negative cases is not equally distributed—in that situation, their overall performance may be improved by data resampling [20]. The use of discretization may also have a positive impact on credit scoring models by increasing the efficiency of certain classification algorithms [21]. Unfortunately, when analyzing the literature on credit scoring, there is a shortage of research in which all the indicated techniques (feature selection, resampling, discretization, classification) would be used in one process of processing a dataset and building a classification model. In connection with the identified research gap, the question arises whether the combined use of the indicted methods and techniques in the process of dataset processing will increase the effectiveness of classification models.

The aim of this article is to analyze the effectiveness of various classification models in supporting credit decisions. Contribution includes:


It is important to note that the presented research is a significant extension of the earlier works in which we examined only selected classifiers and feature selection methods [22] as well as rough set approach [23].

Section 2 discusses the problem of credit risk assessment and reviews the literature on the subject. Section 3 presents a review of useful methods for classification task, feature selection, data resampling, and feature discretization incorporated in the study, as well as proven measures for assessment of classification models. Section 4 contains a description and explanation of the adopted test procedure. The general results of the research carried out are included in Section 5, while the more detailed results are included in the Appendices A–G. The paper is summarized with conclusions and proposals for further research presented in Section 6.

#### **2. Literature Review**

The subject of interest of authors dealing with financial issues is often credit risk, generally defined as the risk of a business partner who does not fully meet its obligations on time and avoids such activities altogether [24]. Credit risk can also be understood as the risk of changes in the value of the company's equity as a result of changes in the creditworthiness of its debtors. It is noted that in recent years a lot of attention has been paid to the methods and algorithms for assessing financial credit risk. This was due, among others, to the occurrence of global financial crises, but also to the need for a thorough assessment of such threats and forecasting business failures. It should be added that the above-mentioned factors have an impact on the functioning of the economy and financial decisions made by societies [25].

Due to the fact that financial credit risk indicates a risk related to financing, its assessment is aimed at solving the following two categories of problems: credit rating or scoring and predicting bankruptcy of forecasting a financial crisis of enterprises. Historically, research on financial credit risk assessment was initiated in the 1930s [26] and continued over the years with considerable success in the 1960s [27]. Nowadays, apart from taking into account the achievements obtained with the use of traditional statistical methods, the research focuses primarily on the use of advanced machine learning methods. This approach, without the need to follow strict assumptions, results in an improvement in the

accuracy of the results obtained in a conventional manner. At the same time, it is impossible to indicate the only effective method that is superior to others. On the other hand, the most recently used intelligence techniques include: artificial neural networks (ANNs), fuzzy set theory (FST), decision trees (DTRs), case-based reasoning (CBR), support vector machines (SVMs), rough set theory (RST), genetic programming (GP), hybrid learning, and ensemble computing [25].

The traditional approach to credit risk assessment focuses on obtaining the optimal linear combination of the input explanatory variables. It is expected that thanks to these variables it will be possible to: model, analyze and predict the risk of corporate insolvency. Their use is determined by popularity, but attention is paid, for example, to the fact that they do not take into account complex relationships between variables. To assess credit risk using statistical models, among others, linear discrimination analysis (LDA), logistic regression (LR), multivariate discriminant analysis (MDA), quadratic discriminant analysis (QDA), factor analysis (FA), risk index models, and conditional probability model are used [25]. Among the works pointing to the domination of statistical methods over other approaches, there are [28,29].

The group of methods that combine the traditional and intelligent approaches are semiparametric method, which are characterized by greater flexibility of the model structure, clearly interpret the modelled process and show greater accuracy. More information on this can be found in [30,31]. In the literature on the subject, there are many interesting combinations of parametric, non-parametric and semi-parametric models, for example, the Klein and Spady model [32], Logit model and the CART model [33]. Another proposal is the integration of a parametric binary logistic regression model (BLRM) and non-parametric models (e.g., SVM, DTR) [34].

Many publications report good results obtained with the use of artificial neural networks [35–37]. The feature of networks that makes them useful for the assessment of credit risk is the ability to process non-linear data and approximate most of the functions. In this way, internal patterns can be found from complex financial data [38]. There are also some limitations to their use, such as difficulty in explaining the black box algorithm, time-consuming learning, not providing optimal solutions, and too much adjustment to the training data.

Another proposal for credit risk assessment are SVMs, which transform non-linear input vectors into a multidimensional feature space. It is possible with the use of kernel functions, which means that the data can be separated by linear models. The interest in SVMs is due to their good performance, the possibility of generalizing a small set of high-value data [39]. Their effectiveness is noticeable when the input data are non-linear and non-stationary, which results in obtaining models supporting credit decisions [40].

The classical classification approach is represented by decision trees. In the case of credit risk, their usefulness results from: easy interpretation of the obtained results, non-linear estimation, non-parametric form, accuracy, possibility of application in the case of continuous and categorical variables, as well as the indication of significant variables. In the discussed field, for example, ID3, C4.5, CART, CHAID, MARS, ADTree [33] can be used.

In the literature on the subject [25], it is possible to note the use of CBR in the subject of credit risk. This approach makes it possible to propose problem-solving by recalling similar experiences. All activities are based on the principle of k-nearest neighbors (kNN), which in the case of classification includes the identified object in the class to which most of its k-nearest neighbors belong. It is suggested to use CBR in the case of small data sets, although it is less precise in relation to other methods used in this type of problem and its improvement is proposed [41].

There have been many interesting publications on credit risk assessment recently. In their work, Wang et al. (2020) [42] presented the results of a study on the assessment of credit risk in the supply chain of commercial banks online. The authors used the literature induction method, the non-linear LS-SVM model and compared the obtained results with the results of the logistic regression model. They found that the LS-SVM evaluation model had a higher classification accuracy than the logistic regression model. In addition, they found that it has a strong generalization capacity and can comprehensively identify credit risk and provide sound, scientific analysis, and is an effective tool supporting the credit risk assessment of small and medium-sized enterprises.

The article by Arora and Kaur (2020) [43], which confirmed the usefulness of modern data mining and machine learning techniques, is also worth mentioning. According to the authors, these methods show precision in predicting credit risk and support taking appropriate decisions. Bolasso (Bootstrap-Lasso) was used in the research. In order to test the predictive accuracy, the functions obtained by Bolasso were applied to the following classification algorithms: Random Forest (RF), Support Vector Machine (SVM), Naïve Bayes (NB), and kNN. The authors concluded that the Random Forest algorithm (BS-RF) with Bolasso enabled provides the best credit risk assessment results.

Other conclusions were reached by Froelich and Hajek (2019) [44], who proposed in their previous studies to automate credit risk assessment by using systems based on machine learning methods. The authors concluded that the obtained results are difficult to interpret and do not fully take into account the expert knowledge. In the next step, they applied multi-criteria group decision making methods (MCGDM) to simulate the assessment process performed by a team of credit risk experts. According to the authors, standard MCGDM methods do not take into account high uncertainty and are not effective in the case of a significant impact of the assessed credit risk criteria. Therefore, they proposed an MCGDM model that combines fuzzy sets and fuzzy cognitive maps with the traditional TOPSIS approach. In turn, Heidary Dahooie et al. (2021) [45] proposed a combination of Data Envelopment Analysis (DEA) with the dynamic multi-attribute decision-making method (DMADM), considering it an innovative dynamic decision-making method for assessing loan applications. The credit performance criteria were distinguished on the basis of a literature review and expert opinion. In contrast, the criteria weights were calculated using the dynamic approach to the common set of DEA weights. Then, candidates were prioritized using five Gray MADM methods (including SAW-G, VIKOR-G, TOPSIS-G, ARAS-G and COPRAS-G). In the final study, a new method called the correlation coefficient and standard deviation (CCSD) was used to determine the aggregate rank.

In the summary of the review of credit risk assessment methods, it should be added that in recent years, in line with the observations of Bellacos (2018) [46], efforts to improve the traditional approach to credit scoring have not always been successful. Compared to traditional credit models, the data used in the new credit models is much more precise, comprehensive and holistic. These data, combined with modern machine learning (ML) algorithms and artificial intelligence (AI), provide much better calibrated risk assessment models. On the other hand, when comparing ML and AI methods with expert credit risk assessment, it should be noted that modern methods take into account many more decision-making factors than a human can do. The expert has knowledge based on his previous experience, but classification models have much more knowledge. The knowledge of classifiers is also based on previous experiences, in this case written as a set of training cases, but their ability to process information is much greater than that of an expert who has limited perception. Moreover, ML methods, unlike humans, do not get tired, do not get sick, etc. Additionally, in the literature, the advantage of machine learning and data mining methods over expert assessment in complex problems requiring the processing of many data is noticed [47]. On the other hand, there are still areas where the expert outweighs ML and AI methods [48].

The banking sector already has some characteristics such as: advanced computerization (available computing power, modern analytical tools), large amounts of transaction data, financial history of customers, which make it the preferred field for implementing credit risk assessment models based on machine learning and artificial intelligence. The content of the Digital Banking report (2021) [49] presenting current trends and priorities in retail banking shows that most banking institutions know what is needed, and many of

them even know how to face the current challenges. The problem, however, is that current banking standards keep organizations from doing this. In the area of credit decisions, this applies to solutions with a very complicated, difficult or even impossible explanation mechanism. An example is neural networks seen as black boxes. What is happening inside such a network cannot be fully explained. Banks in Poland refuse to use such tools, as it is difficult to justify a specific credit decision made on their basis before the Polish Financial Supervision Authority (PFSA). PFSA is sympathetic to traditional scoring and other methods whose results are intuitive, easily interpreted, and easy to argue and explain.

#### **3. Materials and Methods**

#### *3.1. Classification Methods*

Machine learning can be used for various tasks, among others, in classification problems, consisting in predicting the belonging of an object to a certain class on the basis of well-defined characteristics of this object. Usually, discrimination of selected object is based on the earlier training of the classifier, during which the classification algorithm attempts to "learn", what are the real classes of training objects and what features determine whether the objects belong to specific classes [47,50]. Methods for classification task are, e.g., C4.5 decision tree (C4.5), random forest (RF), decision table (DT), naive Bayes (NB) classifier, logistic regression (LR), or k-nearest neighbors (kNN) algorithm. The characteristics of selected classification methods are presented in Table 1.


**Table 1.** Characteristics of selected classification methods.


#### **Table 1.** *Cont.*

#### *3.2. Feature Selection Methods*

One of the basic issues in classification task is the multidimensionality of the object to be assigned to a specific class. This is a serious obstacle decreasing accuracy of classification algorithms, known as the "dimensional curse" [66]. Dimensionality reduction of feature space allows lowering the computational and data collection costs, which eventually improves predictions [67]. Tools, which can be used for that task are called feature selection methods.

The feature selection process focuses on identifying relevant features in dataset as significant and rejecting redundant features [68]. For this purpose, various algorithms are used to assess the importance of particular features in the classification task. The feature selection methods are divided into three categories: filters, wrappers, and embedded methods [69]. Filters and wrappers are usually composed of four elements (steps), such as: generation of feature subset, evaluation of the subset, stopping criterion, result validation [70]. By describing individual elements of the feature selection methods, it is possible to point out significant differences between these groups of methods.

Filters are based on independent evaluation of features using general data characteristics. For example, Pearson correlation coefficients between each input and selected output can be used. Feature subset is determined by defining threshold for minimum value of correlation or particular number of features to be selected before training the machine learning algorithm [71].

Wrappers evaluate individual feature subsets using machine learning algorithms, which algorithms will eventually be used in the classification or regression task. In this case, training algorithm is included in the feature selection procedure, therefore, crossvalidation based on set of training cases is usually used to estimate the accuracy of the classifier using a specific feature subset [72].

Embedded methods are similar to wrappers in that they use classification to perform the task of feature selection. The main difference between wrappers and embedded methods is "embedding" of selection procedure into the selected classifier. In other words, the dimensions of training objects subject to classification are reduced while building classifier model [73]. For instance, in decision trees unnecessary features are eliminated by trimming and defining the minimum number of objects in the node.

Wrappers differ only in the applied machine learning algorithms, so, as in the case of embedded methods, the results obtained using them depend solely on the quality of the machine learning algorithm and the algorithm fit to a specific classification task. Wrappers and embedded methods analyze the features of the objects contained in the training set only in terms of obtaining the maximum number of correct classifications, omitting other characteristics of the features. Meanwhile, the general characteristics of the features seem so important that they should affect the selection of individual features that determine the training and test cases. Therefore, filtration procedures that determine the significance of individual attributes using measures other than classifier's accuracy seem to be more interesting. Filter methods are using various measures to assess relevance of each feature, e.g., distance function and different correlation measures.

Popular filter technique that uses the distance function is ReliefF [74]. On the other hand, the most numerous groups of filters are correlation procedures, among them the most promising are: Symmetrical Uncertainty (SU) [75], Correlation-based Feature Selection (CFS) [76], Fast Correlation-Based Filter (FCBF) [77], and Significance Attribute (SA) [78]. The basis characteristics of each method are presented in Table 2.


**Table 2.** Characteristics of selected feature selection methods.


**Table 2.** *Cont.*

#### *3.3. Resampling Methods*

In binary classification, when number of classes in training set is unbalanced, i.e., class distribution is strongly skewed, conventional classifiers maximizing their accuracy usually build models that tend to classify all objects as belonging to the majority class. This results in low accuracy for the minority class, whose objects are underrepresented in training set, whereas such class is often of uttermost importance [84]. To overcome this issue, **resampling** methods are commonly used for training set. The two most popular in machine learning, yet very simple, are techniques of random undersampling and random oversampling [20]. In addition to the resampling methods already aforementioned, another interesting approach is Synthetic Minority Over-sampling Technique (**SMOTE**) [85]. Table 3 lists the main advantages and disadvantages of each of these approaches.

**Table 3.** Characteristics of selected resampling methods.


#### *3.4. Discretization Methods*

Some classification algorithms improve their performance by using feature discretization. Moreover, certain classifiers cannot work without data discretization. Such methods bin continuous features, dividing them into ranges or intervals, resulting in conversion of numerical data to nominal data. Here, main issue with feature discretization is appropriate choice of cutpoints, because continuous data can be discretized in an infinite number of ways. Perfect discretization method should find a relatively small number of cutpoints, dividing data into relevant bins. Among discretization techniques, there are supervised and unsupervised methods. First group results are superior to second group, because it uses class distribution to which each object belongs as additional information. Great number of methods perform discretization based on class entropy, which is a measure of uncertainty in finite range of classes. Entropy is calculated for different splits and compared to entropy of dataset without splits. It is run recursively until the search stop criterion is meet [86]. For instance, heuristic method of Minimal Description Length Principle (MDLP) can be used, here. This technique determines whether or not to accept current cut-off point candidate, thus, stopping recursion if specified condition is not met [87]. The entropy-based discretization with MDLP stop criterion is considered to be one of the best supervised discretization methods [71]. It measures information gain score of possible cutpoint by comparing entropy value. For each considered cutpoint, entropy of input interval is compared to the weighted sum of entropies for two output intervals. There are several different criteria for MDLP stopping condition, including Fayyad criterion [88] and Kononenko criterion [89].

#### *3.5. Classification Evaluation Metrics*

The quality of the classification can be evaluated by, e.g., Receiver Operating Characteristic curve (ROC), Area Under Receiver Operating Characteristic curve (AUROC) and Gini coefficient (GC). Another interesting measure is Precision-Recall Curve (PRC).

ROC is the graphic representation of the predictive model effectiveness made by sketching the quantitative characteristics of binary classifiers derived from such model using variety of cut-off points. This shows the relationship between True Positive Rate (TPR) and False Positive Rate (FPR). *TPR* can be calculated as follows by Equation (1) [85]:

$$TPR = \frac{TP}{TP + FN} \tag{1}$$

where *TP* indicates number of true positives, i.e., model predicts positive class correctly and *FN* indicates number of false negatives, i.e., model predicts negative class incorrectly. In turn, *FPR* is defined as Equation (2) [85]:

$$FPR = \frac{FP}{FP + TN} \tag{2}$$

where *FP* indicates number of false positives, i.e., model predicts positive class incorrectly and *TN* indicates number of true negatives, i.e., model predicts negative class correctly.

AUROC measures the classifier's accuracy. It is calculated as probability thresholds for following event—considered object belongs to negative or positive class. Geometrically, this is area below ROC. The higher value of AUROC, the better classification results of model are, where AUROC < 0.5 means invalid classifier, i.e., worse than random, AUROC = 0.5 means random classifier, and AUROC = 1 means ideal classifier [85].

GC is a measure of model's quality, interpreted as degree of ideality for classifier. GC is calculated based on the following Equation (3):

$$GC = 2 \* AlUROC - 1\tag{3}$$

The higher value of GC, the better classifier is, where GC = 0 means random classifier, and GC = 1 means ideal classifier [90].

PRC shows dependence between precision (Positive Predictive Value—PPV) and recall (TPR) for the classifier, where former is calculated as follows Equation (4) [91]:

$$PPV = \frac{TP}{TP + FP} \tag{4}$$

Big area under PRC (AUPRC) represents both high precision and high recall, where high precision corresponds to low false positive frequency and high recall corresponds to low false negative frequency. High scores for precision and recall indicate that classifier predicts accurate results and also most of them are positive [91]. PRCs are often zigzag curves with oscillations. Due to that fact, they tend to cross over much more than ROCs, therefore, leaving researcher difficult comparison. It is recommended to use PRCs in addition to ROCs for obtaining complete overview while evaluation and comparison of classifier models [92].

#### **4. Research Procedure**

The dataset on which the experiment was conducted describes anonymized data about loan repayment and borrowers. This set consists of 91,759 records described by 272 conditional attributes (features) and the decision attribute. It was divided in proportion 70/30% into training set (64,230 records) and testing set (27,529 records) [93].

Final research was preceded by a series of preliminary tests, during which following were selected:


During preliminary tests, it was noticed that one of the models with outstanding classification results can be random forest, therefore, its more detailed examination allowed to select optimal parameters, i.e., number of iterations = 239 and maximum tree depth = 13 [22].

In this research study it was assumed that various combinations will be tested, consisting in filter methods (SU, FCBF, CFS, SA, ReliefF), classifiers models (C4.5, DT, kNN, LR, NB, RF, optimized random forest (ORF)), resampling methods (without resampling, random undersampling, SMOTE) and feature discretization (without discretization, Fayyad criterion, Kononenko criterion). Taking into account the number of methodological approaches considered in each group, this gives 315 different scenarios and the same number of classification models supporting credit decisions. In practice, this number was smaller due to the fact that the number of conducted scenarios was limited, because of omitting selected resampling and discretization algorithms. Here, following heuristics was used, according to which, if specific preprocessing method, i.e., resampling or discretization, does not give satisfactory results, then there is no reason for its inclusion in subsequent scenario. Moreover, due to the high computational complexity, some scenarios did not use ReliefF. It should be noted that in case of large training dataset, this method performed in general time-consuming calculations, not yielding acceptable results. Therefore, all scenarios included at least 4 filter methods (SU, FCBF, CFS, SA) and all seven classifiers. Additionally, it should be clarified that for case of random undersampling, each scenario was repeated three times, building three different classification models and averaging results, eventually. The above approach was followed in order to minimize the impact of training cases random selection on classification results. The research study was divided into four general scenarios in which following combinations of methods were applied:


Furthermore, at the beginning, classification was performed without using filter methods, i.e., scenario 0. Results of this study were reference to subsequent scenarios in which filter methods were used. According to such approach all research scenarios allowed to define:


Figure 1 depicts the research study, which was carried out. Figure 1 shows that processing techniques including feature discretization and feature selection were applied to training set and results were used in testing set. This was necessary step to allow full consistency between training set and testing set. For instance, binning of training data was achieved and then the same bins were adopted to testing data. Likewise, selection of relevant features was done based on training set and redundant features were removed from testing set. Only one processing method used on training cases without testing cases was data resampling.

**Figure 1.** Scenario-based research study. Abbreviations: RU—Random undersampling, SMOTE—Synthetic Minority Over-sampling Technique, FC—Fayyad criterion-based discretization, KC—Kononenko criterion-based discretization, CFS— Correlation-based Feature Selection, SA—Significance Attribute, SU—Symmetrical Uncertainty, FCBF—Fast Correlation-Based Filter, DT—Decision table, LR—Logistic regression, NB—Naïve Bayes, RF—Random forest, C4.5—C4.5 decision tree, kNN—k-nearest neighbors, ORF—Optimized random forest.

#### **5. Results and Discussion**

Full results of conducted research study are presented in Appendices A–G, while this section shows only the best results from each considered scenario. Table 4 depicts the four top classification results from each scenario. From Table 4 it can be stated that the best classification results are obtained by RF model with possible optimization and feature selection method allowing top classification results is mainly CFS. It should be also noted that overall outstanding result was achieved by RF on full dataset of 272 features. Obviously, dimensionality reduction of such data is necessary due to the lack of ability to explain classification or need to collect great amount of information in order to classify new case. Assuming feature selection is made without resampling or discretization the best classification results were obtained by ORF. However, if both feature selection and classification accuracy are important, then RF model should be supported by data resampling, which allows to balance class distribution. Moreover, in case of RF, as well as LR and DT, undersampling provides better classification results than discretization (cf. Appendices C, E and F). On the contrary, it is opposite for NB, kNN and C4.5. Furthermore, RF and LR, both with undersampling, yield superior results than with combination of undersampling and discretization. On the other hand, above combination improves quality of classification for NB. Additionally, in order to obtain acceptable results using LR or NB, it is necessary to employ methods previously mentioned while for RF model they can be entirely omitted. Moreover, the randomness in applied undersampling algorithm also plays vital role. It has serious impact on obtained feature sets, thus, on results of classification. Nevertheless, conclusions drawn here are true for each research case performed during the study. It should be noted that in order to maximize accuracy of classification, it is recommended to carry out several draws and select set of training cases that allows to obtain the best results for the classification of testing cases.


**Table 4.** The best classification results from each research scenario.

**Classifier:** NB—Naive Bayes, RF—Random Forest, DT—Decision Table, LR—Logistic Regression, ORF—Optimized Random Forest; **Resampling:** RU—Random Undersampling; **Discretization:** KC—Kononenko Criterion, FC—Fayyad Criterion; **Feature selection:** CFS— Correlation-based Feature Selection, SU—Symmetrical Uncertainty.

> On the other hand, if the selection of possibly smallest feature set is of great importance, then FCBF should be used. Table 5 depicts four top classification results from each scenario where feature sets were obtained by above method. From Table 5 it can be stated that feature sets consisting in five or six features do not provide acceptable classification results. Bearing in mind that the minimum number of features and the maximum accuracy are essential, results of RF in scenario 2 and NB in scenario 4 are worth noting. DT achieves also relatively good classification results compared to other models. Main reason behind that is due to the built-in feature selection, i.e., DT automatically reduces feature space. Whether

input feature set is relatively large enough, this can cause deterioration of classification compared to other models, but with low number of features additional reduction is not performed, so that there is no negative impact on final results.


**Table 5.** The best classification results from each research scenario using FCBF.

**Classifier:** NB—Naive Bayes, RF—Random Forest, DT—Decision Table, LR—Logistic Regression, ORF—Optimized Random Forest; **Resampling:** RU—Random Undersampling; **Discretization:** KC—Kononenko Criterion, FC—Fayyad Criterion; **Feature selection:** FCBF— Fast Correlation-Based Filter.

#### **6. Conclusions**

The article deals with the problem of credit decisions based on machine learning methods. In particular, the effects of the application were verified together with classifiers of other machine learning methods in the processing of the credit data set. Summarizing results of conducted research study, it is possible to indicate premises related to use of individual methods, i.e., feature selection, binary classification, data resampling, feature discretization:


Of course, above heuristics do not fulfill topic in an exhaustive way of choosing appropriate approach to credit scoring problem. In some business cases, apart from classification result and size of feature set, the ability to explain classification may be also important, which gives certain advantage. Moreover, constraining oneself only to classification accuracy, it is not possible to clearly determine whether it is better to use AUROC, AUPRC or GC. Basically, the selection of classification model will consist in seeking trade-off between inherent features of classifiers. Therefore, further research is targeted on the selection of a specific approach using a classifier for credit decisions in support of stakeholders (e.g., banks) depending on their personal needs (i.e., actual requirements and preferences). Assessment of various approaches is, here, a multi-criteria decision problem, thus, a multi-criteria decision analysis [94] will be involved.

**Author Contributions:** Conceptualization, P.Z. and A.B.; methodology, P.Z.; validation, J.B.; formal analysis, A.B.; investigation, P.Z.; resources, A.R.-Z.; data curation, M.P.; writing—original draft preparation, P.Z.; writing—review and editing, J.B. and A.B.; supervision, P.Z. and J.B.; project administration, A.R.-Z.; funding acquisition, D.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** The research is partially financed through the National Centre for Research and Development, Poland (grant no. POIR.01.01.01-00-0322/18-00).

**Data Availability Statement:** Data available on request.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Results of Scenario 0**



#### **Appendix B. Results of Scenario 1**

**Table A2.** Classification results for feature subset selected by CFS (13 features).


**Table A3.** Classification results for feature subset selected by FCBF (six features).


**Table A4.** Classification results for feature subset selected by SU (13 features).



**Table A5.** Classification results for feature subset selected by SA (13 features).

#### **Appendix C. Results of Scenario 2—Random Undersampling**

**Table A6.** Classification results for feature subset selected by CFS (35/27/37 features).


**Table A7.** Classification results for feature subset selected by FCBF (12/14/11 features).


**Table A8.** Classification results for feature subset selected by SU (35/27/37 features).


**Table A9.** Classification results for feature subset selected by SA (35/27/37 features).



**Table A10.** Classification results for feature subset selected by ReliefF (35/27/37 features).

#### **Appendix D. Results of Scenario 2—SMOTE**

**Table A11.** Classification results for feature subset selected by CFS (42 features).


**Table A12.** Classification results for feature subset selected by FCBF (28 features).


**Table A13.** Classification results for feature subset selected by SU (42 features).


**Table A14.** Classification results for feature subset selected by SA (42 features).


#### **Appendix E. Results of Scenario 3—Fayyad Criterion**


**Table A15.** Classification results for feature subset selected by CFS (14 features).

**Table A16.** Classification results for feature subset selected by FCBF (five features).


**Table A17.** Classification results for feature subset selected by SU (14 features).


**Table A18.** Classification results for feature subset selected by SA (14 features).


#### **Appendix F. Results of Scenario 3—Kononenko Criterion**

**Table A19.** Classification results for feature subset selected by CFS (13 features).



**Table A20.** Classification results for feature subset selected by FCBF (five features).

**Table A21.** Classification results for feature subset selected by SU (13 features).


**Table A22.** Classification results for feature subset selected by SA (13 features).


#### **Appendix G. Results of Scenario 4—Random Undersampling, Kononenko Criterion**

**Table A23.** Classification results for feature subset selected by CFS (35 features).


**Table A24.** Classification results for feature subset selected by FCBF (10 features).



**Table A25.** Classification results for feature subset selected by SU (35 features).

**Table A26.** Classification results for feature subset selected by SA (35 features).


**Table A27.** Classification results for feature subset selected by ReliefF (35 features).


#### **References**


## *Article* **Evaluation of Feature Selection Methods on Psychosocial Education Data Using Additive Ratio Assessment**

**Fitriani Muttakin 1, Jui-Tang Wang 2,\*, Mulyanto Mulyanto <sup>2</sup> and Jenq-Shiou Leu <sup>2</sup>**


**\*** Correspondence: rtwang@mail.ntust.edu.tw

**Abstract:** Artificial intelligence, particularly machine learning, is the fastest-growing research trend in educational fields. Machine learning shows an impressive performance in many prediction models, including psychosocial education. The capability of machine learning to discover hidden patterns in large datasets encourages researchers to invent data with high-dimensional features. In contrast, not all features are needed by machine learning, and in many cases, high-dimensional features decrease the performance of machine learning. The feature selection method is one of the appropriate approaches to reducing the features to ensure machine learning works efficiently. Various selection methods have been proposed, but research to determine the essential subset feature in psychosocial education has not been established thus far. This research investigated and proposed methods to determine the best feature selection method in the domain of psychosocial education. We used a multi-criteria decision system (MCDM) approach with Additive Ratio Assessment (ARAS) to rank seven feature selection methods. The proposed model evaluated the best feature selection method using nine criteria from the performance metrics provided by machine learning. The experimental results showed that the ARAS is promising for evaluating and recommending the best feature selection method for psychosocial education data using the teacher's psychosocial risk levels dataset.

**Keywords:** evaluation feature selection; evaluation model; decision model; psychosocial education

#### **1. Introduction**

Psychosocial education is multidisciplinary and covers a vast field of study. Therefore, it is not surprising that research in psychosocial education encompasses an abundance of environments and features that are logically expected to be linked to the problem-solving of educational quality improvements. Research from various perspectives, such as personal environment [1], family [2], nutrition [3], and physical activities [4], has been conducted to get an overview of the various psychosocial relationships in education. Accordingly, research linked to psychosocial education is categorized as one of the most active in education. Indeed, the search using the keyword "psychosocial education" in Google Scholar shows 212,000 results of research published between 2017 and 2021.

On the other hand, the success of artificial intelligence and big data influences decisionmaking perspectives, particularly those based on predictive problems. Big data can effectively handle more large-scale amounts, more complex varieties, and higher data dimensions [5]. Meanwhile, artificial intelligence, especially machine learning, significantly improves the quality of decision models [6,7]. These two factors encourage researchers to collect more data with massive features.

Theoretically, the more data that are collected, the more information that is obtained. The more information obtained, the better the prediction will be generated to be. However, the increase in the number of variables and the volume of data impacts data sparsity, especially if the data quality is poor. The increase in sparsity makes it much more difficult

**Citation:** Muttakin, F.; Wang, J.-T.; Mulyanto, M.; Leu, J.-S. Evaluation of Feature Selection Methods on Psychosocial Education Data Using Additive Ratio Assessment. *Electronics* **2022**, *11*, 114. https:// doi.org/10.3390/electronics11010114

Academic Editors: Agnieszka Konys and Agnieszka Nowak-Brzezi ´nska

Received: 14 November 2021 Accepted: 28 December 2021 Published: 30 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

to find data representative of the population. Furthermore, it makes machine learning challenging to generalize to the domain problem. The vague generalization will cause machine learning to lose its ability to adapt to new problems [8,9].

Instead of thrusting all features into machine learning, performing input feature optimization is often more efficient and effective. Feature selection can eliminate all features that are irrelevant to the prediction target. There have been various methods of selecting a feature that has been proposed and proven to impact machine learning performance. With many feature selection methodologies and different approaches in each method, it is relatively easy to raise a question about which method can give the optimum and effective results in machine learning, especially regarding the psychosocial education problem.

Hence, this paper proposed a methodology to evaluate the best feature selection method in the domain of psychosocial education. The evaluation was performed using a decision model approach that utilized multi-criteria decision making (MCDM). Furthermore, additive ratio assessment (ARAS) was adopted to evaluate and rank the best feature selection method. The evaluation and ranking used the metrics from the machine learning classification performance on the teacher's psychosocial risk level dataset.

#### **2. Related Work**

Feature selection is one of the critical stages in machine learning modeling, and the relevant feature has implications for better stability, robustness, and generalization of machine learning [10]. The feature selection method can be divided into three approaches [11–13]: filtering, wrapper, and embedded method.

Moorthy and Gandhi [14] previously conducted research using the filtering method. They optimized medical data using feature selection techniques for classification problems. They combined analysis of variance (ANOVA) and whale optimization (WO) to give a better result for SVM and k-NN classifiers than the one without ANOVA-WO. Ding and Li [15] also conducted a similar study identifying mitochondrial proteins in malaria by combining ANOVA and incremental feature selection (IFS) methods to find the most optimal feature. The proposed model achieved 97.1% accuracy compared to 92.0% on the comparison model. Next, Utama [16] performed feature selection using the mutual information (MI) model to predict the airline's tweet sentiment analysis. The feature selection made contributions to the classifier improvement.

Similarly, the wrapper method also gives promising results. Richhariya et al. [17] proposed a Universum support vector machine based on the recursive feature elimination (USVM-RFE) method to diagnose Alzheimer's. Feature selection was performed on the MRI data of brain tissue, and the classification using USVM-RFE showed better results than the one using SVR-RFE. The implementation of RFE was also done in the study [18], where RFE-SVM was used to determine the best feature among the various heart rate variability (HRV) data. The study showed that RFE-SVM could identify the HRV feature and detect the stress level better.

The approach using the embedded method has been widely used. Liu et al. [19] implemented feature selection using the embedded method. The implementation was performed during a cyberattack on the Internet of Things (IoT) data. The accuracy of the proposed method was relatively comparable to that of the comparison model. However, it was better in training speed, 1000 times faster than the overall features model. The implementation of the embedded method as the feature selection was also conducted by Loscalzo et al. [20]. Feature selection was used to remove unneeded input in robotic sensors. The paper showed that the embedding methodology significantly reduces unimportant sensors. Lastly, Liu et al. [21] compared embedding methodology to the others, such as Chi-Square, F-Statistic, and Gini Index. The experiment showed that the weighted Gini Index (WGI) method was better than the other methodologies on the data with limited features.

Given the importance of choosing a suitable feature selection method for the data characteristics of domain problems, selecting the best feature selection method is quite challenging. There are various techniques for selecting feature selection methods, one of which is the decision system model approach. Kou et al. [22] conducted a study to select the best subset feature for a text classification case. The study compared several models from the MCDM, such as TOPSIS, GRA, WSM, VIKOR, and PROMOTHEE. The results showed that PROMOTHEE was better for evaluation models in the text-based classification case than other models. Hasemi et al. [23] proposed the EFS-MCDM method to determine the best feature on the computer network dataset. The features ranking in the EFS-MCDM delivered more optimal and efficient results in the measurement of accuracy, f-score, and run-time algorithm compared to other methods. Similarly, Singh [24] implemented TOPSIS to select features in the network traffic dataset. The research concluded that the classification model with the TOPSIS-based feature subset had the same accuracy yet much lower computation time.

Despite all the studies conducted on selecting existing feature selection methods so far, to the best of the authors' knowledge, there has been no study comparing and evaluating the best feature selection method to be implemented in psychosocial education. Previous studies on psychosocial education only implemented machine learning without extensive analysis of the used features.

#### *Research Contribution*

Based on the knowledge gaps derived from the previous studies, this paper would advance the body of knowledge about the feature selection method in two primary contributions:


#### **3. Methodology**

#### *3.1. Theoretical Overview*

#### 3.1.1. Artificial Intelligence Research on Psychosocial Education

Nowadays, the research in the education field focuses not only on the academic aspects, such as academic achievement, graduation level, academic grading, and teaching methods, but also on non-academic aspects, such as community relationships [25] and psychosocial. As such, the non-academic aspect also influences the quality of education [26–28].

On the other hand, the flourishing of research in artificial intelligence has made an impressive contribution to the psychosocial education field. Numerous artificial intelligencebased studies have successfully revealed psychosocial phenomena that influence education development. In a study conducted by Navarro [29], artificial intelligence was successfully used to predict the link between the condition of the environment and educators' stress levels. In addition, the research successfully interviewed and provided 4890 data points with 118 features used in predicting the level of stress on the educators. An extensive amount of data and high-dimensional features in the study indicated that psychosocial research is essential and exciting to be carried out.

#### 3.1.2. Feature Selection Methods

Real-world problems are often represented by extensive data collection and highdimensional features. Occasionally, existing features may not directly relate to the target problems that need to be solved [30,31]. Under such circumstances, the selection of features becomes critical. Selecting the right features makes it possible to improve model performance and efficiency in the computation process [32,33].

Three approaches are available to select features. The first approach, the filtering method, performs a selected subset of features based on the characteristics of the feature itself. The best feature is obtained from the statistical analysis of each feature with other features or target data. Next, the wrapper method uses machine learning to select the best data subsets for analysis. The wrapper method uses machine learning to reconstruct the feature subset and tests it using statistical modeling. The third approach, the embedded method, uses the same principle as the wrapper method, however, in evaluating the feature subset by analyzing the performance of machine learning.

Next, this section will briefly describe eight feature selection methods evaluated in this paper. There are three filtering methods: analyzing variance, mutual information, and chi-square; the exhaustive search feature is a wrapped method; embedding random forest, Lasso, and recursive feature elimination are embedded methods. Those methods would be compared to the models of machine learning using the baseline feature.

#### 3.1.3. ANOVA

ANOVA is a statistical analysis used to calculate the distance of difference (variance) between two clusters [34]. ANOVA uses the *f-ratio* to calculate the magnitude of every feature and target class. Magnitude values above the *f-ratio* will be retained, and others will be discarded. In an ANOVA with class *k*, the variance among classes is defined as follows [35]:

$$
\sigma\_{v-all}^2 = \frac{\sum (\overline{x}\_i - \overline{x})^2 n\_i}{(k-1)} \tag{1}
$$

where *ni* is the value discovered from the calculation on the *i*-th class, *xi* is the mean of the *i*-th class, and *x* is the mean of all classes; the class variance is defined as follows:

$$\sigma\_{v-class}^2 = \frac{\left(\sum \left(\overline{\mathbf{x}}\_{ij} - \overline{\mathbf{x}}\right)^2\right) - \left(\sum \left(\overline{\mathbf{x}}\_i - \overline{\mathbf{x}}\right)^2 n\_i\right)}{\left(R - k\right)}\tag{2}$$

Then, the *f-ratio* is calculated based on the degree of the two variances:

$$f\text{-}ratio = \frac{\sigma\_{v-all}^2}{\sigma\_{v-class}^2} \tag{3}$$

#### 3.1.4. Chi-Square

Chi-square is a statistical method that is widely used for calculating the correlation between two variables [36–38]. The implementation of Chi-square as the method to select subset features in machine learning can be done by calculating the dependency level of each feature towards the target data [39,40]. If *n* is the observed frequency and *μ* is the expected frequency, then the Chi-square (*X*2) for a feature with a number of *f* and class *C* is defined as follows:

$$X^2 = \sum\_{i=1}^f \sum\_{j=1}^C \frac{\left(n\_{ij} - \mu\_{ij}\right)^2}{\mu\_{ij}} \tag{4}$$

#### 3.1.5. Mutual Information (MI)

MI is used to calculate the distance of random vectors between clusters [41,42]. Mutual information looks for the similarity value between the distribution of probability *P*(*X*,*Y*) and product of entropy *P*(*X*)*P*(*Y*) [43]. Mutual information between two random vectors *X* and *Y* is defined as follows:

$$MI(X,Y) = \sum\_{\mathbf{x} \in X} \sum\_{\mathbf{y} \in Y} P(X = \mathbf{x}, \ Y = \mathbf{y}) \ln \frac{P(X = \mathbf{x}, \ Y = \mathbf{y})}{P(X = \mathbf{x})P(Y = \mathbf{y})} \tag{5}$$

In feature selection problems, the implementation of mutual information was used to calculate the information value of how significant the contribution of a feature is towards the prediction of the target class [44–46]. Mutual information for feature set *S* and *m* feature, which have a large dependence on the target class *C*, is defined as follows:

$$MI(\mathcal{S}\_{m\prime}, \mathcal{C}\_{i}) = \frac{\log(P(\mathcal{S}\_{m\prime}, \mathcal{C}\_{i})}{P(\mathcal{S}\_{m}) \* P(\mathcal{C}\_{i})} \tag{6}$$

#### 3.1.6. Exhaustive Search Feature (EFS)

In the EFS method, the algorithm performance is obtained by evaluating the existing features in all possible combinations. The feature subset with the highest performance will be selected [47,48]. EFS works by finding the value of validity (*P*, *S*), assessing the entire subset of candidate feature *S* for a whole solution to a problem *P*. The result is obtained from Output (*P*, *S*), in which the entire values of *S* are suitable for the problem *P*. The EFS method is a greedy algorithm, as it uses a brute force approach to find the best possible feature subset. Due to its exhaustive nature, ESF usually requires large amounts of resources.

#### 3.1.7. Embedding Random Forest (ERF)

ERF is an ensemble method to reconstruct the average output of an individual tree [49]. The recursive approach is needed to find the best value from the feature subset during the elimination process, especially for highly correlated features [50]. Evaluating the high correlation can be done using the mean decreasing impurity approach. The Gini Index is one of the most popular measures of mean decreasing impurities, and it is defined as follows:

$$Gini = 1 - \sum\_{i=1}^{n} \left(P\_i\right)^2\tag{7}$$

#### 3.1.8. Lasso

The least absolute shrinkage and selection operator (Lasso) is one of the shrinkage techniques. Lasso selects the variables by minimizing the number of squared errors using penalty regularization [49,51]. Shrinkage regression is carried out towards zero along with the increase in the value of the lambda (*λ*) parameter used to control the number of shrinkages [52]. Lasso is defined as follows:

$$L\_{\text{lasso}} = \left(\nexists \emptyset\right) = \sum\_{i=1}^{n} \left(y\_i - x\_i^{\prime} \not\triangleright \right)^2 + \lambda \sum\_{j=1}^{m} \left| \not\triangleright \right|\tag{8}$$

#### 3.1.9. Recursive Feature Elimination (RFE)

RFE is a feature selection method that works iteratively to rank features' importance [50]. In minimizing computational resources, some approaches eliminate instead of one by one but based on a subset of features [53]. An analysis and elimination process is performed in each iteration on the feature subsets with low relevance values. Two components of RFE are the number of features and the algorithm used to analyze the performance of the feature subsets. Generally, the iteration procedure of the RFE is performed as follows [54]: (1). Train each feature subset with a classifier, (2). Regarding the ranking of the feature subsets, calculate each feature subset's ranking, (3). Removing the feature subset that has low significance.

#### 3.1.10. ARAS: Decision System Approach for the Feature Evaluation Method

Additive ratio assessment (ARAS) is one of the MCDM modeling techniques. ARAS is a method that relies on the intuitive principle that the best solution must have the largest ratio. Ranking using the ARAS method is performed by comparing the value of each criterion on each alternative by looking at its weight to obtain the ideal alternative [55,56].

The ARAS method utilizes a function value that determines the complexity of feasible alternatives. The ARAS method was directly proportional to the values and weights of the main criteria considered to determine the best alternative. ARAS is based on the argument that complex problems can be understood simply by using relative comparisons. In ARAS, the ratio of the sum of normalized and weighted criteria values describes the possible alternatives to obtaining the optimal alternative rank. The ARAS method compares the utility functions of alternatives with optimal utility function values [57].

Like the classical MCDM approach, ARAS focuses on the ranking of criteria. Ranking with ARAS is done in several stages [55]. The first stage is forming a decision-making matrix. The matrix consists of 0 − *m* alternatives (rows) and 1 − *n* criteria (columns). If *i* represents the number of alternatives, *j* is the number of criteria. The decision-making matrix is denoted as follows:

$$X = \begin{bmatrix} \mathbf{x\_{01}} & \dots & \mathbf{x\_{0j}} & \dots & \mathbf{x\_{0n}} \\ \vdots & \ddots & \vdots & \ddots & \vdots \\ \mathbf{x\_{i1}} & \dots & \mathbf{x\_{ij}} & \dots & \mathbf{x\_{in}} \\ \vdots & \ddots & \vdots & \ddots & \vdots \\ \mathbf{x\_{m1}} & \dots & \mathbf{x\_{mj}} & \dots & \mathbf{x\_{mn}} \end{bmatrix}; i = \overline{0, m;} j = \overline{1, m}; \tag{9}$$

The *x*0*<sup>j</sup>* optimal value of the criterion is the best value that can be used to represent the performance on each *j* criterion. In this paper, *x*0*<sup>j</sup>* optimal criterion is defined as follows:

$$\begin{aligned} \mathbf{x}\_{0j} &= \max\_{i} \mathbf{x}\_{ij} \text{ } if \max\_{i} \mathbf{x}\_{ij} \text{ is } benefit; \\ \mathbf{x}\_{0j} &= \min\_{i} \mathbf{x}\_{ij}^{\*} \text{ } if \min\_{i} \mathbf{x}\_{ij}^{\*} \text{ is } cost; \end{aligned} \tag{10}$$

The next stage is normalizing all the criteria defined from *xij* of the matrix *X*. The normalized decision-making matrix *X* is defined as follows:

$$
\overline{X} = \begin{bmatrix}
\overline{\mathfrak{x}\_{01}} & \dots & \overline{\mathfrak{x}\_{0j}} & \dots & \overline{\mathfrak{x}\_{0u}} \\
\vdots & \ddots & \vdots & \ddots & \vdots \\
\overline{\mathfrak{x}\_{i1}} & \dots & \overline{\mathfrak{x}\_{ij}} & \dots & \overline{\mathfrak{x}\_{in}} \\
\vdots & \ddots & \vdots & \ddots & \vdots \\
\overline{\mathfrak{x}\_{m1}} & \dots & \overline{\mathfrak{x}\_{mj}} & \dots & \overline{\mathfrak{x}\_{mm}}
\end{bmatrix}; i = \overline{\mathfrak{0}, m;} j = \overline{1, m}; \tag{11}
$$

Normalization of benefits criteria can be done using the following formula:

$$\overline{\mathbf{x}}\_{ij} = \frac{\mathbf{x}\_{ij}}{\sum\_{i=0}^{\text{nr}} \mathbf{x}\_{ij}} \tag{12}$$

Meanwhile, normalization of cost criteria can be done using the normalized two-stage procedure following the notation:

$$\overline{\mathbf{x}}\_{ij} = \frac{1}{\mathbf{x}\_{ij}^{\*}} ; \; \overline{\mathbf{x}}\_{ij} = \frac{\mathbf{x}\_{ij}}{\sum\_{i=0}^{m} \mathbf{x}\_{ij}} \tag{13}$$

The next step is defining the normalized-weighted matrix, starting with determining the value of *wj*. The sum of weights of all the criteria is 1, and the weight *wj* is limited as follows: *<sup>n</sup>*

$$\sum\_{j=1}^{n} w\_j = 1\tag{14}$$

After that, the normalized-weighted matrix is calculated using the following formula:

$$
\hat{X} = \begin{bmatrix}
\mathfrak{X}\_{01} & \dots & \mathfrak{X}\_{0j} & \dots & \mathfrak{X}\_{0n} \\
\vdots & \ddots & \vdots & \ddots & \vdots \\
\mathfrak{X}\_{i1} & \dots & \mathfrak{X}\_{ij} & \dots & \mathfrak{X}\_{in} \\
\vdots & \ddots & \vdots & \ddots & \vdots \\
\mathfrak{X}\_{m1} & \dots & \mathfrak{X}\_{mj} & \dots & \mathfrak{X}\_{mm}
\end{bmatrix}; \ i = \overline{0, m;} \ j = \overline{1, n} \tag{15}
$$

Normalization using weight *wj* for all criteria can be calculated using the following formula:

$$
\hat{\mathfrak{x}}\_{\hat{i}\hat{j}} \quad \underline{\mathfrak{x}}\_{\hat{i}\hat{j}} \* \mathfrak{w}\_{\hat{j}}; \ i = \overline{0, \ m\_{\prime}} \tag{16}
$$

where *wj* is the weight of criterion *j*, and *x*ˆ*ij* is the normalized ranking of criterion *j*. The next step is to calculate the values of the optimality function using the following formula:

$$S\_i = \sum\_{j=1}^{n} \mathfrak{X}\_{ij\prime} \colon i = \overline{0, m\_{\prime}} \tag{17}$$

The final step in the ARAS model is to determine the ranking of the alternatives. If *Si* and *S*<sup>0</sup> are optimality criterion values, then the ranking *K* for alternatives *i* follows the definition:

$$K\_i = \frac{S\_i}{S\_0}; i = \overline{0, m\_r} \tag{18}$$

#### *3.2. Experimental Design*

In this section, the stages of the proposed methodology will be discussed. Three steps comprise the proposed method: preprocessing, machine learning, and the decision system. The first step is preprocessing the dataset, and the preprocessing stages aim to improve the quality of the data. Furthermore, preprocessing is performed to make the dataset more visible and is considered to improve the machine learning algorithm [58,59]. Data preprocessing concerns cleaning data, transforming categorical data to numerical form, and normalizing data.

After the preprocessing, the next step is the machine learning phase. This phase involves feature selection methods, classification, and performance evaluation. The feature selection method determines the best subset of features from the dataset. In the classification stage, a decision tree classifier is employed to generate the performance of models such as accuracy, precision, recall, f1-score, weighted precision, weighted recall, weighted f1-score, train time, and inference time using the selected feature from the previous stage.

The next stage is the decision system phase. In this step, the performance metrics are compared to determine the rank of the feature selection methods. ARAS uses the performance matrices as the ranking criteria, and this method is essential for formulating the best feature selection methods. The final result, ARAS, presented the feature selection method ranking. The stages of the proposed methodology are depicted in Figure 1.

**Figure 1.** The proposed method of decision model to evaluate feature selection methods.

The proposed model evaluates the feature selection method using two metrics, i.e., model performance and computation performance. Model performance is a measurement of machine learning performance using selected features, and in contrast, computational performance refers to computational capabilities during the training and inference process. Experiments and evaluations are carried out on seven methods and one baseline model, which is a model that uses all features. The schematic detail of the criteria selection of the feature selection method is portrayed in Figure 2.

**Figure 2.** Schematic diagram of decision model to evaluate feature selection methods.

#### *3.3. Dataset Description*

The psychosocial education dataset used here refers to the research [29,60] to test the proposed method. It is a public dataset obtained from a psychosocial assessment to identify Colombia's teachers' stress levels. The dataset consists of 4890 instances and 118 features divided into six domains. The complete specification of the dataset can be seen in Table 1.

**Table 1.** Detailed specification of dataset.


#### *3.4. Dataset Preprocessing*

In a machine learning problem, the dataset is present to demonstrate the effectiveness of the proposed method. Therefore, a high-quality dataset is required to evaluate the proposed model against the existing model. Data prepossessing is a well-known technique to improve dataset quality.

The teacher's psychosocial risk level dataset is valuable and pristine, and it provides the basis for delivering research on the degree of psychosocial distress among teachers in Columbia. Several studies have been conducted using the same dataset [29,60]. Primarily, the dataset was preprocessed appropriately. It will still be necessary for us to perform several preprocessing steps to prepare a suitable dataset for the proposed methods.

The first step involves performing common preprocessing steps, such as clearing improper data and handling missing values. Then, we divided the data into two subsets by following the Pareto distribution rule [61]. In this case, 80% of the data was used for training, and 20% was used for testing. A randomly selected distribution is made to ensure fair data distribution.

The next step is to apply standardization to rescale the distribution of each dataset subset. By performing a standardization transformation of the dataset, each feature dataset will have a mean value of 0 with a standard deviation value of 1. Hopefully, a preprocessed dataset will lead machine learning to the optimal model.

#### *3.5. Evaluation of Performance Metrics for Feature Selection Methods*

Evaluation is done to measure machine learning performance. Generally, machine learning performance is measured by using a confusion matrix. A confusion matrix combines actual value and predicted value in the classifier. The confusion matrix is True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). The matrices for measuring accuracy, precision, recall, and f1-Score are obtained from the following calculation:

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \tag{19}$$

$$Precision = \frac{TP}{TP + FP} \tag{20}$$

$$Recall = \frac{TP}{TP + FN} \tag{21}$$

$$F1 = 2\frac{Precision \times Recall}{Precision + Recall} \tag{22}$$

The fundamental concept for calculating the confusion matrix is binary classification [62]. A single comparison is made between two classes in binary classification, while this single comparison becomes irrelevant in multi-class classification [63]. Each class's precision, recall, and f1-score are estimated as micro-averaged and macro-averaged. After that, the metrics are calculated using the one vs. all method [64]. For example, the microaveraged scores and macro-averaged precision (PRE) scores in the *k*-class are defined as follows [65]:

$$PRE\_{micro} = \frac{TP\_1 + \dots + TP\_k}{TP\_1 + \dots + TP\_k + FP\_1 + \dots + FP\_k} \tag{23}$$

$$PRE\_{macro} = \frac{PRE\_1 + \dots + PRE\_k}{k} \tag{24}$$

#### **4. Results and Discussion**

This section reviews the performance evaluation of the proposed method. We actualized the discussion in two parts: the performance of each feature selection method on the psychosocial education dataset and the implementation of ARAS in selecting the best feature selection method. Analysis and evaluation will also be conducted by comparing performance against a single criterion. In this case, accuracy criteria are used as a comparison.

#### *4.1. Performance Analysis of the Feature Selection Method*

This section discusses the performance measures of the feature selection method. The feature selection reduces the dimension by eliminating the least important features and retaining the important ones. It is expected that, by reducing the dimensions, the model and computational performance will increase. If the baseline used all 118 features, the other methods only performed the subset features according to the algorithm. Table 2 shows the selected feature for each method.


**Table 2.** Selected features in each method.

The performance measure of the feature selection method is carried out to obtain performance parameters that will be used as the criteria for ARAS in the future. The measurement consists of performance models such as accuracy, precision, recall, f1-score, weighted precision, weighted recall, and weighted f1-score. The computation performance consists of train time and inference time. From a series of experiments conducted, what is interesting is that the baseline model requires the longest training time (34.3910 s) compared to other feature selection methods. It is decent because the baseline model used all the features in the psychosocial education datasets. However, the baseline model produced lower results than other models with far fewer selection features in the accuracy metric. Details of performance matrices for each feature selection method can be seen in Table 3.

**Table 3.** The performance metrics of feature selection methods.


#### *4.2. Evaluation Feature Selection Method Using ARAS*

At this stage, choosing the best feature selection method is performed. ARAS determines the ranking using performance metrics from each feature selection method. The first step is to initialize the decision-making matrices for each alternative and their respective criteria pairs. By assigning each feature selection method as an alternative, and assignment performance matrices, i.e., accuracy (A), precision (P), recall (R), f1-score (FS), weighted precision (WP), weighted recall (WR), weighted f1-score (WFS), train time (TT), and inference time (IT) as criteria *xn*. Based on the analysis, it is determined that the value of the criteria *x*1–*x*<sup>7</sup> are the benefit, while *x*<sup>8</sup> and *x*<sup>9</sup> are as the cost. In addition, it is also determined that the weighted value (w) of criteria *x*<sup>1</sup> is 0.2 and criteria *x*2–*x*<sup>9</sup> is 0.1, with the sum of their weighted values of 1. Criteria *x*<sup>1</sup> gets a higher weight because, in real problems, accuracy is one of the most important performance matrices that is widely used as a benchmark for machine learning [66,67]. The initial decision-making matrix's complete formation with each criterion's weight and optimization is shown in Table 4.

After the initial decision matrix is completed, the next step is to normalize the decision matrix. The step is finding the optimal value of *A*<sup>0</sup> value. The max operator is used for criteria with the benefit value, and the min operator is used for criteria with the cost value using Equation (10):


**Table 4.** Initial decision-making matrix *X*.

After obtaining the value *A*0*<sup>j</sup>* , all of the criteria in the matrix are normalized. The decision matrix is normalized using Equation (12) for benefit and Equation (13) for cost. The formation of the normalization of the decision matrix *X* is shown in detail in Table 5, and for example of calculating the values of *x*<sup>1</sup> (*A*0) and *x*<sup>1</sup> (*Baseline*) are as follows:

*<sup>x</sup>*<sup>1</sup> (*A*0) <sup>=</sup> 0.9770 0.9770+0.9725+0.9734+0.9752+0.9757+0.9265+0.9770+0.9770+0.9706 *x*<sup>1</sup> (*A*0) = 0.1120 *<sup>x</sup>*<sup>1</sup> (*Baseline*) <sup>=</sup> 0.9725 0.9770+0.9725+0.9734+0.9752+0.9757+0.9265+0.9770+0.9770+0.9706 *x*<sup>1</sup> (*Baseline*) = 0.1115 *xA*<sup>0</sup> *<sup>j</sup>* = ⎧ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩ [max(0.9725, 0.9734, 0.9752, 0.9757, 0.9265, 0.9770, 0.9770, 0.9706)], [max(0.9581, 0.9585, 0.9614, 0.9647, 0.9324, 0.9719, 0.9684, 0.9537)], [max(0.9427, 0.9453, 0.9484, 0.9479, 0.9257, 0.9478, 0.9513, 0.9400)], [max(0.9501, 0.9517, 0.9547, 0.9569, 0.9267, 0.9591, 0.9597, 0.9400)], [max(0.9721, 0.9730, 0.9749, 0.9754, 0.9344, 0.9769, 0.9768, 0.9703)], [max(0.9725, 0.9734, 0.9752, 0.9757, 0.9265, 0.9770, 0.9770, 0.9706)], [max(0.9722, 0.9731, 0.9750, 0.9753, 0.9273, 0.9766, 0.9768, 0.9704)], [min(35.942, 3.9885, 3.9892, 3.9893, 8.4677, 2.0112, 2.9948, 3.9878)], [min(0.9668, 0.9933, 0.9950, 0.9636, 0.9770, 0.9823, 0.9646, 0.9646)] ⎫ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎬ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎭

*xA*<sup>0</sup> *<sup>j</sup>* = {0.9770, 0.9719, 0.9513, 0.9597, 0.9770, 0.9770, 0.9766, 2.0112, 0.9646}

After the normalization of the matrix *X* is obtained, the next step is to perform the weighted normalization by multiplying the criteria weight by the normalized weighted matrix according to the formula (16). The results of weighted normalization are presented in detail in Table 6.


**Table 5.** Normalized decision-making matrix *X* on each criterion.

**Table 6.** Normalization-weighted decision-making matrix *X*ˆ of each criterion.


Next, the optimal value *Si* is calculated, where *Si* is the value of the ideal function of alternative *i*. After that, the criteria *Ki* is ranked using Equations (17) and (18). Meanwhile, the value *Ki* is calculated by dividing the value *Si* by the value *S*0. For the values *S*0, *KBaseline* and *KANOVA* can be computed as follows:

*S*<sup>0</sup> = 0.0224 + 0.0113 + 0.0112 + 0.0112 + 0.0112 + 0.0112 + 0.0112 + 0.0201 + 0.0112 *S*<sup>0</sup> = 0.1210 *K*(*Baseline*) = 0.1013 0.1210 ; *K*(*Baseline*) = 0.8372 *K*(*ANOVA*) = 0.1101 0.1210 ; *K*(*ANOVA*) = 0.9099

In detail, the calculation of the optimal value *K* is presented in Table 7. Then, based on the results of the *Ki*, the final results of the rankings are shown in Table 8.

**Table 7.** The result of optimality value on the feature selection methods.



**Table 8.** Final results of the ARAS rank for feature selection methods.

The decision results using ARAS show that the ERF method is top-ranking, and the baseline is at the lowest rank. ERF with 11 features gives better results than the baseline, which uses 118 features. It shows that selecting the best subset features is still relevant to machine learning problems.

We compare the ARAS rank with single machine learning measurements such as accuracy. In that case, the results obtained tend to be the same while on ARAS: ERF > Lasso > MI > Chi-square > Anova > RFE > EFS > Baseline, and on the other hand, using accuracy, the performance order is obtained as follows: ERF > Lasso > MI > Chi-square > Anova > Baseline > RFE > EFS. It happens because the overall performance produced by feature selection methods is mostly stable, so there are no models with cross-dominating criteria. To consider the dominating performance result, Figure 3 shows the comparative performance of every model.

**Figure 3.** Performance comparison of feature selection methods.

The experiment shows that the machine learning phase accomplished the model's performance analysis. By selecting specific metrics, the aim of the performance of machine learning can be defined. For example, the accuracy metric can be used as a benchmark metric to find the best accuracy model. Nevertheless, a decision model to measure and evaluate the overall performance metrics of feature selection methods is still necessary.

Finally, the goal of the proposed method is to show that the proposed model can resolve the problem formulation. Theoretically, this methodology is relevant and should be proposed. ARAS can perform a fair mapping in ranking the feature selection methods in the psychosocial education domain, especially to identify Colombia's teachers' stress level problems. However, this methodology has not fully demonstrated the significance of performance evaluation in the current dataset case, where several dominant criteria ultimately dictate the ranking results. More experience is necessary to provide a robust comparison and conclusion, and more experience based on a similar dataset might provide better results.

#### **5. Conclusions**

ARAS has proven effective and can be implemented as an evaluation model to determine the best feature selection method in the psychosocial education dataset. The evaluation used performance matrices to rank the feature selection methods. From the evaluation that has been accomplished, the determination of weight and optimization value plays an essential role in the ARAS model. Giving subjective weights affects the overall ARAS ranking.

Regarding future research directions, we recommend further investigation on the proposed method on different datasets with conditions where each criterion contradicts and does not predominate the other. The problem associated with imbalanced datasets that show uneven and contradictory performance matrices can be challenging. This problem is expected to measure the extent of ARAS's ability to provide an optimal ranking.

**Author Contributions:** Conceptualization, F.M. and J.-T.W.; Methodology, J.-T.W. and F.M.; Software, M.M.; Visualization, F.M. and M.M.; Project administration, J.-S.L. and J.-T.W.; Supervision, J.-S.L. and J.-T.W.; Writing—original draft, F.M. and M.M.; Writing—review and editing, J.-T.W. and J.-S.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Ministry of Science and Technology, Taiwan, under Grant MOST-110-2637-E-011-003-.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors gratefully acknowledge the support extended by the Ministry of Science and Technology, Taiwan, under Grant MOST-110-2637-E-011-003-.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Aggregation of Rankings Using Metaheuristics in Recommendation Systems**

**Michał Bałchanowski \*,† and Urszula Boryczka †**

Institute of Computer Science, Faculty of Science and Technology, University of Silesia in Katowice, B ˛edzi ´nska 39, 41-200 Sosnowiec, Poland; urszula.boryczka@us.edu.pl

**\*** Correspondence: michal.balchanowski@us.edu.pl

† These authors contributed equally to this work.

**Abstract:** Recommendation systems are a powerful tool that is an integral part of a great many websites. Most often, recommendations are presented in the form of a list that is generated by using various recommendation methods. Typically, however, these methods do not generate identical recommendations, and their effectiveness varies between users. In order to solve this problem, the application of aggregation techniques was suggested, the aim of which is to combine several lists into one, which, in theory, should improve the overall quality of the generated recommendations. For this reason, we suggest using the Differential Evolution algorithm, the aim of which will be to aggregate individual lists generated by the recommendation algorithms and to create a single list that will be fine-tuned to the user's preferences. Additionally, based on our previous research, we present suggestions to speed up this process.

**Keywords:** recommendation systems; rank aggregation; differential evolution; supervised learning; matrix factorization; metaheuristic

#### **1. Introduction**

In today's world where the amount of information available is overwhelming for a common user, the use of systems designed to support the user in making decisions is becoming more apparent. This role is taken on by recommendation systems, which are more commonly used in various areas of our life. From buying items on auction sites through selecting a movie to adding new friends on social networks. The growing popularity of this type of website means that there is a real demand for recommendation systems that work efficiently and not only increase the quality of the generated recommendations but also ensure their novelty and diversity [1].

Within the recommendation systems, we can distinguish two main approaches to creating a recommendation. They can be based on an attempt to predict what rating (e.g., on a scale from 1 to 5) the user would give to an item in the system. They can also attempt to predict a certain set of items, most often presented in the form of a list that would be recommended to the user [2] (this problem is also called the top-N recommendations problem). Additionally, we can rely on data entered directly by the user or we can infer their preferences by observing how they use the system.

This article will also discuss the problem of rank aggregation, which has been described thoroughly in the literature, especially in the context of information retrieval systems [3–5] and proven to be NP-hard [6] even for small collections of ranks (e.g., 4 or more). However, according to some researchers [7], this topic has not yet been sufficiently studied in the context of recommended systems. Depending on the dataset used, individual recommendation algorithms can generate different recommendations, and choosing one particular algorithm over others can decrease the quality of recommendations for some of the users.

**Citation:** Bałchanowski, M.; Boryczka, U. Aggregation of Rankings Using Metaheuristics in Recommendation Systems. *Electronics* **2022**, *11*, 369. https://doi.org/10.3390/ electronics11030369

Academic Editor: Stefano Ferilli

Received: 30 November 2021 Accepted: 22 January 2022 Published: 26 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Therefore, the use of aggregation techniques has been proposed also in this context where the aim is to combine the individual lists generated by different recommendation techniques in order to create one "super" list.

Additionally, due to the fact that we will be optimizing the average precision (AP) measure, the Differential Evolution (DE) algorithm will be used, which is a metaheuristic that makes the direct optimization of this measure possible [8]. Our method is universal, and thus any metaheuristic algorithm that is used for real-valued optimization can be used here (e.g., PSO [9]). We chose the DE to conduct our research, due to the fact that it is well-suited for this type of optimization [10–12]. DE is arguably one of the most versatile and stable population-based search algorithms that exhibits robustness to many different optimization problems [13]. Additionally, it is relatively simple to implement and has a small number of control parameters, which makes this algorithm easy to tune.

The main contribution of this paper is to present how the DE algorithm can be applied to the problem of rank aggregation in recommendation systems, which will be supported by tests performed on the MovieLens 100k data set [14]. We will also present, based on our previous work [15], how to accelerate this algorithm while generating ranking lists of items using a dedicated fitness function. This function can also be successfully used in other metaheuristics that use real-valued representations of individuals in a population. In addition, we will present research that will show that the use of metaheuristic algorithms in the context of the problem of rank aggregation can be additionally justified due to the resistance of these techniques to algorithms that generate low-quality recommendations.

The article is divided into six chapters. Section 2 constitutes a literature review with information about the current literature. Section 3 presents a formal definition of a recommendation system, an explanation of the ranking aggregation problem and the Differential Evolution algorithm. Section 4 presents a description of our algorithm along with the system architecture and a figure showing a simple example regarding how the matrix fitness function is calculated. Section 5 discusses how the test environment was prepared for conducting the experiments and presents the results with commentary. The final Section 6 discusses our conclusions and research proposals for the future.

#### **2. Literature Overview**

The problem of recommendations can be presented as the problem of predicting how a user would rate a given item (e.g., on a scale from 1 to 5) [16], or as the problem of creating a list of suggested items and is referred to as the Top-N recommendation problem [17]. In fact, the latter is more similar to the real-life scenario when working with recommendation systems [18], where the recommendations are most often presented in the form of a list of suggested items in which the elements at the beginning are more important than the ones at the end.

There have been many works describing this approach in the context of recommendation systems [2,17]. In order to evaluate the quality of such recommended lists, measures that take into account the order in which the items appear on the list are used, e.g., Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG). Due to the fact that these measures are usually difficult to directly optimize, metaheuristic algorithms can be applied here [8,19]. A good review of evolutionary algorithms in recommendation systems is the paper [20], in which the authors presented an overview of the current research in this area and suggestions for research in the future.

In this article, we also pay attention to the problem of rank aggregation. A great deal of work has been done on this subject, especially in the context of information retrieval systems [21]. We generally divide the algorithms used for rank aggregation into two categories: permutation-based and score-based. There are many suggested techniques in the literature, for example: Borda Count [6], COMB\* [22] (e.g., COMBSUM and COMBMNZ), or OutRank [23]. Within the context of recommendation systems, there have also been several works addressing this problem. In [24], a system for creating recommendations for the entire group of users was suggested, instead of as usually done for one user only.

In the work [25], the authors suggested creating a multi-criteria recommendation system, which, in addition to the quality of the generated recommendations, also took into account measures, such as novelty and diversity. In [26], the authors used genetic programming to create a recommendation system that generated recommendations by optimizing the MAP measure. It is also worth paying attention to [7], in which the researchers asked themselves whether the problem of rank aggregation in the context of recommendation systems is worth looking into. They performed extensive experiments and suggested the direction in which future work in this area should go.

#### **3. Background of the Research**

This chapter explains the basic information and the definitions used in this article. At first, the definition of the recommendation system, methods of obtaining feedback from users and the problem of matrix factorization will be discussed. Then, we will present the problem of rank aggregation in the context of recommendation systems. Finally, we will present a metaheuristic algorithm that will be used during our research.

#### *3.1. Recommender System*

In a recommendation system, we distinguish a certain set of users *<sup>U</sup>* <sup>=</sup> *<sup>u</sup>*1,..., *<sup>u</sup>*|*U*<sup>|</sup> and a certain set of items *<sup>I</sup>* <sup>=</sup> *<sup>i</sup>*1, ... , *<sup>i</sup>*|*I*|. Each of the users *<sup>u</sup>* <sup>∈</sup> *<sup>U</sup>* has interacted with some of the items *i* ∈ *I*. The task of the recommendation system for the Top-N recommendation problem is to, on the basis of the historical data collected in the system, predict the user's next choices and create a list of items that are likely to interest the user. High-quality recommendations contribute to user satisfaction, which can translate into an overall good impression when using the platform. Depending on what kind of feedback is obtained from the user, recommendation techniques can be based on data from:


It should also be noted that recommendation systems often do not have good quality features for users and items. For this reason, various methods of obtaining them have been proposed, and one of the most popular techniques is to factorize the user–item matrix. With this, we can obtain features that are also called latent features. More on the subject can be found in [33].

#### *3.2. Rank Aggregation Problem*

This section describes the problem of rank aggregation in the context of recommendation systems. We define a ranking as an ordered list of items *<sup>τ</sup>* = [*ij* <sup>&</sup>gt;<sup>=</sup> *ih* <sup>&</sup>gt;<sup>=</sup> ··· <sup>&</sup>gt;<sup>=</sup> *iz*], where the items at the beginning of the list (first position) are more significant than those at the end (last position). Item positions *ij* in ranking *τ*, we define as *τ*(*ij*). Two items *ij* ∈ *τ*

and *ih* ∈ *τ* can be compared by checking their position in the list *τ*. If the item *ij* is ranked higher in the *<sup>τ</sup>* in comparison to the item *ih*, it is defined as *<sup>τ</sup>*(*ij*) > *<sup>τ</sup>*(*ih*).

In recommendation systems, aggregations are generated through various algorithms, where a single algorithm will be defined as *ah*, and a set of *n* recommendation algorithms will be defined as *A* = {*a*1, *a*2, ... , *an*}. Each of the algorithms *ah* ∈ *A* generates a ranking *<sup>τ</sup>*, and the set of all *<sup>n</sup>* created rankings is defined as *<sup>T</sup>* = {*τ*1, *<sup>τ</sup>*2, ... , *<sup>τ</sup>n*}. In addition, all algorithms that generate recommendations take, as input, matrix *Mm*×*n*. Each row in this matrix represents a user *ui* ∈ *U*, and each column represents an item *ij* ∈ *I*. The value of this matrix *Mi*,*<sup>j</sup>* corresponds to the rating given by the user *ui* to the item *ij*. Note that users rate only a small fraction of the items appearing in such a matrix; therefore, such a matrix is very sparse.

The problem of rank aggregation can be defined as the problem of finding such a combination of rankings in *T* generated by a set of recommendation algorithms *A* for each user *ui* ∈ *U*, to create a single list ("super-list") that will optimize a given criterion (in our case, the average precision) to the greatest extent. Such a list should, in theory, be "better" than individual lists.

#### *3.3. Differential Evolution*

In order to optimize the AP measure, the Differential Evolution algorithm was used, which is a metaheuristic developed by K. Price and R. Storn [10]. It is based on individuals, which are represented as vectors of real numbers. For this reason, it is primarily suitable for the optimization of continuous functions, although there are papers that have suggested modifications to the algorithm and its adaptation to the optimization of discrete problems [30].

There is a population *P* of individuals, where each individual is a solution to an optimization problem, often represented as a *d* dimensional vector of real-valued numbers. The initial population *P* can be initialized randomly and should cover the entire search space. In the classic version of the algorithm, this is assumed to have a uniform probability distribution. In order to determine how good a given individual is in the population, it is necessary to define the fitness function, which assigns a certain value to each individual in the population.

This value is later used in the selection process, which is the process of choosing which individuals should go to the next generation. With each iteration, the algorithm attempts to improve the population of individuals until the stopping criterion is reached (e.g., a certain number of iterations). Owing to the use of crossover and mutation operators [34], the population of individuals changes and the algorithm attempts to find a better solution. Mutation creates a new individual by combining three randomly selected individuals and can be expressed with the following formula:

$$
\vec{v}\_i = \vec{x\_{r\_1}} + F(\vec{x\_{r\_2}} - \vec{x\_{r\_3}}),
\tag{1}
$$

where *r*1, *r*<sup>2</sup> and *r*<sup>3</sup> are random unique individuals (*r*<sup>1</sup> = *r*<sup>2</sup> = *r*3). The *F* parameter is the parameter responsible for amplification and usually takes a value in the range [0, 1]. After creating a new individual *vi* using the mutation operator, we use the crossover operator according to Formula (2). The *CR* parameter is the parameter that determines the crossover probability. Additionally, there is a *rand* function that generates a random number between [0, 1].

$$u\_{i,j} = \begin{cases} \begin{array}{l} v\_{i,j} \\ \begin{array}{l} x\_{i,j} \end{array} \end{array} & \text{if } \left(rand(j) \le CR \text{ or } i = i\_{rand} \right) \\\ x\_{i,j} & \text{otherwise.} \end{array} \tag{2}$$

#### **4. Suggested AggRankDE Method**

Our AggRankDE method is designed based on the values issued by the individual recommender algorithms for each item *i* in the set of all items *I* to find a vector of the weight *W* that achieves the largest AP value on the training set *TS*. It should be noted that this vector is created for each user *ui* ∈ *U* separately, since each user has their own individual recommendation preferences. Additionally, based on our previous research, we suggest a matrix representation for the scores given by individual algorithms and the population of individuals of the DE algorithm.

Details of this representation can be found in our previous work [15], and a simple example is presented in Figure 1. As a result it is easier to parallelize the process of learning user preferences and, thus, to reduce the computation time that is needed to find the particular preference vector *W*.

**Figure 1.** Toy example of the multiplication of two matrices. Matrix *A* represents scores assigned by the recommendation algorithms to each item *i* ∈ *I* and some population *P* (real value vectors) of the metaheuristic algorithm represented by matrix *B*. Matrix product *C* represents new scores for each item *<sup>i</sup>* <sup>∈</sup> *<sup>I</sup>*, which, after sorting, create new rankings *<sup>τ</sup><sup>n</sup>* where *<sup>n</sup>* ∈ {1, 2, . . . , *NP*}.

The hybridization technique was taken from [25] and is based on assigning weights *W* = {*wa*<sup>1</sup> , *wa*<sup>2</sup> , ... , *wan* } for each algorithm *ah*, from the set of algorithms *A* = {*a*1, *a*2, ... , *an*}. The aggregated value for each item is calculated according to the formula:

$$\mathfrak{p}(i\_j|\mu\_i) = \sum\_{h=1}^n \mathfrak{p}\_{a\_h}(i\_j|\mu\_i) \,\, w\_{a\_h} \tag{3}$$

where *wah* is the weight assigned to the algorithm *ah* ∈ *A*, with each algorithm assigning a value of *p*ˆ*ah* (*ij*|*ui*) to each item *ij*, which determines the degree of potential interest of user *ui* in this item. We should also remember to use the normalization technique so that all the algorithms in *A* can operate on the same scale.

The use of the metaheuristic algorithm based on evolution is associated with the need to define the fitness function so that, in subsequent iterations, the algorithm can reward individuals who are better adapted, i.e., with a greater value of the fitness function. In our case, this will be the average precision (*AP*) measure calculated for the active user *uA* as follows:

$$Fitness = AP@k(R, S) \tag{4}$$

where *S* is the set of items recommended by the system and *R* is the set of items that user *uA* rated in *TS*. According to our experiments, the value of *k* in *AP* during the learning process should be defined as the number of items that the user *uA* rated in his *TS*. In our opinion, such a value is most appropriate due to the fact that it does not cause the algorithm to overfit. The details for how to calculate *AP*, especially in the context of recommendation systems, can be found in our paper [35]. The architecture of our system is presented below Figure 2.

**Figure 2.** System architecture. The recommendation process is divided into two phases. In the first phase, recommendation algorithms generate recommendations in the form of lists, and active user *uA* is selected with all his *N* items from the training set. In the second phase, a metaheuristic algorithm works (in our case DE) with the dedicated fitness function, which allows for faster calculation of item scores, on the basis of which, new rankings will be created.

#### **5. Experimental Evaluation**

Due to the fact that recommendations are most often presented to users in the form of a list, in our experiments, we used the average precision measure (AP) and the mean average precision measure (MAP). The *AP* measure is used in the context of a specific (one) user, and, in our research, it was used to compare the list of items recommended to the user with the list of items available in the test set for a given user. This allowed us to calculate the quality of the generated recommendations.

In addition, it should be noted that this measure also takes into account where the relevant items are located on the list. If the relevant items are higher (closer to the first position), then the *AP* value is also higher. Due to the fact that metaheuristics are computationally expensive, we chose only a certain subset of users for the experiments. We randomly selected 50 users who rated at least 150 movies in the dataset. The experiments carried out as part of this paper were performed using the popular MovieLens 100k dataset. The AggRankDE algorithm adopts four algorithms as the input: SVD, WMF, BPR and WARP. All of them are based on matrix factorization, and thus features are generated for each item and for each user on the basis of the user–item matrix.

These features are called latent features due to the fact that their meaning cannot be explained. In addition, these algorithms are considered to be the current state-of-the-art and are often used to compare research results in recommendation systems for the Top-N recommendation problem. The research environment was implemented in Python and C#, and the research was carried out on a computer with an Intel Core i5-7600 (3.50 GHz) with 16 GB RAM.

#### *5.1. Parameters Tuning*

Before creating an aggregation, the parameters of the algorithms that are included must be tuned. To this end, experiments were conducted to tune their values so that they could achieve the best possible MAP measure on the set of users used for the experiments. This is an important step, due to the fact that improper tuning of the parameters can result in the generation of poor quality recommendations. Table 1, presented below, shows the parameter values used during the tuning process.

This process consisted of first setting all parameters to the default values and then changing only one parameter that was selected for the tuning. After the process was completed, the best values were saved in the ("*Best values*" column in Table 1). The detailed MAP@10 values obtained during this process for various parameters are presented in tables: Table 2 (learning rate), Table 3 (regularization) and Table 4 (latent features).

The process of tuning the *CR* and *F* parameters for the DE algorithm was also performed, and the results of these experiments are presented in Tables 5 and 6. In addition, in article [10], the authors indicated that a good value for the parameter *NP* is a value between 5 · *d* and 10 · *d*, where *d* is the number of dimensions. The authors also point out that the parameter *F*, equal to 0.5, is usually a good initial value and this parameter typically takes a value in the range [0.4, 1]. The final values of the Differential Evolution algorithm that were used during the experiments are presented in Table 7.



**Table 2.** Learning rate parameter tuning. This table presents MAP@10 for different parameter values. The remaining parameters are set to the default values according to Table 1.



**Table 3.** Regularization parameter tuning. This table presents MAP@10 for different parameter values. The remaining parameters are set to the default values according to Table 1.

**Table 4.** Latent features (dimensions) parameter tuning. This table presents MAP@10 for different parameter values. The remaining parameters are set to the default values according to Table 1.


**Table 5.** F parameter tuning. This table presents MAP@10 for different parameter values. The remaining parameters are set to the default values according to Table 1.


**Table 6.** CR parameter tuning. This table presents MAP@10 for different parameter values. The remaining parameters are set to the default values according to Table 1.



**Table 7.** The differential evolution parameters used in the experiments.

#### *5.2. Experimental Setup*

In order to prepare the environment for testing, first, the data was prepared in an appropriate way. User ratings were sorted by the time in which a given rating was issued and then divided into two sets: training (80%) and test (20%). Owing to this approach, our algorithm attempts to predict the user's future preferences based on the user's previous activity. The task is not trivial due to the number of items from which we can choose items and which will later be presented to the user.

Fifty users were randomly selected for the study, where a recommendation was generated for each user, and then the results of the suggested recommendations were compared with the test sets of each user. The AP measure was used to calculate the quality of the generated recommendations, and then its value was averaged for all users selected for testing; thus, the tables show the results given using the MAP measure. In order to show that our algorithm gives good results, we compared it with other algorithms used for the rank aggregation problem, such as the Borda Count, Majority Judgement, Pairwise Method (Copeland's) and Score Voting (mean).

In the research, we additionally took into account the quality of recommendations that was achieved through algorithms that participated in the creation of aggregation. These included the Bayesian Personal Ranking (BPR) and Weighted Approximate-Rank Pairwise (WARP) algorithms, the implementation of which is available in the LightFM library [36]. In addition, the usual SVD algorithm marked in the results as "SVD" and a weighted matrix factorization (WMF) algorithm were implemented.

#### *5.3. Results*

In Section 5.1, we presented the process of tuning the parameters for the various algorithms used to create aggregations. This is an important step, due to the fact that the quality of the generated recommendations by the different recommendation techniques can largely depend on the parameters that are set. For example, by analyzing Table 4, it can be seen that the MAP value obtained was highly dependent on the number of latent features. Additionally, the research presented in Table 2 showed that the parameter "Learning rate", which is characteristic for the BPR and WARP techniques, also required tuning as opposed to the parameter "Regularization" (Table 3) where the default value (0) generated the best quality of the recommendations.

While analyzing the results presented in Table 8, it can be seen that the AggRankDE algorithm aggregated the recommendation algorithms and improved the overall quality of the generated recommendations even compared to other aggregation techniques. This is an important observation because it shows that one "super" list can be created from several lists to improve the quality of recommendations, which is consistent with the experimental results by [7].

Looking at the quality of the recommendations generated by the different recommendation algorithms, we can see that, depending on *MAP*@, the quality of the recommendations varies. In general, as the number of items based on which the MAP@ measure is calculated increases, it can be seen that the quality of the recommendations decreases, although the AggRankDE algorithm improved the quality of the generated recommendations in all cases.

Additionally, after the introduction of the "Random" method (Table 9), which purposefully generated poor quality recommendations, in the case of the AggRankDE, this did not significantly degrade the quality of the produced aggregation in contrast with, for

example, the Borda Count method. This indicates that the AggRankDE has some resistance to weak algorithms that are used in the aggregation.

Table 10 presents the improvement in the speed (in seconds) of the generated recommendations after implementing the matrix fitness function. Time is measured for a single user in the system and depends on the number of iterations. Looking at this table, it can be seen that the improvement in speed is significant, and this is due to the fact that the operation on entire matrices can be easily parallelized. This is particularly important in the context of metaheuristic algorithms due to the fact that computing the fitness function is the most costly step in this type of algorithm.

**Table 8.** The quality of the generated recommendations (MAP) for different *MAP*@ values for the best parameters presented in Table 1.


**Table 9.** The quality of the generated recommendations (MAP) for different *MAP*@ values for the best parameters presented in Table 1 with the additional *RANDOM* algorithm.


**Table 10.** The average time (in seconds) depending on the number of iterations. The remaining parameters are according to Table 7.


When analyzing the experimental results, the application of the DE algorithm with the hybridization technique presented in [25] produced good results. However, in our paper, we suggested how to improve it by using a dedicated fitness function to directly optimize the average precision measure and to speed up its calculation process. By assigning different weights to the different algorithms included in the aggregation, the DE algorithm

optimizes the average precision measure using a weighted hybridization technique in order to obtain the highest possible value of the average precision measure on the training set.

During the testing phase, this translated into an increase in the quality of the generated recommendations. However, this process is computationally very expensive; therefore, we suggested using the matrix representation in the fitness function, which significantly accelerated the process of calculating the values for each item by the hybridization technique on the basis of which the ranking was created.

#### **6. Conclusions**

In this article, we presented how the Differential Evolution algorithm can be used to optimize the problem of rank aggregation in recommendation systems. The experiments were conducted on the database MovieLens 100k, and they showed that our algorithm improved the quality of the recommendations expressed by the MAP measure by 5% compared to other algorithms used for this purpose. Our research showed that, even using simple aggregation techniques, we could improve the quality of the generated recommendations.

In addition, in analyzing the research results, it can be seen that the AggRankDE algorithm is resistant to algorithms that generate poor-quality recommendations. We believe that this is due to the fact that, through the presence of a training phase in which the DE algorithm optimizes the AP measure, it is able to detect algorithms that generate low-quality recommendations and assign them correspondingly low weights, which results in them participating least in the creation of the list of recommended items.

Based on our previous work, we also suggested the use of matrix representation for the population of the DE algorithm and the values of coefficients calculated by individual aggregation algorithms for each item in the system. Such a representation makes it much easier to parallelize the process of calculating the values for individual items in the training phase on the basis of which new rankings (recommendations) are created. The calculation of the fitness function is the most expensive operation in the metaheuristic algorithms. In the context of the recommendation systems, this is particularly important, due to the relatively large data sets that are processed.

In following papers, we will increase the number of algorithms that are part of the aggregation, add more aggregation techniques and increase the number of data sets on the basis of which the research is carried out. We will also conduct a more detailed analysis of the effectiveness of our algorithm, taking into account a larger number of users, and conduct a more detailed analysis of how the parameters of the individual algorithms included in the aggregation and the model itself affect the quality of the generated recommendations.

Another interesting direction of research would be to take a closer look at the quality of the generated recommendations by particular algorithms in relation to individual users. Although the AggRankDE algorithm is more robust to algorithms that generate poor recommendations, the decrease in the quality is noticeable. Presumably, eliminating the weaker quality algorithms would generally improve the quality of the aggregation produced. We believe that the problem of rank aggregation within the context of the recommendation systems has not yet been sufficiently studied, and this will likely be the direction of our future work.

**Author Contributions:** Conceptualization, U.B. and M.B.; Formal analysis, U.B. and M.B.; Investigation, M.B.; Methodology, U.B. and M.B.; Project administration, M.B.; Software, M.B.; Validation, U.B. and M.B.; Visualization, M.B.; Writing—original draft, U.B. and M.B.; Writing—review and editing, U.B. and M.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data is available at https://grouplens.org/datasets/movielens/100k/ accessed on 29 November 2021.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


## *Article* **Integration Strategy and Tool between Formal Ontology and Graph Database Technology**

**Stefano Ferilli †**

Department of Computer Science, University of Bari, 70125 Bari, Italy; stefano.ferilli@uniba.it; Tel.: +39-080-5442293

† Current address: Via E. Orabona 4, 70125 Bari, Italy.

**Abstract:** Ontologies, and especially formal ones, have traditionally been investigated as a means to formalize an application domain so as to carry out automated reasoning on it. The union of the terminological part of an ontology and the corresponding assertional part is known as a Knowledge Graph. On the other hand, database technology has often focused on the optimal organization of data so as to boost efficiency in their storage, management and retrieval. Graph databases are a recent technology specifically focusing on element-driven data browsing rather than on batch processing. While the complementarity and connections between these technologies are patent and intuitive, little exists to bring them to full integration and cooperation. This paper aims at bridging this gap, by proposing an intermediate format that can be easily mapped onto the formal ontology on one hand, so as to allow complex reasoning, and onto the graph database on the other, so as to benefit from efficient data handling.

**Keywords:** knowledge representation; formal ontologies; graph databases

**Citation:** Ferilli, S. Integration Strategy and Tool between Formal Ontology and Graph Database Technology. *Electronics* **2021**, *10*, 2616. https://doi.org/10.3390/ electronics10212616

Academic Editor: Agnieszka Konys

Received: 24 September 2021 Accepted: 24 October 2021 Published: 26 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Two main perspectives, very different from each other, have been adopted in Computer Science for information storage and handling. The 'Knowledge Base' (KB) perspective is interested in high-level reasoning on the available information, so as to infer implicit information or check the consistency of the information with respect to the reference domain. It is pursued by the Knowledge Representation (KR) branch of Artificial Intelligence (AI) and includes the research field of formal ontologies. The 'Data Base' (DB) perspective is a traditional branch of research in Computer Science interested in developing optimal data organizations aimed at efficient storage, management and retrieval. While clearly complementary, these two perspectives have traditionally been investigated separately. However, due to the increasingly pervasive use of AI solutions in many applications, it would be extremely relevant to take advantage of both.

A new opportunity for cooperation comes from the recent development of *Graph Databases*, a kind of NoSQL DB aimed at optimizing element-driven data browsing rather than batch processing as in traditional relational DBs. Another difference between graph and relational DBs is that the former do not have a pre-defined schema to describe and organize the data, which obviously affects the interpretability and accessibility of the data by the applications and their interoperability. A *graph* is a data structure consisting of nodes (usually representing things) and arcs connecting these nodes (usually representing relationships between things). The arcs may be directed, if they have a direction, and may have attributes or labels qualifying or quantifying the relationship. Interestingly, when the terminological part of an ontology (Tbox, reporting definitions and axioms) is considered in conjunction with the assertional part (Abox, specifying individuals or instances) the result is a so-called *Knowledge Graph* (KG, a kind of KB) [1]. Whilst the literature on ontologies often defines them as encompassing both parts, the relevant literature adopts this very definition for KGs, equating the ontology to the data model only:

• "A knowledge graph is created when you apply an ontology (the data model) to a dataset of individual data points (the [...] data). In other words:

ontology + data = knowledge graph" [2].


It is clear that graph representation can be the missing link to join the two perspect ives/technologies and take the best from each. Unfortunately, formal ontologies and graph DBs refer to different graph models which cannot straightforwardly be combined together. This paper proposes a technology, called *GraphBRAIN*, aimed at bridging the gap between them through the following contributions:


It would allow graph DB developers to carry out high-level reasoning on their data. Indeed, formal, automated reasoning is much more powerful than the DB's query language, e.g., using ontological reasoning one may check consistency, correctness or completeness of the data. Using rule-based reasoning one may infer information that is not explicitly expressed in the data, possibly defined by complex patterns (as expressible in Logic Programming). Even more, multiple inference strategies (e.g., abduction, argumentation, etc.), not just deduction, can be carried out.

We already developed prototypes of the library, of a tool for building and maintaining the schema and of a tool for handling and consulting the DB based on the schema. This preliminary implementation of GraphBRAIN [6] is currently in use as part of a larger ongoing project [7], aimed at building an integrated system for AI-supported tourism, providing advanced support to end-users, entrepreneurs and institutions involved in touristic activities. It currently includes schemas describing the inter-related domains of 'tourism' (concerning history, cultural heritage items, points of interest, logistics and services, etc.), 'food' (concerning typical dishes and beverages from specific regions), 'computing' (concerning computing devices and their history) [8] and 'lam' (concerning libraries, archives and museums) [9].

Original contributions of this paper are:


This paper is organized as follows. After discussing in Section 2 the basic concepts and related works about formal ontologies and graph databases, Section 3 describes our proposed formalism for interfacing the two technologies. Then, Section 4 shows how schemas expressed in our formalism can be mapped onto graph DBs on one hand and onto a standard ontological format on the other. Finally, Section 5 concludes the paper.

#### **2. Basics and Related Work**

According to one of its many definitions in Computer Science, an *ontology* is "a formal, explicit specification of a shared conceptualization" [10]. Therefore, building an ontology requires a conceptualization step, by which: (1) the relevant entities, relationships and their attributes in a domain of interest are identified; (2) names are defined for them; (3) possibly (in the case of *formal* ontologies) axioms are stated expressing what is mandatory, permitted or prohibited in that domain. Explicit or implicit ontology building is pervasive in Computer Science (e.g., when designing E-R diagrams in DBs, or class diagrams in Object-Oriented systems, or predicates, functions and constants in KBs), to determine what can be represented in a (family of) application(s) and to define the rules driving their operation. Indeed, ontologies are key to improving communication among agents, foster systems interoperability and support reuse. Formal Ontologies specifically focus on automated reasoning aimed at making inferences on the available knowledge (concerning both the concepts and their instances) expressed according to the ontology. The main reasoning tasks include KB satisfiability, axiom entailment, concept satisfiability, instance retrieval, classification, query answering [11].

A standard formalism for expressing ontologies and KGs is the Web Ontology Language (OWL) [12]. In fact, a number of reasoners based on OWL are available [13] that provide implementations for all or part of the inferences. OWL is based on the Resource Definition Framework (RDF) [14], originally developed for describing resources on the Web but amenable to knowledge representation in general. RDF graphs are based on a directed graph data model in which nodes are Uniform Resource Identifiers (URIs). A Named Graph is an RDF graph named by a graph URI. An RDF Graph is a collection of RDF Triples, representing arcs, i.e., units of RDF Data of the form:

#### (Subject, Predicate, Object)

where the Subject and Predicate are URIs and the Object may be a URI or a literal value. Triplestores (or 'Semantic Graph Databases') are DB Management Systems (DBMSs) specifically focusing on RDF Data. Sometimes they need to extend Triples to store extra information, thus actually becoming Column Stores. A common extension are Quads, useful to add context or provenance to triples. Another NoSQL semantic graph database is GraphDB, which may work schema-free or exploiting an RDF ontological schema. Triplestores are specialized for RDF knowledge graphs and thus not optimized for generic data handling, like standard DBMSs. Since data representation constrained to using URIs does not necessarily make sense out of the Automated Reasoning applications (e.g., the Semantic Web), we aim at working with 'normal' DBs but still adopting the graph approach and still being able to carry out formal reasoning on their contents.

A more general structure than Triplestores is provided by graph DBs, based on the Labeled Property Graphs (LPGs) model [15]. In LPGs, both nodes and arcs may have names (called *labels* for nodes and *types* for arcs) and can store *properties* represented as key/value maps. Many arcs, possibly labeled with the same type, may exist between the same pair of nodes. Operationally, nodes and arcs are associated with unique identifiers. The most relevant differences between RDF graphs and LPGs are [16]:


original Subject and Object and to the additional properties) worsens readability; another partial solution is via annotations;


Whilst not directly related to data storage and management, and seemingly irrelevant, readability may be important for exploitation purposes when a portion of the graph is to be graphically displayed for humans—one of the main strengths of graphs. For the reader's reference, Table 1 provides a comparison of the different terms used to denote the same concepts in the DB, KR and LPG communities. In the following we will use them interchangeably, depending on the needs and context.


**Table 1.** Alignment between DB, Ontology and LPG terminology.

The relevance of the graph-based approach to DB technology nowadays is witnessed by many big players in the industry developing their own solutions: just consider Google's 'Knowledge Graph', Facebook's 'Social Graph' and Twitter's 'Interest Graph'. All these solutions are proprietary and specifically intended for use in the products of such companies. As a more general-purpose solution we may mention Microsoft Research's 'Graph Engine' (previously known as 'Trinity') [17], a project started in 2010 and released as open source in 2017; however, no recent news is available for it, nor any particular success has been reported for it. In the following we will refer to Neo4j [18], the most popular graph DB according to DB-Engines, a platform that ranks DBMSs according to their popularity [19]. It is currently ranked #17, gaining 4 places in the past year [20]. It has been adopted by many big companies and governmental organizations for several different and relevant use cases, including Recommendation, Biology, Artificial Intelligence and Data Analytics, Social Networks, Data Science and Knowledge Graphs [21].

In Neo4j labels usually represent classes, nodes represent class instances, types represent relationships and arcs represent relationship instances. Each node may be associated with many labels, while each arc may have at most one type. Neo4j comes with a powerful query language (Cypher) and extensive libraries for advanced data manipulation (APOC). However, Neo4j (as most graph DBs) is schema-free: the user may apply any label/type or property to each single node or arc. Only simple 'constraints' may be defined to bias the DB content; while ensuring great flexibility, this causes the lack of a clear semantics for the graph contents. This motivated this work, aimed at proposing a schema formalism for graph DBs. In particular we believe the schema must be in the form of an ontology, so as to enable high-level reasoning on the available knowledge and still benefit from the advantages provided by graph DBs and LPGs. Specifically, we may leverage the advantages of DBMSs (scalability, storage optimization, efficient handling, mining and browsing of the data, etc.) and LPGs (flexibility, expressive power) for handling individuals, and exploit the high-level functionalities of ontological reasoners (allowing formal reasoning on, and consistency or correctness checks of, the data) on the ontological part.

On the methodological side, a few theoretical works analyze the possibilities of cooperation between ontologies and graph DBs, e.g., ref. [22] recognizes the need, but limited adoption, of logic-based KR for the development of KGs and summarizes some attempts to tackle this issue. Ref. [23] uses Neo4j to show how ontological schemas can be applied to Multilayer graphs (graphs whose labeled edges belong to a number of predetermined classes) and their algebraic counterpart, ontological tensors, also elaborating on complexity.

Other approaches are more practical, aimed at mapping ontologies or KGs to graph DBs. Ref. [24] stores the Freebase KG in Neo4j. As opposed to our proposal, it is not interested in developing ontologies as schemas for the graph DB; actually, it focuses on simple 'querying', not on 'reasoning', and the power or the proposed queries is incomparable to what can be obtained using automated reasoning techniques from AI. Most other works specifically focus on the mapping between OWL and LPGs. G2GML [25] maps OWL (RDF graphs) to PGs to overcome the limitations of SPARQL in implementing traversal or analytics algorithms. It proposed an exchangeable serialization format to support different graph DBMSs and their interoperability, but redefined the PG model. OWL2LPG [26] maps OWL 2 ontologies to an LPG representation, and vice versa, identifying specific kinds of queries that in Neo4j should be both easily expressible and more performant than in WebProtégé 4.0. Since the queries concern the ontology axioms and their revisions, it translates the ontology, not the data. In our approach the ontology stays apart from the DB, where only the data are stored and queried. SciGraph [27] aims at representing OWL ontologies and data as Neo4j graphs. It is strictly 'OWL-centric' and implementation dependant: it reads only formats available to the OWLAPI [28]—an API for OWL which is fully compliant with the official OWL specifications by W3C—and ignores the rest. It is clearly stated that creating ontologies based on the graph and supporting reasoning are not goals of this work. Therefore, it is exactly opposite to our work. VirtualFlyBrain [29] aims at translating only "a well defined subset" of OWL 2 EL ontologies into Neo4j and back in such a way that entailments and annotations (not the syntactic structure) are preserved after the round-trip. Differences from other mappings, such as SciGraph, are quite technical, e.g., having to do with the treatment of blank nodes or with the use of 'safe labels' for typing relations (a safe label is basically the URI with all non-alphanumeric characters being replaced by underscores). The authors point out some 'idiosyncrasies' of the approach, again very technical. Like us they only support datatypes that are supported by both Neo4j and OWL. As opposed to us, they label individuals with their most direct class, while we label them with their top-level class. All these approaches adopted a perspective biased towards ontologies and on their mapping on the graph DB. Since LPGs are more structured than RDF graphs, this direction seems quite obvious, at least syntactically. Since we believe that the DB technology is more mature and widely exploited than the ontology one, we take the opposite perspective and aim at preserving the DB structure and organization, superimposing the ontology on it only so far as it can be easily done.

OWLStar [30] exports Neo4j to OWL but specifying ontological semantics (e.g., OWL-DL interpretations), to be converted to OWL, in edge properties, so the driving perspective is again OWL-centric. It uses RDF\* (and its query language SPARQL\* that extends SPARQL), in an attempt to bring PGs into RDF by adding syntax to attach properties to edges. Ref. [31] proposes a formal mapping between LPGs and RDF that can be leveraged to keep the data in the DB and render them in RDF. However, RDF is an extension of RDF and thus not compliant with standard reasoners, which prevents immediate reuse of the many reasoners available in the literature for performing ontological reasoning that involves instances. To overcome this limitation we developed a mapping of LPGs onto standard RDF. This required reconciling the differences between the two models and notably the inability of RDF to express datatype properties on relationships.

Some discussions and practical proposals can be found in the Neo4j community blog. The mainstream approach [32] proposes solutions for interoperability of Neo4j data and automated reasoning on them. The former is obtained by exporting Neo4j instances to RDF, e.g., upon request of an ontological reasoner. One way to do this is exporting Neo4j data in JSON using Cypher and the APOC libraries [33] and then further translating the result

into other ontological formats (e.g., using libraries such as [34]). The latter is obtained by importing an RDF ontology into Neo4j, e.g., using the tool provided by the 'official' Neo4j library [35]. The RDF triples specifying the ontology are just transposed into nodes and arcs in the graph, so that the graph DB includes the schema, almost like schemas are stored in relational DBs as tables within the DB itself. On this representation, some (simple) kinds of ontological reasoning (e.g., navigation of the subclass hierachy) are translated into DB queries using Cypher. This solution has several drawbacks. First, the graph would include two disjoint parts, the ontology and the data, to be handled in totally different ways albeit coexisting in the same graph (in relational DBs they would be stored in different schemas, while in graph DBs there is a single overall graph). Second, no formal discussion is provided about what kinds of reasoning can be mapped onto graph DB queries. We expect them to be quite limited if compared to the power of state-of-the-art ontological reasoners. Furthermore, implementing these reasoning facilities is still in charge of the applications accessing the DB. Finally, it does not prevent data that are not compliant with the intended ontology to be inserted into the DB.

Instead, we propose an API, to be exploited by all applications accessing the DB, that wraps the DB and enforces compliance of the data with the intended schemas in both building and consulting the DB. In our vision KB designers must provide pre-specified data schemas, expressed in the form of ontologies for LPGs, that this API will interpret and use to drive all subsequent accesses to the DB. By referring to a schema, the applications will commit to be compliant with it, as in traditional databases. Just like in Triplestores and RDF\* this will ensure a tight integration between the data and the schema. As opposed to Triplestores, RDF\* and most of the cited works, where the ontology is ingested in the graph, the data/instances (stored in the graph DB) are kept apart from the schema/ontology (specified in a file external to the DB, using an ontological representation format). As discussed in Section 4, we leverage this separation between the data repository and the data schema to obtain the additional opportunity of applying different (but compatible) schemas to the same DB. Indeed, each schema may represent a different, partial view on the same data, allowing to limit or expand the possible interactions depending on specific needs and adding flexibility to our solution. Again, this is not even thinkable in Triplestores.

Proposing an ontological format brings the need for tools to comfortably build, browse and edit the ontologies expressed in this format. Several tools have been proposed, in the literature and practice, for the current standard ontology representations (notably OWL). Each pursues different objectives as regards the construction, editing, annotation and merging of ontologies [36]. Protégé [37,38], based on the OWLAPI, is the most popular and mature. Different versions, extensions and plugins for Protégé have been proposed (e.g., [39,40]), including an online version. Since sometimes they are not completely compatible with the original tool, we will take the OWLAPI as the standard reference in the rest of this paper. Since the ontological format for LPGs we propose in this paper has different features than those available for the RDF graph model, we also developed a corresponding tool for ontology definition and handling. In particular, it allows the ontology designer to specify attributes also for relationships and to specify labels for nodes and types for arcs, which is not allowed by extant ontological standards and tools. Therefore, our starting point was the need to define a schema for the graph DB, and the tool was developed so as to allow the users to comfortably define a schema to be used for building the KB. Then, in order to enable OWL reasoning capabilities, the translation in standard ontology format was a consequential objective. The various approaches proposed in the literature to assess the quality of tools for the construction of ontologies [41] can provide useful hints for improving and extending our tool with advanced features.

#### **3. GraphBRAIN Graph Database Scheme Format**

The *GraphBRAIN Schema* (GBS) format we propose to define graph DB schemas consists of an XML file whose tags allow us to exploit the representational features provided for by the LPG model (we developed a DTD for automated syntax checking of GBS files). In the following, when specifying the GBS file structure, we will adopt the usual notation of square brackets [... ] to denote optional elements, curly brackets {... } to denote repeated elements and pipes in parentheses (... | ...) to denote choices. Furthermore, we write XML tag names in boldface, XML tag attribute names in italics and entity or relationship names in smallcaps. Text in plain typeface reports comments useful to understand the various elements and their behavior.

The main structure of the XML with the tags and their nesting is reported in Table 2, where the universal entity ENTITY and the universal relationship RELATIONSHIP, acting resp. as the roots of the entity and relationship hierarchies, are implicitly assumed (remember that in ontological terminology entities correspond to classes and relationships correspond to object properties). Therefore, entities and relationships are to be specified only starting from the first level of specialization, which we will call *top-level*. Since each node (resp., arc) in the graph must be associated with one top-level entity (resp., relationship), the top-level entities (resp., relationships) are to be considered as disjoint. They may be the roots of specialization hierarchies of sub-entities (resp., sub-relationships). The set of direct specializations of a (sub-)entity or (sub-)relationship are in turn disjoint and are not to be intended as a partition: instances that do not fit any of the specializations of a parent (sub-)entity or (sub-)relationship may be directly associated with the parent. Therefore, also the root and intermediate levels of each hierarchy admit instances in the knowledge base. This design choice prevents multiple inheritance (associating an instance to many classes belonging to different branches in the hierarchy). We partially recover this at the level of instances: when two instances of different (sub-)entities represent the same object, we link them using an ALIASOF relationship. The single reference object represented by all these instances takes the union of their attributes.

**Table 2.** Main structure of GBS files.


Entities and relationships are specified using the structure shown in Table 3. **Reference** is used only in relationships to specify their possible domain-range pairs, **taxonomy** is optional (used only if the entity or relationship has sub-entities or sub-relationships) and allows us to conveniently represent the specialization-type assertions; all other object properties are to be specified in the **relationships** section. **Attributes** is mandatory for entities (an entity instance must be described by some attribute) and optional for relationships (a relationship may carry information in its very linking two instances). **Specialization** is a recursive tag, allowing to define hierarchies of sub-entities or sub-relationships. In addition to its own attributes each specialization inherits all the attributes of the (sub-)entities (resp., (sub-)relationships) on the hierarchy path from its specific **specialization** section up to the corresponding top-level entity (resp., relationship).


**Table 3.** Structure for describing entity and relationship hierarchies in GBS files.

Some tags have XML attributes that specify the details of the item they represent in the schema:

• **domain** tag:

*name* the unique identifier for the domain being described *author* the author of the schema *version* the version of the schema

• **entity** tag:

*name* the unique identifier for the entity

• **relationship** tag:

*name* the unique identifier for the relationship

*inverse* the unique identifier for the inverse relationship of *name*

• **reference** tag:

*subject* the identifier of the entity that is the domain of the (sub-)relationship *object* the identifier of the entity that is the range of the (sub-)relationship

#### • **specialization** tag:

*name* the unique identifier for the specialization (sub-entity or sub-relationship) [*inverse*] the unique identifier for the inverse sub-relationship of *name* (not used for sub-entities)

• **attribute** tag:

*name* an identifier for the attribute

*mandatory* = ( **true** | **false** )

whether the attribute must take a value in each instance

*distinguishing* = ( **true** | **false** )

whether the attribute may concur in distinguish instances having the same values for mandatory attributes

*display* = ( **true** | **false** )

whether the attribute represents interesting additional information with respect to mandatory and distinguishing attributes, to be possibly displayed

*datatype* = ( **integer** | **real** | **boolean** | **string** | **text** | **select** | **tree** | **date** | **entity** )

[*length*] the maximum allowed number of characters (used only when datatype = string)

[*target*] an entity name (used only when datatype = entity)

Therefore, the union of mandatory and distinguishing attributes of an entity or relationship can be used to specify a key for uniquely identifying its instances. The union of mandatory, distinguishing and display attributes of an entity or relationship can be used to build and display a summary reporting the most relevant information about the instances.

Regarding datatypes, attributes of type *integer*, *real*, *boolean*, *string*, *text* take an atomic value of the corresponding type, where *text* is intended for free text of any length, differently from *string* which has a limited maximum length that can be specified in the 'length' attribute. Attributes of type *date* take values in one of the following forms:


where year is any integer, month ∈ {01, ... , 12} and day ∈ {01, ... , 31}. Attributes of type *select* denote a choice in an enumeration of values, described using the substructure reported in Table 4; attributes of type *tree* denote a choice in a tree of values, described using the recursive substructure shown in Table 5. Attributes of type *entity* denote 1:1 relationships between an instance of the current entity and an instance of another entity (specified in the 'target' attribute of the tag), e.g., the birthplace of an entity Person would be modeled as an attribute of type *entity* with target='Place':

```
<entity name="Person">
   <attributes>
      <attribute name="birthplace" datatype="entity" target="Place"/>
   </attributes>
</entity>
```
**Table 4.** Structure for describing enumerative attribute values in GBS files.

```
attribute ... datatype="select" tag
    values
         {value}
```
**Table 5.** Structure for describing enumerative attribute values in GBS files.

```
(**) (attribute ... datatype="tree" | values) tag
     values
         {value} // see (**) (recursive)
```
As a conventional notation we propose identifiers made up of uppercase letters, lowercase letters or decimal digits only. They should start with an uppercase letter for entity names and enumeration or tree values, or with a lowercase letter for domain, relationship and attribute names. Multi-word names are built by juxtaposing their constituent words, using an uppercase letter for the first letter of each word (except for the first one, as prescribed above). When writing documentation, a relationship 'rel' between an entity 'Subj' and an entity 'Obj' can be represented using the dot notation

#### Subj.rel.Obj

which is not ambiguous since dots are not allowed in our entity and relationship names.

Tables 6 and 7 show a fragment of a GBS file concerning the domain of computing. We see entity 'Component', representing an electronic component and including a taxonomy of sub-classes, some of which have specific attributes of various type, e.g., sub-class 'Memory' has attributes 'capacity' and 'speed' in addition to those inherited by 'Component' ('name', 'description', 'originalPrice' and 'announcementDate'). In the relationships section we see that relationship 'wasIn' may be established between a 'Component' and an 'Event' (to signify that the component was on show at the event), or between a 'Person' and a 'Place' (meaning that the person was in that place), etc.

**Table 6.** Sample fragment of ontology in GBS format (part 1).

```
<!-- <!DOCTYPE domain SYSTEM "graphbrain.dtd"> -->
<domain name="retrocomputing" author="stefano" version="1">
   <entities>
      <entity name="Component">
         <attributes>
            <attribute name="name" mandatory="true" datatype="string"/>
            <attribute name="description" mandatory="false" datatype="text"/>
            <attribute name="originalPrice" mandatory="false" datatype="real"/>
            <attribute name="announcementDate" mandatory="false" datatype="date"/>
         </attributes>
         <taxonomy>
            <specialization name="Chip">
               <taxonomy>
                  <specialization name="Logic">
                      <taxonomy>
                         <specialization name="FlipFlop">
                            <attributes>
                               <attribute name="type"
                                     mandatory="false" datatype="select">
                                  <values>
                                     <value name="D"/>
                                     <value name="FK"/>
                                     <value name="JK"/>
                                     <value name="T"/>
                                  </values>
                               </attribute>
                            </attributes>
                         </specialization>
                         <specialization name="Memory">
                            <attributes>
                               <attribute name="capacity"
                                     mandatory="false" datatype="string"/>
                               <attribute name="speed"
                                     mandatory="false" datatype="string"/>
                            </attributes>
                            <taxonomy>
                               <specialization name="EPROM"/>
                               <specialization name="PROM"/>
                               <specialization name="RAM"/>
                               <specialization name="ROM">
                                  <attributes>
                                     <attribute name="content"
                                           mandatory="false" datatype="string"/>
                                  </attributes>
                               </specialization>
                            </taxonomy>
                         </specialization>
                      </taxonomy>
                  </specialization>
                  <specialization name="MicroProcessor">
                      <attributes>
                         <attribute name="speed" mandatory="false" datatype="string"/>
                         <attribute name="bits" mandatory="false" datatype="integer"/>
                      </attributes>
                  </specialization>
                  <specialization name="PLA"/>
                  <specialization name="RRIOT"/>
               </taxonomy>
            </specialization>
            [...]
         </taxonomy>
      </entity>
      [...]
   </entities>
```
**Table 7.** Sample fragment of ontology in GBS format (part 2).

```
<relationships>
      <relationship name="wasIn" inverse="hosted">
         <references>
            <reference subject="Company" object="Event"/>
            <reference subject="Company" object="Place"/>
            <reference subject="Component" object="Event"/>
            <reference subject="Event" object="Place"/>
            <reference subject="Person" object="Company"/>
            <reference subject="Person" object="Event"/>
            <reference subject="Person" object="Place"/>
            [...]
         </references>
         <attributes>
            <attribute name="reason" mandatory="false" datatype="string"/>
            <attribute name="position" mandatory="false" datatype="string"/>
         </attributes>
      </relationship>
      [...]
   </relationships>
</domain>
```
Each GBS schema is intended to describe one domain. However, sometimes wider domains involve ontological elements that are already described in more 'basic' schemas (e.g., the schemas for Cultural Heritage, Food and Transportations might be exploited in the ontology aimed at supporting a touristic application) and it might be useful to reuse such schemas, both for standardization of the definitions and for building on existing knowledge. Actually, the combination of many schemas is more powerful a representation than the simple juxtaposition of their elements. Indeed, their shared entities act as bridges that allow, through the relationships available in those domains, to connect proprietary entities of each domain that would not otherwise have a chance to be related with each other. In the GBS framework, classes and relationships in different ontologies are considered the same (and thus are shared) if they have the same name. They may have, however, different attributes, reflecting the different perspectives associated with the different domains. If an attribute is present in different domains it must have the same type in all of them. Moreover, additional cross-schema relationships (and entities) may be defined in the overall ontology, building on the existing ones. GBS schemas support such opportunity by providing for an optional section in which existing schemas can be imported. The structure of this section (delimited by tag **imports** and placed at the beginning of the schema, before the entities and relationships) is as shown in Table 8. The tag attributes are:

	- *elementname*: the name of the element to be deleted

**Table 8.** Structure for describing imported schemas in GBS files.


Schemas are imported in the same order as specified by the sequence of **import** tags. Definitions of top-level elements (entities or relationships) in an imported schema having the same name as elements defined in previous imported schemas override the previous definitions. Finally, elements defined in the **entities** or **relationships** sections of the importing schema override elements with the same name in all imported schemas. Since it may happen that some elements of the imported schemas are not needed in the current domain, **delete** tags allow to remove them from the overall ontology.

In addition to the API for GBS-based handling of Neo4j, we developed tools for GBS schema/ontology editing and for data management. They were implemented as Web Applications based on the Java Server Faces technology and the PrimeFaces library. JavaScript was used for handling interactive browsing of the graph. A connection to Prolog allows it to carry out rule-based reasoning on selected portions of the data. Obviously Neo4j was used to store the knowledge graph, while Postgres was used to store user and usage data (roles, access rights, change log, etc.). A demo of the tools can be found at http://193.204.187.73:8088/GraphBRAIN/ in the form of a general-purpose system for the collaborative development, management and (personalized) fruition of a KB, in the same spirit as Freebase [42]. After logging into the system, the user may choose a domain and all subsequent interaction is driven by the corresponding GBS schema. Screenshots of the current online prototypes are shown in Figures 1–3.


**Figure 1.** Online editor for GBS schemas/ontologies.

Figure 1 shows the interface for building, editing and browsing GBS schemas/ontologies. In the left-hand-side section the entity hierarchy, with entity attributes and attribute types and values, can be handled. In the center section the same can be done for relationships, also including inverse relationships and references. On the right-hand-side section imports can be handled and existing schemas can be loaded. On the bottom several save and export buttons are available. Figure 2 shows the interactive interface to feed and consult information in the knowledge base by direct interaction. It consists of two form-based tabs, one for entities (Figure 2a) and one for relationships (Figure 2b), allowing the user to insert, update, remove or query instances. The forms are automatically generated by the system from the GBS specification of a schema and interact with the graph DB using our API to enforce consistency with the selected schema. Let us first describe the entity tab. In the left-hand-side section (sub-)entities and corresponding instances can be selected. In the center section a form with the attributes of the selected (sub-)entity is shown, possibly filled with the values from the selected instance. Regarding the relationships tab, the center section allows to choose a relationship, for which subject and object (sub-)entities and corresponding instances can be selected in the left- and right-hand-side sections, respectively. When a triple (subject, relationship, object) is selected, the center section also shows a form with the attributes of the selected (sub-)relationship. If subject and object instances are also selected, a drop-down menu allows selecting a specific relationship instance, in which case the attribute form is filled with the corresponding values. More functions are available (e.g., handling of attachments to the selected instances, or search and collaborative evaluation facilities) but their description is beyond the scope of this paper.

(**b**)

**Figure 2.** Online interfaces for managing and consulting GBS knowledge bases: (**a**) entities, (**b**) relationships.

Figure 3 shows the tab in which users can display and manually browse the graph. Since the whole KB would be too large to be readable, only a portion thereof is shown in this tab. The portion is dynamically generated so as to focus on the portion of graph of interest to the user based on their profile, optionally starting from selected nodes specified by him. In the figure, the graph was generated for user 'stefano' starting from nodes representing Chuck Peddle (a pioneer in microprocessor design) and the 6502 (one of the earliest and most successful microprocessors on the market), identified by a thicker node border. Different colors of nodes denote different classes (e.g., light blue for Person, yellow for Component, etc.). At a glance, it is possible to see clusters of nodes that represent possibly relevant aggregates of information to be investigated or explored. Note that the nodes and arcs in this view may belong to different schemas, not only to the schema selected for the form-based interaction. Therefore, here the user may discover connections that are beyond the starting domain. The user may pan and zoom on the graph, drag nodes, dynamically follow links, read attributes of nodes and/or arcs, further expand the graph around nodes of interest and run analytics and mining algorithms from menus

on the right-hand-side and contextual menus that appear by clicking on the graph. The information on a node or arc in this view is the complete set of properties for that node or arc, gathered from all domains in which it is involved.

**Figure 3.** Online interface for browsing GBS knowledge bases.

#### **4. Mapping onto DB and Ontology**

Since graph DBs are naturally suited to express knowledge graphs, i.e., knowledge bases underlying given ontologies, a fundamental requirement of our approach is that our schemas can be mapped onto both the DB and to an OWL representation which can then be processed by a reasoner. In this section, we report in detail how these two mappings work in practice.

#### *4.1. Use as a Graph DB Schema*

As said, part of the main motivation for defining GBS schemas is to endow LPG-based graph DBs with a schema that ensures a clear semantics to the information pieces they contain and provides directions for their management and interpretation. According to this perspective the DB users will be required to work according to pre-specified data schemas expressed in the form of ontologies. Operationally, the DB will be wrapped into a layer, e.g., in the form of an API (see the previous section), that takes as input a GBS schema specifying the desired domain ontology and controls all interactions, allowing the external applications to manipulate and consult only information items that are compliant with the ontology.

In our approach we also provide an additional opportunity. Specifically, we allow a single graph DB to underlie several domains (schemas), provided that their elements (entities and relationships) are compatible. By *compatible* we mean that for elements having the same name in the different schemas, attributes having the same name must have the same datatype, too. The other attributes, or non-shared elements, can be freely defined. Therefore, using any of such schemas on the DB would provide a partial view of its contents, perhaps representing a different perspective or aimed at limiting access to the DB contents for some users or applications.

Let us now show how the GBS elements are implemented using LPG features. For easy reference, Table 9 summarizes the mapping.


**Table 9.** Correspondence between GBS elements and LPG features.

#### 4.1.1. Entities and Relationships

Leveraging the possibility of using many labels for nodes, each node is labeled with the top-level entity it belongs to and with all the domains for which it is relevant (e.g., 'Herbert Simon' would be labeled with 'Person' for the entity and with 'economy' and 'computing' for the domains). When the same DB underlies several domains, this allows to select only the instances actually involved in a domain of interest. On the other hand, since each arc may take at most one type, we use it for specifying the relationship it expresses. The domains for which a relationship instance is relevant may be inferred from the domain labels of the nodes it connects by considering all the domain labels that are present in both its subject and its object.

#### 4.1.2. Attributes

Concerning attributes, we propose to reserve an attribute name ('*specialization*') to store which is the specific sub-entity (resp., sub-relationship) the entity (resp., relation) instance belongs to. Given the top class (resp., relationship) specified in the labels (resp., types) and the specific sub-entity specified in the 'specialization' property, the path of specializations between these two may be easily recovered bottom-up starting from the latter and climbing the specialization hierarchy in the ontology up to the former (since nodes admit many labels, one might specify all the sub-entities in such a specialization path as labels; for the sake of uniformity with arcs, where this is not possible, we propose the above solution). We also propose to implicitly assume another reserved attribute '*notes*' for both nodes and arcs, that allows to add information not considered by the other, domain-specific attributes.

#### 4.1.3. Attribute Types and Values

Attribute values of types *integer*, *real*, *boolean*, *string* and *text* are stored as literal values for the corresponding DB types, e.g., Neo4j provides the following types matching GBS types: Integer and Float (both subtypes of an abstract type Number), Boolean, and String.

For types *select* and *tree* the string corresponding to the selected value in the list or tree is stored.

An attribute of type *entity* actually corresponds to a relationship between the current instance and an instance of the target entity and thus it is stored in the DB as an arc, connecting the nodes corresponding to these two instances and having the attribute name as type. Note that in our proposed naming policy attribute names start with a lowercase letter, just like relationship names.

Finally, albeit Neo4j provides for temporal types, including 'Date', following [18] we propose to model attributes of type *date* as relationships, as well. We assume the ontology implicitly defines four entities, as shown in Table 10:

**DAY** representing a specific day of a specific year, with integer attributes *day*, *month*, *year*;

**MONTH** representing a specific month of a specific year, with integer attributes *month*, *year*;

**YEAR** representing a year, with a single integer attribute *year*.

**TIMELINE** representing the overall timeline.

This allows to specify dates at different granularity, differently from the Date type available in Neo4j. Neo4j provides functions for Date truncation to Month or Year, but such truncations actually correspond to the first day of the month or year and thus there is no way to distinguish whether a date like 2020/01/01 actually refers to the specific day or is a truncation for the month (2020/01) or year (2020). A single TIMELINE node is automatically added to the DB. DAY, MONTH or YEAR nodes are automatically added to the DB for each year/month/day, year/month or year value, resp., in date attributes of instances. The DB will also automatically link, using arcs of type BELONGSTO, each DAY node with the corresponding MONTH node, each MONTH node with the corresponding YEAR node and finally all YEAR nodes with the TIMELINE node. This will allow collecting all instances referring to the same date at different levels of granularity. Furthermore, arcs of type FOLLOWS may be added and maintained between adjacent days, months or years in the DB. This will allow to easily extract from the DB time intervals and associated information.

**Table 10.** Implicit entities and relationships for time handling.

```
<entities>
   <entity name="Timeline"/>
   <entity name="Year">
      <attributes>
         <attribute name="year" mandatory=""true" datatype="integer"/>
      </attributes>
   </entity>
   <entity name="Month">
         <attribute name="month" mandatory=""true" datatype="integer"/>
         <attribute name="year" mandatory=""true" datatype="integer"/>
   </entity>
   <entity name="Day">
         <attribute name="day" mandatory=""true" datatype="integer"/>
         <attribute name="month" mandatory=""true" datatype="integer"/>
         <attribute name="year" mandatory=""true" datatype="integer"/>
   </entity>
</entities>
<relationships>
   <relationship name="belongsTo" inverse="includes">
      <references>
         <reference subject="Day" object="Month"/>
         <reference subject="Month" object="Year"/>
         <reference subject="Year" object="Timeline"/>
      </references>
   </relationship>
   <relationship name="follows" inverse="precedes">
      <references>
         <reference subject="Day" object="Day"/>
         <reference subject="Month" object="Month"/>
         <reference subject="Year" object="Year"/>
      </references>
   </relationship>
</relationships>
```
#### *4.2. Mapping to OWL Format*

The other part of our motivation for this work was using the ontology level not only as a DB schema, but also to carry out formal reasoning and consistency or correctness checks on the individuals. As noted in Section 2, a widespread standard for representing ontologies is OWL, based on a different model than LPGs, on which GraphBRAIN ontologies are based. While of course new reasoners may be purposely developed for GBS ontologies, it would be desirable to translate GBS ontologies into OWL, so as to allow immediate reuse of the many existing tools for OWL ontologies. This section provides a strategy for this

translation, aimed at overcoming and reconciling the differences in concepts, perspectives and expressive power between the two ontological models. For compliance with existing tools and reasoners, our implementation of GraphBRAIN adopted the same OWL-API as Protégé for its ontology export functionality, so that the generated ontologies are fully compliant with the standard and may be edited using Protégé. So, in the following, we will use the OWL-RDF syntax accepted by Protégé.

When serializing GBS ontologies to OWL format we propose to use prefix **gbs** in the namespaces, so that they can be easily recognized.

Note that here we just provide the translation for the basic GBS format, expressing the DB schema. Additional tags/features can be added to this basic format to express information intended for use by the ontological level (e.g., transitivity of relationships, etc.), but this is a wide path of investigation and will be developed in future work.

As a reference for the subsequent discussion, we provide in Figures 4–6 some screenshots of a sample GBS ontology (concerning the domain of 'computing') exported in OWL using our API and opened with Protégé.

#### 4.2.1. Entities

Entities in GBSs correspond to Classes in OWL. Each (sub-)entity is declared in OWL using the **owl:Class** statement. Specializations are associated with their immediate superclass using the **rdfs:subClassOf** statement. The implicit universal entity ENTITY, generalizing all (sub-)entities defined in the schema, corresponds to the 'Thing' class in OWL. Since classes are to be considered as disjoint (see Section 3), the axioms for classes in the top level and the specializations of each (sub-)class also include (many) **owl:disjointWith** statements to all of their sibling (sub-)classes, e.g., the following fragment of taxonomy for entity DOCUMENT:

```
<entity name="Document">
   <taxonomy>
      <value name="Printable">
         <taxonomy>
             <value name="Book"/>
             <value name="Letter"/>
         </taxonomy>
      </value>
   </taxonomy>
</entity>
```
translates into the following OWL fragment:

```
<owl:Class rdf:about="http://owl.api.ontology#Document">
   <owl:disjointWith rdf:resource="http://owl.api.ontology#Component"/>
   <owl:disjointWith rdf:resource="http://owl.api.ontology#Device"/>
   <owl:disjointWith rdf:resource="http://owl.api.ontology#Person"/>
   <owl:disjointWith rdf:resource="http://owl.api.ontology#Place"/>
</owl:Class>
<owl:Class rdf:about="http://owl.api.ontology#Printable">
   <rdfs:subClassOf rdf:resource="http://owl.api.ontology#Document"/>
</owl:Class>
<owl:Class rdf:about="http://owl.api.ontology#Book">
```

```
<rdfs:subClassOf rdf:resource="http://owl.api.ontology#Printable"/>
   <owl:disjointWith rdf:resource="http://owl.api.ontology#Letter"/>
</owl:Class>
```

```
<owl:Class rdf:about="http://owl.api.ontology#Letter">
   <rdfs:subClassOf rdf:resource="http://owl.api.ontology#Printable"/>
   <owl:disjointWith rdf:resource="http://owl.api.ontology#Book"/>
</owl:Class>
```
In the OWL translation, each entity instance is associated with the sub-class specified by its 'specialization' attribute of the top-level class specified in its labels.

In Figure 4, in the left-hand-side area of the window we see the class hierarchy, in which class 'Computer' (a sub-class of 'Device') has been selected and corresponding details are shown in the right-hand-side area. We may notice that Computer has in turn several sub-classes.

**Figure 4.** OWL translation of a sample GBS ontology loaded in Protègè: classes.

#### 4.2.2. Relationships

Relationships in GBSs correspond to Object Properties in OWL. Each (sub-)relationship is declared in OWL using the **owl:ObjectProperty** construct. Specializations are associated with their immediate super-relationship using the **rdfs:subPropertyOf** construct. The implicit universal relationship RELATIONSHIP, generalizing all (sub-)relationships defined in the schema, corresponds to the 'topObjectProperty' object property in OWL. Subject and Object entities acting as references of a relationship in GBSs correspond to Domain and Range of the Object Property in OWL, expressed by constructs **rdfs:domain** and **rdfs:range**, respectively. The name for the inverse of a relationship in GBS is translated into OWL using the **owl:inverseOf** construct.

GBSs may use the same relationship name applied to possibly many Subject–Object pairs as references. This cannot be expressed directly in OWL. Adding all the Subject (resp., Object) entities as domain (resp., range) classes to the corresponding OWL object property would be interpreted in OWL as the intersection of the Subject (resp., Object) classes as the domain (resp., range) of the OWL object property.

When the subject (resp., object) of all references in a relationship is the same, the logical disjunction (OR) operator of the classes in the object (resp., subject) would solve the problem, e.g., the following relationship:

```
<relationship name="produced" inverse="producedBy">
   <references>
      <reference subject="Company" object="Device"/>
      <reference subject="Company" object="Software"/>
   </references>
</relationship>
```
meaning that companies may produce devices or software (but a specific company might produce both, or either, or none of them), might be represented as a single object property

```
Company.produced.(Device OR Software)
```
and the following relationship:

```
<relationship name="belongsTo" inverse="includes">
<references>
      <reference subject="Device" object="Collection"/>
      <reference subject="Document" object="Collection"/>
   </references>
</relationship>
```
meaning that devices or documents may belong to collections, might be represented as a single object property

```
(Device OR Document).belongsTo.Collection
```
However, in general, when the subjects and objects both involve many classes, adding the logical disjunction (OR) of the Subject entities as the domain and of the Object entities as the range would be a wrong translation, because it would not prevent OWL from accepting instances from incompatible Subject–Object pairs, e.g., if relationship WASIN can be applied to reference pairs COMPANY-EVENT and PERSON-PLACE:

```
<relationship name="wasIn" inverse="hosted">
   <references>
      <reference subject="Company" object="Event"/>
      <reference subject="Person" object="Place"/>
   </references>
</relationship>
```
using '(Company OR Person)' as the domain and '(Event OR Place)' as the range:

(Company OR Person).wasIn.(Event OR Place)

would admit relating an instance of Company to an instance of Place, which was not intended by the GBS ontology. We reconcile this by introducing in OWL one object property for each GBS relationship, using the same name and the disjunction (OR) of the Subject entities as the domain and the disjunction (OR) of the Object entities as the range. Then, for each Subject–Object reference pair for a relationship 'rel' in GBS, in OWL we define a new relationship 'rel\_Subject\_Object' with domain Subject and range Object, as a subObjectProperty (OWL feature **rdfs:subPropertyOf**) of 'rel' (not ambiguous since underscores are not allowed in GBS entity and relationship names).

The OWL translation of the previous example would be:

```
<owl:ObjectProperty rdf:about="http://owl.api.ontology#hosted"/>
<owl:ObjectProperty rdf:about="http://owl.api.ontology#wasIn">
   <owl:inverseOf rdf:resource="http://owl.api.ontology#hosted"/>
</owl:ObjectProperty>
```

```
<owl:ObjectProperty rdf:about="http://owl.api.ontology#wasIn_Company_Event">
   <rdfs:subPropertyOf rdf:resource="http://owl.api.ontology#wasIn"/>
   <rdfs:domain rdf:resource="http://owl.api.ontology#Company"/>
   <rdfs:range rdf:resource="http://owl.api.ontology#Event"/>
</owl:ObjectProperty>
<owl:ObjectProperty rdf:about="http://owl.api.ontology#wasIn_Person_Place">
   <rdfs:subPropertyOf rdf:resource="http://owl.api.ontology#wasIn"/>
   <rdfs:domain rdf:resource="http://owl.api.ontology#Person"/>
   <rdfs:range rdf:resource="http://owl.api.ontology#Place"/>
</owl:ObjectProperty>
```
In principle, we should add some constraint telling OWL that 'rel' is an 'abstract' relationship, i.e., it does not admit direct instances (any instances must belong to a subObjectProperty of 'rel'), but unfortunately this cannot be expressed in OWL [43]. However, since the OWL functionality will be applied only to the instances in the DB, which are controlled by the GBS ontology, in practice this constraint will be implicitly enforced for explicit instances. Only the reasoning might identify individuals belonging to 'rel'. Another option would be defining only the subObjectProperties, but semantically we would miss the information that they express the same concept declined for different references and operationally we would miss the opportunity of defining in 'rel' a core set of properties that apply to all of its sub-relationships. On the other hand, defining attributes (Datatype Properties) on Object Properties is forbidden by OWL and must be handled appropriately in the translation, as we will see in the next sections.

When the *name* of a relationship and its *inverse* in GBS are the same, instead of adding the inverse object property, the object property is labeled as symmetric, using the **owl:SymmetricProperty** construct, e.g., ALIASOF:

```
<relationship name="aliasOf" inverse="aliasOf">
   <references>
      <reference subject="Company" object="Company"/>
      <reference subject="Person" object="Person"/>
   </references>
</relationship>
```
is translated as:

```
<owl:ObjectProperty rdf:about="http://owl.api.ontology#aliasOf">
   <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#SymmetricProperty"/>
</owl:ObjectProperty>
```
<owl:ObjectProperty rdf:about="http://owl.api.ontology#aliasOf\_Company\_Company"> <rdfs:subPropertyOf rdf:resource="http://owl.api.ontology#aliasOf"/> <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#SymmetricProperty"/> <rdfs:domain rdf:resource="http://owl.api.ontology#Company"/> <rdfs:range rdf:resource="http://owl.api.ontology#Company"/> </owl:ObjectProperty>

```
<owl:ObjectProperty rdf:about="http://owl.api.ontology#aliasOf_Person_Person">
   <rdfs:subPropertyOf rdf:resource="http://owl.api.ontology#aliasOf"/>
   <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#SymmetricProperty"/>
   <rdfs:domain rdf:resource="http://owl.api.ontology#Person"/>
   <rdfs:range rdf:resource="http://owl.api.ontology#Person"/>
</owl:ObjectProperty>
```
In Figure 5, the left-hand-side area reports the hierarchy of object properties corresponding to GBS relationships, all depending from the universal class 'topObjectProperty'. Object properties 'aliasOf' and 'belongsTo' have been expanded, showing the subproperties generated by the corresponding references. 'belongsTo\_Award\_Collection' is selected, whose details are reported on the right-hand-side area. Specifically, we see that its domain is class 'Award' and its range is class 'Collection' and that it is a subPropertyOf class 'belongsTo'.

**Figure 5.** OWL translation of a sample GBS ontology loaded in Protègè: object properties.

#### 4.2.3. Data Types

Attributes of data types *integer*, *real*, *boolean*, *string* and *text* are translated into OWL using the corresponding datatypes **xsd:integer**, **xsd:decimal**, **xsd:boolean**, **xsd:string** (for both string and text). Note that OWL provides several versions of some datatypes.

For types *select* and *tree*, we define in OWL an Enumerated datatype specifying the values in the list or tree. We do not need to store the tree structure, so we can flatten the tree values into a list, because (a) in GBS the tree is just a conceptual aid to the users, in order to build the interfaces to the DB; and (b) there are no duplicate values in the tree, e.g., the values for attribute 'gender' of entity 'Person' in this GBS fragement:

```
<entity name="Person">
   <attributes>
      <attribute datatype="select" mandatory="false" name="gender">
         <values>
            <value name="M"/>
            <value name="F"/>
         </values>
      </attribute>
</entity>
```
would be specifies as the range of the datatype property 'gender\_Person' made up of the list of string values {M, F}:

```
<owl:DatatypeProperty rdf:ID="gender_Person">
  <rdfs:range>
    <owl:DataRange>
      <owl:oneOf>
        <rdf:List>
           <rdf:first rdf:datatype="&xsd;integer">M</rdf:first>
           <rdf:rest>
              <rdf:List>
                  <rdf:first rdf:datatype="&xsd;integer">F</rdf:first>
                  <rdf:rest rdf:resource="&rdf;nil" />
             </rdf:List>
          </rdf:rest>
        </rdf:List>
      </owl:oneOf>
    </owl:DataRange>
  </rdfs:range>
</owl:DatatypeProperty>
```
Attributes of type *entity* actually correspond to a relationship between the current instance and an instance of the *target* entity and thus they have as values the individuals of the corresponding *target* class.

Finally, OWL provides several datatypes for expressing the GBS *date* type (e.g., **xsd:date**). While for some purposes they may be enough for representing and handling this type, having an ontological description of time may allow more powerful reasoning. Recently, a specific OWL ontology of temporal concepts, OWL-Time [44], has been proposed for describing and handling temporal properties. This might be another solution, in the same spirit as our proposal but more complex and powerful. We reproduce the strategy discussed in Section 3. This option involves adding to the OWL ontology classes 'Day', 'Month', 'Year' and 'Timeline', and object properties 'belongsTo\_Day\_Month', 'belongsTo\_Month\_Year' and 'belongsTo\_Year\_Timeline', as specializations of a general 'belongsTo' relationship, to suitably connect these classes.

#### 4.2.4. Entity Attributes

As usual in databases, attributes in different entities might have the same name but different meaning. Since in OWL each name must identify one element, we disambiguate by merging the attribute name with the entity it belongs to. Therefore, attribute 'attr' of entity 'Ent' will be stored as 'attr\_Ent' in the OWL version of the ontology (not ambiguous since underscores are not allowed in entity names).

Attributes of data types *integer*, *real*, *boolean*, *string* and *text* are translated into OWL as datatype properties having the attribute class as the domain and the corresponding primitive OWL datatype as the range (as specified in the previous section).

As shown in the previous section, attributes of types *select* and *tree* are translated into a datatype property having the attribute class as the domain and an Enumerated Type as the range.

In Figure 6, on the left-hand-side, the data properties are shown, all depending from the 'topDataProperty' root. Some correspond to entity attributes. 'buttons\_Mouse' is selected, showing its domain class ('Mouse') and the associated datatype ('integer').


**Figure 6.** OWL translation of a sample GBS ontology loaded in Protègè: data properties.

Attributes of type *entity*, actually corresponding to a relationship between the instances of the attribute entity and those of the target entity, are translated as object properties having the attribute class as domain and the target class as range. This is compliant with our proposed naming policy, since attribute names start with a lowercase letter just like object property names. Specifically, since the target class individual associated with each domain class instance is unique, we also set this object property in OWL as functional (**owl:FunctionalProperty**).

Finally, according to the what reported in the previous section, attributes of type *date* can be modeled as datatype properties or as object properties.

In Figure 5, some object properties correspond to entity attributes of type 'entity' or 'date', e.g., 'announcementDate\_Component\_Day' represents the object property expressing the 'announcementDate' attribute (of type 'Date') of entity 'Component' (which is the domain of this object property), linking it to entity 'Day' (acting as the range of this object property).

#### 4.2.5. Relationship Attributes

As previously noted, OWL does not allow expressing attributes (datatype properties) on relationships (object properties). In the ontological practice this is solved by a process of *reification*, by which the object property becomes a class, to which the attributes can be associated, and considering it as the subject of two object properties, linking it respectively to its domain and range. We adopt the same strategy in our translation. After turning the relationship into a class, its attributes are handled as reported in the previous section, e.g., considering again relationship WASIN:

```
<relationship name="wasIn" inverse="hosted">
   <attributes>
      <attribute datatype="string" mandatory="false" name="reason"/>
      <attribute datatype="date" mandatory="false" name="startDate"/>
   </attributes>
</relationship>
```
the OWL classes, datatype properties (for attribute 'reason' and object properties (for attribute 'startDate') generated after reification would be:

```
<owl:Class rdf:about="http://owl.api.ontology#wasIn">
   <rdfs:subClassOf rdf:resource="http://owl.api.ontology#Relationship"/>
</owl:Class>
<owl:DatatypeProperty rdf:about="http://owl.api.ontology#reason_wasIn">
   <rdfs:domain rdf:resource="http://owl.api.ontology#wasIn"/>
   <rdfs:range rdf:resource="http://owl.api.ontology#string"/>
</owl:DatatypeProperty>
<owl:ObjectProperty rdf:about="http://owl.api.ontology#startDate_wasIn_Day">
   <rdfs:subPropertyOf rdf:resource="http://owl.api.ontology#RelationshipProperty"/>
   <rdfs:domain rdf:resource="http://owl.api.ontology#wasIn"/>
   <rdfs:range rdf:resource="http://owl.api.ontology#Day"/>
</owl:ObjectProperty>
<owl:ObjectProperty rdf:about="http://owl.api.ontology#startDate_wasIn_Month">
   <rdfs:subPropertyOf rdf:resource="http://owl.api.ontology#RelationshipProperty"/>
   <rdfs:domain rdf:resource="http://owl.api.ontology#wasIn"/>
   <rdfs:range rdf:resource="http://owl.api.ontology#Month"/>
</owl:ObjectProperty>
<owl:ObjectProperty rdf:about="http://owl.api.ontology#startDate_wasIn_Year">
   <rdfs:subPropertyOf rdf:resource="http://owl.api.ontology#RelationshipProperty"/>
   <rdfs:domain rdf:resource="http://owl.api.ontology#wasIn"/>
   <rdfs:range rdf:resource="http://owl.api.ontology#Year"/>
```
</owl:ObjectProperty>

While this transformation is required only for relationships having attributes, it may not be appropriate to have some relationships translated as object properties (those with no attributes) and others translated as classes. Therefore, we translate all relationships both as classes (possibly with attributes), reproducing their hierarchy under the RELATIONSHIP top-level class, and as object properties.

#### *4.3. Logical Architecture and Workflow*

Figure 7 provides a high-level graphical description of the involved components and the flow of information in GraphBRAIN. The GraphBRAIN system is shown as a grey box, including the graph DB that stores the data, the GBS schemas and the API. Shapes denote kinds of information: the schemas (empty shapes) define the allowed information patterns and information (filled shapes) is stored in the DB based on these patterns (the shape of the information blocks is the same as that of the schema they refer to). Some information may belong to different schemas (shown as overlapping shapes in the DB). Note that the schemas are kept apart from the data, that several schemas may be used on the same DB and that the API is independent of the schemas (the same API may be used on all DBs, since the schema to be used are provided as an input during the operations).

All interactions between external entities and the system pass through the API. Applications (e.g., the Web Application described in Section 3) may ask the API to provide information about the patterns in one of the available schemas and use them to inform their data handling requests. When they request to store (insert/update) or retrieve (read) information based on a schema, the API checks that their structure is consistent with the patterns defined in the specified schemas, in which case the request is fulfilled. Requests for information patterns not defined in the scheme (the triangle in the figure) are blocked. Given an existing KG, its ontological part can be imported in a schema; if required, also its instances can be imported into the DB based on the imported schema. Conversely, a schema can be exported to an ontology for a KG and possibly the corresponding data in the DB can be exported as instances to the KG, as well.

**Figure 7.** Interplay among components and roles.

#### **5. Conclusions**

Formal ontologies, described as RDF graphs, have traditionally been investigated as a means to formalize an application domain so as to carry out automated reasoning on it. The union of the terminological and assertional parts of an ontology is known as a Knowledge Graph. On the other hand, database technology has ever since focused on the optimal organization of data so as to boost efficiency in their storage, management and retrieval. Graph databases, based on the Labeled Property Graphs (LPG) model, are a recent technology specifically focusing on element-driven data browsing rather than on batch processing. Furthermore, graph databases are typically schema-less, preventing uniform interpretation of the data by, and interoperability of, the applications. In spite of the patent and intuitive complementarity and connections between these technologies, the underlying graph models are partially incompatible and little exists to bring them to full integration and cooperation.

Whilst most efforts in the literature are OWL-centric and aimed at mapping RDF ontologies to LPGs, we place more emphasis on the database, so as to benefit from efficient data handling, and aim at enriching it with reasoning capabilites that exploit as much as possible the flexibility of the LPG model. To the best of our knowledge this is a completely novel perspective in the literature.

For this purpose, we proposed to express database schemas in the form of ontologies, so as to clearly describe the database content and to allow users to carry out complex reasoning on it, beyond the queries allowed by the database query language. Specifically, we defined an intermediate format (GBS) that can be easily mapped onto formal ontology standards on one hand and onto the graph database structure on the other. A peculiarity of our approach is that many schemas/ontologies can be applied to the same graph to express different domains or perspectives on its content. These ontologies may share classes and relationships, allowing cross-fertilization of the knowledge from the corresponding domains. The use of ontologies enables multistrategy formal, automated reasoning on the data, that goes much beyond what simple queries can do.

In this paper, for the first time, we provided the full specification for GBS and discussed how its components can be mapped on a most famous graph DB (Neo4j) and on a standard formal ontology (OWL). Operationally, this framework is supported by an API that is meant to act as a wrapper for the DB, ensuring that its content is compliant with a GBS schema, and that can connect the instances in the DB with an ontological reasoner using the same schema as an ontology. Based on this API many different applications may exploit this powerful combinations of databases and ontologies in their functions. Among these applications we developed a tool to build, browse and edit GBS schemas, and a tool to add, edit and consult the DB content according to a pre-specified schema. Such a tool is described in this paper, as well.

The API and tools are continuously under development to be extended and refined, and research is ongoing to further improve the mapping between the GBS and OWL formalisms, so as to fully exploit their respective advantages in both the instance (database) and the schema (ontology) part of the knowledge graph. In particular, we are working at the extension of the schema format with additional tags/features to express information that may improve the effectiveness of reasoning at the ontological level.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The author would like to thank Domenico Redavid for the useful discussions on the methodology, and Davide Di Pierro for their contribution in the implementation of the schema management section. Grateful thanks go to Artificial Brain S.r.l. for implementing most of the Web Application.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**


## *Article* **An Ontology-Based Approach for Knowledge Acquisition: An Example of Sustainable Supplier Selection Domain Corpus**

**Agnieszka Konys**

Faculty of Computer Science and Information Technology, West-Pomeranian University of Technology in Szczecin, Zołnierska 49 Street, 71-210 Szczecin, Poland; akonys@zut.edu.pl ˙

**Abstract:** Selecting the right supplier is a critical decision in sustainable supply chain management. Sustainable supplier selection plays an important role in achieving a balance between the three pillars of a sustainable supply chain: economic, environmental, and social. One of the most crucial aspects of running a business in this regard is sustainable supplier selection, and, to this end, an accurate and reliable approach is required. Therefore, the main contribution of this paper is to propose and implement an ontology-based approach for knowledge acquisition from the text for a sustainable supplier selection domain. This approach is dedicated to acquiring complex relationships from texts and coding these in the form of rules. The expected outcome is to enrich the existing domain ontology by these rules to obtain higher relational expressiveness, make reasoning, and produce new knowledge.

**Keywords:** ontology; knowledge base; sustainable supplier selection; ontology population; information extraction; knowledge acquisition from text

#### **1. Introduction**

The concept of sustainable development is based on the intersection of three dimensions: economic, environmental, and social. Each of them deals with different aspects, but together they focus on promoting sustainable development. Globalization forces global manufacturers to attach much importance to partnerships between suppliers. In general, a supply chain is a concept that links upstream, midstream, and downstream. The manufacturers' aim is to reduce costs in this process. Moreover, supply chain management (SCM) receives the applicable information from downstream to improve the quality of the goods provided upstream and downstream [1]. Growing customer, non-governmental organization (NGO), and law enforcement concerns about environmental, social, and corporate responsibility have drawn industry academics and practitioners to the concept of sustainable supply chain management [2].

The assessment of sustainable development is an issue of growing importance among scientists and decision-makers. Sustainability assessment offers a large number of opportunities to measure and evaluate the level of its accomplishment. The search for effective methods of assessing sustainable development and its monitoring of development is now becoming one of the key factors determining the development of a sustainable society. The problem of assessing sustainable development applies to almost all areas. The international environmental policy, government, and people have stimulated enterprises to strictly adopt sustainable concepts in the supply chain networking to obtain a reactive, regulatory, proactive strategic, and competitive merit and abrade the non-sustainable challenges and factors against the world's environment [3]. Due to globalization, sustainable supply chains are becoming more and more important. Hence, it is worth paying attention to ensuring sustainable supplier selection in this process. Sustainable supplier selection is a combined multi-dimensional problem that includes considering both qualitative and quantitative factors. The sustainability paradigm has been considered a comprehensive term in supplier

**Citation:** Konys, A. An

Ontology-Based Approach for Knowledge Acquisition: An Example of Sustainable Supplier Selection Domain Corpus. *Electronics* **2022**, *11*, 4012. https://doi.org/10.3390/ electronics11234012

Academic Editor: Stefano Ferilli

Received: 20 October 2022 Accepted: 2 December 2022 Published: 3 December 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

selection, which includes a vital presence of three aspects (economic, environmental, and social) [4].

Ensuring sustainable supply chain complexity is one of the most difficult problems in today's global supply chains and is assumed as the key impediment to business performance. It has a significant influence on competitiveness, costs, customer satisfaction, product innovation, and market share. Therefore the decision-makers must know the criteria causing sustainable supply chain efficiency. Proper identification and prioritizing of sustainable supplier criteria are required for effective monitoring and controlling of supply chain management [5]. Moreover, the timeliness of these criteria is also of great importance. The selection of a sustainable supplier depends on many factors. Thus, the crucial question is to find a reasonable approach between comprehensiveness and a manageable multi-dimensional knowledge base as well as up-to-date information exchange.

This paper presents an ontology-based approach to knowledge acquisition from the text. This approach is dedicated to acquiring complex relationships from texts and coding these in the form of rules. The approach begins with elaborating data using VosViewer to plot knowledge domain maps. Next, existing domain knowledge is implemented as OWL ontology and applies NLP tools and text-matching techniques to deduce different atoms, such as classes, properties, and literals, to capture deductive knowledge in the form of new rules. The expected outcome is to enrich the existing domain ontology by these rules to obtain higher relational expressiveness, make reasoning and produce new facts.

Several research gaps are identified through an in-depth review of the literature. Firstly, lack of a comprehensive knowledge base about criteria, sets of criteria are found by various literature studies but cannot effectively estimate sustainable supplier selection criteria [1,6,7]. Secondly, in most cases, there is a subjective evaluation of the performance of sustainable supplier selection [3,8,9].

Moreover, there is a lack of a systematic framework to handle knowledge about sustainable supplier selection criteria [1,6,7,9]. There is also a lack of a complex approach to both selecting and filtering linguistic information about criteria determining sustainable supplier selection and its categorization in the form of a knowledge base [3,5,8,10,11].

These research gaps are transformed into the author's contribution as follows:


Based on this, it is possible to define the following highlights:


The presented approach begins with creating domain knowledge represented as OWL ontology and applies NLP tools and text-matching techniques to deduce different atoms, such as classes, properties, and literals, to capture new knowledge. This research increases the body of knowledge on the ontology for the sustainable supplier domain by providing a systematic keywords map of the subject and grasping the main criteria in the research field. The results demonstrate that the proposed approach can (1) successfully handle the

knowledge domain, (2) reduce the time for searching for relevant information, (3) improve the accuracy of search results that suit users' specific needs, and (4) provide quick updates with new knowledge.

The remainder of this paper is organized as follows. Section 2 presents the related works, in particular, taking into account such topics as sustainable supplier selection, information extraction, NLP, and ontologies. In Section 3, Materials and Methods, a new ontology-based approach for extracting knowledge in the form of rules from texts is described in detail. Section 4 presents the working example of the elaborated approach. Section 5 provides the conclusions and directions for further research.

#### **2. Background and Related Works**

#### *2.1. Sustainable Supplier Selection*

The growing emphasis on supply chain management among manufacturing companies has made the suppliers' role in the value-addition processes to become strategically significant [8]. The problem of assessing sustainable development applies to almost all areas. Supplier selection is a combined multi-dimensional problem that includes considering both qualitative and quantitative factors [9]. Due to globalization, sustainable supply chains are becoming more and more important. The fast globalization of doing business affects business competition, changing the model from "company versus company" to the model "supply chain versus supply chain" [11]. Therefore, choosing a good combination of suppliers to work with is critical to the success of conducting business [1]. Over the years, the importance of selecting suppliers has been appreciated and emphasized. Adding sustainability aspects to the supplier selection process highlights existing trends in environmental, economic, and social issues related to management and business processes. Moreover, the development of sustainable development allows the integration of environmental, economic, and social thinking with conventional supplier selection [12].

From a systematic point of view, the study of the problem of sustainable supplier selection can be divided into two parts, including criteria and methods [13]. The analysis of the literature provides a set of various methods exploiting different aspects and using single or mixed approaches, as well as examples of selection criteria [11,12,14]. Most of the studies on sustainable supplier selection use MCDM or fuzzy MCDM techniques with complex calculations [1]. A wide range of methods was applied to solve the problem of sustainable supplier selection. The literature reviews [12] point out that the main single and combined approaches used to solve this problem are mathematics methods and artificial intelligence approaches, especially including analytic hierarchy process [10,15], linear programming [10], multi-objective programming [16,17], goal programming [6], data envelopment analysis [13], heuristics [18], statistical [19], cluster analysis [7], multiple regression [20], discriminant analysis [21], neural networks [22], software agent [20], casebased reasoning [23], expert system [21], and fuzzy set theory [14] as well as combinations of selected pairs.

As it is a multi-dimensional concept, the selection of sustainable suppliers is not based on a single criterion but on a set of criteria, which are mostly focused on economic, social, and environmental issues. In general, most companies need to focus on their supply chains to enhance sustainability to meet customer demands and comply with environmental legislation. In order to achieve these goals, companies must focus on criteria that include carbon footprint and toxic emissions, energy use and efficiency, waste generation, and worker health and safety [24]. Therefore, to analyze interrelationships among sustainability criteria, it is necessary to identify the most important ones for a given decision problem and then evaluate suppliers according to these criteria. Since the knowledge about criteria is scattered, a set of hybrid information aggregation is required to provide practical evaluation and link this set of information to the proposed knowledge base. The literature analysis provides many multi-criteria methods to support a balanced selection of suppliers and multiple cuttings of criteria sets, often suited for a given area (e.g., food, industry, and others). There are many comparable approaches; Table 1 shows a

small piece of them. However, little attention has been paid to building a complex solution that allows gathering the selection criteria for sustainable suppliers, and there is almost no systemic and structured knowledge-based approach that could be used to evaluate the sustainability of suppliers.


**Table 1.** Examples of multi-criteria methods to support a selection of sustainable suppliers.

#### *2.2. Information Extraction*

The information extraction (IE) process is based on the automatic extraction of certain types of information from natural language text. IE is the process of extracting information from unstructured text sources to enable entities to be searched, classified, and stored in a knowledge base [34]. The general aim is to parse text in natural language and look for instances of a certain class of objects or events and the instances of relationships between them. Another definition describes information extraction as a form of natural language processing in which certain types of information must be recognized and extracted from a text. Extracting information uses various algorithms and methods for finding information [35]. IE deals with the collection of texts in order to transform them into information that can be easily understood and analyzed [36]. Semantically enhanced information extraction (also known as semantic annotation) links these units to their semantic descriptions and connections from the knowledge graph. Because is much information available on the Internet these days, and the amount of it is constantly growing, this results in information overload. However, the real problem is not the sheer amount of information but the inability to filter it properly [34,37]. IE helps in the automatic detection of new, previously unknown information by automatically extracting information from various unstructured resources [38]. Therefore, the key element is linking the extracted information together to formulate new facts or new knowledge. In other words, in IE, the goal is to discover previously unknown information. Figure 1 displays an illustrative example of how information extraction works in practice.

**Figure 1.** An example of information extraction.

Natural Language Processing (NLP)

NLP aims to analyze, identify and solve problems related to the automatic generation and understanding of human language. NLP aims to perform, decode and understand unstructured information [39]. NLP allows for the following:

• Sorting the data to remove the rubbish from the interesting parts;


Overall, the combination of NLP and information extraction extracts new knowledge from the raw data. Finally, unknown information is obtained by automatically extracting information from various unstructured resources.

#### *2.3. Ontology and Ontology Population*

#### 2.3.1. Ontology

Recently, the terms ontology and Semantic Web are quite popular and top research areas in computer science. Ontology is a standard recommended by World Wide Web Consortium for representing knowledge in the Semantic Web, and it turns into a fundamental and critical component for developing applications in different real-world scenarios [40]. Ontologies have become an important tool in domain modeling over the years and have been used successfully in several fields. In the artificial intelligence field [41–44], ontologies can also be used to build knowledge databases that will be used in various systems, using the obtained information to perform different tasks [41]. As a result, they help in carrying out real-world representations, establishing axioms, and obtaining conclusions from them [41,45,46].

Ontologies are defined as a set of concepts and relations between them [47]. Concepts can be divided into classes, subclasses, attributes, relationships, and instances. From a technical point of view, ontologies are a formal source of domain-specific knowledge, which is proven to be efficient for search results diversification [48]. In fact, they allow you to express the semantics of a domain in a language that computers can understand, allowing automatic processing of the meaning of the information provided [49]. Ontologies provide a controlled vocabulary of concepts whose semantics are explicitly defined and machine understandable [47]. Ontologies also offer a common understanding of the topics of communication between systems and users and enable the processing of web-based knowledge as well as the sharing and reuse among applications [48]. The most popular definition of ontology was proposed by Gruber, who stated that ontology could be defined as an explicit, formal specification of a shared conceptualization [47]. It contains the following components called concepts, individuals, relations, and attributes. It can be formulated as follows:

$$\mathbf{O} = \{\mathbf{I}; \mathbf{C}; \mathbf{R}; \mathbf{A}\} \tag{1}$$

where I is the set of individuals, C refers to the set of concepts, R represents the set of relations and the interactions between domain individuals as follows: R is ⊆ C1 × C2 ×. Cn and A is the set of axioms.

The concepts (classes) correspond to the relevant abstractions of a segment of reality (the domain of the problem). The relations (properties) link the individuals or concepts between them. The individual is defined as a resource that has been placed into the class, but individuals are not classes themselves. The axioms are statements that are asserted to be true in the domain being described [50].

The OWL 2 standard is currently used as a formal language for representing ontologies. The inference process takes place using various ontological reasoners. The main functions of reasoners are ontology consistency checking, class taxonomy building, and ontology querying. Ontology reasoning aims to ensure that the ontology is consistent with its logical semantics. The reasoning is also required to infer new knowledge from ontology. The reasoners enable validation of the ontology, whereas at the end is possible to obtain inferred knowledge against the user's description logic (DL) queries.

#### 2.3.2. Ontology Population

Ontology population is a process for inserting concept and relation instances into an existing ontology [51,52]. The ontology population process has several tasks: the extraction of relation instances and identification values from any information sources and assigning such values to instances. The next task involves extracting instances, or more precisely, identifying values from any information source and assigning them to an instance [51,52]. There are many approaches in the literature related to ontology learning and ontology population. Ontology learning has benefited from the adoption of established techniques such as machine learning, data mining, natural language processing, information retrieval, and knowledge representation [53]. Based on the classification proposed by Alexander Maedche and Steffen Staab [54], ontology learning approaches were distinguished, taking into account the type of input data used for learning. Thus, common classification contains ontology learning from text, dictionary, knowledge base, semi-structured schemata, and relational schemata [53]. Each of them requires multiple research efforts to achieve a common domain conceptualization [55,56].

An automated ontology population is intended to identify concept and relation instances by using a computational tool [52,55,57,58]. Ontology learning techniques apply more complex NLP techniques to the text. Rather than simply extracting terms, they analyze the grammatical structure of sentences to determine how the terms are used. Then they deduce possible IS-A relationships between terms, which will be used to build classification hierarchies.

#### **3. Materials and Methods**

This section describes a new ontology-based approach for extracting knowledge in the form of rules from texts. This approach is dedicated to acquiring complex relationships from texts and coding these in the form of rules. The proposed approach is based on different works in the areas of knowledge acquisition, rule-based reasoning, and ontology population. A semi-automated supervised solution has been proposed for extending ontology classes in terms of learning concept attributes, data types, and value ranges. This approach requires two inputs: existing knowledge and free texts. The existing knowledge is OWL ontology. Free texts represent the domain knowledge in unstructured natural language, in this case, English. The selected domain covers sustainable supplier selection criteria.

#### *3.1. Data Preparation and Search Strategy*

In this study, we used the following tools: (1) Scopus database for managing bibliographic references [59] and (2) VOSviewer for bibliographic analysis and developing a keywords map [60]. The search strategy encompasses using the Scopus database to retrieve documents related to sustainable supplier selection criteria. In order to support the document filtration process, a formal PRISMA approach [61] was used (Figure 2). However, not all steps from the PRISMA flow diagram were used because the main goal was to search for criteria, and filtering only on the abstract and keywords was insufficient. The list of papers contains 1652 elements. The year of publication of selected documents is between 2003 and 2021. The analysis started in June 2021; hence not all publications from 2021 are included. The analyzed set of papers excluded from the final set of documents the conference reviews, erratum, and review. The query was as follows: (TITLE-ABS-KEY ("sustainable supplier selection"). The extracted documents were exported to Excel spreadsheets as \*.csv file. The results can be revised by the author's name, affiliation, document type, source title, or subject area.

**Figure 2.** Modified PRISMA procedure [61].

Then the set of papers was manually filtered. The process itself was highly timeconsuming but allowed for the identification of an initial set of criteria and sub-criteria. It contained 8261 items. The data set prepared in this way was then subjected to further work using a dedicated tool for plotting knowledge domain maps.

#### *3.2. Plotting Knowledge Domain Maps*

The developed data set was used to prepare and plot knowledge domain maps. Previously collected data were processed using the VOS viewer software [62]. VOSviewer enables the user to generate networks from given bibliometric data. VOSViewer allows the user to group criteria and sub-criteria and display the results. The size of a given item displays the density of occurrence of a given criterion (Figure 3).

**Figure 3.** Collected and elaborated data using the VOS viewer software [62].

For the analysis, it was also necessary to clean up the data, so a VOSViewer thesaurus file was created to combine similar criteria names. Due to the relatively large number of criteria, it is not possible to present all changes in this study. Selected limitation rules are defined and shown for example:


The thesaurus file contains 68 extra items. Ultimately, the set included 8261 criteria as input from 1652 papers. The total number of main clusters is 126. Each cluster contains a set of sub-criteria. The keyword occurrence map was also created. The most common keywords are green, cost, and quality. Table 2 shows the 10 most popular keywords.

**Table 2.** The top 10 keywords.


#### *3.3. Ontology Representation*

The conducted plotting knowledge domain maps provide the pre-elaborated set of criteria and sub-criteria ready to implement in an OWL ontology. The ontology contains all the identified elements, which are the backbone for taxonomy building/class hierarchy building. This process requires the knowledge engineer's participation. Therefore, the input domain ontology was developed from scratch based on the data set provided. The Protégé OWL-API [63] was selected to work with ontology and to manipulate the different constituents of the ontology (classes, object properties, data type properties, and individuals). It aims to structure knowledge, organize it, and above all, reason about it. The main stages of the development process are inspired by the ontology methodology provided by Noy and McGuiness, as shown in Figure 4.

The first step aims to define the domain and scope of the ontology—in this case, the domain of sustainable suppliers was considered. Since no similar solutions have been found, the second step will be to create an ontology from scratch. Steps 3 through 7 relate directly to ontology construction. In the third step, it is necessary to indicate the most important terms in the ontology. These terms are then detailed. This is the basis for building the class hierarchy in step 4. The class hierarchy represents an "is-a" relation: class X is a subclass of Y if every instance of X is also an instance of Y. It is worth noticing that the whole set contains 126 main classes and 8261 sub-classes. Thus, there are 8378 classes in total. The final set of clusters is attached in supplementary materials (the set of criteria: Sustainable\_Supplier\_Criteria.xls). Table 3 displays a piece of a class hierarchy.

**Figure 4.** Ontology construction steps.

**Table 3.** Examples of classes.


Therefore, in the 5th step, the constitution of the relations is needed. In Protégé, the slots are also named object properties. Object properties describe the relations between classes or individuals. Another group is datatype property, which aims to describe the relations between individuals and values. Table 4 shows selected object properties and datatype properties with assigned domains and ranges.


**Table 4.** Examples of object properties and datatype properties.

In the 6th step, the definitions of facets of the slots take place. The value types, cardinality, range of slots, and other features are determined. The 7th step aims to create instances of the classes in the hierarchy. Defining an individual instance of a class requires (1) selecting the class, (2) creating an individual instance of that class, and (3) filling the slot values [64].

The resulting knowledge base contains 8261 ontological entities such as classes, relations, datatype properties, and individuals. This ontology contains a considerable amount of information representing the sustainable supplier criteria. Moreover, this ontology can be fed with new data from external sources. The ontology is available at: https://webprotege. stanford.edu/#projects/d819c911-a0dc-4208-86a5-3be0df042caa/edit/Classes (accessed on 1 April 2022).

#### *3.4. Ontology Population—Information Extraction and Discovering Specific Concepts from the Text and Semantic Annotation*

Ontologies can provide an alternative to storing knowledge at the concept and instance levels. The process of ontology enrichment by adding the names of the concepts and their relationships and instances to populate the ontology is performed by domain experts. However, this process is time-consuming and requires relevant knowledge from domain experts as well as manual skills. Therefore, an ontological population is needed to obtain useful information from texts and includes enrichment with class and relationship instances using an existing ontology as input [52].

The elaborated approach aims to provide a knowledge extraction ontology-based system for texts that helps automatically acquire and formalize this knowledge, limiting the need for expert intervention as much as possible. The proposed approach is based on natural language processing (NLP) and information extraction (IE) techniques. In this work, information extraction techniques are applied as named-entity recognition and coreference resolution. The process of discovering specific concepts from text requires using a dedicated tool. The approach was developed by using the GATE tool and a pipeline-shaped architecture, i.e., a process should finish for starting the next one. GATE is an architecture, framework, and development environment for language engineering (LE). GATE is a component-based model application that allows for easy coupling and decoupling of the processing resources. GATE includes a core library and a set of reusable LE modules. The framework implements the architecture and provides amenities for processing and visualizing sources, including representation, import, and export of data. The provided reusable modules can perform basic language processing tasks such as POS and semantic tagging [65]. The process is shown in Figure 5.

**Figure 5.** Semi-automatic information extraction and knowledge base model constructions.

The input data are provided by the user in the form of unstructured text or web resources. Therefore, a corpus of documents is created. The corpus consists of a set of various documents related to the sustainable supplier. Apart from scientific papers, it is also possible to use as input various reports and statistics written by specialists. The usage of GATE software enables pipeline construction using various processing resources. Therefore, various steps take place, especially containing:


This semantic annotation using the population of ontology and definition of classes would be impossible without ANNIE (A nearly new information extraction system: Tarnow, Poland). ANNIE is a component of GATE. It is a complete chain dedicated to information extraction. ANNIE is based on the Java Annotation Patterns Engine (JAPE) and includes various annotation modules that are useful for performing various extraction tasks. In selected cases, it is possible to use additional processing resources. Figure 6 shows a simplified procedure related to information extraction and feeding the ontology with new knowledge.

**Figure 6.** Information extraction procedure and feeding the ontology with new knowledge. Source: Personal elaboration on base of GATE documentation.

#### *3.5. Rule-Based Reasoning*

The GATE resource OntoRoot Gazetteer can create annotations over textual documents. It demands implementing an ontology as an input in combination with other generic GATE resources. Another processing resource, the JAPE transducer, applies JAPE rules to transform annotations into property assertions. It allows for defining the rules and recognizing regular expressions in annotations of documents. A single JAPE rule is composed of two parts: LHS and RHS. The LHS contains the patterns to match, whereas the RHS details the annotations to be created. JAPE rules combine to form a specific state. The rules are designed to tag classes, instances, and attribute values. The priority of rules is based on pattern length, rule status, and rule order. The phases combine to create grammar. JAPE rules are used to locate terms in the text that potentially relate to markers, and that will later be used to create new annotations using the JAPE formalism and to identify the body and the head of the produced rules.

Table 5 presents an implemented code of the sample JAPE rule titled "Quality1". In this case, to match a string of text, the "Token" annotation and the "string" feature were used to match text with "Token" annotation quality. The formula combination used in this example is enclosed in parentheses, followed by a colon and label. The sign "->" separates the LHS and the RHS parts, and it begins the RHS part. RHS is responsible for the manipulation of the annotation pattern from LHS, and the label on the RHS must match a label on the LHS. When the LHS part is true, the RHS part should be run [65]. When a rule matches a text sequence, the entire sequence is assigned by the rule to the label. The transducer is informed that the temporary label (quality) will be renamed to "Quality" and the rule that achieves this is "Quality1". Naming a rule is important for the debugging purpose, as when the rule fires, it will be part of the annotation properties that you can see in GATE GUI. In this example, a sample criterion will be annotated as {rule = Quality1}.

**Table 5.** An implemented code of the sample JAPE rule.


The set of syntactic rules was created manually. The categories of developed rules refer to a previously elaborated set of criteria implemented in the OWL ontology. Elaborated rules aim to extract attribute values from any corpus of documents and assign them to a given class. These rules have been implemented in the JAPE language. GATE offers OWLLim as an ontology editor that allows you to add results directly to the ontology. In addition, it is possible to save all extracted information in the XML file. Subsequently, an ontology can be automatically created with all information about classes, attributes, and instances. The XML file may also be used by the Protégé environment as an input file and may be processed and saved in OWL/XML format.

#### **4. Case Study**

#### *4.1. Domain Knowledge Acquisition and Cluster Construction*

Data were collected from the Scopus database [59]. This data pre-processing and selection process was described in Section 3.1. Manual screening of selected works allows for dividing the data into criteria and sub-criteria. This process enables the initial classification of criteria. The main set of criteria represents keywords specific to a given class. For example, if the criterion "Quality" is analyzed, then the sub-criteria containing this word in the description will belong to that class. Moreover, in many cases, the sub-criteria may belong to other classes l (e.g., the quality of delivery will belong to the quality and delivery classes).

Subsequently, a bibliometric analysis of selected articles takes place in order to obtain and condense a large amount of bibliographic information. The assumptions of this process are described in Section 3.2. The output is a plotted knowledge map containing the criteria of a sustainable supplier. Finally, this process allowed the grouping of a set of clusters with assigned criteria. The input file was modified on the base of a pre-elaborated set of criteria and sub-criteria. As the main purpose is to extract and classify criteria and sub-criteria, other information such as author, publication date, and the title is omitted. Moreover, the analysis of the keywords alone is insufficient, as it does not contain information about the criteria that are crucial for the construction of the knowledge map. Its further elaboration helps in taxonomy construction. Therefore, VOSviewer will be fed data about the items in the network and the links between the items. This process allows for building a map and obtaining a classification of clusters of related items. This map was computed and normalized using the association strength method as the analysis method. This method is used to normalize the strength of connections between items. The association strength method is used for normalizing the strength of the links between items.

Figure 7 depicts the items indicated by a label and, by default, also by a circle. The size of a label and its circle reflects its importance. Overall, the set of 126 various clusters was defined. The items grouped in the cluster represent the criteria that specify the sustainable supplier's selection. Items containing sub-items are arranged in the same cluster and are related to the main criterion. The colors represent the groups of related items. The distance between items tells you how related the items are. The volume of the circle indicates the contribution of the item, while the size of a circle reflects the total number of co-occurrences of the item.

Figure 8 presents the density map, where each point in a map has a color (ranging from blue to green to yellow) that depends on the density of keywords at that point. The color of the point is closer to yellow when there are more items in the neighborhood of the point and the higher weight of these items.

**Figure 7.** *Cont*.

**Figure 7.** The network visualization of selected items. Source: Personal elaboration using VOSviewer software [62].

In turn, the color of the point is closer to blue when we have a smaller number of items in the neighborhood of the point and the smaller weights of the neighboring items.

As a result, the taxonomic form elaborated on the base of the cluster construction can be implemented in the OWL language. The final set of criteria represents the identified items, and it covers 8261 elements.

#### *4.2. Ontology Construction and Validation*

The knowledge acquisition process is described in Section 3.3. The considered domain refers to sustainable supplier criteria. In conclusion, an in-depth analysis of selected articles and the use of bibliometric analysis supports the process of acquiring knowledge and plotting a map of the knowledge domain. This is followed by specification and conceptualization of knowledge, formalization, integration, and implementation in OWL language. Therefore, the knowledge derived from the unstructured data was performed in a structured form. The ontology construction process requires the specification of individuals (concepts), classes, and relations, as well as restrictions, rules, and axioms. The exemplary classes, object properties, and datatype properties were presented in Tables 3 and 4. Figure 9 shows a small piece of a class hierarchy. Each class contains sub-classes. The exemplary class technology is shown in Figure 10 with assigned sub-classes. The ontology also provides information about suppliers' profiles (Figures 11 and 12).

**Figure 9.** Selected criteria of the constructed ontology. Source: Personal elaboration using Protégé software [63].

**Figure 10.** Selected criterion technology with sub-criteria. Source: Personal elaboration using Protégé software [63].

The implementation uses Protégé-OWL API [63] to work with the OWL ontologies and DL query mechanism to manipulate the different constituents of the ontology. The formal description was performed using the description logic (DL) standard. The formal description of the developed knowledge representation using DL allows for machine processing, sharing, reusing, and, finally, populating new knowledge. The evaluation process of the elaborated ontology was performed using the competency questions and implemented using the description logic query mechanism. This process aims to check the coherence and correctness of the constructed ontology using reasoning mechanisms. For a consistent ontology, the output is a result set.

**Figure 11.** An example of a sustainable supplier profile. Source: Personal elaboration using Protégé software [63].

**Figure 12.** Description of sustainable supplier profile. Source: Personal elaboration using Protégé software [63].

The first example shows how to ask about sustainable supplier criteria in terms of flexibility, quality, responsiveness, and delivery. A rule-based query is created to find results that meet a defined set of criteria. Query 1 is executed by the code, as shown in Table 6.

**Table 6.** The working example of the 1st query.


**Table 6.** *Cont.*


The second exemplary query aims to demonstrate how to find sustainable supplier criteria in the context of quality, reputation, and delivery. The sub-criteria were predefined, including quality of product, quality ISO 9000, delivery and service, delivery on time, and reputation of the supplier. The query was executed using a reasoner. The code is shown in Table 7.

**Table 7.** The working example of the 2nd query.


These queries represent only the partial possibilities of using a knowledge base in extracting information. The examples are attached in supplementary materials (see: JAPE examples: JAPE examples.zip). Given the huge number of criteria included in the knowledge base, there are many possibilities to build different combinations of queries. As a result, the user will also be able to indicate the profile of the preferred supplier. It also allows the user to identify the source of the criteria. Combining the knowledge base with additional modules/knowledge bases containing information, for example, on indicators, gives a chance for a comprehensive source of knowledge in the field of sustainable supplies and suppliers.

#### *4.3. Semantic Annotation and Ontology Population*

The corpus for tests consists of a set of sustainable supplier reports, papers, and other data gathered from web resources. The use of ANNIE, together with selected processing resources (PR) dedicated to information extraction, enabled the performance of various extraction tasks. (mentioned in detail in Section 3.4). The implementation of these PR begins the process of performing the corpus of documents. The corpus of documents may contain various text documents such as scientific articles, report sheets, plain text, etc., and links to websites. Finally, a set of basic annotations has been provided. In order to extend the built-in set of annotations, the own annotations with specific constraints and rules have been created. The created annotations depend on what a user wants to search for and how to classify it. Figure 13 displays exemplary annotations that aim to find the criteria related to technology, transport, and strategic feature. The criteria found in the document body are highlighted (depending on the color assigned to them). It is also possible to add additional features.

**Figure 13.** Displaying the exemplary annotations from the text (web resource). Source: Personal elaboration using GATE software [65].

The implementation of the presented approach using semantic annotation and ontology population requires the use of tools included in this environment and, thus, the installation of new plugins for working with ontologies. OWLIM Ontology plugin and GATE Ontology Editor were used to work with ontology (Figure 14). The ontology was created in the Protégé environment [63]; however, to work with GATE and enable semantic annotation and ontology population, available GATE plugins were used in this part of the experiments.

Within the ontology population, it is possible to create specific rules that are designed to find and classify selected concepts. Hence, the next step is to use JAPE Transducer. JAPE Transducer defines the rules and recognizes regular expressions in annotations of documents. Figure 15 displays the partial results of these phases. The working example of the rule named Quality1 demonstrates the applicability of JAPE rules. Many such rules were created to carry out the tests. Of course, the possibilities of creating rules are huge, and it is possible to expand the rules with additional elements. Figure 16 displays the partial results of applied rule Quality1. The execution of the JAPE rule for extracting attribute values for rule Quality1 is shown in Figure 17.

**Figure 14.** Displaying the ontology using the OWLIM Ontology plugin and GATE Ontology Editor. Source: Personal elaboration using GATE software [65].

**Figure 15.** The exemplary JAPE rule "Quality1". Source: Personal elaboration using GATE software [65].

The presented approach offers a semi-automatic, supervised ontology population. By using semantic annotation, it is possible to annotate the relevant word, for example, "Quality of supply" as a criterion related to sustainable suppliers and link it to an ontology instance. As a consequence, new knowledge is added to the ontology. The application of the reasoning mechanism allows classifying the selected word as a criterion of quality. It can therefore be interpreted as follows from the ontology that "Quality of supply" is a criterion associated with a given supplier profile. For implemented ontology, the class feature can be used on the LHS of a JAPE rule. When matching the class value, the ontology is checked for subsumption. If any sub-class on the left side of "==" matches {Lookup.class == Quality}, it will match a lookup annotation with the class feature, whose value is either quality or any subclass of it (Figure 18).


**Figure 16.** Populated ontology after applying the created rules. Source: Personal elaboration using Protégé software [63].


**Figure 17.** The execution of the JAPE rule for extracting attribute values for rule Quality1. Source: Personal elaboration using GATE software [65].

**Figure 18.** The execution of the LHS JAPE rule for extracting attribute values for rule QualityLookup. Source: Personal elaboration using GATE software [65].

Ontologies are useful for encoding the information found. Applying the created rules for a given corpus of documents makes it possible to extract knowledge using rules and assign this knowledge to classes and instances in the ontology (Figures 16 and 19). The richer NE tagging and application of JAPE rules aim to disambiguate the instances. The modified ontology is then loaded using Protégé software [63]. In this way, the user has control over the development of the ontology and its population and the updating of data. In order to further develop the ontology, rules can be created automatically from a single pattern, with a rule per object property having to be populated.

**Figure 19.** Graphical visualization of the part of populated ontology after applying the created rules. Source: Personal elaboration using Protégé software [63].

#### *4.4. Validation and Evaluation*

In order to evaluate and validate the obtained ontology, the application of the reasoning mechanism takes place. Two reasoning mechanisms were applied: HermiT 1.4.3.456 and Pellet. Both of them did not detect the inconsistency of the loaded ontology (Figure 20).


**Figure 20.** The log results after using HermiT and Pellet reasoners. Source: Personal elaboration using Protégé software [63].

Other ontology assessments and validations require the use of a master ontology. In this case, these measures cannot be used. For example, ontology can be evaluated

using metric-based evaluation, including relationship richness, attribute richness, and class richness. However, to evaluate the quality using these metrics, a similar basic ontology is needed. Apart from that, it is possible to evaluate the ontology using dedicated measure balance distance metrics (BDM), but the reference ontology, test set, and training set are also necessary.

#### **5. Conclusions**

This paper proposed an ontology-based approach for knowledge acquisition from the text for the sustainable supplier selection domain. The presented solution showed the process of acquiring complex relationships from texts and encoding them in the form of rules. As a result, the enrichment of the existing domain ontology by adding new knowledge and reaching higher relational expression, reasoning, and producing new facts has been successfully implemented and achieved.

This process required the use of various techniques and tools, such as VosViewer for plotting knowledge domain maps, Protégé environment for implementing and managing the OWL ontology, GATE software with NLP tools and text matching techniques and plugins for deducing different atoms, and JAPE rules for capturing deductive knowledge in the form of new rules. The evaluation process was performed using the reasoning mechanisms HermiT 1.4.3.456 and Pellet.

The essential contribution of the work covers the following:

Developing an ontology-based framework to deal with distributed knowledge representation; Developing a domain ontology that stores various information about sustainable suppliers, which supports various knowledge management aspects, associating dynamic data delivered from external sources with predefined information gathered in the ontology;

Constructing a knowledge base with rules and queries using JAPE;

Checking the consistency and testing the use of the ontology in different scenarios in the domain of sustainable supplier selection and applying rule-based reasoning.

The presented ontology provides independent knowledge about criteria for sustainable supplier selection, which is proved by a scientific literature analysis. The new knowledge can be incorporated into any database, knowledge base, or information system. This form of storing knowledge offers machine-readable access and semantic data handling. Additionally, the proposed approach made it possible to:

Increase the body of knowledge on the ontology for the sustainable supplier domain by providing a systematic keywords map of the subject and grasping the main criteria in the research field;

Handle knowledge domain;

Reduce time for searching for relevant information;

Improve the accuracy of search results that suit user's specific needs;

Provide quick updates with new knowledge.

However, there are still some limitations that need to be addressed in future research. Further refinements to the presented approach include increasing the level of automation of phases that currently require manual work. In particular, a way to automate JAPE rule definitions and prepare patterns is currently under development. The use of the reasoning abilities provided by the ontology to generate new JAPE rules, starting with patterns of manually specified JAPE rules, is also a promising direction and an extension of this work.

**Supplementary Materials:** The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/electronics11234012/s1; the set of criteria: Sustainable\_Supplier\_Criteria.xls; JAPE examples: JAPE examples.zip.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Parallel Tiled Code for Computing General Linear Recurrence Equations**

**Włodzimierz Bielecki †,‡ and Piotr Błaszy ´nski \*,‡**

Faculty of Computer Science and Information Systems, West Pomeranian University of Technology in Szczecin, 70-322 Szczecin, Poland; wbielecki@zut.edu.pl


‡ These authors contributed equally to this work.

**Abstract:** In this article, we present a technique that allows us to generate parallel tiled code to calculate general linear recursion equations (GLRE). That code deals with multidimensional data and it is computing-intensive. We demonstrate that data dependencies available in an original code computing GLREs do not allow us to generate any parallel code because there is only one solution to the time partition constraints built for that program. We show how to transform the original code to another one that exposes dependencies such that there are two linear distinct solutions to the time partition restrictions derived from these dependencies. This allows us to generate parallel 2D tiled code computing GLREs. The wavefront technique is used to achieve parallelism, and the generated code conforms to the OpenMP C/C++ standard. The experiments that we conducted with the resulting parallel 2D tiled code show that this code is much more efficient than the original serial code computing GLREs. Code performance improvement is achieved by allowing parallelism and better locality of the target code.

**Keywords:** computing-intensive data; dynamic programming; loop nest tiling; parallel code; OpenMP C/C++

#### **1. Introduction**

The purpose of this document is to show a way to produce a parallel program for computing general linear recurrence equations (GLREs). This code can also be tiled.

A recurrence equation expresses each element of a sequence or multidimensional array of values as a function of the preceding ones. Recurrence equations have a broad spectrum of applications, for example: population dynamics, spatial ecology, analysis of algorithms, binary search, digital signal processing, time series analysis, and theoretical and empirical economics. Such applications deal with multidimensional data and are computing-intensive. To reduce their execution time, those applications should be parallelized and run on modern multicore machines.

Many sequential GLRE solutions have been implemented in a variety of development environments. The main problems of these programs are the long time spent in loops and the low cache performance for a large input data sizes that may make them unapplicable.

Loop parallelization and tiling (blocking) can be used to enhance the efficiency of a sequential program. Blocking is a commonly used technique to improve code performance. It allows us to generate a parallel program with greater granularity of code and data locality that will be executed in a multithreaded environment with both distributed and shared memory.

A classic way to automatically parallelize and tile a loop nest is based on affine transformations and includes the following steps: extracting dependencies available in that nest, forming time partition constraints using the obtained dependencies, finding the maximum number of linear independent solutions to those constraints, and finally generating target

**Citation:** Bielecki, W.; Błaszy ´nski, P. Parallel Tiled Code for Computing General Linear Recurrence Equations. *Electronics* **2021**, *10*, 2050. https:// doi.org/10.3390/electronics10172050

Academic Editors: Juan M. Corchado and Xianzhi Wang

Received: 22 June 2021 Accepted: 21 August 2021 Published: 25 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

code [1,2]. If the number of linear independent solutions to those constraints is two or more, then tiled parallel code can be generated [1,2].

In this paper, we examine the loop nest in the C language presented in Listing 1, which computes GLREs. That code is taken from https://www.netlib.org/benchmark/livermorec (accessed on 24 August 2021).

```
Listing 1: Original loop nest computing GLREs.
1 for ( l =1 ; l <=loop ; l++ ) {
2 for ( i =1 ; i <n ; i++ ) {
3 for ( k=0 ; k<i ; k++ ) {
4 w[ i ] += b [ k ] [ i ] * w[ ( i −k ) −1];
5 }
6 }
7 }
```
In Section 2, we demonstrate that there exists a single solution to the time partition constraints for the program in Listing 1; hence, direct parallelization and tiling of the program is not feasible using affine transformations.

Our proposal is to modify the code in Listing 1 to another one whose dependencies allow us to form time partition constraints for which there exist two linear independent solutions that allow us to generate parallel 2D tiled code.

The main contributions of the article are as follows.


The rest of this article is organized as follows. In Section 2, we provide background on dependency analysis and parallel code generation. In Section 3, we show how to generate GLREs with the parallel 2D tiled program. In Section 4, we analyze similar work of parallel code generation for comparable cases. In Section 5, we discuss the results of the experiments performed. Conclusions of the work are presented in Section 6.

#### **2. Background**

Generating serial tiled code allows for significant improvements in data locality that results in improved program performance. Generating parallel tiled code lets on additional increasing code performance due to running such a code by means of multiple threads on many cores.

To our best knowledge, there is no technique allowing us to generate any tiled code examined in this paper and presented in Listing 1. All known techniques based on affine transformations and/or transitive closure of dependence graphs [1,2] are unable to generate any tiled code (serial and/or parallel) for the code in Listing 1. Those techniques are within the class of reordering transformations. They do not introduce any additional computations to generated target code in comparison with an original code; they only reorder iterations of the original code allowing for tiling and/or parallelism.

The other known techniques, for example, [3,4], that are not within reordering transformations allow us to parallelize and tile code similar to some extent to the code in Listing 1 but not exactly the same. All of those techniques introduce additional computations to generated target code in comparison with those in an original one. That prevents achieving the maximal target code performance.

Thus, there is the need to develop techniques belonging to the class of reordering transformations to tile and/or parallelize the code presented in Listing 1, allowing for generating code that does not include any additional computations in comparison with those of an original one. In this paper, we present an approach to resolving this challenge, allowing for generation of parallel tiled code.

In the loop nest, we can find dependencies between instruction instances in the iteration space of that loop nest—the collection of all statements executed inside that loop nest. A dependence is a situation when two instruction instances access the same location in memory and at least one of those accesses is a write. Each dependence is represented with its source and destination but only if the source is executed before the destination. Most commonly, dependencies are expressed by relations that map dependency sources to dependency destinations. Dependence sources and destinations are represented with iteration vectors. The notation of such a relation is the following.

*R* := [*PARAMS*] → {[*input tuple*] → [*output tuple*] | *constraints*}, where [*PARAMS*] is the list of relation parameters, [*input tuple*] represents dependence sources, [*output tuple*] represents dependence destinations, and *constraints* are the constraints—a system of affine equalities and inequalities on parameters and tuple variables.

In the case of dependencies when the dimensions of the left and right tuples of the relation are the same, the distance vector is the difference between the iteration vector of a dependence target and the iteration vector of the corresponding dependence source.

A dependence vector is uniform if all its elements are constants.

To extract dependencies present in the loop nest, we use the polyhedral model, which is returned by PET [5], and we use the iscc calculator [6], which performs calculations on polyhedral sets and relations [7]. The iscc is an interactive interface to the PET library and barvinok library that will let you count points in polytopes. Barvinok is available online at https://repo.or.cz/barvinok.git (accessed on 24 August 2021). We also used the iscc calculator to calculate distance vectors and generate target code.

To parallelize and tile the loop nest, a time partitioning constraint should be created [1] that states that if iteration *I* of statement *S*1 depends on iteration *J* of statement *S*2, then *I* must be assigned to a time partition that is executed no earlier than the partition containing *J*, i.e., schedule(*I*) ≤ schedule(*J*), where schedule(*I*) and schedule(*J*) denote the discrete execution time of iterations *I* and *J*, respectively.

Linear independent solutions to time partition constraints are needed to create schedules for each occurence of a single instruction of the loop nest allowing for parallelization and tiling of code. The schedule defines a strict partial order, i.e., an irreflexive and transitive relation on the statement instances that determines the order in which they are or should be executed. Details of use of linear independent schedules for generating parallel tiled code can be found in a number of articles, for example, in article [2].

We should extract as many linear independent solutions to time partition constraints as possible. The degrees of parallelism of the target code and the dimension of the tile are higher when more independent solutions are extracted [1]. When there is a single solution to the time partition constraints, parallelization and tiling of the corresponding loop nest using affine transformations is not possible [1].

#### **3. Methods. Parallel Tiled Code Generation**

Using PET and the iscc calculator, we extract dependencies available in the code in Listing 1; they are presented with the following relation.

*R* := (*loop*, *n*) → { (*l*, *i*, *k*) → (*l* - , *i*, *k*- ) <sup>|</sup> *<sup>l</sup>* <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> *<sup>i</sup>* <sup>&</sup>lt; *<sup>n</sup>* <sup>∧</sup> <sup>0</sup> <sup>≤</sup> *<sup>k</sup>* <sup>&</sup>lt; *<sup>i</sup>* <sup>∧</sup> *<sup>l</sup>* <sup>&</sup>lt; *<sup>l</sup>* - ≤ *loop* ∧ 0 ≤ *k*- <sup>&</sup>lt; *<sup>i</sup>* } ∪ (*loop*, *<sup>n</sup>*) → { (*l*, *<sup>i</sup>*, *<sup>k</sup>*) <sup>→</sup> (*<sup>l</sup>* - , −1 + *i* − *k*, *k*- ) <sup>|</sup> *<sup>l</sup>* <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> *<sup>i</sup>* <sup>&</sup>lt; *<sup>n</sup>* <sup>∧</sup> *<sup>k</sup>* <sup>≥</sup> <sup>0</sup> <sup>∧</sup> *<sup>l</sup>* <sup>&</sup>lt; *<sup>l</sup>* - ≤ *loop* ∧ 0 ≤ *k*- ≤ −2 + *i* − *k* } ∪ (*loop*, *n*) → { (*l*, *i*, *k*) → (*l* - , *i* - , −1 − *i* + *i* - ) <sup>|</sup> *<sup>l</sup>* <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> <sup>0</sup> <sup>≤</sup> *<sup>k</sup>* <sup>&</sup>lt; *<sup>i</sup>* <sup>∧</sup> *<sup>l</sup>* <sup>≤</sup> *<sup>l</sup>* - <sup>≤</sup> *loop* <sup>∧</sup> *<sup>i</sup>* <sup>&</sup>lt; *<sup>i</sup>* - <sup>&</sup>lt; *<sup>n</sup>* } ∪ (*loop*, *<sup>n</sup>*) → { (*l*, *<sup>i</sup>*, *<sup>k</sup>*) <sup>→</sup> (*l*, *<sup>i</sup>*, *<sup>k</sup>*- ) <sup>|</sup> <sup>0</sup> <sup>&</sup>lt; *<sup>l</sup>* <sup>≤</sup> *loop* <sup>∧</sup> *<sup>i</sup>* <sup>&</sup>lt; *<sup>n</sup>* <sup>∧</sup> *<sup>k</sup>* <sup>≥</sup> <sup>0</sup> <sup>∧</sup> *<sup>k</sup>* <sup>&</sup>lt; *<sup>k</sup>*-<sup>&</sup>lt; *<sup>i</sup>* },

where *R* is the relation name; *loop* and *n* are parameters; ∪ is the union operation of sets (relation *R* is composed as the set union of simpler relations); the tuple before the sign → of each simpler relation is the left tuple of this relation, for example, for the first simpler relation, the left tuple is presented with variables (*l*, *i*, *k*); the tuple after the sign → of each simpler relation is the right tuple of this relation, for example, for the first simpler relation, the right tuple is presented with variables (*l* - , *i*, *k*- ); the expressions after the sign | are the constraints of each simpler relation, each constraint is represented with the conjunctions of inequalities built on tuple variables and parameters; and ∧ is the logical AND operator.

The left tuple of each simpler relation represents dependence sources, whereas the right represents dependence destinations.

Applying the *deltas* operator of the iscc calculator to relation *R*, we obtain the three distance vectors presented with set *D* below.

$$D := (loop, \stackrel{\circ}{n}) \to \{ (l, i, k) \mid 0 \le l < loop \land ((l > 0 \land i < 0 \land i < k < n + 2i) \lor 1) \}$$

(*<sup>i</sup>* <sup>&</sup>gt; <sup>0</sup> ∧ −*<sup>n</sup>* <sup>+</sup> <sup>2</sup>*<sup>i</sup>* <sup>&</sup>lt; *<sup>k</sup>* <sup>&</sup>lt; *<sup>i</sup>*)) } ∪

(*loop*, *<sup>n</sup>*) → { (0, 0, *<sup>k</sup>*) <sup>|</sup> *loop* <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> <sup>0</sup> <sup>&</sup>lt; *<sup>k</sup>* ≤ −<sup>2</sup> <sup>+</sup> *<sup>n</sup>* } ∪

(*loop*, *<sup>n</sup>*) → { (*l*, 0, *<sup>k</sup>*) <sup>|</sup> <sup>0</sup> <sup>&</sup>lt; *<sup>l</sup>* <sup>&</sup>lt; *loop* <sup>∧</sup> <sup>2</sup> <sup>−</sup> *<sup>n</sup>* <sup>≤</sup> *<sup>k</sup>* ≤ −<sup>2</sup> <sup>+</sup> *<sup>n</sup>* }.

where the notations used are the same as for relation *R* above except from the set is represented with a single tuple.

Each conjunct in the set above represents a particular distance vector.

Taking into account the constraints of those distance vectors, we simplify them to the following form.

$$\begin{array}{lcl} D := \{(a\_1, a\_2, a\_3) \mid a\_1 \ge 0 \land -\infty \le a\_2 \le \infty \land -\infty \le a\_3 \le \infty\} \\ (0, 0, b\_3) \mid b\_3 > 0; \\ (1, -0, a\_2) \mid a\_1 \ge 0 \land \dots < a\_2 \le \infty. \end{array}$$

(*c*1, 0, *<sup>c</sup>*3) <sup>|</sup> *<sup>c</sup>*<sup>1</sup> <sup>&</sup>gt; <sup>0</sup> ∧ −<sup>∞</sup> <sup>≤</sup> *<sup>c</sup>*<sup>3</sup> <sup>≤</sup> <sup>∞</sup>}.

The time partition constraints created from the resulting distance vectors according to article [1] are as follows.

$$h\_1 \* a\_1 + h\_2 \* a\_2 + h\_3 \* a\_3 \ge 0,\tag{1}$$

$$h\_3 \* b\_{\hat{3}} \ge 0,\tag{2}$$

$$h\_1 \* c\_1 + h\_3 \* c\_3 \ge 0,\tag{3}$$

where *h*1, *h*2, *h*<sup>3</sup> are the unknowns.

Taking into consideration that −∞ ≤ *a*<sup>2</sup> ≤ ∞, −∞ ≤ *a*<sup>3</sup> ≤ ∞, −∞ ≤ *c*<sup>3</sup> ≤ ∞, we can suppose that to satisfy all the above constraints, *h*<sup>2</sup> and *h*<sup>3</sup> should be 0, i.e., *h*<sup>2</sup> = *h*<sup>3</sup> = 0. Thus, the above constraints can be rewritten as follows.

$$h\_1 \* a\_1 \ge 0,\tag{4}$$

$$h\_1 \* c\_1 \ge 0.\tag{5}$$

Hence, we may cease that there exists a single solution to constraints (1), (2), and (3), namely (1, 0, 0)*T*. This means that all the three loops in the code in Listing 1 cannot be parallelized and tiled by means of affine transformations.

Next, we try to parallelize and tile only two inner loops *i* and *k* in the loop nest in Listing 1. For this purpose, we make the outermost loop *l* to be serial and extract dependencies for inner loops *i* and *k* described with the relation below.

*R* := (*n*) → { (*i*, *k*) → (*i*, *k*- ) <sup>|</sup> *<sup>i</sup>* <sup>&</sup>lt; *<sup>n</sup>* <sup>∧</sup> *<sup>k</sup>* <sup>≥</sup> <sup>0</sup> <sup>∧</sup> *<sup>k</sup>* <sup>&</sup>lt; *<sup>k</sup>*- <sup>&</sup>lt; *<sup>i</sup>* } ∪ *<sup>n</sup>* → { (*i*, *<sup>k</sup>*) <sup>→</sup> (*i* - , −1 − *i* + *i* - ) <sup>|</sup> <sup>0</sup> <sup>≤</sup> *<sup>k</sup>* <sup>&</sup>lt; *<sup>i</sup>* <sup>∧</sup> *<sup>i</sup>* <sup>&</sup>lt; *<sup>i</sup>* -<sup>&</sup>lt; *<sup>n</sup>* }.

Applying the *deltas* operator of the iscc calculator to relation *R*, we obtain the two distance vectors presented with set *D* below.

*<sup>D</sup>* := (*n*) → { (*i*, *<sup>k</sup>*) <sup>|</sup> *<sup>i</sup>* <sup>&</sup>gt; <sup>0</sup> ∧ −*<sup>n</sup>* <sup>+</sup> <sup>2</sup>*<sup>i</sup>* <sup>&</sup>lt; *<sup>k</sup>* <sup>&</sup>lt; *<sup>i</sup>* } ∪

(*n*) → { (0, *<sup>k</sup>*) <sup>|</sup> <sup>0</sup> <sup>&</sup>lt; *<sup>k</sup>* ≤ −<sup>2</sup> <sup>+</sup> *<sup>n</sup>* }.

Next, we simplify the representation of the distance vectors above to the form.

*<sup>D</sup>* :<sup>=</sup> {(*a*1, *<sup>a</sup>*2) <sup>|</sup> *<sup>a</sup>*<sup>1</sup> <sup>&</sup>gt; <sup>0</sup> ∧ −<sup>∞</sup> <sup>≤</sup> *<sup>a</sup>*<sup>2</sup> <sup>≤</sup> <sup>∞</sup>;

$$\left(\underline{0,b\_2}\right)\mid b\_2 > \underline{0}\}.$$

The time partition constraints formed on the basis of the distance vectors above are the following.

$$\begin{array}{c} h\_1 \ast a\_1 + h\_2 \ast a\_2 \geq 0, \\ h\_2 \ast b\_2 \geq 0, \end{array} \tag{6}$$

where *h*1, *h*<sup>2</sup> are the unknowns.

Taking into consideration that −∞ ≤ *a*<sup>2</sup> ≤ ∞, we can deduce that to satisfy constraints (6) and (7), *h*<sup>2</sup> should be 0, i.e., *h*<sup>2</sup> = 0.

So, there exists a single solution to constraints (6) and (7), namely (1, 0)*T*, and we conclude that provided the outermost loop *l* is serial, the two inner loops *i* and *k* cannot be parallelized and tiled by means of affine transformations.

To cope with that problem, we transform the code in Listing 1 to improve dependence properties. With this goal, we apply the following schedule to each iteration of the code in Listing 1:

(*l*, *<sup>i</sup>*, *<sup>k</sup>*)*<sup>T</sup>* → (*l*, *<sup>t</sup>* = *<sup>i</sup>* − *<sup>k</sup>*)*T*.

This schedule implies that each iteration of the code in Listing 1, represented with iteration vector (*l*, *<sup>i</sup>*, *<sup>k</sup>*)*T*, is mapped to the two-dimensional time (*l*, *<sup>t</sup>* = *<sup>i</sup>* − *<sup>k</sup>*)*T*. It means that iterations of loop *l* should be executed serially, while for a given value of iterator *<sup>l</sup>*, iteration (*i*, *<sup>k</sup>*)*<sup>T</sup>* should be executed at time *<sup>t</sup>* = *<sup>i</sup>* − *<sup>k</sup>*. This time guarantees that each iteration (*i*, *k*)*<sup>T</sup>* is executed when all its operands are ready. To justify that fact, let us noting that for iteration (*i*, *k*)*T*, operand *b*[*k*][*i*] is input data; hence, its value is ready at time 0, and operand *w*[(*i* − *k*) − 1] is ready at time (*i* − *k*) − 1 when an actual value of this operand is already calculated and written in memory. Thus, iteration (*i*, *k*)*<sup>T</sup>* can be executed at time *t* = *i* − *k*, i.e., at time, which is one more than the time when operand *w*[(*i* − *k*) − 1] is ready.

In other words, the schedule above is based on data flow software paradigm [8].

To generate target serial code, we form the following relation, which maps each statement instance within the iteration space of the code in Listing 1 to the two-dimensional schedule below.

*CODE* := (*loop*, *<sup>n</sup>*) → { (*l*, *<sup>i</sup>*, *<sup>k</sup>*) <sup>→</sup> (*l*, *<sup>t</sup>* <sup>=</sup> *<sup>i</sup>* <sup>−</sup> *<sup>k</sup>*) <sup>|</sup> *loop* <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> <sup>0</sup> <sup>&</sup>lt; *<sup>l</sup>* <sup>≤</sup> *loop* <sup>∧</sup> <sup>0</sup> <sup>&</sup>lt; *<sup>i</sup>* <sup>&</sup>lt; *<sup>n</sup>* <sup>∧</sup> <sup>0</sup> <sup>≤</sup> *<sup>k</sup>* <sup>&</sup>lt; *<sup>i</sup>* },

where the constraints

*loop* <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> <sup>0</sup> <sup>&</sup>lt; *<sup>l</sup>* <sup>≤</sup> *loop* <sup>∧</sup> <sup>0</sup> <sup>&</sup>lt; *<sup>i</sup>* <sup>&</sup>lt; *<sup>n</sup>* <sup>∧</sup> <sup>0</sup> <sup>≤</sup> *<sup>k</sup>* <sup>&</sup>lt; *<sup>i</sup>*

define the iteration space of the code in Listing 1. Applying the *iscc* codegen operator to the relation above, we get the pseudocode shown in Listing 2.

Listing 2: Target serial pseudocode.

```
1
2 for ( int counter = 1; counter <= loop ; counter += 1 )
3 for ( int var1 = 1; var1 < n ; var1 += 1 )
4 for ( int var2 = var1 ; var2 < n ; var2 += 1 )
5 ( counter , var2 , −var1 + var2 ) ; // pseudostatement
```
We transform the pseudocode code in Listing 2 to C code, taking into account that in that pseudocode, variables *counter*, *var*1, and *var*2 correspond to variables *l*, *t*, and *i*, respectively, in the tuple of set *CODE*; the second variable *var*2 in the pseudostatement relates to variable *i*, while the third expression −*var*1 + *var*2 corresponds to variable *k* in the tuple of set *CODE*. Thus, we replace the pseudostatement in the code in Listing 2 with the statement

*w*[*i*]+ = *b*[*k*][*i*] ∗ *w*[(*i* − *k*) − 1];

from Listing 1 changing variables *i* and *k* with variable *var*2 and the expression −*var*1 + *var*2, respectively. As a result, we obtain the compilable program fragment presented in Listing 3.

Listing 3: Target sequential compilable program fragment.

```
1 for ( int counter = 1; counter <= loop ; counter += 1 )
2 for ( int var1 = 1; var1 < n ; var1 += 1 )
3 for ( int var2 = var1 ; var2 < n ; var2 += 1 )
4 w[ var2 ] += b[ − var1 + var2 ] [ var2 ] * w[ ( var2 − (− var1 +
             var2 ) ) − 1];
```
The target serial code in Listing 3 is in the scope of reordered transformations. It performs the same computations as those performed with the initial code in Listing 1 but in a different order. It is well-known that a reordered transformation of a code is correct if it executes the same computations as those executed with the initial one (1) and respects all the dependencies that appear in that code (2) [1]. The transformed code is correct as it performs the same computations as those executed with the initial one (1) and it respects all the dependencies available in the initial one as explained below (2).

There exist three kinds of dependencies in the code presented in Listing 3: data flow dependencies (some statement instance first generates a result, then that result is used with another statement instance, those instances belong to different time units represented with the value of iterator counter), antidependencies (some statement instance first reads a result, then that result is updated with another statement instance, and output dependencies (two statement instances write their results to the same memory location).

Data flow dependencies are respected due to the fact that in the target code, the execution of a statement instance being the target of each data dependence starts only when all the arguments (data) of this operation are prepared, i.e., the processing of all the instruction instances generating these arguments has already finished, and the operand values are stored in the shared part of memory. This is guaranteed because the source of each data dependence is executed at a time unit defined with the value of iterator counter that is less than the one when the corresponding target is executed.

Anti- and output dependencies are honored due to the lexicographical order of the execution of dependent statement instances within each time partition represented with the value of iterator var1.

We also experimentally confirmed that the both loop nests presented in Listing 1 and Listing 3 generate correct results. The experiments used for input data prepared deterministically and randomly.

Dependencies available in the code in Listing 3 are represented with the following relation.

*R* := (*loop*, *n*) → { (*counter*, *var*1, *var*2) → (*counter*- , 1 + *var*2, *var*2- ) <sup>|</sup> *counter* <sup>&</sup>gt; <sup>0</sup><sup>∧</sup> *var*<sup>1</sup> <sup>&</sup>gt; <sup>0</sup><sup>∧</sup> *var*<sup>2</sup> <sup>≥</sup> *var*1<sup>∧</sup> *counter* <sup>≤</sup> *counter*- <sup>≤</sup> *loop* <sup>∧</sup> *var*<sup>2</sup> <sup>&</sup>lt; *var*2- <sup>&</sup>lt; *<sup>n</sup>* } ∪ (*loop*, *<sup>n</sup>*) <sup>→</sup> { (*counter*, *var*1, *var*2) → (*counter*- , *var*1- , *var*2) <sup>|</sup> *counter* <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> *var*<sup>1</sup> <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> *var*<sup>1</sup> <sup>≤</sup> *var*<sup>2</sup> <sup>&</sup>lt; *<sup>n</sup>*∧*counter* <sup>&</sup>lt; *counter*- <sup>≤</sup> *loop*∧<sup>0</sup> <sup>&</sup>lt; *var*1- ≤ *var*<sup>2</sup> } ∪(*loop*, *n*) → { (*counter*, *var*1, *var*2) → (*counter*- , *var*1- , <sup>−</sup><sup>1</sup> <sup>+</sup> *var*1) <sup>|</sup> *counter* <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> *var*<sup>1</sup> <sup>≤</sup> *var*<sup>2</sup> <sup>&</sup>lt; *<sup>n</sup>* <sup>∧</sup> *counter* <sup>&</sup>lt; *counter*- ≤ *loop* <sup>∧</sup> <sup>0</sup> <sup>&</sup>lt; *var*1- <sup>&</sup>lt; *var*<sup>1</sup> } ∪ (*loop*, *<sup>n</sup>*) → { (*counter*, *var*1, *var*2) <sup>→</sup> (*counter*, *var*1- , *var*2) | <sup>0</sup> <sup>&</sup>lt; *counter* <sup>≤</sup> *loop* <sup>∧</sup> *var*<sup>1</sup> <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> *var*<sup>2</sup> <sup>&</sup>lt; *<sup>n</sup>* <sup>∧</sup> *var*<sup>1</sup> <sup>&</sup>lt; *var*1-≤ *var*2 },

where *loop* and *n* are parameters.

Applying the *deltas* operator of the iscc calculator to relation *R*, we obtain the three distance vectors presented below.

*<sup>D</sup>* := (*loop*, *<sup>n</sup>*) → { (*counter*, *var*1, *var*2) <sup>|</sup> <sup>0</sup> <sup>≤</sup> *counter* <sup>&</sup>lt; *loop* <sup>∧</sup> ((*var*<sup>1</sup> <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> <sup>0</sup> <sup>&</sup>lt; *var*<sup>2</sup> <sup>&</sup>lt; *<sup>n</sup>* <sup>−</sup> *var*1) <sup>∨</sup>

(*counter* <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> *var*<sup>1</sup> <sup>&</sup>lt; <sup>0</sup> ∧ −*<sup>n</sup>* <sup>−</sup> *var*<sup>1</sup> <sup>&</sup>lt; *var*<sup>2</sup> <sup>&</sup>lt; <sup>0</sup>)) } ∪

(*loop*, *<sup>n</sup>*) → { (0, *var*1, 0) <sup>|</sup> *loop* <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> <sup>0</sup> <sup>&</sup>lt; *var*<sup>1</sup> ≤ −<sup>2</sup> <sup>+</sup> *<sup>n</sup>* } ∪

(*loop*, *<sup>n</sup>*) → { (*counter*, *var*1, 0) <sup>|</sup> <sup>0</sup> <sup>&</sup>lt; *counter* <sup>&</sup>lt; *loop* <sup>∧</sup> <sup>2</sup> <sup>−</sup> *<sup>n</sup>* <sup>≤</sup> *var*<sup>1</sup> ≤ −<sup>2</sup> <sup>+</sup> *<sup>n</sup>* }.

Taking into account the constraints of those distance vectors, we simplify their representation to the following form.

*D* := {(*a*1, *a*2, *a*3) | *a*<sup>1</sup> ≥ 0 ∧ −∞ ≤ *a*<sup>2</sup> ≤ ∞ ∧ −∞ ≤ *a*<sup>3</sup> ≤ ∞;

(0, *<sup>b</sup>*2, 0) <sup>|</sup> *<sup>b</sup>*<sup>2</sup> <sup>&</sup>gt; 0;

$$\left(\underline{c\_1}, \underline{c\_2}, 0\right) \mid c\_1 > 0 \land -\infty \le c\_2 \le \infty\\\underline{\qquad}$$

The time partition constraints constructed according to article [1] are as follows.

$$h\_1 \* a\_1 + h\_2 \* a\_2 + h\_3 \* a\_3 \ge 0,\tag{8}$$

$$h\_2 \* b\_2 \ge 0,\tag{9}$$

$$h\_1 \* c\_1 + h\_2 \* c\_2 \ge 0,\tag{10}$$

where *h*1, *h*2, *h*<sup>3</sup> are the unknowns.

Taking into account that −∞ ≤ *a*<sup>2</sup> ≤ ∞ , −∞ ≤ *a*<sup>3</sup> ≤ ∞, −∞ ≤ *c*<sup>2</sup> ≤ ∞, we can deduce that *h*<sup>2</sup> and *h*<sup>3</sup> should be 0, i.e., *h*<sup>2</sup> = *h*<sup>3</sup> = 0 for the constraints (8), (9), and (10) to be compatible. Therefore, these constraints can be written with the following formulas.

$$h\_1 \* a\_1 \ge 0\_{\text{'}}\tag{11}$$

$$h\_1 \* c\_1 \ge 0.\tag{12}$$

Thus, we may conclude that there exists a single solution to constraints (8), (9), and (10), namely (1, 0, 0)*T*. This means that all thee loops in the code in Listing 3 cannot be parallelized and tiled by means of affine transformations.

Next, we try to parallelize and tile only two inner loops *var*1 and *var*2 in the loop nest presented in Listing 3. For this purpose, we make the outermost loop *counter* to be serial and extract dependencies for inner loops *var*1 and *var*2. They are expressed with the relation below.

*R* := (*n*) → { (*var*1, *var*2) → (1 + *var*2, *var*2- ) <sup>|</sup> *var*<sup>1</sup> <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> *var*<sup>2</sup> <sup>≥</sup> *var*<sup>1</sup> <sup>∧</sup> *var*<sup>2</sup> <sup>&</sup>lt; *var*2- <sup>&</sup>lt; *<sup>n</sup>* } ∪ *<sup>n</sup>* → { (*var*1, *var*2) <sup>→</sup> (*var*1- , *var*2) <sup>|</sup> *var*<sup>1</sup> <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> *var*<sup>2</sup> <sup>&</sup>lt; *<sup>n</sup>* <sup>∧</sup> *var*<sup>1</sup> <sup>&</sup>lt; *var*1- ≤ *var*2 }.

Applying the *deltas* operator of the iscc calculator to relation *R*, we obtain the two distance vectors presented below.

*<sup>D</sup>* := (*n*) → { (*var*1, *var*2) <sup>|</sup> *var*<sup>1</sup> <sup>&</sup>gt; <sup>0</sup> <sup>∧</sup> <sup>0</sup> <sup>&</sup>lt; *var*<sup>2</sup> <sup>&</sup>lt; *<sup>n</sup>* <sup>−</sup> *var*<sup>1</sup> } ∪

*<sup>n</sup>* → { (*var*1, 0) <sup>|</sup> <sup>0</sup> <sup>&</sup>lt; *var*<sup>1</sup> ≤ −<sup>2</sup> <sup>+</sup> *<sup>n</sup>* }.

After the simplification of the representation of the distance vector above, we obtain the following vectors.

$$D := \{ (a\_1, a\_2) \mid a\_1 > 0 \land a\_2 > 0; \}$$

(*b*1, 0) <sup>|</sup> *<sup>b</sup>*<sup>1</sup> <sup>&</sup>gt; <sup>0</sup>}

The time partition constraints created from the distance vectors above are the following:

$$h\_1 \* a\_1 + h\_2 \* a\_2 \ge 0,\tag{13}$$

$$h\_1 \* b\_1 \ge 0,\tag{14}$$

where *h*1, *h*<sup>2</sup> are the unknowns. There are two linear independent solutions to the constraints above: (1, 0)*<sup>T</sup>* and (0, 1)*T*. Applying those solutions, we are able to parallelize and tile the two inner loops of the code in Listing 3 using the technique presented in paper [2].

The target parallel tiled code presented by means of the OpenMP C/C++ API is shown in Listing 4. It is generated for the best tile size equal to 24 × 54; choosing the best tile size is explained in Section 5.

Listing 4: Transformed parallel loop nest.

1 2 **#define** min ( lhs , rhs ) ( ( lh s ) < ( rhs ) ? ( lh s ) : ( rhs ) ) 3 **#define** max ( lhs , rhs ) ( ( l h s ) > ( rhs ) ? ( l h s ) : ( rhs ) ) 4 **#define** floord ( val ,d) ( ( ( val ) <0) ? −(( −( val ) +(d) −1)/(d) ) : ( val ) /(d) ) 5 **#define** ceild ( val ,d) ceil ( ( ( **double** ) ( val ) ) /(( **double** ) (d) ) ) 6 7 **for** ( **int** i 0 = 1; i 0 <= loop ; i 0 += 1 ) { 8 **for** ( **int** w0 = 0; w0 <= floord (26 \*n−26, 675) ; w0+=1) { 9 #pragma omp p a r a l l e l **for** 10 **for** ( **int** h0 = max(0 , w0 − (n + 49) / 50 + 1) ; h0 <= min ( ( n − 1) / 54, (25 \* w0 + 24) / 52) ; h0 += 1 ) { <sup>11</sup> **for** ( **int** i 1 <sup>=</sup> max(1 , 54 \* h0 ) ; i 1 <= min(min(n − 1 , 50 \* w0 − 50 \* h0 + 49) , 54 \* h0 + 53) ; i 1 += 1 ) { 12 **for** ( **int** i 2 = max(50 \* w0 − 50 \* h0 , i 1 ) ; i 2 <= min( n − 1 , 50 \* w0 − 50 \* h0 + 49) ; i 2 += 1 ) { 13 w[ i 2 ] += ( b[ − i 1 + i 2 ] [ i 2 ] \* w[ i 1 − 1]) ; 14 } 15 }

```
16 }
17 }
18 }
```
In that code, outermost loop *i*0 is serial nontiled, and loops *w*0 and *h*0 enumerate tile identifiers, while loops *i*1 and *i*2 enumerate iterations within each tile. Parallelism is extracted with the wavefront technique [2] and presented with the OpenMP directive #*pragma omp parallel f or* inserted before loop *h*0 that means that this loop is parallel.

#### **4. Related Work**

Related techniques can be divided into the following two classes: the class of reordering transformations and the one of nonreordering transformations. There are numerous publications concerned with both of the classes. Approaches based on affine transformations [1,2,9–11] and those based on the transitive closure of dependence graphs [12–16] belong to reordering transformations. Reordering techniques are code-independent and are used in optimizing compilers, for example [17–19], which automatically generate optimized target code for source code.

Nonreordering transformations are code-dependent, i.e., for a given code, a transformation is fulfilled manually. The following publications within nonreordering transformations concern the problem similar to that implemented with the code in Listing 1 but not exactly the same problem [3,4,8,20–26].

Both classes allow for generating the target program that is semantically identical to the original one. However, there are the following differences in target code generated using techniques of those classes.

Reordering transformations do not introduce any additional computations to generated target code in comparison with those of original code; they only reorder loop nest iterations of the original code allowing for tiling and/or parallelism. They are codeindependent and are aimed at automatic code generation.

Nonreordering transformations allow us to parallelize and tile code similar to some extent to the code in Listing 1 but not exactly the same. All of those techniques introduce additional computations to generated target code in comparison with those in the original code. That increases the computational complexity of the algorithm and prevents achieving the maximal target code performance, and it is the main drawback in comparison with reordering transformations.

Each technique is manually created for the code that should be optimized. Adapting such a technique even to a slightly different problem can require additional work that can be time-consuming and not always possible.

After an extensive analysis of many nonreordering techniques mentioned above, we did not find any one that exactly implements the problem presented with the code in Listing 1. Without extensive research, it is not clear how any of those techniques can be adapted to implement exactly the same problem that implements the code in Listing 1.

In the class of reordering transformations, we examined the PLUTO [17] and TRACO compilers [15]. PLUTO is based on affine transformations and automatically generates tiled and/or parallel code. TRACO uses the transitive closure of dependence graphs to tile and/or parallelize input code. For the code in Listing 1, both PLUTO and TRACO are unable to generate any tiled and/or parallel code. In Section 3, we presented the reason why affine transformations fail to generate any parallel and/or tiled code for the serial code in Listing 1.

Below, we discuss some nonreordering transformations, which allow for generation of code implementing algorithms similar to that realizing with the code in Listing 1 but not exactly the same. Without extensive research, it is not clear how to adapt any of those techniques to generate target code fulfilling the same calculations as those performed with the code in Listing 1.

Karp et al. [20] discussed parallelism in recurrence equations. They proposed a decomposition algorithm that decides if a system of uniform recurrence equations (SURE) is computable or not. If so, multidimensional schedules can be derived and applied to extract parallelism.

Papers [3,4] introduced a recursive doubling strategy to compute recurrence equations in parallel. Recursive doubling envisages the splitting of the computation of a function into two subfunctions whose evaluation can be performed simultaneously in two separate processors. Successive splitting of each of these subfunctions allows for the computation over more processors.

Maleki and Burtscher [21] introduced two phase approach to compute recurrence equations. The first phase iteratively merges pairs of adjacent chunks by correcting the values in the second chunk of each pair. The second phase produces the resulting chunks in a pipelined mode to compute the final solution.

Sung et al. [22,23] proposed the idea to divide the input into blocks and decompose the computation over each block. Interblock parallelism is exploited to enhance code performance.

Nehab et al. [24] also suggested splitting the input data into blocks that are processed in parallel by modern GPU architectures and overlapped the causal, anticausal, row, and column filter processing.

Marongiu and Palazzari [25] addressed the parallelization of a class of iterative algorithms described as the system of affine recurrence equations (SARE). It introduces an affine timing function and an affine allocation function that perform a space-time transformation of the loop nest iteration space. It considers algorithms dealing with only uniform dependence vectors, while the approach presented in this paper deals with nonuniform vectors.

Ben-Asher and Haber [26] defined recurrence equations called "simple indexed recurrences" (SIR). In this type of equation, for extending capabilities, ordinary recurrences are generalized to *X*[*g*(*i*)] = *opi*(*X*[ *f*(*i*)], *X*[*g*(*i*)]), where *f* and *g* are affine functions *opi*(*x*, *y*) is a binary associative operator. In that paper, the authors proposed a parallel solution to the SIR problem. This case of recurrences is simpler than that considered in our paper and any tiled code is not considered.

Summing up, we may conclude that in the class of reordering transformations, there does not exist any technique allowing for parallelizing and/or tiling the code in Listing 1. In the class of nonordering transformations, to our best knowledge, no technique has been published to generate parallel tiled code implementing the problem addressed in this paper.

The main contribution of our paper is presenting a novel technique, which for the first time allows us to parallelize and tile the examined loop nest implementing computing general linear recurrence equations by means of reordering transformations. The novelty consists in adding an additional phase to classical reordering transformations: to source code, we first apply a reordering schedule that respects all data dependencies; then, we apply classical affine transformations to the serial code obtained in the first phase. This increases target code generation time but does not introduce any additional computations to the source code. Generated target code is still within the class of reordering transformations.

#### **5. Results**

The primary reason for writing a parallel program is speed. We strive that the parallel program execution should be completed at a shorter time in comparison with that of the serial one. We need to know what is the benefit from tiling and parallelism. For this purpose, we need to compute the parallel program speedup.

The speedup of a parallel program over a corresponding sequential program is the ratio of the compute time for the sequential program to the time for the parallel program. The value of speedup shows how efficient is a parallel program.

Perfect linear speedup occurs when the value of speedup is the same as the number of threads used for running a parallel program. In practice, perfect linear speedup seldom occurs because of parallel program overhead and the fact that all computations of an original program cannot be parallelized.

According to Amdahl's law, the parallel code speedup, *S*, is limited to *S* <= 1/*s*, where *s* is the serial fraction of code, i.e., the fraction of code that cannot be parallelized. For example, if *s* = 0.2, the maximal speedup is 5 regardless of the number of threads used for running a parallel program.

To evaluate the performance of the parallel tiled code presented in Listing 4, we carried out experiments aimed at measuring the execution time of the original program and parallel one (for the different number of threads) and next calculated the speedup of the parallel program.

Below, we present the results of experiments carried out with the codes shown in Listing 1 (serial code) and Listing 4 (parallel code). As we mentioned in the previous section, we cannot find any related parallel code that fulfills exactly the same computations as those executed with the code in Listing 1. Thus, we limited our experiments to the codes mentioned above.

To carry out experiments, we used a processor Intel Xeon X5570, 2.93 GHz, 2 physical units, 8 (2 × 4) cores, 16 hyper-threads, and an 8 MB cache. Executable parallel tiled code was generated by means of the g++ compiler with the -O3 flag of optimization.

Experiments were carried out for ten different lengths of the problem defined with parameter *N* from 1000 to 5000 for the codes presented in Listing 1 (serial code) and Listing 4 (parallel code).

All of the source code to perform the experiments and the program to run the tested codes can be found at https://github.com/piotrbla/livc (accessed on 24 August 2021).

We carried our experiments to choose the optimal size of a tile. The size of a tile is optimal if (i) all data associated with that tile can be held in cache, (ii) those data occupy almost the entire capacity of cache, and (iii) tiled code execution time is minimal provided that the conditions (i) and (ii) above are satisfied.

For this purpose, we fulfilled three trials whose results are shown in Figure 1. The curve "trial1" represents how the time of tiled program execution depends on the block size along axis *h*0 when the block size along axis *w*0 is fixed equal to 16. After first trial 1, we chose the best size along the *h*0 axis equal to 32.

**Figure 1.** Time for different tile sizes. All phases.

The curve "trial2" demonstrates how the time of tiled program execution depends on the block size along axis *w*0 when the block size along axis *h*0 is equal to 32 (the result of trial 1). After trial 2, we chose the best size along axis *w*0 equal to 24.

The curve "trial3" shows how the time of tiled program execution depends on the block size along axis *h*0 when the block size along axis *w*0 is equal to 24 (the result of trial 2). After trial 3, we chose the best tile size along axis *h*0 equal to 54. Finaly, as the best size of a 2D tile in the parallel tiled code, we chose 24 × 54.

For the best tile size, Table 1 presents execution times and speedup of the serial program in Listing 1 and the parallel tiled one presented in Listing 4 for 32 OpenMP threads used. Figure 2 depicts the data presented in Table 1 in a graphical way. As presented, the execution time of parallel tiled program grows practically in a linear manner exposing considerable speedup (the ratio of the serial program execution time to that of the corresponding parallel one) presented in Figure 3.


**Table 1.** Time in seconds and speedup for Intel Xeon X5570 and 32 OpenMP threads.

**Figure 2.** Time for Intel Xeon X5570 and different problem sizes.

**Figure 3.** Speedup for Intel Xeon X5570 v3 and 32 OpenMP threads.

Figure 4 presents how parallel tiled code speedup depends on the thread number for the maximal problem size used for experiments, i.e., for *N* = 5000. The parallel tiled code speedup grows practically linear for the number of threads 1 to 12. Linear speedup for the number of threads ≥ 12 is prevented with the serial fraction of code (Amdahl's law) parallel loop initialization fulfilled with a single thread and serial input–output operations. Speedup is also limited with parallel program overhead—there is thread synchronization in the examined parallel code, after each wavefront, barrier synchronization is inserted because the following wavefront can be executed after completing the calculations of the previous wavefront.

**Figure 4.** Speedup for Intel Xeon X5570 for different threads number.

We may conclude that the generated parallel tiled code implementing computingintensive general linear recurrence equations and presented in Listing 4 can be successfully run on modern multicore machines with a large number of cores.

#### **6. Conclusions**

We presented an approach to generate parallel tiled code for computing general linear recurrence equations (GLREs) presented in Listing 1. That code is computing-intensive and must be run on modern multicore computers to reduce execution time. We demonstrated how to transform that code to obtain the modified code shown in Listing 3, which exposes dependencies such that there exist two linear independent solutions to the time partition constraints formed on the basis of those dependencies. This allows us to apply the affine transformation framework and generate parallel 2D tiled code computing GLREs presented in Listing 4. The parallelism is achieved using the wavefront technique and presented with the code that conforms to the OpenMP standard. To our best knowledge, the target parallel tiled code generated by us and presented in Listing 4 is the first to allow for enumerating 2D tiles and the first that does not require any additional computations in comparison with those of the original serial code. This code is derived by means of tiling the loop nest iteration space. Our experiments with the resulting parallel tiled code show that the code significantly outperforms the original GLREs computing serial code. The code performance improvement is achieved due to the parallelism and better locality of the target code.

**Author Contributions:** Conceptualization and methodology, W.B. and P.B.; software, P.B.; validation, W.B. and P.B.; data curation, P.B.; original draft preparation, W.B.; writing—review and editing, W.B. and P.B.; visualization, P.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Source code to reproduce all the results described in this paper can be found at: https://github.com/piotrbla/livc (accessed on 24 August 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:

GLRE General Linear Recurrence Equations

TRACO Compiler based on the TRAnsitive ClOsure of dependence graphs

#### **References**


## *Article* **Design of Automatic Correction System for UAV's Smoke Trajectory Angle Based on KNN Algorithm**

**Pao-Yuan Chao \*, Wei-Chih Hsu and Wei-You Chen**

Department of Computer and Communication Engineering, National Kaohsiung University of Science and Technology (NKUST), Kaohsiung 807618, Taiwan

**\*** Correspondence: i107109103@nkust.edu.tw

**Abstract:** Unmanned aerial vehicles (UAVs) have evolved with the progress of science and technology in recent years. They combine high-tech, such as information and communications technology, mechanical power, remote control, and electric power storage. In the past, drones could be flown only via remote control, and the mounted cameras captured images from the air. Now, UAVs integrate new technologies such as 5G, AI, and IoT in Taiwan. They have a great application value in a high-altitude data acquisition, entertainment performances (such as night light shows and UAV shows with smoke), agriculture, and 3D modeling. UAVs are susceptible to the natural wind when spraying smoke into the air, which leads to a smoke track offset. This study developed an autocorrect system for UAV smoke tracing. An AI model was used to calculate smoke tube angle corrections so that smoke tube angles could be immediately corrected when smoke is sprayed. This led to smoke tracks being consistent with flight tracks.

**Keywords:** unmanned aerial vehicle; machine learning; UAV smoke show; mobile networks; artificial intelligence

#### **1. Introduction**

Flexible, safe, stable, high-speed, and low-cost UAVs have been developed in the past few years. This was achieved due to the continuous development of the modules, including the process materials, electric power storage, and sensors [1–12]. So far, UAVs have been widely used in civilian, commercial, and government units. Industrial UAVs can be applied in environmental monitoring, infrastructure inspection, disaster or accident rescue, agriculture, forestry, fishery, animal husbandry management, spatial information measurement, land and guard patrol, media communication, telecommunications services, home delivery logistics, and the military. Most application fields can be further subdivided. For example, environmental monitoring can be divided into the monitoring and investigating of air pollution, oil pollution, nuclear pollution, marine pollution, and river pollution, and even includes the study of weather changes. Many types of infrastructure are subjects for inspection, including roads, railways, transmission towers, and oil fields. Regarding the rescue, drones can be used for video recording, a real-time image transmission, and material delivery. Regarding the environmental conditions, there are waters, mountainous areas, or buildings. Agriculture, forestry, fishery, and animal husbandry management includes pesticide or fertilizer spraying and the observation of crops, trees, pastures, and fish farms. The work in spatial information includes aerial mapping, a terrain attribute classification and survey, a national land survey, urban planning, a land survey and development, water control and flood control planning, and 3D real scene modeling. Guard patrol includes a coastal patrol, criminal chasing, and general security work. Regarding media communication, in addition to providing real-time news about disaster areas and war zones, they can be applied to business and tourism marketing. There are diverse applications in the military, and they can be used as reconnaissance aircraft, target aircrafts, and bombers. Therefore, industrial UAVs have unlimited business opportunities. In Ghana,

**Citation:** Chao, P.-Y.; Hsu, W.-C.; Chen, W.-Y. Design of Automatic Correction System for UAV's Smoke Trajectory Angle Based on KNN Algorithm. *Electronics* **2022**, *11*, 3587. https://doi.org/10.3390/ electronics11213587

Academic Editor: Juan M. Corchado

Received: 23 September 2022 Accepted: 31 October 2022 Published: 3 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

there have been about 275,000 UAVs flying and delivering medical kits containing vaccines [13]. The edge computing technology is used in UAVs for the power transmission line inspection [14].

On many major holidays and celebrations in Taiwan, people can see the colorful smoke from fighters flying by the military in the sky. In addition to the purchase of fighters, such activities often cost a lot of money (military aircraft maintenance, personnel training, and aviation gasoline), which consumes gasoline and imposes a burden on the environment. Therefore, the use of UAVs carrying smoke tubes for aerial smoke spraying has grown in recent years. Today, most UAVs are mainly powered by electricity and do not emit exhaust gas like gasoline engines. Moreover, UAVs cost less than traditional fighters in smoke spraying. For example, in Taiwan, according to the Regulations of Drone of the Civil Aeronautics Administration [15], if pilots hold G2 licenses (flying more than 400 feet above the ground or water, operating beyond the range of visibility, dropping, or spraying objects) and UAVs are registered in the UAV system of the Civil Aeronautics Administration and apply for airspace in advance, regulatory restrictions can be exempted and UAV shows with smoke can be performed.

UAVs cannot carry large smoke tubes and smoke tubes installed at different positions are susceptible to the natural wind and lead to a smoke track offset. In this study, detectors and smoke tube correctors were installed above the UAV. An AI model was used for training to correct the offset tracks. In this way, small low-altitude UAVs could achieve the same visual effects as traditional high-altitude fighters in UAV shows with smoke, in a way which is cheap and environmentally friendly.

#### **2. Materials and Methods**

#### *2.1. Unmanned Aerial Vehicle*

The UAV used in this study has the following basic flying parts: a flight controller, motor, electronic transmission, frame, and GPS positioning module. In addition, a servo motor, smoke tube, Raspberry Pi, 4G communication module, and lithium battery are installed. The overall weight is about 2.5 kg. Considering the motor load in the flight process and flight time, a UAV with a wheelbase of 450 mm and an EDU450 carbon fiber frame was selected [16], as shown in Figure 1.

**Figure 1.** EDU450 carbon fiber frame.

A Pixhawk2 CUBE consisting of two processors was used as the flight controller of the UAV. In the Pixhawk2 CUBE, the main processor was STM32F427 V3, and the coprocessor was STM32F1. The built-in sensor included a tri-axis accelerometer (L3GD20), an accelerometer and magnetometer (LS303D), a gyroscope (MPU9250), and a barometer (MS5611). With a weight of 73 g, this light and efficient flight controller supported opensource flight control software PX4 and Ardupilot. Three–eight axis multi-rotor models and multiple interfaces, including the Mavlink interface, I2C interface, and PWM signal output system, are used in this study. Raspberry Pi 3B+ has a USB interface which can be used for Internet access apart from the built-in WiFi card. If the 4G network card fails, it can switch to the built-in WiFi [17], a backup network, as shown in Figure 2.

**Figure 2.** System configuration.

#### *2.2. Smoke Tube Position*

In this study, a four-axis multi-rotor UAV was used for the experimentation. The smoke from the smoke tube is vulnerable to the natural wind and tube position, leading to a smoke track offset. In that case, the audience is unable to enjoy the performance. As shown in Figure 3, when the smoke tube was placed directly below or above the UAV, the smoke from the smoke tube was affected by the downdraft generated by the propeller. Regardless of the rotation of the servo motor and adjustment of the smoke spraying direction, the smoke would be affected by the airflow and sprayed downward. In this study, self-made 3D material parts were attached to the UAV. An additional extension area was built to install the servo motor and smoke tube for adjusting the spraying angle. This is to prevent the influence of downdraft generated by the propeller as much as possible and to make the smoke from the smoke tube be sprayed backward, as shown in Figure 4a. To avoid the pendulum effect and the excessive output energy of the rear motor, the length of the extension area is adjusted to 29 cm after multiple outdoor flight tests. The aerial testing shows that the smoke track is significantly improved, as shown in Figure 4b. After improving the smoke emission direction, the track angle correction was discussed below, so that the audience could enjoy the best effect of the smoke track.

**Figure 3.** Spraying effects at different smoke tube positions, should be listed as: (**a**) below the UAV; (**b**) above the UAV.

**Figure 4.** Spraying effects at the redesigned smoke tube position, should be listed as: (**a**) the smoke tube extension frame is at the rear; (**b**) effect after modification.

#### *2.3. AI Model Selection and Design*

K-nearest neighbors (KNN) are one of the most popular machine learning algorithms. It has been widely used in HPC applications, such as image/video retrieval, big data analysis, machine learning, and computer vision [18,19]. It is a nonparametric statistical method for regression and classification. The K-nearest training samples in the feature space ware input [20,21] and the k-value were used to determine which classification group the data were nearest to. The classification criteria were decided by majority voting, and the Euclidean distance was used to calculate the distance, as shown in Equation (1).

$$\mathbf{P} = \sqrt{\left(\mathbf{x}\_2 - \mathbf{x}\_1\right)^2 + \left(\mathbf{y}\_2 - \mathbf{y}\_1\right)^2} \tag{1}$$

In the KNN classifier, the output was the group classification, and its neighbors determined the category corresponding to the input object by a majority voting. KNN adopted the vector space modal for a classification, and objects in the same category were highly similar. The similarity could be calculated by the known category cases to evaluate the possible categories of the input objects. The training samples were multi-dimensional eigenspace vectors, in which each training sample had a classification label. The algorithm included eigenvector access and training sample labels at the training stage.

The KNN classifier assigns a weight of 1/k to the k-nearest neighbors and zero to all other neighbors. This can be applied to the weighted nearest neighbor classifier. The weight wni is given to the nearest neighbor i. In (2), a similar result holds for the strong consistency of the weighted nearest neighbor classifier [21].

$$\sum\_{i=i}^{n} \mathbf{w}\_{\text{ni}} = 1 \tag{2}$$

Let Cwnn <sup>n</sup> denote the weighted nearest classifier with the weight {wni}<sup>n</sup> i=1. According to the regularity condition of the category distribution, the excess risk has (3) an asymptotic expansion.

$$\mathbf{R}\_{\rm R}(\mathbf{C}\_{\rm n}^{\rm wmn}) - \mathbf{R}\_{\rm R}\left(\mathbf{C}^{\rm Bayes}\right) = \left(\mathbf{B}\_{1}\mathbf{s}\_{\rm n}^{2} + \mathbf{B}\_{2}\mathbf{t}\_{\rm n}^{2}\right)\left\{1 + o(1)\right\}\tag{3}$$

$$\mathbf{s}\_{\mathbf{n}}^{2} = \sum\_{i=1}^{n} \mathbf{w}\_{\mathbf{n}i}^{2} \tag{4}$$

$$\mathbf{t}\_{\mathbf{n}} = \mathbf{n}^{-\frac{2}{d}} \sum\_{i=1}^{n} \mathbf{w}\_{\mathbf{n}i} \left\{ \mathbf{i}^{1 + \frac{2}{d}} - (\mathbf{i} - \mathbf{1})^{1 + \frac{2}{d}} \right\} \tag{5}$$

The optimal weighting method { w<sup>∗</sup> ni }<sup>n</sup> <sup>i</sup>=<sup>1</sup> is used to balance the two items above. Let k<sup>∗</sup> = Bn 4 <sup>d</sup>+<sup>4</sup> , (6) correspond to i = 1, 2, ... , k∗, and w<sup>∗</sup> ni = 0 correspond to i=k<sup>∗</sup> + 1, . . . , n. After using the optimum weight, the dominant term in the asymptotic expansion of excess risk is O <sup>n</sup><sup>−</sup> <sup>4</sup> d+4 .

$$\mathbf{w}\_{\rm ri}^{\*} = \frac{1}{\mathbf{k}^{\*}} \left[ 1 + \frac{\mathbf{d}}{2} - \frac{\mathbf{d}}{2\mathbf{k}^{\*2/\mathbf{d}}} \left\{ \mathbf{i}^{1+2/\mathbf{d}} - (\mathbf{i}-1)^{1+2/\mathbf{d}} \right\} \right] \tag{6}$$

At the classification stage, k is a user-defined constant. A vector without a category label (query or test point) will be classified into the most frequently used category among the k sample points nearest to the point.

When flying in the air, the UAV is affected by the crosswind wwind and deviates from its route. At this point, to correct the route, the flight controller will give a Roll value to adjust the pitch angle θ<sup>1</sup> of the UAV, as shown in Figure 5. When the airframe is corrected, an angle adjustment θ<sup>2</sup> is given to the smoke tube to make the smoke tube turn to the windward face to face the direction the wind comes from, as shown in Figure 6. According to the angle adjustment θ<sup>1</sup> of the flight controller and the angle θ<sup>2</sup> of the smoke tube, the direction and magnitude of the wind the UAV is exposed to in the air can be known, as shown in Equation (7).

$$\mathbf{w}\_{\text{wind}} \rightarrow \; \theta\_1 + \; \theta\_2 \tag{7}$$

**Figure 5.** Roll angle correction θ1.

**Figure 6.** Smoke tube angle correction *θ2*.

Based on the above conclusion, the direction and magnitude of the wind are related to the value of θ1. The operator can adjust θ<sup>2</sup> according to the value of θ<sup>1</sup> when flying the UAV. θ<sup>1</sup> and θ<sup>2</sup> will be trained by the machine learning-based KNN classification method, and their relationship is shown in Equation (8).

$$\mathbf{w} \propto \theta\_1 \to \theta\_2 \tag{8}$$

#### **3. Experimental Results and System Validation**

During its flight, the UAV is affected by the natural wind and deviated from its course. At this point, the flight controller corrects the pitch angle in real-time to make the UAV return to its course. Its flight direction was mainly changed by correcting the pitch, yaw, and roll parameters. In addition to the airspeed meter sensor installed above the UAV, the three sensing values of the pitch, yaw, and roll parameters of the UAV can be used to learn the changes in the wind fields in the air.

Five angles, namely −60◦, −30◦, 0◦, 30◦, and 60◦ were designed in this study. These angles are the output y to be estimated by the KNN model. The input features include the pitch, yaw, roll, and airspeed meter values read from the flight controller. A KNN model was built for training.

Finally, the trained model was stored in the Joblib package and then put into the Raspberry PI in the UAV. Later, the designed system was used to read the model so that the real-time wind speed data read could be directly put into the AI model in the Raspberry PI for calculation. The results could be transmitted to the flight controller via the system to adjust the servo motor that controlled the smoke tube direction. It helped adjust the smoke tube to the optimal angle. The correction flow chart is shown in Figure 7.

**Figure 7.** Autocorrection flow chart for smoke trailing.

As the UAV which was selected could not be fitted with a larger smoke tube, each spraying took about 30 to 40 s. The operator could collect about five pieces of data on each flight, which is not much. During the training, the pitch, yaw, roll, and airspeed meter data were put into the KNN model, and the accuracy was 50%. To improve the accuracy, more data is required. Therefore, the pitch and yaw were discarded, and the roll value was kept. The UAV was designed to fly back and forth in a straight line automatically. In the case of the deviation caused by a crosswind, the UAV could return to its route mainly by correcting the roll value. A roll value of 0 indicates that the UAV flew horizontally without any roll. A positive roll value indicates that the wind blew from the left side of the UAV. A higher value represents a higher wind speed. On the contrary, a negative roll indicates that the wind blew from the right side of the UAV. A smaller value reflects a higher wind speed. In this study, the roll value was collected to determine the speed and direction of the wind. This is to make up for the limitation of the airspeed meter under the breeze. The roll value was sensitive and thus could detect detailed data. In this study, 64 data items from the database were put into the AI training model, and the accuracy was 71%, as shown in Figure 8.

**Figure 8.** KNN training accuracy.

The roll data read by the UAV was recorded and put into the KNN model for testing. The test data and predicted angles are shown in Table 1. The table shows that the rolls and angle corrections were as expected. The smoke tube should have been shifted to the right when the roll was positive and left when the roll was negative.

**Table 1.** KNN model test data and results.


Images taken behind the smoke tube show that the smoke tube could automatically change the direction and angle according to the direction and speed of the wind. Figure 9 shows that the UAV deviated to the left due to the wind from the right side. The roll of the angle correction to the right given by the flight controller was 0.061363. The smoke tube should be adjusted 30◦ to the right according to the calculation by the AI model. Figure 10 shows that the UAV deviated to the right due to the wind from the left side. The smoke tube should be adjusted 30◦ to the left, according to the calculation by the AI model.

**Figure 9.** The smoke tube shifted to the right.

**Figure 10.** The smoke tube shifted to the left.

The smoke from the smoke tube would be adjusted to the direction of the windward face. Based on the wind speed, the smoke tube would be adjusted to 30◦ or 60◦. In this case, when the windward face of the UAV was in the front and rear directions, the smoke tube angle would not be adjusted.

Based on the observation of the actual flight, an accuracy of 71% indicated a significant improvement. Figure 11 illustrates the case without correction by the AI model. From the audience's angle, it could be clearly seen that the track (yellow arrow) of the smoke from the smoke tube was offset due to the natural wind. Figure 12 shows the case with a correction by the AI model. From the audience's angle, the smoke from the smoke tube could be observed (yellow arrow). Despite the influence of the wind field in the air, the smoke track could be manipulated to be almost the same as the flight route.

**Figure 11.** Smoke track correction without the AI model.

**Figure 12.** Smoke track correction with the AI model.

#### **4. Conclusions and Future Work**

In this study, Raspberry PI was used as the microcomputer for transmissions with the server. The flight data were received to control the smoke tube and sensor of a quadaxis UAV. To avoid the influence of the airflow, 3D-printed parts were used to refit the UAV. Its frame was extended to install a smoke tube and an electronic igniter so that the smoke tube could be lit for shows with smoke. As the smoke from the smoke tube was susceptible to the natural wind and became offset, a servo motor was installed to adjust the direction of the smoke from the smoke tube. A manned aircraft flew to record the angle adjustment, wind direction, and wind speed. The KNN was used to train a modified AI model. After applying the AI model to the Raspberry PI, the UAV emitted smoke in the air. The Raspberry PI and the flight controller could directly read the wind field data, and the angle could be immediately calculated and then sent back to the flight controller if it needed to be corrected. In this way, the spraying angle of the smoke tube could be adjusted immediately to make the smoke track the same as the flight route as much as possible. According to the results, the correction accuracy was 71%, which can demonstrate the difference between before and after the correction.

In the past, fighters sprayed smoke in the air to celebrate major festivals, which was expensive and polluted the environment. Our design is expected to make UAV shows with smoke possible on small occasions so that such events can be enjoyed on many occasions other than major festivals. The UAV designed by us is powered by electricity. Compared with a fuel-powered aircraft, it causes less environmental pollution and is cheaper.

The architecture in this study was designed for single UAVs. If the information of multiple UAVs can be displayed simultaneously on the web page and the crowd control can be carried out through function buttons on the web page, multiple shows with smoke can be performed simultaneously to spray. As for the collection of the wind speed data, this study collected various data, such as the roll, pitch, yaw, and anemometer values. Due to insufficient sample data, only the roll data were used for the machine learning. If more data can be collected in the future and all data collected can be imported for machine learning, the AI model will be more effective, and the overall correction effect will be perfect.

**Author Contributions:** Conceptualization, P.-Y.C. and W.-Y.C.; methodology, W.-C.H. and W.-Y.C.; resources, P.-Y.C. and W.-C.H.; writing—original draft preparation, P.-Y.C. and W.-Y.C.; writing review and editing, W.-C.H. and W.-Y.C.; visualization, W.-Y.C.; supervision, P.-Y.C. and W.-C.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Prediction of Offshore Wave at East Coast of Malaysia—A Comparative Study**

**Mohammad Azad 1,\* and Md. Alhaz Uddin <sup>2</sup>**


**Abstract:** Exploration of oil and gas in the offshore regions is increasing due to global energy demand. The weather in offshore areas is truly unpredictable due to the sparsity and unreliability of metocean data. Offshore structures may be affected by critical marine environments (severe storms, cyclones, etc.) during oil and gas exploration. In the interest of public safety, fast decisions must be made about whether to proceed or cancel oil and gas exploration, based on offshore wave estimates and anticipated wind speed provided by the Meteorological Department. In this paper, using the metocean data, the offshore wave height and period are predicted from the wind speed by three state-of-the-art machine learning algorithms (Artificial Neural Network, Support Vector Machine, and Random Forest). Such data has been acquired from satellite altimetry and calibrated and corrected by Fugro OCEANOR. The performance of the considered algorithms is compared by various metrics such as mean squared error, root mean squared error, mean absolute error, and coefficient of determination. The experimental results show that the Random Forest algorithm performs best for the prediction of wave period and the Artificial Neural Network algorithm performs best for the prediction of wave height.

**Keywords:** prediction; artificial neural network; support vector machine; random forest; regression; offshore wave; wind speed

#### **1. Introduction**

Human activities in the offshore region are continually increasing due to oil and gas exploration. Adverse conditions can often occur due to environmental disasters during offshore operations. Predicting wave characteristics is a crucial prerequisite for offshore oil and gas development (Figure 1), considering the safety of lives and the avoidance of economic damage. Wave height and period are typically significantly increased by the wind associated with storms passing across the ocean's surface. Weather forecasting departments usually predict wind forces rather than wave periods and heights. There are several empirical approaches for estimating wave height and wave period from wind force [1–5]. Estimating wave height and period is inherently inaccurate and random, making it difficult to simulate using deterministic equations [6]. The numerical approach for calculating wave height and period is a difficult and complex procedure that, despite substantial breakthroughs in computational tools, produces solutions that are neither dependable nor consistently applicable. Machine learning methods are perfect for modeling inputs with corresponding outputs since they do not necessitate an understanding of the underlying physical mechanism [7].

Several studies have been performed using artificial neural networks (ANN) to measure important wave heights and mean-zero-up-crossing wave period history for different locations in the seas. These parameters were predicted three, six, twelve, and twenty-four hours in advance using two different neural network methods [8–10]. The time series of

**Citation:** Azad, M.; Uddin, M.A. Prediction of Offshore Wave at East Coast of Malaysia—A Comparative Study. *Electronics* **2022**, *11*, 2527. https://doi.org/10.3390/electronics 11162527

Academic Editor: Giuseppe Ciaburro

Received: 25 April 2022 Accepted: 23 July 2022 Published: 12 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

these wave parameters has been investigated using simulations in Portugal's western coast area. Time series with wave height have been disintegrated into multi-resolution time series using wavelet transformation hybridized with ANN and wavelet transformation [11,12]. As the input of the ANN, the multi-resolution time series has been used to predict the important wave height at an unlike multi-step lead time near Mangalore, India's west coast. Mandal et al. [13] expected wave heights from observed ocean waves off the west coast of India, Marmugao. To predict wave height, recurrent neural networks with a resilient propagation (Rprop) update algorithm have been implemented.

**Figure 1.** Offshore oil and gas development

The time series of wave height and mean-zero-up crossing wave period were simulated using data from the local wind. The application of various types of machine learning models [14–17] was performed to improve the accuracy of the prediction. The authors [18] used ANN to estimate wave height from wind force at 10 chosen locations in the Baltic Sea. The WAM4 wave model was used to figure out the time series for the waves that had been forecasted in the past. There were two different techniques, Feed-forward Back-propagation (FFBP) and Radial Basis Function (RBF), used to develop a machine learning system for predicting wave heights at a particular coastal point in deeper offshore areas [19–21]. Data on wave height, average wave period, and wind speed were obtained from remotely sensed satellite data on India's west coast. Tsai et al. [22] employed the ANN with a back-propagation algorithm to estimate wave height and cycle from wind input based on the wind-wave relationship [23]. Time records of waves at either station may be predicted based on data from the neighboring station. Several deterministic neural network models have been developed by Deo et al. [24] to predict wave height and wave periods from generated wind speed. However, the model can offer adequate results in deep water and open areas, and the prediction periods are extensive.

This research uses wind force/metocean data to correctly predict wave height and wave period. Critical comparisons are performed for the current investigation to provide more accurate findings in predicting the wave parameters using three machine learning algorithms: ANN (Artificial Neural Network), SVM (Support Vector Machine), and RF (Random Forest).

The main contributions of the paper are (i) performing the experiments on the data sets obtained on the east coast of Malaysia and (ii) understanding the best predictive models for future prediction.

We arrange the remaining parts of the manuscript as follows. First, we explain the data collection procedures, the three regressor models, and performance metrics used for the comparison among the models in Section 2. Then, we show the experimental results and discuss the findings in Section 3. Finally, Section 4 contains a concise conclusion.

#### **2. Materials and Methods**

In this section, first, the data collection procedure is described, followed by the regression methods, and finally the evaluation metrics are described.

#### *2.1. Data Collection*

On a 2° × 2° grid in South China, environmental data are collected along Malaysia's east coast, including the basins of Sabah (longitude 114.39° E, latitude 5.83° N) and Sarawak (longitude 111.82° E, latitude 5.15° N) (Figure 2). The Malaysian Meteorological Service and Fugro OCEANOR provided the data that was acquired from satellite altimetry [25,26] using oceanographic SEAWATCH meteorological (metocean) buoys and sensors, calibrated and corrected by Fugro OCEANOR [27].

**Figure 2.** Data collection location of east coast of Malaysia [26]. Reprinted from Renewable Energy, 88, Omar Yaakob,Farah Ellyza Hashim,Kamaludin Mohd Omar,Ami Hassan Md Din,Kho King Koh, Satellite-based wave data and wave energy resource assessment for South China Sea, 359-371, 2016, with permission from Elsevier.

The summary of common statistical values of the collected data is given in Table 1.


**Table 1.** Summary of basic statistics of the collected data.

There are 1460 data samples and three features: wind speed (m/s), wave height (m), and wave period (s).

#### *2.2. Methods*

There are a lot of algorithms [28–33] that can be used for regression analysis. In this study, three well-known and commonly used regression algorithms (ANN, SVM, and RF) were chosen to predict wave height and period from wind force. The wind force is employed as an input for training the network, while the wave height and period are used as outputs. Before discussing the details of each method, Table 2 shows the advantages and disadvantages of each method [34]:



The ANN algorithm is the most frequently utilized algorithm in such problems. The difference between the state-of-the-art ANN algorithm and two commonly used algorithms, SVM and RF, is then determined. Note that ANN is a parametric regression algorithm while SVM and RF are nonparametric regression algorithms. However, because the data set is small enough, it is not possible to employ advanced deep learning techniques (such as convolutional neural networks or recurrent neural networks).

#### 2.2.1. Artificial Neural Network (ANN)

Inspiring by biological neurons, the researchers created artificial neurons [35]. It is possible to create a network of artificial neurons to predict the desired functions (i.e., the target variables).

The basic idea behind a single artificial neuron is that it takes an input function *I*, which is a product of weight vectors with the sample input vector, and feeds it into an activation function *g* to produce an output (Figure 3). Generally, the usual practice is to create a network of neurons based on many layers: input layers, hidden layers, and output layers. The input layers basically consist of one or more neurons based on the supplied inputs (for the present study, only one input is used), then there can be one or more hidden layers (only one hidden layer is used), and finally, there can be one output layer consisting of the expected outputs (Figure 4). The general trend is to use a back-propagation algorithm to update the weights involved in each connection between neurons so that it is possible to minimize the errors of regression between true output and the predicted output.

**Figure 3.** The architecture of a single artificial neuron.

**Figure 4.** The architecture of ANN.

The hyperbolic tangent function, tanh(*x*) = *<sup>e</sup>x*−*e*−*<sup>x</sup> <sup>e</sup>x*+*e*−*<sup>x</sup>* is used as the activation function. The steps of the Python implementation are described in Algorithm 1.

#### **Algorithm 1** The Python implementation of the ANN algorithm

**Input:** Data set containing features and target

**Output:** Prediction of ANN for the given data set


7. Predict using the obtained regressor and calculate the performance metrics on the test data set;

#### 2.2.2. Support Vector Machine (SVM)

The basic idea behind regression using SVM (i.e., also popularly known as Support Vector Regression) is that given input examples {(*i*1, *o*1), ... ,(*in*, *on*)} ⊂ *X* × *R*, where *X* denotes the example space (e.g., *X* = *R* for our problem), a function *f*(*x*) is obtained that has at most deviation from the actual outputs oi for all the input examples (Figure 5) [36].

Generally, it is easy to describe the case of linear functions [37] *f* , *<sup>f</sup>*(*x*) =<sup>&</sup>lt; *<sup>w</sup>*, *<sup>x</sup>* <sup>&</sup>gt; <sup>+</sup>*<sup>b</sup>* , with *<sup>w</sup>* <sup>∈</sup> *<sup>X</sup>*, *<sup>b</sup>* <sup>∈</sup> *<sup>R</sup>*

where < , > denotes the dot product in *X*. Furthermore, it is possible to rewrite this problem as a convex optimization problem:

minimize <sup>1</sup> <sup>2</sup> ||*w*2|| subject to

$$y\_i - < w\_\prime \\ x > -b \\ \le \epsilon$$

$$ +b \\ -y\_i \le \epsilon$$

The steps of the Python implementation are described in Algorithm 2.


**Input:** Data set containing features and target

**Output:** Prediction of SVR for the given data set

1. Read the data set using the Python read\_excel function;

2. Extract features and target variables;

3. Scale the features and target using the StandardScaler function;


6. Derive the SVR using the above results;

7. Predict using the obtained regressor and calculate the performance metrics on the test data set;

#### 2.2.3. Random Forest (RF)

A decision tree has been widely used from the very beginning of machine learning research to find the best hypothesis as the classifier and regressor. A decision tree is a tree-like structure where a sequence of actions is taken from the root to the leaf nodes based on the values of each decision node in the concerned path. A sample decision tree is depicted in Figure 6, where the course of action of playing outside is taken based on the value of whether it is training outside or not.

**Figure 6.** A simple decision tree model.

There are a number of different variants of decision tree algorithms, from ID3 [38], C4.5 [39], and CART [40] to ensembles of decision trees (e.g., Random Forest [28])) that have been proposed to improve the accuracy of the classifiers and regressors. For a single decision tree construction, the most widely used splitting criteria, "Gini index" or "Entropy", can be mathematically stated as follows [32,38–41]:

$$p\_t = \frac{N\_t(T)}{N(T)}$$

$$\text{• } \quad \text{Entropy } \text{ent}(T) = -\sum\_{t \in D(\underline{T})} p\_t \log\_2(p\_t);$$

$$\text{• } \quad \text{Gini index } \operatorname{gini}(T) = 1 - \sum\_{t \in D(T)} p\_t^2.$$

where *T* is the data set and *D*(*T*) is the set of labels in the data set *T*. In addition, *N*(*T*) represents the number of samples and *Nt*(*T*) represents the number of samples with the label *t*.

A Random Forest (RF) is an ensemble of decision tree models such that each tree in the ensemble is built based on the bootstrap samples of the training data set. Furthermore, during the construction of the decision trees, the random forest selects only a subset of the features at each split point. In this way, the constructed decision trees are more different from each other to reduce the correlation among them and to have a better prediction. In contrast to the single decision tree (CART), random forests do not use any pruning of the tree.

Like any ensemble algorithm, the random forest also takes the average among all decision tree regressors to produce the final predictions (Figure 7).

**Figure 7.** The average prediction of *n* decision trees in a random forest prediction model

The steps of the Python implementation are described in Algorithm 3.

#### **Algorithm 3** The Python implementation of the RF algorithm

#### **Input:** Data set containing features and target

**Output:** Prediction of RF for the given data set


7. Predict using the obtained regressor and calculate the performance metrics on the test data set;

#### *2.3. Performance Metrics*

Four performance metrics [42] have been used for the comparison of the results.


For each sample, the error (*ei*) is the difference between actual output (*oi*) and predicted output (*o*ˆ*i*). The average of the squares of the error (*ei*) is the *MSE*. The average is taken by dividing the summation of the square of all the errors by the total number of samples (*n*).

$$MSE = \frac{\sum\_{i=1}^{n} (o\_i - \delta\_i)^2}{n} \tag{1}$$

Root mean squared error (*RMSE*) is the square root of the average difference in squared error between actual output (*oi*) and predicted output (*o*ˆ*i*). It is the most commonly used metric.

$$RMSE = \sqrt{\frac{\sum\_{i=1}^{n} (o\_i - \delta\_i)^2}{n}} \tag{2}$$

Mean absolute error (*MAE*) is the average absolute difference of error between actual output (*oi*) and predicted output (*o*ˆ*i*). It is the most commonly used metric. The advantage of *MAE* is that it is softer to outliers. The reason is that it does not have any square term associated with the equation, therefore the penalty for the outlier points is not that much compared to *MSE* and *RMSE* where there are heavy penalties imposed by the square term.

$$MAE = \frac{\sum\_{i=1}^{n} |(o\_i - \delta\_i)|}{n} \tag{3}$$

The coefficient of determination or *R*<sup>2</sup> explains the degree to which the independent variables explain the variation of the output variable. It is a measure of how well new samples can be predicted by the model through the proportion of explained variance. If *oi* is the true value of the *i*-th sample and *o*ˆ*<sup>i</sup>* is the corresponding output or predicted value, the estimated *R*<sup>2</sup> can be calculated as:

$$\mathcal{R}^2 = 1 - \frac{\sum\_{i=1}^n (o\_i - \delta\_i)^2}{\sum\_{i=1}^n (o\_i - \delta)^2} \tag{4}$$

where *<sup>o</sup>*¯ <sup>=</sup> <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *oi <sup>n</sup>* .

#### **3. Results and Discussion**

The experiments have been executed based on the available field data from the east coast of Malaysia, which consists of 1460 samples, one input variable, wind speed (m/s), and one target variable, either wave height (m) or wave period (s). The regression results are compared among the three regressors (ANN, SVM, and RF). The Python programming environment (version 3.6) with scikit-learn (version 0.24.1) is used for the implementation.

Initially, the ANN regressor is trained using the training data and then the accuracy is validated using the testing data. The network is chosen with one hidden layer of 4 neurons, and each neuron uses the hyperbolic tangent function. The weight of the ANN is optimized using "lbfgs" in the family of quasi-Newton methods. Furthermore, the SVM regressor is trained using the same training data and validated using the same testing data as mentioned above. For SVM, the standard radial basis function (RBF) is used as the kernel function. Finally, the RF regressor is trained and validated in a similar fashion. For RF, 400 trees are used for the number of trees, 3 is used for the maximum depth of the tree, and 1460 × 0.05 = 73 samples are used for the training of each decision tree in the ensembles. Each regressor is compared by the state-of-the-art performance metrics.

#### *3.1. Cross Validation Results*

It is a common practice to keep a part of the available data for testing and the remaining part for training. However, the model evaluation is particularly dependent on specific pairs of (training and testing) fractions, and the result can be overfitting. To overcome this problem, cross validation [43] is used to test the model's fitness to predict new data that was not used to train the model and how it can generalize to unknown data.

There are many variants of cross-validation. The standard method is to use *k*-fold cross-validation. In general, the procedure is to use *k* rounds in cross-validation. It splits the data into *k* subsets. In a single round, it completes the analysis on one subset (the training set), and then validates the performance on the remaining *k* − 1 subsets (the validation set). In the next round, another subset is chosen for training and the remaining is chosen for validation. To reduce variability, these steps are repeated *k* times for *k* subsets so that each subset is selected for training. Finally, an average among the *k* validation results is reported as the fitness of the model's predictive performance.

Nevertheless, it is common practice to repeat the k-fold cross-validation process multiple times and report the average performance among all folds and all repeats. This approach is called repeated *k*-fold cross-validation.

The average and standard deviation of the regression performance of wave height were determined using 10-fold cross validation that was done for three methods as presented in Table 3. The four-performance metrics are reported in the same way as in the preceding section.


**Table 3.** Regression results for 10-fold cross validation with three times repeated for wave height (m).

It is clearly evident that the method of ANN has a minimum average value of mean absolute error, mean square error, and root mean square error. Besides, it has the maximum average value of the coefficient of determination. As a result, this approach performs best for predicting wave height.

Similar findings for wave period have been presented in Table 4 using 10-fold cross validation that has been done for the consecutive three methods.

**Table 4.** Regression results for 10-fold cross validation with three times repeated for wave period (s).


It is apparent that the RF approach has the lowest mean absolute error, mean square error, and root mean square error. Furthermore, it has the maximum value of the coefficient of correlation. As a result, this approach is the most accurate for predicting wave periods.

*R*<sup>2</sup> is close to 1, which indicates that regression predictions perfectly fit the data. However, in our case, the experimental results show the lower value of *R*<sup>2</sup> due to the inconsistencies in the collected data set.

#### *3.2. Graphical Representation of a Sample Training and Testing Results*

To illustrate the model's performance graphically, one sample is chosen randomly as the pair of (training, testing), where the training is 80% and testing is 20% of the data. In Table 5, the results for both training and testing data are shown in the case of the wave height regression problem. It is evident that the mean squared error, mean absolute error, and root mean squared errors are the smallest for RF compared to others in the training data. However, these metrics are the smallest for ANN compared to others in the testing data. Furthermore, the coefficient of correlation (*R*2) is largest for RF compared to others in the training data, but it is largest for ANN compared to others in the testing data. As a result, the ANN approach is superior at predicting wave height in the future.


**Table 5.** A sample regression results for wave height (m).

The regression results of ANN, SVM, and RF for wave height are depicted in Figure 8. Figure 8a shows the ANN regression findings for wave height. The black dots are actual testing data, and the green dots are the predicted values. It is clear that the predicted values follow the pattern of the actual testing data. Nevertheless, the regression results for the SVM regressor are shown in Figure 8b. The black dots are actual testing data, and the blue

dots are the predicted values. It is clear that the predicted values follow the pattern of the actual testing data.

**Figure 8.** *Cont.*

**Figure 8.** The regression result for wave height (m); (**a**) shows results using ANN, (**b**) shows results using SVR, and (**c**) shows results using RF. The predicted values follow the pattern of the actual testing data in all cases. For RF, the pattern is not as smooth as in ANN or SVM, because RF is not a single decision tree method; rather it is an ensemble method that works by taking the average votes. ANN prediction is smoother than SVR.

The regression results of wave height based on the RF regressor are presented in Figure 8c. The black dots are actual testing data, and the orange dots are the predicted values. It is clear that the predicted values follow the pattern of the actual testing data. The pattern is not as smooth as SVM or ANN because RF is not a single decision tree method; rather it is an ensemble method that works by taking the average of different decision trees' predictions.

In addition, in the case of wave period, Table 6 shows the findings for both training and testing data. It is evident that the mean squared error, mean absolute error, and root mean squared error is the smallest for RF compared to others in both training and testing data. Nevertheless, the coefficient of determination (*R*2) is the largest for RF compared to others in both training and testing data. Therefore, for future prediction of wave period, the RF method is best.


**Table 6.** A sample regression results for wave period (s).

The regression results of ANN, SVM, and RF for wave period are depicted in Figure 9. For ANN, in Figure 9a, the actual testing data is shown in black dots and the predicted values are shown in green dots. It is clear that the predicted values follow the pattern of the actual testing data for ANN. For SVM, in Figure 9b, the actual testing data are represented by black dots, while the actual predicted values are represented by blue dots. It is clear that the predicted values follow the pattern of the actual testing data for SVM.

**Figure 9.** The regression result for wave period (s); (**a**) shows results using ANN, (**b**) shows results using SVR, and (**c**) shows results using RF. The pattern is not directed positively as like wave height in Figure 8 and rather going horizontally which indicates lower values of *R*<sup>2</sup> compared to wave height.

Finally, for RF, the real testing data is represented in black dots, whereas the actual predicted values are shown in orange dots in Figure 9c. It is clear that the predicted values follow the pattern of the actual testing data for RF. The pattern is not as smooth as SVM or ANN because RF is not a single decision tree method; rather it is an ensemble method that works by taking the average of different decision trees' predictions.

#### *3.3. Comparison with Standard Non-Parametric Kernel Regression*

The non-parametric method of kernel regression (KR) in statistics is used to calculate the conditional expectation of a random variable. The goal is to find a non-linear relationship between two random variables, *I* and *O* [44]. The problem under consideration can be modeled using a standard non-parametric regression problem. As an example, *I* can be wind speed and *O* can be wave height. In Table 7, the previous results are compared with the results of KR for the same sample.



It is clear that KR results are not far from those of the SVM, ANN, or RF. In fact, ANN produces the best results in the case of wave height prediction.

#### *3.4. Overall Discussion*

The goal of this study is to understand the best predictive model among ANN, SVM, and RF for the prediction of wave height and period from the wind speed. A detailed experiments are performed in terms of cross-validation and sample training and testing results. The summary of this result is given in the Table 8.



It is evident from Table 8 that the Random Forest (RF) method is truly performing well across two prediction problems. It is the second best for the prediction of wave height and the best for predicting wave period. Nevertheless, it has the advantage of being a non-parametric method. Moreover, the underlying tree structure has the advantages of interpretability and usage. However, to be specific, RF performs best for the prediction of wave period while ANN performs best for the prediction of wave height.

#### **4. Conclusions**

This study carried out three different and well-known machine learning algorithms for the prediction of offshore waves. These approaches are used to predict the wave height and wave period from the given wind forces. Multiple accuracy ranges are obtained in terms of the mean absolute error, mean square error, root mean square error, and coefficient of determination. Overall, these performance measures show average behavior. However, it is possible to compare the three employed methods and analyze the results. The regression analysis in the random forest performs best for the prediction of wave period, and the artificial neural network performs best for the prediction of wave height. Furthermore, it was compared with the standard non-parametric kernel regression and found to have a similar result.

In situations when a traditional analysis would be challenging, these machine learning techniques can produce very quick and reasonable predictions. These studies can benefit the community as a measurement of safety and precaution in the critical marine environment.

In this regard, there are numerous potential future research directions. One disadvantage of using such metocean data is the existence of discrepancies. Future work should incorporate advanced algorithms in addition to the aforementioned models to address such inconsistencies. Additionally, more machine learning techniques should be used to find the ideal answer for this specific prediction problem. The time dependence of the metocean data, which can be examined using time series analysis tools, is another interesting subject of research.

**Author Contributions:** Conceptualization, all authors; methodology, all authors; software, M.A.; validation, M.A.; formal analysis, all authors; investigation, all authors; resources, all authors; data curation, M.A.U.; writing, all authors; visualization, all authors; supervision, all authors.; project administration, M.A.U.; funding acquisition, M.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** It is available with a justifiable and reasonable request.

**Acknowledgments:** The authors are grateful to Jouf University for their assistance with this research. Moreover, the authors would like to express their gratitude to all of the volunteers and anonymous reviewers for their suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Classification of Task Types in Software Development Projects**

**Włodzimierz Wysocki 1,\*, Ireneusz Miciuła <sup>2</sup> and Marcin Mastalerz <sup>3</sup>**


**Abstract:** Managing software development processes is still a serious challenge and offers the possibility of introducing improvements that will reduce the resources needed to successfully complete projects. The article presents the original concept of classification of types of project tasks, which will allow for more beneficial use of the collected data in management support systems in the IT industry. The currently used agile management methods—described in the article—and the fact that changes during the course of projects are inevitable, were the inspiration for creating sets of tasks that occur in software development. Thanks to statistics for generating tasks and aggregating results in an iterative and incremental way, the analysis is more accurate and allows planning the further course of work in the project, selecting the optimal number of employees in task teams, and identifying bottlenecks that may decide on faster completion of the project with success. The use of data from actual software projects in the IT industry made it possible to classify the types of tasks and the necessary values for further work planning, depending on the nature of the planned software development project.

**Keywords:** software; knowledge management; reasoning; information extraction; rule mining; knowledge acquisition; engineering

#### **1. Introduction**

The contemporary intensity of changes in the surrounding reality due to technological progress makes management by economic units extremely demanding and difficult. Therefore, it is necessary to react quickly and effectively to changes that create new conditions for business activity. In the era of knowledge-based economy, information, information systems co-created by it, and related information technologies are extremely valuable and inextricably linked with knowledge [1]. The usefulness of IT systems is enormous and directly influences the increase of the possibilities of management units by reflecting their innovativeness and technological potential, which is of great importance when making strategic business decisions.

Globalization, the era of e-economy, and computerization are the topicality of modern economic activity [2]. That is why the production of an IT product (IT product) is so fundamental. Managing the production of IT products is a new scientific and technological discipline that has emerged at the interface between computer science and management engineering. All economic entities that base their activities on IT solutions are interested in the practical results of research in this discipline. In addition, it should be stated that, indirectly from all kinds of improvements in software development and management, any activity that uses any IT software will benefit. The digitization of the economy results in the need for reliability and security of emerging applications, IT systems, or transaction services. A lack of reliability of IT software can cause enormous losses. That is why it is so important to develop and create IT software that is as reliable as possible. Therefore,

**Citation:** Wysocki, W.; Miciuła, I.; Mastalerz, M. Classification of Task Types in Software Development Projects. *Electronics* **2022**, *11*, 3827. https://doi.org/10.3390/ electronics11223827

Academic Editor: Maria Liz Crespo

Received: 3 November 2022 Accepted: 19 November 2022 Published: 21 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the work efficiency of teams producing IT software—characterized by a cyclical nature—is so important. Moreover, the literature highlights the dependence of the implementation effectiveness of subsequent software development stages on current project stages and the maturity level of the IT product.

The software development process is constantly evolving; there are technological changes and ways of organizing the work of project teams. In the knowledge economy era, most economic activities require systems and all kinds of IT products [3]. This forces the appropriate acceleration of software development, while at the same time maintaining the required quality. That is why it is so important to have the right team to implement often complex and innovative IT solutions. On the other hand, numerous project teams cause problems in effective communication in the group and cause the need to optimize work organization. To ensure this optimization, tools and IT products designed for managing software development processes are often used.

Traditional approaches to IT project management have become insufficient due to the high variability of requirements and the need to change the work organization of project teams [4]. It is particularly noticeable in innovative software development projects for the e-economy, where flexibility in the production process and originality of solutions are required. Therefore, despite the support of software development processes with increasingly better tools and systems, the pursuit of optimal use of project resources including human teams—is still a considerable challenge. Since the beginning of the 21st century, the agile approach has been increasingly used in the practice of software development. The literature on the subject shows a significant improvement in the number of successful projects and the optimization of the use of resources necessary in software development through the use of continuously developed agile methods. This is particularly important due to the constant changes taking place in the requirements for the software being developed [5]. Therefore, traditional methodologies cannot be successfully applied to the numerous changes that are natural in innovation projects because they are extremely static and, apart from the first phase, consist of a fixed number of tasks. The next stage of the methodology development is based on iterations, which assumes that we learn about new requirements in the development process; requirements change and the existing ones are detailed. However, in the current agile methodologies, the requirements are incomplete and we do not know the details; this is also due to the nature of the innovative projects [6]. We have a set of requirements in the form of epics and user stories. Then, in the course of the software development process, creating new requirements becomes the cause of new tasks. Another source of new implementation tasks is the tests of current versions that detect defects or reveal new implementation possibilities. Automation of this process by the task generator in agile methodologies allows for the creation of appropriate increments of tasks; the number and the nature of which depends on the size and type of the project. On the other hand, the iterative measurement of tasks allows for subsequent iteration planning, with the assumption of the invariability of certain aspects for the best estimation. Therefore, planning methods based on summary work are nowadays of great practical importance for the management of manufacturing processes.

Project management systems for software development collect a lot of data about tasks and information related to them necessary to perform the work and the necessary exchange of information between the project team [7]. However, there is still room for improvement when it comes to providing information that will optimize IT project management. Because the data and information collected in the systems are semantically poor, it is necessary to analyze the works that are performed during the implementation of tasks to correctly classify them, which will also allow the estimation of their size, thus increasing the quality of planning. The isolation and classification of the characteristic types of tasks, and the work performed within them, will allow for the automatic execution of analyses and estimations to create insightful statistics and reports necessary for optimal forecasting decisions. In addition, taking into account the knowledge about the current technological and business components will give an appropriate insight into the current state of work in the project and will allow for more optimal planning of the software development process.

This work aims to identify the impact of the type of tasks on efficiency and to indicate those factors that positively affect the effectiveness of software development teams. The model of task types was built based on the data analysis of real software projects from the financial sector, managed by the Jira system. Connecting the model with the Jira system enables easy data acquisition for analysis and increases its commercial potential. A separate abstract layer of the model, in combination with a dedicated database, supports the possibility of creating interfaces to other IT project management systems. The article attempts to classify project tasks, thus determining the necessary expenditure on tasks of various types. In conjunction with the results of research on tasks created during the production process and cyclical works, it allows for planning projects. Linking task types with project team roles allows the simulation of project work and supports project management by identifying bottlenecks in the manufacturing process and avoiding over-employment.

The article discusses the basic concepts of the context in which software development projects are implemented and the development of methodology for the optimal management of these processes. At the same time, the principles and practices, as well as the lifecycle of software development projects, are characterized, which is important for the possibility of introducing improvements in this process. The concept of the programming process model is inspired by the agile approach to managing software as it is developed. Extending the original approach to the roles of team members and task types allows more precise work planning in the project and management of the project team, including detecting bottlenecks or unused resources. The article discusses the programming process model that is oriented toward the manual recognition of task types, thus enabling effective support for planning tasks carried out in innovative IT projects.

#### **2. Literature Review**

Software development in IT companies is mainly carried out through appropriate organization of the work of project teams. Unfortunately, in projects involving the search and creation of new solutions where the variability and unusual nature of ideas and requirements are natural, there is a constant need to improve the management of this process. Due to the innovative nature of IT projects, project teams are burdened with high risk. This forces the search for optimal project management methods and techniques that will allow for better control and use of the company's resources. Nowadays, analytical methods and tools embedded in IT systems are a great help in supporting project management.

Traditional software development project management methodologies are currently being replaced by, or supplemented with, agile methodologies. The development of project management methods in the IT industry has resulted from dissatisfaction with the small number of successful projects. A big problem was discovered in the management of innovative projects, where rigid and strongly formalized traditional software development methodologies did not work and even made it difficult to introduce changes and the proper functioning of projects [8]. Agile methods are used primarily due to the desire to reduce the number of errors, to shorten the time needed to create finished products, or to reduce the total production costs [9]. However, as noted by J. Shore and S. Warden, when making decisions regarding the implementation of agile practices, it is difficult to unequivocally state greater successes in the implementation of IT projects [10]. This is due to the variety of interpretations of the concept of "IT project success" and the fact that the use of the same methodologies in different companies from the same industry results in different results [11]. There is a common view in the literature that the key to success in project management is learning about appropriate techniques and tools [12]. However, the instrumental layer of an appropriate project management methodology alone will not ensure its success if not properly applied by its members, because the most important part of software development projects are the people [13]. Therefore, an appropriate balance should be found between the hard (budget, schedule, and implementation time) and soft (communication, changes, motivation, and competencies) elements of the project. Therefore, regardless of the methodology used, project management is closely related to people management, and practice shows that most of the problems affecting project success result from the omission of the "purely human aspects" of team management [14]. Accordingly, this article aims to identify the main problems related to people management in the success of an IT project by enabling effective task planning carried out in innovative, and thus variable, software development processes.

Since the beginning of the 21st century, we have witnessed a dynamic development in agile methodologies [15]. This is due to the adaptation of the management process to projects with a high degree of innovation. In these cases, hard methodologies such as PRINCE2 and PMI PMBoK do not work due to the detailed and long-term planning stage that is not able to take into account future changes [16,17]. With these characteristics of the projects, the too-high level of standardization of activities does not work, because there are often unsuccessful attempts at establishing the project lifecycle and detailed requirements already at the project initiation stage. Even though this approach gives a lot of comfort to the implementers, in the case of innovative projects, it does not meet the real needs that will be known only at further stages of implementation. The progress of work on the innovative project reveals the necessity to repeatedly verify many assumptions, actions, and plans, because, only as the implementation progresses, knowledge and actual visualization of the developed solutions in the shape of the final product are acquired. Additionally, certain paths for reaching specific results turn out to be ineffective only after the verification and testing phase. External changes should also be mentioned, i.e., changes that are beyond the control of project teams, e.g., changes in regulations and legal norms or the current market situation. Often, consistent plan implementation leads to the creation of functionalities that will not be adapted to reality or that will be useful only after costly modifications. Therefore, this article attempts to classify the characteristic types of tasks in software development projects that will allow for more effective planning and adaptation of the required work to dynamically changing reality.

All factors that make it difficult, or even impossible, to precisely define and describe the results of the project at the stage of its definition result from the following elements that occur when creating software (especially innovative software) [18,19]:


In addition, the implementation of innovative IT projects is associated with the risk of other irregularities. Kruchten (2004) lists several main problems that can affect any project, which are [20]:


The above factors and elements of project implementation management, and the inability to solve them in accordance with the traditional approach, contributed directly to the development of methods for making traditional methodologies more flexible and, as a result, prepared the ground for the Manifesto for Agile Software Development that announced in 2001 the principles for agile software development methodologies [21,22]:


The principles of an adaptive approach to software development project management indicate the direction for project teams, while specific practice is necessary for the actual implementation of works [23]. The process structure and specific practices create a minimal flexible framework for self-organizing teams. IT tools are essential to accelerate software development and to reduce costs. Contracts are crucial for the development of the customer– supplier relationship. Documentation supports communication [24]. However, the key issue is to provide the project team with feedback to answer the question of where the team currently is in the software development process [25]. This is possible thanks to iteration and incremental software lifecycles. The essence of the iterative process is the frequent delivery of working pieces of software (successive increments) that implement selected sets of functions that together make up the usefulness of the final product. The iterative software development cycle leads to a management style where long-term plans are fluid, while a stable plan can be created for a short period of time. Iterative and incremental software development leads to completely new relationships with the business client and different principles of the project team's functioning.

The concept of the iterative and incremental programming development cycle as a remedy for dilemmas of the e-economy era has resulted in the creation of new methodologies for IT project management [26]. These methodologies, called agile, do not cut off completely from document-oriented formalized traditional methodologies, but have specific features (adapted to the requirements of modern software development projects) [27,28]:


Organizations are increasingly implementing digital transformation plans, due to the threat of disruptions, in order to keep pace with the growing pace of business. Agile software development plays a huge role in this process. Many of the digital workflows in use today are based on agile principles. Thanks to a flexible scalable IT infrastructure, cloud computing is evolving in line with the needs of agile software development. The DevOps concept removes the traditional distinction between software development and operations. The software is used as a tool in SRE–DevOps implementation and systems management and automation of operational duties [29]. CI/CD methodologies confirm that software will change frequently and provide tools that help developers deliver new code faster [30]. Agile methodologies come in a variety of forms to meet the needs of any project [31]. Even though agile approaches are different, all of them are based on the key ideas contained in the agile approach. For this reason, any framework or behavior that complies with these principles is termed agile. Regardless of the specific agile approaches that a team decides to apply, the benefits of an agile methodology can only be fully realized through the collaboration of all parties involved [32]. In recent years, a large number of agile software development project management methodologies have emerged. The most popular are [33–36]:


time delivery can never be compromised and that design modifications are always to be expected. This belief is based on eight principles and a business-based methodology.

• Lean development, a "lean" software development. The basic idea of the approach is the elimination of losses understood as elements that do not add any value to the product. The aim of such action is to deliver the finished product to the customer as soon as possible. Lean software development is based on the values and principles of adaptive project management. The term "lean software development" is considered by Mary Poppendieck and Tom Poppendieck, who, in their book *Lean Software Development: An Agile Toolkit,* presented, among others, the seven main principles of lean management and a set of 22 techniques supporting the approach.

Agile management methodologies are a group of methodologies that are characterized by an adaptive and variable approach to managing software development projects. Additionally, agile methodologies were developed much earlier than the agile manifesto itself. The first works on adaptive project management methods date back to the 1980s (an example is the rapid application development methodology) and the concept of agile methodologies was introduced in the mid-1990s [37]. The idea of a time frame is a well-defined process dedicated to software development control at the lowest level in an iterative cycle with several review points. Reviews help ensure the quality and efficiency of software development. By delivering the software on time at the lowest level, the timely production at the highest level (i.e., the project level) is ensured [38]. The basic principle of the project plan is to prepare a schedule of planned increments and, within them, the planned time frames, which will create a complete project schedule capable of changes with emerging new requirements. The use of the time frame technique, together with the MoSCoW prioritization technique, ensures no delays in project implementation and the delivery of ready-made software that will meet business goals within a given time [39].

Projects with a high degree of innovation are very difficult to include in a complete schedule and scope of work. Therefore, adaptive methodologies describe functionalities (i.e., independent elements of the subsystem), which in subsequent releases can be quickly changed and handed over for implementation. Agile methodologies, as opposed to traditional methodologies, rule out the validity of long-term planning [40,41]. Therefore, the plans are speculative and not deterministic. This allows you to adapt to all types of changes that appear during the implementation of the project. Additionally, the distinguishing factor of agile methodology is a strong emphasis on the cooperation and integration of the project team, because only in this case is the smooth flow of information and effective communication ensured. The general scheme of the project lifecycle in the case of agile methodologies is based on five phases, indicated by J. Highsmith [18,42]:


Agile project management methodologies require an appropriate level of project maturity for organizations and project teams [43]. In addition, agile methods continue to be improved in order to most effectively respond to the needs of managing still-innovative software development projects. Hence, there are so many methods that are still looking for new, more perfect, solutions; although, they undoubtedly already seem to be better adapted than the classic methods to dynamically changing project environments. Implementation based on iteration allows for effectively adapting signals that come from both inside the project and from the external environment.

Classifying specific types of tasks in software development projects will allow for determination of the requirements, the time necessary, and the expenditure that will be needed to implement them. This will allow for more effective work on further adaptive project planning methods. Linking task types with the roles of the project team will allow you to simulate project work, which will support project management by identifying bottlenecks in the software development process. In addition, it will allow for avoiding over employment and will allow support for quick "what if" simulations. At the same time, the introduction of the task classification in terms of business and technological components, in conjunction with the employee competency model, will allow for automatically selecting the composition of the project team and optimally managing the team; this has the highest priority in agile methodologies. This is of great importance for the optimization of software development process management and leads to the development of perfect methods in response to the dynamism of real-world changes. It will have replicative and predictive capabilities to plan the project, to simulate it during changes, and to detect bottlenecks during the entire process. In this article, experiments were carried out on many real IT projects to classify task types and to attempt to find the relationship between the nature of the planned project and the specificity and number of these tasks.

#### **3. Materials and Methods**

The model of task types was built on the basis of the data analysis of real software projects from the financial sector, managed by the Jira system. Table 1 contains summary data of these projects. In four projects, the team worked in accordance with traditional methodologies, to which more and more elements of the agile approach were introduced. In two projects, the teams worked in accordance with the scrum approach. It is quite a large data set that allows for the analysis of phenomena related to modern manufacturing processes. The total number of project issues exceeds 30,000, which makes it possible to use artificial intelligence methods. The total duration of the projects is approximately 22 years, which does not mean that historical records are dealt with. Project teams worked on most projects in parallel.


**Table 1.** Projects, the data of which were used for research on task types.

In Jira, teams use issues to describe and track specific tasks to be done. Issues are the basic building blocks of projects. The issue can be of a specific type. The most common type of issue is task, which in practice is used for many purposes. The second most frequent type is bug, which describes the defects found in the developed software and the work needed for fixing them. The full list of types used in the projects with their frequency is shown in Figure 1.

**Figure 1.** The occurrence frequencies of issue types in the researched projects.

The most commonly used issue types are task, bug, and subtask. The summary determines the idea of the actions to perform. The description field contains a detailed description of the work to do. A significant feature of the issue is its state. Tracking issue state changes allows for the recognition of the performed work. The workflow describes the values of the state field and the allowed transitions between them. Figure 2 shows a typical simplified workflow used in the analyzed projects.

**Figure 2.** A typical simplified issue workflow used in projects.

As shown in Figure 2, transitions between issue states correspond to performing basic actions in implementing new system functions or fixing bugs. For example, the DevOps (development and operations) methodology helps to establish cooperation between developers and operators to automate the continuous delivery of new software, which is expected to contribute to shortening the development cycle and to creating high-quality software [44,45]. Another development of DevOps is the concept of development, security, and operations (DevSecOps), which at the same time is designed to integrate security methods with the software development process, where security measures are built in to ensure the integrity and availability of the application [46].

A valuable feature of the Jira system is storing the history of changes in the value of issue elements, including the state. Storing history supports the tracking of issue execution. Not all issues use the state to track work. The Jira system allows employees to register working hours devoted to work on an issue. For some types of work, there is no need to keep track of their state, as they are repetitive works performed as needed. In this case, the possibility of registering working hours is sufficient.

The presented approach is based on a precise division of the development process into activities and roles in line with the RUP methodology, adapted to hybrid and agile processes [47,48]. The previous series of articles described the agent–object model of the manufacturing process (AGOMO), which was first used to assess the maturity of RUP processes, then to plan hybrid water–scrum–fall processes [49,50]; it became an inspiration for the research presented here.

As we showed in the previous section, the types of Jira issues are insufficient to clearly define the work's purpose and type. The model was created in order to fill the gap that prevents linking the work carried out with the composition and competencies of the project team. The combination of these two areas will allow for a more precise quantitative analysis of the projects' works and the detection of the causes of the observed phenomena. It is a way to optimize the efficiency of development processes and to increase the level of maturity of project teams and organizations [51].

Software projects consist of tasks that a project team performs to build an IT system [52,53]. The task represents the Jira issue. Each of the tasks performed has a clearly defined goal. This goal can be, for example, implementing a requirement, testing a component or the entire module, administering environments, or managing a project; it determines its type. Members of the project team perform the tasks. Roles define the competencies and responsibilities of those carrying out the tasks. One person can have many roles; one role can have many people.

The task of implementing system functions requires performing some essential subtasks. They include implementation, i.e., the creation of component source code, code review, creation of unit tests, testing and verification of functions, and acceptance tests. Such tasks correspond to the issue, the state of which changes according to the workflow shown in Figure 2. The implementation tasks that consist of subtasks are named stateful tasks in the model.

Each task and subtask contains the number of project team members' working hours. Task and subtask types enable association efforts with the roles of project team members. They help to recognize bottlenecks or over-employment in completed IT projects and to avoid them during project planning.

Issues whose states do not change during the project are represented in the model by recurring tasks. They have been called recurring because a single task of this type can be performed as needed for the project's duration. For example, this might be a project development management task performed by an administrator when defects arise or when new software components need to be configured. For recurring tasks, you can calculate the average labor intensity in a period and, on this basis, assume what the fixed costs of the project are [54]. Genuinely recurring tasks are, for example, meetings such as daily stand-up meetings, workshops with clients, or steering group meetings.

The authors analyzed the tasks of the studied projects in terms of their type. The results, in the form of graphs showing the percentage number and the sizes of stateful and recurring tasks, are shown in Figure 3. The chart on the left shows the ratio of the number of stateful tasks to the recurring tasks in the projects. The chart on the right shows the ratio of the effort of stateful tasks to recurring tasks.

**Figure 3.** The number and effort of stateful and recurring tasks in the researched projects.

The case of the P2 project is significant, where the number of stateful tasks is greater than the recurring tasks (amounting to over 90%), while the effort of these tasks is slightly over 40%. It is very interesting, because stateful tasks are responsible for implementing the functions of the built system and for fixing the detected defects in it. From the developer's point of view, this workload should probably be the greatest. From the point of view of a researcher of software development processes, this phenomenon arouses great curiosity. In order to satisfy it, it is first necessary to precisely define the types of tasks in the projects, which this article implements on the basis of actual and implemented projects from practical activity.

#### **4. Results**

The classification of task types is based on the division of tasks into stateful and recurring tasks. By definition, stateful tasks represent work related to the implementation of system components and the fixing of defects. Stateful tasks are processed by a subtasking algorithm. Recurring tasks are responsible for the remaining works. Recurring tasks can be classified on the basis of summary and extended-text descriptions included in the issue. Initially, these tasks were classified manually. Subsequently, simple classification algorithms were created based on keyword searches. Recurring tasks are also broken down into subtasks that correspond to the registered work.

Figure 4 shows a diagram of the classification algorithm activity. The algorithm first checks how many times the issue has changed its state; if at least three times (more than open and close), the workflow is the basis for dividing the task into subtasks. The split algorithm tries to create subtasks (e.g., development, testing, code review). For this, it uses the value of the state before and after, the time of the change, the role of the person, and the registered works and then assigns completed work to the created subtask. The result is a stateful task.

If the task did not change state or if the algorithm did not detect any subtasks, the task type is determined by searching for keywords in the text description. When it fails, the tasks go to a spreadsheet, where they are manually classified. Then, the algorithm checks who was working on the task. If many people worked on a task on one day, it is a recurring group task (e.g., stand-up and other meetings). If one person worked on a task on one day, it is a recurring individual. The way of dividing the recurring task into subtasks depends on the individual/group classification. If the programmer and the tester alternately execute a task, it is stateful; the algorithm divides them into subtasks according to the roles of the team members.

**Figure 4.** Algorithm of task classification.

The algorithm for breaking stateful tasks into subtasks is very complex due to the many ways that users use Jira to work on a project issue. Searching for keywords is not very accurate and supporting it by manual classification makes it impossible to use the classification in practice. Therefore, it is planned to replace these solutions with the NLP (natural language processing) classification. However, there are some good points to manual ranking. During the work on the algorithm, the process of the manual analysis of recurring tasks detected more than 50 types of tasks that were aggregated into three groups containing 14 main types of tasks.

Stateful tasks consist of tasks for implementing new features and for fixing bugs reported by customers and testers. Recurring tasks are divided into three main groups: implementation, meetings, and organizational tasks. A complete list of the main task types is provided in Table 2. The task types listed in the table form a hierarchical structure divided into types and groups.

The task classification algorithm—classifying tasks performed alternately by programmers and testers as stateful tasks—increased the number and effort of state tasks. An updated version of the graphs in Figure 3 is included in Figure 5. The changes are significant. For example, for project P3, the percentage of stateful tasks increased from 58% to 66% and the percentage of stateful task efforts increased from 24% to 43%.

The development of task types and an algorithm enabling automatic classification of the researched projects allowed for a detailed analysis of the tasks of the researched projects. Below, we present effort charts (Figure 6) for six groups of task types in the researched projects.

The P1 project was implemented in a hybrid methodology. It is the longest project among the researched, with a duration of more than 8 years. The project went through many phases of implementation, delivery, and maintenance. Therefore, the values and the ratio of effort in the project were averaged. They can serve as a benchmark when compared with other hybrid and traditional designs.

The P2 project was carried out in the hybrid methodology. It is characterized by a large number of hours used for various types of meetings. On the other hand, the very small scope of work in the management field suggests that the project may have been managed collectively. The very little work involved in fixing defects found by clients indicates that time spent in meetings was well spent.


**Table 2.** Types of stateful and recurring tasks used by the model.

**Figure 5.** Updated numbers and effort of stateful and recurring tasks.

The P3 project was carried out in a hybrid methodology. It has high administrative and management costs. The ratio of repairing defects found by customers to repairing defects found by testers is interesting. There is no such tendency in the other projects, except perhaps for the P4 project. This may indicate inaccurately defined requirements or difficult contact with the customer.

**Figure 6.** Effort of the six main groups of task types in projects.

The P4 project was implemented in the hybrid methodology. The distribution of the workload of tasks in the P4 project differs from others in terms of a very large amount of work to fix the defects detected by customers. The cost of this work is one-and-a-half times greater than that for implementation tasks and five times greater than the cost of repairing defects detected by the development team. The situation is similar to the P3 project, only the management and development costs are much lower. It is possible that the P4 project is in the maintenance phase of a project, with a large number of defects.

The P5 project was implemented using the scrum methodology. At the high level of abstraction given by the task type group analysis, no difference can be seen between this project and the hybrid projects. What is important is the lack of resources to rectify defects reported by customers. It is very possible that the project did not enter the customer implementation phase and was not put into production.

The P6 project was also implemented using the scrum methodology. More than half of the work was devoted to product implementation. The product has likely been delivered to the customer, as indicated by 8% of the efforts to fix defects reported by customers. The meetings have a significant share in the work on the project, which is consistent with the scrum methodology. The outlays for the maintenance and management of the project are small.

Graphs of total efforts by groups of task types give a very synthetic view of the projects. We can get a deeper look at the differences between projects by focusing attention on the detailed effort of cyclic task types. Figure 7 shows radar charts of recurring task effort, shown as a percentage of total recurring effort. The details of the R-DEV group from Figure 6 are shown here. The left side of the graph shows the labor intensity of meetings, cooperation with the client, administrative work, and management. The right side of the chart shows the expenditure on recurring development tasks. The charts differ from each other to reflect the characteristics of the projects. The charts show the "fingerprints" of the projects, making it possible to identify their detailed characteristics and to compare them with each other.

In most of the charts, the left organizational and management side dominates over the right developer side. The P1, P2, P3, and P6 projects follow a similar pattern in the recurring works chart, which indicates greater expenditure on meetings and management than on development work. Analysis and design play a large role in the P3, P5, and P6 projects.

**Figure 7.** Structure of expenditure on recurring tasks in the researched projects.

Comparing the details of recurring work in projects allows you to determine the minimum, average, and maximum values of recurring work, which will enable the use of linguistic variables in the work for project planning. On the basis of static summary graphs, the nature of the project can be determined and projects can be compared with each other. It is also important to determine the area of expenditure of recurring tasks in projects and to determine their minimum, average, and maximum values. This is important for project planning. The entered types of tasks can be useful to answer the question of how tasks are created and performed during the development process.

The previous chapters introduced the division of tasks performed in the software process into stateful tasks related to the implementation of new tasks and recurring ones, with works performed periodically. This division may lead to the assumption that stateful tasks related to new features are mainly created at the beginning of the project. As for bugs, it would be prudent to assume that they arise after implementing certain requirements or features. The very name of recurring tasks (e.g., daily stand-ups) suggests that they are performed in equal intensity throughout the manufacturing process. Creating new tasks is very important when planning the development process, because the implementation of tasks cannot be started when they have not yet been created. This fact limits the size of the planned project team.

The model of task types and the research carried out on actual projects show what this case looks like in reality. Figure 8 shows when the state tasks in the researched projects were created. The charts do not show the number of created tasks, rather their effort, which better reflects the total size of tasks created in a month. DEV—new functions; BUG-DEV—defects detected by testers; BUG-CLI—defects detected by the customer (see Table 2 for details).

Most of the charts show that new features are developed throughout the life of the project. This phenomenon may come as a big surprise, especially since the P1-P4 designs were produced according to a hybrid approach in which traditional practices had a large share. The situation in the long-term P1 project is understandable, because it consists of many phases and, in each of them, new functions were created to be implemented. The P2 project is an exception among the examined projects, because new functions are created during the first 8 months at the beginning of the development process and then, for 24 months, they are implemented, and repairs of defects detected by the development team are created.

**Figure 8.** Stateful tasks created in the development process of the researched projects.

The graphs of the P3–P6 projects show, however, that new functions are created until the very end of the manufacturing process, although to a lesser extent. The reasons for this are interesting. Does it result from a long-term process of acquiring new requirements parallel to the development process? Or, maybe the reason is getting to know the details of the requirements obtained earlier? Unfortunately, the data placed in the Jira system do not answer these questions directly, because they do not take into account the requirements engineering processes. An interesting phenomenon is also the periodic increase and decrease in both the work on new functions and the repair of detected defects.

Work on the implementation of stateful tasks proceeds in a different rhythm than the creation of new stateful tasks. The number of man-hours used per month for stateful assignments depends on the number of people on the project team and their assigned roles. It should be taken into account that the team also performs recurring tasks. Figure 9 shows the monthly expenditures on the execution of stateful works in the researched projects.

**Figure 9.** Work on stateful tasks in the researched projects.

The charts show that, in most projects, bug fixes detected by the development team are delayed in relation to the implementation of the functions of the developed software. Even more delayed is the repair of errors detected by the client, because they are the last detected. This phenomenon is best seen in the graphs of the P3 and P6 projects. Comparing the work charts with the charts of the created tasks (see Figure 8) gives better insight into the project. For example, in the P2 project, after creating many thousands of new features, there is a break

for several months, and then defects detected by the project team are created. The P2 project work graph shows that there was no break in the project, the work was less intensive, but at that time defect fixes and then the implementation of new features were ongoing.

The answer to the question of what the effort of recurring work in the software development process looks like can be found in Figure 10. The graphs show a similar periodicity as the graphs of the effort of stateful tasks. The source of these periodic disturbances is very interesting. Probably, to some extent, the outlays for stateful and recurring work are complementary, i.e., increases in the first graph correspond to decreases in the second.

**Figure 10.** Work on recurring tasks in the researched projects.

The model of task types, in conjunction with the roles of the project team members, allows for the analysis of the work carried out in the project. On this basis, it is possible to trace the implementation of tasks and to recreate the composition of the project team. The next step in the development will be the possibility of simulating the work in the software development process or of planning the composition of the project team based on task plans.

The project team consists of the roles and the number of jobs of people employed in a given role. The analysis of actual project data is the source of the model, hence the lack of certain roles in the team, e.g., the role of an analyst. The current set of roles is defined as follows: ADM—administrator, PM—team leader, PRG—developer (who also deals with design and collection of requirements), TST—tester. The team consists of roles and the number of positions for a given role.

The adopted set of roles is not consistent with the agile approach represented by the scrum methodology. The scrum team chiefly consists of three roles: the scrum master, the product owner, and the development team. Developers are everyone belonging to the development team who are involved in software development. However, in practice [21], it is worth distinguishing the role of a tester (whose main activities are software testing and quality assurance), a programmer (who creates production code), and an administrator (who manages development and production environments and tools supporting the development and implementation of emerging software).

Figure 11 shows the concept of the relationship between task types and the roles of project team members. It consists of task types, roles, and two kinds of connections: simple and proportional. A simple link between a task type and a role determines that tasks of a specific type are performed by members of the project team with that role. For example, DEV-PRG tasks are performed by people with the PRG role and BUG-CLI-TST tasks are performed by people with the TST role.

**Figure 11.** Relationship between the main types of tasks and the roles of project team members performing them.

A proportional link exists between meeting tasks and all roles of the project team. The idea of proportional connection is to divide the man-hours allocated to meetings among people from the project team in proportion to the number of people performing a given role. For example, there are 100 person-hours of meetings recorded in a month and programmers completed 72% of the work per month, so 72 meeting person-hours per month are added to the workload of the developers' tasks.

The workload of tasks performed monthly by roles is divided by the adopted average number of hours worked by a person per month. The score is the number of positions for that role. The number of positions is quantized to 1/4 and rounded up. The adopted average number of 165 h of work per month takes into account only non-working days. It does not take into account holidays and possible dismissals due to the employee's illness. This number can be changed freely.

Linking task types to the roles of project team members enables the approximation of the team composition needed to complete the project. Table 3 presents the monthly work of the P6 project, broken down by the main types of tasks and the composition of the project team reconstructed on their basis.


**Table 3.** Monthly work of the P6 project and reconstructed composition of the project team.

The number of people on the project team goes up and down in line with monthly stateful and recurring tasks. The team has been growing since the beginning of the project. In July and October 2020, it was at its highest; the number of positions was 10.25. After that, the team shrunk and the project ended in December 2020. There was a maximum number

of 6.5 programmers and 1.75 tester positions per project. It is interesting that, in April and October 2020, there were two positions for the project manager in the team. Often, projects employ people to perform organizational and support work, for example, managing issues in the Jira system, maintaining Kanban boards, or creating reports.

#### **5. Model Verification**

The classification algorithm, according to Figure 4, consists of a method of dividing into subtasks and assigning task types based on a set of keywords built from the manual classification of P3 project tasks. Since the total number of issues in all projects was large, it was difficult to classify them manually. The research is intended to serve as a proof of concept, so the same set of keywords was used to classify the tasks of the other projects.

The method of dividing tasks into subtasks is closely related to the classification of tasks, because the types of subtasks affect the distinction, for example, of whether implementation or testing has been performed. If a subtask is not correctly identified, labor intensity will not be assigned to it and it will not be included in the accuracy indicator - *EAp* . Subtask division is complex, because people who record work in the JIRA system do it in many different ways. The method of subtask division developed most first for the P3 project, then was adapted to other projects. The primary indicator of the effectiveness of project task classification - *EAp* is the ratio of the labor intensity of the recognized subtasks (*E<sup>R</sup> <sup>p</sup>* ) to the total labor intensity of the project - *Ep* :

$$EA\_p = \frac{E\_p^R}{E\_p} \cdot 100\%$$

where *p* is the project; *EAp* is the index of effectiveness of classification of labor intensity of project tasks *p*; *E<sup>R</sup> <sup>p</sup>* is labor intensity of correctly identified project subtasks *p*; *Ep* is total labor intensity of the project *p*.

To further verify the accuracy of task classification, a manual check of a sample of 100 randomly selected tasks for each project was conducted. As a result, the accuracy rate was obtained (*NA<sup>M</sup> <sup>p</sup>* ) determining the percentage of correctly classified tasks in the sample (*NMC <sup>p</sup>* ) to the number of tasks in the sample (*N<sup>M</sup> <sup>p</sup>* ):

$$NA\_p^M = \frac{N\_p^{MC}}{N\_p^M} \cdot 100\%$$

where *p* is the project; *NA<sup>M</sup> <sup>p</sup>* is the indicator of the effectiveness of the classification of the number of tasks of the project *p* in the sample; *NMC <sup>p</sup>* is the number of correctly identified project tasks *p* in the sample; *N<sup>M</sup> <sup>p</sup>* is the number of project tasks *p* in the sample, *N<sup>M</sup> <sup>p</sup>* = 100.

The labor-intensity accuracy rate (*EA<sup>M</sup> <sup>p</sup>* ) is obtained, determining the percentage of the labor intensity of correctly classified tasks in the sample (*EMC <sup>p</sup>* ) to the total labor intensity of the sample (*EA<sup>M</sup> <sup>p</sup>* ):

$$EA\_p^M = \frac{E\_p^{\rm MC}}{E\_p^M} \cdot 100\%$$

where *p* is the project; *EA<sup>M</sup> <sup>p</sup>* is the index of effectiveness of classification of labor intensity of project tasks *p* in the sample; *EMC <sup>p</sup>* is the labor intensity of correctly identified project tasks *p* in the sample; *E<sup>M</sup> <sup>p</sup>* is the total labor intensity of the project *p* in the sample.

Table 4 shows the results of the verification of the effectiveness of project task and subtask classification and the manual verification of the samples of project tasks.


**Table 4.** Results of verification of the effectiveness of the classification of tasks and subtasks of projects.

The accuracy rates of the P3 project can be considered exemplary, since a classification algorithm was developed for this project. The differences between the values of numerical and labor-intensive accuracy indicators determined during the manual verification of small samples (from 1% to 4% of the number of tasks in the project) are due to the fact that not all correctly classified tasks have labor hours recorded. The low indicator values for projects P2, P4, and P6 are due to the frequent use of a language other than English in task descriptions. The set of keywords developed for project P3 consists of English words, hence the poor transferability of the classification algorithm to some projects. In addition, the manual verification of the P4 project detected the following: a lack of keywords in the description and a large number of tasks with no change in status, resulting in a lack of subtasks and tasks with no registered work. The results in Table 4 indicate the need to translate job descriptions from the JIRA system into English before starting classification based on keywords or using NLP models.

#### **6. Conclusions**

The data of actual projects are the basis of the presented research and model. On the one hand, they increase the possibility of practical applications, on the other hand, they limit the model to the types of tasks and roles present in the researched projects. With this in mind, we tried to make the model flexible and open to modifications.

Connecting the model with the Jira system enables easy data acquisition for analysis and increases its commercial potential. A separate abstract layer of the model, in combination with a dedicated database, supports the possibility of creating interfaces for other IT project management systems.

The classification algorithms presented in the article are based on the manual recognition of task types. With manual recognition, rules based on keywords are created, which allows automatic recognition of task types at subsequent occurrences. As is known from the literature and practice, such algorithms are not very elastic and not very accurate (45% accuracy) [55]. However, with a growing base of manually recognized tasks, it will be easy to change to NLP models such as BERT [56]. This will allow fully automated operation of the task classification and subtask classification algorithm on a real-time basis. It will allow the analysis of the data collected in JIRA, the production of reports and charts to provide insight into the manufacturing process, and support for the project manager in decision making.

The division into state and cyclic tasks shows that state tasks are created during software development and their number and labor intensity depend on the size of the project. The rate of growth and completion of state tasks depends on the composition of the project team. The project's execution time depends on this rate. The number and labor intensity of cyclic tasks, on the other hand, depends on the duration of the project. Thus, the classification of tasks becomes the basis for constructing a generator of state and cyclic tasks to create software development plans. In turn, the creation of a development plan and the composition of the project team will allow the construction of a simplified simulation of the work in the project.

The ability to create a project plan and to select the appropriate composition of the project team, and, then, thanks to the simulation, to check how the work will proceed, will allow for comprehensive support of the management of the development process. Thanks to simulation, it will be possible to estimate whether the composition of the team is suitable for the specifics of the project. Simulation can show that, for example, programmers have implemented the requirements and that the team is waiting for testers to perform tests. In this way, the project manager can recognize the risk of a bottleneck in the project and prevent it in advance.

Project planning is useful not only before starting; real-time automatic task classification will allow analysis and will use the calculated task statistics to plan the next sprints or stages of the development process with increasing accuracy.

The introduction of an additional classification of tasks in terms of business and technological components, in conjunction with the employee competency model, would automatically collect information about employees' experiences in business and technology areas. This would allow the assessment of the level of employees' competences, selecting the composition of the project team and perhaps managing the team so that the competences are duplicated and dispersed among team members.

**Author Contributions:** Conceptualization, W.W. and I.M.; methodology, W.W.; software, W.W.; I.M. and M.M.; validation, W.W., I.M. and M.M.; formal analysis, W.W., I.M. and M.M.; investigation W.W. and I.M.; resources, W.W. and I.M.; data curation, W.W. and I.M.; writing—original draft preparation, W.W. and I.M.; writing—review and editing, W.W., I.M. and M.M.; visualization, W.W. and I.M.; supervision, W.W., I.M. and M.M.; project administration, W.W., I.M. and M.M.; funding acquisition, W.W. and I.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** The project was financed within the framework of the program of the Minister of Science and Higher Education in Poland under the name "Regional Excellence Initiative" in the years 2019– 2022, project number 001/RID/2018/19, the amount of financing PLN 10,684,000.00.

**Data Availability Statement:** Data are contained within the article.

**Acknowledgments:** Many thanks to Marcin Korze ´n, Jakub Swacha, and Leon Dorozik for scientific support.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Knowledge Mining of Interactions between Drugs from the Extensive Literature with a Novel Graph-Convolutional-Network-Based Method**

**Xingjian Xu \*, Fanjun Meng and Lijun Sun**

College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China **\*** Correspondence: xingjian@imnu.edu.cn

**Abstract:** Interactions between drugs can occur when two or more drugs are used for the same patient. This may result in changes in the drug's pharmacological activity, some of which are beneficial and some of which are harmful. Thus, identifying possible drug–drug interactions (DDIs) has always been a crucial research topic in the field of clinical pharmacology. As clinical trials are time-consuming and expensive, current approaches for predicting DDIs are mainly based on knowledge mining from the literature using computational methods. However, since the literature contain a large amount of unrelated information, the task of identifying drug interactions with high confidence has become challenging. Thus, here, we present a novel graph-convolutional-network-based method called DDINN to detect potential DDIs. Combining cBiLSTM, graph convolutional networks and weightrebalanced dependency matrix, DDINN is able to extract both contexture and syntactic information efficiently from the extensive biomedical literature. At last, we compare our DDINN with some other state-of-the-art models, and it is proved that our work is more effective. In addition, the ablation experiments demonstrate the advantages of DDINN's optimization techniques as well.

**Keywords:** knowledge mining; drug–drug interaction; graph convolutional network; self-attention; deep learning

#### **1. Introduction**

When treating patients with drugs, doctors often use multiple drugs at the same time because the effectiveness of one drug is limited. Particularly in the case of severe and chronic diseases, many different drugs have to be used at the same time to treat lesions, relieve pain, prevent complications or are used for other medical reasons. As drugs are taken together, complex biochemical reactions may take place in vivo, resulting in unpredictable results, which are called drug–drug interactions (DDIs) [1]. In terms of their side effects, DDIs can be basically divided into two types: beneficial and adverse [2]. A beneficial drug interaction can improve patient outcomes, whereas adverse drug interactions can pose serious threats to patients' health, reducing the effectiveness of drugs, prolonging the course of disease, and even putting patients' lives at risk. Therefore, the identification of possible DDIs has always been a crucial research topic in clinical pharmacology [3]. A number of databases were constructed by researchers in order to document the DDIs found, such as DrugBank [4], DDInter [5], TwoSides [6] and SFINX [7].

The traditional method of obtaining DDIs involves the use of clinical trials, and these are time-consuming, expensive, and often have serious ethical implications [8]. In spite of the fact that in vivo trials remain the most accurate method for identifying DDIs, the disadvantages described above severely limit the pace at which DDIs can be identified. In recent years, many biomedical research papers have been published at high frequencies, which led researchers to study how meaningful information can be extracted from these papers. Clearly, manually curation is not feasible, so machine learning or other knowledgemining-based methods must be employed [9]. The two examples in Figure 1 illustrates

**Citation:** Xu, X.; Meng, F.; Sun, L. Knowledge Mining of Interactions between Drugs from the Extensive Literature with a Novel Graph-Convolutional-Network-Based Method. *Electronics* **2023**, *12*, 311. https://doi.org/10.3390/ electronics12020311

Academic Editors: Agnieszka Konys and Agnieszka Nowak-Brzezi ´nska

Received: 29 November 2022 Revised: 4 January 2023 Accepted: 6 January 2023 Published: 7 January 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

DDI extraction from drug-related text sentences, for example, the published literature or drug descriptions. For sentence S1, the DDI type of Fluoxetine and Phenelzine is "Advice" (see Section 3.1 for a description of the specific DDI types). For sentence S2, the DDI type of PGF2alpha and Oxytocin is "Effect". Although these automated prediction methods may output false-positive and true-negative DDI predictions, they nevertheless became a mainstream approach for the DDI prediction task due to their efficacy. If it is necessary, researchers may then validate these high-confidence DDIs produced by automated DDI prediction methods clinically [10].

**Figure 1.** Two examples illustrating DDI extraction from drug-related text sentences.

Initially, there are mainly two kinds of traditional machine learning methods for automatically extracting DDI: pattern-based and feature-based ones. In pattern-based methods, experts with extensive domain knowledge are required to propose some recognizable patterns based on their own experiences [10]. Later, a number of feature-based methods are proposed, among which the best-performing ones are based on support vector machine (SVM), for example, FBK-irst [11] and NIL\_UCM [12]. In general, machine learning methods that are based on features have experienced great success and are more portable than those that rely on patterns [13]. There is, however, an inherent disadvantage to these methods, which is that they heavily rely on tedious feature engineering and redundant feature selection, and defining the feature set in a supervised manner will also limit the identification of other valuable patterns. Moreover, as these methods are based on traditional machine learning models and are not capable of extracting deep features from input data, they will become much less effective when dealing with large data sets [14].

Deep learning can solve the above problems well, and it has been applied widely and successfully in a variety of other fields as well, such as in the field of computer vision, natural language processing (NLP) and speech recognition [15,16]. Deep learning methods based on graph structure have been proposed and successfully applied to the DDI prediction task [17,18]. The first wave of popular deep-learning-based DDI detection methods rely primarily on sequence-based networks, for example, the convolutional network (CNN) and recurrent neural networks (RNNs) [19]. In most cases, these methods can achieve better results than methods based on traditional machine learning models. However, the main drawback of this approach is that they cannot handle long or complex sentences in the literature's text or other information sources, mainly because of the inherent characteristics of CNN or RNN. The researchers then proposed dependency-based methods, which can be used to extract corpora that contain multiple long and complex sentences, incorporating structural information into a neural architecture for DDI prediction. As many DDI extraction corpora contain a large number of long sentences (≥150 words) [20], dependency-based methods obviously have advantages over sequence-based ones. In regard to all these methods, there are still some challenges to overcome: (1) These methods only use the literature's text as input data and lack relevance to other information extraction sources; (2) due to the difficulty of parallelizing existing dependencies-based methods, such as tree-LSTM, they are often inefficient and have a disappointing runtime performance; (3) as their network is essentially linear, most of these methods are only capable of predicting the interaction of one pair of drugs at a time, which severely limits their practical usage.

In order to resolve the issues outlined above, we propose DDINN (DDI Neural Network) for the DDI prediction task, which is a novel graph-convolutional-network-based method featured by the self-attention mechanism for pruning. Our method utilizes contextual features of sentences as vertices and syntactic features as edges to construct a graph, which will be fed to GCN layers sequentially. DDINN can capture more neighborhood information of the graph more effectively by stacking the convolution layer. In particular, we rebalance the weights of each edge via a self-attention mechanism. Thus, DDINN is able to exploit both the context and structure of the input sentence to the maximum extent possible. Our final step was to train and evaluate the DINN model on the dominant DDI extraction dataset from SemEval-2013 Task 9 of the DDIExtraction 2013 dataset [21]. Validation experiments and ablation study show the effectiveness of DDINN and its superiority compared to other similar methods. Performance assessments are also conducted on the DDINN model's components to show the improvement compared with other traditional methods.

To summarize, we can state the following as our main contribution:


Following is the outline of the remainder of this paper. In Section 2, we review the characteristics of existing DDI extraction approaches and briefly summarize the improvements made in the DDINN method proposed here to overcome their shortcomings. Section 3 describes the implementation specifics of DDINN in detail. Then, the experiments and analysis of their results are presented in Sections 4 and 5. As a final point, in Section 6, our conclusions regarding the entire work of DDINN is presented.

#### **2. Related Works**

Currently, there are three main types of DDI extraction methods: feature-based, kernelbased, and deep learning neural-network-based methods. The representative methods below will serve as the baseline for further experimental validation.

#### *2.1. Feature-Based Methods*

Feature-based methods aim to find a way to distinctively represent data characteristics using some feature representation techniques, which are called feature engineering. This process involves transforming the original data into feature vectors that can better express the essence of the problem. Then, classifiers are trained based on various linguistic features extracted from the data. For example, UTurku [22] uses dependency graph features to mine entity associations and it achieved an F-value of 59.4% in the DDIExtraction 2013 competition. WBI-DDI [23] proposes a two-stage method that first classifies the results using multiple methods including APG (all path graph), Moara, SL (shallow linguistic), and TEES (urku event extraction system) separately, and then it votes on these classification results to obtain the best classification result, which achieved an F-value of 60.9%. FBKirst [11] constructs a combined kernel classifier by combining the feature kernel, shallow linguistic kernel and closure tree kernel for binary classification, deleting negative examples and then constructing a combined kernel classifier to achieve multi-classification, which scored 65.1% in the DDIExtraction 2013 competition F-value.

#### *2.2. Kernel-Based Methods*

The purpose of kernel-based methods is to find and learn the mutual relationships in a set of data. Widely used kernel methods include support vector machines, Gaussian processes, etc. Kernel-based methods are an effective way to solve nonlinear pattern analysis problems. The core idea is as follows: First, the original data are embedded into a suitable high-dimensional feature space by some nonlinear mapping; then, the patterns are analyzed and processed in this new space using a generic linear learner. Featureand kernel-based DDI extraction can achieve better results than the rule-based extraction, and these methods have been the mainstream method for DDI extraction for a long period of time. The disadvantage is that they are time-consuming and laborious for performing multiple complex feature extractions, so the extraction's performance is bottlenecked and cannot be improved significantly. In 2015, Kim et al. [13] constructed kernel functions by employing a set of lexical and syntactic features based on a series of lexical and syntactic features with an F-value of 67% in DDIExtraction 2013. In 2016, Zheng et al. [24] constructed kernel functions for a graph kernel with an F-value of 68.4%. This method became the best model among the current methods using feature-based and kernel functions. It is similar to our approach in that semantic and syntactic information is integrated. However, the performance of previous studies has not been satisfactory since they have only looked at the shortest dependency path (SDP).

#### *2.3. Neural-Network-Based Methods*

Neural networks have an extremely strong feature representation capability. Thus, deep learning methods have a significant advantage over other machine learning methods in terms of accuracy and do not require a complex pre-processing process. In classification tasks, neural networks can be treated as classifiers capable of automatically extracting features. With the rapid development of deep learning, many neural-network-based DDI extraction methods emerged in recent years and have excellent performances in DDI extraction task over traditional feature- or kernel-based methods. The relationship between drug entities can be extracted using neural networks in two basic ways: sequence-based and dependency-based methods.

Different neural architectures, including CNNs and RNNs, are used in sequence-based models. Quan et al. [25] proposed a multichannel convolutional neural network (MCCNN) for automated biomedical relation extraction. As a result of MCCNN's performance on the DDIExtraction 2013 challenge dataset, MCCNN was reported to achieve an overall F-score of 70.2% compared to the linear SVM-based standard system (e.g., 67.0%). Sun et al. [26] proposed a recurrent hybrid convolutional neural network (RHCNN) for DDI extraction from the biomedical literature in which semantic embeddings and position embeddings are both used to represent the texts mentioning two drug entities. RHCNN is reported to achieve DDI automatic extraction with a micro F-score of 75.48%. In addition to CNNbased models, RNN-based ones have also been adopted for extracting DDI effectively. For example, in GGNN [27], textual drug pairs are encoded with convolutional neural networks, while molecule pairs are encoded with graph convolutional networks. DDI relations are then extracted by concatenating the outputs of these two networks. Sahu et al. [28] present three long short-term memory (LSTM) network models for mining DDI relation from biomedical text, namely B-LSTM, AB-LSTM and Joint AB-LSTM. The experimental results on the DDIExtraction2013 dataset show that the Joint AB-LSTM model produces reasonable performances with an F-score of 69.39%.

Dependency-based neural network architectures are constructed using structural information of a given sentence. It is common for the DDI extraction corpus (literature text or drug description, etc.) to contain multiple long and complex sentences, and the longest sentence may contain over 150 words, so using only sequence-based networks for extraction is extremely challenging. It is therefore very helpful to introduce structural knowledge (such as dependency trees) into the DDI extraction task. For example, Zhao et al. [29] present a syntax convolutional neural network (SCNN) for DDI extraction. In SCNN, a new syntax word embedding method is proposed that incorporates syntactic sentence information.

#### *2.4. Improvements Made by DDINN*

In order to address the shortcomings of the approaches discussed above, we made considerable improvements with respect to DDINN for the DDI extraction task:


#### **3. Materials and Methods**

#### *3.1. Problem Definition*

Words in the literature's text can be denoted as **<sup>X</sup>** = [**x1**, **x2**, ··· , **xi**, ··· , **xn**] <sup>∈</sup> <sup>R</sup>*d*×*n*, where *<sup>n</sup>* denotes the total number of words and **xi** <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* denotes the *<sup>d</sup>*-dimensional *<sup>i</sup>*-th embedded token. Drugs described in this text can be denoted as D = {*Dk* | *k* ∈ [1, *n*]}. The mapping relationship between words and drugs is already known, and it can be represented as *Rxd*(**xi**, *Dk*), *Rxd* ⊂ {0, 1}. If *Rxd*(**xi**, *Dk*) = 0, it means that there is no relationship between **xi** and *Dk*; otherwise, it shows a positive relationship.

All drug entities can be annotated with the following five drug–drug interaction relationship types [21]:


Thus, in the problem of DDI relation extraction, C represents the overall prediction classes as follows.

$$\mathcal{OC} = \{ \text{Advice}, \text{Mechianism}, \text{Effect}, \text{Int}, \text{Negative} \} \tag{1}$$

Now, the problem of DDI predication can be defined as follows. Given **X** and the *Rxd*, our DDINN method will predict drug relation set R*D*.

$$\mathcal{AB}\_D = \{ R\_{dd}(D\_a, D\_b) \mid a \in [1, n], b \in [1, n], a \neq b \},\tag{2}$$

$$R\_{dd}(D\_{a}, D\_{b}) \in \mathcal{C} \tag{3}$$

#### *3.2. Overview of Architecture*

The outline of the overall architecture of our novelly proposed DDINN model is illustrated in Figure 2. Firstly, each word in the input literature text is transformed into a token vector that consists of the embeddings of the word itself, its dependency, part of speech, and distance in sentences. These embedding vectors are concurrently sent to cBiLSTM and the weight-rebalanced dependency parser to extract the contextual and syntactic features, respectively. Then, DDINN constructs a graph, which is fed to the GCN layers, by converting contextual features into graph vertices and syntactic features into graph edges. Consequently, the representations of drug pairs and sentences consisting of other remaining words are obtained by masking the output of GCN layers. At the last step, the PPI prediction classifier, which is the final output of DDINN, is generated by concatenating the representations above sequentially and passing them to the softmax and linear layers. Below, we will provide a detailed description of the process for building the DDINN model.

**Figure 2.** Architecture overview of our proposed DDINN method.

#### *3.3. Contextual Feature Representations*

In our work, the contextual and syntactic representation of sentences is used to analyze the literature's text. The concept of a bag-of-words model is often used in traditional sentiment analysis, where a document is viewed as a collection of terms or combinations of short compound words regardless of grammatical and word order. As a result, when processing sentences, word vectors are often used. It is very common for obtaining word embeddings by pre-training, and the word representation obtained in this way is often independent of the sentence's context. However, due to polysemy, the word itself can have different meanings in different contexts. Therefore, it is impossible to accurately describe the contextual meaning of the word itself in a certain context only by using the word vector. The use of context-sensitive vectors can enhance the representations of semantic relations between sentences [30].

Our solution to these issues involves the use of contextual bidirectional long short-term memory recurrent neural networks (cBiLSTM). In cBiLSTM, the contextual information extraction problem is viewed as a sequence classification problem, and a type of pooling will be performed to obtain sentence-level polarity after using RNNs as discriminative binary classifiers. There are two separate layers of LSTM in cBiLSTM. As for word token **xi**, these two LSTM layers are responsible for capturing both forward and reverse contextual information, respectively. By estimating the probability of a word based on its complete left and right contexts, the networks process the bi-directional period adjacent to the position of a word in the sentence. Therefore, the cBiLSTM is able to understand the contextual meaning of words more effectively than traditional network models.

#### 3.3.1. Word Embedding

The first step is the vectorization of words to obtain **X**. Considering that the word *Ti* in it does not necessarily have a mapping relationship in **X**, in this case, this paper will use a uniform distribution on interval [−0.5, 0.5] for its random initialization. Let **x**(*Ti*) denote the vector of word *Ti*; this representation rule is described as follows:

$$\mathbf{x}(T\_i) = \begin{cases} \mathbf{x}\_{i\prime} & T\_i \in \mathbf{X}\_{\prime} \\ \mathsf{U}Information([-0.5, 0.5])^d \; \; \; T\_i \notin \mathbf{X}. \end{cases} \tag{4}$$

#### 3.3.2. Construct cBiLSTM

Later, word vector **x** will be processed by cBiLSTM, which will produce the forward −→*hi* and backward ←− *hi* for word vector **xi**.

$$
\overrightarrow{h\_i} = LSTM(\mathbf{x}\_i, \overrightarrow{h\_{i-1}}) \tag{5}
$$

$$
\overleftrightarrow{h\_i} = LSTM(\mathbf{x}\_i, \overleftarrow{h\_{i-1}}) \tag{6}
$$

Then, we can calculate the contextual feature, *hi*, of word vector **xi** by concatenating −→*hi* and ←− *hi* as follows.

$$\mathcal{H}\_i = [\overleftarrow{h\_n}; \overleftarrow{h\_i}] \in \mathbb{R}^d \tag{7}$$

At the final step of this section, all contextual information (denoted as *H*) of the sentences will be fed to the later networks for parsing.

$$H = (h\_1, h\_2, \dots, h\_n) \in \mathbb{R}^{n \times d} \tag{8}$$

#### *3.4. Syntactic Feature Representations*

Dependent syntactic analyses aim to parse the text into a dependent syntactic tree. This is performed by obtaining the dependencies and association paths between words. Thus, the method gives the model a better understanding of natural language by extracting text features based on sentence structure. In addition to contextual information, syntactic information is also important. In fact, contextual and syntactic features complement each other. Here, we adopt the graph convolutional network (GCN) [31,32] to extract syntactic information. The syntactic structure of texts is more similar to that of graph data. For such non-Euclidean spatial data, traditional deep learning models do not effectively exploit or may even corrupt its intrinsic information. By extending convolution to graph-structured data, GCN is proposed, which has the ability to model common graph data in reality, and then it explores the complex relationships in it. In this paper, we use the full dependency tree as the input of the graph convolutional network and introduce the attention mechanism during the training process so as to selectively focus on the dependency substructure.

#### 3.4.1. Construct Dependency Matrix

Based on the dependency structure, we first generate the corresponding adjacency matrix *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*n*×*n*. Most traditional dependency-tree-based networks do not employ full dependency trees to convey syntactic information from sentences. These methods often use 1 or 0 to encode syntactic dependencies between words, which indicate that the elements in the adjacency matrix *A* take values of only 1 or 0. However, this approach ignores the impact of different dependencies on the target task and introduces other redundant features. As a result of such strategies, which are normally determined by rule-based preprocessing, crucial information may also be lost [33,34].

To address the problems above, we introduce two more steps: a dependency-aware embedding representation method based on dependency relations in the layers and selfattention-based pruning. The dependency-aware embedding representation not only focuses on the dependency correlations between words but also considers the dependency tag types and the semantics of the words associated with the tags. The following paragraphs provide the implementation details of the dependency-aware embedding representation method.

For **X**, if there is a dependency relationship between word *i* and *j* and the dependency type is *<sup>ϕ</sup>*, the corresponding dependency-type embedded vector is <sup>ℵ</sup>*<sup>ϕ</sup>* <sup>∈</sup> <sup>R</sup>*dφ*<sup>×</sup>1, and the dependency relationship between these two words can be embedded represented as follows:

$$a\_{i\bar{j}} = \text{Sign}(Av\text{g}[\mathbf{x}\_{\text{i}}, \mathbf{x}\_{\text{j}}] \times \omega\_{\varphi} \times N\_{\varphi} + b\_{\varphi}) \tag{9}$$

where *ωϕ* and *bϕ* are trainable parameters, *Avg* denotes the average value function, *Sigmod* denotes the activation function and ℵ*<sup>ϕ</sup>* is initialized before the model's training and will be updated during the training process. Thus, if words *i* and *j* have syntactic dependency, the elements in matrix *A* can be represented as *Aij* = *aij*; otherwise, *Aij* = 0.

#### 3.4.2. Self-Attention-Based Pruning

Then, in order to exploit syntactic dependencies more fully, self-attention-based pruning is employed to assign weights to all edges in the dependency graph. By incorporating the self-attention mechanism, we transform *A* into a soft adjacent matrix *A*". Self-attention has the advantage of noticing the relationship between different positions in a single sequence. Thus, the edge weights of all node pairs in the graph are reassigned regardless of whether they are directly or indirectly connected. This is why we call output *A*" as a *soft* adjacent matrix.

In the specific calculation process, we use query and key pairs of **x***<sup>i</sup>* as self-attention function parameters. By employing multi-head attention [35,36], we were able to capture a different context from multiple perspectives. In particular, the soft adjacent matrix, *A*", can be calculated as follows:

$$\hat{A} = Softmax\left(\frac{Q\mathbf{W}\_h{}^Q \times (K\mathbf{W}\_h{}^K)^T}{\sqrt{d}}\right) \tag{10}$$

where *So f tmax* is the activation function and *Q* and *K* are the features of the previous convolutional layer *h*(*l*−1). *W<sup>h</sup> <sup>Q</sup>* <sup>∈</sup> <sup>R</sup>*d*∗*<sup>d</sup>* and *<sup>W</sup><sup>h</sup> <sup>K</sup>* <sup>∈</sup> <sup>R</sup>*d*∗*<sup>d</sup>* are used for projection parameters, where *h* denotes the *h*-th head in *H*, which is defined in Equation (8).

#### 3.4.3. Construct GCN

Then, contextual information *H*, which is the output of Equation (8), and adjacency matrix *A*" will be fed into the *l*-level GCN:

$$\hat{H}^{(l)} = \operatorname{Relu}(\hat{\mathbf{D}}^{-\frac{1}{2}}\hat{A}\_{\mathbf{D}}\hat{\mathbf{D}}^{-\frac{1}{2}}\boldsymbol{H}^{(l-1)}\boldsymbol{W}^{(l-1)} + b\_{l})\tag{11}$$

where *Relu* is the activation function, *A*" *<sup>D</sup>* is the edge matrix of *A*", *D*" denotes the degree matrix of *<sup>A</sup>*" *<sup>D</sup>*, *<sup>H</sup>*(*l*−1) denotes the node features of the (*<sup>l</sup>* <sup>−</sup> <sup>1</sup>)-th level GCN (when *<sup>l</sup>* <sup>=</sup> 1, *<sup>H</sup>*(*l*−1) <sup>=</sup> *<sup>H</sup>*) and *<sup>W</sup>*(*l*−1) denotes the weight matrix of the (*<sup>l</sup>* <sup>−</sup> <sup>1</sup>)-th level GCN.

Finally, in order to further enhance the generalization capability of the model, the output of the GCN layers above will be processed by a pooling layer, dropout layer, and Relu layer:

$$H^\* = \omega \times \operatorname{Relu}(\operatorname{Dropout}(\operatorname{Pooling}(\mathbf{H}^{(l)}))) + b \tag{12}$$

where *H*∗ is the final output of GCN, which holds the contextual and syntactic feature information of text **X** at the same time.

#### *3.5. Extract DDI*

#### 3.5.1. Extract Masked Representations

After completing the above steps, we have hidden representations of each word in the input literature text, which can be simply denoted as **w***<sup>i</sup>* for word *i*. The problem in this step can be defined as follows: Within the input word representations [**w**1, ··· , **w***n*], drug A is mapped to **w***a*, and drug B is mapped to **w***b*; we want to extract the relationship between drug A and B. In order to achieve this, we first calculate the masked representations of drug A, drug B, and the sentence including other words (i.e., words except for **w***<sup>a</sup>* and **<sup>w</sup>***b*), which are denoted as *<sup>H</sup>M*<sup>∗</sup> *<sup>A</sup>* , *<sup>H</sup>M*<sup>∗</sup> *<sup>B</sup>* , and *<sup>H</sup>M*<sup>∗</sup> *<sup>S</sup>* , respectively. The calculation process is as follows:

$$H\_S^{M\*} = \text{MaxPooling}(Mash\_S(H^\*))\tag{13}$$

$$H\_A^{M\*} = \text{MaxPooling}(\text{Mask}\_A(H^\*))\tag{14}$$

$$H\_B^{M\*} = \text{MaxPooling}(Mash\_B(H^\*))\tag{15}$$

where *H*∗ is the output of Equation (12), *MaxPooling* denotes an activation function that can transform *<sup>n</sup>* output vectors to only one vector, i.e., *MaxPooling* <sup>∈</sup> <sup>R</sup>*n*×*<sup>d</sup>* <sup>→</sup> <sup>R</sup>*d*. *MaskA*, *MaskB* and *MaskS* denote functions that can select only representations for drug A, drug B and sentences formed by the remaining words, respectively.

#### 3.5.2. Construct DDI Classifier

Finally, we can predict the DDI by using a classifier. Firstly, we concatenate the masked representations above and then feed them to a fully connected layer [37]. The final result of this classifier is denoted as *HFinal*, which is calculated as follows:

$$H\_{\text{Final}} = \text{FC}(\text{Concat}(H\_A^{M\*}, H\_B^{M\*}, H\_S^{M\*})) \tag{16}$$

where *FC* is the fully connected layer, and *Concat* is the function that concatenates all its parameters. *HFinal* will then be fed into a linear layer and a softmax layer to output the probability distribution for the DDI relationship between these drugs [38,39]:

$$P = Softmax(Linear(H\_{Final}))\tag{17}$$

#### **4. Experiments**

#### *4.1. Dataset*

In this paper, we evaluate DDINN on the DDIExtraction2013 dataset [20], which is most widely used when comparing the performances of different DDI extraction algorithms. Prior to 2011, there were relatively few studies related to the DDIExtraction task due to the lack of standard datasets, and almost all of those studies were rule-based. These rules have to be formulated by professionals, and the DDI extraction is achieved by matching the DDI expressions in the sentences with the formulated rules. This approach is more effective for composing simple sentences. However, for long and complex sentences, especially those with many subordinate clauses, the performance of this method is much less effective. In 2011, the SemEval 2011 competition established the DDIExtraction subtask and provided the standard DDIExtraction dataset for the first time. Subsequently, in 2013, the SemEval 2013 competition supplemented and improved the dataset, which can be referred to as DDIExtraction2013.

The text corpus of this dataset has two sources: (1) literature abstracts in the discipline of drug interactions downloaded from the MedLine (https://medline.com/, accessed on 20 February 2022) medical literature retrieval system and (2) articles studying drug interactions downloaded from the DrugBank (https://drugbank.com/, accessed on 23 February 2022) online database. A total of 18,491 pharmacological substances and 4999 drug–drug interactions were manually annotated in this DDI corpus, which consists of 1017 documents (784 paragraphs from DrugBank and 233 abstracts from MedLine). All documents contain 5806 sentences and 127,653 tokens. The details of the DDIExtraction2013 dataset are listed in Table 1.

**Table 1.** The statistics information of DDIExtraction 2013 dataset.


#### *4.2. Training*

In the training process, cross entropy cost function and *L*<sup>2</sup> regularization are used as the optimization objective. The cross entropy is defined as follows:

$$l\_i = -\ln \mathcal{Y}\_i^T P\_i \tag{18}$$

where *Yi* denotes the one-hot representation of the *i*-th instance label, and *Pi* is the model output, which is defined in Equation (17). For a mini batch M = [**X**1, **X**2, ··· , **X***M*], we defined the optimization objective as follows:

$$\mathcal{J}(\theta) = \frac{1}{|\mathcal{M}|} \sum\_{i=1}^{|\mathcal{M}|} l\_i + \lambda \|\theta\|\_2^2 \tag{19}$$

where *θ* includes all the parameters in our model. At the final step, parameter *θ* in the objective function, J (*θ*), is optimized with Nadam [40], which is an algorithm that performs first-order gradient optimization on an efficient stochastic objective function.

The models are randomly initialized at the beginning, so if a higher learning rate is selected at this point, the model may become unstable or oscillate, while a lower learning rate will result in a slower convergence speed. The learning rate scheduler with exponential decay [41] is used to control the dynamic change of the learning rate during the training process (see Figure 3). It can slow down overfitting in the initial stages and maintain the stability of the deep layer. Upon the completion of training, the model that can predict interactions between two drugs is obtained.

#### *4.3. Experiment Setup*

The DDINN is implemented with PyTorch (https://pytorch.org/, accessed on 10 January 2022) and open-sourced at Github (https://github.com/xingjianxu/DDINN, accessed on 10 January 2022). We use pre-trained word embeddings from GloVe [42] combined with PMCVec [43,44], which is based on unlabeled biomedical texts from PubMed (https://pubmed.ncbi.nlm.nih.gov/, accessed on 10 January 2022) and PubMed Central (https://www.ncbi.nlm.nih.gov/pmc/, accessed on 10 January 2022). In order to obtain

the dependency tree, dependency label, and POS tag of each word, we use the Stanford Parser (https://nlp.stanford.edu/software/lex-parser.shtml, accessed on 10 January 2022). All experiments are conducted with two RTX 3090 GPUs. The detailed parameters are listed in Table 2.

**Figure 3.** Learning rate exponential decay.

**Table 2.** The main hyperparameter settings used in DDINN implementations and evaluation experiments.


#### *4.4. Assessment Metrics*

In order to evaluate the quality of prediction results, micro-precision, micro-recall, and micro-F score are employed as assessment metrics, which are denoted as *Pmicro*, *Rmicro*, and *Fmicro*, respectively. As described in Table 1, we can define the prediction classes. We set D as

$$
\mathcal{O} = \{ \text{Advice}, \text{Mechanism}, \text{Effect}, \text{Int}, \text{Negative} \}\tag{20}
$$

and these metrics above can be calculated as follows:

$$P\_{micro} = \frac{\overline{TP}}{\overline{TP} + \overline{FP}} = \frac{\sum\_{i=1}^{n} TP\_i}{\sum\_{i=1}^{n} TP\_i + \sum\_{i=1}^{n} FP\_i} \tag{21}$$

$$R\_{micro} = \frac{\overline{TP}}{\overline{TP} + \overline{FN}} = \frac{\sum\_{i=1}^{n} TP\_i}{\sum\_{i=1}^{n} TP\_i + \sum\_{i=1}^{n} FN\_i} \tag{22}$$

$$F\_{\rm micro} = \frac{2 \times P\_{\rm micro} \times R\_{\rm micro}}{P\_{\rm micro} + R\_{\rm micro}} \tag{2.3}$$

where *TPi* denotes the true positives in the prediction class *i* ∈ D, *FPi* denotes the false positives and *FNi* denotes the false negatives.

#### *4.5. Baselines*

The following two kinds of methods are selected as the baseline for evaluating the performance of DDINN in this paper:


#### **5. Results and Discussion**

#### *5.1. Performance Comparison*

As shown in in Table 3, we compare the performance of our DDINN method to those of the other eight baseline methods. For each method, the *Fmicro* score for four kinds of DDI types and the overall precision, recall and *Fmicro* score are listed. The performance statistics are obtained by conducting test experiments on the DDIExtraction2013 dataset, except for UTurku, GGNN and GCNN, which are directly cited from their original papers. This is because we cannot find available codes or runnable binaries for these methods, and they all conducted the performance test on the DDIExtraction2013 dataset. The highest values in each test are marked in bold, and the second best ones are marked underlined.

In comparison with all baseline methods, except for the PPI type of Int, DDINN exhibited the highest performance scores. The main reason for this is that DDINN requires a relatively large amount of training data, and training data with the Int PPI type only rarely (1.68% in total training data) appears in the DDIExtraction2013 dataset (see Table 1). The experimental results proved that the series of optimization used in DDINN finally worked and successfully improved the quality of the results of the DDI prediction task.

The training process of this model on the DDIExtraction2013 dataset is shown in Figure 4, which shows the changes in the precision, recall, and the *Fmicro* score values over the epoch. From the figure, it can be seen that all these values improve faster in the early stage of the training, and then they fluctuate continuously to find the local optimal value; finally, they gradually converge to smooth values.

**Figure 4.** Precision, recall, and *Fmicro* value on entire test dataset in the training process.

#### *5.2. Error Analysis*

Figure 5 shows the confusion matrix of the model in this paper. Each column of the matrix represents an instance prediction of a class, while each row represents an actual instance of the class. The darker color in the figure indicates a larger proportion of error. To clearly highlight the misclassification of the DDI predicted by our model, the values in the confusion matrix are normalized.


**Table 3.** Performance comparisons with other DDI prediction methods.

<sup>1</sup> The second best value of the column is marked by underline style. <sup>2</sup> The best value of the column is marked by bold style.

**Figure 5.** Confusion matrix with L1 normalization.

From Figure 5, we can see that there are two main types of classification errors for the model: (1) the class of relations with the Int type is often incorrectly classified as the Advice type; (2) the four positive classes of relations (Advice, Mechanism, Effect and Int) are often incorrectly classified in the negative class.

For the first type of error, which is already briefly discussed in Section 5.1, the reason is that the number of Int DDI type is too small, with only 96 instances in the training set, and we observed in this paper that the instances of DDI type Int and Effect in the dataset have similar semantics, resulting in the model's inability in classifying these two categories well. The second type of error is also mainly caused by the dataset, where the number of negative categories in the dataset is 28,509, while the number of remaining positive examples is only 4999, which inevitably allows a small number of DDI types to be misclassified into the negative DDI type.

#### *5.3. Ablation Study*

Additional ablation experiments are conducted in order to evaluate the influence of different modules or optimizations on DDI prediction. Firstly, the impact of contextual representation methods has been investigated. The corresponding results are shown in Table 4, in which method "GCN only" refers to the model without any contextual representation engagement, and the others are models using GRU, LSTM and cBiLSTM to extract contextual representations, respectively. From Table 4, we can see that cBiLSTM improves the F-score of the GCN-only model by 6.1%, and the cBiLSTM model is indeed more suitable for DDI prediction tasks than some other RNN models.


**Table 4.** Ablation study on different contextual representation methods.

We also investigate the influence of the self-attention pooling strategy used in the construction of the weight-rebalanced dependency matrix, and the results are listed in Table 5. "Full tree" means the method without any pruning strategy. "LAC (*k* = *n*)" means using the LCA strategy [46] to conduct the tree pruning, and the subtree only includes tokens with the range of *n* words. From Table 5, we can see that the self-attention-based pruning strategy improved the F-score by 5.4% compared with the full tree strategy. Selfattention adds some complexity to the model, but it is worth it.

**Table 5.** Ablation study on different syntactic dependency extraction methods.


#### **6. Conclusions**

In this paper, we proposed a novel graph-convolutional-network-based method for the knowledge mining of interactions between drugs from the extensive literature, which is called DDINN. Our method makes full use of cBiLSTM to capture the contextual information of input sentences and target drug entities. Additionally, the self-attention mechanism is used to maximize the acquisition of syntactic information related to the DDI extraction task and discard irrelevant information. At last, the output of cBiLSTM and weight-rebalanced dependency matrix will be fed into GCN layers to obtain the DDI type classifier.

The evaluation experiments prove that the DDINN model in this paper achieved higher performance results compared to other state-of-the-art DDI prediction methods in the DDIExtraction2013 dataset. In future work, we will consider data augmentation and other schemes to improve the performance of the DDINN relative to the imbalanced dataset. Additionally, we hope to improve the interpretability [47,48] of deep learning networks in DDINN, which will enhance its utility in the medical field.

**Author Contributions:** Conceptualization, X.X.; funding acquisition, X.X. and F.M.; project administration, X.X.; software, X.X. and L.S.; validation, L.S. and F.M.; writing—original draft, X.X. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by grants from the Fundamental Research Funds for Inner Mongolia Normal University (2022JBQN105), Inner Mongolia JMRH Project (JMRKX202201) and Fundamental Research Funds for Inner Mongolia Normal University (2022JBQN109).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The datasets and codes used in this paper to produce the experimental results are publicly available at GitHub (https://github.com/xingjainxu/DDINN, accessed on 29 December 2022). The project code of biolitNER is also open sourced and accessible at GitHub under the GPLv3 license.

**Acknowledgments:** We thank Sun for their help in setting up the experiment's server node.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

## *Article* **An RG-FLAT-CRF Model for Named Entity Recognition of Chinese Electronic Clinical Records**

**Jiakang Li 1,2, Ruixia Liu 1, Changfang Chen 1, Shuwang Zhou 1,3, Xiaoyi Shang <sup>1</sup> and Yinglong Wang 1,\***


**Abstract:** The goal of Clinical Named Entity Recognition (CNER) is to identify clinical terms from medical records, which is of great importance for subsequent clinical research. Most of the current Chinese CNER models use a single set of features that do not consider the linguistic characteristics of the Chinese language, e.g., they do not use both word and character features, and they lack morphological information and specialized lexical information on Chinese characters in the medical field. We propose a RoBerta Glyce-Flat Lattice Transformer-CRF (RG-FLAT-CRF) model to address this problem. The model uses a convolutional neural network to discern the morphological information hidden in Chinese characters, and a pre-trained model to obtain vectors with medical features. The different vectors are stitched together to form a multi-feature vector. To use lexical information and avoid the problem of word separation errors, the model uses a lattice structure to add lexical information associated with each word, which can be used to avoid the problem of word separation errors. The RG-FLAT-CRF model scored 95.61%, 85.17%, and 91.2% for F1 on the CCKS 2017, 2019, and 2020 datasets, respectively. We used statistical tests to compare with other models. The results show that most *p*-values less than 0.05 are statistically significant.

**Keywords:** clinical named entity recognition; Chinese medical text; pre-trained model

#### **1. Introduction**

Informatization has penetrated all aspects of social life. In the medical field, more and more hospitals are building information systems to improve their service level and core competitiveness, effectively use limited medical resources, and provide patients with high-quality treatment. These information systems can not only improve doctors' efficiency but also enhance internal management, making information communication among departments more efficient and simplifying and standardizing the medical treatment process. Medical staff can be released from tedious and repetitive work, with extra time and energy being used to provide better patient services.

Existing medical systems have generated countless medical data, and if the data cannot be used effectively, it will be a waste of professional knowledge. As a medical record, Electronic Medical Record (EMR) has received great attention in scientific research [1] because it contains complete and detailed clinical information generated by patients during each visit. EMR refers to the digital information such as words, symbols, charts, graphics, data, images, and so on, generated by medical personnel using the information system of medical institutions in medical activities. EMR contains various information such as text and medical images. Medical images are mainly the results of laboratory tests of patients, such as CT and B-ultrasound. These medical images can currently be analyzed

**Citation:** Li, J.; Liu, R.; Chen, C.; Zhou, S.; Shang, X.; Wang, Y. An RG-FLAT-CRF Model for Named Entity Recognition of Chinese Electronic Clinical Records. *Electronics* **2022**, *11*, 1282. https://doi.org/10.3390/electronics 11081282

Academic Editors: Agnieszka Konys and Agnieszka Nowak-Brzezi ´nska

Received: 9 March 2022 Accepted: 15 April 2022 Published: 18 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

using pattern recognition and machine learning methods, but EMR also contains much textual data. To make use of the text data, Natural Language Processing (NLP) technology is essential. Electronic medical records cover all patient information from admission to discharge, including admission time, symptoms, body parts, examination methods, medication, and other physical information [2]. Medical services may consider providing patients with the facility to submit inquiries in the form of comments [3].

EMR information extraction is to identify various medical entities from texts and establish relationships among them. The information extraction of EMR was first carried out on English medical records, and many achievements have been achieved, while domestic research on Chinese EMR is still in its infancy. Therefore, it is our top priority.

Named Entity Recognition (NER) is the foundation of text data mining and information processing. For entity recognition in the medical field, it refers to identifying entities such as symptoms, body parts, examinations, etc. Identifying this information and analyzing the relationship among different entity information plays an indispensable role in establishing a knowledge map in the medical field, building an auxiliary diagnosis model, and providing data support for clinical decision-making.

Early NER systems are mainly rule-based approaches. This method extracts the target entity through the preset rule template and has achieved certain results. Although for some uncommon fields, experts need to write rules, which is demanding, time-consuming, and limited, rule-based approaches are not outdated but are still an important complement to other approaches.

Feature-based Supervised Learning Approaches transform NER tasks into classification tasks or sequence labeling tasks. Conditional Random Fields (CRF) and Hidden Markov Models (HMM) [4] are two common algorithms.

With the rapid expansion of deep learning, Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM) are applied to the CNER tasks [4]. Alam et al. [5] proposed a new framework based on association rule mining for prognostic factor identification in malignant mesothelioma. At present, the integration of LSTM and CRF is a common method. However, there are limitations. Transformer [6] proposes self-attention, enabling the LSTM networks to solve long-distance dependencies. Transformers gradually replaced LSTM as the mainstream feature extractor in NLP.

Unsupervised pre-trained models are suitable for general domains but not appropriate for the medical domain.

In addition, Chinese NER is related to word segmentation. Since Chinese entities are generally composed of words, word segmentation errors will lead to errors in Chinese NER. The character-based Chinese NER model cannot fully utilize the information of words. The Lattice-LSTM proposed by Zhang et al. [7] improves the accuracy of this task by adding dictionary information to the model. However, due to the complexity of the lattice structure, it does not support parallel computing. Li et al. [8] proposed a Flat Lattice Transformer (FLAT), which uses a flatten lattice structure and transformer to realize parallel processing. At the same time, FLAT uses the calculation method of relative position in the Transformer-XL model [9], and by adding additional position information in the Transformer structure, it solves the modeling of long text and captures ultra-long distance dependencies.

Also, unlike other languages, Chinese is a pictograph. Chinese characters contain rich semantic information. Many words with similar meanings are similar in composition and structure of Chinese characters, which is especially obvious in the medical field. The glyph information of Chinese characters is also of significant reference value. Glyce, proposed by Meng [10], can extract the glyph vectors of Chinese characters. It attempts to extract the semantics of Chinese characters from various ancient and modern Chinese characters and various writing styles, and the performance is improved.

To solve problems, we propose a RoBerta Glyce-Flat Lattice Transformer-CRF (RG-FLAT-CRF) model suitable for Chinese CNER tasks. First, the glyph vector is obtained by Glyce, the character vector and word vector are obtained by Word2vec [11], and the character vector obtained by RoBerta is spliced with the glyph vector and the word vector

obtained by Word2vec. At the same time, the Flat-lattice structure is used, word information is added, the head position code and tail position code are constructed for each character and vocabulary, and the relative position code is calculated. The concatenation of vectors and the corresponding position encoding are sent to a transformer to extract the context information of every Chinese character. Finally, we jointly decode the labels of the entire sentence using CRF. Our main contributions are as follows:

#### *Contribution*

In our contributions, we have:


The rest of this article is organized as follows. Section 2 provides a brief review of related work of NER. The proposed model is presented in Section 3. The relevant content of the experiment is described in detail in Section 4. Finally, Section 5 gives the conclusions.

#### **2. Related Work**

We include the following studies: (1) How to enhance the semantic representation of Chinese word vectors. (2) Feature extraction networks more applicable to the Chinese language. (3) The characteristics and difficulties of named entity recognition in Chinese electronic medical records. (4) Related Evaluation Metrics [12]. We used multiple strings such as "Chinese electronic medical record named entity recognition", "Chinese named entity recognition", and "medical named entity recognition" to retrieve peer-reviewed articles using Multiple databases, including Scopus, ACM Digital Library, IEEE Xplore, ScienceDirect, SpringerLink, and Google Scholar [13].

This section primarily provides a brief introduction to rule-based and dictionary-based methods, machine learning-based methods, and deep learning-based methods. Then, the representation method of the word vector is introduced.

#### *2.1. Rule-and-Dictionary-Based Clinical Named Entity Recognition*

Nowadays, Rule-and-Dictionary-Based CNER is commonly used, and these methods benefit from the development of professional medical dictionaries. Researchers complete the NER task by pattern matching according to the belonging list in the dictionary. Friedman et al. [14] developed a clinical document processor that recognized medical information in the medical record and mapped this information into a structured representation containing medical terms. Fukuda et al. [15] proposed a method to identify the names of substances such as proteins from biological papers, using the characteristics of proper noun descriptions in the professional field, which eliminates the need to prepare a professional term dictionary in advance. Names can be extracted with precision, whether they are known or newly defined or are single or compound words.

The completeness and accuracy of the dictionary and the accuracy of the matching algorithm can determine the accuracy of such methods. Therefore, dictionary-based methods are more suitable for fields where proper nouns are fixed and updated infrequently. In the

biomedical field, there are problems such as the fast updating of proper nouns and different expressions of the same entity name. Experts need to spend much time and effort writing rules, and the cost is high. In addition, different rules are needed for different systems. They are of poor portability and are hard to reuse quickly.

#### *2.2. Clinical Named Entity Recognition Based on Machine Learning*

In the past, traditional machine learning based on CNER has been widely used, including HMM, CRF, Support Vector Machine (SVM) [16], Naive Bayesian Model (NBM) [17], etc. Settles [18] used combined feature sets with CRF in biomedical NER tasks. Tang [19] developed an SVM-based NER system for medical entities in the medical record. Roberts et al. [20] utilized SVM with a manually constructed dictionary to classify. Liu [21] evaluated the contribution of different features in the CRF-based CNER task.

Compared with the methods analyzed in Section 2.1, the method in Section 2.2 does not require the experimenter to master much language knowledge, thus saving time and effort. However, this type of method requires a lot of energy to design features. The effect of the model depends on the designed features. With deep learning modeling, the feature extraction problem in traditional machine learning can be addressed.

#### *2.3. Deep-Learning-Based Clinical Named Entity Recognition*

Recently, we have witnessed the great success of deep learning in the field of NLP, such as NER and event extraction tasks. Commonly used network models include Convolutional Neural Networks (CNN) [22], Recurrent Neural Networks (RNN) [23], and LSTM. Ma et al. [24] proposed the Bi-directional LSTM-CNNs-CRF model, character-level representations are extracted using CNN, Bi-directional LSTM (BiLSTM) is responsible for modeling the contextual information of each word. Xu et al. [25] combined bidirectional LSTM and CRF based, BiLSTM-CRF model can learn the information features of a given dataset and achieved a score of 0.8022 at NCBI, outperforming many widely used baseline methods. Yin et al. [26] used convolutional neural nets for Chinese character radical feature extraction and captured the correlation between characters using self-attentiveness. Kong et al. [27] proposed a Chinese medical named entity recognition based on a multi-layer CNN and attention mechanism, constructing a multi-layer CNN to extract short-term and long-term memories and using an attention mechanism to capture global information. However, the above deep neural network-based CNER methods cannot model the ambiguity of Chinese.

The BERT-BiLSTM-CRF model was proposed by Jiang et al. [28] to be applied to CNER. The semantic representation of words was enhanced with a BERT pre-trained language model, and the BiLSTM was to learn contextual information. Qin et al. [29] proposed a BERT-BiGRU-CRF model in the field of Chinese electronic medical records, which uses BERT to convert the electronic medical record text into low-dimensional vectors and BiGRU to obtain contextual features. Wu et al. [30] used a bi-directional LSTM model to learn a medical entity's partial head information using Roberta to learn medical features. Wang et al. [31] used information from medical encyclopedias as additional information to enhance the recognition of Chinese electronic medical record entities. However, these models do not fully consider the characteristics of medical domain data, and it is not very effective in medical entity extraction.

#### *2.4. Research Status of Word Vector Representation Methods*

If you want to reflect a word in a text and perform mathematical calculations, it must be done through word embedding. The bag-of-words model simply represents words without any semantic features. As the number of words increases, so does the dimension. Researchers propose a way to solve this problem using a pre-trained language model for word representation. Pre-training refers to obtaining a training model independent of subsequent tasks from a large-scale corpus using self-supervised learning. The model can be transferred to other tasks, thereby reducing the training burden of subsequent tasks. The Word2Vec model was proposed by Mikolov et al. to obtain vectors. The GloVe

algorithm was proposed by Pennington et al. [32]. In recent years, pre-trained models have received increasing attention. Since this type of model is a context-independent word vector trained by static pre-training technology, it cannot accurately model the polysemy of a word. Therefore, Peters et al. [33] proposed the ElMo algorithm. The bidirectional LSTM network structure was used for context encoding, which could effectively capture context information.

#### 2.4.1. Models for BERT and Its Variants

Devlin et al. [34] proposed Bidirectional Encoder (BERT). The emergence of Bert opened a new era of research in the field of NLP. Then some improved pre-training models based on BERT, mainly including ERNIE [35], BERT-WWM [36], RoBerta [37], and XLNet [38]. The ERNIE model is pre-trained using massive corpora in multiple fields, including encyclopedias, news, forums, etc. BERT-WWM's improvement over BERT is to replace a complete word with a Mask label instead of a subword. The RoBerta model uses a dynamic mask mechanism for pre-training, cancels the NSP task, and expands the batch size. As an auto-regressive model, the XLNet model can expand the language model and increase the prediction of bidirectional words, the above predicting the next word and the following predicting the previous words.

#### 2.4.2. Research on Chinese Characters

The structure of Chinese characters is different from that of English. Chinese characters are pictographs, and their glyphs also contain rich meanings. Therefore, many scholars have carried out characterization studies on the glyph features of Chinese characters. Sun [39] proposed to learn the radical features of Chinese. Wang et al. [40] proposed a Chinese character root and stroke-enhanced embedding method for learning Chinese character roots from the internal information of semantics and form. Wei [41] proposed a visual embedding method for semantic association among visual words, segmented the glyph, spliced the average embedding vectors corresponding to each sub-region, and converted it into a fixed-length vector for keyword detection. Su [42] used convolutional autoencoders to learn glyph features from images of traditional Chinese characters and introduced glyph features during training using the corpus. Meng [6] proposed the Glyce model. It tried to extract the semantics of Chinese characters from various ancient and modern Chinese characters and various writing styles, and the performance was improved.

These are the characteristics of Chinese, which improve CNER tasks. However, the current mainstream CNER methods cannot integrate the pre-trained model with the Chinese glyph information.

#### **3. Proposed Method**

In the NER task, the character sequence of the input text is represented by *X* = (*x*1, *x*2,..., *xn*). The labels of the input text are represented by *Y* = (*y*1, *y*2,..., *yn*). The goal of a NER system is to predict the correct sequence *Y* of labels for the text given the known sequence of characters *X* of the text. The RG-FLAT-CRF model proposed in this chapter consists of three parts; the embedding layer, the encoding layer, and the decoding layer. The overall structure is shown in Figure 1.

The model first matches the latent words related to the character in the input text and splices the character information and words information into the embedding layer. The embedding layer consists of three parts, and the character vector is spliced after processing by RoBerta, Glyce, and Word2vec. The word vector is obtained using Word2vec, head and tail position encoding are constructed for each character and word, and the relative position encoding is calculated. The concatenation of word vectors and the corresponding position encoding are input into the encoding layer, consisting of a Transformer neural network that captures deep features and encodes the input sequence. Finally, the output of the encoding layer is input to the decoding layer, which predicts the final label sequence.

**Figure 1.** Model structure diagram of RG-FLAT-CRF.

This study uses NER to perform entity recognition on Chinese EMR. Specific steps are as follows:


#### *3.1. Embedding Layer*

The embedding layer consists of three parts: RoBerta layer, Glyce layer, and Word2vec layer:


The character vectors processed by RoBerta, Glyce, and Word2vec are spliced to obtain multi-feature word vectors, and then the character vectors and word vectors processed by Word2vec are spliced together.

#### 3.1.1. RoBerta

Pretrained language models are often used in NER tasks to generate richer semantic representations. BERT and its variant RoBerta are widely used in research. We use RoBerta for text encoding instead of BERT. Compared with BERT, the model structure of RoBerta has not changed. They are all composed of 12 stacked transformers. Each layer has a hidden state of 768 dimensions. Each Transformer uses a 12-head self-attention mechanism. The only thing that has changed is the pre-training method. Dynamic masks and text encoding are adopted to remove the NSP task and use more data to train the model.

The vector is obtained through the RoBerta. The RoBerta structure is shown in Figure 2. The input text is *Z* = {*Z*1, *Z*2,..., *Zx*}. First, the sequence is vectorized. This part consists of token embedding, clause embedding, and position embedding. These three embedding layers are essentially equivalent to the static embedding layers, and the table lookup is performed by the embedding matrix. For the *x*-th token in the processed token sequence, the vector calculation is as follows:

$$
\sigma\_x = \mathcal{W}\_l(E\_{\ell\_x}) + \mathcal{W}\_s(E\_{s\_x}) + \mathcal{W}\_p(E\_x) \tag{1}
$$

where *Wt*, *Ws*, *Wp* are the token embedding matrix, the clause embedding, and the position matrix.

**Figure 2.** Structure diagram of RoBerta.

Token Embeddings represent the Embedding vector of each word. Segment Embeddings are used to distinguish different sentences before and after punctuation marks. Position Embeddings represent the embeddings of a word's position. The input feature of RoBerta is the sum of the above 3 embeddings. "[CLS]" is used as the starting symbol of the input, indicating that the feature can be used in the classification model. "[SEP]" indicates the clause symbol, which is used to cut off the clauses in the sentence.

The obtained vector is input into the stacked Transformer to extract features. The final output is the result of encoding the input sentence text. Finally, we obtained the sentence representation vector with the dependency information among words and words in the sentence text. The calculation is as follows:

$$H = M \mathfrak{ul}\_{trans}(E) \tag{2}$$

where *Multrans*(.) represents the stacked Transformer, outputting the text encoding of the entire sentence through the last layer *H*, which can be expressed as *H* = *h*0, *h*1, ... , *hx*. Here *hx* is the text representation vector to the xth token.

#### 3.1.2. Glyce

Chinese characters are pictographs, and most Chinese characters are evolved from graphics. Chinese characters contain rich semantic information, especially in the medical field. Most of the words for diseases have the same parts. Therefore, we believe that adding glyph information to word vectors can enhance the representation of characters.

Glyce used different versions of the writing method, as well as different writing to enhance the representation of the characters.

Glyce is different from traditional CNN. There are about 100,000 Chinese characters, but only a few thousand are commonly used. Compared with classification on the ImageNet dataset. There're few training examples for Chinese characters. Compared with the size of Imagenet images, Chinese images are usually smaller, with a size of 12 × 12. Thus according to the Chinese writing habits, a 2 × 2 Tianzi lattice structure is used. As shown in Figure 3, this structure can reflect the glyph information of Chinese, including components such as radicals, which is suitable for the extraction of glyph information.

**Figure 3.** Schematic diagram of the Tianzi lattice.

The structure of Glyce Tianzi lattice-CNN is shown in Figure 4. The processing process is shown in Figure 5. To capture lower-level graph features, the input image approximation firstly passes through a convolutional layer with kernel size 5. In addition, the convolutional layer has to increase the number of feature channels to 1024. Then we apply a max-pooling layer with a pooling kernel of 4 × 4 to perform feature downsampling. After this, the resolution is reduced from 8 × 8 to 2 × 2. This 2 × 2 Tianzi lattice structure shows the glyph features of Chinese characters, and finally, we apply the group convolution operation to map the Tianzi lattice to the final output.

**Figure 4.** CNN structure diagram in Glyce.


**Figure 5.** The Tianzi lattice—CNN structure.

For the input text *Z* = {*Z*1, *Z*2,..., *Zx*}, the glyph vector obtained by Glyce is *EG* = (*eG*0,*eG*1,...,*eGx*) as shown in Figure 6.

**Figure 6.** Glyce character embedding.

#### 3.1.3. Word2vec

We use Word2vec to get word vectors, a typical representative of distributed representation. Compared with one-hot, Word2vec takes into account the relationships among words. In addition, Word2vec also optimizes the training efficiency of the model, so it is used more frequently.

#### *3.2. Position Encoder*

Chinese NER tasks are often considered sequence labeling tasks. By calculating the probability of each character corresponding to each entity type label, The label with the highest probability is used as the final identification result. There are usually two vectorization methods to vectorize Chinese characters into the model calculation: methods based on word vectors and methods based on character vectors.

The first task of the word vector-based model is to segment the text into the form of words. The improvement effect of word vectors on entities is significant. The word contains more semantic information, but if there is a false classification, it will affect the results of NER.

For instance, in Figure 7, this sentence can be divided into '济南人 (Jinan People)', '和 (and)', '山庄 (Mountain Villa)', and can also be divided into '济南 (Jinan People)', '人和山 庄 (Renhe Mountain Villa)'. These two-word segmentation methods have a great impact on recognition.

**Figure 7.** Structure diagram of Lattice.

Using character vector-based models avoids word segmentation error information but lacks lexical information. For example, '感冒 (cold)', separate the word '感 (feel)' and '冒 (emit)' represent different semantic information. '感 (feel)' means feeling, and '冒 (emit)' means to penetrate outward or rise upward. It is difficult to express the information of the word '感冒 (cold)' in medicine after '感 (feel)' and '冒 (emit)' are separated, which is especially obvious in the medical field.

To address the above problems, we adopted the FLAT-lattice structure, shown in Figure 8. This structure uses both character vectors and word vectors. Based on character vectors, the latent vocabulary of each character is matched, and the word vectors are added to the model. This method utilizes the semantic relationship of words and avoids the phenomenon of word segmentation errors.

**Figure 8.** Structure diagram of Flat-lattice.

After using the dictionary to obtain lattice information from the string, it is flattened, and the structure is shown in Figure 8.

These flat lattices can also be defined as spans. A span comprises a token, a head, and a tail. A token is a word or character, and the head represents the starting position of the token in the original sequence, and the tail represents the ending position of the token in the original sequence. For characters, the head and tail are the same. For the matched words, head indicates the start position of the word in the sequence, and tail indicates the end position of the word in the sequence. The flat lattice can preserve the original structure of the lattice and, at the same time, preserve the word order information of the original sentence.

According to the Flat-lattice structure, there are three interrelationships, intersection, involvement, and separation. We use relative position encoding to encode the positional relationship among each span. Relative position encoding does not directly model the interaction relationship but obtains a dense vector by computing a set of head and tail changes. Not only the interrelationships among spans can be represented, but more detailed sequence relationships can be shown, such as the distance among words and characters. Let *tailx* and *tailx*, *heady* and *taily* denote the head and tail positions of *sx* and *sy*, respectively. Four kinds of relative distances can be used to represent the relative relationship between *sx* and *sy*. Their calculation formulas are as follows:

$$r\_{xy}^{\text{hh}} = head\_x - head\_y \tag{3}$$

$$r\_{xy}^{ht} = head\_x - tail\_y \tag{4}$$

$$r\_{xy}^{th} = tail\_x - head\_y \tag{5}$$

$$r\_{xy}^{tt} = tail\_x - tail\_y \tag{6}$$

where *rhh xy* stands for the distance from the head of *sx* to the head of *sy*, *rht xy* is the distance from the head of *sx* to the tail of *sy*, *rth xy* represents the distance from the tail of *sx* to the head of *sy*, *rtt xy* is the distance from the tail of *sx* to the tail of *sy*. The final relative position encoding is a nonlinear transformation of the four distances, which can be calculated like:

$$L\_{xy} = \operatorname{Re} L \operatorname{II} \left( \mathcal{W}\_l \left( P\_{r\_{xy}^{th}} \bigoplus P\_{r\_{xy}^{th}} \bigoplus P\_{r\_{xy}^{th}} \bigoplus P\_{r\_{xy}^{th}} \right) \right) \tag{7}$$

among them, *Wl* is a learnable parameter, ⊕ represents the connection operator, and the calculation method of *Pr* refers to the calculation method of the transformer. The calculation is as shown in the equation:

$$P\_r^{2k} = \sin\frac{r}{1000^{\frac{2k}{d\_{model}}}}\tag{8}$$

$$P\_r^{2k+1} = \cos \frac{r}{1000^{\frac{2k}{d\_{model}}}} \tag{9}$$

#### *3.3. Encoder*

The encoding layer consists of Transformers, which aim to extract semantic and temporal features from the context automatically.

Before the transformer appeared, most NER used BiLSTM as the model's encoder. However, BiLSTM has some problems: (1) The sequential nature of the recurrent neural network represented by LSTM hinders the parallelization of training samples; (2) The problem of long-term dependence cannot be completely solved.

Transformer avoids recurrent model structure and uses attention mechanism for modeling. The structure is shown in Figure 9. We used its encoding part, which consists of two parts, a feedforward network and a multi-head self-attention layer, both of which have a residual network. Multi-head self-attention consists of stacked self-attentions, all accompanied by a "layer normalization" step.

**Figure 9.** Structure diagram of Transformer.

When the encoder encodes this word, the self-attention mechanism can take other words in this sentence into consideration.

First, we send the vector output of the embedding layer and the corresponding relative position encoding to the encoding layer, using the encoding layer of the transformer. A Query vector, a Key vector, and a Value vector are created for each word by this selfattention mechanism. They are obtained through the vector multiplication by the three matrices we trained. Their calculation formula is as follows:

$$Q = Linear(X) = X\mathcal{W}\_q\tag{10}$$

$$K = \operatorname{Linear}(X) = X\mathbb{W}\_k\tag{11}$$

$$V = Linear(X) = X\mathcal{W}\_k\tag{12}$$

The second step is to calculate the score, which will make the gradient more stable, and then it is divided by <sup>√</sup>*dhead*. The traditional Transformer model can capture contextual semantics by adding position information to the input, but there is a problem of sentence errors in the face of text segmentation input. Therefore, extra position information is added to the Transformer structure of the Transformer-XL model, and the absolute vector is converted into a relative vector. Solve the modeling of long text, capture ultra-long distance dependencies, and calculate the attention score vector among input vectors by the formula:

$$A\_{x,y}^{\*} = \frac{\mathcal{W}\_q^\mathsf{T} E\_{\mathsf{s}\_x}^\mathsf{T} E\_{\mathsf{s}\_y} \mathcal{W}\_{\mathsf{k},\mathsf{E}} + \mathcal{W}\_q^\mathsf{T} E\_{\mathsf{s}\_x}^\mathsf{T} L\_{xy} \mathcal{W}\_{\mathsf{k},\mathsf{R}} + \mathfrak{u}^\mathsf{T} E\_{\mathsf{s}\_x} \mathcal{W}\_{\mathsf{k},\mathsf{E}} + \mathfrak{v}^\mathsf{T} L\_{xy} \mathcal{W}\_{\mathsf{k},\mathsf{R}}}{\sqrt{d\_{\text{head}}}} \tag{13}$$

where *Wq*, *Wk*,*E*, *Wk*,*R*, *u*, *v* are learnable parameters,*Esx* , *Esy* are the embedded representations of *sx* and *sy*.

Then pass the result through softmax, which normalizes the scores for all words. For the weighted value vector, the output of the self-attention layer at that position is obtained, and the following is its formula:

$$Attention(A, V) = softmax(A)V\tag{14}$$

The multi-head attention mechanism consists of multiple self-attentions. Define multiple groups of different *Q*, *K*, and *V*, and let them focus on different contexts, respectively. The process of calculating *Q*, *K*, *V* is still the same, except that the matrix of linear transformation has changed from one set of - *WQ*, *WK*, *WK* to multiple sets of - *WQ*, *WK*, *WK* .

For the input matrix *X*, each group of *Q*, *K*, *V* can get an output matrix *Z*. Concatenate the different matrices together and multiply with an additional matrix *Wo*.

The multi-head attention mechanism enhances the attention layer's performance in two aspects:


$$MH\_{att}(A, V) = \text{Concat}(head\_1, \dots, head\_h) \mathbb{W}\_o \tag{15}$$

The resulting output is subjected to layer normalization and residual connections. The specific formula is as follows:

$$X\_{MH\_{att}} = X\_{MH\_{att}} + X \tag{16}$$

$$X\_{MH\_{att}} = LayerNorm(X\_{MH\_{att}}) \tag{17}$$

After the operation of Feedforward, the formulas are shown in equations:

$$X\_{hidden} = Linear(ReLU(Linear(X\_{attention})))\tag{18}$$

$$X\_{hidden} = X\_{attention} + X\_{hidden} \tag{19}$$

$$X\_{hidden} = LayerNorm(X\_{hidden})\tag{20}$$

#### *3.4. Decoder*

The decoding layer consists of CRFs, whose purpose is to resolve the correlation between the output labels to obtain the globally optimal annotation sequence for the text.

For the input sequence *X* = (*x*1, *x*2,..., *xn*), its predicted label is *Y* = (*y*1, *y*2,..., *yn*). The score matrix P output by the encoding layer is n×k in size, n is the length of the input sequence, and q is the different types of labels defined. *Pi*,*yi* represents the score of the ith character in the sentence on the *yi* label. A state transition score matrix A represents the probability score of transition among different labels. *Ayi*,*yi*−<sup>1</sup> represents the transition score from label *yi* to label *yi*+1. *y*0, *yn*+<sup>1</sup> represent the start tag and the end tag, respectively. Under the condition of the given sequence, the score *S*(*X*, *y*) of the corresponding sequence tag is obtained. The functions can be described as follows:

$$SX\_{\prime}y = \sum\_{i=0}^{n} A\_{y\_i, y\_{i+1}} + \sum\_{i=1}^{n} P\_{i, y\_i} \tag{21}$$

The predicted probability is *P*(*y*|*X*). The calculation formula is shown in (22):

$$P(y|X) = \frac{e^{\mathbb{S}(X,y)}}{\sum\_{y' \in \mathcal{Y}\_{\mathcal{X}}} e^{\mathbb{S}(X,y')}}\tag{22}$$

The loss function, as shown in the formula:

$$-\log\left(P(\mathcal{Y}|X)\right) = \log\sum\_{\mathcal{Y}' \in \mathcal{Y}\_X} \varepsilon^{S(X,\mathcal{Y}')} - S(X,\mathcal{Y})\tag{23}$$

In the last, we adopted the Viterbi algorithm to get the optimal path, that is, a more reasonable predicted label of the input sequence. The calculation formula is as follows (24):

$$y^\* = \arg\_{y' \in \mathcal{Y}\_\mathcal{X}} \max \mathcal{S}(\mathcal{X}, y') \tag{24}$$

#### *3.5. Time Complexity Analysis*

We discuss the time complexity of the model.

$$O\left(n^2 \cdot d + n \cdot d^2 + \sum\_{l=1}^{N} \left(\mathcal{M}\_l^2 \cdot \mathbb{K}\_l^2 \cdot \mathbb{C}\_{l-1} \cdot \mathbb{C}\_l + n \cdot k^2\right)\right)$$

where *n* is the sequence length and d is the dimension of embedding. *n* is the number of convolutional kernels the neural network has; *l* is the lth convolutional layer of the neural network; *C* is the number of output channels of the lth convolutional layer of the neural network; and for the lth convolutional layer, the number of input channels *Cn* is the number of output channels of the *l*-1st convolutional layer. *k* is the number of labels as

#### **4. Experiment Design**

This section presents the following aspects: the dataset used for the experiments, the labeling rules, the evaluation metrics, and an introduction to the comparative experimental model.

#### *4.1. Dataset*

Our proposed RG-FLAT-CRF model is validated with real datasets of three clinical NER tasks.

These three datasets are all from the CCKS competition dataset. The following is the introduction to these datasets.

CCKS-2017 data is adopted for the experiment. Since we did not participate in the competition, we only found some open-source data. The CCKS-CNER2017 dataset. Provides 300 electronic clinical record texts with 29,865 annotated instances (7816 sentences). It is annotated with five entity types: symptoms and signs, diseases and diagnosis, body parts, examinations and tests, and treatment. Table 1 lists its detailed statistics. The proportion of each part of the data is shown in Figure 11.

**Table 1.** Entity statistics of the three datasets.


CCKS-2019 contains 23,384 annotated instances (10,179 sentences). They are annotated with six entity types, namely diseases and diagnosis, examinations, tests, surgery, drugs, and anatomical parts. The elaborated statistics are shown in Table 1. The proportion of each part of the data is shown in Figure 10.

**Figure 10.** The proportion of medical entities on CCKS2019.

**Figure 11.** The proportion of medical entities on CCKS2017.

CCKS-2020 contains 24,341 annotated instances (13,308 sentences) with six entity types: diseases and diagnosis, examinations, tests, surgery, drugs, and anatomical parts. Table 1 shows the specific statistics. The proportion of each part of the data is shown in Figure 12.

**Figure 12.** The proportion of medical entities on CCKS2020.

#### *4.2. Labeling Rules*

We adopt the BOI rule, where the entity's beginning is represented by B, I is the interior, and O stands for the other categories.

Annotation methods of five entity categories in CCKS2017: SS for symptoms and signs, DD for disease and diagnosis, AP for body parts, EE for inspection and examination, TM for treatment.

Annotation methods of six entity types in CCKS2019 and 2020: DD for disease and diagnosis, GEXA for examination, AP for the anatomical site, SU for surgery, EEXA for the test, and DR for the drug.

#### *4.3. Evaluation Indicators*

This paper uses the most common evaluation metrics in the NER field Precision, Recall, and F1 scores are used as the evaluation indicators of the model to evaluate the performance of the evaluation model comprehensively. TP is the number of positive samples predicted as positive samples, FN is the number of positive samples predicted as negative samples, and FP is the number of negative samples predicted as positive samples. They are widely used to evaluate classification and sequence annotation tasks [43].

Precision: The ratio of the number of recognized entities to the number of recognized entities is recorded as Precision, abbreviated as P. The calculation formula is Equation (25).

Recall: The percentage of correctly identified entities out of the number of entities in the sample. The calculation formula is Equation (26).

Both take values between 0 and 1, and the closer the value is to 1, the higher the precision or recall. Precision and recall are sometimes contradictory; a weighted harmonic mean that needs to be considered, and the F1-score is a combination of the two. The higher the F1 score, the more robust the classification model is. The calculation formula is Equation (27).

$$Precision = \frac{TP}{TP + FP} \tag{25}$$

$$Recall = \frac{TP}{TP + FN} \tag{26}$$

$$F\_1\text{-score} = \frac{2 \times Precision \times Recall}{Recall + Precision} \tag{27}$$

#### *4.4. Experimental Parameters*

The parameters of the RG-FLAT-CRF were tuned by Adam, and a hierarchical lr mechanism introduced. For the pre-trained RoBerta model, a learning rate of 3 × <sup>10</sup>−<sup>5</sup> is used, and for the other parts a learning rate of 2 × <sup>10</sup>−<sup>4</sup> is used. For the RG-FLAT-CRF model, the batch size used is 12. Details are shown in Table 2.

**Table 2.** Parameter settings.


#### **5. Results and Analysis**

This part is divided into two parts: performance comparison with existing models, and ablation research.

#### *5.1. Performance Comparison with Existing Models*

To verify the effect of the RG-FLAT-CRF-model, the RGT-CRF model is compared to the existing state-of-the-art models. Evaluated on CCKS2017, CCKS2019, and CCKS2020 datasets, respectively. The comparison model is as follows:

(1) RoBerta: Liu et al. [37] improved the BERT model and proposed the RoBerta model. RoBerta performed better than BERT on NLP downstream tasks, and used RoBerta to enhance semantic representation and complete NER tasks.


Tables 3–5 show the precision, recall, and F1 results detailing various medical entities and all medical entities. From the comparison results of Table 6, the performance of the RGT-CRF model proposed in this chapter has achieved the best results on the three datasets, and the improvement on CCKS2017 is about 2~5%. The improvement is about 0.3~8% on CCKS2019 and about 3~9% on CCKS2020.


**Table 3.** Results of different models on CCKS2017.


**Table 4.** Results of different models on CCKS2019.

**Table 5.** Results of different models on CCKS2020.


The effect of ACNN is unstable in CCKS2017 and CCKS2019. Compared with other models, ACNN does not use BERT or an improved model based on BERT to enhance semantic representation, but multi-layer CNN and attention mechanisms play a certain positive role. From the three datasets, most of the models use BERT or an improved pretraining model based on BERT to enhance semantic representation and have achieved good experimental results. RoBerta-BiLSTM-CRF performs better than RoBerta-BiGRU-CRF on the three datasets. Although BiGRU has a simpler structure than BiLSTM, it is clear that BiLSTM is more suitable for Chinese electronic medical record NER. At the same time, these two models perform moderately well on the three datasets, as the feature extraction networks of the two models are variations of recurrent neural networks and cannot solve the long-range dependency problem. AR-CCNER and Ra-RC performed better on the

CCKS2017 and CCKS2019 datasets overall. Although AR-CCNER did not use a BERT-based pre-training model to enhance semantic representation, both AR-CCNER and Ra-RC were based on the characteristics of Chinese. BiLSTM and CNN are used to extract and use radical features, respectively, which utilize the glyph information of Chinese characters to a certain extent, but do not consider the information of learning the overall glyph structure of Chinese characters, and the model also lacks medical vocabulary information. BE-Bi-CRF-JN also achieved good results, proving that the use of external corpus in Chinese electronic medical records NER is effective. The above analysis shows that the RGT-CRF model is more suitable for Chinese electronic medical record named entity recognition electronic medical record recognition. This is mainly because the model adds glyph information while introducing lexical information based on words.


**Table 6.** Comparison of the results of different F1 of each model on different datasets.

From the perspective of entity type, the overall recognition effect of different medical entities is compared longitudinally. From Figures 13–15, it can be seen that the recognition results of different models on CCKS2017 show disease and diagnosis. Poor, because there are many long entities like '右股骨颈骨折髋关节股骨头表面置换术 (Right femoral neck fracture hip femoral head resurfacing)' in the two types of entities in the CCKS2017 dataset, and the boundaries of each entity cannot be clearly identified. The recognition results of different models on CCKS2019 and CCKS2020 show disease and diagnosis. The recognition results of these two types of entities are poor because the two types of entities in the CCKS2019 dataset and CCKS2020 dataset are similar to 'CA125- , 'CEA'. Many entities coexist with English and numbers, such as 'CA199- , which will also cause the model to fail to identify the boundaries of each entity.

**Figure 13.** F1 values of different entities on CCKS2017 for different models.

**Figure 14.** F1 values of different entities on CCKS2018 for different models.

To make the comparative results more convincing, a further hypothesis test was performed by calculating p-values using the t-test method, and *p*-values smaller than the significance level (usually 0.05) were considered statistically significant. Table 7 shows the statistical comparison of the proposed method with other methods. Most of the results are significant.


**Table 7.** Comparison results with different models on different datasets.

#### *5.2. Ablation Research*

We design a set of ablation experiments to verify the contribution of each part to the model, where RGT-CRF-NG indicates that the model does not add glyph information. RGT-CRF-NF shows that the model does not add lexical information and its corresponding positional encoding. Finally, it is compared with RoBerta-BiLSTM-CRF and RGT-CRF on three datasets, and the results are shown in Table 8.


**Table 8.** Performance of different variants on three datasets.

The experimental results of RGT-CRF-NF and RGT-CRF-NG are better than the RoBerta-BiLSTM-CRF model regarding the three datasets, indicating that the glyph information and the use of lattice structure to add lexical information are effective for Chinese electronic medical record named entity recognition. The result of RGT-CRF-NG is slightly worse than that of RGT-CRF-NF, indicating that adding medical glyph information to the Chinese electronic medical record NER task is more effective than word information. This comparison can also be found in the above experiments using glyph information. Similarly, the final model with radical information is better than the model without radical information. This is because many Chinese characters in medical entities have the same glyph structure, so their meanings are also similar.

For example, '疼 (pain) ', '痛 (pain)', '病 (sick)', '腹 (belly)', '腰 (waist)', '肝 (liver)', '脾 (spleen)', '呕 (vomit)', '吐 (threw up)', '咳 (cough)', '嗽 (cough)', '胰 (pancreatic)', '肠 (intestinal)', '肿 (swell)', '胀 (swell)'. And this is very common in medical entities.

#### **6. Conclusions**

In this paper, an RG-FLAT-CRF model is proposed for Chinese CNER, which can learn the glyph features of medical fonts, and at the same time introduces word information to enhance word boundaries, and finally achieves good performance on three datasets. The RG-FLAT-CRF model obtains character vectors through RoBerta, Glyce, word2vec, and word vectors through word2vec. The word information is fused using the Flat-lattice structure and then encoded by the transformer network. In line with the output of the encoding layer, the label of each input character is predicted by the CRF layer. It addresses problems like word segmentation errors and lack of lexical information, given the characteristics of Chinese medical characters and the vector of multi-feature fusion. The final experimental results demonstrate that our proposed model outperformed the baseline models.

Several issues require further research. At this stage, deep learning requires a large amount of annotated data to train the model, as does our proposed model, but large-scale annotated data in the Chinese electronic medical record domain requires medical experts to annotate, which can be time-consuming. Therefore, our next research investigates how to perform named entity recognition on medical record texts with sparse data.

**Author Contributions:** J.L.: Conceptualization, Methodology, Software, Writing—original draft. Y.W.: Supervision, Project administration. R.L., C.C., and X.S.: Investigation, Writing—review & editing. S.Z.: Data curation, Resources. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Key R&D Program (Demonstration of R&D and Application of Integrated Science and Technology Service Platform for Central Plains Urban Agglomeration), grant number 2018YFB1404500.

**Data Availability Statement:** We used the CCKS open-source Chinese electronic medical record named entity recognition dataset and cite it in the article.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Optimising Health Emergency Resource Management from Multi-Model Databases**

**Juan C. Arias 1, Juan J. Cubillas 2,\* and Maria I. Ramos <sup>3</sup>**


**\*** Correspondence: juanjose.cubillas@unir.net

**Abstract:** The health care sector is one of the most sensitive sectors in our society, and it is believed that the application of specific and detailed database creation and design techniques can improve the quality of patient care. In this sense, better management of emergency resources should be achieved. The development of a methodology to manage and integrate a set of data from multiple sources into a centralised database, which ensures a high quality emergency health service, is a challenge. The high level of interrelation between all of the variables related to patient care will allow one to analyse and make the right strategic decisions about the type of care that will be needed in the future, efficiently managing the resources involved in such care. An optimised database was designed that integrated and related all aspects that directly and indirectly affected the emergency care provided in the province of Jaén (city of Jaén, Andalusia, Spain) over the last eight years. Health, social, economic, environmental, and geographical information related to each of these emergency services was stored and related. Linear and nonlinear regression algorithms were used: support vector machine (SVM) with linear kernel and generated linear model (GLM), and the nonlinear SVM with Gaussian kernel. Predictive models of emergency demand were generated with a success rate of over 90%.

**Keywords:** healthcare; database design; geospatial data

#### **1. Introduction**

The health sector is one of the most sensitive sectors in our society, and proof of this are the resources and efforts that are invested worldwide in trying to improve the management of the health system, especially in the optimisation of health resources [1]. Although there has been a huge change in the way diseases are diagnosed and treated, there has been little change in the way health services are managed in the 21st century. A number of academic studies have emerged in the field of service design, but not much of this research is available, especially in the field of health services [2]. An overview of the current state-of-the-art in this area shows that the vast majority of it is aimed at achieving greater economic efficiency in some aspects of the sector. There is scientific work in which different techniques have been tested in order to improve the management of health care resources. In this sense, Cubillas et al. [3] used tools of data mining to improve the appointment scheduling in primary health care centres. The results show that it is possible to predict, with a very acceptable level of precision, the number of patients who will attend the health centre each day. For this purpose, a series of historical assistance data were used. In this type of work, the quantity and quality of available data are the keys to generate an adequate predictive model. Similarly, other research has used spatial analysis to improve the effectiveness of these predictive algorithms, and confirms that the use of spatial data extends the scope of predictive models [4]. Additionally, the use of statistical methods to anticipate patient arrival rates in health care organisations allows

**Citation:** Arias, J.C.; Cubillas, J.J.; Ramos, M.I. Optimising Health Emergency Resource Management from Multi-Model Databases. *Electronics* **2022**, *11*, 3602. https://doi.org/ 10.3390/electronics11213602

Academic Editors: Agnieszka Konys and Andrei Kelarev

Received: 27 September 2022 Accepted: 3 November 2022 Published: 4 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

one to schedule the internal staff in order to meet the demand for service driven by the patient arrival rate [5]. Nevertheless, this research had the limitation of the use of a few months of data to draw inferences about the patient arrival. This issue generates insights that are less reliable and more subject to short-term idiosyncrasies in data. In short, all types of research based on models that provide advance information on the behaviour of a phenomenon such as the demand for health resources are highly dependent on aa large volume of available data with a high spatio-temporal quality.

There is no relevant history of the implementation of systems that provide a daily and sufficiently early forecast of the demand for resources in emergency health services (i.e., that provide direct location and management of the resources that attend to patients on a daily basis) [6]. There are some studies that have highlighted an important increase in the need to optimise the structure of databases in order to face the demand for new necessary data in health care management [7,8]. An example of this is the implementation of telemedicine systems and the adaptation to the new information laws, thanks to the new broadband communications that are beginning to become generalised [9].

In the second decade of the 21st century, there has been an increase in publications aimed to improve the structure of databases adapted to daily management as a way to obtain more detailed health information. New channels of communication between the patient and health care services are proliferating by means of portable devices such as tablets and smartphones, or using a PC [10,11]. Subsequently, research in this sector has maintained this line of work and new challenges appear on the horizon. Moreover, the global pandemic initiated in 2020 by the novel severe acute respiratory syndrome (SARS)- CoV-2 virus (coronavirus disease 2019 [COVID-19]), has led to drastic changes in health priorities. Biomedical priorities have come to dominate the agenda, highlighting the multisectoral knowledge gaps and the challenges to be addressed for health management of the pandemic. In contrast, information management and decision support systems took a back seat [12]. However, once the process of immunisation of the population has begun, it is time to take stock and analyse how health resources have been managed and whether, in some way, correct decision making could have prevented the saturation of health services. There are many and varied data to be assessed in order to carry out an adequate management of health care resources. Despite the current pandemic, the demand for health care continues to be motivated by different pathologies and for different reasons.

In short, nowadays, more and more data are available, all from different sources, with different formats and different temporality and resolution. It is therefore necessary to properly manage and integrate a variety of data from new input devices used in health into centralised databases. It is also important that these databases are able to integrate any new variables resulting from the progress of health research. In this way, the usefulness of the database, in addition to assuming an effective resource management tool and providing quality to the service, would also have an important role in disease monitoring [13,14]. This scenario requires the development of specific tools and methodologies aimed at achieving these health management goals such as Hamami et al. (2019) [15], who highlighted that achieving the best model is a complex task due to the interaction of many components and the variability of parameter values that lead to radically different dynamics. It therefore points out that the modelling process can be improved through the use of data mining techniques [16]. Another example of the use of data mining techniques in health care management for decision making has already concluded that they can influence the costs, revenue, and operational efficiency while maintaining a high level of patient care [17].

There are, therefore, many aspects to consider and, above all, the large amount of data that is generated every day around a health service must be managed. Thus, in addition to data mining tools, it is an important area of application for big data [18], which is known as medical big data [19]. Medical big data comes from a variety of sources such as administrative records, clinical records, biometric data, data from patient reports, etc. They also are large in scale, extremely fast in update, polymorphic, incomplete, and time sensitive [20]. In addition, whether or not the data are used appropriately remains an open

question. The data warehouse (DW) is the answer to data processing, but the applications of traditional DW methods in the health care domain require considerable attention due to the unique business nature of this industry [21]. Muji et al. (2010) [22] proposed a data-driven approach to the development of health information systems, which involved a databasecentric system where different applications share the same integrated data source. The database design provides the necessary scalability to cover other specialised applications without the need for structural changes at the database level. The achievement of this objective requires databases with administrative and health care information data from several consecutive years [6,23,24] as well as an efficient model for storing and retrieving big health data to achieve valid estimates for optimal and quality management [25]. In short, in terms of offering an improvement in the quality of health care, it is essential to adapt database systems for use in DW and big data technologies and in their exploitation techniques.

In the field of health emergencies, we can cite the work of Graham et al. 2018 [26] in which a predictive study was carried out on the flow of patients to the emergency department from hospitals by using records from two large hospitals in the city of Northern Ireland. This work achieved a reliability of more than 80%.

Other more recent studies such as those by Gurazada et al. (2022) [27] have conducted predictive work on the length of stay of patients in the emergency department. Sixteen potentially relevant factors impacting on waiting times were identified through a literature review.

All of this work contributes to improving patient care by providing health care resource managers with advance information. These studies handle a large volume of patient data. However, in an emergency department, a large amount of data is recorded. Not only patient data, but also data about the service provided, the resources used, and external factors at the time of the emergency. The correct organisation and storage of this heterogeneous information in a database multiplies the possibilities of extracting hidden knowledge as well as predictive capabilities from the data.

This work focused on the management aspect of emergency health resources. It presents the design of a database that is complex enough, that is, with multiple variables extracted from each health emergency demand, to integrate all types of health information to allow for advanced knowledge and better management of these resources, thus providing quality patient care. The aim of this work was the design and implementation of a multidisciplinary database containing all of the information of the complete process of management and resolution of emergencies in the city of Jaén, in Andalusia, southern Spain. This database will serve as a source to apply and analyse regression algorithms of data mining in order to predict the demand for emergency resources that will occur in the coming days.

#### **2. Methodology/Methods**

#### *2.1. Methodology of Work in Emergencies in Andalusia*

An emergency is defined as a situation in which a person's life is in danger, otherwise, it is identified as an urgency. Currently, there is a free emergency telephone number available to citizens in Andalusia (061), Spain, and an urgency number (902505061). This service is provided by EPES (Public Company of Sanitary Emergencies) [28], which has eight provincial services in Andalusia, one per province. The provincial services are the headquarters from which all of the urgencies and emergencies of each province are managed. The most important nucleus of each provincial service is the coordination room, where all calls made by citizens to the emergency number of each province are received. Each coordination room is formed by one or several coordinating doctors and by a set of telephone managers that manage the citizen's health demand. These telephone managers attend to the citizen by gathering as much information as possible about the current request, the coordinating doctor participates in this management and finally decides, depending on the seriousness of the patient, which resource is mobilised to resolve the request.

All of the necessary health resources for all of the urgencies and emergencies of the province are mobilised from the coordination room. EPES has its own resources such as the terrestrial emergency teams (mobile UVI) and the air emergency teams (Sanitary Helicopters) as well as coordinating and mobilising all of the emergency resources of the Andalusian Health Service (SAS) and all ambulances from the province's urgent transport network (RTU). There are approximately 60 units in the Provincial Service of Jaen.

Once the resources have been activated from the coordination room, they are directed to the place of assistance, with all movements of said units being recorded in the computer system, knowing in real-time the geolocation of all of them and the exact moment of resolution of the assistance where it is. All emergency and emergency units have electronic devices (tablets) in which they register the patient's medical history of the care provided (HCDM Digital Clinical History in Mobility), which is computerised at the same time as the resolution. The most relevant information of this history is sent to the coordinating centre for storage, along with the demand created by the telephone manager at the beginning of the process, ending this demand and giving the cycle a new start (Figure 1).

**Figure 1.** Complete cycle of urgencies and emergencies management.

The previously indicated units are located in pre-set and static locations. A significant improvement in this system would be to change the position of the available resources dynamically, according to the type of assistance they provide, the number of them, and the time of year they are available. For this purpose, the database presented in this work can be an important step forward, as the design, structure, and level of detail of the information stored allow for this objective to be achieved.

#### *2.2. Dataset*

The first phase of the work carried out consisted of information gathering, which involved data collection from various agencies involved in the health emergency service. The idea was to identify the idiosyncrasies of the emergency event at each point in time, together with the factors that may have influenced its occurrence, thus enriching the

information currently stored in the system. This also involves downloading data from different websites with official statistical data, meteorological, social, and economic data. In the case of this research, data from the last eight years (2013–2020) of health activity in urgencies and emergencies were used. The main characteristic of most of the information integrated in the system is its spatial component. The geolocated data and their descriptions are as follows:

	- In order to enrich the data stored in the system, the following data were included:
	- o Economic level of the patients in each area: Analysing, on one hand, the cadastral value of real estate, extracted from the Directorate General for Cadastre of Spain website [30]. We also added an analysis of the current price of housing in each district of the city through real estate web portals.
	- o Level of unemployment in patients and their family units: This variable was obtained from data provided by the Spanish National Statistics Institute [31]. This public institution provided us with the type of population in each census tract. In Spain, a census tract composes a small region of the city, 1000 and 2500 residents.
	- o Level of study, age of citizens, and members of the family unit: These data were obtained from the website of the Institute of Statistics and Cartography of Andalusia [32].

#### *2.3. Data Mining Algorithms*

As indicated in the introduction, in this work, data mining techniques were used, based on the attributes stored in the designed database. Prior to the development of the models, the following techniques were revised:

The predictive study carried out in this work was based on the use of regression algorithms. These included linear regression, logistic regression, the generalised regression model, one-class support vector machine (SVM), etc. In this study, the nature of the variables a priori was unknown, in fact, as discussed above, they are heterogeneous in nature. Linear and nonlinear regression algorithms were used as follows:


$$\mathbf{y} = \beta \mathbf{0} + \beta \mathbf{1} \times \mathbf{1} + \dots + \beta \mathbf{p} \mathbf{x} \mathbf{p} + \varepsilon \tag{1}$$

The first algorithm selected in this study was the generalised linear models (GLM) algorithm, which works mathematically as the weighted sum of the features with the mean value of the distribution assumed by the link function g, which can be chosen flexibly depending on the type of result.

$$\log(\text{EY}(\text{y}\,\text{|}\,\text{x})) = \beta 0 + \beta 1 \times 1 + ...\beta \text{p}\,\text{xp} \tag{2}$$

In other words, this algorithm is an extension of linear algorithms that allows linear or normal distributions and non-constant variances to be modelled. Linear models make a set of restrictive assumptions, in which the target is normally distributed conditional on the value of the predictors with a constant variance, regardless of the value of the predicted response. In this sense, GLM relaxes these restrictions, and for a binary response example, the response is a probability in the range [0, 1] [33,34].

Another linear algorithm selected was SVM, which has the advantage of being able to be used with different kernels. Kernels allow the data to be distributed on a hyperplane according to a function, which facilitates the adaptation of the algorithm to the nature of the data, allowing for infinite transformations.

In this study, we worked with the SVM with linear kernel. When the linear kernel is used, the following transformation is performed

$$\mathbf{K}(\mathbf{x}, \mathbf{x}') = \mathbf{x} \cdot \mathbf{x}' \tag{3}$$

This algorithm has the advantage that it fits very well if the nature of the data is linear, and if there are many predictor variables (as in the case study). Note that in this algorithm, there is no upper limit on the number of predictor attributes, and the only limitations are those imposed by the hardware.


$$\mathbf{K}(\mathbf{x}, \mathbf{x}') = \exp(-\mathbf{y} \mid \mid \mathbf{x} - \mathbf{x}' \mid \mid \mathbf{2}) \tag{4}$$

The value of γ controls the behaviour of the kernel. When it is very small, the final model is equivalent to that obtained with a linear kernel, as its value increases, the data move away, forming a Gaussian bell in the hyperplane, fitting very well when the nature of the data does not have a linear distribution.

In summary, these are the advantages of these three algorithms in this study, starting from the hypothesis of a priori ignorance of the relationship between the variables with the target attribute and also considering that the training data we had were limited and the predictor variables were numerous. Moreover, the complexity of these algorithms means that the relationship between the attributes used cannot be described by a specific equation. In short, the following algorithms were applied in this study:


#### **3. Results and Discussion**

#### *3.1. Database*

The workflow of the database design was divided into three phases. The first consisted of data collection from all sources described above. In the second, a data cleaning process was developed to facilitate data management and analysis. In this phase, a process of cleaning, normalisation, and grouping was carried out. We started with the original table, demands, which had 84 attributes, in which all the details of the assistance demands made by the emergency teams can be found. In order to prepare the structure of the database for different types of exploitation, this information was restructured into specific blocks that include several tables. These blocks are as follows:

• Resources mobilised in emergency assistance

This is the most important block in the database as they are tables that contain the health data. These tables contain all the information corresponding to the assistance provided by the emergency teams during the last eight years in the province of Jaen.

• Patient personal data

This block includes the patient information fields that contain the personal information of each patient. Most significant fields are the age, sex, date of birth, and address (Table 1).

**Table 1.** Personal patient information.


• Patient health information fields

These contain the health information of all patients attended. The most significant fields are shown in Table 2.


• Chronological information fields of the assistance

In relation to the information on the ambulance mobilised for each assistance, the start and end time of each assistance interval is recorded. This information includes the time at which the mobile resource is activated by the coordinating centre, the time at which it arrives at the site of medical assistance, the time elapsed during the action on the patient, the time of transport of the patient to the hospital, and the time at which the mobile resource is available again for the next assistance (Table 3).


**Table 3.** Chronological information of the assistance.

• Weather information

Numerous studies indicate that environmental and weather factors directly influence conditions such as allergies and directly influence the onset of certain diseases such as allergies or certain chronic illnesses. Thus, it is important to bear in mind that Jaén is the largest oil producer in the world and, therefore, the flowering of the olive tree in spring, when temperatures are high, means that many people allergic to pollen demand emergency services. Adverse weather conditions also lead to a proliferation of accidents requiring emergency services. Therefore, it is clear that for a better management of emergency resources, these meteorological factors have to be considered as they can influence a sudden increase in the demand for emergency assistance. In this study, some external factors have been considered that can influence the number of required assistance such as meteorological factors (e.g., minimum, maximum, and average temperature, precipitation, humidity, and daily air quality data). The data downloaded from the website of the Environmental Information Network of Andalusia REDIAM [29] has a field ESTACION\_ID, indicating the number of meteorological stations, with a total of 20 meteorological stations monitored

throughout the province of Jaén. All atmospheric data collected in the meteorological stations of the urban core of the city of Jaén corresponded to the same period of assistance considered in this research (Figure 2).

**Figure 2.** Definitive table composition of meteorological information.

• Sociological information

The urban core is divided into nine districts called postcodes. Because the location of the assistance provided is given by postcode, as much information as possible was collected for each postcode. The source of information used was the Institute of Statistics and Cartography of Andalusia [32], distributed in 100 fields with these generic areas: (1) total number of inhabitants, by sex and age group; (2) marital status of the population, by age group; (3) level of studies, by age group; (4) types of housing, use, regime, size; and (5) households, and number of people that compose it.

• Geolocated quadrants

One of the factors that enriches the database is the incorporation of geolocated information. This allows the exploitation of the database to take into account the variability of the information depending on its location. The minimum geographical unit considered is a geolocated quadrant of 250 m, which is the one used by the Institute of Statistics and Cartography of Andalusia [32]. This was not considered in the fragmentation of the urban core map of Jaén in 128 quadrants with cells of 250 m (Figure 3). The information stored for each quadrant was:


**Figure 3.** Geolocated quadrants that divide the town of Jaén.

Finally, in the third phase, the database was designed and created. The structure of the entity relationship diagram (ERD) is presented in Figure 4.

**Figure 4.** Entity relationship diagram (ERD).

The entity relationship model conceptually represents the organisation and relationship of the data in the designed database. In this case, it is a simplified representation as the database has multiple tables (more than 50 tables). The purpose of Figure 4 is to show the type of information stored and its relationship. Each entity was grouped as a block of information, and the data blocks represented were as follows: socio-economic data of the users that can potentially be attended, meteorological and environmental information, clinical data of the users, geographical information of the users, registers of, and finally, the emergency resources mobilised.

The database management system used was the Oracle Database [37], which is a system of object-relational type (ORD). The development environment used was Oracle SQL Developer [38], an integrated development environment that allows working with SQL in Oracle databases. This environment allowed us to create and execute SQL queries and procedures for the integration of different types of information.

• Debugging tables and preparing data

The structure explained above has many applications, one of the most important is to study the number of attendances that are expected by demarcation, taking into account the different variables that affect the result. Deepening this assumption, new tables in which the information grouped by days, zones, and resources will be stored can be generated. A forecast of the number and type of expected attendances will be obtained as well as valuable information on the mobilisation of resources that is expected at different times of the year, also taking into account variables such as atmospheric data, day of the week, holiday, or work, economic values of the area, and all the variables that surround an assistance (Table 4).


**Table 4.** New generated table for resource management optimisation.

As indicated above, the urban core of Jaén was divided into 128 quadrants, which allowed us to exploit the information in a georeferenced manner. Each of the quadrants were stored using the coordinates that defined them, which allowed us to study the information of the demands and the population in a detailed and geographical manner.

The way the database was designed and the inclusion of geographic information and other external factors such as meteorological factors allowed the data to be exploited for predictive analysis. The growth of the database and the increase in the volume of information stored means that valuable historical data are now available. Future exploitation and

analysis of these data such as detecting patterns of behaviour and relationships between variables will allow advance planning of the emergency resources available at each time of the year and in each part of the city.

#### *3.2. Predictive Model*

The results of the first phase of the study corresponded to the application of the MDL algorithm to identify which attributes had the most influence on the target attributes. In this case, the number of resources mobilised (Table 5) shows the attributes that are related to the target attribute and are therefore used by the algorithms. The attributes in the database were checked by the minimum description length (MDL) algorithm, which returned a value between 0 and 1, with 0 indicating that the attribute has no relationship with the target attribute, and 1 indicating that the attribute has the maximum relationship. The attributes that were related to the target and that were used to train the model are shown below. The model was trained with real demand data from 2011 to 2019, and 2020 was left out to produce the result of the predictive algorithms with real demand data from 2020.

**Table 5.** Attributes used in the model.


In this work, as mentioned in Section 2.3, two types of regression algorithms were used: linear and nonlinear. In the case of the linear algorithms, support vector machine (SVM) with linear kernel and the generated linear model (GLM) were used; on the part of the nonlinear models, SVM with Gaussian kernel was used. Each model has its advantages and disadvantages, in the case of SVM, it provides great performance when there is little training data available, however, GLM is an extension of a linear regression model, which is very useful when the conditional distribution of the target attribute is not normal, introducing a link function g (2). Its adjustment in practice is conducted using the maximum likelihood method, therefore, this model was based on calculating the weighted sum of the predictors. These models were formulated by John Nelder and Robert Wedderburn as a way of unifying statistical models such as linear regression, logistic regression, and Poisson regression

The prediction focuses on determining the number of emergency resource activations that will be required to meet the demand for emergency health care, for which the three regression models were tested, and to measure their efficiency, a model was generated with data for the years 2011–2019 and the prediction was made for the year 2020, comparing the absolute error of the prediction with the real data for the year 2021. The results obtained were as follows: GLM had an error of 9%, SVM with linear kernel 16%, and SVM with Gaussian kernel 21%. The efficiency of the model can be seen in the form of a graph. Figure 5 shows the actual number of emergency resource mobilizations each day during 2020. For this purpose, a predictive model was generated with the training data of the actual realised demands in the year 2011 until 2019. Then, from the three models tested, it the prediction of resources to be mobilized in 2020 was generated, and finally, it calculated

the absolute error by comparing the prediction with the actual value of the mobilised resources. The graph showed the prediction of the GLM model for each day (red line) and the blue line represents the actual number of resources mobilised in this year, so the accuracy of the model could be seen graphically. The absolute error of the GLM regression algorithm was 9%; this value was very good since the variation in the mobilisation of the demands can vary from 50 on the day when the most were mobilised and nine on the day when the least were mobilised (i.e., the variation was higher than 555%).

**Figure 5.** Prediction on the emergency resource activation. Comparison of the actual data and predictions for the year 2020.

Considering that the number of activations varies greatly, ranging from 20 to 45 per day, it is very important to be able to have a temporary forecast in advance, as each ambulance is equipped with a doctor, nurse, and driver. Therefore, it involves a significant expenditure of health care resources.

#### **4. Conclusions**

The general objective of the project was to create a database as complete as possible and with a great diversity of information, which would represent in detail all possible aspects of the emergency health activity. We did not just want to store data, but to obtain the maximum details of the entire process of attending to an emergency, that is, from the moment the call is received in the coordination room until the end of the assistance received by the patient, thus closing the health claim that said patient originated.

An additional objective that we addressed was to study and store all the non-health aspects that surround an emergency and that may affect that emergency. As previously mentioned, the economic, social, environmental, and geographical aspects of each of the emergencies have been studied. The next step was to analyse all of this information and study the percentage of relationship that each variable had with the appearance or alteration of said emergencies. In this sense, it has been concluded that there is a direct relationship between the environmental factors and the activation of emergency services in Jaén. This relationship was statistically quantified with the MDL algorithm, which quantifies the relationship of each attribute with the target attribute.

Another important achievement is that a model was designed using the multi-model database where not only clinical data, but also other very basic environmental and air quality factors are stored, these attributes being precisely some of the input system data for the prediction. These data are available on several websites with up to a 10-day forecast.

The main conclusion of this work is that we managed to develop models that are able to predict the number of activations of the emergency services with an absolute error of 6%, considering the large variation in the number of activations from one day to another, with variations of more than 110%. Other predictive studies in the health sector have achieved a reliability of around 80% [26]. This study achieved better accuracy. It is also important to note that this study worked with health data captured at the time of care by the doctor or nurse. These data are stored in the optimised database, which allows these data to form part of the training data of the predictive model by recalculating the predictions and readjusting the model each day as the database grows. This is disruptive to other work [39], where public or non-clinical data sources are used.

There are predictive works that use machine learning to address the evolution of patients in the emergency department, more specifically, the level of mortality [40], and others have focused on predicting the population groups that are more likely to use health services [41]. In this sense, what is innovative about the study presented here is that it focused on accounting for the resources that will be mobilised each day (i.e., being able to know in advance the emergency health demand that will be received on a given day). It is therefore a prediction that makes it possible to anticipate the resources available, improving the quality of patient care. This information, in advance, is an indicator that can be very important for emergency resource managers, being a useful tool, better than a naïve model based on the average of historic values. The use of this tool can also help to improve several aspects of health care management. The first is the economic plan, if the demand is known well in advance. Another important aspect is that the application of the model will increase efficiency, as we will be able to anticipate the demand for resources, a key aspect in health emergencies.

Finally, it can be concluded that this multi-model database allowed us to exploit the information with predictive models. Furthermore, it is a first step toward further work in the future to analyse the type of resources requested in the demands and the main pathologies of the activations, or even determine or predict the location where the emergency activation will take place.

**Author Contributions:** All authors whose names appear on the submission made substantial contributions to the conception or design of the work, nevertheless, here are the concrete contributions of each author: Writing—original draft: J.C.A.; Methodology and formal analysis: J.C.A. and J.J.C.; Writing—Reviewing and Editing, M.I.R. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Informed Consent Statement:** All methods were carried out in accordance with the relevant guidelines and regulations. All experimental protocols were approved by EPES (Empresa Pública de Emergencias Sanitarias) Consejería de Salud y Familias. Junta de Andalucía (Spain). No clinical data were used in this research. Informed consent was obtained from all subjects and/or their legal guardian(s).

**Data Availability Statement:** Data included in the database cannot be shared due to the data protection law in Spain. The design of the database is available and can be shared openly. To request the data, please contact the first author. In this work, clinical data were not used.

**Acknowledgments:** This work would not have been possible without the support of EPES (Public Company for Health Emergencies), a company belonging to the Andalusian Health System. This work was also partially supported by the Graphics and Geomatics Group of Jaén (TIC-144), and the research PREDIC\_I-GOPO-JA-20-0006 which is co-financed with European agricultural fund for rural development and the Junta de Andalucía funds.

**Conflicts of Interest:** We confirm that there are no potential competing interests and all authors have approved the manuscript for submission.

#### **References**


## *Article* **Acquiring, Analyzing and Interpreting Knowledge Data for Sustainable Engineering Education: An Experimental Study Using YouTube**

**Zoe Kanetaki 1, Constantinos Stergiou 1, Georgios Bekas 1, Sébastien Jacques 2,\*, Christos Troussas 3, Cleo Sgouropoulou <sup>3</sup> and Abdeldjalil Ouahabi <sup>4</sup>**


**Abstract:** With the immersion of a plethora of technological tools in the early post-COVID-19 era in university education, instructors around the world have been at the forefront of implementing hybrid learning spaces for knowledge delivery. The purpose of this experimental study is not only to divert the primary use of a YouTube channel into a tool to support asynchronous teaching; it also aims to provide feedback to instructors and suggest steps and actions to implement in their teaching modules to ensure students' access to new knowledge while promoting their engagement and satisfaction, regardless of the learning environment, i.e., face-to-face, distance and hybrid. Learners' viewing habits were analyzed in depth from the channel's 37 instructional videos, all of which were related to the completion of a computer-aided mechanical design course. By analyzing and interpreting data directly from YouTube channel reports, six variables were identified and tested to quantify the lack of statistically significant changes in learners' viewing habits. Two time periods were specifically studied: 2020–2021, when instruction was delivered exclusively via distance education, and 2021–2022, in a hybrid learning mode. The results of both parametric and non-parametric statistical tests showed that "Number of views" and "Number of unique viewers" are the two variables that behave the same regardless of the two time periods studied, demonstrating the relevance of the proposed concept for asynchronous instructional support regardless of the learning environment. Finally, a forthcoming instructor's manual for learning CAD has been developed, integrating the proposed methodology into a sustainable academic educational process.

**Keywords:** computer-aided design (CAD); educational data mining; engineering education; online and hybrid learning environments; social media analytics

#### **1. Introduction**

Today, many higher education institutions are integrating learning activity data analytics into their operations [1]. Universities are recognizing the benefits of information solutions that not only better students at all stages of their education, even the most challenging, but also implement ever more effective educational resources to enhance the learning experience for students and their instructors [2]. Teaching module organizers and instructors focus on analytics results and the use of algorithms to improve their content and flexibility to identify students at risk of academic failure as early as possible and then provide them with more targeted learning solutions [3].

**Citation:** Kanetaki, Z.; Stergiou, C.; Bekas, G.; Jacques, S.; Troussas, C.; Sgouropoulou, C.; Ouahabi, A. Acquiring, Analyzing and Interpreting Knowledge Data for Sustainable Engineering Education: An Experimental Study Using YouTube. *Electronics* **2022**, *11*, 2210. https://doi.org/10.3390/ electronics11142210

Academic Editors: Agnieszka Konys and Agnieszka Nowak-Brzezi ´nska

Received: 24 May 2022 Accepted: 13 July 2022 Published: 14 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Long before the global health crisis due to the SARS-CoV-2 virus, nearly a decade of student education data could be used to understand key aspects of learner characteristics that would differentiate those who are capable of graduating from those at risk of dropping out. With all educational procedures going online, the acquisition of electronic data from a variety of online sources has led to an increase in all analytical procedures, and their management is a concern for the academic community [4].

It has been more than two years since the learning experience in universities was totally disrupted by the consequences of the COVID-19 pandemic, resulting in a significant decline in student retention and academic performance. Sustainable measures were then gradually put in place to ensure students' virtual presence and support them in their asynchronous tasks. Today, after almost a year and a half of exclusive and mandatory distance learning imposed by the pandemic, the learning processes of higher education have been tested in terms of applicability, feasibility and long-term sustainability. Although the health situation is not yet fully stabilized, the educational models deployed during the most critical period must be evaluated [5–8].

The first semester of the 2021–2022 academic year is undoubtedly a defining moment in the educational community. Although the post-COVID-19 period has not yet arrived, it is now appropriate to speak of the beginning of the "meta-COVID-19 period," certainly defining a period of disruption, but offering many exciting opportunities in the academic world. Specifically, in the era of digital transformation of the educational procedure [9], the authors of [10] applied the Greek term "meta", meaning beyond, to describe a promising future for academia, followed by a global health phenomenon. With the increasing implementation of digital learning systems during the health crisis, it is important to ensure that the system design can motivate and support active student engagement to achieve the required educational goals. Therefore, socially oriented technology tools can be incorporated into the design of online learning systems to increase student engagement and improve student performance during the learning process [11]. By adjusting learning tactics and introducing technological features applied during the epidemic into meta-COVID-19 instruction, academic institutions will be able to progress and thrive in sustainable educational models.

The work presented here consists of an experimental study that focuses on instructor monitoring of learner behaviors and engagement throughout the teaching module to assess the effectiveness and sustainability of the applied learning tactic. The methodology applied is based on video analysis and, more precisely, on the processing and interpretation of data coming from the consultation by students of online digital content directly from the reports of a YouTube channel made available as part of a computer-aided mechanical design (CAD) module [12]. The analysis of the experimental data collected, analyzed and interpreted should make it possible to evaluate the benefit of this YouTube channel dedicated to asynchronous pedagogical support in online or hybrid teaching environments. To convince the university community of the sustainable integration of the YouTube channel into the educational process, learners' viewing habits were analyzed in depth from the channel's 37 instructional videos, all in conjunction with the execution of the CAD course. To develop the results of this work, two distinct learning periods are considered: the first refers to exclusively distance learning spaces (i.e., during the most critical period of the health crisis), and the second reflects both face-to-face and mixed learning environments (i.e., mixing face-to-face and distance).

The study proposed here covers a wide range of skills: from the creation of a complete asynchronous educational and social environment based on a YouTube channel where digital observations can be acquired, to the extraction and exploitation of visualization data. The novelty of this study lies in the fact that the use of social media channels in online and hybrid learning spaces has not yet been analyzed in depth, for its sustainable integration in the academic educational procedure.

The development of this work will be articulated as follows: Section 2 will first present a review of the literature related to the objectives of this work. The methodological and organizational aspects will be presented in Section 3, starting with the creation of the educational YouTube channel, the sources of data exploration, and proceeding to the statistical analysis of the acquired data where the results will be presented. The main results obtained will be presented in Section 4, and a discussion, based on these results, will be conducted in Section 5. Finally, the conclusions and research perspectives will be analyzed in Section 6.

#### **2. Related Work**

Long before the emergence of distance learning imposed by the COVID-19 pandemic, institutions around the world incorporated learning management systems (LMS) into their instructional schemes, whether in online or blended learning environments [13]. During the health crisis, several learning tools were used, individually or in combination. At the University of West Attica (Greece), as well as at the University of Tours (France), the Microsoft (MS) Teams learning platform was, for example, widely used for synchronous transmissions, in combination with the E-Class (or Moodle) LMS for asynchronous support. In [14], researchers proved that using a single learning platform (MS Teams), supported by a social media channel such as YouTube for asynchronous support, limited the dropout rate of learners, compared to MS Teams supported by the LMS Moodle.

Just as considering customer preferences is critical to the development of a business, in education, analyzing students' preferences and taking steps to provide them with learning materials in innovative learning spaces could be the key to improving their academic performance [15,16].

Educational data mining (EDM) seems to have a major effect in the field of education [17]. EDM and data analytics promise a better understanding of student learning, as well as new insights into the hidden aspects that influence learner performance [18]. Learning analytics (LA) is an emerging area of learning management systems that tracks and records student activities in online and virtual learning environments [19]. The role of LA is to use the data generated by students as they interact with new technology features to improve the teaching–learning process (TLP) and enable instructors to make better decisions in terms of structuring teaching modules [20]. It is generally accepted that the more data you collect, the more information you get. Therefore, the more information one acquires, the more accurate predictions, forecasts and estimates can be obtained. Data quality is very important, as it is affected by the number of variables and the amount of data acquired, which can lead to information sparsity, especially in cases where the quality of the data appears to be poor [21]. In addition, process analysis allows for the observation of unusual activities and behaviors, which can lead to the detection of "outliers", alarm objects, and calls for intervention [22]. Given the power of the method, LA can therefore be a major feedback tool for educators and instructional designers to improve the learning experience [23].

Researchers in [24] developed a technology acceptance model (TAM) to examine which factors of social networking sites such as YouTube and TikTok can support and facilitate online knowledge acquisition. In this study, data collection was conducted using an online questionnaire on four external factors: content richness, innovativeness, satisfaction, and enjoyment. The results showed that both social networking sites contribute to knowledge sharing and acquisition. Although reports from YouTube channels were not retrieved and considered in this research, the authors concluded that to increase acceptance, the focus should be on uploaded video content.

Although the authors of [25] conducted a systematic review of the literature describing the sources and use of educational data, data analyses from the YouTube channel were unfortunately not considered. Additionally, in [26], the researchers examined the influence of instructor-generated video content on student engagement and participation in a course using the number of posts per week and the number of characters per post as parameters. They conducted an independent-sample *t*-test to compare student evaluation of the course by dividing the population into two groups: those who had been exposed to instructorgenerated videos and those who had not. The test showed statistical significance between the two groups.

In [27], the authors discussed the use of YouTube analytics both to assess student attendance during lectures and to measure the impact of lectures on the student learning experience. Going further, the authors in [28] planned campaigns by processing analytics data from the tools offered by social media channels.

The remarkable popularity of social media applications can be attributed to the encryption technology used, which ensures user privacy and limits access to personal information [28]. Social media is thought to offer benefits such as enhancing human interaction through the use of electronic media, increasing creativity, creating a sense of affiliation and acceptance, encouraging engagement and cooperative learning, reducing restrictions in terms of space and social or economic position within a community, increasing interaction and communication among members, and improving users' technological expertise [29]. With technological tools now widely available to people under the age of 20, it is possible to access social media sites with one hand, thanks to mobile technology and the use of inexpensive electronic devices such as tablets and smartphones [30,31].

Sharing videos or their URL links has become easier for instructors with the adoption of virtual learning environments (VLE). Content in teaching modules, discussion forums, and targeted sequences in online courses can be shared via YouTube video links embedded in the assessment features of learning platforms [32]. The analytics of YouTube channels can provide valuable information about learners' viewing habits, as well as measures of how students engage in the learning process with videos [33]. With YouTube analytics, instructors can thus track video viewing behaviors on supportive tasks to better understand their usefulness [34]. Previous studies in this area have shown that students do not follow a video in its entirety. They play, stop, rewind, and replay the educational content in order to review segments of the video recordings. This specific learner behavior is intended to recall the part of the newly acquired knowledge that was not clearly defined [33,35]. In [35], the authors investigated the use of pausing and searching in videos of course recordings provided by the channel interface, and how these two features relate to students' learning tactics and performance in a specific curriculum. Before the COVID-19 pandemic imposed restrictions on academia, researchers studied how learners in traditional learning environments and flipped classrooms interacted with the videos, as well as the nature of those interactions through analysis of processing data [32]. Instructors recorded short or long videos of portions of their lectures and uploaded them to VLE [32].

In transforming the traditional learning space into a virtual one, one of the most significant problems has been the loss of contact with the engineering environment itself, which is a key aspect of engineering education [36,37]. In addition, the authors of [10] associated educational data mining, processing, and analysis with the term "sustainability", which is present and promoted in most aspects of everyday life, from business operations to manufacturing and the environment [38]. Data mining, the processing of data and eventual interpretation of the results, is a fundamental process that allows researchers to establish the relationship between raw digital data and the assessment of real world conditions [39]. With digital data now available by tracking activities, analysts can get lost in unnecessary information. To facilitate the process, researchers should be able to set their boundaries, creating controlled environments for data production, i.e., targeted to the goals of their research area.

In this study, the data collected and analyzed provides insight into students' video content viewing behaviors from a dedicated YouTube channel. These behaviors were analyzed over two distinct time periods: the first in 2020–2021, i.e., during the most critical period of restrictions due to the COVID-19 pandemic, and the second in 2021–2022, during the early hours of the meta-COVID-19 period. The objective is to study the similarities between the delivery of engineering training modules in exclusively online mode and in mixed mode. This objective had already been set well before the implementation of the new pedagogical environments, whether online or hybrid; the social communication channel was then used as an asynchronous pedagogical support allowing the collection of information directly related to students' behaviors, while avoiding the "noise" due to irrelevant details.

Given the results already available and the gaps identified by the literature review, this work is guided by the following five research questions:


The answers to the above research questions should help demonstrate the relevance of educational data from social media channels such as YouTube to the academic community, and help institutions implement their strategies for sustainable digital transformation in higher education.

The research objective of this study is not limited to evaluating a specific YouTube channel as a tool to support asynchronous teaching. It also aims to provide feedback to instructors and suggest steps and actions to implement in their teaching modules to ensure students' access to new knowledge while promoting their engagement and satisfaction, regardless of the learning environment, i.e., face-to-face, fully remote, and hybrid.

#### **3. Methodological and Organizational Aspects**

#### *3.1. Foreword*

The methodology described in this section was deployed in a 12-week "computeraided mechanical design (CAD I)" module, a teaching module within the Department of Mechanical Engineering of the University of West Attica (Greece). Prior to the health crisis, this module was divided between traditional mechanical design in a room equipped with drawing boards and computer-aided design with Autodesk Inventor software in a computer lab.

During the most critical hours of the COVID-19 pandemic, videos (directly downloadable into the MS Teams environment available to students) were integrated into flipped classrooms to provide asynchronous instructional support to complement the online courses delivered synchronously via the MS Teams platform. All of this work was done, including the integration of the following three tasks:


During the period when pandemic restrictions were relaxed, the 12-week CAD module was divided into two main stages. For the first 4 weeks, all classes were conducted faceto-face in the classroom equipped with drawing boards. The objective of this phase was to teach students to represent views of a three-dimensional object by freehand drawings (sketches). In the second stage, conducted in the computer lab, students enrolled in activities to create three-dimensional mechanical objects in different views (top, side, and cross-section) using CAD tools. To complete these activities, learners had the option of participating face-to-face or online (the class was streamed live by the instructor via

MS Teams). In both stages, asynchronous support videos were attached to the students' assigned task. The YouTube channel "MCAD I UNIWA" was created to provide learners with all the video support needed to complete their learning independently. This YouTube channel, and all of its video content (see Appendix A), is managed by a CAD instructor with over twenty years of CAD experience. The administrator of the YouTube channel was also responsible for coordinating all activities of the various instructors in the teaching module. The MCAD I UNIWA's YouTube channel policy is public. In order to target specific tasks and associate them with their asynchronous support video links, the tasks were signed as "assignments" on the MS Teams communication platform. Each task was announced in the students' MS Teams dashboard, where instructions were provided in text form [36]. The YouTube channel links for each task were uploaded as reference material targeting the specific task to avoid confusion when searching the 37 videos to find the one relevant to the task, as shown in Figure 1. The instructor was motivated to attach the URLs of the videos to each task after considering that most viewers of today's social media channels do not easily subscribe to the channels. In this way, we were able to reach both non-subscribing and subscribing students, with the latter being immediately notified of new videos.

**Figure 1.** Flow chart illustrating the data mining and processing methodology applied.

What we seek to highlight in this manuscript is the presence of a significant relationship between learners' listening habits and their behaviors, particularly in the acquisition of knowledge and skills necessary for graduation. To do so, we draw on feedback from the CAD I module, looking in depth at data collected from the "MCAD I UNIWA" YouTube channel. To achieve this objective, we implemented the methodology described in Figure 1. The first step was the creation of a YouTube channel in which most of the instructional video content was created from screen recordings and audio recordings intended exclusively for student use. We did not use the raw recording of the full laboratory lecture because we wanted to target specific tasks, i.e., fundamental to the mechanical engineering profession, to help students perform them outside of class [41].

The uploaded videos were divided into four categories, based on the learning objectives to be achieved at each stage of the teaching process. Video categories 1 and 3 were devoid of audio, primarily to invite the instructor to explain the content presented at their own pace and in their own style. Specifically, the sketch videos showed the instructor's sketchbook, complete with pencil, as he or she drew freehand views of the object. The videos showing the model of the object being studied in three dimensions were intended to help students design the geometric shapes in all views around the object. The third and fourth categories were generated by a combination of screen recordings and audio recordings from the software modeling environment.

The YouTube channel administrator examined viewing patterns since the first video was uploaded to progressively interpret learners' needs and determine if these screen and audio recordings met the demand for asynchronous support. During learning phases conducted exclusively online, visualization patterns could be identified by the end of the first semester. When the health regulations were relaxed, i.e., when university educational spaces were allowed to transform into hybrid learning environments (i.e., combining online and face-to-face learning), a challenge arose: to analyze the visualization patterns of learners in hybrid educational spaces and correlate the results with those of spaces conducted exclusively at a distance. Once all the data was collected, the method then consisted of filtering the data from the social media channel reports corresponding to the two distinct time periods: the first corresponding to the first semester of the 2020–2021 academic year in the exclusively distance learning environments and the second, the first semester of the 2021–2022 year in the face-to-face, virtual, and mixed learning environments. Thus, two years were necessary to collect sufficient data and to ensure the results that will be discussed in the rest of the manuscript.

The final step in the methodology is to perform statistical tests on the defined variables. First, normality tests were performed to assess the normality of the distribution. In the case where the distribution is Gaussian, parametric tests were chosen. In the opposite case, non-parametric tests were performed [42]. For determining if the variances analyzed for the two separate academic years' time periods are equal or unequal, a series of *F*-tests were performed. These tests can eventually be completed by *t*-tests depending on the equality or inequality of the variances generated by the *F*-tests.

#### *3.2. Participant Demographics*

Although not such an easy task, analysis of YouTube channel reports provides detailed information about the demographics of the participants and gives a "typical user profile" while revealing repetitive viewing behaviors. In our case, the term "user" refers to engineering students attending the computer-aided mechanical design module. Since one of the research questions focuses on developing a manual for future instructors, it is essential that module coordinators observe student activities outside the classroom and ultimately develop a profile of the typical student, which each instructor must consider, before taking steps to improve module delivery. For this reason, information about the status of the YouTube channel subscription, the type of viewing device used, as well as the preferred operating system, is collected, in addition to the standard demographic data. Note that in the YouTube studio, this type of data is available in separate tabs for each of the variables, after filtering for the two observation periods (i.e., 2020–2021 and 2021–2022). The goal of this strategy is to provide a clear picture with enough numerical data to allow for the most accurate comparisons and conclusions possible.

Accumulating demographic characteristics directly from the source, as opposed to self-reported responses in questionnaires, can increase the validity of the acquired data. It should be noted that many studies, including [43], have relied exclusively on demographic attributes to assess and even predict students' academic performance with high accuracy.

Table 1 provides a summary of the key demographic characteristics of the mechanical engineering students who participated in the study during the two time periods noted above. This summary was compiled from data directly extracted from YouTube channel reports, content available in YouTube Studio's advanced analysis mode, and by specifying a custom time period. Each metric (gender, age, geography, etc.) was exported from the YouTube Studio view to Google Sheets and converted to an MS Excel file [44].

The number of students who took the online module was 212 in the winter semester of the 2020–2021 year. In 2021–2022, Table 1 shows a slightly higher number of students (i.e., 230) who took the module in a blended learning mode. Regardless of the two time periods considered, 90% of the learners were male. This percentage reflects the true majority of male students enrolled in the Department of Mechanical Engineering at the University of West Attica, as verified by the student registry. In the hybrid learning environments, 97.6% of the students were between the ages of 18 and 24, and the total number of participants was located in Greece. Considering that, in the exclusive online learning environments, 60.9% of the students were between the ages of 18 and 24 and 37.7% were between the ages of 25 and 34, it can be inferred that the online learning spaces offered a unique opportunity for older students to attend their courses without their physical presence. In addition, the exclusive online learning spaces allowed a small number of international resident students to take the module from another country.


**Table 1.** Demographic information of mechanical engineering students who participated in the study.

The registration status of viewers is already an interesting factor: Only 24.1% of viewers are registered on the educational channel, indicating an initial trend that most students watch the videos repeatedly, without finding a reason to subscribe to the YouTube channel. This suggests that students have a similar attitude toward the educational channel, as probably with other social media sites.

Although personal computers were widely used in both periods studied, they are losing ground to mobile devices each year. Tablet use increased in the second period. Finally, television as a viewing device decreased in 2021–2022. Finally, the most used operating system was Microsoft Windows, but there is a 10.9% decline between 2020–2021 and 2021–2022, which is explained by the increasing use of tablets and the doubling of iOS android devices.

#### *3.3. Reports on Students' Views, Comparative Analysis and Discussion*

A total of thirty-seven videos were uploaded to the YouTube channel. Since there was no preparation time available before the universities closed, the timing of the posting of each video was scheduled in parallel with the running of the laboratory module. The name of each video corresponds to the title of the assigned task. All videos can be distinguished by content and learning objective into the following four categories, as shown in Table 2:



**Table 2.** Analysis of data from the "MCAD I UNIWA" YouTube channel.

In order to understand the actual size of the learners, the unique users' metric was filtered from the reports provided by the YouTube channel. This specific metric provides a clearer picture of the estimated number of views during the two different time periods (i.e., during the pandemic period and during the start-up period of meta-COVID-19). The metrics for specific aspects related to viewing by video categories are presented in Figures 2 and 3. Specifically, the fourth category, relating to the methodology of carrying out computer-aided design tasks, is the most viewed and, as expected, has the longest viewing time. The videos with the highest average viewing percentage also belong to this category. The third category of videos, showing an overview of filmed objects, comes next in the students' preferences. It should be noted that this type of video has the shortest duration, limited to ten to fifty seconds. Furthermore, three of the nine videos of this type were generated in the second period, i.e., in hybrid learning environments, after taking into account the students' requests. The second category of videos contains the smallest number of videos due to the fact that most of the software support tools were integrated in the fourth category. This type of video has a considerable number of views in relation to the number of views. This suggests that these videos are aimed at a specific audience, who will persist in watching the content to better understand the use of the software. Finally, the first category, drawing without sound, has a considerable number of views, duration of viewing and average percentage of viewing that mainly reflect the first period, when the educational process was carried out exclusively online.

**Figure 2.** Average viewing time and number of views by video category.

**Figure 3.** Average percentage of views and number of videos by video category.

#### **4. Main Results**

#### *4.1. The Number of Views and Unique Viewers: Two Major Variables in the Foreground*

Table 3, whose data is best depicted in Figure 4, summarizes the number of views per week for the two time periods studied (in 2020–2021, when distance learning was exclusive, and in 2021–2022, when hybrid learning was the norm) based on the sequence of course modules. The interest of Table 3 and Figure 4 is also to examine the variations between the two periods studied. It is important to note that the number of views decreased in 2021–2022, which may be due to the fact that students were attending classes in face-to-face mode and asynchronous support was not as necessary as in distance learning environments.


**Table 3.** Variation of the number of views per week for two time periods per module step.

Separate colors were applied in the two periods according to Figure 4.

**Figure 4.** Chart of unique viewers per week. Separate colors were applied in the two periods according to Table 3.

Figure 4 shows two trend curves for the number of unique viewers as a function of CAD module length (in number of weeks) for the two time periods studied (2020–2021 and 2021–2022). Both curves show similar trend dynamics, but only from the fourth course onward. From the beginning of the teaching until the fourth laboratory lecture, the trends are slightly different and can be explained by the fact that the module was delivered exclusively online in 2020–2021 and face-to-face in the first semester of 2021–2022. Higher values of unique viewers in both time periods, can be noticed in week seven. At this point in the educational process, students enroll in the third stage of the CAD module from the second stage of the module workflow. The seventh laboratory lecture is where the most difficult part of the new knowledge has been delivered, referring to the sectional views. It should be noted that at this stage, students who have not yet assimilated the layout of the plans, have difficulties in completing their weekly tasks. Finally, new knowledge was accumulated in the first seven lessons, combining the use of software, the rules of mechanical drawing and the perception of object views, therefore asynchronous task support was necessary in this specific period. From the 8th to the 10th week, the Christmas vacations took place and a clear downward trend can be observed in both lines of the time series. Upward trends are again observed from week 10 onwards, when students resume module attendance after the Christmas break. High values of unique viewers can also be distinguished in week 13, which can be interpreted as the fact that the laboratory lectures have reached their final point and the exams are only one week away.

These first graphically illustrated results must now be rigorously demonstrated by statistical data, which will be analyzed in the following sections.

#### *4.2. Selected Statistical Variables*

The statistical study proposed in this section is conducted on the following six variables: "Impressions", "Unique viewers", "Viewing time", "Impressions Click-Through Rate (CTR)", "Number of views" and "Average viewing time". In particular, we will use YouTube Analytics terminology to describe these variables. The term "Impressions" is used to express the number of times one of the video thumbnails from the YouTube channel appears on the participant's screen. Therefore, the "Impressions Click-Through Rate" (CTR) indicates how many impressions were converted into views. Its calculation (expressed as a percentage) is based on the ratio between the number of clicks and the number of impressions. This metric aims to reveal how many people saw the thumbnail, found it interesting and clicked on it. The "Views" metric indicates the number of times a video has been viewed. The term "viewers" refers to the number of individuals who watch a certain piece of content. A viewer may click more than once on a specific piece of content. This is why the "Unique viewers" metric is used, which is a more accurate and relevant variable since it only counts one person, even if they click on the same video content multiple times or use multiple devices or browsers. "Viewing time" is expressed in YouTube terminology as "watch time". This audience retention metric, expressed here as an average value, counts the hours that a specific video is watched by individual viewers. For each of the six variables defined above, two pairs of datasets were compared:


#### *4.3. Normality Test Results*

First, for each variable, it was necessary to verify the normality of the data distribution. This normality test is very important insofar as it determines the choice to carry out a test, either parametric or non-parametric. In the remainder of this subsection, we will only discuss the main results of the procedure. The basis and steps of the Shapiro–Wilk test, along with detailed results, are presented in Appendix B.

The results in Appendix B show that, regardless of the two time periods studied (i.e., 2020–2021 and 2021–2022), only the two variables "Impressions" and "Number of views" have *p*-values greater than the 5% threshold, which does not allow us to reject the null hypothesis and thus to consider that each sample in question is normally distributed. For these variables, an *F*-test was performed to evaluate whether the variances of the variables (whose distributions can be considered as normal according to the Shapiro–Wilk test) are equal or not. The final objective is to reject or not the null hypothesis of the existence of statistically significant differences between the academic years 2020–2021 and 2021–2022 [36]. These *F*-tests will be complemented by Student's *t*-tests to compare the means of the two data distributions considered.

For the other four variables (i.e., "Unique viewers"; "Watch time"; "Impressions CTR"; and "Average viewing time") with non-normal distributions, a Mann–Whitney–Wilcoxon test (which is a non-parametric method), was performed to assess whether there is a statistically significant difference between the two periods of 2020–2021 and 2021–2022.

#### *4.4. Hypothesis Testing Results*

For each of the two normally distributed variables mentioned above (i.e., "Impressions" and "Number of views"), the Fisher–Snedecor test or *F*-test was performed. As with the results of the normality tests presented earlier, here we analyze only the results obtained (see Table 4). However, the foundations and main steps of the method are recalled in Appendix C. The results in Table 4 show that for the "Impressions" variable, there is no equality of variances between the 2020–2021 and 2021–2022 data, since the *p*-value is below the 5% threshold. As for the variable "Number of views", there is equality of variances

as long as the *p*-value is greater than the 5% threshold. For this variable in particular, this allows us to conclude that for the two periods considered, since the variances are equal, learners' need for assistance in completing their tasks from viewing asynchronous educational content does not depend on the learning environment.

**Table 4.** Results of the *F*-test for the two normally distributed variables "Impressions" and "Number of views".


The *F*-tests were supplemented with *t*-tests, as shown in Table 5. The results of these tests reflect the comparison of the means of the data sets of the variables "Number of Views" and "Impressions". As the *p*-values in Table 5 are above the 5% threshold, the null hypothesis cannot be rejected. Therefore, no statistically significant difference was observed between 2020–2021 and 2021–2022.

**Table 5.** Results of the *t*-test for the two normally distributed variables "Impressions" and "Number of views".


For each of the four remaining variables (i.e., "Unique viewers"; "Viewing time or Watch time"; "Impressions CTR"; and "Average viewing time") whose distributions are not Gaussian for the two periods studied (i.e., 2020–2021 and 2021–2022), a non-parametric Mann–Whitney–Wilcoxon test was performed.

The results in Table 6 show that there is no statistically significant difference between the two periods studied for the variables "Viewing time" and "Unique viewers" because their respective *p*-values are above the defined risk of 5%. However, this is not the case for the other variables (i.e., "Impressions CTR" and "Average viewing time"). Indeed, their respective *p*-values are below the 5% risk, confirming the statistically significant difference between the 2020–2021 and 2021–2022 data.

**Table 6.** Mann–Whitney–Wilcoxon test results for the four variables "Unique viewers"; "Watch time"; "Impressions CTR"; and "Average viewing time".


All the statistical results described above, whether parametric or non-parametric, confirm the preliminary results established in Section 4.1. in that the two variables "Number of views" and "Unique viewers" do not depend on the learning environment (i.e., exclusively remote or hybrid environment).

#### **5. Discussion**

Beyond the analysis of the statistical tests proposed above, Figure 4 allows us to draw some major conclusions that we will review and discuss in this section.

The total number of unique viewers in the first semester of the 2020–2021 academic year was 719, and in the same period of the 2021–2022 year was 570. By calculating the percentage of unique viewers for each of the two periods compared to the total number, a percentage difference of 11.55% was calculated. The two curves in Figure 4 representing the data series for both time periods show similar trends in student viewing habits for both time periods, leading us to conclude that learners' need for assistance in completing their tasks is not dependent on the learning environment. This observation is made when comparing online and hybrid modes of instruction. When comparing the visual behavior of students during the first four laboratory lectures taught in the face-to-face mode in 2021–2022 with the same period in 2020–2021 in the online spaces, the curve does not show similar trends. This specific observation can be explained by the fact that when teaching the online module, all twelve lectures were delivered exclusively at a distance, whereas in the hybrid learning spaces, the first four courses were delivered exclusively in face-to-face mode. In addition, the theme of the first four lectures was "sketching", which involves freehand drawings. For the non-computer tasks, videos were only used for object representation, and their contribution was limited to categories 1 and 3, with tasks related to the first category being repeated face-to-face.

The variable "Impressions CTR" was one of the viewing measures that showed statistically significant differences between the two time periods in the non-parametric tests conducted. By comparing the proportion of reduction between the "Number of views" (11.86%) and "Impressions CTR" (1.76%), we can conclude that even though the "Number of views" decreased in the hybrid learning spaces, the "Impressions CTR" variable showed a very low percentage of reduction, which proves the positive attitude of students clicking on the thumbnails of the videos to watch them.

Although the results of this study revealed viewing patterns for both time periods examined, there are still some limitations, based on the circumstances in which the module was delivered during each time period. The learning experience at universities was affected by the consequences of the pandemic, resulting in a significant decline in student retention and academic performance. Sustainable measures, such as those implemented in this study, had to be taken to first ensure students' virtual presence, as well as support for asynchronous tasks.

In the exclusive online instruction, the lack of physical contact between learners and their educators allowed the former to engage in the asynchronous support channel, which allowed instructors to track their viewing activities and analyze their viewing behaviors through data analysis by retrieving high-precision information. In the hybrid learning modes, specifically in the face-to-face delivered modules, learners had the opportunity to physically communicate with their instructors and get support for their tasks.

In synthesis of the above, the YouTube channel created and used in this work as an asynchronous tutoring tool has been integrated into the educational process. It provides quality asynchronous support when needed and is part of a long-term viability and sustainability approach. These new tools for supporting individual tasks outside of the classroom can benefit pedagogical practices and ultimately the learning process in very implicit ways and primarily by being "masked" by popular social media sites like YouTube. The channel analysis provided valuable information about learners' visualization habits that can serve as guidelines for future instructors and developers of instructional module structure. The log of visualization measures revealed viewing patterns indicating that students' visualization behaviors follow the flow of the module, and especially regardless of their mode of attendance and teaching space. The ability to follow the content of a module through educational videos at one's own pace and preference contributes to the development of senses of quality and equity [19]. The EDM and its statistical analysis showed that the foundation of the YouTube channel met the needs of students regardless of the learning environment on which this tool was applied.

Moving forward, the methodology applied in this study provides direct feedback for future CAD instructors and instructional module developers. Our proposed recommendations and action plan are as follows:


#### **6. Conclusions and Future Work**

In 2020–2021, during the most critical hours of the COVID-19 pandemic, higher education instructors, specifically those at the University of West Attica (Greece), created a social media channel (in this study, using YouTube), as part of a mechanical engineering CAD module, to provide students with asynchronous support for their teaching tasks. To provide learners with the most direct access, links to the 37 videos were attached to each assessed task. One year later, at the beginning of the meta-COVID-19 period, the same asynchronous task support technique was applied, but this time in face-to-face and blended learning spaces.

The experimental analysis proposed in this manuscript is based on the process of processing and interpreting acquired knowledge data extracted directly from the reports of a YouTube channel; this YouTube channel having been created and used by an instructor and containing educational videos of different categories based on the learning objectives of a CAD module. The main challenge here was to investigate the potential of the data as acquired from the YouTube channel and whether the raw material downloaded in the form of previews could be processed and reveal valuable information about how students use this form of asynchronous digital educational material; YouTube having been hijacked from its primary function, i.e., entertainment.

The shift from exclusively online to hybrid learning environments first showed that the use of asynchronous task support decreased. YouTube analytics were the most appropriate tool in terms of efficiency and accuracy for expressing student retention beyond physical, online, or hybrid engineering labs, as they not only recorded the number of times a video was viewed, but also differentiated between users who watched the same video multiple times. Specifically, we defined and used variables and measures that are very common on social media sites, expressing the level of audience acceptance, to examine the contribution of video to the learning process. The following six statistical variables were selected as primary measures expressing audience retention in YouTube channels: "Impressions", "Unique viewers", "Viewing time or watch time", "Impressions Click-Through Rate (CTR)", "Number of views" and "Average viewing time".

The comparative analysis of the YouTube reports showed similar trends in student viewing habits over the two time periods studied (i.e., at the most critical time of the global health crisis and at the beginning of the meta-COVID-19 period), with a slight decrease of nearly 12% in viewing due to a return to traditional learning environments, where most students solved their tasks in class and did not require asynchronous support. In contrast to face-to-face and 100% distance learning, the analysis showed that trends in learner viewing habits are similar during online and hybrid learning spaces. In particular, reports from the social media channel showed that educational videos followed the weekly stream trend, resulting in an increase in active viewers, which was directly related to the increase in workload and workload accumulation.

After testing the normality of each of the above six variables, a series of hypothesis tests (i.e., parametric and non-parametric based on normality tests) were performed to accept or reject the null hypothesis of this study, which concerns the absence of statistically significant changes in learners' listening habits over the two periods studied (i.e., 2020–2021 and 2021–2022). Of the six variables analyzed, only two—"Impressions CTR" and "Average viewing time"—show display metrics with statistically significant differences between the two periods studied. For both of these statistically significant variables, it is appropriate to focus on impressions CTR, which expresses the number of times viewers click to watch a video after seeing its thumbnail. Although there was a decrease in views and impressions CTR between the two time periods, the percentage reduction was not proportional for either variable.

Although we have answered the five research questions listed in Section 2, the current study has some limitations. Although many universities have begun to use data and analytics, there is still a long way to go before these tools can fully prove their potential in terms of improving the learning experience. This is especially true today, due to the unstable health conditions caused by the COVID-19 outbreak, although overall they seem to be gradually normalizing.

Future work will involve processing analyses of a second semester CAD module Computer Aided Mechanical Design (CAD II) and performing similar statistical tests to improve the reliability of the results. Due to the increase in the number of students and institutions participating in online learning and using digital tools over the past two years, there is now a plethora of data available that may not have been available before. Institutions of higher education may want to start using this data with an eye toward serving students ever better in the years to come.

**Author Contributions:** Conceptualization, Z.K. and C.S. (Constantinos Stergiou); methodology, Z.K., C.S. (Constantinos Stergiou), G.B. and S.J.; software, G.B.; validation, G.B.; formal analysis, Z.K. and G.B.; investigation, Z.K. and C.T.; resources, Z.K. and C.S. (Constantinos Stergiou); data curation: Z.K.; writing—original draft preparation: Z.K.; writing—review and editing: Z.K., C.S. (Constantinos Stergiou), G.B., S.J. and A.O.; visualization: Z.K., C.S. (Constantinos Stergiou), G.B. and S.J.; supervision, C.S. (Constantinos Stergiou) and C.S. (Cleo Sgouropoulou); project administration, C.S (Constantinos Stergiou) and S.J.; funding acquisition, S.J. and A.O. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** This study involves the analysis of data sets obtained from reports on a YouTube channel created and administered by a member of the authors' team. Therefore, all rights to the reports are reserved to the authors.

**Informed Consent Statement:** Informed consent was obtained from all study participants at the time of initial data collection.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** These research activities are currently supported by the University of West Attica and more particularly by its Department of Mechanical Engineering, as well as by the University of Tours. The authors of this manuscript would like to thank their colleagues at the following institutions, as well as the students, who contributed greatly to the success of this work.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this paper:


#### **Appendix A. URL of Each Educational Video Content of the YouTube Channel "MCAD I UNIWA"**


#### **Appendix B. Summary of the Basics and Main Steps of the Shapiro–Wilk Normality Test**

The Shapiro–Wilk normality test is designed to detect all deviations from normality. In particular, this test rejects the normality hypothesis when the *p*-value is less than or equal to a threshold value (usually 5%). The null hypothesis is that there is no difference between the distribution studied and a normal distribution. The alternative hypothesis is that there is a difference. If the *p*-value at the end of the test is less than the set threshold (usually 5%), then the null hypothesis can be rejected and the data are not considered normal. In this case, a series of non-parametric tests can be applied. Conversely, if the null hypothesis cannot be rejected, the data are considered normal and parametric tests can be implemented [46,47]. Note that if only one of the data sets does not meet the set threshold and the other data set does, the variable of interest is considered a non-parametrically valued variable.

In addition to the *p*-value and to decide the normality of each distribution, we will focus on two metrics in particular: skewness and kurtosis of each of the variables examined for the two datasets (i.e., variables referring to a sample of 37 YouTube videos, with Dataset 1 containing observations from the 2020–2021 academic year and Dataset 2 from the 2021–2022 academic year). While skewness focuses on the overall shape, kurtosis focuses on the tail shape. The normal distribution is characterized by a zero skewness coefficient and a zero kurtosis coefficient. Concerning the skewness, a positive coefficient indicates a left asymmetry, while a negative coefficient indicates a right asymmetry. With respect to kurtosis, a negative value indicates that the distribution is "platykurtic", i.e., more flattened than a normal density. A positive kurtosis coefficient indicates that the distribution is "leptokurtic", i.e., less flattened.

The following tables summarize the *p*-values, skewness and kurtosis of the six variables defined in this study for the two periods studied (i.e., 2020–2021 and 2021–2022).


#### **Appendix C. Summary of the Basics and Main Steps of and** *t***-Tests for Parametric variables, and the Mann–Whitney–Wilcoxon Test for Non-Parametric Variables**

As stated in [48,49], the hypothesis is an interpretation of the reasons for a certain phenomenon. Two types of hypotheses can be defined in a scientific approach:


Therefore, the null hypothesis must be tested. In our case, it concerns the lack of significant difference between the variances of the six variables examined, when comparing the samples of variables from the two academic semesters mentioned above [50].

Parametric statistical tests (based on the hypothesis that the sample under consideration is drawn from a population following a distribution belonging to a given family, i.e., the normal distribution), when their use is well justified, they generally have greater statistical power than non-parametric tests (i.e., without a distribution). More precisely, they are likely to detect a significant effect when it actually exists. Normality is tested here by the Shapiro–Wilk test mainly because, for a given significance level, the probability of rejecting the null hypothesis (i.e., a sample is drawn from a normally distributed population) if it is false is higher than for other tests of normality [42].

#### *Appendix C.1. Fisher-Snedecor or* F*-Test for Parametric Variables*

For the two variables examined (i.e., "Impressions" and "Number of views"), a sample size of 37 was set, referring to the number of videos uploaded to the YouTube channel. The degrees of freedom are determined by subtracting one from the sample size. In statistical calculations, degrees of freedom measure the mathematical complexity of a calculated parameter. As mentioned earlier, the probability that the tested parameters have statistical significance is tested by setting a threshold for their statistical significance.

The Fisher–Snedecor test or *F*-test consists in comparing the resultant value of the test with the critical value (*F*critical) of the Fisher–Snedecor distribution for the risk sought (the risk is equal to 5% here); this critical value is determined from a table. If the resulting value of the test (*p*-value) is higher than the critical value (5%), then the null hypothesis (i.e., the two variances are equal) is rejected. Otherwise, it is not rejected [51]. Note that the more dissimilar the variances are, the more the *p*-value tends to zero.

#### *Appendix C.2. Student's t-Test for Parametric Variables*

The Student's *t*-test, or *t*-test, is a popular statistical test used to measure the differences between the means of two groups or a group compared to a standard value. It is based on a probability distribution called Student's *T* distribution. Performing this test is used to understand whether the differences are statistically significant.

The 2-sample *t*-test, which is the most standard and classical analysis technique, aims at comparing the means of two independent populations to identify a significant difference. To run a Student's *T* distribution, the following must be available: the difference between the mean values of the data sets; the variance of each sample; the number of data in each group; and the acceptable error threshold (usually 5%). In this study, for a variable under consideration, the null hypothesis is that the two populations (of 2020–2021 and 2021–2022) are identical and that there is no significant difference between them. At the end of the test, if the *p*-value is lower than the set threshold (usually 5%), the null hypothesis can be rejected. Otherwise, it cannot be rejected.

#### *Appendix C.3. Mann–Whitney–Wilcoxon Test for Non-Parametric Variables*

The Mann–Whitney–Wilcoxon test is used to test the hypothesis that the distributions of each of the two groups of data are close. Like any statistical test, it consists in highlighting an event whose probability distribution is known (at least its asymptotic form) from what is observed. The *p*-value obtained, if it is unlikely according to this law, will suggest rejecting the null hypothesis. More precisely, if the *p*-value is greater than the fixed risk (here 5%), then the null hypothesis cannot be rejected. Otherwise, it can be rejected.

#### **References**


## *Article* **A Computational Tool for Detection of Soft Tissue Landmarks and Cephalometric Analysis**

**Mohammad Azad 1,\*, Said Elaiwat <sup>1</sup> and Mohammad Khursheed Alam <sup>2</sup>**


**Abstract:** In facial aesthetics, soft tissue landmark recognition and linear and angular measurement play a critical role in treatment planning. Visual identification and judgment by hand are timeconsuming and prone to errors. As a result, user-friendly software solutions are required to assist healthcare practitioners in improving treatment planning. Our first goal in this paper is to create a computational tool that may be used to identify and save critical landmarks from patient X-ray pictures. The second goal is to create automated software that can assess the soft tissue facial profiles of patients in both linear and angular directions using the landmarks that have been identified. To boost the contrast, we employ gamma correction and a client-server web-based model to display the input images. Furthermore, we use the client-side to record landmarks in pictures and save the annotated landmarks to the database. The linear and angular measurements from the recorded landmarks are then calculated computationally and displayed to the user. Annotation and validation of 13 soft tissue landmarks were completed. The results reveal that our software accurately locates landmarks with a maximum deviation of 1.5 mm to 5 mm for the majority of landmarks. Furthermore, the linear and angular measurement variances across users are not large, indicating that the procedure is reliable.

**Keywords:** soft tissue; gamma correction; landmark detection; X-ray images; facial profile

#### **1. Introduction**

A pleasant-looking facial aesthetic is one of the purposes of healthcare treatment. Nowadays, many young males and females are looking for orthognathic surgery for a better and more attractive appearance in society. Therefore, it is necessary to study the facial skeleton and corresponding soft tissue. Bones and teeth are examples of hard tissue. Ligaments, tendons, and muscles are examples of soft tissue that connects and supports the body's surrounding structures and organs [1]. For successful orthognathic surgery, hard and soft tissue facial profile analysis should be included [2–4]. The hard tissue analysis for orthognathic surgery has been discussed by Burstone et al. [5] and the soft tissue analysis by Legan and Burstone [4]. However, several researchers found that the soft tissue (covering the teeth and face) can behave differently from patient to patient because of the thickness.

Structure inconsistencies have historically been considered the main treatment restrictions by orthodontists. In actuality, the therapeutic modifiability is more closely related to the soft tissues. As a result, the crucial step in orthodontic decision-making is the study of soft tissue. The extent to which the orthodontist can change the size of dental arches and the positioning of the mandible is determined by these soft tissues. Therefore, the cephalometric analysis of the soft tissue should be taken into account in a successful surgical treatment plan [6,7]. Furthermore, it is necessary to find the standard soft tissue profile analysis based on age, sex, ethnic group, etc. [8–11]. Similarly, Sahar [11] reported that an

**Citation:** Azad, M.; Elaiwat, S.; Alam, M.K. A Computational Tool for Detection of Soft Tissue Landmarks and Cephalometric Analysis. *Electronics* **2022**, *11*, 2408. https:// doi.org/10.3390/electronics11152408

Academic Editors: Agnieszka Konys, Agnieszka Nowak-Brzezi ´nska and Hyunjin Park

Received: 31 May 2022 Accepted: 17 July 2022 Published: 2 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

increasing number of Saudis are looking for orthognathic surgery. Therefore, this study should be carried out to find its own characteristics for Saudi Arabia.

There are two ways to analyze the soft tissue profile. The first way is by hand without any computer-aided solution; a transparent acetate sheet is superimposed above the printed X-ray image and then manually calculates the linear distances and angles. The second way is the computer-aided solution. There are many tools available which are used for research and teaching, but may cost a lot of money. Besides, the tool may not be customizable to suit the local community's needs. Nevertheless, most of the research regarding soft tissue profile analysis did not consider X-ray images, which are inevitable for the Saudi population since only X-ray images are acceptable in all Saudi hospitals, clinics, dental colleges, etc. Therefore, it is really necessary and of utmost importance to consider an inexpensive approach for the accurate investigation of the soft tissue profile analysis from the X-ray images.

Furthermore, other approaches are not created to study the linear and angular measurements automatically and need some type of manual intervention to obtain an accurate result. The novelty of our approach is that we implemented this process automatically without any type of manual intervention from the user, i.e., after successful annotations, the user can obtain the linear and angular measurements automatically.

The contribution of our study is:


We performed experiments with 14 male and 14 female subjects' X-ray images. We preprocess the images to improve the contrast by applying the gamma correction. After that, we annotated all these images by four examiners. As a result, we have a total of 112 annotated X-ray images. In the end, we calculated the variations among all annotators and found the variations are negligible, and hence, our approach is reliable. Nevertheless, we also calculated the linear and angular measurements and calculated the variations. We found that the variation is very small and our results are reliable.

An image must be in grayscale X-ray images with a minimum size of 1024 × 1024 to be included in the annotations. Since X-ray images are the main type of images for clinical practices in Saudi Arabia, we chose these types of images, and the mentioned size is the minimum best size for good X-ray images. In addition, the input images must include the areas of the forehead, nasal, labial, and chin. The reason is that these areas contain all our soft tissue landmarks. Furthermore, the intensity of the edge line, which contains such landmarks, and the background of the image should be distinguishable for the success of the detection of the landmarks. If these conditions are not met for any image sample, then it will not be processed further.

The rest of the paper is organized as follows. In Section 2, we consider the previous studies related to our paper. In Section 3, we explain the preprocessing steps before proceeding with our approach. Section 4 contains the detailed methods of the software architecture as well as methods related to the capturing of the annotations and measurements. Section 5 contains the results of the computer experiments and a discussion of the findings. At the end, Section 6 contains short conclusions and future work.

#### **2. Previous Studies**

A number of studies [2,12–20] have been carried out on the different aspects of soft tissue profile analysis. Some research focuses on linear measurement of the soft tissue profile [2] and some other research focuses on angular measurement [4,21]. The aim of this section is to review only the important and relevant studies related to the present research of soft tissue profile analysis.

One of the initial significant research attempts on the linear measurement of the soft tissue profile analysis was conducted by Paulo et al. [2], who considered 15 landmark points from the four major regions (facial, labial, chin, and nasal) and then calculated linear measurements based on the vertical, horizontal, and Canut's lines. Unfortunately, the work is based only on the photographic records (no X-ray images) and it is only for the European white population. They also did a similar analysis for angular measurement [21]. Sahar et al. [11,16] and Nasser [18] focus on the soft tissue profile analysis for the Saudi population in the Riyadh region. Their solution uses a market tool that is very expensive to buy and does not have the flexibility for custom usages. They did not consider major landmarks as in [2] and did not consider all the linear and angular measurements under consideration. Nevertheless, they did not consider any preprocessing steps such as gamma correction to improve the contrast and visibility of the X-ray images.

Furthermore, other researchers consider this problem based on the population of different regions in the world, e.g., Alcaledi et al. [19] consider the Japanese population, Hamdan et al. [20] consider the Jordanian population, Al-Azemi et al. [17] consider the Kuwaiti population, Filipovi´c et al. [15] consider the Serbian population, Celebi et al. [14] consider the Turkish population, Akter et al. [13] consider the Bangladeshi population, Pandian et al. [12] consider the Indian population, etc.

For many years, manual cephalometric tracing was the only method where a transparent acetate sheet was superimposed above the printed X-ray images. The researchers used a pencil to locate the major landmarks and draw the lines and angles for the soft and hard tissue analysis [22].

Ricketts [23] illustrated computerized cephalometric tracing in 1969. After that, many tools are available on the market. Unfortunately, such tools are very expensive and noncustomizable for use in the analysis of the cephalometric study. Therefore, it is necessary to look for a solution that will be used for the study of the soft tissue profile analysis, especially for the Saudi population. Another disadvantage of other tools is that they do not automatically provide linear and angular measurements; rather, a ruler must be used to obtain measurements for each subject of study, which is time-consuming and tedious work. In our work, our method automatically calculates the distance once the dental practitioners finalize the landmarks on the X-ray images. Furthermore, we use gamma correction to improve the contrast of the image, which was not performed by other tools.

#### **3. Data Preprocessing**

The borders of the soft tissue in the initial X-ray images are not clearly visible due to poor contrast, and hence we need to apply some preprocessing steps, i.e., to use the gamma correction method to sharpen and increase the contrast of the borders of the soft tissue.

According to [24], gamma correction, or gamma, is a nonlinear process which is performed for encoding and decoding luminance values in still images or video systems. In the simplest instances, gamma correction is specified by the power-law expression:

$$
\Upsilon = aX^{\gamma} \tag{1}
$$

where *γ* is a user defined parameter. The user can change the parameters and the brightness of the image will change based on the value of *γ*. The output value *Y* is obtained by raising the non-negative real input value *X* to the power *γ* and multiplying it by the constant *a*. In the case of *a* = 1, inputs and outputs are usually in the range of 0–1.

In Figure 1, we show a sample X-ray image before and after applying the gamma correction. It is clear that after applying the gamma correction, the soft tissue borders are now visible for further usage. In our tool, we first preprocess all X-ray images using the gamma correction before performing actual annotations.

(**a**) Before (**b**) After **Figure 1.** Effect of gamma correction of a sample X-ray image.

#### **4. Method**

The steps to developing the aforementioned tool for dental doctors' practices will be discussed in this section.

#### *4.1. Architecture*

We chose to use the client-server communication model paradigm. The reason behind this is that we have many users or clients (located separately) who will use this tool to annotate images. Such a distributive nature of clients necessitates the use of a client-server communication model rather than working as a single standalone program.

We have shown the client-server architecture in Figure 2. For the client-side, we used JavaScript (jQuery) along with CSS/HTML, and for the server-side, we used PHP (Laravel) with Apache server and MySQL database.

**Figure 2.** Software architecture.

The "admin" user controls the main functionality of this tool and can add other users to use this tool for annotations. The tool is divided into three main components. The first component is to add new users who will perform the job of annotation. The second component is to add the landmarks (tags in the tool) dynamically, i.e., we do not fix the number of landmarks; the admin can dynamically add new landmarks under consideration. Right now, the tool is using only thirteen soft tissue landmarks, as shown in Table 1.

The third component is to add new images for annotations, edit previously added images, and search for images by keywords, gender, etc. When an expert annotator first logs in then he will see the list of images that he has already uploaded and/or annotated and a button to add new images (see Figure 3).

At this stage, the annotator can add new images by clicking the button "Add Image". Furthermore, he can edit already added images by clicking the hyperlink "Edit" and can annotate the considered image by clicking on "Annotations".




**Figure 3.** The first landing page after login to show the list of images.

#### *4.2. Capturing Annotations*

In this study, we use 13 landmarks from the soft tissue area as shown in Table 1.

The user should click "Annotations" to start annotating the above landmarks, or if he has already done so, he can update those annotations easily. For illustration purposes, we show a sample X-ray image with the expert annotations in Figure 4.

When an expert annotator uses this tool to annotate landmarks, our tool captures those landmark positions (2D coordinates) and stores them in the system. It is possible to zoom the X-ray images for better viewing and locating the position of the landmarks. The tool takes the position based on a percentage matrix, i.e., both *x* and *y*-axis are taken as 100% of its actual width and height, respectively. It then uses the captured landmark position to calculate the actual position from the actual width and height of that image. Therefore, the zoom does not affect the landmark's actual position. In addition, the position is taken with a long decimal fraction, which is up to 30 decimal points, to make the position more accurate.

It is always possible to change the landmarks manually and correct them. The "Edit" button is used for editing any images that have been previously annotated.

**Figure 4.** A sample X-ray image with annotation.

#### *4.3. Linear and Angular Measurements*

The tool automatically calculates the linear and angular measurements without any intervention from the user. Once the user finishes the annotation, then he needs to simply click the "View Distances" link and it will show both measurements immediately (see Figure 5). In addition, the user can save the measurement in an excel file.

#### 4.3.1. Linear Measurements

The tool provides the horizontal (*x*-axis) distance and vertical (*y*-axis) distances for a fixed pair of landmarks. The horizontal and vertical distance calculations are trivial; that is the normal difference between the values of the corresponding axis. The distance output is given in pixels that can be converted to mm by Equation (4). Nevertheless, we can easily calculate the Euclidean distance (c) from the horizontal and vertical distances by the following formula:

$$
\mathcal{L} = \sqrt{a^2 + b^2} \tag{2}
$$

where,

*c* = the Euclidean distance

*a* = the horizontal distance (Distance-X)

*b* = the vertical distance (Distance-Y)

#### 4.3.2. Angular Measurements

The tool provides the angle for a given set of three landmarks in a degree unit. For example, G-N-Prn is a set of three landmarks where the middle landmark, N, is the vertex of the angle, and G-N and N-Prn are the sides of the angle. The angle has been calculated by the following formula:

$$\theta = \cos^{-1} \frac{b}{c} \tag{3}$$

where,

*b* = the distance G-N *c* = the distance N-Prn


(**a**) Linear measurement


(**b**) Angular measurement

**Figure 5.** A sample window of linear and angular measurement.

#### **5. Results & Discussion**

For our experiments, we obtained 28 sample X-ray images (14 male and 14 female) from health centers in Saudi Arabia. Four examiners independently performed the annotation and validation of the concerned soft tissue landmarks. As a result, we have a total of 112 annotated X-ray images.

In the literature, there are many methods for the evaluation of the system for identifying the landmark position for the acceptance of clinical practices. The manual method of human visual judgement is prone to intrajudge and interjudge variations [25]. The second way is the mix of manual and computer systems recognition method that is also susceptible to human error [25]. The third way is to examine if the computer system's output is within the radius of 2 mm or not [25]. Our method is better where we obtained a radius for some landmarks even smaller than 2 mm.

#### *5.1. Validation of Locating the Landmarks*

We calculate the variation of landmarks by the computer system. For each landmark, we find the minimum variation that is the minimum distance between any two identified landmarks. Similarly, we calculate the average and maximum distance between any two landmarks. Note that, for each variation (minimum, average, and maximum), we take the average among the 28 samples and show them as our results. Table 2 displays the pixel and corresponding distance (mm) variation, with the first column displaying the landmark name, the second column displaying the minimum variation, the third column displaying the average variation and the fourth column displaying the maximum variation. We calculate the length in 'mm' from the pixel by the following formula (using 300 dpi):

*Length* [mm] = *pixel* × 25.4 mm/dpi (4)

We can observe that the minimum variation is below 1 mm and the maximum variation for most landmarks is in the range of 1.5 mm to 5 mm except for the landmark G and Pg. It is due to the fact that they are the most difficult to identify. Similarly, the average variation for most landmarks is in the range of 1 mm to 2 mm except for the above-mentioned two landmarks. Figure 6 shows the variation in a sample X-ray image, and Figure 7 shows the variation of each landmark in the same X-ray image (red circle shows the variation area).


**Table 2.** Variation of different landmarks.

**Figure 6.** Graphical representation of variations of one sample image showing all the landmarks.

**Figure 7.** Graphical representation of variations of each landmark.

#### *5.2. Validation of Linear and Angular Measurements*

We show the minimum (the column 'Min'), average (the column 'Avg'), and maximum (the column 'Max') value of linear measurement (Euclidean distance) in pixel and mm units in Table 3. Nevertheless, we compared statistically using a Student's paired *t*-test and we did not find any significant difference in variation, which shows that the results are stable using this tool.


**Table 3.** Variation of linear measurement.

Similarly, we show the minimum (the column 'Min'), average (the column 'Avg'), and maximum (the column 'Max') values of angular measurements (in degree units) in Table 4. Furthermore, we examined statistically using the Student's paired *t*-test and found no significant difference, indicating that the results produced by our tool are consistent.

**Table 4.** Variation of angular measurement (degree).


#### *5.3. Limitations and Special Cases*

Even though this developed tool can detect soft tissue landmarks pretty accurately, it is not free from limitations. The success of landmark detection depends mainly on the quality, size, and resolution of images as well as the intensity of soft tissue edge lines compared to the background.

The most prevalent conditions affecting the facial region are cleft deformities [26] and craniofacial defects [27]. Soft tissue landmarks are difficult to identify in bilateral and unilateral complete cleft lip and palate cases because the alveolus and lip are not fused well. In such cases, a clear image and a zoom-in facility might help. As well, experienced orthodontists can follow an anatomical point of view if they feel difficulties.

#### **6. Conclusions and Future Work**

In this paper, we describe our tool to capture the soft tissue landmark positions. This tool is based on the paradigm of the client-server communication model. Any orthodontist can use this tool for his clinical practice, and it can accurately give the landmark positions up to 30 decimal points. It can also extract information on linear and angular measurements for orthodontic treatment, allowing for a more personalized healthcare experience. We conducted experiments on 28 human samples, which resulted in robust and accurate measurement of soft tissue landmarks within a 5 mm radius.

One of the limitations of this study is that this tool only annotates soft tissue landmarks. In the future, hard tissue landmarks will be explored.

In today's world, the smartphone is the most user-friendly technology available in the healthcare field. As a result, we will strive to integrate the proposed approach onto smartphones in the future, so that physicians may quickly recognize landmarks and complete the cephalometric analysis.

**Author Contributions:** Conceptualization, M.A., S.E. and M.K.A.; methodology, M.A., S.E. and M.K.A.; software, M.A.; validation, M.A.; formal analysis, M.A., S.E. and M.K.A.; investigation, M.A., S.E. and M.K.A.; resources, M.A., S.E. and M.K.A.; data curation, M.A., S.E. and M.K.A.; writing, M.A., S.E. and M.K.A.; visualization, M.A., S.E. and M.K.A.; supervision, M.A., S.E. and M.K.A.; project administration, M.A.; funding acquisition, M.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** The authors extend their appreciation to the Deanship of Scientific Research at Jouf University for funding this work through research grant no. (DSR2020-04-2582).

**Data Availability Statement:** Available upon reasonable request.

**Acknowledgments:** The authors would like to express their gratitude to all of the volunteers who helped with this study.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Review* **A Systematic Review of the Applications of Multi-Criteria Decision Aid Methods (1977–2022)**

**Marcio Pereira Basílio 1,2,\*, Valdecy Pereira 2, Helder Gomes Costa 2, Marcos Santos <sup>3</sup> and Amartya Ghosh <sup>4</sup>**


**Abstract:** Multicriteria methods have gained traction in academia and industry practices for effective decision-making. This systematic review investigates and presents an overview of multi-criteria approaches research conducted over forty-four years. The Web of Science (WoS) and Scopus databases were searched for papers on multi-criteria methods with titles, abstracts, keywords, and articles from January 1977 to 29 April 2022. Using the R Bibliometrix tool, the bibliographic data was evaluated. According to this bibliometric analysis, in 131 countries over the past forty-four years, 33,201 authors have written 23,494 documents on multi-criteria methods. This area's scientific output increases by 14.18 percent every year. China has the highest percentage of publications at 18.50 percent, followed by India at 10.62 percent and Iran at 7.75 percent. Islamic Azad University has the most publications with 504, followed by Vilnius Gediminas Technical University with 456 and the National Institute of Technology with 336. *Expert Systems with Applications*, *Sustainability*, and the *Journal of Cleaner Production* are the top journals, accounting for over 4.67 percent of all indexed works. In addition, E. Zavadskas and J. Wang have the most papers in the multi-criteria approaches sector. AHP, followed by TOPSIS, VIKOR, PROMETHEE, and ANP, is the most popular multi-criteria decision-making method among the ten nations with the most publications in this field. The bibliometric literature review method enables researchers to investigate the multi-criteria research area in greater depth than the conventional literature review method. It allows a vast dataset of bibliographic records to be statistically and systematically evaluated, producing insightful insights. This bibliometric study is helpful because it provides an overview of the issue of multi-criteria techniques from the past forty-four years, allowing other academics to use this research as a starting point for their studies.

**Keywords:** systematic review; multicriteria; MCDA; MCDM; MADM; MODM; AHP; TOPSIS; VIKOR; PROMETHEE; ANP

#### **1. Introduction**

As the transmission of scientific knowledge in its most diverse fields of study expands, literature evaluation becomes a demanding work for the researcher [1]. The challenge is reflected in the volume of research published each month by thousands of academic publication outlets. According to [2]'s theory of limited rationality, a researcher's rationality is constrained by the knowledge available, the cognitive limitations of the individual mind, and the decision-making time availability.

Human activities require decision-making. All such decisions are based on an evaluation of individual decision options, typically based on the decision maker's preferences, experience, and other data [3]. Some decisions are simple, while others are complex [4].

**Citation:** Basílio, M.P.; Pereira, V.; Costa, H.G.; Santos, M.; Ghosh, A. A Systematic Review of the Applications of Multi-Criteria Decision Aid Methods (1977–2022). *Electronics* **2022**, *11*, 1720. https:// doi.org/10.3390/electronics11111720

Academic Editors: Agnieszka Konys and Agnieszka Nowak-Brzezi ´nska

Received: 29 April 2022 Accepted: 25 May 2022 Published: 28 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

According to Kahraman et al. [5] and Govindan and Jepsen [6], some decisions are relatively simple, especially if the consequences of making the wrong decision are minor, whereas others are highly complex and have significant effects. In most cases, real-life problem-solving involves several competing points of view that must be considered to reach a reasonable decision [7]. A decision can be defined formally as a choice made based on available information or a method of action aimed at solving a specific decision problem [8]. In practice, multiple-criteria decision analysis (MCDA) evaluates possible courses of action or options by selecting a preferred option or sorting the options from best to worst [9–12]. In everyday practice, the use of MCDA is critical in signaling the best rational alternative to the decision-maker so that he can allocate finite resources between competing and alternative interests. Whether in an organizational or domestic setting, the decision-maker is constantly confronted with multiple paths and limited resources. Researchers refer to multiple criteria methods in various ways. Some authors prefer the term multiple-criteria decision aid or aiding (MCDA), while others prefer to use the term multicriteria decision-making or multiple-criteria decision-making (MCDM), multi-objective decision-making (MODM), or multi-attribute decision-making (MADM). Some authors prefer the term multiple-criteria decision aid or aiding (MCDA), while others prefer to use the term multiple-criteria decision analysis [13].

The most often used MCDA approaches, as opined by [3,14], are divided into two "schools": American and European. The American School of decision-support methods is based on a functional approach, namely the utilization of value or usability. These strategies typically do not account for data inconsistency, ambiguity, or decision-maker preferences. This collection of techniques is closely related to the operational approach based on a single synthesized criterion. MAUT, AHP, ANP, SMART, UTA, MACBETH, and TOPSIS are the critical methods used in the American School. The European School's techniques are based on a relational concept. As a result, they employ a synthesis of criteria based on outranking relations. Transgression between pairs of decision alternatives characterizes this relationship. Among the European School of decision support methods, the ELECTRE and PROMETHEE groups are the most prominent. NAIADE, ORESTE, REGIME, ARGUS, TACTIC, MELCHIOR, and PAMSSEM are other methodologies from the European MCDA sector. Many multi-criteria decision-making strategies integrate ideas from the American and European decision-making schools. EVAMIX, QUALIFLEX, PCCA, MAPPAC, PRAGMA, PACMAN, IDRA, COMET, and DRSA are a few examples.

Furthermore, as stated by [6,14–16], MCDA methods are used to solve decision-making problems in several areas, including the information and communication technology; business intelligence; environmental risk analysis; environmental impact assessment and environmental sciences; water-resource management; solid-waste management; remote sensing; flood-risk management; health-technology assessment; healthcare; transport; nanotechnology research; climate change; energy; international law and policy; human resources; financial management; performance and benchmarking; supplier selection; e-commerce and m-commerce; agriculture and horticulture; chemical and biochemical engineering; software evaluation; network selection; education and social policy; heating, ventilation, and air conditioning and small-scale energy management systems; and public security.

According to Sałabun et al. [3], despite the numerous MCDA approaches available, it is essential to note that no method is ideal and can be deemed acceptable for use in every decision-making context or for solving every choice problem [17]. As a result, different multi-criteria techniques may yield various choice suggestions [18]. However, if multiple multi-criteria methods produce inconsistent findings, the accuracy of each option is called into doubt [19]. In such a case, selecting a decision-support technique relevant to the given problem [20] becomes essential because only an appropriately chosen approach allows one to acquire the correct answer that reflects the decision maker's preferences [21].

Humans make decisions regularly, and decision-making is an inherent element of people's character. Some decisions are simple and have little impact on people's lives; others, on the other hand, directly impact people's lives, cities, and nations. In this regard, and given the importance of multi-criteria decision-making methods in assisting decisionmakers in a variety of fields, the current study aims to answer the following research questions (RQ) and develop a reference framework on academic productivity regarding multi-criteria decision-making methods:

RQ1: Who are the most influential authors and researchers in their scientific productivity in multi-criteria decision-making methods?

RQ2:What is the annual scientific publication growth in multi-criteria decision-making methods? RQ3: Which countries have the most significant production of articles on the multi-criteria methods of decision support?

RQ4: Which journals have the highest number of publications?

RQ5: What are the most used methods, and in which research areas?

RQ6: What are the conceptual structures of the multi-criteria decision-support methods?

Three hundred forty-two systematic literature studies on multi-criteria methods were discovered during the literature survey. The ten largest categories classified by Web of Science using multi-criteria methods were green sustainable science technology [22], energy fuels [23], environmental sciences [24], operations research and management science [25,26], computer science and artificial intelligence [27], management [28], economics [29], engineering environmental [30], computer science and interdisciplinary applications [31], and civil engineering [32].

This article is structured as follows: Section 2 briefly describes the methods and materials. Section 3 presents the preliminary bibliometric results and visualizes the collaborative relationships between countries and authors using R and the VOSviewer software. Keyword co-occurrences are analyzed, and strategic diagrams are constructed in the same section to reveal thematic trends on the multi-criteria decision support theme. The main discussions are summarized in Section 4.

#### **2. Materials and Methods**

This section presents the fundamental concepts that guided this study. The intention is not to cover all the subjects but rather to provide essential supporting information for understanding the research, the context, and the results.

The volume of academic publications is increasing at an accelerating rate. In this way, keeping up-to-date and knowing a given topic's state of the art is becoming increasingly difficult. As stated by Aria and Cuccurullo [33], the emphasis on empirical contributions has resulted in voluminous and fragmented research flows, which contributes to the heavy work of the researcher to keep up to date. Researchers affirm that literature reviews are prevalent in the state-of-the-art synthesis of various themes [33,34].

The structured literature review is a traditional way to analyze and review scientific literature. This type of review provides an in-depth analysis according to the content of the literature [35–39]. However, this method suffers from several limitations. For instance, it is very time-consuming, and the number of analyzed papers is limited. It is almost impossible to analyze hundreds of documents through the structured literature-review process. Although the authors carefully select the documents according to several criteria, it is challenging to eliminate subjective factors, and some essential studies may be omitted. With the digitization of scientific journals, the volume of published papers has increased dramatically. A bibliometric analysis effectively handles hundreds, even thousands, of documents and reviews the related literature from a macro perspective [37].

The term bibliometric refers to the quantitative study of bibliographic materials [40,41]. It can characterize the development in a research field or capture the changes in a specific journal. Various techniques have been developed to conduct bibliometric analysis, and the most-used methods are social network analysis and co-word analysis [37].

Social network analysis is based on the premise that the relationships between units can be interpreted as a graph [42]. It is an effective method to evaluate the importance of nodes and reveal the network structure. In the bibliometric networks, different types of networks, such as coauthorship networks [43,44], bibliographic coupling networks [45], and co-citation networks [46], are constructed by bibliometrics [47].

Co-word analysis is a content-analysis technique proposed by [48,49]. It is applied to map the strength of associations between information items in textual data [50]. It involves a co-occurrence analysis of keywords in a selected body of literature. Co-occurrence analysis, a central task of association analysis in data mining, is used to group keywords with high relevance in clusters [51]. Typically, each set corresponds to a search theme. Researchers use co-occurrence analysis to identify established and emerging research themes or tracking patterns [52–54].

Numerous software tools support bibliometric analysis; however, many do not assist scholars in a complete recommended workflow. The most relevant tools are Cit-NetExplorer [55], VOSviewer [56], SciMAT [50], BibExcel [57], Science of Science (Sci2) Tool [58], CiteSpace [59], HistCite, Pajek, Gephi, Bibliometrix [33], and VantagePoint (www.thevantagepoint.com (accessed on 24 April 2022)). In this study, VOSviewer and Bibliometrix were used to conduct a co-citation analysis.

In this study, a topical query on 29 April 2022, was conducted in the Web of Science (WoS) and Scopus database, using the following search query: (("multi-attribute decision making" or "madm" or "mcda" or "modm" or "mcdm" or "multi-criteria" or "multi-criteria" or "multiplecriteria") and ("ahp" or "todim" or "topsis" or "promethee" or "electre" or "vikor" or "maut" or "fitradeoff" or "dematel" or "copras" or "multimoora" or "swara" or "analytical network process" or "anp" or "simple multi-attribute rating technique" or "smart" or "goal programming" or "thor" or "cbr" or "saw" or "condorcet" or "drsa" or "macbeth" or "paprika" or "wpm" or "wsm" or "utadis" or "waspas")). The search was only restricted to titles, abstracts, keywords, and articles published between 1977 and 2022. Additionally, the search in the WoS database was limited to the Core Collection. The search query yielded 35,643 entries from the WoS and Scopus databases. Following the download of the records, the RStudio bibliometrix package version 1.2.1335 was installed on a Win64 operating system. Bibliometric analysis was performed using the Bibliometrix R package. The Bibliometrix tool was used to build the descriptive and co-citation networks. The function convert2df embedded in the Bibliometrix package was used to extract and create a data frame corresponding to the unit of analysis within the exported files from WoS and Scopus databases. After making the data frames from the WoS and Scopus files, the mergeDbSources function merged the WoS and Scopus data frames and excluded duplicate records from both files. Twelve thousand one hundred forty-nine duplicate records were removed, resulting in a data frame with 23,494 records for the bibliometric analysis. The process of obtaining the bibliographic records file can be seen in Figure 1.

**Figure 1.** Search strategy and extraction of data. Source: Prepared by the authors based on Basilio et al. [60] and Ghosh and Prasad [61].

#### **3. Results**

The results from the bibliometric analysis show that 33,201 authors produced 23,494 documents in the period from 1 January 1977, to 29 April 2022. The types of documents identified in the sample, despite the limitations, are described in the methods and data section and further illustrated in Figure 2.

Regarding academic production, studies on multi-criteria decision-support methods had their genesis in 1977. Figure 3 depicts the publishing trajectory until April 2022. The graph shows that the upward trend began in 1986 with a modest inclination. During this time, the average number of publications each year was 7.3. From 1987 to 1996, the average number of papers per year climbed to 28.3 documents. This average increased to 123.2 records per year during the next ten years and finally reached 1265.73 from 2007 to 2021, indicating a strong level of interest in the topic among researchers. Taking the entire period into account, publications on multi-criteria decision-support methods grew at an annual percentage rate of 14.18. Figures 4 and 5 show the average total citations per year (16.06) and the average years from publication (6.36), respectively.

Five peaks are depicted in the graph shown in Figure 4. In 1983, the earliest and most important studies were conducted. In that year, six documents were published. The article by Van Laarhoven and Pedrycz [62], with a total citation count of 2158, had the most impact on citations in 1983. The authors presented a fuzzy variant of Saaty's pairwise comparison method for deciding between many options when there are competing choice criteria. Eleven publications were included in the sample in 1986. The article by Brans et al. [63] had a significant impact that year, increasing the yearly average of 1609 citations. Brans et al. [63] introduced the PROMETHEE approach in this study. Chen et al. [64] had the most-cited paper in 1994, with 967 citations. Chinese researchers provided novel methods for dealing with fuzzy multi-criteria decision-making based on the theory of fuzzy sets. There were

2454 citations to Chen's paper [64] in 2000, which affected the average of the 63 articles published that year. Chen [64] extended the TOPSIS model to the fuzzy environment. Furthermore, in 2004, two publications significantly impacted the average number of citations among the 128 papers published: Opricovic and Tzeng [65] had 2590 citations, while Pohekar and Ramachandran [66] had 1270 citations. The VIKOR and TOPSIS approaches were compared by Opricovic and Tzeng [65]. Pohekar and Ramachandran [66] conducted a systematic review of multi-criteria techniques for sustainable energy management. Table 1 provides a summary of the sample's most cited articles.

**Figure 2.** Graphical representation of the documents contained in the sample.

**Figure 3.** Graphical representation of the annual scientific production. Note: The data for 2022 corresponds to partial values quantified up to 29 April 2022.

**Figure 4.** Graphical representation of the average total citations per year.




#### **Table 1.** *Cont.*

The year 2022 is shown as an outlier in Figure 5. The average number of papers cited every year was calculated using only the year of publication, which skews the results by overestimating this value. However, there are no distinguishing traits in this year's sample compared to earlier times. The volume of publications resulted in a total of 472,345 references.

#### *3.1. Monitoring of Scientific Production around the World*

Figure 6 shows that at least 120 countries or regions contributed to the research on multicriteria methods. China (n = 4327) is the largest contributor to multicriteria methods research, followed by India (n = 2485), Iran (n = 1812), Turkey (n = 1788), Taiwan (n = 1192), United States (n = 794), Brazil (n = 752), Spain (n = 608), Italy (n = 555), and Malaysia (n = 493). Regarding citations, Table 2 offers a slightly different order, but China continues to lead scientific production in terms of both knowledge generation and references to the scientific community: China (n = 82,615), Taiwan (n = 32,535), Turkey (n = 28,739), India (n = 23,643), Iran (n = 23,613), United States (n = 20,217), Lithuania (n = 12,292), United Kingdom (n = 10,917), Spain (n = 10,071), and Italy (n = 8601). As shown in Table 1, the top 10 research universities are Islamic Azad University (n = 504), Vilnius Gediminas Technical University (n = 456), National Institute of Technology (n = 336), University of Tehran (n = 334), Indian Institute of Technology (n = 265), and Istanbul Technical University (n = 243), as seen in Table 1.

Figure 7 illustrates the relationships between organizations through the coauthorship analysis, using universities as the unit of analysis. The research was based on the following criteria: (1) the minimum number of documents per organization (n ≥ 50); (2) the minimum number of citations per organization (n ≥ 50). With the established criteria, 50 organizations out of the 7619 analyzed were separated. The nodes represent the universities. The diameter of the nodes represents the number of citations, and the thickness of the connecting lines between the nodes represents the level of cooperation between the institutions. As a result, Islamic Azad University and Vilnius Gediminas Technical University stand out in this analysis.

**Figure 7.** The network map of institutions involved in multi-criteria methods of decision-support research. Note: The colors of the circles are used to identify the clusters resulting from the analysis of the relations provided by the VOSviewer software.



This section provides a quick summary of the bibliometric findings. However, we chose to go beyond a typical bibliometric analysis by stratifying the investigation and providing the reader with specific information about the countries ranked in Figure 2. Table 3 lists the major research topics, universities, research funding organizations, notable authors, and the most relevant papers.

**Table 3.** Analytic picture of scientific production in the ten best-ranked countries.



#### **Table 3.** *Cont.*

*3.2. Overview of the Leading Journals and Papers That Disseminate Research on Multi-Criteria Methods*

Six thousand one hundred and five journals have published research on multi-criteria methods over the past forty-four years. As seen in Table 3, the top ten journals published 2180 of the total 20,861 studies on multi-criteria techniques (10.40%). *Expert Systems with Applications*, *Sustainability*, and *Journal of Cleaner Production* are the top three journals, accounting for over 4.67 percent of all indexed material. The journal with the highest impact factor (IF) is the *Journal of Cleaner Production* (7246), followed *by Applied Soft Computing* (5472), and *Expert Systems with Applications* (5041). (5.452). Five journals are classified as Q1 by the JCR 2019 standards, two as Q2, and three as Q3. In the eighth column of Table 4, the number of citations for each journal is displayed as an example.


**Table 4.** Top 10 most-active journals that published research articles on multicriteria methods (sorted by count).

Figure 8 depicts the inter-relationship between the Journals, which was developed based on the researchers' preferences and referencing publications from sources with a high impact factor. The diameter of the circles is directly related to the number of citations, while the colors represent the identified clusters. In the eleventh column of Table 4, we can observe the five countries that published the most in each source. The maximum number of articles is from China, occupying the first position in eight out of the ten journals. The analysis of the highly cited papers shows that *Renewable and Sustainable Energy Reviews*, *Expert Systems with Applications*, and the *International Journal of Production Economics* have an incredible scientific impact on all scholars and have articles with more than 800 citations (Table 1).

#### *3.3. Analysis of the Most Influential Authors Who Discuss the Topic of the Multi-Criteria Methods*

Zavadskas E, Wang J, Tzeng G, Wang Y, and Kahraman C are among the top ten authors out of 29,050 who have published the most articles on this topic (Table 5). Edmundas Kazimieras Zavadskas is the first vice-rector of Vilnius Gediminas Technical University (VGTU). In addition, he is a member of the VGTU Senate, a professor, and the head of the Department of Construction Technology and Management. He has co-written over fifty novels in Lithuanian, Russian, German, and English. Corporations and academic institutions commissioned over forty research papers. The professor's primary research interests include building life cycles, decision-support systems, and multi-criteria optimization methods in construction technology and management.

**Figure 8.** The network map of co-cited journals. Note: The colors of the circles are used to identify the clusters resulting from the analysis of the relations produced by the VOSviewer software.


**Table 5.** Ranking of authors with the highest scientific production on multicriteria methods.

**Rank Authors Country University H\_Index G\_Index Article Counts Total Number of Citations Average Number of Citations First Author Counts First Author Citations Counts Average First Author Citations Counts** 10 TURSKIS Z Lithuania Vilnius Gediminas Technical University 34 63 93 4264 45.85 10 273 27.3 Total 1458 50,372 34.54 353 12,890 36.51

**Table 5.** *Cont.*

Figure 9 depicts a group of 160 authors grouped into six clusters based on two essential criteria about the authors' academic output: the minimum number of citations (n ≥ 500) and the minimum number of documents (n ≥ 10). Each cluster, identified by a distinct color, indicates the authors' and co-authors' iterations. The number of links and the total links strength (TLS) are employed to determine the strength of the relationships. Each cluster's featured author is the author with the most links and the highest TLS. In this way, each cluster's information is presented: Cluster 1 (red) contains 37.5% of the sample, with an emphasis on authors Wang Y (Links = 112, TLS = 540) and Cheng Y (Links = 103, TLS = 394); Cluster 2 (green) contains 26.9% of the sample, with an emphasis on authors Wang J (Links = 140, TLS = 315), Xu Z (Links = 141, TLS = 2048), Zhang H (Links = 144, TLS = 1935), and Wang X (Links = 121, TLS = 658); Cluster 3 (blue) contains 10.6% of the sample, with an emphasis on author Kahraman C (Links = 143, TLS = 2548); Cluster 4 (yellow) contains 10% of the sample, with an emphasis on authors Zavadskas E (Links = 153, TLS = 9165) and Turskis Z (Links = 138, TLS = 4074); Cluster 5 (purple) contains 7.5% of the sample, and author Liu H stands out (Links = 122, TLS = 1395); Cluster 6 (light blue) has 7.5% of the sample, highlighting the author Tzeng G (Links = 139, TLS = 2167).

**Figure 9.** The network map of productive authors. Note: The colors of the circles are used to identify the clusters resulting from the analysis of the relations produced by the VOSviewer software.

#### *3.4. Main Research Areas for the Application of Multi-Criteria Methods*

The distribution of scientific production by research areas is depicted in Table 6. It is observed that there has been a shift in the preferences of academics in research fields over the past four decades. Table 7 displays the top five study areas by period. There was no change in the first five areas observed in the first two periods. From 1982 to 2002, research and applications of multi-criteria methods focused mainly on the following areas: operations research (1st), business economics (2nd), computer Science (3rd), engineering (4th), and mathematics (5th). With the increase in the volume of works published in the third decade under study, as shown in Figure 2, there was also a change in the research areas. From 2003 to 2012, the mathematics field was surpassed by environmental sciences ecology, which ranked fifth with 288 papers. Operations research, which held the numberone spot for two decades, was ranked third. The field of business economics lost its second place to computer science and fell to fourth place, followed by the ascent of engineering from fourth to first place. The most recent period analyzed was marked by a substantial increase in the number of published works. However, regarding the areas of interest of researchers, there has been a clear preference for engineering (1st) and computer science (2nd), followed by a change in preference as the traditional area of operations research has given way to environmental sciences ecology (3rd). In the fourth position, we find science technology, which has emerged with a greater level of interest from researchers due to the advancement of recent changes. The fifth place was occupied by business economics, a field in which scholars' interest has diminished over the past four decades.

**Table 6.** Distribution of scientific production by research areas.


Note: It is necessary to clarify the value indicated in the third column, "26,376" this is the total number of articles in the sample associated with the research areas. Each article can be related to more than one search area.



Note: Only data corresponding to the fifth position in each period were recorded.

In Section 3.1, a global overview of the scientific output on multi-criteria methods is provided, highlighting the significant countries and classifying each production. However, as seen in the case of research domains, the hegemony of the scientific output has also evolved differently between nations. The shift in emphasis in specific scientific fields and the consolidation of others directly impact the hegemony of nations. If we analyze Table 2, we can see the consolidation of engineering and computer science as prominent areas in the production of the ten countries explored and the emergence of interest in science and technology.

#### *3.5. Most-Used Methods*

Table 8 lists the 26 methods examined throughout the sample period. The publishing period in WoS/Scopus concerning the investigated method is recorded in column 3. The chronology was produced based on the evolution of multi-criteria approaches, as shown in Figure 10, using information from the starting period of each method's scientific output. The chronology depicts techniques that have been embedded in the literature and that continue to evolve, such as AHP, TOPSIS, PROMETHEE, ELECTRE, and others, such as SWARA, WASPAS, and FITRADEOFF, that have been published for up to ten years but are not yet well-known in academia. The publications of each studied technique are then noted in column 4. The AHP, TOPSIS, and VIKOR approaches have the most publications in the four decades studied. They are also the most commonly employed methods by professionals in solving multi-criteria related issues. Column 5 indicates the research areas wherein the specialists used the method the most. Computer science stands out among others because 47% of the researched methods address issues related to these areas, with the TOPSIS method being used the most. Engineering follows, with 35% of the methods, with the AHP method being the second most-used method. Business economics takes 11%, and operations research 8% respectively. In column 7, we build on the study to show a trend toward developing solutions that include one or more methodologies and the creation of hybrid models based on the data acquired. This section concludes by emphasizing that, despite the small number of applications, the scenario depicts the integration of multicriteria methods with some machine learning techniques, which could be the beginning of a new trend in the coming years (see column 8).




**Table 8.** *Cont.*

#### *3.6. Mapping the Evolution of Themes*

Cobo et al. [170] assert the set of identified themes of the subperiod t, with U ∈ Tˆt representing each detected theme in the subperiod t. Let V ∈ Tˆ(t + 1) represent each theme found in the subsequent subperiod t + 1. It is argued that there is a thematic progression from topic U to theme V if both related thematic networks contain the same keywords. Thus, V can be considered a development of U. Additionally, the keyword cluster k ∈ U ∩ V is regarded as a "thematic nexus" or "conceptual nexus".

Figure 11 was created using the "thematicEvolution" function of the Bibliometrix R package. The evolution of themes associated with multi-criteria methods is depicted in Figure 11 across the five time periods. In the first period, i.e., between 1977 to 1986, three themes are recorded. As the rectangles represented the same region during this period, it may be deduced that there was a balance in disseminating topics. In the second phase (1987–1995), there are twelve topics, of which eight had no foundation in the first period, such as "AHP", "TOPSIS", and "fuzzy set theory". These methods have their earliest publication record in 1990/1991 (Table 8). Still, researchers favor them, as in the case of TOPSIS, which has the same rectangular area as "GOAL PROGRAMMING", one of the three primary subjects of the program. During the third era (1996–2004), we recorded fourteen themes that originated in or branched from the preceding period. In this third period, the focus is on the AHP method, which is the most influential subject, as indicated by a distinct set of four keywords ("ahp", "analytic hierarchy process", and "analytical hierarchy process (ahp)"). It is important to note that the "GOAL PROGRAMMING" theme has become less popular and that the PROMETHEE and ELECTRE methods have become more popular. Despite being published for the first time in 1989/1991, they did not emerge as a topic until the third period. The themes decreased from fourteen to nine for the fourth phase (2005–2013). Two AHP-related concepts continue to hold the apex of importance. In addition to the PROMETHEE method, the TOPSIS methods, which did not emerge in the third era, reappeared distinctly. The final period evaluated between 2014–2022 continues with a reduction from nine to six themes presented in a balanced way, reflecting the preference for topics associated with the AHP and TOPSIS methods. The use of the theme-evolution map allowed us to graphically confirm the choice of specialists in solving multi-criteria problems using original tools in the AHP and TOPSIS methods during the study period.

**Figure 11.** The evolution of themes built with the authors' keywords.

#### **4. Discussion**

This research article presents a bibliometric analysis of the multi-criteria methods from 1977 to 29 April 2022. The bibliographic data was obtained from the Scopus and Web of Science (WoS) databases. The bibliometric analysis was conducted using the Bibliometrix R tool and the VOSviewer software to investigate the essential characteristics of the studies done so far, including publications; citations, citation structure; influential authors; cocitation contributors and burst detection analysis; author-keywords; co-occurrence analyses; and timeline-view analysis. The ability to make judgments is a distinguishing characteristic of a person. Man makes spontaneous and intuitive decisions based on his brain's information-processing skills. We judge the color of our ties for a business meeting as to whether or not to invest millions of dollars in a specific project. We realize that we face two distinct types of decisions: simple and complex. We can make straightforward decisions with few variables and little trouble. However, when the problem involves a matrix (n × m) variable, we require methodologies and computer capabilities to systematize, arrange, and rank the best options to aid decision-making. Accordingly, the objective of this study was to comprehend the global evolution of research on the creation and use of multi-criteria decision methods.

With a scientific production growth rate of 14.18% each year, it is clear that the academic community is interested in researching and publishing publications on multi-criteria decision-making approaches. Moreover, 60.93% of all publications were concentrated in only ten nations, with China leading the way with 18.50%, India coming in second with 10.62%, and Iran coming in third with 7.75%. In addition, the remaining 39% of publications have an average production rate of less than 1%, suggesting that the dissemination of multi-criteria approach research in such nations could enhance academic output. The top 10 countries in terms of citations follow a consistent pattern, accounting for 62.48% of all citations made during the research period. Among the top 10 countries in terms of multi-country collaboration (MCP) in publications, Turkey has the lowest MCP ratio with 0.0487, indicating a limited partnership with researchers from other nations, followed by India (0.0592) and Brazil (0.0861). Malaysia leads multi-country collaboration, with an MCP ratio of 0.2331, followed by the United States (0.2234) and Spain (0.2169).

Regarding sites that publish articles on multi-criteria techniques, the study reveals the top ten journals that have published approximately 10.4% of the subject's total publications. China, India, Iran, and Turkey, the four nations with the most publications on multi-criteria techniques, account for around 80% of the university-based publications on multi-criteria methods. These universities account for 11.79% of academic output, with the Islamic Azad University of Iran contributing 2.14% and Vilnius Gediminas Technical University of Lithuania accounting for 2.18%. Surprisingly, Lithuania is not among the top ten nations regarding scientific output. However, among the other authors in this survey, Prof. Edmundas Kazimieras Zavadskas of Lithuania ranks first with 240 articles on multicriteria approaches.

The journal *Expert Systems with Applications* has published 1.70% of all articles to date, followed by *Sustainability* with 1.68 percent and the *Journal of Cleaner Production* with 1.30%. The leading journals in terms of citations are *Expert Systems with Applications*, with an average of 7.88 citations per paper, followed by the *European Journal of Operational Research*, with 6.61 citations per article. Regarding the origin of publications, eight of the top ten countries publish most of their articles in the ten highest-ranked journals. In contrast, the *European Journal of Operational Research* ratio is 2 out of 10.

Regarding the most influential authors in this field, approximately 0.034% of 33,201 authors are responsible for 6.98% of publications over the past forty-four years, with ZAVAD-SKAS E having the most publications, with 240, followed by WANG J with 211 articles and TZENG G with 191 articles. This bibliometric analysis reveals that six of the top ten authors are Chinese, with the Central South University author affiliation standing out.

In addition to identifying writers with higher academic production, this study includes a comprehensive summary of the countries, funding sources, and the five multi-criteria approaches, i.e., AHP, TOPSIS, VIKOR PROMETHEE, and ANP, most frequently utilized by the authors in their respective studies. Engineering and computer science are the most prominent subjects in terms of research fields. One trend identified was the expansion of multi-criteria technique integration and the formation of hybrid models.

This paper gives a complete overview of multi-criteria methods through a bibliometric study, enabling scholars to comprehend the current state and future development patterns of multi-criteria decision-making methods research. As an indication for prospective research, we can emphasize the need to understand the emergence and regionalization of specific techniques and their variations, expand research within the identified countries to gain a deeper understanding of their scientific production on the issue investigated, apply topic modeling to find latent themes in the researched database, and systematize method variants and their interfaces with other research areas, such as machine learning.

**Author Contributions:** Conceptualization, M.P.B.; data curation, V.P.; formal analysis, M.S.; investigation, V.P.; methodology, M.P.B.; project administration, M.S.; supervision, H.G.C.; validation, V.P.; writing—original draft, M.P.B.; writing—review and editing, M.P.B. and A.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** https://github.com/marciobasilio/bibliometric\_multicriteria (accessed on 24 April 2022).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Electronics* Editorial Office E-mail: electronics@mdpi.com www.mdpi.com/journal/electronics

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel: +41 61 683 77 34

www.mdpi.com ISBN 978-3-0365-6789-1