*2.1. Data Set Description*

Anonymous patient data were extracted from the San Rafael Hospital database. Records range from January 2000 to January 2020. The data set consisted consisted of 996 records and 40 variables. A total of 47.19% of patients suffered a relapse in less than six months, whilst 52.81% had not relapsed in that period of time.

**Citation:** Rodríguez, A.M.; Tort, C.G.; Ulloa, V.S.; Gestal, J.M.L.; Pereira, J.; Pulido, V.A. Training of Machine Learning Models for Recurrence Prediction in Patients with Respiratory Pathologies. *Eng. Proc.* **2021**, *7*, 20. https://doi.org/ 10.3390/engproc2021007020


Academic Editors: Joaquim de Moura, Marco A. González, Javier Pereira and Manuel G. Penedo

Published: 13 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

#### *2.2. Machine Learning Algorithms*

#### 2.2.1. Linear Discriminant Analysis

Linear discriminant analysis (LDA) is generally used to classify patterns between two classes [6]. LDA models differences among samples assigned to certain groups, in order to maximize the ratio of the between-group variance and the within-group variance.

#### 2.2.2. Quadratic Discriminant Analysis

Quadratic discriminant analysis (QDA) is used when it is known that individual classes show distinct covariances. In this method, individual covariance matrix is estimated for every class of observations.

#### 2.2.3. K-Nearest Neighbors

The k-nearest neighbor classifiers (k-NNCs) assumes that similar features will form a different cluster in feature space with multiple data points. The classifier takes k-nearest neighbors to find similarities between the test data and the features of a different class.

#### 2.2.4. Decision Trees

Decision trees (DTs) are used for classification and regression. The DT predicts the value of a target variable by learning simple decision rules inferred from the data features.

#### **3. Results and Discussion**

Figure 1 shows the results obtained for the four models. The accuracy is expressed as the ratio of correctly predicted observation to the total observations; sensitivity, ratio of true positives to actual positives; and specificity, ratio of true negatives to total negatives in the data.


**Figure 1.** Results obtained for the four models.

The overall accuracy for the four models is 60%; however, the accuracy value must be greater than 80% to be considered good.

The differences between sensitivity and specificity indicate that these models have a better performance predicting non-relapses than relapses. As expected, the accuracies reported by this study were lower than the ones we would expect. In this study, we used a dataset which did not have input and output parameters for a specific disease diagnostic. Clinical records from San Rafael included information about diagnosis, procedures or health system, but it did not include parameters to diagnose a respiratory disease. With aim to make better predictions, data sets need to include more useful information such as whether the patient is smoker or not, air quality or physical activity. The use of machine learning for health predictions is growing in popularity, although some challenges lie ahead.

**Author Contributions:** Conceptualization, A.M.R., C.G.T., J.P.; methodology, A.M.-R.; writing, A.M.R; supervision, V.S.U., J.M.L.G., V.A.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** Centro de Investigación de Galicia CITIC and Campus Innova (agreement I+D+ 2019-20) is funded by Consellería de Educación, Universidade e Formación Profesional from Xunta de Galicia and European Union (European Regional Development Fund - FEDER Galicia 2014-2020 Program) by grant ED431G 2019/01 and Universidade da Coruña. Partially supported by the Spanish Ministry of Science (Challenges of Society 2019) PID2019-104323RB-C33.

**Conflicts of Interest:** The authors declare no conflict of interest.

