**Preface to "Modelling and Machine Learning Methods for Bioinformatics and Data Science Applications"**

With the enormous amount of data flowing from a variety of real-world problems, Artificial Intelligence (AI), and in particular Machine Learning (ML) and Deep Learning (DL) techniques, have powered new achievements in many complex applications, that were prohibitive with deterministic approaches. These advances, which are based on a multidisciplinary research framework involving computer science, numerical analysis and statistics, come from research efforts in both industry and academia and are particularly well suited to address complex problems in data science and, more specifically, in biotechnological and medical applications. While these methods have proven to be astounding in performance, they still suffer from a sort of opacity, meaning that their produced results, though correct and quickly obtained, are difficult to be interpreted or explained, a fundamental drawback especially for those problems where people's lives are at stake. Therefore, a deeper understanding of the fundamental principles of machine and deep learning methods is mandatory in order to evidence both their advantages and limitations. Though a variety of different approaches exists to face the problem of explainability—ranging from methods that try to understand on which features an AI model bases its prediction, to the construction of ad hoc architectures which some logical knowledge can be extracted from—we think that a viable alternative is that of a *contamination* between mathematical modeling and machine learning, in the belief that the insertion of the equations deriving from the physical world in the data-driven models can greatly enrich the information content of the sampled data, allowing us to simulate very complex phenomena, with drastically reduced calculation times and interpretable solutions. The application of such hybrid techniques to structured data, such as time series or graphs, however, opens to other interesting challenges, aimed at determining whether these techniques are able to generalize to different problems, how much the data structure (for instance, sampling from a subspace or manifold) affects the method, and how to choose appropriate (hyper-)parameters to ensure a good fit, while still avoiding overfitting.

This Special Issue brings together researchers from different disciplinary fields, who focus on building theoretical foundations and presenting cutting-edge applications for deep learning applications. We hope this issue will serve as a hint for researchers to reflect on fundamental issues in a wide variety of fields, including pure and applied mathematics, statistics, computer science and engineering, to join forces to integrate different approaches suitable for solving complex problems, quickly, reliably and understandably for human experts.

The contributions collected can be divided into two main categories, relating, respectively, to the application of mathematical models/DL techniques to the study of biological macrosystems and to the automatic analysis/prediction of medical data for the prognosis of human diseases.

For the first category, the paper titled "Machine Learning Techniques Applied to Predict Tropospheric Ozone in a Semi-Arid Climate Region" describes a comparative evaluation of a large class of statistical modeling methods for classifying high or low ozone concentration levels. Indeed, ground-level ozone exposure has led to a significant increase in environmental risks, since it adversely affects not only human health but also some delicate plants and vegetation.

Additionally, in the paper "Interactions Obtained from Basic Mechanistic Principles: Prey Herds and Predators", four different predator–prey–herd models are presented, that are derived assuming that the prey gathers in herds, that the predator can be specialist—i.e., it feeds on only one species—or generalist—i.e., it feeds on multiple resources—and considering two functional responses, the herd-linear and herd-Holling type II functional responses. The paper aims at deriving their mathematical formulation from the individual-level state transitions, and compare the models' dynamics in terms of equilibria, stability and bifurcation diagrams. The predator–prey–herd antagonistic behavior has been widely observed in population ecology, especially in aquatic species and insects, and has been proven to deeply affect niche expansion and speciation.

To the second category belongs the manuscript "Alzheimer Identification through DNA Methylation and Artificial Intelligence Techniques", which presents a nonlinear approach for identifying combinations of CpG DNA methylation data as biomarkers for Alzheimer disease (AD). Indeed, the possibility of having techniques that can determine earlier if an individual has AD is becoming increasingly important, especially after the FDA approval of the first drug for AD treatment (there were drugs before it targeting some of the effects of the illness, but not the actual illness itself). Such an early diagnosis will be possible soon, thanks to non-invasive medical tests to capture methylation data, simply based on blood.

Two contributions, namely "Visual Sequential Search Test Analysis: An Algorithmic Approach" and "A Mixed Statistical and Machine Learning Approach for the Analysis of Multimodal Trail Making Test Data", are devoted to the automatic analysis of Trail Making Test (TMT) data. TMT is a popular neuropsychological test, commonly used in clinical settings as a diagnostic tool for the evaluation of some frontal functions, that provides qualitative information on high order mental activities, including speed of processing, mental flexibility, visual spatial orientation, working memory and executive functions. Such data are preprocessed in the form of sequences and treated with an algorithmic approach based on the episode matching method, or in the form of scan-path images, that can be processed via DL and clustering methods, for distinguishing patients affected by the extrapyramidal syndrome and by chronic pain from healthy subjects. A statistical analysis, based on the blinking rate and on the pupil size, is also carried out, to help classifying different pathologies.

Finally, the paper "A Multi-Stage GAN for Multi-Organ Chest X-ray Image Generation and Segmentation" proposes a deep learning approach to the generation of realistic synthetic images—particularly useful in medical applications where the scarcity of data often prevents the use of DL architectures—that can be employed to train a segmentation network. Segmentation is, in fact, the preventive step for automatic image analysis and classification, and has proven fundamental, for instance, in order to diagnose COVID-19 based on lung damage.

> **Monica Bianchini, Maria Lucia Sampoli** *Editors*

#### *Article* **Machine Learning Techniques Applied to Predict Tropospheric Ozone in a Semi-Arid Climate Region**

**Md Al Masum Bhuiyan 1,\*, Ramanjit K. Sahi 1, Md Romyull Islam 1 and Suhail Mahmud 2**


**Abstract:** In the last decade, ground-level ozone exposure has led to a significant increase in environmental and health risks. Thus, it is essential to measure and monitor atmospheric ozone concentration levels. Specifically, recent improvements in machine learning (ML) processes, based on statistical modeling, have provided a better approach to solving these risks. In this study, we compare Naive Bayes, K-Nearest Neighbors, Decision Tree, Stochastic Gradient Descent, and Extreme Gradient Boosting (XGBoost) algorithms and their ensemble technique to classify ground-level ozone concentration in the El Paso-Juarez area. As El Paso-Juarez is a non-attainment city, the concentrations of several air pollutants and meteorological parameters were analyzed. We found that the ensemble (soft voting classifier) of algorithms used in this paper provide high classification accuracy (94.55%) for the ozone dataset. Furthermore, variables that are highly responsible for the high ozone concentration such as Nitrogen Oxide (NOx), Wind Speed and Gust, and Solar radiation have been discovered.

**Keywords:** tropospheric ozone; machine learning; El Paso-Juarez; semi-arid climate
