**About the Editors**

**Giovanna Castellano** is an Associate Professor at the Department of Computer Science, University of Bari Aldo Moro, Italy, where she is the coordinator of the Computational Intelligence Lab. She is member of IEEE Computational Intelligence Society, the EUSFLAT society and the INDAM-GNCS society. Her research interests are in the area of Computational Intelligence and Computer Vision. She has published more than 200 papers in international journals and conferences. She is Associate Editor of several international journals. She was Co-organizer of the 4th EUSFLAT European Summer School on Fuzzy Logic and Applications (SFLA2018). She was General chair of the IEEE Conference on Evolving and Adaptive Intelligent Systems (IEEE-EAIS2020).

**Gabriella Casalino** is currently Assistant Professor at the Computational Intelligence LAB of the department of Informatics, University of Bari. Her research activity is focused on Computational Intelligence with a particular interest for data analysis. Three are the main themes she is currently working on: intelligent data analysis, computational intelligence for eHealth, and data stream mining. She is active in the computer science community as a reviewer for international journals and conferences. She is also involved in the organizing committees of international conferences. She is also the Associate Editor of the *Journal of Intelligent and Fuzzy Systems* and she is Guest Co-Editor of several Special Issues on international journals.

## *Editorial* **Special Issue on Computational Intelligence for Healthcare**

**Gabriella Casalino \* and Giovanna Castellano \***

Computer Science Department, University of Bari Aldo Moro, 70125 Bari, Italy **\*** Correspondence: gabriella.casalino@uniba.it (G.C.); giovanna.castellano@uniba.it (G.C.)

The number of patient health data has been estimated to have reached 2314 exabytes by 2020. Traditional data analysis techniques are unsuitable to extract useful information from such vast quantity of data. Thus, intelligent data analysis methods combining human expertise and computational models for accurate and in-depth data analysis are necessary. The technological revolution and medical advances made by combining vast quantities of available data, cloud computing services, and AI-based solutions can provide expert insight and analysis on a mass scale and at a relatively low cost. Computational intelligence (CI) methods such as fuzzy models, artificial neural networks, evolutionary algorithms, and probabilistic methods have recently emerged as promising tools for the development and application of intelligent systems in healthcare practice. CI-based systems can learn from data and evolve according to changes in the environments by taking into account the uncertainty characterizing health data, including omics data, clinical data, sensor, and imaging data. The use of CI in healthcare can improve the processing of such data to develop intelligent solutions for prevention, diagnosis, treatment, and follow-up, as well as for analysis of administrative processes.

The present Special Issue on Computational Intelligence for healthcare is intended to show the potential and the practical impacts of CI techniques in challenging healthcare applications. The Special Issue received several submissions, all of which went through a rigorous peer-review process. After the review process, twelve papers were selected on the basis of the review ratings and comments. These selected papers range over main applications of CI in healthcare.

A special case of medical data is the data generated by omics technologies, which enable DNA decoding and genome sequencing. Such data represent the expression of genes or portions thereof in experimental subjects or in cell lines produced in the laboratory. The study of the complex interactions between genes makes it possible to understand their role in the course of a particular disease. In particular, the analysis of the different biological levels allows us to better address the knowledge of pathogenetic mechanisms at the molecular level, allowing the identification of biomarkers that allow us to improve the diagnosis or even to plan personalized therapies However, the complexity of the relationships and the uncertainty present in the data collected and in the phenomena studied make it necessary to use specific methods for the treatment of information with these characteristics. The first paper, entitled, "Explaining Ovarian Cancer Gene Expression Profiles with Fuzzy Rules and Genetic Algorithms" [1], addresses the analysis of gene expression data for approaching the study of expression profiles in ovarian cancer compared to other ovarian diseases. The work combines a feature selection among genes that is guided by the genetic algorithm into the creation of fuzzy if–then rules that explain how classes can be distinguished by observing changes in the expression of selected genes. After testing several parameters, a final model was obtained consisting of 10 genes involved in the molecular pathways of cancer and 10 rules that correctly classify all samples. Omics data of a different kind, namely sgRNA sequences, have been analyzed in the paper "CRISPRLearner: a deep learning-based system to predict CRISPR/Cas9 sgRNA on-target cleavage efficiency" [2]. In particular, ten datasets were considered. After a pre-processing step, including data standardization and augmentation, a convolutional neural network was used to predict

**Citation:** Casalino, G.; Castellano, G. Special Issue on Computational Intelligence for Healthcare. *Electronics* **2021**, *10*, 1841. https://doi.org/ 10.3390/electronics10151841

Received: 21 July 2021 Accepted: 27 July 2021 Published: 31 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

sgRNA cleavage efficiency. Experiments on benchmark datasets showed the effectiveness of the proposed method in correctly identifying the disease. Moreover, a comparison with state-of-the-art methods shows the superiority of the proposed deep-learning-based model.

In addition, detection and analysis of physiological data acquired from sensors are an essential process in smart healthcare applications. Indeed, with the advent of low-cost sensors and fast networks, a new discipline called Internet of Medical Things (IoMT) has emerged. It allows a continuous monitoring of patients through intelligent objects. Physiological data analysis can be performed in fog computing to abridge the excess latency introduced by cloud computing. However, the latency for the emergency health status and overloading in fog environment become key challenges for smart healthcare. The paper "Intelligent Fog-Enabled Smart Healthcare System for Wearable Physiological Parameter Detection" [3] presents a novel healthcare architecture for physiological parameter detection that resolves these problems. The overall system is built upon three layers. In the first layer, data from the wearable devices of patients are subjected to fault detection in a personal data assistant (PDA) via a rapid kernel principal component analysis (RK-PCA) algorithm. Then, in the second layer, the faultless data are processed to remove redundancy via a new fuzzy assisted objective optimization by ratio analysis (FaMOORA) algorithm. A two-level health-hidden Markov model (2L-2HMM) finds the user's health status from temporal variations in data collected from wearable devices. Finally, the user's health status is detected in the third layer through a hybrid machine learning algorithm called SpikQ-Net, and according to the user's health status, immediate action is taken. The proposed tri-fog health model is validated by a thorough simulation showing better achievements in latency, execution time, detection accuracy, and system stability.

When different sensors are used for data acquisition, synchronization is a critical factor. In the article "A Synchronized Multi-Unit Wireless Platform for Long-Term Activity Monitoring", a time-synchronized multi-unit, multi-sensor, and multi-rate acquisition system for kinematic and static analysis is proposed [4]. It is a wearable multi-board acquisition system for offline activity monitoring. A master–slave architecture was used to syncronize measures acquired from different sensors. Moreover, a mobile Android application was developed in order to manage the data acquisition. The high modularity of the proposed platform makes it general-purpose. Indeed, experiments on different scenarios have been carried out to validate its performance. In particular, a case study of surface electromyography (sEMG) was used for monitoring muscle activities during walking. sEMG signals are also used in the article "A Deep Learning Approach to EMG-Based Classification of Gait Phases During Level Ground Walking" for gait phase classification [5]. Specifically, an Artificial Neural Network (ANN) was proposed to classify gait events and to predict foot–floor contact. The use of ANN has allowed the automatic selection of relevant features in data, thus avoiding the data engineering phase, which is necessary when using other machine learning algorithms. In vivo experiments have been conducted at the Movement Analysis Laboratory of Università Politecnica delle Marche to acquire gait signals. Raw signals were pre-processed to obtain the final labeled segments of the walking phases. Four different architectures of the multi layer perceptron algorithm (MLP) were proposed by modifying the model structure, and the results have been compared. The aim of the analysis was to detect the transitions between gait phases. Furthermore, a comparison with a feature-based (FB) method has shown that the best MLP model is more accurate in detecting phase changes.

A further essential and crucial task in healthcare is image analysis to support medical diagnosis. In particular, recent advances in neuroimaging techniques, such as diffusion tensor imaging (DTI), represent a crucial resource for brain image analysis in order to detect alterations related to severe neurodegenerative disorders, such as Alzheimer's disease (AD). The paper "An Ensemble Learning Approach Based on Diffusion Tensor Imaging Measures for Alzheimer's Disease Classification" [6] presents an ensemble learning approach for the automatic discrimination between healthy controls and AD patients, using DTI measures as predicting features and a soft-voting ensemble approach for the classification. The

proposed approach, efficiently combining single classifiers trained on specific groups of features, is able to improve classification performances with respect to the comprehensive approach of the concatenation of global features and at the same time reducing the dimensionality of the feature space and in turn the computational effort. A different task on images has been implemented in the article "Leukemia Image Segmentation by Using a Hybrid Histogram-Based Soft Covering Rough k-Means Clustering Algorithm" [7]. Image segmentation is the task of partitioning a given image in not-overlapped areas to detect regions of interest. Particularly, authors propose a leukemia diagnosis support system through nucleus segmentation. A soft set together with a rough set were used to represent the uncertainty in nucleus images. A four-step pipeline is proposed. Images are firstly pre-processed, and then a clustering algorithm is applied. A histogram-based method (HSCRKM) is proposed to identify the optimal number of cluster. Then, different features are extracted from the images, and the resulting data are used to predict the areas in the image as belonging to the leukemia tumor class or the healthy class. Several clustering and classification methods have been compared to identify the optimal pair. Results show that the proposed HSCRKM overcomes the compared clustering methods. Moreover, all the classification models increased their performance when trained on groups coming from HSCRKM. However, among all the considered prediction methods, logistic regression and neural network provided the best performance (average accuracy higher than 90%). Diabetic Retinopathy (DR) images are analyzed in the paper "Blended Multi-Modal Deep Convnet Features for Diabetic Retinopathy severity Prediction" for an early recognition of the disease [8]. Both uni-modal and multi-modal approaches, which combine data coming from different sources, were used to predict the severity level of the disease (healthy, mild, moderate, severe, and proliferative). To this aim, Deep Neural Networks (DNN) have been proposed. In particular, in the uni-modal approach, a single pre-trained ConvNet is used to extract the final feature representation. In the multi-modal approach, the final feature representation is obtained by blending deep features extracted from multiple ConvNets.

One main factor that hampers the effectiveness of CI methods in the medical domain is the imbalanced nature of medical data due to non-uniform distribution of the number of instances per class. The paper entitled "Integrating Enhanced Sparse Autoencoder-Based Artificial Neural Network Technique and Softmax Regression for Medical Diagnosis" [9] addresses the problem of unbalancement in medical datasets to create robust models for the prediction of different diseases. The authors propose an approach that integrates an enhanced sparse autoencoder (SAE) for effective feature learning and an optimized Softmax regression for robust classification. When employed for the prediction of three different diseases, namely chronic kidney disease, cervical cancer, and heart disease, the proposed method provides higher test accuracies compared to other machine learning algorithms. In addition, the paper "Two-Stage Monitoring of Patients in Intensive Care Unit for Sepsis Prediction Using Non-Overfitted Machine Learning Models" [10] addresses the problem of unbalanced clinical data. In this case, data concern patients in the Intensive Care Unit (ICU) to face the problem of early detection of sepsis, collected within the PhysioNet/Computing in Cardiology Challenge 2019. The labeled clinical dataset includes only 2% records with the sepsis label, leading to highly unbalanced dataset. To address these issues, the authors propose a method using two separate ensemble models to take into the account the amount of time the patients spent in the ICU. A total of 44 different methods, based on decision trees, naive Gaussian Bayes, SVM, and ensemble learners,are compared. Results show the effectiveness of the proposed method. Moreover, the considered machine learning models return comparable utility score values when the number of features is reduced, suggesting that feature engineering is necessary.

Long-term electrocardiogram (ECG) is used to detect Premature Ventricular Contraction (PVC) in the paper "Searching for Premature Ventricular Contraction from Electrocardiogram by Using One-Dimensional Convolutional Neural Network" [11]. A onedimensional convolutional neural network (CNN) has been used for the prediction tasks. It allows avoiding pre-processing phases on data, as required by common machine learning

algorithms, and it is able to directly analyze raw data while extracting the most relevant features from it. High performance values are returned by the diagnostic system (accuracy of 99.64%).

Finally, a multistage support vector machine model has been proposed for early recognition of Unipolar Depression (UD) disease in the article "Realizing an integrated multistage support vector machine model for augmented recognition of unipolar depression" [12]. A pre-processing phase for feature ranking is implemented in order to reduce the data dimensionality. Comparison with other machine learning algorithms has shown the effectiveness of the proposed approach in correctly identifying the disease, other than overcoming their performance. Moreover, the recursive feature selection method has proved to be able to improve the accuracy of the classifiers.

**Acknowledgments:** The guest editors Gabriella Casalino and Giovanna Castellano would like to thank the authors for their contributions, the reviewers for their effort in reviewing the manuscripts, and the editorial staff of the MDPI journal *Electronics* for their support in producing this Special Issue. Gabriella Casalino acknowledges funding from the Italian Ministry of Education, University and Research through the European PON project AIM (Attraction and International Mobility), nr. 1852414, activity 2, line 1. This work was partially supported by INdAM GNCS within the research project "Computational Intelligence methods for Digital Health". The guest editors are with the CITEL—Centro Interdipartimentale di Telemedicina, University of Bari Aldo Moro.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Explaining Ovarian Cancer Gene Expression Profiles with Fuzzy Rules and Genetic Algorithms**

**Arianna Consiglio 1,\*, Gabriella Casalino 2, Giovanna Castellano 2, Giorgio Grillo 1, Elda Perlino 1, Gennaro Vessio <sup>2</sup> and Flavio Licciulli <sup>1</sup>**


**\*** Correspondence: arianna.consiglio@ba.itb.cnr.it

**Abstract:** The analysis of gene expression data is a complex task, and many tools and pipelines are available to handle big sequencing datasets for case-control (bivariate) studies. In some cases, such as pilot or exploratory studies, the researcher needs to compare more than two groups of samples consisting of a few replicates. Both standard statistical bioinformatic pipelines and innovative deep learning models are unsuitable for extracting interpretable patterns and information from such datasets. In this work, we apply a combination of fuzzy rule systems and genetic algorithms to analyze a dataset composed of 21 samples and 6 classes, useful for approaching the study of expression profiles in ovarian cancer, compared to other ovarian diseases. The proposed method is capable of performing a feature selection among genes that is guided by the genetic algorithm, and of building a set of *if-then* rules that explain how classes can be distinguished by observing changes in the expression of selected genes. After testing several parameters, the final model consists of 10 genes involved in the molecular pathways of cancer and 10 rules that correctly classify all samples.

**Keywords:** computational intelligence; classification; fuzzy inference systems; genetic algorithms; next-generation sequencing; ovarian cancer; interpretable models

### **1. Introduction**

Among the most common cancers in women, ovarian cancer is the most lethal, due to its late symptoms and diagnosis, and its onset can be a primary tumor or secondary tumor of the fallopian tube or endometrium [1]. Based on histopathology and molecular genetic alterations, ovarian carcinomas are divided into five main types that can be considered as different diseases: high-grade serous, endometrioid, clear cell, mucinous, and low-grade serous carcinomas [2]. There is currently no reliable test to diagnose asymptomatic ovarian cancer, and any study of the molecular processes that are active in its proliferating cells can contribute to the identification of new molecular biomarkers for efficient diagnosis, prognosis, and therapy.

Next-Generation Sequencing (NGS) technologies provide researchers with experimental datasets that describe the molecular profile of cancerous cells by allowing them to estimate the expression of genes in a tissue sample, which is the number of copies of a gene that are present as Ribonucleic Acid (RNA) fragments and decoded by the sequencer. Standard bioinformatic pipelines are used to compute gene expressions and to compare samples for significant expression differences, with differential expression analysis [3].

However, NGS experiments are quite expensive and require further laboratory validation of the most significant results, as they can present noise in the data that stems from the inherent complexity of the technology. This is why many researchers use NGS with a limited number of samples to extract the most evident molecular activities and validate those results only on a larger number of samples. Moreover, NGS results are

**Citation:** Consiglio, A.; Casalino, G.; Castellano, G.; Grillo, G.; Perlino, E.; Vessio, G.; Licciulli, F. Explaining Ovarian Cancer Gene Expression Profiles with Fuzzy Rules and Genetic Algorithms. *Electronics* **2021**, *10*, 375. https://doi.org/10.3390/ electronics10040375

Academic Editor: Jun Liu Received: 31 December 2020 Accepted: 30 January 2021 Published: 4 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

highly dependent on the laboratory experimental settings used and the datasets produced with different technical conditions (sequencer type, tissue type, tissue conservation, etc.) are not directly comparable. This is why NGS data are mainly exploited for case-control studies with only two conditions.

Due to the digitalization process, the biomedical domain represents a source of valuable data. A growing amount of this data is generated every day, ranging from vital parameters to omics data and output from imaging devices. Therefore, machine learning techniques have been used extensively in the medical domain, as they can automatically derive useful models for making predictions, and detecting patterns that reveal hidden relationships in the data [4]. Automatic systems have been proposed to support medical experts while avoiding repetitive tasks. Moreover, thanks to the availability of this huge amount of data and the high computational capabilities of modern systems, novel insights, which could not have been discovered through manual analyses, have been returned by automatic techniques.

Machine learning algorithms have been applied to biological data of the most varied diseases such as neurodegenerative diseases [5,6] and cancer [7,8], just to name a few. Computational Intelligence is a research branch dealing with nature-inspired algorithms, such as fuzzy logic, neural networks, and evolutionary algorithms, which can process numerical data to address complex problems that may be difficult to solve with traditional machine learning algorithms [9]. Neural networks have gained a lot of attention in recent years and their "deep" variants have led to Deep Learning (DL), which has redefined the state-of-the-art performance in several domains, including the medical one [10]. In particular, DL algorithms have been successfully applied to omics data for early disease prediction or the extraction of meaningful biomarkers [11,12]. However, DL techniques have two main drawbacks: they are not interpretable, even though research is moving in this direction [13], and they need a huge amount of data to learn a model.

On the contrary, fuzzy logic has been widely used in the medical field due to its ability to represent the uncertainty and vagueness inherent in medical concepts and in the clinician's way of reasoning [14]. It differs from classical Boolean logic as each object partially belongs to a given set. A membership matrix is used to represent the possibility that each object belongs to each set [9]. Moreover, a Fuzzy Inference System (FIS) is a fuzzy logic-based reasoning system that uses linguistic variables and linguistic terms to represent vague and uncertain concepts that are involved in the reasoning, thus leading to natural language-based explanations. In fact, the knowledge base of these systems is composed of fuzzy variables whose values are represented through fuzzy sets and *if-then* rules that represent the reasoning [14]. On the other hand, Genetic Algorithms (GAs) are heuristic methods inspired by natural evolution in which optimal individuals are selected for the reproduction of the next generation of the population [9]. They are commonly used to solve complex problems that cannot be handled with procedural methods due to the high complexity of the task. GAs are typically used in Bioinformatics to select a subset of more informative genes; in fact, omics data usually produce thousands of variables for each single sample in an experimental investigation. This curse of dimensionality affects automatic techniques, so dimensionality reduction techniques are often used to extract the most significant subset of genes for the specific task [15]. Thanks to their ability to gradually refine solutions through natural selection, GAs are not biased by human knowledge of the problem and are effectively used for feature selection [16,17].

In this study, we describe the results of our analyses performed on a set of data presented in previous work [18]. This dataset contains the sequencing of 21 human ovarian tissue samples from 12 cancer and 9 non-cancer samples, grouped into 6 diagnostic classes. Due to the large number of classes and the low number of replicates for each class, this dataset is quite difficult to analyze with standard bioinformatic tools. In this paper, we aim to extract useful information from this dataset. The goal of the research was to characterize ovarian cancer tissues by comparing them with other ovarian and uterine tissues and to find a panel of genes capable of discriminating classes and providing

information on the pathologic conditions. The method proposed to analyze this dataset is based on genetic algorithms for the selection of features and fuzzy rule-based systems for the classification task, i.e., the diagnosis of the 6 classes of samples. The proposed method aims to provide experts with an interpretable model that can help them, in further laboratory studies, to clarify still unknown mechanisms behind the pathology.

To the best of our knowledge, this is the first time fuzzy logic and genetic algorithms have been combined for ovarian cancer classification. Furthermore, this is the first time this dataset has been analyzed using automatic techniques. Therefore, both biological analyses and computational intelligence techniques have been applied in this paper to verify the effectiveness of the derived results.

The rest of this paper is organized as follows. Section 2 describes the dataset that has been analyzed through the bioinformatic pipeline, and the computational intelligence techniques employed. Section 3 reports the results obtained with the proposed methodologies. Finally, conclusions are summarized in Section 4.

#### **2. Materials and Methods**

In this section, we will present the dataset employed in this work and the techniques used to analyze it and evaluate the results obtained.

#### *2.1. Dataset Description and Bioinformatic Preprocessing*

The dataset used in this work was presented in a previous paper [18]. It was produced with the Illumina HiSeq2500 sequencer and consists of approximately 30 million paired-end reads (RNA fragments) per sample.

The sequenced samples were selected from 21 Formalin-Fixed Paraffin-Embedded tissues, belonging to 6 classes that are the target of our investigation:


The last three groups are non-cancerous samples. The dataset is represented by raw FASTQ files (text files containing the RNA fragments detected by the sequencer), and the gene expressions (RNA counts) were estimated with the bioinformatic tool STAR [19], combined with RSEM [20] and MultiDEA [21].

After gene expression estimation, the final dataset has 21 samples and over 45 thousand genes (features), but many of them will be filtered out for low intensity as low expressions are not reliable for evaluating significant changes in gene values. By applying the standard filter of gene expressions > 50, the feature space of this dataset is reduced to about 9 thousand genes. The main goal of expression profiling is to identify all the genes that are expressed in the samples under study and to extract the genes that show changes in the expression that may be correlated to the experimental conditions. The gene functions, activities, and interactions are collected in molecular pathways and stored in pathway databases, such as KEGG [22] or BioCarta [23].

#### *2.2. Differential Expression Analysis*

Differential expression analysis aims to verify whether an observed change in RNA counts (gene expressions) between two experimental conditions is statistically significant. Changes in expression are correlated to the activation of a series of actions among molecules in the cell (a pathway) that change the state of the cell in response to a stimulus.

Following a standard bioinformatic workflow, differential gene expression analysis was performed with DESeq2 [24]. Significant changes are called overexpressions if the expression of the gene increases and underexpressions if it decreases, and the magnitude of the change is evaluated by fold change computation, which is the logarithmic rate of

expression between two conditions. When expression values are estimated from RNA counts, they are proportional to the length of the gene that produced the fragments detected. The fold change metric is independent of gene length, but the significance of its result must be statistically tested. Only mean gene expressions > 50 were considered in the analysis, while the result of at least one halved or doubled expression with a *p*-value < 0.05 was considered statistically significant, after multiple testing adjustments by False Discovery Rate [25].

#### *2.3. Fuzzy Rule-Based System*

The classification task was performed on subsets of genes (selected by the genetic algorithm, as will be described in Section 2.4) with a fuzzy rule-based system. A Fuzzy Inference System (FIS) is a popular rule-based method for modeling uncertain and imprecise information. In the medical domain, linguistic terms are used to represent patients' symptoms and suggestions are derived through fuzzy inference mechanisms. The domain knowledge is expressed in the knowledge base in the form of *if-then* fuzzy rules. The strength of these systems is their "interpretability", that is the ability to easily express the reasoning behind the rules in a way that is understandable by humans [26]. This is a critical aspect in medical applications as experts need to understand how certain results are obtained to trust the technology.

The classifier was implemented with the "frbs" R package [27]. As the aim of the work was to analyze the gene expression variations, the input variables are the genes selected through the GA. As variations are usually considered to be high (overexpression) or low (underexpression), we have defined the number of fuzzy terms for each gene domain as three (low, medium, and high expression). The medium expression fuzzy set is centered on the mean expression of the gene. The fuzzy rules are equidistant Gaussian sets, and the extreme sets have their center defined by the most extreme values of their gene domain. As domain experts are interested in observing the fold change rate, to linearly represent the increase and decrease of expression (for example, a halving or doubling of expression), we have defined the fuzzy sets on a logarithmic transformation of the estimated expressions, as shown in Figure 1.

**Figure 1.** Three fuzzy sets cover the domain of gene expression, thus describing underexpression for low values, medium for the mean expression of the gene, overexpression for high values.

The output of the model is a set of *if-then* rules in which the input fuzzy variables and their values (fuzzy terms) are concatenated in the premise. The consequent contains the output variable and its value, which in our case is discrete and corresponds to the 6 diagnoses of the samples (KE, KS, KSB, CS, EN, N). Table 1 shows an example of fuzzy rule where the selected genes assume Medium/Overexpressed/Underexpressed values, and the target class is KE: endometrioid carcinoma.


**Table 1.** Example of a fuzzy rule for the classification of samples based on gene expression.

Due to the low number of samples, the leave-one-out cross-validation method was used to assess the accuracy of the fuzzy classifier.

#### *2.4. Genetic Algorithm*

To preserve the interpretability of the fuzzy rule output, only a small number of genes should be included in the rules. The selection of these genes has been implemented with a genetic algorithm.

The evaluation of the most important and influential genes is a complex task because this feature selection task should take into account two important characteristics of NGS data: (1) gene expressions and their magnitude depend on gene length; (2) genes influence each other. These factors undermine the use of feature selection methods based on statistical assumptions such as variance evaluation. Our genetic algorithm can select the features considering multiple factors, suitably tuned by the fitness function.

These are the main parameters of the genetic algorithm:


$$\text{Fitness} = \text{Accur} \times 0.5 + \text{Simpl} \times 0.3 + \text{Inter} \times 0.2 \tag{1}$$

where Accur is the accuracy of the model (number of correctly classified samples/total samples), Simpl is a value in [0,1] inversely proportional to the number of rules generated by the model (1 if the number of rules is equal to the number of classes), so that individuals with fewer rules are preferred, Inter is a value in [0,1] that evaluates how many selected genes are relevant for the biomedical task under analysis: if the gene is already known to be involved in cancer molecular pathways (as defined by KEGG [22,28]), the model is rewarded with additional fitness. Initially, only the accuracy (Accur) of the model was considered, but the final individuals showed a large number of fuzzy rules; in fact, the number of fuzzy rules is strictly dependent on the selection of variables returned by the GA. Then we introduced a factor that increases as the number of rules decreases (Simpl), which helped us to select the final individuals with a minimum number of fuzzy rules. However, repeatedly running the genetic algorithm with a different initial random seed produced very different final individuals (only a few genes were present in all results), so we decided to inject biological information into the model. This was performed by selecting the genes involved in cancer (by extracting KEGG's cancer pathway from GSEA) and by adding another factor into the fitness function that increases when the individual contains those genes (Inter). The three parameters are weighted and summed, to obtain a total fitness in [0,1] and to give different (decreasing) importance to each element of the sum. We tested multiple weights and chose the final three shown in the formula to give

slightly more importance to the classification accuracy and decreasing importance to the last two addends. This fitness function has been proposed to suit the classification task at hand.


The fitness evaluation is the most time-consuming operation as it must be performed on all individuals of each generation. As its processing is independent for each individual, parallel computing could be used to speed up the execution time of each generation. Indeed, we compared the execution times required to compute 100 generations of 400 individuals by using both serial and parallel processing (with a 64 cores architecture). While the first took more than 4 h to stop, the second one ended after about 10 min, thus with a saving of over 20 times. The genetic algorithm was implemented with an R script and the R "parallel" package was adopted for parallel computing.

### **3. Results**

In this section, we present the results of the elaboration performed on the ovarian cancer dataset. The data were analyzed with both a standard pipeline used by bioinformaticians and the model proposed in this paper. The analysis aims to extract information on changes in gene expression that can be useful for discriminating between different tissues, and thus to study the molecular mechanisms that differ in the samples.

As the dataset consists of only 3 samples for each class (6 samples in one case), the main objective is to highlight only the most important expression changes in an interpretable system that also takes into account the interactions among genes. The results obtained will also be discussed from a biomedical point of view.

#### *3.1. Differential Expression Analysis*

To give an idea of how complex and difficult it is to interpret an expression analysis with more than 2 classes, here we report some results of a standard differential expression analysis workflow we have applied (described in Section 2). This type of analysis allows one to highlight those changes of expression that show statistical significance in the comparison between two conditions. We have performed this analysis in two steps.

In the first step, we have compared each group with the complete set of samples not belonging to the selected group, to search for those expression variations that are typical of the selected group. This analysis describes how specific a class tissue is, and is useful for the researcher who needs to study the singular events that occur in a tissue class and not in all the other classes analyzed, but it hides the events that occur in two or more classes and not in the other. The results are summarized in Table 2. The "Specific genes" column contains the number of genes that are differentially expressed only in that specific group.

**Table 2.** Results of the differential expression analysis performed on each group against all other data, considered together.


In the second step, we have compared each possible pair of groups to each other, to compute the differences of each tissue relative to another (Table 3). This analysis is more useful for the researcher who needs to select a set of biomarkers, i.e., a minimal set of genes that allows one to distinguish all the tissues of a study.

**Table 3.** Results of the differential expression analysis performed on each group versus each other data group, considered separately. Each cell contains the total of genes that are significantly differentially expressed and the partial counts of overexpressed and underexpressed).


From this analysis, we can extract the information in Table 4. As we can see, these results are quite difficult to interpret and do not take into account the interactions among genes. Usually, at this stage, researchers analyze the molecular pathways of the differentially expressed genes and select a subset of genes to further study and validation; however, in this multiclass case this step is very complex. In Section 3.2, we will present the results obtained with our proposed model based on fuzzy rules and genetic algorithms.

**Table 4.** Number of differentially expressed genes present in grouped comparisons (1 = only one comparison, 2 = gene DE in 2 comparisons, etc.).


#### *3.2. Fuzzy Rule-Based System & Genetic Algorithm*

In this section, we describe the results obtained with the combination of genetic algorithms and fuzzy rules on the same dataset.

Table 5 summarizes the parameters tested for the execution of the genetic algorithm. Several values have been tested to speed up the execution of each generation, to avoid local minima, and to obtain final individuals with the highest fitness. In particular, the number of total individuals was increased to speed up the best individual's selection (because the number of preserved and brand new individuals also increased), and the mutation was inserted to avoid local minima. The number of generations, initially set at 1000, was increased to 2000, because only a minority of executions stopped for a small elite population (see stop criteria in Section 2.4). We also analyzed the composition of the population and observed that each feature appears at least once in the population after about 50 generations.

The number of features to be selected was based on the trade-off between the choice of a set of features capable of discriminating the 6 sample classes and the need to maintain the cardinality of the set rather low, to preserve the interpretability of the fuzzy rules and define a small number of genes to be selected for further biological study and laboratory validation. In addition, the domain experts wished to obtain a panel of genes capable of distinguishing samples of around 10–15 genes.


**Table 5.** All parameters tested for the Genetic Algorithm. The final parameters are presented in bold.

Several experiments were performed for the fitness function, as already detailed in Section 2.4, near Equation (1). Different fitness functions were compared and—based on the empirical analyses made—the one including accuracy, the number of rules obtained, and involvement of cancer-associated genes were found to be the most suitable for our genetic algorithm. Moreover, a weighting mechanism has been used to give to each addend a different importance. Indeed, we give slightly more importance to the classification accuracy and decreasing importance to the last two addends. The final parameters are shown in bold in Table 5.

The final individuals were selected based on accuracy only (100%), computed with leave-one-out cross-validation, then sorted by fitness. After repeating the genetic algorithm with different random seeds, we selected 72 best individuals. The final individuals are similar to each other for 78% of the selected features and differ on the remaining genes, and each individual is a subset of 10 out of the same 14 genes, listed in Table 6. The parameter that encouraged the model definition with respect to genes already known to be strongly involved in cancer pathways (as collected in KEGG) influenced the selection of 6 cancer-related genes in each individual. The remaining four genes (the first 4 in the table) are the most important in the classification task; in fact, they are present in each of the 72 individuals. The number of fuzzy rules automatically extracted for each best individual is always equal to 10.

**Table 6.** The genes selected by the genetic algorithm, sorted by frequency of occurrence in the final 72 individuals with the best accuracy and fitness. The genes known to be correlated to cancer are marked with (\*).


Table 7 lists the molecular pathways collected in the KEGG database and the genes involved. Moreover, MAPK9, MAPK1, KRAS, CBL, and EGFR are also involved in other molecular mechanisms active in cancer, such as choline metabolism, proteoglycan, and central carbon metabolism.

**Table 7.** The genes known to be involved in cancer, from the KEGG database of molecular.


From a literature search, XPNPEP1, GATA4, DTX3L, and NPIPB12 also show some correlation with cancer. In particular: XPNPEP1 was found overexpressed in clear cell renal cell carcinoma [29]; multiple studies have shown that GATA4 is closely associated with tumorigenesis [30]; DTX3L is involved in cell proliferation, differentiation, and survival [31]; NPIPB12 has also been correlated to cancer [32].

Figure 2 shows an example of a set of rules defined by one of the final 72 individuals. As mentioned above, the final individuals all contain XPNPEP1, GATA4, DTX3L, and NPIPB12 and a different combination of the other genes. Moreover, all the final individuals exhibit a similar structure to the final rules. In particular:


Figure 3 shows two examples of fuzzy sets defined on MAPK9 and DTX3L, for KS data only. The MAPK9 gene (known to be strongly involved in cancer pathways) shows a tendency to overexpression, while the DTX3L gene shows an evident overexpression in KS data. This trend is correctly described by the fuzzy sets defined over the expression domain.

**Figure 2.** A set of fuzzy rules with accuracy = 100%, able to classify and describe the samples correctly.

**Figure 3.** The fuzzy sets defined over the gene expression of MAPK9 and DTX3L for the KS samples.

As can be seen, fuzzy rules are easily understood by users who are not technicians. Fuzzy systems can describe complex behaviors with a transparent description in terms of linguistic knowledge that is interpretable, i.e., easy to read and understand by human users [26]. If we observe the rules generated by the FIS, they clearly explain which are the

genes and their expressions involved in the activation of each rule. They are written by using terms coming from natural language, such as the names of the genes, the terms medium, under, and overexpression, that are commonly used by the domain experts, and the derived classes refer to different diseases, as classified by experts. This is a very desirable result as biologists have to analyze these outcomes. Indeed, all the results and comments that we were able to extract with this model based on the combination of fuzzy rule-based systems and genetic algorithms will be subject to further examinations and assessments by biologists and clinicians. Further laboratory validation of the expression of the 14 selected genes on a larger cohort of patients will allow the selection of the final set of genes useful for the definition of a final panel of biomarkers for ovarian cancer characterization.

### **4. Discussion and Conclusions**

Ovarian cancer is a complex multifactorial disease characterized by complex gene interactions. Different types of ovarian cancer are essentially distinct diseases, as indicated by differences in epidemiological and genetic risk factors, precursor lesions, patterns of spread, and molecular events during oncogenesis, response to chemotherapy, and prognosis. A previous study attempted to address this disease by producing NGS datasets of 6 different classes of samples from surgical ovarian tissues, but classical bioinformatic workflows are unable to extract easily interpretable information for studying the expression profiles of the genes involved in the disease. The low number of replicates for each group does not allow the application of algorithms for automatic pattern extraction such as Artificial Neural Networks, and their limitations in result interpretation do not make them suitable for studying the genes involved in the disease mechanisms.

In this paper, we have tried to extract a set of genes that can be used to distinguish the 6 classes of samples and also to provide an explanation of how their expression changes in the data. We have compared the results of the most used bioinformatic pipeline with our model, based on the extraction of fuzzy rules on a set of genes selected by a genetic algorithm. The bioinformatic pipeline is designed for binary classes of case-control studies, and it allows the selection of statistically significant differentially expressed genes, but the results obtained on our 6 groups are difficult to interpret and to use for the extraction of biological markers. Moreover, it does not take into account the correlation and interactions among genes. Our proposal extracts a set of fuzzy rules that are indeed easier to interpret and selects genes both considering their ability to distinguish samples and their known involvement in cancer pathways. We have chosen to exploit fuzzy sets for our model because they represent well the concept of overexpression and underexpression, and we have applied genetic algorithms for gene selection because they allow us to select the features through a random search in the feature space, guided by some factors that are not based on variance evaluation and statistical testing. The perfect accuracy achieved by our classification model can be justified considering the very small size of the dataset we have adopted, which limits the generalizability of our results. Unfortunately, collecting a large sample of data in this particular domain is an extremely difficult task. However, we believe that the results obtained on our experimental data are still very promising and pave the way for a working system capable of supporting domain experts in ovarian cancer evaluation.

The result of our work is that with our method it is possible to select a small subset of genes able to distinguish the 6 classes of samples and to define an interpretable set of rules that can be used by domain experts to further study the selected genes, their involvement in cancer and the possibility of using them as early biomarkers for ovarian cancer diagnosis. Another important achievement of our proposal is that it allows us to elaborate meaningful results even with a reduced number of replicates for each class. As an extension of this work, in the near future, we will apply our model to other NGS datasets and define a more flexible function for pathway information in the fitness function.

**Author Contributions:** Conceptualization, A.C., E.P. and F.L.; data curation, A.C. and E.P.; formal analysis, A.C., G.C. (Gabriella Casalino), G.G. and G.V.; software, A.C.; validation, G.C. (Gabriella Casalino), G.C. (Giovanna Castellano), G.G. and G.V.; supervision, F.L.; writing—original draft, A.C.; writing—review and editing, A.C., G.C. (Gabriella Casalino), G.C. (Giovanna Castellano), G.G., E.P., G.V. and F.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was partially supported by INdAM GNCS within the research project "Computational Intelligence methods for Digital Health". G.C. (Gabriella Casalino), G.C. (Giovanna Castellano), A.C., and G.V. are members of the INdAM GNCS research group.

**Data Availability Statement:** The FASTQ files used in this work are available upon request from the authors of the paper that first described the dataset [18].

**Acknowledgments:** The authors thank E. Maiorano and L. Resta (Department of Emergency and Organ Transplantation, Operating Unit of Pathological Anatomy, University of Bari Aldo Moro, Bari, Italy) for their essential contribution of tissue samples without which no study would have been possible. G.C. (Gabriella Casalino), G.C. (Giovanna Castellano), and G.V. are members of the "CITEL-Centro Interdipartimentale di Telemedicina" research center of the University of Bari Aldo Moro.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

