**Algorithm 1** Fast-non-dominated-sorting

**Require:** Evaluated population *P* = {*<sup>P</sup>*1,..., *PN*}, number of solutions in population *N*. 1: **for** each *Pi*, *i* ∈ {1, *N*} **do** 2: **for** each *Pj*, *j* ∈ {1, *N*} **do** 3: **if** *Pi* dominates *Pj* (*Pi* ≺ *Pj*) **then** 4: increase the set of solutions which the current solution dominates: *SPi* ← *SPi* ∪ 1*Pj*2 5: **else** 6: **if** *Pj* dominates *Pi* (*Pj* ≺ *Pi*) **then** 7: increase the number of solutions which dominate the current solution: *nPi* ← *nPi* + 1 8: **end if** 9: **end if** 10: **end for if** no solution dominates *Pi*, (*nPi* = 0) **then** 12: *Pi* is a member of the first front: F1 ← F1 ∪ {*Pi*} **end if end for** *t* ← 1 **while** F*t* = ∅ **do** H ← ∅ **for** each *Pi* ∈ F*t* **do** 19: **for** each *Pj* ∈ *SPi* **do** *nPj* ← *nPj* − 1 0**then**

11:

	- 28: **end while**

1

29: **return** a list of the non-dominated fronts F

 1*Pj*2

NSGA-II is one of the popular multiobjective optimization algorithms. It is usually assumed that each individual in the population represents a separate solution to the problem. Thus, the population is a set of non-dominated solutions, from which one or more solutions can then be selected, depending on which criteria are given the highest priority. A distinctive feature of the proposed algorithm is, among other things, that the solution to the problem is the entire population, in which individual members represent individual patterns.

#### **Algorithm 2** An evolutionary algorithm for pattern generation

**Require:** The set of baseline observations *X* and values of control variables *Y*

1: Create a random parent population *P*0 of size *N* from *X* and *Y*

2: F ← *Fast* − *nondominated* − *sort*(*<sup>P</sup>*0)

3: Create a child population of size *N* using selection, crossover and mutation procedures *Q*0 ← *Child*(*<sup>P</sup>*0)


$$\text{9:} \qquad I \leftarrow \sqrt{\mathcal{C}ov^{+}(P\_{t+1})} - \sqrt{\mathcal{C}ov^{-}(P\_{t+1})}$$

$$\text{10:} \qquad P\_{t+1} \gets P\_{t+1} \cup \mathcal{F}\_t$$

11: **end while**


17: **return** *Pt*+<sup>1</sup>

#### **3. Application of the Proposed Method to Problems in Healthcare**

The proposed approach relates to interpretable machine learning methods. Therefore, experimental studies were carried out on problems in which the interpretability of the recognition model and the possibility of a clear explanation of the solution proposed by the classifier are of grea<sup>t</sup> importance. The results are shown on two datasets: breast cancer diagnosis (the dataset from the UCI repository [66]) and predicting complications of myocardial infarction (regional data [67]).

#### *3.1. Breast Cancer Diagnosis*

The problem of diagnosing breast cancer on a sample collected in Wisconsin is considered (Breast Cancer Wisconsin, BCW) [66]. Since the attributes in the data take on numeric (integer) values, their binarization is necessary, that is, the transition to new binary features. Threshold-based binarization is used. Based on the original value *x*, a new binary variable *xt* can be constructed as follows:

$$\mathbf{x}\_t = \begin{cases} 1, & \text{if } \mathbf{x} \ge \mathbf{t}, \\ 0, & \text{if } \mathbf{x} < \mathbf{t}, \end{cases} \tag{7}$$

where *t* is a cut point (threshold value).

As a result of executing the binarization procedure, 72 binary attributes were obtained from 9 initial attributes. The original dataset is divided in the ratio of 70% and 30% into training and test samples (in a random way), which results in 478 and 205 observations, respectively. To search for logical patterns, objects of the training sample (both positive and negative classes) consistently act as a baseline observation. For each baseline observation, the NSGA-II algorithm is run independently with the following parameters:


The final set of patterns includes patterns of the first front from the last population of solutions. The first front can contain one or several patterns. Thus, the final set of patterns is completed based on covering the observations of the training sample. Therefore, each observation will be covered by at least one pattern from this set.

A series of experiments are carried out with a different probability of dropping one when initializing control variables in the first population of the genetic algorithm. The following probabilities were used: *p* ∈ {0.5, 0.3, 0.1}. A complete set of patterns was independently constructed for each probability. This set included the patterns of the first fronts obtained using the NSGA-II algorithm with the sequential acceptance of the observations of the training sample as a baseline observation. The resulting complete set of patterns are shown in Figure 5 in the coordinates of the number of covered observations of its own and the opposite class.

**Figure 5.** Aggregate scatter diagram of the detected patterns with *p* = 0.5.

Figure 6 shows the resulting full set of patterns for the value *p* = 0.3.

**Figure 6.** Aggregate scatter diagram of the detected patterns with *p* = 0.3.

Figure 7 shows the resulting full set of patterns for the value *p* = 0.1.

**Figure 7.** Aggregate scatter diagram of the detected patterns with *p* = 0.1.

Table 1 shows the distribution of patterns by power. Power is understood as the number of covered observations of its class included in the training sample. Data on the aggregate set of patterns are given without considering their class affiliation.


**Table 1.** Distribution of patterns in terms of power.

In the case of equally probable dropping of zero and one during initialization, a large number of trivial patterns are found, that is, patterns that cover only the basic observation and do not cover any more observations. These patterns are overly selective. With a decrease in the probability of dropping one during initialization, the selectivity of the obtained patterns decreases, thereby increasing their coverage. However, increased coverage affects the accuracy of initiated patterns, i.e., the number of observations from the opposite class.

The results of the classification of test observations compared to actual values are shown in Table 2. Symbols "+" and "−" denote the class labels introduced above. The symbol "?" means the impossibility of classification because none of the patterns is valid for observation.


**Table 2.** Confusion matrix for BCW problem.

Thus, a significant influence of the probability of dropping one is established during the initialization of control variables in the first population of the multi-criteria genetic algorithm. This parameter affects the selectivity of the detected patterns. The equiprobable dropping of zero and one entails a significant proportion of baseline observations, for which only patterns that cover only the baseline observation and no other training observations are found.

#### *3.2. Predicting the Complication of Myocardial Infarction*

The problem of predicting the development of complications of myocardial infarction is considered [67]. Datasets are often significantly asymmetric: one of the classes significantly outnumbers the other in terms of the number of objects. In our case, a sample contains data on 1700 cases with an uneven division into classes: 170 cases with complications, and 1530 cases without. Each case in the initial sample is described by 127 attributes containing information about the history of each patient, the clinical picture of the myocardial infarction, electrographic and laboratory parameters, drug therapy, and the characteristics of the course of the disease in the first days of myocardial infarction. Data include the following types: textual data, integer values, rank scale values with known range, real and binary values. A more detailed description of the attributes is given in Table 3.

The problem of predicting atrial fibrillation (AF) is solved, assuming that it is described by the variable 116 (target variable, 1—the occurrence of atrial fibrillation, 0—its absence). According to the nature of the attributes, variables 94, 95, 97, 98, 104, 105, 107, 108 were excluded since these values could be obtained after the occurrence of complications [67]. The exclusion of these attributes significantly complicates the forecasting task. Thus, the number of predictors from the initial data was 107 variables.

The properties of the data used are as follows:


Two approaches to data processing were implemented when constructing the classifier, as described below.


**Table 3.** Description of sample attributes for myocardial infarction problem.

3.2.1. First Approach to Data Preparation (Complete Data and Handling of Missing Values)

The data contained a significant number of missing values. Columns with more than 100 missing values were excluded (47 variables). The remaining variables contained rank scale values and binary values. Missing values in columns containing 100 or less gaps were filled with the mode value. Thus, the prepared dataset contained 60 variables (not including the target).

The binarization for the values presented on the rank scale was performed based on the following rules. Based on the original variable *x*, a new binary variable *xt* was constructed as follows: 

$$\mathbf{x}\_t = \begin{cases} 1, & \text{if } \ x = r, \\ 0, & \text{if } \ x \neq r, \end{cases} \tag{8}$$

where *r* is the value of the original variable *x*,*r* ∈ *R*, such that *R* is a known set of all possible values of a variable *x*. As a result of the binarization procedure, the total number of binary variables was 119.

The original dataset is divided in a ratio of 70% and 30% into training and test samples, that is 1190 and 510 observations, respectively. To search for patterns, the training sample's objects (both positive and negative classes) consistently act as a baseline observation. For each baseline observation, the NSGA-II algorithm is run independently with the following parameters:


More resources are allocated for seeking positive class observations. The sizes of the parent and descendant populations are 50, and the number of generations was 100. As

a result, a complete set of patterns was obtained, containing 1961 positive class patterns and 6148 negative class patterns. A classifier is built that decides on a new observation by a simple voting. The results of the classification of the test observations compared to the actual values are shown in Table 4. Symbols "+" and " −" denote the class labels introduced above.


**Table 4.** Confusion matrix for AF problem (first approach).

Thus, the classifier arising from data for which missing values were processed did not show an acceptable result of classifying positive class objects, which may be caused by deleting essential data due to missing values processing. Since logical patterns are usable for classifying data with missing values, a different approach to data preparation was used,which is described below.

#### 3.2.2. Another Approach to Data Preparation (Reduced Sampling without Processing Missing Values)

The second approach to data preparation is to store the missing attribute values as in the original sample. The class imbalance problem is also solved by randomly selecting a subset of objects of the prevailing class, equal in cardinality to the set of objects of another class.

As a result, the original sample was reduced to 338 cases, with an equal number of instances belonging to each class. Observations are described by 106 variables (not including the target). On the reduced sample, 8 variables take either just the same value, or the same value and gaps, so these variables were excluded from consideration. Among the remaining variables, 16 are rank variables, 12 are real, and the other variables are binary. Rank variables are transformed into several new binary variables, whose number is determined by the number of allowed ranks. Based on each real variable, four new binary variables are built. The threshold values of these variables were divided by the range of values of the original variables into four equal parts. As a result of applying the binarization procedure, the total number of variables was 200 (excluding the target variable).

Data were divided into training and test samples in the ratio of 80% and 20%, respectively, which causes 270 and 68 observations, respectively, in the same number of observations of positive and negative classes. For each training set observation, the NSGA-II algorithm is run independently with the following parameters:


As a result, the cumulative set of patterns for the first front includes 2259 rules for the positive class and 2403 for the negative class. The classifiers are built based on reduced sets with the selection according to informativeness (*I*) of revealed patterns. Simple voting of patterns classifies a new observation. If the attribute value fixed in the pattern is unknown in the new observation, we assume that this pattern does not cover this observation. The classification results for different threshold values of informativeness are shown in Table 5.


**Table 5.** Confusion matrix for AF problem (the second approach).

Thus, the greatest accuracy is achieved when selecting patterns with informativeness greater than 2 or greater than 4. Since the importance of correctly identifying positive class objects (true positive rate or sensitivity) is greater than a negative one (true negative rate or specificity), the best option is to select patterns with informativeness greater than 4, since the accuracy of classifying objects in the positive class is higher. The obtained classification accuracy exceeds the accuracy obtained in [44].

Let us explore some characteristics of the resulting patterns when choosing patterns based on *I* ≥ 4. Table 6 shows the number of patterns as well as the degree distribution of these patterns, that is, the number of binary variables fixed in the pattern.

**Table 6.** Degree of the patterns.


Figure 8 shows the indicated sets of patterns in the coordinates of the number of covered observations of its class and the opposite class. Figure 8 is given for the training sample.

**Figure 8.** Cumulative scatter diagram of patterns on the training set.

Figure 9 shows the indicated sets of patterns in the coordinates of the number of covered observations of its class and the opposite class for the test sample.

**Figure 9.** Cumulative scatter diagram of patterns on the test sample.

To evaluate the efficiency of the proposed approach, we made a comparative study with some widely used machine learning classification algorithms. We compared the proposed multi-criteria genetic algorithm (MGA-LAD) with Support Vector Machine (SVM), C4.5 Decision Trees (J48), Random Forest (RF), Multilayer Perceptron (MP), and Simple Logistic Regression (LR) methods. The tests were performed on the following datasets from UCI Machine Learning Repository [66]: Wisconsin breast cancer (BCW), Hepatitis (Hepatitis), Pima Indian diabetes (Pima), Congressional voting (Voting). Results using 10-fold cross-validation are presented in Table 7. Here, the "ML" column contains results for algorithms that give the highest accuracy among the commonly used machine learning algorithms. The best algorithm is given in parentheses. Another approach that has been used for comparison is logical analysis of data (LAD-WEKA in the WEKA package [68]) at various fixed fuzziness values φ (an upper bound on the number of points from another

class that is covered by a pattern as a percentage of the total number of points covered by the pattern).


**Table 7.** Classification accuracy of the compared algorithms.

The value of the fuzziness parameter ϕ affects the size, coverage, and informativeness of the resulting patterns and ultimately affects the classification accuracy. In LAD, it is necessary to find many *a*-pattern based on different baseline observations. However, for each such pattern, the most appropriate fuzziness value may be different. Using a twocriteria model and the corresponding optimization algorithm allows us to find a set of Pareto optimal patterns without the need to fix the fuzziness value, which expands the possibilities of LAD and can improve the classification accuracy.
