**1. Introduction**

Logical analysis of data (LAD) is a methodology for processing a set of observations, or objects, some of which belong to a specific subset (positive observations), and the rest does not belong to it (negative observations) [1]. These observations are described by features, generally numerical, nominal, or binary. Logical analysis of data is performed by detecting pattern-logical expressions that are true for positive (or negative) observations and not performed for negative (or, respectively, positive) observations [2]. Thus, regions in the feature space containing observations of the corresponding classes can be approximated using a set of positive and negative patterns. To identify such patterns, we use models and methods of combinatorial optimization [3–5].

Classification problems are one of the application fields of LAD [6,7]. From the point of view of solving classification problems, applying LAD can be considered as the construction

**Citation:** Masich, I.S.; Kulachenko, M.A.; Stanimirovi´c, P.S.; Popov, A.M.; Tovbis, E.M.; Stupina, A.A.; Kazakovtsev, L.A. Formation of Fuzzy Patterns in Logical Analysis of Data Using a Multi-Criteria Genetic Algorithm. *Symmetry* **2022**, *14*, 600. https://doi.org/10.3390/ sym14030600

Academic Editor: László T. Kóczy

Received: 25 February 2022 Accepted: 15 March 2022 Published: 17 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

of a rule-based classifier. Like other rule-based classification approaches, such as decision trees and lists of rules, this approach has the advantage that it constructs a "transparent" classifier. Thus, it belongs to interpretive machine learning methods.

Two types of patterns can be distinguished [8]. The first type is homogeneous (pure, clear) patterns. A homogeneous pattern covers part of the observations of a particular class (for example, positive) and does not cover any observation of another class (negative). However, if we consider real data for constructing a classifier, then pure patterns often do not give a good result. Due to the noisiness of the data, the presence of inaccuracies, errors, and outliers, pure patterns may have too low generalizing ability, and their use leads to overfitting. In such cases, the best results are shown by fuzzy (partial) patterns, in which the homogeneity constraint is weakened. Such weakening (relaxation) leads to the formation of more generalized patterns [4,9]. The pattern search problem is considered as an optimization problem, where the objective function is the number of covered observations of a certain class under a relaxed non-coverage constraint of observations of the opposite class.

Modern literature offers various approaches to the formation of patterns [8]. To generate patterns, enumeration-based algorithms [6,10], algorithms based on integer programming [2] or mixed approach based on both integer and linear programming principles [11,12] are used. In [5], a genetic algorithm for generating patterns is described. In [13], an approach based on metaheuristics is presented, in which the key idea is the construction of a pool of patterns for each given observation of the training set. Logical analysis of data is used in many application areas, such as cancer diagnosis and coronary risk prediction [2,10,11,14], credit risk rating [11,15–17], assessment of the potential of an economy to attract foreign direct investment [18], predicting the number of airline passengers [19], fault prognosis and anomaly detection [20–23], and others.

Thus, in traditional approaches to the logical analysis of data with homogeneous patterns, each pattern covers part of the observations of the target class and no observations of the opposite class. Otherwise, the homogeneity constraint is transformed into a relaxed non-coverage constraint related to a number or ratio of observations of the opposite class. Thus, such optimization model is not symmetric in the sense that the problem is focused on the number of covered observations of the target class while the coverage of the opposite class is considered as a constraint set at a certain level. Such an approach with the fuzzy patterns remains in the domain of the single-criterion optimization.

Recently, fuzzy logic theory has been widely developed in research. As mentioned in the literature [13], a certain degree of fuzziness seems to improve the robustness of the classification algorithm. In a fuzzy classification system, an object can be classified by applying a set of fuzzy rules based on its attributes. To build a fuzzy classification system, the most difficult task is to find a set of fuzzy rules pertaining to the specific classification problem [24]. To extract fuzzy rules, a neural net was proposed in several studies [25–27]. On the other hand, the decision tree induction method was used in [28–30]. In [30], a fuzzy decision tree approach was proposed, which can overcome the overfitting problem without pruning and can construct soft decision trees from large datasets. However, these methods were found to be suboptimal in solving certain types of problems [24]. In [13,24], a genetic algorithm for generating fuzzy rules was described. It was noted that they are very robust due to the global searching.

Fuzzy classification has practical applications in various fields. For instance, in [31], a fuzzy rule-based system for classification of diabetes was used. Authors in [32,33] have applied fuzzy theory in managing energy for electric vehicles. In [34], problems of a product processing plant related to the discovery of intrusions in a computer network were solved with use of a fuzzy classifier. In our study, we use fuzziness as a concept of partial patterns.

Both traditional and fuzzy approaches provide for finding both patterns covering the target class and patterns covering the opposite class. The methods for finding such patterns do not differ; in this sense, such approaches are symmetrical: the composition of patterns does not change from replacing the target class with the opposite one. At the same time, there is a significant difference between the requirement for maximum coverage and the requirement for purity of patterns.

When analyzing real data, pure patterns may be ineffective, and the concept of a pattern is extended to fuzzy patterns that cover some of the negative objects. This expansion occurs by relaxing the "empty intersection with negative objects" constraint. Thus, the aim of our study is to construct a classification model based on LAD principles, which does not impose a strict restriction nor relaxed constraint on the pattern coverage of the opposite class observations. Our model converts such a restriction (purity restriction) into an additional criterion. We formulate the pattern search problem as a two-criteria optimization problem: the maximum of covered observations of a certain class with the minimum of covered observations of the opposite class. Thus, our model has two competing criteria of the same scale, and the essence of solving the problem comes down to finding a balance between maximum coverage and purity of patterns. For this purpose, in this paper, we study the use of a multi-criteria genetic algorithm to search for Pareto-optimal fuzzy patterns. Our comparative results on medical test problems are not inferior to the results of commonly used machine learning algorithms in terms of accuracy.

The rest of the paper is organized as follows. In Section 2, we describe known and new methods implemented in our research. We provide the basic concepts of logical analysis of data (Section 2.1), an approach to formation of logical patterns (Section 2.2), a two-criteria optimization model for solving the pattern search problem, concepts of the evolutionary algorithm NSGA-II (Non-dominated Sorting Genetic Algorithm-II), developed to solve the multi-criteria optimization problem (Section 2.4). In Section 2.5, we discover the ability of evolutionary algorithms to solve the problem of generating logical patterns and describe modifications of the NSGA-II (Section 2.5). In Section 3, we present the results of solving two applied classification problems. In Sections 4 and 5, we discuss and shortly summarize the work.

#### **2. An Evolutionary Algorithm for Pattern Generation**

Several approaches which resemble in certain respects the general classification methodology of LAD can be distinguished [35]. For instance, in [36], a DNF learning technique was presented that captures certain aspects of LAD. Some machine learning approaches based on production or implication rules, derived from decision trees, such as C4.5 rules [37], or based on Rough Set theory, such as the Rough Set Exploration System [38]. The authors of [39] proposed the concept of emerging patterns in which the only admissible patterns are monotonically non-decreasing. The subgroup discovery algorithm described in [40] maximizes a measure of the coverage of patterns, which is discounted by their coverage of the opposite class. The algorithm presented in [35] maximizes the coverage of patterns while limiting their coverage of the opposite class. In [35], the authors introduced the concept of fuzzy patterns, considered in our paper.

#### *2.1. Main Stages of Logical Analysis of Data*

LAD is a data analysis methodology that integrates ideas and concepts from topics, such as optimization, combinatorics, and Boolean function theory [10,41]. The primary purpose of logical analysis of data is to identify functional logical patterns hidden in the data, suggesting the following stages [2,41].

Stage 1 (Binarization). Since the LAD relies on the apparatus of Boolean functions [41], this imposes a restriction on the analyzed data, namely, it requires the binary values of the features of objects. Naturally, in most real-life situations, the input data are not necessarily binary [2,12], and in general may not be numerical. It should be noted that in most cases, effective binarization [41] leads to the loss of some information [2,6].

Stage 2 (Feature extraction). The feature description of objects may contain redundant features, experimental noise as well as artifacts generated or associated with the binarization procedure [42,43]. Therefore, it is necessary to choose some reference set of features for further consideration.

Stage 3 (Pattern generation). Pattern generation is the central procedure for logical analysis of data [2]. At this stage, it is necessary to generate various patterns covering different areas of the feature space. However, these patterns should be of a sufficient level of quality, expressed in the requirements for the parameters of the pattern (for example, complexity or degree). To implement this stage, a specific criterion is selected as well as an optimization algorithm for constructing patterns that is relevant to the data under consideration [44].

Stage 4 (Constructing the classifier). When the patterns are formed, the classification of the new observation is carried out in the following way. An observation that satisfies the requirements of at least one positive pattern and does not satisfy the conditions of any negative patterns is classified as positive. The definition of belonging to a negative class is formulated similarly. In addition, it is required to determine how a decision will be made regarding controversial objects, for example, by voting on the generated patterns. The set of patterns formed at the previous step usually turns out to be too large and redundant for constructing a classifier, which leads to the problem of choosing a representative limited subset of patterns [2], such that it will provide a level of classification accuracy compared to using the complete set of rules. In addition, a decrease in the number of patterns makes it possible to increase the interpretability of the resulting classifier in the subject area [45].

Stage 5 (Validation). The last step of the LAD is not special and is inherent in other data mining methods. The degree of conformity of the model to the initial data should be assessed, and its practical value should be confirmed. In applied problems, the initial data reflect the complexity and variety of real processes and phenomena.

Stage 4 works fine on "ideal" data which means: reasonable amount of homogeneous data with no errors, no outliers (standalone observations which are very far from other ones), no gaps or inconsistency in data. When processing real data, we must take into account several issues [46]. Features may be heterogeneous (of different types and measured on different scales).

Data can be presented in a more complex form than a standard matrix of object features, for example, images, texts, or audio. Various data preprocessing methods are used to extract features. Another type of object description consists in pairwise comparison of objects instead of isolating and describing their features (featureless recognition [47]).

The number of objects may be significantly less than the number of features (data insufficiency). In systems with automatic collection and accumulation of data, the opposite problem arises (data redundancy). There are so much data that conventional methods process them exceptionally slowly. In addition, big amounts of data pose the problem of efficient storage.

The values of the features and the target variable (the label of class in the training sample) can be missed (gaps in data) or measured with errors (inaccuracy in data). Gross errors lead to the appearance of rare but large deviations (outliers). Data may be inconsistent which means that objects with the same feature description belong to different classes as a result of data inaccuracy.

Inconsistency and inaccuracy in data dramatically reduce the effectiveness of approaches based on homogeneous patterns. We have to apply fuzzy patterns in which a certain number of observations of the opposite class are allowed. At the same time, it is difficult to determine this threshold, since the level of data noise is usually unknown.

#### *2.2. Formation of Logical Patterns*

We restrict ourselves to considering the case of two classes: *K*<sup>+</sup> and *K*<sup>−</sup>. Objects of *K*<sup>+</sup> class will be called positive sampling points, and objects of *K*− class will be called negative. In addition, we assume that objects *X* ∈ *K*<sup>+</sup> ∪ *K*− are described by *k* binary features, that is *x* (*j*) *i*∈ {0, 1} ∀*i*, *j*, where *j* is the object and *i* is the index of feature, *j* = 1, *k*.

 LAD uses terms that are conjunctions of some literals (binary features *xi* or their negations 1 − *xi*). We will say that a term *C* covers an object *X* if *C*(*X*) = 1. A logical positive pattern (or simply a pattern) is a term that covers positive objects and does not cover negative objects (or covers a limited number of negative objects). The concept of a negative pattern is introduced in a similar way.

Choose an object *a* ∈ *K*+, assuming that *a* = (*<sup>a</sup>*1, ... , *ak*) is the vector of feature values of this object. Denote by *Pa* a pattern covering the point *a*. The pattern is a set of feature values, which are fixed and equal for all the objects covered by the pattern. To distinguish fixed and unfixed features in pattern *Pa*, we introduce binary variables *Y*(*a*) = .*y*(*a*) 1 ,..., *y*(*a*) *k* ' [48,49] as follows:

$$y\_j^{(a)} = \begin{cases} 1, & \text{if } j \text{th feature is fixed in } P\_{\mathfrak{a}\_{\prime}} \\ 0, & \text{otherwise.} \end{cases} \tag{1}$$

A point *b* ∈ *K*<sup>+</sup> will be covered by a pattern *Pa* only if *y*(*a*) *j* = 0 ∀*j* : *bj*=*aj*. On the other hand, some point *c* ∈ *K*<sup>+</sup> will not be covered by the pattern *Pa* if *yj* = 1 for at least one *j* ∈ {1, *k*} for which *cj*= *aj*.

It should be noted that any point *Y*(*a*) = {*y*(*a*) 1 , ... , *y*(*a*) *k* } corresponds to the subcube in the space of binary features *X* = {*<sup>x</sup>*1,,..., *xk*}, which includes an object *a*.

It is natural to assume that the pattern can cover only a part of the observations from *K*+. The more observations of positive class the pattern covers in comparison with observations of another type, the more informative it is [50]. Negative observation coverage is a pattern error.

Denote pattern *Pa* as a binary function of an object *b*: *Pa*(*b*) = 1 if object *b* is covered by pattern *Pa*, and 0 otherwise.

Noted constraints establish the minimum allowable clearance between the two classes. To improve the reliability of the classification, namely, its robustness to errors on class boundaries, constraints can be strengthened by increasing the value on the right side of the inequality.

Let us introduce the following notation: *Cov*+(*Pa*) is number of observations from *K*<sup>+</sup> for which the condition *Pa*(*b*) = 1, *b* ∈ *K*<sup>+</sup> is satisfied; *Cov*−(*Pa*) is number of observations from *K*− for which the condition *Pa*(*c*) = 1, *c* ∈ *K*− is satisfied.

The pattern *Pa* is called "pure" if *Cov*−(*Pa*) = 0. If *Cov*−(*Pa*) > 0, then the pattern *Pa* is called "fuzzy" [35]. Obviously, among the pure patterns, the most valuable are the patterns with a large number of covered positive observations *Cov*+(*Pa*).

The number of covered positive observations *Cov*+(*Pa*) can be expressed as follows [2]:

$$\mathcal{Cov}^+(P\_\mathfrak{a}) = \sum\_{b \in \mathcal{K}^+} \prod\_{j=\overline{1,k}, \ b\_j \neq a\_j} (1 - y\_j^{(a)}) \,. \tag{2}$$

The condition that the positive pattern *Pa* should not cover any point of the negative class, which means the search for a pure pattern requires that for each observation *c* ∈ *K*<sup>−</sup>, the variable *y*(*a*) *j* takes the value 1 for at least one *j* for which *cj* = *aj*. Thus, a pure pattern is a solution to the problem of conditional Boolean optimization [2]:

$$\begin{aligned} \sum\_{b \in K^{+}} \prod\_{j=1, k\_{r} \not b\_{j} \not\models a\_{j}} \left( 1 - y\_{j}^{(a)} \right) &\to \max, \\ \sum\_{j=\overline{1, k\_{r}}} y\_{j}^{(a)} &\ge 1 \quad \forall c \in K^{-}. \end{aligned} \tag{3}$$

Noted constraints establish the minimum allowable clearance between the two classes. To improve the reliability of the classification, namely, its robustness to errors on class boundaries, constraints can be strengthened by increasing the value on the right side of the inequality.

Since the properties of positive and negative patterns are completely symmetric, the procedure for finding negative patterns is similar.

#### *2.3. Proposed Optimization Model*

From the point of view of classification accuracy, pure patterns are preferable [10]. However, in the case of incomplete or inaccurate data, such patterns will have small coverage, which means, for many applications, the rejection of the search for pure patterns in favor of partial.

For the case of partial patterns, the constraint of the optimization problem *Cov*−(*Pa*) = 0 transforms into the second objective function, which leads to the optimization problem simultaneously according to two criteria:

$$\mathcal{C}ov^{+}(P\_{a}) \to \max \quad \text{and} \quad \mathcal{C}ov^{-}(P\_{a}) \to \min. \tag{4}$$

The least suitable are those patterns that either cover too few observations or cover positive and negative observations in approximately the same proportion. The contradictions between these conflicting criteria can be resolved by transferring the second objective function to the category of constraints through establishing a certain admissible number of covered negative observations. In addition, multi-criteria optimization methods [51] can be applied, which construct an approximation of the Pareto front [52].

In addition, when searching for patterns, it is worth considering the degree of the pattern—the number of fixed signs of this pattern. It is easy to establish that there is an inverse dependence between the pattern degree and the number of covered observations (both positive and negative), so the pattern degree should not be too large [53].

The search for simpler patterns has well-founded prerequisites. Firstly, such patterns are better interpreted and understandable during decision-making. Secondly, it is often believed that simpler patterns have better generalization ability, and their use leads to better recognition accuracy [53]. The use of simple and short patterns leads to a decrease in the number of uncovered positive observations, but at the same time, shorter patterns can increase the number of covered negative observations. A natural way to reduce the number of false positives is to form more selective observations. This is achieved by reducing the size of the pattern determining subcube [48,54].
