*Article* **New Classification Method for Independent Data Sources Using Pawlak Conflict Model and Decision Trees**

**Małgorzata Przybyła-Kasperek \* and Katarzyna Kusztal**

Institute of Computer Science, University of Silesia in Katowice, Be¸dzi ´nska 39, 41-200 Sosnowiec, Poland **\*** Correspondence: malgorzata.przybyla-kasperek@us.edu.pl; Tel.: +48-32-269-17-56

**Abstract:** The research concerns data collected in independent sets—more specifically, in local decision tables. A possible approach to managing these data is to build local classifiers based on each table individually. In the literature, many approaches toward combining the final prediction results of independent classifiers can be found, but insufficient efforts have been made on the study of tables' cooperation and coalitions' formation. The importance of such an approach was expected on two levels. First, the impact on the quality of classification—the ability to build combined classifiers for coalitions of tables should allow for the learning of more generalized concepts. In turn, this should have an impact on the quality of classification of new objects. Second, combining tables into coalitions will result in reduced computational complexity—a reduced number of classifiers will be built. The paper proposes a new method for creating coalitions of local tables and generating an aggregated classifier for each coalition. Coalitions are generated by determining certain characteristics of attribute values occurring in local tables and applying the Pawlak conflict analysis model. In the study, the classification and regression trees with Gini index are built based on the aggregated table for one coalition. The system bears a hierarchical structure, as in the next stage the decisions generated by the classifiers for coalitions are aggregated using majority voting. The classification quality of the proposed system was compared with an approach that does not use local data cooperation and coalition creation. The structure of the system is parallel and decision trees are built independently for local tables. In the paper, it was shown that the proposed approach provides a significant improvement in classification quality and execution time. The Wilcoxon test confirmed that differences in accuracy rate of the results obtained for the proposed method and results obtained without coalitions are significant, with a *p* level = 0.005. The average accuracy rate values obtained for the proposed approach and the approach without coalitions are, respectively: 0.847 and 0.812; so the difference is quite large. Moreover, the algorithm implementing the proposed approach performed up to 21-times faster than the algorithm implementing the approach without using coalitions.

**Keywords:** Pawlak conflict analysis model; independent data sources; coalitions; decision trees; dispersed data

#### **1. Introduction**

In today's world, data are often collected in a decentralized and dispersed manner. There are many examples that illustrate this process: hospitals that separately collect data on the same issue/disease; banks that store data on their clients; applications on mobile devices that collect various data. These data are collected independently and in separate data storage.

It is crucial to use these data sets simultaneously to construct a classification of new objects. Of course, a very significant consideration is to guarantee high efficiency in the classification process based on dispersed data.

The issues of dispersed data are mainly considered in distributed learning approaches [1,2]. The distributed models process all or part of the data at different nodes [3,4]. A solution in which all the data are simultaneously aggregated and stored in a single set is

**Citation:** Przybyła-Kasperek, M.; Kusztal, K. New Classification Method for Independent Data Sources Using Pawlak Conflict Model and Decision Trees. *Entropy* **2022**, *24*, 1604. https://doi.org/10.3390/ e24111604

Academic Editors: Przemysław Juszczuk and Jan Kozak

Received: 24 October 2022 Accepted: 1 November 2022 Published: 4 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

both inefficient and often impossible to apply [5]. Therefore, most research papers have proposed a collaborative solution without data aggregation. In federated learning [6,7], nodes perform multiple rounds with local data and send the local model to the central server for aggregation into new global models. The main idea here is to guarantee data protection and privacy. Moreover, models are much shorter than raw data, so the exchange of data is faster and less complex. In the distributed learning approach, methods can be found in which local models are built independently, and the final decision is simply generated by applying fusion methods. Various models have been proposed, both parallel [8] and hierarchical [9,10]. The concept of agent collaboration is also key here [11]; however, we do not build aggregated tables as a result of this collaboration. In the literature, examples of classifier ensembles in which feature subsets are considered can be found [12–14]. There are also ensembles of classifiers built based on subsets of objects [15,16]. In the paper [17], an approach that considers missing values in the context of ensembles is considered. A crucial matter that affects the quality of classification is diversity among the base classifiers [18,19]. The method for generating the final decision also has a significant impact on the efficiency of ensembles [20,21]. Approaches recognizing relations between local data are considered in the literature. In the paper [22], a hierarchical federated learning approach was proposed. On the other hand, the paper [23] proposed a hierarchical approach in classifier ensembles. Mainly in the literature, distributed learning is considered in terms of the following issues [2,24]: data division—horizontal or vertical fragmentation; type of base classifiers can be homogeneous or heterogeneous; type and cost of communication—data or models may be shared; privacy and data security—whether raw data exchange is allowed; fusion methods—if local models are built (global model is not created) then fusion of predictions is necessary to generate global decisions; data consistency—it can be assumed that objects are shared between local tables and are consistent, or data can be independently created and inconsistent. However, proposed approaches do not analyze the contents of local tables and the relationships between them. In addition, the aggregation of local tables is seldom considered in the literature.

Therefore, in this paper we fill this gap and propose a solution that performs a complex analysis of tables' content. The proposed approach aims to identify conflicts of local tables. The term conflict used here refers to significant differences in the values of conditional attributes occurring in local tables. We analyze relations and create coalitions of local tables containing similar data. Based on the aggregated tables, a model is built. It is expected that in this way we achieve better classification accuracy because models created via this approach have a better ability to generalize concepts compared to approaches that use a single model created based on a single table.

In the literature, conflict analysis is widely considered and various models are proposed. Group decision-making represents an approach that solves the situation in which each individual has their own private perspective [24]. In [25], a model is proposed for distributed group-decision support system that is suitable for use over the Internet. The theory of negotiation and coalition formation presents an important issue regarding social interaction and is also studied in computer science in the context of distributed systems [26,27]. Pawlak's conflict analysis model [28,29] is yet another approach to conflict recognition that provides excellent solutions in a variety of applications [30,31]. Pawlak conflict analysis model was also considered in the context of dispersed data in the papers [32–34]. This application shows that the Pawlak model provides excellent results for dispersed data when tables are aggregated within coalitions. However, the approach discussed in their study is completely different from the one proposed in this paper. Here, the compatibility of tables is examined in terms of the information stored in them—the values on the attributes. In contrast, the papers [32–34] consider compatibility in terms of predictions generated by the base models created based on the tables. Another difference is that in this paper we assume that in local tables the same attributes are present, while in the papers [32–34] there was no such assumption. Furthermore, in this paper, the system is static, whereas previously it was dynamic. However, the success of the previous model provides the inspiration for proposing a new approach in this paper. The main differences between these approaches are listed in Table 1.


**Table 1.** Comparison of the new approach with the approach proposed in the papers [32–34].

This paper proposes the use of the Pawlak conflict analysis method to generate coalitions of decision tables, in which there are similar values on a set of conditional attributes. The goal is to achieve a better quality of classification by ensuring that similar units work together. Formally, this approach requires that data are collected in a set of decision tables (that were collected independently) in which the names of the conditional attributes are identical (but the values on the objects may differ). Thus, coalitions of tables containing similar values will be created. The tables in one coalition are then aggregated and a common model is determined based on the aggregated table. This approach seems natural, since in everyday life we also notice that similar entities join forces to form better decisions or to guarantee better management. This paper describes the process of using characteristics of attribute values stored in decision tables in the Pawlak conflict analysis model. The paper proposes a static and hierarchical classification model. The model is static because coalitions—the model's structure—are determined only once. Hierarchy of the model results from the fact that tables in coalitions are aggregated and then models are built based on them and these models perform classification. In this paper, decision trees are used as base models. Specifically, classification and regression trees with Gini index (CART) [35] are applied. The final classification of new objects is determined using majority voting based on the predictions generated by the decision trees.

The paper also considers a parallel approach in which conflict analysis is not considered. In this approach, the CART trees are also employed as base models, but the cooperation of tables is not implemented, and the final decisions are made by majority voting of decision trees generated independently based on tables.

The main objective in this study is to analyze how building coalitions of tables using the Pawlak conflict analysis model affects the quality of classification and the running time of the model. The two research hypotheses are verified in the paper. The first is that applying the proposed model with Pawlak analysis and coalitions provides better classification quality than an approach in which coalitions are not used (in both models the same base classifiers are used—the CART trees). The second research hypothesis is that the algorithm implementing the proposed model has a lower time complexity than the algorithm implementing the approach in which decision trees are built based on each local table separately.

Herein, it is shown that combining local tables into aggregated tables significantly improves classification quality. In addition, it reduces the number of generated trees and thus reduces the time complexity of the method.

The main contributions of the paper are:


The structure of the paper is organized as follows. Section 2 presents the proposed model. The method of defining the coalitions and steps in building the model are described there. Section 3 is dedicated to presenting the experimental results. The data, the measures used and the methodology of the experiments are described in this section, and the results obtained are also provided in tables. Section 4 contains the discussion and comparisons of the obtained results. Section 5 gives conclusions and future research plans.

#### **2. Materials and Methods**

This section describes a new proposed hierarchical system for classification based on dispersed data. In this research, we assume that the sets of attributes appearing in local tables are equal. Stages of system construction are described in the following subsections. The first step involves creating the system's structure—generating coalitions of local tables. This stage is implemented only once. Our goal here is the cooperation of tables that store similar conditional attribute values. This concept detailing the cooperation of units that share similar views with each other—have compatible values in this case—represents a natural behavior that we can observe in everyday life and nature. For this purpose, characteristics of conditional attributes' values are calculated. In the next step, coalitions are created based on these characteristics using the Pawlak conflict analysis model. The final step is the aggregation of tables from one coalition. Based on such aggregated coalition's data, a classifier is built. In this study, we use a decision tree model. The final classification model is a set of such decision trees generated for coalitions. The classification of an object is conducted by the majority voting of these trees. Figure 1 illustrates the workflow of the proposed model.

#### *2.1. Basic Concepts and Method of Defining Characteristics of Conditional Attributes*

We assume that a set of decision tables is given. The tables were collected independently by separate units, but it is required that the same attributes are stored in all tables. We do not impose any restrictions on the objects contained within the tables. We assume that we do not know which objects are shared between local tables.

Formally, we assume that a set of decision tables *Di* = (*Ui*, *A*, *d*), *i* ∈ {1, ... , *n*} from one discipline is available, where *Ui* is the universe, a set of objects; *A* is a set of conditional attributes; *d* is a decision attribute. As can be seen the sets of objects are different between local tables. The names of attributes that occur in local tables, both conditional and decision, are the same. Therefore, the conditional attributes *A* and decision attribute *d* in all local tables are denoted in the same way. Clearly, from a formal point of view, the attribute *<sup>a</sup>* <sup>∈</sup> *<sup>A</sup>* in the decision table *Di* is a function *<sup>a</sup>* : *Ui* <sup>→</sup> *<sup>V</sup>a*, where *<sup>V</sup><sup>a</sup>* is the set of values of the attribute *a*. Thus, the domains of the functions between local tables are different. However, for the sake of simplicity, the same designations for attributes were adopted in all local tables, and the domain of the function will be directly derived from the attribute's membership in the decision table. Aggregation for these tables is a difficult process and can generate inconsistencies. Another aspect that should be taken into account is data protection and privacy. In addition, the process of aggregating all local tables is highly complex. Thus, in the literature, rather, methods are proposed for partial aggregation of tables or even building separate models based on each local tables, and then aggregating these models or the predictions generated by the models [7,21,36].

**Figure 1.** The overall workflow of the proposed model.

In this paper, a new approach is proposed in which we aggregate tables that contain similar values on conditional attributes. For this purpose, for each local table and for each attribute, some characteristics of the attribute's values occurring in the table are generated. Suppose that in each local table we have *m* attributes *card*{*A*} = *m* (*card* denotes the number of elements in the set). Let us assume that we have *m*<sup>1</sup> quantitative attributes and *m*<sup>2</sup> qualitative attributes, so *m*<sup>1</sup> + *m*<sup>2</sup> = *m*.

For each quantitative attribute *aquan* ∈ *A*, we determine the average of all attribute's values present in local table *Di*, for each *<sup>i</sup>* ∈ {1, ... , *<sup>n</sup>*}. Let us denote this value as *Val<sup>i</sup> aquan* . We also calculate the global average and the global standard deviation. Let us denote them as *Valaquan* and *SDaquan* . These values are determined based on the averages calculated for the local decision tables according to the following formulas:

$$\overline{Val}\_{a\_{quav}} = \frac{1}{n} \sum\_{i=1}^{n} \overline{Val}\_{a\_{quav}}^{i} \tag{1}$$

$$SD\_{a\_{\text{quav}}} = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} \left(\overline{Val}\_{a\_{\text{quav}}} - \overline{Val}\_{a\_{\text{quav}}}^{i}\right)^2} \tag{2}$$

These characteristics for quantitative attributes will be used in the coalitions generation process. For each qualitative attribute *aqual* ∈ *A*, we determine a vector over the values of that attribute. Suppose attribute *aqual* has *c* values *val*1, ... , *valc*. The vector *Val<sup>i</sup> aqual* = (*n<sup>i</sup>* <sup>1</sup>, ... , *<sup>n</sup><sup>i</sup> c*) represents the number of occurrences of each of these values in the decision table *Di*. More precisely, the coordinate *nj* represents the number of objects in table *Di* that have value *valj* on attribute *aqual*. This vector is normalized. This is done to ensure that in further analysis the percentage of occurrences of a given value in the table matters rather than the number of objects in the table.

The Pawlak conflict analysis model is employed to determine coalitions of local tables that store similar attribute values. The next section presents the method to create an information system with a description of the conflict situation and how coalitions are generated with the use of the Pawlak model.

#### *2.2. Pawlak Conflict Analysis Model and Creation of Coalitions*

The Pawlak conflict analysis model is a very simple yet effective approach for recognizing coalitions of units involved in a conflicting situation [28,29]. In this model, an information system is defined in which the views of agents—units involved in a conflict situation—on the issues that are the matter of the conflict are stored. In the considered approach, the agents are local tables while the issues are conditional attributes stored in these tables. Formally, an information system is defined *S* = (*U*, *A*), where *U* is a set of local decision tables *U* = {*D*1, ... , *Dn*} and *A* is a set of conditional attributes (qualitative and quantitative) occurring in local tables, which was defined in the previous section. In the Pawlak model, opinions of agents on issues are expressed by using three values. Value 1 means an agent is in favor of an issue, value 0 means an agent is neutral to an issue, while value −1 means an agent is against an issue. The original interpretation differs from that used herein. In this paper, the values refer rather to the differences in values of a given attribute appearing in the local decision table. Depending on the type of attribute (qualitative or quantitative), a different method of determining these values is used.

For the quantitative attribute *aquan* ∈ *A* a function *aquan* : *U* → {−1, 0, 1} is defined

$$a\_{quam}(D\_i) = \begin{cases} 1 & \text{if } \overline{val}\_{a\_{quam}} + SD\_{a\_{quam}} < \overline{val}\_{a\_{quam}}^i \\ 0 & \text{if } \overline{val}\_{a\_{quam}} - SD\_{a\_{quam}} \le \overline{val}\_{a\_{quam}}^i < \overline{val}\_{a\_{quam}} + SD\_{a\_{quam}} \\ -1 & \text{if } \overline{val}\_{a\_{quam}}^i < \overline{val}\_{a\_{quam}} - SD\_{a\_{quam}} \end{cases} \tag{3}$$

The motivation for proposing this function originates from the method of estimating typical values of normal distribution. It is known that about 68% of the typical values from the normal distribution fall within the range: average ± standard deviation. Thus, we assign the value 0 on attribute *aquan* to decision tables *Di* when the average of the attribute's values occurring in the table falls in the *SDaquan* -neighborhood of the global average *Valaquan* .

This means that the values of the attribute occurring in the decision table are typical.

In contrast, the value 1 means that the average of the conditional attribute values in the decision table is above the global average more than *SDaquan* value; it deviates more than the value of the standard deviation. Similarly, the value −1 indicates an atypical—lower—average value of the conditional attribute in the decision table compared to the global average value.

As mentioned above, the vectors that determine the distribution of values occurring in the decision tables are generated for qualitative attributes. For an attribute *aqual* ∈ *A* we have the vectors *Val<sup>i</sup> aqual* = (*n<sup>i</sup>* <sup>1</sup>, ... , *<sup>n</sup><sup>i</sup> <sup>c</sup>*), *i* ∈ {1, ... , *n*}. In order to define three groups of decision tables with similar distribution of the attribute's *aqual* values, we group these vectors with the *k*–means clustering algorithm, fixed number of groups *k* = 3 and the Euclidean distance. We then place in descending order the centroids obtained for groups. Ordering with respect to the value of the first centroid coordinate was applied. Let us denote the groups of decision tables obtained from the *k*–means algorithm and indexed in relation to the centroids' order as *G*1, *G*2, *G*3. For the qualitative attribute *aqual* ∈ *A* a function *aqual* : *U* → {−1, 0, 1} is defined

$$a\_{qual}(D\_i) = \begin{cases} 1 & \text{if } D\_i \in G\_1 \\ 0 & \text{if } D\_i \in G\_2 \\ -1 & \text{if } D\_i \in G\_3 \end{cases} \tag{4}$$

The function above assigns values on a qualitative attribute to local tables that reflect the consistency of the characteristics of this attribute appearing in the table. Thus, decision tables that contain similar distribution of values of the qualitative attribute will have the same value assigned in the information system *S*.

In this way, the information system *S* is defined that stores information about the compatibility of values of conditional attributes occurring in local tables. Based on this system, we calculate the general similarity of values of all attributes for each pair of tables. For this purpose, a conflict function is used that was proposed by Pawlak in their conflict analysis model [28]. The conflict function *ρ* : *U* × *U* → [0, 1] is defined as follows

$$\rho(D\_{i\prime}D\_{j}) = \frac{\operatorname{card}\{a \in A : a(D\_{i}) \neq a(D\_{j})\}}{\operatorname{card}\{A\}}.\tag{5}$$

A pair of decision tables *Di*, *Dj* ∈ *U* is said to be [28]:


Set *X* ⊆ *U* is a coalition if for every *Di*, *Dj* ∈ *X* decision tables are allied *ρ*(*Di*, *Dj*) < 0.5. By applying the Pawlak conflict analysis model, we obtain coalitions of local tables that share similar values of conditional attributes. It should be noted that coalitions do not have to be disjointed—one local table can be included in several coalitions. In fact, this is a quite common case, as will be shown in the experimental section.

The pseudo-code of the algorithm that generates the coalitions of local tables is given in Algorithm 1.


**Input:** A set of local decision tables *Di* = (*Ui*, *A*, *d*), *i* ∈ {1, . . . , *n*}.

**Output:** A set of coalitions of local tables *X*1,..., *Xk*.

*Construction of an information system S* = (*U*, *A*)*, where U* = {*D*1, ... , *Dn*} *and A is a set of conditional attributes*

for each *a* ∈ *A*:

if *a* is a quantitative attribute then

Use Equation (3) to define the function *a*

else

Use Equation (4) to define the function *a*

*Conflict function values* for each pair *Di*, *Dj* ∈ *U*:

Use Equation (5) to calculate the value *ρ*(*Di*, *Dj*)

*Creation of coalitions X*<sup>1</sup> = *U*, *i* = 1, *j* = 1 while *i* ≤ *j*:

Repeat until there is a pair of tables *Dl*, *Dk* ∈ *Xi* so that *ρ*(*Dl*, *Dk*) ≥ 0.5:

$$\begin{aligned} \stackrel{j}{X}\_{j} &= j+1\\ X\_{j} &= X\_{i} \backslash \{D\_{l}\} \, \, X\_{i} = X\_{i} \backslash \{D\_{k}\} \\ i &= i+1 \end{aligned}$$

Return only the largest sets, due to the inclusion relation, from the sets *Xi*, *i* = 1, . . . , *j*

The computational complexity of the algorithm is exponential due to the number of local tables. The greatest complexity is noted when there exists no pair of local tables similar enough to satisfy the conditions of being allied. Subsequently, all subsets of the set of local tables will eventually be checked. However, in most applications, the number of local tables is not so large. In the experimental section, the application of the proposed model is checked for dispersed data containing up to eleven local tables. The obtained times in the worst cases are expressed in minutes.

#### *2.3. Aggregation of Tables from Coalitions and Final Classification*

An aggregated decision table is defined for each coalition of local tables generated in the previous step. Suppose we have coalitions of tables *X*1, ... , *Xk*. The aggregated decision table for the coalition *Xj* is denoted as *<sup>D</sup>aggr <sup>j</sup>* = (*Uaggr <sup>j</sup>* , *<sup>A</sup>*, *<sup>d</sup>*), where *<sup>U</sup>aggr <sup>j</sup>* = *Di*∈*Xj Ui* and the names of attributes in the aggregated table are the same as those in local tables. The attribute *a* from the aggregated table is a function defined on *Uaggr <sup>j</sup>* that takes values in *<sup>V</sup>a*. The attribute *<sup>a</sup>* from the aggregated table has the same value, on object *<sup>x</sup>* <sup>∈</sup> *Ui*, as the corresponding attribute *a* from the local table *Di* on that object. Thus, an aggregated table is defined by summing objects from local tables in the coalition without recognizing whether there are common objects in the local tables (based on the assumptions, we do not possess this possibility). In the aggregated table, the values assigned to objects on the attributes are taken from local tables.

Based on aggregated tables, models are generated. In this paper, the classification and regression tree algorithm is used with Gini index [35]. It should be noted that prepruning and postpruning were not used for this tree. An implementation available in Python language was used for this purpose [37]. Specifically, *DecisionTreeClassifier(criterion = "gini")* function was used. The tree is built independently for each aggregated table, thus we obtain *k* models *M*1,..., *Mk*.

The classification of a new object *x* is realized by each model separately. The final decision—the global decision, which we denote as ˆ*d*(*x*)—is made by majority voting. This means that there may be a tie, which we do not resolve in any way. Thus, ˆ*d*(*x*) is the set of decisions that were most frequently indicated by models *M*1, ... , *Mk*. In the experimental part, the relevant measures for evaluating the quality of classification, which takes into account the possibility of draws, were used.

In the section below, an illustrative example of the proposed approach is provided for clarification.

#### *2.4. Baseline Model without the Use of Coalitions*

The results obtained using the proposed method are compared with the results generated by an approach without any conflict analysis. In the baseline approach, a model is built based on each local table. In order to perform a fair comparison of the impact of the proposed novelty on the results obtained, the same classification model was used—for each local table the CART tree is used. Classification of a new object is realized by applying the majority voting method to the classification results obtained using these decision trees. Ties can occur, but as stated before, we do not resolve them in any way. The adequate measures were used in the experimental part.

#### *2.5. Example of Use of the Proposed Approach*

Let us consider an example that uses the proposed approach. Suppose we have a set of four local tables *Di* = (*Ui*, *A*, *d*), *i* ∈ {1, ... , 4}. Each of them contains a set of five conditional attributes *A* = {*a*1, ... , *a*5} and a decision attribute *d*. We assume that *<sup>V</sup>ai* <sup>=</sup> {0, 1, 2}, *<sup>i</sup>* ∈ {1, ... , 5}, and *<sup>V</sup><sup>d</sup>* <sup>=</sup> {*d*1, *<sup>d</sup>*2} for each of the tables. For the purposes of this example, the conditional attributes in the tables are quantitative. The local tables defined above are given in Table 2.


**Table 2.** Local tables used in the example.

Based on the attribute values in the local tables (Table 2), the information system is generated as described in Section 2.2. In the first step, the average of all attribute's values occurring in the local table for each attribute and each table is calculated. These values are denoted as *Val<sup>i</sup> aj* , *i* ∈ {1, ... , 4}, *j* ∈ {1, ... , 5} and are given in Table 3. Furthermore, the global average and the global standard deviation for each attribute are calculated, the values are also shown in Table 3.

**Table 3.** Averages *Val<sup>i</sup> aj* , *i* ∈ {1, . . . , 4}, *j* ∈ {1, . . . , 5}.


Thus, according to Equation (3), the values in the information system for attribute *a*<sup>1</sup> are assigned as follows

$$a\_1(D\_i) = \begin{cases} 1 & \text{if } \ 1.337 < \overline{Val}\_{a\_1}^i \\ 0 & \text{if } \ 1.163 \le \overline{Val}\_{a\_1}^i \le 1.337 \\ -1 & \text{if } \ \overline{Val}\_{a\_1}^i < 1.163 \end{cases} \tag{6}$$

which means that *a*1(*D*1) = 0, *a*1(*D*2) = 1, *a*1(*D*3) = 0, *a*1(*D*4) = 0, *a*1(*D*5) = 0. For other attributes, the values in the information system are determined similarly. The obtained information system is shown in Table 4.

**Table 4.** Information system.


In the next step, the values of conflict function for the local tables are determined according to Equation (5). For example, for the pair (*D*1, *D*2) of local tables, the value is calculated as follows

$$\rho(D\_1, D\_2) = \frac{\operatorname{card}\{a \in A : a(D\_1) \neq a(D\_2)\}}{\operatorname{card}\{A\}} = \frac{4}{5}.\tag{7}$$

The values of the conflict function for the above information system are presented in Table 5.

**Table 5.** Function values.


Figure 2 shows a graphical representation of the conflict situation. When agents (local tables) are allied (*ρ*(*Di*, *Dj*) < 0.5), the circles representing the agents are linked. In order to find coalitions, all cliques should be identified in the graph. In this example, there are two coalitions: {*D*1, *D*3, *D*4} and {*D*2}.

**Figure 2.** A graphical representation of the conflict situation example.

An aggregated decision table is generated for each coalition. The aggregated tables are presented in Table 6.

Now, a decision tree is built for each aggregated table. This is done using the function implemented in the Scikit-learn library *tree.DecisionTreeClassifier(criterion = "gini")*. The built decision trees are presented in Figure 3. Test objects are classified based on these models using the simple voting method.

**Table 6.** Aggregated local tables.


**Figure 3.** Decision trees created for aggregated decision tables. (**a**) The aggregated table *D*<sup>1</sup> *aggr* (**b**) The aggregated table *D*<sup>2</sup> *aggr*.

Since local table *D*<sup>2</sup> is left in a coalition containing only one element, the second aggregated table is the same as the local table *D*2, therefore, the trees generated based on them are also the same. So we should mainly focus on the tree generated based on the first aggregated table and the three trees generated from local tables *D*1, *D*<sup>3</sup> and *D*4. As we can see, they are quite different. For example, in the tree generated based on the aggregated table there is a condition *a*<sup>2</sup> ≤ 1.5 the root, which does not correspond to the conditions occurring in the trees in Figure 4a,c,d. In addition, in the aggregated tree, there is the attribute *a*<sup>5</sup> in two internal nodes and the attribute *a*<sup>4</sup> in one internal node. These attributes are not included at all in the trees generated from local tables *D*1, *D*<sup>3</sup> and *D*4.

Since tables are combined into coalitions in terms of similarity of conditional attributes' values, trees generated based on aggregated tables should not be very altered compared to trees generated from local tables. In general, trees generated from a larger number of training objects are expected to be more accurate and have better classification quality.

For comparison, let us also consider the baseline model, in which coalitions are not generated. In this case, the decision trees are generated directly based on local tables. Thus, we obtain four decision trees generated from the tables given in Table 2, which are presented in Figure 4.

**Figure 4.** Decision trees created for local decision tables, (**a**) for the local table *D*1, (**b**) for the local table *D*2, (**c**) for the local table *D*3, (**d**) for the local table *D*4.

#### **3. Results**

The experiments were carried out using the data available from the UC Irvine Machine Learning Repository [38]. A total of three data sets were selected for the analysis—the Vehicle Silhouettes, the Landsat Satellite and the Soybean (Large) data sets. Regarding the Landsat Satellite and Soybean data sets, the training and test sets are located in the repository. The Vehicle data set was randomly split into two disjoint subsets, the training set (70% of objects) and the test set (30% of objects). Data characteristics are given in Table 7.


**Table 7.** Data set characteristics.

The training sets of the above data sets were dispersed. A total of 5 different dispersed versions with 3, 5, 7, 9 and 11 local tables were prepared to check for different degrees of dispersion for each data set. This was done using a stratified mode. Each local table contained the full set of attributes, and a subset of the set of objects.

The quality of classification was evaluated based on the test set. The following measures were used:

• the classification accuracy

$$acc = \frac{1}{card\{\mathcal{U}\_{t \text{est}}\}} \sum\_{\mathbf{x} \in \mathcal{U}\_{t \text{est}}} I(d(\mathbf{x}) \in \hat{d}(\mathbf{x})),$$

where *<sup>I</sup>*(*d*(*x*) <sup>∈</sup> <sup>ˆ</sup>*d*(*x*)) = 1, when *<sup>d</sup>*(*xi*) <sup>∈</sup> <sup>ˆ</sup>*d*(*x*) and *<sup>I</sup>*(*d*(*x*) <sup>∈</sup> <sup>ˆ</sup>*d*(*x*)) = 0, when *<sup>d</sup>*(*x*) <sup>∈</sup>/ <sup>ˆ</sup>*d*(*x*); <sup>ˆ</sup>*d*(*x*) is a set of global decisions generated by the system for the test object *x* from the test set *Utest*

• the classification ambiguity accuracy

$$acc\_{ONE} = \frac{1}{card\{\mathcal{U}\_{test}\}} \sum\_{\mathbf{x} \in \mathcal{U}\_{test}} I(d(\mathbf{x}) = \hat{d}(\mathbf{x})),$$

where *<sup>I</sup>*(*d*(*x*) = <sup>ˆ</sup>*d*(*x*)) = 1, when {*d*(*x*)} <sup>=</sup> <sup>ˆ</sup>*d*(*x*) and *<sup>I</sup>*(*d*(*x*) = <sup>ˆ</sup>*d*(*x*)) = 0, when {*d*(*x*)} <sup>=</sup> <sup>ˆ</sup>*d*(*x*)

• the average size of the global decision sets

$$\overrightarrow{d} = \frac{1}{card\{\mathsf{UL}\_{test}\}} \sum\_{\mathbf{x} \in \mathsf{UL}\_{test}} card \{\vec{d}(\mathbf{x})\}.$$

The classification accuracy refers to the ratio of correctly classified objects from the test set to their total number in this set. When the correct decision class of an object is contained within the generated decision set, the object is considered to be correctly classified. The classification ambiguity accuracy also describes the ratio of correctly classified objects from the test set to their total number in this set. With the difference being that this time when only one correct decision class is generated, the object is considered to be correctly classified. The third measure allows us to assess the frequency and number of draws generated by the classification model.

The experiments were conducted according to the following scheme:


• Analysis of the baseline approach. Generating decision trees based on the local tables (without any conflict analysis or coalitions). The final decision is made by simple voting. Evaluating the baseline approach using a test set.

As mentioned above, Table 8 shows the coalitions generated during construction of the proposed model. As can be seen, in two cases no coalitions were generated—for the Satellite and Soybean data sets with three local tables. In most cases, coalitions were created and, as can be seen, they are not disjoint sets. This means that some local tables were involved in the creation of several aggregated tables. The reason for this is that a given local table is partially similar to different sets of local tables and provides additional knowledge to the construction of trees representing different concepts.

**Table 8.** Coalitions generated using the Pawlak conflict analysis model for dispersed data. LT denotes local table.


Table 9 presents the classification accuracy *acc* values, the classification ambiguity accuracy *accONE* values and the average number of generated decisions set ¯*d* obtained for all dispersed data sets. The table shows the results obtained for both the proposed approach and the baseline approach. For each data set, the better result is indicated in bold. As can be seen, in the vast majority of cases better results are generated by the proposed model with creation of coalitions and recognition of similarity of data stored in local tables.

To better visualize the differences in the results generated by the models, Figure 5 was prepared with the classification accuracy marked for each data set. As can be seen, the most significant improvement in classification quality using the proposed approach was observed for the Soybean data set. Here, the improvement is around 0.1. For the Vehicle Silhouettes data set, the improvement in most cases is around 0.03 (even greater in certain scenarios). Furthermore, for the Landsat Satellite data set, the improvement in results was also noticed, but smaller at around 0.015. However, for all data sets, there is a noticeable and seemingly significant improvement obtained using the proposed approach compared to the baseline approach.


**Table 9.** Results of classification accuracy *acc*, classification ambiguity accuracy *accONE* and the average number of generated decisions set ¯*d* for all dispersed data sets.

In order to investigate the significance in differences of accuracy rate obtained for the proposed model and the baseline approach, the results from Table 9 were used. Two dependent samples were created—one containing the results for the proposed model and one containing the results for the baseline approach. Each sample had a cardinality equal to 13 observations results obtained for different data sets and number of local tables. The Wilcoxon test confirmed that differences in the accuracy rate between these two groups are significant, with *p* = 0.005.

Additionally, a comparative box-plot chart for the accuracy rate values was created (Figure 6). We can observe an increase in accuracy rate when the proposed model is used. Both the box alignment and the median itself are significantly higher when the proposed model is employed.

**Figure 6.** Box-plot chart with (median, the first quartile—Q1, the third quartile—Q3) the value of accuracy rate *acc* for the proposed model and the baseline approach.

Furthermore, we also analyzed the time needed to generate decision trees in both approaches. In the baseline method, the time needed to generate trees directly from local tables was investigated, and in the proposed approach the time required to generate trees from aggregated tables was considered. Table 10 shows the execution times of the decision tree generation algorithms in the baseline approach and with coalitions.


**Table 10.** Execution times of the decision tree generation algorithms in the base approach and with coalitions.

The differences in execution times are notably significant. The proposed model has significantly lower time complexity. This is due to the fact that with the proposed approach coalitions creation—a smaller number of trees is created than when decision trees are generated based on each local table separately. This results in the significantly reduced execution time of making a final decision based on dispersed data.

Figure 7 illustrates the ratio of execution times of the baseline approach to the proposed approach. As can be seen for the Satellite data set, in some cases, the proposed approach exhibits an execution time more than 20-fold faster than the baseline approach. In general, it can be seen that for the largest data set (Satellite) the execution acceleration is the most significant.

In addition, for a smaller degree of dispersion—smaller number of local tables the reduction in execution time using the proposed approach is greater than for data with a larger degree of dispersion—greater number of local tables. This is due to the fact that for a larger degree of dispersion, there is also a greater number of coalitions generated using the Pawlak analysis model (as can be seen in Table 8).

**Figure 7.** Ratio of execution times of the algorithms implementing the baseline approach and the approach with coalitions.

All experiments were performed on a portable computer with the following technical specifications:


The code used for the analyzed approaches has been implemented in Python and all data-related calculations have been saved in a text document. Decision trees were built using the function implemented in the Scikit-learn library *tree.DecisionTreeClassifier(criterion = "gini")*. In all cases, the Gini index was used. The postpruning and prepruning methods were intentionally not applied, since the main goal of this study focused on analyzing how building coalitions of tables using the Pawlak conflict analysis model affects classification quality and model running time. Combining local tables into aggregated tables was shown to significantly improve classification quality. In addition, it also reduces the number of generated trees and thus reduces the time complexity of the method.

#### **4. Discussion**

The paper proposes a new method for classification based on dispersed data. This method is used when the same set of conditional attributes occurs in all local tables. It should be noted that the conditional attributes can be of different types—both qualitative and quantitative. Sets of objects in local tables can be diversified. Indeed, we do not consider the possibility of examining whether identical objects occur in different local tables. The main idea behind this method is the aggregation of tables that store similar values on conditional attributes. In order to determine which tables should be aggregated, a new

method for generating characteristics of values stored in tables and a new method for using the Pawlak conflict analysis model are proposed. Next, a method for defining aggregated tables and a method for final decision-making are defined. It was shown that the proposed method brings a significant improvement in the quality of classification obtained based on dispersed data compared to the approach when aggregation of tables and formation of coalitions are not considered.

The main advantages of the proposed approach are:


The main limitations of the proposed approach are:


There are practically no parameters in the proposed model, since the Pawlak model has no parameters, and the decision trees were built without prepruning or postpruning (this will be implemented in the next stage of the future work). The only parameter we can consider is the degree of data dispersion. The decision tables were dispersed to varying degrees into 3, 5, 7, 9 and 11 decision tables. The dispersion was performed in relation to the objects in stratified mode and ensuring the number of objects in the local tables remains equal. Figure 8 shows the function of classification accuracy values in relation to the number of local tables.

**Figure 8.** Classification of accuracy values in relation to the number of local tables: (**a**) for the baseline approach (**b**) for the approach with coalitions.

In the case of the baseline method for both the Soybean and the Vehicle data sets, an increase in the degree of data dispersion results in a deterioration of classification accuracy. For the Landsat Satellite data set, this relation is not observed. For the proposed approach, only for the Vehicle set can it be stated that an increase in the degree of dispersion affects the deterioration of classification accuracy. For the Soybean data set, the proposed method eliminates the negative effect of high dispersion on classification accuracy. Thus, it can be concluded that the use of the proposed approach allows improvement in the quality of classification, especially in the case of high dispersion where many local tables

occur. In other words, the proposed model generally improves the quality of classification, but is particularly useful for data dispersed over a large number of local tables.

#### **5. Conclusions**

A new classification approach based on dispersed data was proposed in this paper. The main innovation lies in the proposal of a method that combines local decision tables into an aggregated table. For this purpose, a method based on the Pawlak conflict analysis model was proposed. The new approach was shown to improve both the quality of classification and the running time.

In future work, we plan to:


**Author Contributions:** Conceptualization, M.P.-K.; methodology, M.P.-K., K.K.; software, K.K.; validation, M.P.-K., K.K.; formal analysis, M.P.-K., K.K.; investigation, M.P.-K., K.K.; resources, M.P.-K.; writing—original draft preparation, M.P.-K.; writing—review and editing, M.P.-K., K.K.; visualization, M.P.-K., K.K.; supervision, M.P.-K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Data Availability Statement:** Publicly available data sets were analyzed in this study. These data can be found here: [38]. One data set has been artificially generated and a description of the process behind the artifical generation is presented in the paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

