Skip to Content
SustainabilitySustainability
  • Article
  • Open Access

20 January 2023

Novel Features and Neighborhood Complexity Measures for Multiclass Classification of Hybrid Data

,
,
and
1
Centro de Investigación en Computación del Instituto Politécnico Nacional, Juan de Dios Bátiz s/n, Gustavo A. Madero, Mexico City 07738, Mexico
2
Centro de Innovación y Desarrollo Tecnológico en Cómputo del Instituto Politécnico Nacional, Juan de Dios Bátiz s/n, Gustavo A. Madero, Mexico City 07700, Mexico
3
School of Business, Deree—The American College of Greece, 6 Gravias Street, GR-153 42 Aghia Paraskevi, 15342 Athens, Greece
4
College of Engineering, Effat University, Jeddah 21478, Saudi Arabia
This article belongs to the Special Issue Knowledge Management in Healthcare

Abstract

The present capabilities for collecting and storing all kinds of data exceed the collective ability to analyze, summarize, and extract knowledge from this data. Knowledge management aims to automatically organize a systematic process of learning. Most meta-learning strategies are based on determining data characteristics, usually by computing data complexity measures. Such measures describe data characteristics related to size, shape, density, and other factors. However, most of the data complexity measures in the literature assume the classification problem is binary (just two decision classes), and that the data is numeric and has no missing values. The main contribution of this paper is that we extend four data complexity measures to overcome these drawbacks for characterizing multiclass, hybrid, and incomplete supervised data. We change the formulation of Feature-based measures by maintaining the essence of the original measures, and we use a maximum similarity graph-based approach for designing Neighborhood measures. We also use ordering weighting average operators to avoid biases in the proposed measures. We included the proposed measures in the EPIC software for computational availability, and we computed the measures for publicly available multiclass hybrid and incomplete datasets. In addition, the performance of the proposed measures was analyzed, and we can confirm that they solve some of the biases of previous ones and are capable of natively handling mixed, incomplete, and multiclass data without any preprocessing needed.

1. Introduction

Several disciplines, such as Pattern Recognition, Machine Learning, Computational Intelligence, and Artificial Intelligence share an interest in automatic knowledge management strategies [1]. The latter is becoming an active research area useful for several daily-life aspects such as environmental concerns (i.e., clothing sustainability [2], fault detection in the subsea [3], and pollution analysis [4]), education (teacher training [5] and other educational applications [6]), and health (disease prediction [7], secure Internet of Things in healthcare [8] and other healthcare challenges [9]), among others.
There are numerous supervised classification algorithms, such as Neighborhood-based classifiers [10], Decision trees [11], Neural networks [12], Support vector machines [13], Associative classifiers [14], and Logical-combinatorial classifiers [15]. However, because of the No Free Lunch theorems [16], no algorithm will outperform all others for all problems and performance measures. That is why meta-learning, as a knowledge management technique, is the focus of several research efforts [17,18,19].
Most meta-learning strategies are based on determining data characteristics, usually by computing data complexity measures [20,21]. Such measures aim at describing data characteristics [22,23,24] related to size, shape, density, and others. However, the majority of the data complexity measures in the literature assume the classification problem is binary (just two decision classes), and that the data is numeric and has no missing values.
Unfortunately, in many real-life applications, such assumptions are not fulfilled. Real data is often hybrid (described by both numeric and categorical attributes) and can present an absence of information. In addition, several problems have multiple possible outcomes or decisions to make; they have multiple decision classes.
As stated before, data complexity measures [20] are usually defined for numeric, complete, and binary classification problems. Therefore, our aim is to extend the measures for characterizing multiclass, hybrid, and incomplete supervised data. The main contributions of this paper are the following:
  • We extend four data complexity measures for the multiclass classification scenario and for dealing with hybrid and incomplete data.
  • We include the proposed four measures in the EPIC software [25,26] for computational availability.
  • We compute the proposed measures for publicly available multiclass hybrid and incomplete datasets.
This paper is organized as follows: Section 2 reviews some of the related works on data complexity measures. Section 3 introduces the extended data complexity measures, and Section 4 shows some of the properties of the proposed measures, as well as their computation over publicly available datasets. Finally, we present some conclusions and future works.

3. Proposed Approach and Results

Our hypothesis is that extending data complexity measures for the multiclass scenario, having hybrid and incomplete data, is possible (Figure 3).
Figure 3. Proposed approach for developing novel data complexity measures.
In the following, we describe the proposed extended measures. It is important to mention that all proposed measures are able to deal with multiclass, hybrid, and incomplete data.

3.1. Extended Feature-Based Measures

3.1.1. Extended Maximum Fisher’s Discriminant Ratio (F1_ext)

We wanted to maintain the idea behind F1 as a way of assessing the discriminant power of individual features. It is given by:
F 1 _ ext = 1 1 + m a x i = 1 m r i r i _ n u m = j = 1 .. c [ | { x U | c l a s s ( x ) = j } | ( μ j i μ i ) 2 ] j = 1 .. c { x U | c l a s s ( x ) = j } ( x [ i ] μ j i ) 2 r i _ c a t = | o v e r l a p _ c a t ( A i ) | | r a n g e _ c a t ( A i ) | overlap _ cat ( A i ) = { v :   x , y U , v = x [ i ] = y [ i ] ? c l a s s ( x ) c l a s s ( y )   } range _ cat ( A i ) = { v :   x U , x [ i ] = v   x ? }
For numerical features, μ i is the mean of feature A i and μ j i is the mean of feature A i considering only the instances in { x U | c l a s s ( x ) = j } . Both means are computed, disregarding the instances with missing values (?).
Our definition of overlap for categorical feature values extends [28] by enumerating values appearing in different classes and disregarding missing values, and our definition of range considers all possible values of feature Ai. Using an efficient implementation, the computational complexity of our proposal is bounded by O ( m n ) .
The proposed measure is able to deal with multiclass hybrid and incomplete data, presenting an advance for data complexity analysis. However, as its predecessor F1, for numerical data, our measure assumes that the linear boundary is unique and perpendicular to one of the feature axes. If a feature is separable but with more than one line, it does not capture such information (Figure 4).
Figure 4. Example of instances linearly separable (a) by one line and (b) by three lines. Note the values of F1_ext are 0.14 and 0.24, respectively.

3.1.2. Extended Volume of the Overlapping Region (F2_ext)

We extended the original F2 measure by using different overlapping ranges for numeric and categorical data. For numeric data, our formulation is close to the original but with extensions.
F 2 _ ext = m i = 1 .. m r i
where:
r i = { r i _ c a t if   feature   i   is   categorical r i _ n u m if   feature   i   is   numeric r i _ n u m = | o v e r l a p _ n u m ( A i ) | | r a n g e _ n u m ( A i ) | = max { 0 , min max ( A i )   max min ( A i ) } max max ( A i )     min min ( A i ) min max ( A i ) = min j = 1 .. c   max x U { x [ i ] ? | x c l a s s ( x ) = j } max min ( A i ) = max j = 1 .. c   min x U { x [ i ] ? | x c l a s s ( x ) = j } max   max ( A i ) = max j = 1 .. c   max x U { x [ i ] ? | x c l a s s ( x ) = j } min   min ( A i ) = min j = 1 .. c   min x U { x [ i ] ? | x c l a s s ( x ) = j } r   i _ c a t = | o v e r l a p _ c a t ( A i ) | | r a n g e _ c a t ( A i ) |
Both minimum and maximum values are computed, disregarding the instances with missing values (?). Our formulation solves the problem of dealing with multiple classes, as well as with missing and hybrid data. Using an efficient implementation, the computational complexity of our proposal is bounded by O ( m n ) .
As pointed out by Lorena [27], the F2 value can become very small depending on the number of operands in Equation (17); that is, it is highly dependent on the number of features a dataset has. Our extension does not avoid this situation. It has the same problems as F1_ext (Figure 5).
Figure 5. Example of the curse of the dimensionality for F2 measure. (a) Dataset with two overlapping instances. Assume all attributes have the same values. (b) Results for one attribute. (c) Results for two attributes. (d) Results for ten attributes. Note the values of F2_ext values rapidly decrease as the number of attributes increase, even though are only two overlapping instances in all cases.

3.1.3. Extended Maximum Individual Feature Efficiency (F3_ext)

We inspire in the extension to the F3 measure in [28], but we maintain the idea of [27] to the measure providing lower values for simpler problems. Our proposal is as follows:
F 3 _ ext = m i n i = 1 m n o (   A i   ) n
where n o (   A i   ) is the number of overlapping instances according to A i , as:
n o (   A i   ) = { | o v e r l a p _ c a t ( A i ) | if   feature   i   is   categorical | i n t e r s e c t ( A i ) | if   feature   i   is   numeric i n t e r s e c t ( A i ) = { n i f   min max ( A i ) min max ( A i )   { x | ( ( x [ i ] > m a x m i n ( A i ) ( x [ i ] < m i n m a x ( A i ) ) } otherwise
where I is the indicator function.
For this measure, our formulation solves the problem of dealing with multiple classes, as well as with missing and hybrid data. In addition, it solves the F3 drawback of not penalizing attributes having the same (or very similar) values for all instances (Figure 6).
Figure 6. The drawback of F3 is solved by F3_ext. We use a two-dimensional dataset, having zero for all instances in the A 2 attribute, with two overlapping instances. Note the values of F3 and F3_ext are 0.00 and 0.50, respectively.
Using an efficient implementation, the computational complexity of our proposal is bounded by O ( m n ) .

3.2. Linearity Measures

Linearity measures are based on the idea of designing a hyperplane able to separate decision classes. Due to the fact that we are working with categorical data and with incomplete data, there is no direct way of using the notions of planes with such data. For future works, we will be working with other topological ideas resembling Linearity and able to deal with hybrid and incomplete data.

3.3. Neighborhood Measures

To extend Neighborhood measures, we propose using a dissimilarity function able to deal with mixed and incomplete data, such as HEOM [32]. We also propose using the normalized version (NHEOM) to guarantee the distance function to return values in [0, 1]. Let m a x i and m i n i be the maximum and minimum values of the numeric attribute A i . The NHEOM is as follows:
N H E O M ( x , y ) = Σ i = 1 m d i s s i ( x [ i ] ,   y [ i ] ) 2 m d i s s i ( x [ i ] ,   y [ i ] ) = { 1 if   x [ i ] = ?   y [ i ] = ? o v e r l a p ( x [ i ] ,   y [ i ] ) if   A i is   categorical r d _ d i s s ( x [ i ] ,   y [ i ] ) if   A i is   numeric o v e r l a p ( x [ i ] ,   y [ i ] ) = { 0 if   x [ i ] = y [ i ] 1 otherwise r d _ d i s s ( x [ i ] ,   y [ i ] ) = | x [ i ] y [ i ] | m a x i m i n i
As shown in Equation (21), the NHEOM function operates by attribute, and for each attribute, it uses one of three cases: for missing values, it returns one. For complete categorical values, the overlap function considers values similar only if they are equal, and for numerical values, the rd_diss function compares them by considering their difference with respect to the maximum difference between values. The normalization procedure (dividing by the square root of the number of attributes) guarantees NHEOM to be in [0, 1].
We maintain the original formulation of N1, N2, and LSC measures, just by using a dissimilarity function able to deal with hybrid and incomplete data. Regarding the N3 measure, to avoid bias toward the majority class, we change the formulation by considering average by class error.
N 3 _ ext = 1 c i = 1 c x U |   c l a s s ( x ) = d i I ( N N ( x ) c l a s s ( x ) ) | { x U |   c l a s s ( x ) = d i } |
This formulation solves the bias by considering the errors for each decision class (Figure 7). It maintains the ability to handle multiclass, hybrid, and incomplete data.
Figure 7. Solving the N3 bias towards the majority class with the new formulation. (a) Balanced dataset with ten instances, two of them misclassified. (b) Imbalanced dataset with ten instances, two of them misclassified. Note how the proposed measure considers there is a full class misclassified in (b).
Due to the difficulties of interpolation in hybrid and incomplete data, we chose not to extend the N4 measure. Similarly, the T1 measure was not considered because of the impossibility of computing hyperspheres with categorical data.

4. Discussion

For discussion, we first analyze the behavior of the proposed measures over synthetic data (Section 4.1), and we compute the measures over publicly available datasets (Section 4.2). All experiments were executed in a Lenovo ThinkPad X1 laptop, with Windows 10 operating system, Intel(R) Core(TM) i7-8550U CPU at 1.80 GHz and 16 GB of RAM. The laptop was not exclusively dedicated to the experiments (all were executed on low priority).

4.1. Synthetic Data

We first supply three explanatory examples for the computation of the proposed measures. The first two of them use two-dimensional datasets, with no missing values (Figure 8 and Figure 9), to be able to visualize the data distribution, and the second example consists of a synthetic hybrid and incomplete dataset, adapted from the well-known play tennis dataset (Figure 10). The results of the data complexity measures over the example datasets are shown in Table 2.
Figure 8. Iris2D dataset. Note that classes are compact, balanced, and fully separated.
Figure 9. Clover dataset. Classes are separated but imbalanced and with disjoints. The dataset resembles a clover flower.
Figure 10. Tennis dataset. This is a modified version of the well-known tennis dataset by Quinlan. Note that it has hybrid numeric and categorical data with missing values.
Table 2. Results of the data complexity measures for synthetic datasets.

4.2. Real Data

In this section, we compute the measures over publicly available multiclass hybrid and incomplete datasets. We add the proposed measures to the EPIC software [25,26]. Table 3 summarizes the measures used in the experiments, clarifying their type, proportion, boundaries, computational complexity, and whether or not it is a newly proposed method.
Table 3. Description of the proposed data complexity measures and each measure’s ability to deal with hybrid and incomplete data.
We selected 15 datasets publicly available in the KEEL repository [33]. All datasets correspond to real-life hybrid and incomplete problems (Table 4); all of them are partitioned using stratified five-fold cross-validation.
Table 4. Description of the real datasets used.
We also provide the non-error rate (NER) results for the Nearest Neighbor classifier for each dataset. NER is computed as [34]:
N E R = Σ g = 1 G S n g G
where
S n g = c g g n g
The NER measure assumes a confusion matrix of g classes (Figure 11), and it is robust for imbalanced data.
Figure 11. Confusion matrix of c classes.
Table 5 presents the results of the data complexity measures’ computation for Feature-based and Neighborhood measures, and Table 6 shows the execution time (in milliseconds). The complex dataset according to each measure is highlighted in bold.
Table 5. Results for Feature-based and Neighborhood and Dimensionality measures.
Table 6. Execution time (in milliseconds) to compute the data complexity measures.
As shown, the hardest dataset is marketing. This complexity is shown in the low values of the non-error rate obtained in Table 4. F2_ext measure offers little information, being close to zero for 14 of the studied datasets. F3_ext measure points out that for most datasets, there is at least one attribute with low overlapping for 12 of the analyzed datasets.
The datasets with no clear separation are horse-colic, mammographic, and wisconsin. N1, N2, and N3_ext measures correlate well with the results of the Nearest Neighbor classifier (Table 4), while LSC shows near one value for all datasets.
The Feature-based measures are very fast, in contrast with the Neighborhood-based measures, which depend on the computation of the dissimilarity matrix between instances. For marketing and mushroom datasets, Neighborhood measures took up to two minutes. However, it is important to mention that we used a sequential implementation of the dissimilarity computation, and such time can be significantly diminished with parallel computation.
In addition, for practical purposes, when we want to assess the complexity of a given dataset, we can compute the measures in less than three minutes for the biggest ones. We think this timeframe is suitable for real-world data, and the proposed measures are computationally feasible, even with a sequential implementation.
The limitations of the proposed measures are as follows:
  • The F1_ext measure, as its predecessor F1, for numerical data assumes that the linear boundary is unique and perpendicular to one of the feature axes. If a feature is separable but with more than one line, it does not capture such information.
  • F2_ext measure can become very small depending on the number of operands in Equation (17); that is, it is highly dependent on the number of features a dataset has.
  • Neighborhood measures are bounded by O ( m n 2 ) . For datasets with a huge number of instances, they can be computationally expensive.

5. Conclusions

Our hypothesis related to the fact that extending data complexity measures for the multiclass scenario, having hybrid and incomplete data is possible, has been verified. We have introduced four data complexity measures for multiclass classification problems. All of the proposed measures are able to deal with hybrid (numeric and categorical) and missing data. This will allow knowing the complexity of a complex dataset in advance before using it to train a classifier. We included the proposed measures in the EPIC software [25,26], and we computed the measures for some of the publicly available datasets with satisfactory results.
In experiments with real datasets, it has been found that the Feature-based measures are very fast, in contrast with the Neighborhood-based measures, which depend on the computation of the dissimilarity matrix between instances. Specifically, when performing the experiments on the marketing and mushroom datasets, Neighborhood measures took up to two minutes, while only fractions of a second were invested in the others.
In future work, we want to work with topological ideas resembling Linearity with the intent to design new Linearity-based measures to deal with hybrid and incomplete data. In addition, we want to solve the issue of the F2_ext measure being severely affected by the curse of dimensionality.
Also, in the case of Neighborhood-based measures, we will implement them using parallel computation. This is to reduce the computation time required to obtain the matrix between instances.
A very relevant future work consists of carrying out a deeper analysis of all the complexity measures available in the current state-of-the-art. Then we will apply formal mathematical methods to specify, build, and verify software and hardware systems focused on their application to machine learning solutions. To do this, we will rely heavily on research papers that clearly explain the phases of machine learning and the available formal methods to verify each phase [30].

Author Contributions

Conceptualization, F.J.C.-U. and Y.V.-R.; methodology, Y.V.-R.; software, Y.V.-R.; formal analysis, C.Y.-M. and M.L.; investigation, F.J.C.-U.; data curation, F.J.C.-U.; writing—original draft preparation, F.J.C.-U. and Y.V.-R.; writing—review and editing, C.Y.-M. and M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

All used real datasets are available at https://www.mdpi.com/ethics (accessed on 1 December 2022).

Acknowledgments

The authors would like to thank the Instituto Politécnico Nacional (Secretaría Académica, Comisión de Operación y Fomento de Actividades Académicas, Secretaría de Investigación y Posgrado, Centro de Investigación en Computación, and Centro de Innovación y Desarrollo Tecnológico en Cómputo), the Consejo Nacional de Ciencia y Tecnología, and Sistema Nacional de Investigadores for their economic support to developing this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Shetty, S.H.; Shetty, S.; Singh, C.; Rao, A. Supervised Machine Learning: Algorithms and Applications. In Fundamentals and Methods of Machine and Deep Learning: Algorithms, Tools and Applications; Singh, P., Ed.; Wiley: Hoboken, NJ, USA, 2022; pp. 1–16. [Google Scholar]
  2. Satinet, C.; Fouss, F. A Supervised Machine Learning Classification Framework for Clothing Products’ Sustainability. Sustainability 2022, 14, 1334. [Google Scholar] [CrossRef]
  3. Eastvedt, D.; Naterer, G.; Duan, X. Detection of faults in subsea pipelines by flow monitoring with regression supervised machine learning. Process Saf. Environ. Prot. 2022, 161, 409–420. [Google Scholar] [CrossRef]
  4. Liu, X.; Lu, D.; Zhang, A.; Liu, Q.; Jiang, G. Data-Driven Machine Learning in Environmental Pollution: Gains and Problems. Environ. Sci. Technol. 2022, 56, 2124–2133. [Google Scholar] [CrossRef] [PubMed]
  5. Voulgari, I.; Stouraitis, E.; Camilleri, V.; Karpouzis, K. Artificial Intelligence and Machine Learning Education and Literacy: Teacher Training for Primary and Secondary Education Teachers. In Handbook of Research on Integrating ICTs in STEAM Education; IGI Global: Hershey, PA, USA, 2022; pp. 1–21. [Google Scholar]
  6. Aksoğan, M.; Atici, B. Machine Learning applications in education: A literature review. In Education & Science 2022; EFE Academy: Jaipur, India, 2022; p. 27. [Google Scholar]
  7. Rezapour, M.; Hansen, L. A machine learning analysis of COVID-19 mental health data. Sci. Rep. 2022, 12, 14965. [Google Scholar] [CrossRef]
  8. Aitzaouiat, C.E.; Latif, A.; Benslimane, A.; Chin, H.-H. Machine Learning Based Prediction and Modeling in Healthcare Secured Internet of Things. Mob. Netw. Appl. 2022, 27, 84–95. [Google Scholar] [CrossRef]
  9. Alanazi, A. Using machine learning for healthcare challenges and opportunities. Inform. Med. Unlocked 2022, 30, 100924. [Google Scholar] [CrossRef]
  10. Hu, Q.; Yu, D.; Xie, Z. Neighborhood classifiers. Expert Syst. Appl. 2008, 34, 866–876. [Google Scholar] [CrossRef]
  11. Kotsiantis, S.B. Decision trees: A recent overview. Artif. Intell. Rev. 2013, 39, 261–283. [Google Scholar] [CrossRef]
  12. Abiodun, O.I.; Jantan, A.; Omolara, A.E.; Dada, K.V.; Umar, A.M.; Linus, O.U.; Arshad, H.; Kazaure, A.A.; Gana, U.; Kiru, M.U. Comprehensive review of artificial neural network applications to pattern recognition. IEEE Access 2019, 7, 158820–158846. [Google Scholar] [CrossRef]
  13. Cervantes, J.; Garcia-Lamont, F.; Rodríguez-Mazahua, L.; Lopez, A. A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing 2020, 408, 189–215. [Google Scholar] [CrossRef]
  14. Yáñez-Márquez, C.; López-Yáñez, I.; Aldape-Pérez, M.; Camacho-Nieto, O.; Argüelles-Cruz, A.J.; Villuendas-Rey, Y. Theoretical foundations for the alpha-beta associative memories: 10 years of derived extensions, models, and applications. Neural Process. Lett. 2018, 48, 811–847. [Google Scholar] [CrossRef]
  15. Martínez-Trinidad, J.F.; Guzmán-Arenas, A. The logical combinatorial approach to pattern recognition, an overview through selected works. Pattern Recognit. 2001, 34, 741–751. [Google Scholar] [CrossRef]
  16. Wolpert, D.H. The supervised learning no-free-lunch theorems. In Soft Computing and Industry; Springer: London, UK, 2002; pp. 25–42. [Google Scholar]
  17. Luengo, J.; Herrera, F. An automatic extraction method of the domains of competence for learning classifiers using data complexity measures. Knowl. Inf. Syst. 2015, 42, 147–180. [Google Scholar] [CrossRef]
  18. Ma, Y.; Zhao, S.; Wang, W.; Li, Y.; King, I. Multimodality in meta-learning: A comprehensive survey. Knowl.-Based Syst. 2022, 250, 108976. [Google Scholar] [CrossRef]
  19. Huisman, M.; Van Rijn, J.N.; Plaat, A. A survey of deep meta-learning. Artif. Intell. Rev. 2021, 54, 4483–4541. [Google Scholar] [CrossRef]
  20. Camacho-Urriolagoitia, F.J.; Villuendas-Rey, Y.; López-Yáñez, I.; Camacho-Nieto, O.; Yáñez-Márquez, C. Correlation Assessment of the Performance of Associative Classifiers on Credit Datasets Based on Data Complexity Measures. Mathematics 2022, 10, 1460. [Google Scholar] [CrossRef]
  21. Cano, J.-R. Analysis of data complexity measures for classification. Expert Syst. Appl. 2013, 40, 4820–4831. [Google Scholar] [CrossRef]
  22. Barella, V.H.; Garcia, L.P.; de Souto, M.C.; Lorena, A.C.; de Carvalho, A.C. Assessing the data complexity of imbalanced datasets. Inf. Sci. 2021, 553, 83–109. [Google Scholar] [CrossRef]
  23. Ho, T.K.; Basu, M. Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 289–300. [Google Scholar]
  24. Bello, M.; Nápoles, G.; Vanhoof, K.; Bello, R. Data quality measures based on granular computing for multi-label classification. Inf. Sci. 2021, 560, 51–67. [Google Scholar] [CrossRef]
  25. Hernández-Castaño, J.A.; Villuendas-Rey, Y.; Camacho-Nieto, O.; Yáñez-Márquez, C. Experimental platform for intelligent computing (EPIC). Comput. Y Sist. 2018, 22, 245–253. [Google Scholar] [CrossRef]
  26. Hernández-Castaño, J.A.; Villuendas-Rey, Y.; Nieto, O.C.; Rey-Benguría, C.F. A New Experimentation Module for the EPIC Software. Res. Comput. Sci. 2018, 147, 243–252. [Google Scholar] [CrossRef]
  27. Lorena, A.C.; Garcia, L.P.; Lehmann, J.; Souto, M.C.; Ho, T.K. How Complex is your classification problem? A survey on measuring classification complexity. ACM Comput. Surv. (CSUR) 2019, 52, 1–34. [Google Scholar] [CrossRef]
  28. Cummins, L. Combining and Choosing Case Base Maintenance Algorithms; University College Cork: Cork, Ireland, 2013. [Google Scholar]
  29. Seshia, S.A.; Sadigh, D.; Sastry, S.S. Toward verified artificial intelligence. Commun. ACM 2022, 65, 46–55. [Google Scholar] [CrossRef]
  30. Krichen, M.; Mihoub, A.; Alzahrani, M.Y.; Adoni, W.Y.H.; Nahhal, T. Are Formal Methods Applicable To Machine Learning And Artificial Intelligence? In Proceedings of the 2022 2nd International Conference of Smart Systems and Emerging Technologies (SMARTTECH), Riyadh, Saudi Arabia, 9–11 May 2022; pp. 48–53. [Google Scholar]
  31. Cios, K.J.; Swiniarski, R.W.; Pedrycz, W.; Kurgan, L.A. The knowledge discovery process. In Data Mining; Springer: Boston, MA, USA, 2007; pp. 9–24. [Google Scholar]
  32. Wilson, D.R.; Martinez, T.R. Improved heterogeneous distance functions. JAIR 1997, 6, 1–34. [Google Scholar] [CrossRef]
  33. Alcalá-Fdez, J.; Fernández, A.; Luengo, J.; Derrac, J.; García, S.; Sánchez, L.; Herrera, F. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. J. Mult.-Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
  34. Ballabio, D.; Grisoni, F.; Todeschini, R. Multivariate comparison of classification performance measures. Chemom. Intell. Lab. Syst. 2018, 174, 33–44. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.