Could You Understand Me? The Relationship among Method Complexity, Preprocessing Complexity, Interpretability, and Accuracy

Kelebercová, Lívia; Munk, Michal; Forgáč, František

doi:10.3390/math11132922

Open AccessArticle

Could You Understand Me? The Relationship among Method Complexity, Preprocessing Complexity, Interpretability, and Accuracy

by

Lívia Kelebercová

^*,

Michal Munk

and

František Forgáč

Department of Informatics, Faculty of Natural Science and Informatics, Constantine the Philosopher University in Nitra, Trieda Andreja Hlinku 1, 949 74 Nitra, Slovakia

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(13), 2922; https://doi.org/10.3390/math11132922

Submission received: 6 May 2023 / Revised: 7 June 2023 / Accepted: 27 June 2023 / Published: 29 June 2023

(This article belongs to the Section Mathematics and Computer Science)

Download

Browse Figure

Versions Notes

Abstract

:

The need to train experts who will be able to apply machine learning methods for knowledge discovery is increasing. Building an effective machine learning model requires understanding the principle of operation of the individual methods and their requirements in terms of data pre-preparation, and it is also important to be able to interpret the acquired knowledge. This article presents an experiment comparing the opinion of the 42 students of the course called Introduction to Machine Learning on the complexity of the method, preprocessing, and interpretability of symbolic, subsymbolic and statistical methods with the correctness of individual methods expressed on the classification task. The methodology of the implemented experiment consists of the application of various techniques in order to search for optimal models, the accuracy of which is subsequently compared with the results of a knowledge test on machine learning methods and students’ opinions on their complexity. Based on the performed non-parametric and parametric statistic tests, the null hypothesis, which claims that there is no statistically significant difference in the evaluation of individual methods in terms of their complexity/demandingness, the complexity of data preprocessing, the comprehensibility of the acquired knowledge and the correctness of the classification, is rejected.

Keywords:

machine learning; data preprocessing; classification; interpretability

MSC:

94A16

1. Introduction

In recent years, the amount of biomedical data has grown significantly. One of the possible explanations is that the growth of new technologies, such as electronic health records, genomics and wearable devices, collects more data than ever before. These technologies allow capturing of large amounts of data that were previously unavailable, such as patient health histories, genomic data and real-time physiological measurements. Another reason could be the aim to tailor medical treatment and interventions to the individual characteristics of each patient, which is the goal of personalized medicine, an approach to healthcare that aims to tailor medical treatment and interventions to the individual characteristics of each patient at the right time.

The massive increase in biomedical data [1] has resulted in the need to process this data quickly, efficiently and accurately. Due to the amount of increasing data, manual data analysis performed by a medical expert would be more accurate, but on the other hand, lengthy, and therefore, a new form of artificial intelligence was created, which is called computer medicine.

This subfield deals with the application of various areas of computer science, such as machine learning, data mining, artificial intelligence and computational biology, to advance pharmaceutical research, diagnosis, treatment and patient care. In this way, it is possible, for example, to predict the reaction to a drug at the level of individual patients, perform clinical diagnostics or predict a pandemic.

With the development of technologies that can collect medical data, the need to educate experts who will be able to work in the field of computer medicine is also growing. Based on the CRISP-DM methodology [2], the tasks of computer medicine fall under the tasks of classification and prediction [3,4]. Thus, the ability to solve tasks such as the classification of cancer diseases is closely related to how the student understands the models that can be used to achieve the goal. Despite the fact that currently, the implementation of classification and regression models is a relatively simple task that can be achieved with a few lines of code using Python libraries, the result in the form of the accuracy of the model on test data depends on whether the student understands the given method and therefore knows what steps to take in the process of data pre-preparation, predictor selection, training strategy and optimization of hyperparameters for a given model.

In this context, we decided to conduct an experiment on students who completed a course in machine learning fundamentals. In this course, 42 students learned the basics of symbolic, subsymbolic and statistical machine learning methods. After completing the course, the students could participate in our experiment, the purpose of which was to show them how to use the theoretical knowledge they gained by completing the previous course in practice on medical data. After completing the additional lectures, we obtained feedback from the students in the form of a test of theoretical knowledge of individual methods to find out how they understand them and their opinion on the scale in terms of data preparation and interpretation of the acquired knowledge. Based on these data, we observed a relationship between the accuracy of individual methods, comprehensibility and complexity in terms of pre-preparation of data and interpretation of acquired knowledge.

In Section 2, we mention interesting studies that addressed a similar issue. In Section 3, we describe the methodology we used during our research, and the results are described in Section 4. The last two sections are a discussion and conclusion of our results.

2. Related Work

“Interpretability is the degree to which a human can understand the cause of a decision” [5]. In other words, the higher the degree of interpretability of the model, the easier it is to understand the model’s results, while accuracy is the ability of the model to make correct predictions [6]. Another reason why interpretability is important is that the understanding of a model greatly increases its deployability [7]. The importance of interpretability lies in the fact that in most cases, for example, in a clinical diagnostic, we do not want to know just the exact predicted class but also the reason behind the prediction.

The interpretability of results produced by machine learning models is the aim of interest from 1980 when Mitchie [8] stated that the comprehensibility of knowledge gained from machine learning algorithms is not measurable. In the next years, Cunningham et al. [9,10] wrote two papers in which they proposed visualization techniques to help understand the production rules for non-specialist users. In agreement with Bratko [11], who tried to observe the accuracy and interpretability of medical data, the formalization of the interpretability of the obtained data represented a new challenge in the field of machine learning.

It did not take long for the topic of the interpretability of the results of machine learning models to interest other researchers. Only three years later, before the formulation of the interpretability of the results of the evaluated data was a challenge, a relatively complex study appeared. Zurada [12] compared the classification performance and interpretability of logistic regression, neural networks, support vector machines, three decision trees and case base reasoning. This study proved that decision trees are a good option for achieving a balance between accuracy and interpretability.

One year later, Johansson et al. [5] observed the relationship between interpretability and accuracy on the sixteen biopharmaceutical classification tasks. The experimental results showed that the tree-based models achieved high interpretability and, in general, using the decision trees, especially in ensemble techniques, has no significant impact on losing the predictive performance.

From the Elshawi et al. [13,14] point of view, although the previous studies proved that three-based models perform well in case of interpretability due to the graph-based structure, which is quite easy to understand, it might be difficult to trust these models in the context of biomedical data due to the lack of explanation of their prediction. An interesting pedagogical-based study was carried out by Sulmont et al. [15,16]. They proved that it is possible to teach machine learning to people with low mathematical or computer science backgrounds. However, these students still have problems with interpreting results, which inspired us to take a closer look at whether the symbolical, subsymbolic or statistical methods are more complicated in terms of interpretability.

Explanation of data is one of the most important steps in CRISP-DM methodology, and the ability to explain the data depends on the data itself. Gonda et al. [17] conducted an interesting experiment, which proved that working with real data allows the student to create a link between theory and practice. We assume that working with real-world data can also cause an improvement in the interpretability of acquired knowledge. For this reason, we used a real-world dataset instead of creating the synthetic data. Since we have categorized machine learning methods as symbolic, subsymbolic and statistical, we can assume that subsymbolic and statistical methods will be more difficult for students to understand as they require an understanding of the mathematical principles on which these methods are based. The study of the ability to understand algorithms based on mathematics was dealt with by Gonda et al. [18], who conducted research on algorithmic graph theory students in order to monitor the change in students’ motivation to learn algorithms. The authors demonstrated that the development of students’ computational thinking is already possible in the teaching of mathematical subjects.

A recent study by Gaostill claims that interpretability is an important research direction, one which is worth further investment in [19]. This fact is also confirmed by a study from last year, which deals with the interpretability of the results of machine learning methods in malware detection, which claims that the best interpretable model is Gaussian Naïve Bayes [20]. A notable study from Updahyaya et al. took a closer look at the interpretability of the health domain. They obtained remarkable results with the random method [21]. Beisbart and Räz dealt with interpretability from a more philosophical point of view. They proposed a systematization in terms of four tasks for philosophers: clarify the notion of interpretability, explain the value of interpretability, provide frameworks to think about interpretability and explore important features of it to adjust our expectations about it [22]. Another notable recent study that deals with the pedagogical aspects of machine learning was written by Hazzan and Mike [23]. They discussed challenges in the teaching of machine learning, and they provided a framework for teaching these concepts.

Despite the fact that there are many researchers who are dedicated to the evaluation of machine learning methods based on various criteria, so far, we have not found an experiment that deals with examining students’ knowledge in the context of determining the degree of complexity of individual methods in the context of the implementation of classification task on biomedical data and obtaining exact feedback from students in the form of an opinion on complexity individual algorithms in terms of data preparation and interpretation of acquired knowledge.

3. Methodology

This paper examines how difficult individual classification algorithms are for students in terms of comprehensibility, data preparation and interpretation of acquired knowledge. To answer this research question, we conducted an experiment with students who had already completed a course of machine learning fundamentals in which they learned the basics of symbolic (Decision Tree—DT, K-Nearest Neighbors—KNN), subsymbolic (Support Vector Machine—SVM) and statistical machine learning methods (Logistic Regression—LR, Naïve Bayes—NB).

The first part of our experiment was to show them how to use the theoretical knowledge they gained by completing the previous course in practice on medical data. For this, we used the breast cancer dataset [24], which consists of records that are calculated from a digitized fine needle aspiration (FNA) image of the breast mass, along with information on whether the case is classified as benign or malignant. After completing the additional lectures, these students were tested on theoretical knowledge of individual methods to find out how they understood them, and we asked them to rate the difficulty of each algorithm in a term of comprehensibility, data preparation and interpretability of acquired knowledge. Then, we compared the results with accuracy F1-score and AUC score obtained by each model with best parameters, which we obtained by applying Grid search and 10-fold cross-validation on the previously mentioned dataset. In the following subsection, we will dive more deeply into each step of our methodology.

During the additional lectures, we taught students to construct and evaluate machine learning models using CRISP-DM methodology. The CRISP-DM methodology consists of six phases:

Business understanding;
Data understanding;
Data preparation;
Modelling;
Evaluation;
Deployment.

The first phase of the CRISP-DM methodology is aimed at understanding the objectives of the problem formulated from the perspective of the client and determining the task of knowledge discovery of the type of problem from the point of view of data modelling [25]. Regarding the breast cancer dataset [24], the knowledge discovery task was a classification, and the goal was to classify whether the new case would be classified as benign or malignant.

Our dataset consists of 569 records that are calculated from a digitized fine needle aspiration (FNA) image of the breast mass, along with information on whether the case is classified as benign or malignant. The dataset contains these 12 main attributes:

ID—the numeric identifier of the case;
Diagnosis–our target binary variable representing if the case is considered benign or malign;
Radius—mean of the distances from the centre to points on the perimeter;
Texture—standard deviation of grey scaled values;
Perimeter;
Area;
Smoothness in radius;
Compactness—perimeter²/area − 1.0;
Concavity—the severity of concave portions of the contour;
Concave points—number of concave portions of the contours;
Symmetry;
Fractal dimension.

The rest attributes represent the mean, standard error and the worst (mean of the three largest values) of attributes 3–12. Our target dependent variable Y is the diagnosis column, which we encoded into the binary variable where 0 represents benign cases, and 1 represents malign cases. The rest of the attributes excluding the ID, which have no meaningful sense for modelling, will be our potential features. Later we will consider feature selection for each machine learning method.

Figure 1 shows that the dataset is imbalanced; it contains 357 cases classified as benign (0) and 212 cases classified as malign (1). The dataset does not contain missing values, and all the features are continuous.

According to Munk et al., data preparation could highly influence the final interpretability of the results, but it is the most time-consuming step [26,27]. At first, we dropped the ID column from our dataset. The second step, which is already visible in Figure 1, was the encoding of categorical variables. In our case, the only categorical column was the diagnosis which consisted of two values, B standing for benign and M standing for malign, and we encoded them into the binary values. As soon as we had encoded this column, we could start to construct the decision tree and random forest because they do not require any other preprocessing steps. To achieve better results, we can further process the data.

The next preprocessing technique we focused on was feature scaling. This technique is used to normalize the range of features, which makes the algorithm, which is useful, especially for algorithms that compute distances as KNN or SVM. In a term of feature scaling, we have two basic options, which are normalization, also known as min–max scaling, and standardization. In our case, we achieved better results with standardization.

To show our students the importance of feature scaling, we showed them how precise the models would be before and after preprocessing. We fed default Scikit Learn models with non-processed data, and then we evaluated them using 10-fold cross-validation. The results in Table 1 and Table 2 show that the standardization has a positive effect on accuracy, F1 measure and Area Under Receiving Operating Characteristic Curve (AUC) score for most models.

The next step was to look at the feature selection process. All our data are numerical; hence we used ANOVA F-value to select the best k-features. Table 3 shows that the number of best features may differ for each model and metric.

In the next steps, we found the optimal hyperparameters of each mode using the grid search method in 10-fold cross-validation. Table 4 shows the best parameters for each model.

Logistic regression achieved the best results with regularization strength C = 1. The penalty hyperparameter defines the type of regularization. The most suitable regularization was the ridge (l2) regularization. The last hyperparameter for logistic regression is the solver parameter, for which the best results were achieved using Newton’s method with the conjugate gradient. This model achieved in 10-fold cross-validation 98% mean test accuracy, 99% precision, 95% recall, 97% F1 score and 100% AUC.

Naïve Bayes has only one hyperparameter, which is var_smoothing, used for regularizing the variance of input parameters to improve the stability of the model. The best results were achieved with var_smoothing set to 3 × 10⁻¹⁰. Naïve Bayes achieved in 10-fold cross-validation 93% mean test accuracy, 93% precision, 89% recall, 91% F1 score and 98% AUC.

Similar to Naïve Bayes, the KNN model also has only one hyperparameter, n_neighbors, defining the size of the neighbourhood. The best KNN model was found within five neighbours. KNN achieved in 10-fold cross-validation 97% mean test accuracy, 98% precision, 93% recall, 95% F1 score and 99% AUC.

The first hyperparameter for SVM in Table 4 is the regularization C which controls the trade-off between maximizing the margin and minimizing the classification error. In our experiment, higher values of C worked better. Our classifier worked best with C = 1, which represents a narrower margin with fewer misclassifications allowed. The method of transforming the features into a higher-dimensional space is defined with a kernel hyperparameter. In our case, the best kernel seems to be the Radial Basis Function (RBF) kernel. This model achieved in 10-fold cross-validation 98% mean test accuracy, 100% precision, 94% recall, 97% F1 score and 99% AUC.

Hyperparameter tuning prevents the Decision Tree from overfitting. The best criterium to split the Decision Tree is defined by a criterion hyperparameter, and in our case, it is the Gini Index. Hyperparameter max_depth handles the complexity of the model, defining the maximum depth of the tree. The Tree with the max_depth = 10 seems to provide the optimal classification results. The minimum number of samples that are required to be at the leaf node is defined by min_samples_leaf, and the optimal value for this hyperparameter is 4. The minimum samples required to split a node are defined by min_samples_split. The optimal value seems to be 20. This model achieved in 10-fold cross-validation 92% mean test accuracy, 89% precision, 89% recall, 89% F1 score and 96% AUC.

Random Forest is a bagging ensemble technique that uses Decision Trees, so the hyperparameter is almost the same. The differences are present in the optimal values of these hyperparameters. For the Random Forest, the best results were achieved by criterion = gini, max_depth = 10, min_samples_leaf = 3, and min_samples_split = 10. Random Forest has one additional hyperparameter, n_estimators, defining the number of Decision Trees used to predict the final class. The optimal Random Forest was achieved using 100 Decision Trees. Random Forest achieved in 10-fold cross-validation 96% mean test accuracy, 96% precision, 94% recall, 95% F1 score and 99% AUC.

Obtaining Feedback from Participants

After expressing the accuracy of the individual methods, we needed to compare the individual values with the students’ opinions on the complexity of the methods and their difficulty in terms of data preprocessing and interpretability. We collected opinions anonymously in the form of a test that, in addition to knowledge questions (to find out how difficult the individual methods are for students to understand), also included questions in which the participants expressed their opinion on the scale. Our test contained these six topics: Logistic Regression, Naïve Bayes, K-Nearest Neighbors, Support Vector Machine, Decision Tree and Ensemble methods.

Each topic contained five questions aimed at complexity in terms of understanding. These questions were composed in such a way that the student had to choose one or more correct options. A student could get a maximum of one point for each question. If the student chose only one correct option and had to choose two, they received only half a point.

In addition to the questions focused on the complexity in terms of understanding, each topic contained two other questions, the aim of which was not to find out the student’s knowledge but their opinion. For each method, we asked the student what their opinion is about the complexity of the given method from the point of view of data preparation and what his opinion is about the complexity from the point of view of the interpretation of the acquired knowledge. They expressed their opinion on a scale from 1–5, where 1 meant the lowest complexity, and 5 meant the highest complexity.

In the end, the students had three more questions in the test. In the first, they had to rank the individual methods in terms of how difficult they were for them in terms of understanding; in the second, they had to rank the methods in terms of the complexity of data preparation; and in the third, they had to rank the methods in terms of the complexity of interpreting the acquired knowledge.

4. Machine Learning Metrics for Evaluation

Accuracy simply measures how often the classifier predicts correctly. Accuracy can be defined by Formula (1), where TP is the number of true positives, TN is the number of correct negatives, FP is the number of False positives and FN is the number of false negatives.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(1)

To explain the F1 measure, precision and recall must be defined. Precision explains how many of the correctly predicted cases turned out to be positive and is defined as the number of actual positives divided by the number of predicted positives (Formula (2)).

P r e c i s i o n = \frac{T P}{T P + F P}

(2)

Recall explains how many true positives we were able to correctly predict using our model and is defined as the number of true positives divided by the total number of true positives (Formula (3)).

R e c a l l = \frac{T P}{T P + F N}

(3)

And finally, the F1 measure is the harmonic mean of precision and recall, which is useful because it penalizes extreme values (Formula (4)).

F 1 = 2 * \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(4)

To understand AUC, it is important to explain what the Receiving Operating Characteristic (ROC) curve is. It is a curve plotted on a 2D graph where one axis represents the true positive rate TPR (Formula (5)) and false positive rate FPR (Formula (6)). The area under this curve is a measure of the classifier’s ability to distinguish between classes.

T P R = \frac{T P}{T P + F N}

(5)

F P R = \frac{F P}{F P + T N}

(6)

5. Experimentation

In this section, we describe the course of our experiment. We conducted the experiment on two groups of participants: students and experts in the field of data science.

5.1. Experimentation with Students

We conducted our experiment on 42 students who completed a course on the basics of machine learning. During this course, they gained theoretical knowledge of basic methods such as Logistic Regression, Naïve Bayes, K-Nearest Neighbors, Support Vector Machines and Decision Trees. In one lecture, the students also became acquainted with the principles of group machine learning methods, from which they tried out the Random Forest method. The course was mainly theoretically oriented. After completing this course, students attended lectures by experts, where they learned to implement the mentioned methods on medical data. Students tried tasks such as the classification of cancer in the Python 3.10 programming language. After completing these lectures, the students participated in our experiment. The first step was to verify their actual knowledge of the mentioned methods using a knowledge test. For each method, students had five single-choice or multiple-choice questions, such as the following: Why is the Naive Bayes classifier called naive? What is true in a Logistic Model? Are support vectors the data points that lie closest to the decision surface?

Another interesting factor for us was the opinion of the students on method complexity, preprocessing complexity and interpretability of results. For this reason, the participants of the experiment filled out a questionnaire in which they indicated their opinion for each method on a scale from 1 to 5, with 1 representing the lowest complexity and 5 representing the highest complexity. We asked them about their opinion on the complexity in terms of the aforementioned method complexity, preprocessing complexity and interpretability of results. The questionnaire also included a task in which individual methods were to be ranked according to the degree of difficulty for all three criteria, from the simplest to the most complex.

5.2. Experimentation with Experts

A small survey (interview) that we conducted among experts moving in the domain of data science speaks for itself. We approached five academic staff providing subjects in the domain of data science (Python, Databases, KD, NLP, ML), of which one had a mathematical background at the level of the third degree (Ph.D.), two at the level of the second degree (Master), one at first-degree level (B.Sc.) and one had no mathematical background.

Furthermore, we approached six data analysts working in global corporations, one in the position of data scientist, four in the position of data analyst and one in the position of IT auditor, while only one in the position of data scientist had a mathematical background (Ph.D.), three had computer science (B.Sc.), one economics (Master) and one did not graduate.

At the beginning of the interview, we introduced them to the project to identify faulty segments and to the task of prediction, which logistic regression will be used to solve. All were Python users, and all had experience using logistic regression for classification. They were then asked a series of questions: From what range of values does the probability come from? If a segment is defective with a probability of 0.25, what is the chance that it will be defective? What is the relationship between chance and probability? What range of values does the chance come from? Is chance a symmetric measure? What is the relationship between logit and chance? Is the logit a symmetric measure? What is the difference between relative risk and odds ratio?

6. Results

We used parametric and non-parametric methods to compare multiple dependent samples to test for differences in the evaluation of methods in terms of their complexity of data preprocessing, interpretability of acquired knowledge, and correctness of classification. We used parametric procedures, given that we identified only small deviations from normality.

If the sphericity assumption of the variance–covariance matrix was not violated, we used an unadjusted univariate test for repeated measures (Preprocessing Complexity: F = 23.792, p < 0.001); otherwise, we used an adjusted univariate test for repeated measures (Method Complexity: G-G Epsilon = 0.787, p < 0.001, Interpretability: G-G Epsilon = 0.744, p < 0.001, Accuracy: G-G Epsilon = 0.390. p < 0.001).

In order not to reduce the power of statistical tests, given that we worked mainly with ordinal variables, we also used non-parametric procedures based on order (Method Complexity: Chi-Square = 66.992, p < 0.001, Preprocessing Complexity: Chi-Square = 71.313, p < 0.001, Interpretability: Chi-Square = 73.632, p < 0.001, and Accuracy: Chi-Square = 25.849, p < 0.001).

The results are the same; we can consider them robust. Based on the mentioned tests, we reject the global null hypotheses, which claim that there is no statistically significant difference in the evaluation of individual methods (DT, KNN, NB, LR, SVM, Ensemble methods) in terms of their method complexity, complexity of data preprocessing, comprehensibility of acquired knowledge and accuracy of classification (method complexity, preprocessing complexity, knowledge comprehensibility, and accuracy performance measure).

After rejecting the global null hypotheses, we were interested in which methods have statistically significant differences in terms of the difficulty of the methods, the complexity of the data preprocessing, the comprehensibility of the acquired knowledge as well as the correctness of the classification (regarding the given dataset to which users applied statistical methods and machine learning methods to solve the task classification).

From Table 5, we can see that the rank mean corresponds to the mean, which is consistent with the consistent results obtained by the parametric and non-parametric procedures. From the point of view of variability (standard deviation), the differences are minimal. The most heterogeneous results were achieved for the Decision Tree; on the contrary, the most homogeneous results, i.e., the most stable, were achieved for the Support Vector Machine classifier.

In terms of algorithm performance (Table 5), Logistic Regression achieved the highest level of classification accuracy, followed by Support Vector Machine, K-Nearest Neighbors and Random Forest, while together they form a homogeneous group (p > 0.05). On the contrary, the lowest rate of classification accuracy (Table 5) was achieved by the Decision Tree and the Naïve Bayes classifier, while together, they form a homogeneous group (p > 0.05). Statistically significant differences in classification accuracy were demonstrated between the two groups (p < 0.05).

From Table 6, we can see a high agreement between the estimates of the mean value, i.e., mean and rank-mean, which corresponds to identical results obtained by parametric and non-parametric procedures.

From the point of view of variability (standard deviation), the differences are minimal (Table 6), whether it is an assessment of the complexity of the method or from the point of view of data preprocessing or comprehensibility of acquired knowledge. However, it is worth noting (Table 6) that in all three evaluation areas, the most heterogeneous results were achieved for Logistic Regression; on the contrary, the most homogeneous results, i.e., the most stable, were achieved for the Decision Tree.

If we look at the results in terms of the complexity of the methods (Table 6a) or as the difficulty of these methods is perceived by users, we must state that the results are not surprising. According to our expectations, symbolic machine learning methods (Decision Tree, KNN), followed by statistical (Naïve Bayes, Logistic Regression) and subsymbolic (SVM), were marked by users as the least demanding. The only surprise is the methods of ensemble machine learning, which we expected due to the principles of statistical inference present and the probability that this method would be considered difficult. As an example of such a method, we chose a Random Forest, i.e., a set of Decision Trees, which probably decided that users found this method less demanding. We assume that in the case of choosing another ensemble method consisting of, e.g., from a set of Logistic Regressions or combining different methods, the assessment of the difficulty of group machine learning would be exactly the opposite.

Decision trees (Table 6a) were marked by users as statistically significantly the least demanding/complex method (p < 0.05); on average, they ranked 1st to 2nd in terms of demandingness/complexity of methods. They were followed (Table 6a) by the Random Forest, KNN, Naïve Bayes classifier and Logistic Regression, which together form a homogeneous group in terms of the assessed difficulty/complexity of the methods (p > 0.05) and ranked 3rd to 4th on average. Users identified SVM as the statistically significantly most demanding method (p < 0.05) (Table 6a); on average, it was ranked 5th to 6th in terms of demandingness/complexity of methods.

In terms of the complexity of data preparation, the results are similar (Table 6b). Users rated the Decision Tree as the least demanding in terms of preprocessing (p < 0.05), ranked 1st to 2nd on average. Similarly, Random Forest (p < 0.05) ranked 2nd to 3rd in terms of preprocessing complexity (Table 6b). They were followed by Logistic Regression, KNN and Naïve Bayes classifier, which together form a homogeneous group in terms of the evaluated complexity of data preparation (p > 0.05), on average they ranked 3rd to 4th, and users consider them to be statistically significantly more demanding methods in terms of data preparation than Random Forest and Decision Tree. The users identified SVM as the statistically significantly most demanding method in terms of the complexity of data preparation (p < 0.05) (Table 6b); on average, it ranked 5th to 6th.

If we look at the results from the point of view of interpretability, the results are even more interesting (Table 6c). As expected, the users marked the Decision Trees as statistically significantly the most comprehensible (the least complex in terms of interpretation) in terms of the evaluation of acquired knowledge (p < 0.05); on average, it was ranked 1st to 2nd. Knowledge, in this case, represents comprehensible structures (trees) that can be converted into rules, i.e., we can express the acquired knowledge in natural language. This is followed by Random Forests and Logistic Regression (Table 6c), which on average ranked third and together form a homogeneous group (p > 0.05) in a term of interpretability. In the case of a Random Forest, this was to be expected since it is a set of trees aggregated by modus (voting). Logistic regression can be a surprise, but the result is a linear model, where with the simplified idea that it directly models probability and not the logarithm of chance, its results/knowledge can be partially understandable for users. Another homogeneous group, together with the Logistic Regression (Table 6c), consists of the Naïve Bayes classifier and the KNN (p > 0.05) in terms of comprehensibility of the acquired knowledge; on average, they ranked fourth. In the case of the Naïve Bayes classifier and KNN, the knowledge gained is less tangible/visible. As statistically significantly the least comprehensible in terms of the evaluation of acquired knowledge (the most complex in terms of interpretation), users marked SVM (p < 0.05); on average, it ranked 5th to 6th (Table 6c). The comprehensibility of acquired knowledge given by hyperplanes is partially imaginable for users in two-dimensional and three-dimensional space; when adding additional dimensions, which is a common practice, the comprehensibility of acquired knowledge for users is lost, and the hyperplane becomes unimaginable.

7. Discussion

The participants who were involved in the evaluation of the methods in terms of their difficulty, complexity of data preprocessing and interpretability have a fundamental knowledge of machine learning methods and practical skills using the selected methods exclusively in the Python language. This results in several bad habits resulting from the use of this language. On the training dataset, users mastered data preprocessing, modelling (Logistic Regression, Naïve Bayes, K-Nearest Neighbors, Decision Tree, SVM, Random Forest) and evaluation of results (correctness of classification) in Python. Subsequently, they received a test focused on the essence of individual methods and on the evaluation of individual methods in terms of the complexity of data preprocessing and the comprehensibility of the acquired knowledge. In the end, they were invited to organize the methods in terms of their difficulty, the complexity of data preprocessing and the comprehensibility of the acquired knowledge so that we could validate their answers.

The results corresponded in the assessment of methods in terms of the complexity of data preparation; a statistically significant positive medium to a very large degree of association was identified between assessment and arrangement (Goodman and Kruskal’s Gamma: 0.36–0.87, p < 0.05). Similarly, in the evaluation of methods from the point of view of comprehensibility of acquired knowledge, a statistically significant positive medium to a large degree of association was identified between evaluation and arrangement (Goodman and Kruskal’s Gamma: 0.30–0.69, p < 0.05).

On the contrary, in the case of the resulting test score focused on the essence of the methods used and by arranging the methods in terms of their complexity/difficulty, we identified a negative association (as the difficulty of the method increased, the test success score decreased), but only a small and statistically insignificant one (Goodman and Kruskal’s Gamma > −0.28, p > 0.05) or independence. From this, it follows that knowledge does not correspond with skills. Although they can use methods in Python and replicate code in Python, their ability to develop a valid model/knowledge is questionable, to say the least. Disadvantages of the Python language include the following:

Template-like nature (rarely can the same thing be solved in the same way; however, for less knowledgeable users, on the contrary, it can represent a certain advantage);
Shallowness (limited parameterization caused by partial automation of methods; however, for less-knowledgeable users, this, on the contrary, can represent a certain advantage).
On the contrary, we can include among the advantages
Replicability (if not ideally tuned, this can be a disadvantage);
Availability (the work of a worse analyst may be limited only to finding the code and not to solving the problem; however, this may, on the contrary, represent a disadvantage).

Python is essential for all data analysts and, moreover, fully sufficient for more basic positions. Greater automation is expected in the field of machine learning, which will lead to the fact that most users will approach models as a black box. However, practice already shows that automated tools are sufficient for most subtasks, for which knowledge of Python and a basic overview of analytical methods, data preprocessing techniques and evaluation of results are sufficient. One expert in the team with in-depth knowledge of methods and the ability to use the widest possible portfolio of data mining tools is sufficient for expert work, which corresponds to common practice when solving data mining projects.

Of the 11 experts, only 2 could satisfactorily answer the set questions (one academic and one employee of a global corporation), both of whom had a third-degree (Ph.D.) mathematical background. There was little hesitation on their part, besides on the last question, namely relative risk, which has a specific use. The rest of the respondents managed only the first, trivial question. On the one hand, the results are very surprising and unambiguous, and there is no point in tabulating them. On the other hand, it shows that although the respondents know logistic regression and they know how to use it for classification, they do not know what logistic regression models are and that its results can be used to predict the chance or probabilities. Of course, the results cannot be generalized, given the small sample of experts and the selection based on availability.

In our opinion, this is a tax on the mechanical use of methods implemented in Python, mechanical code taking (restriction of work to finding code and not solving a problem, which in most cases is unique), respectively, using other automated tools. In global corporations, not all analysts are employed as data scientists; many perform partial tasks such as selecting variables, evaluating results, etc., while using automated tools and repetitive procedures. One person who looks deeply into things is usually enough in a team. Global corporations have dealt with this phenomenon by granulating tasks in the knowledge acquisition process.

8. Conclusions

From the results, we can see the connections between the evaluated areas of the used methods. There is a positive medium to a large degree of association between the correctness of the classification and the complexity of the methods in terms of the difficulty of the methods themselves (Goodman and Kruskal’s Gamma: 0.60) and partly also in terms of data preprocessing (Goodman and Kruskal’s Gamma: 0.33) (Table 5 and Table 6a,b).

Similarly, between the correctness of the classification and the comprehensibility of the acquired knowledge (Goodman and Kruskal’s Gamma: 0.47), we can observe a moderate degree of positive association (Table 5 and Table 6c), as well as between the comprehensibility of the acquired knowledge and the complexity of the methods (Goodman and Kruskal’s Gamma: 0.60) and the necessary preprocessing data (Goodman and Kruskal’s Gamma: 0.87), where a positive large to very large association was achieved (Table 6a–c).

Although more complex methods bring us more accurate knowledge, they are less comprehensible. In cases where we are not only interested in the result, in the sense of inclusion in classes, it is very important to find the right border between the comprehensibility of the acquired knowledge and its accuracy.

The results of our study confirm that there are differences in complexity in terms of method, data preparation, and interpretability of results between statistical, symbolic and subsymbolic methods. Also, the studies carried out so far did not consider the knowledge and opinion factors when evaluating individual criteria. Interpretability was perceived mechanically and at the level of models and not groups of methods, while the criteria of interpretability were the number of parameters, the size of the model, the significance of features, etc. Although the results of our study are consistent with some of the results of related studies from the related works section, for example, in the studies by Bratko et al. [11] that state that decision trees are easy to interpret, which is also confirmed by our results, obtained from knowledge and opinion questions from students and experts, we see the benefit in verifying the claims of the authors of related studies using our methodology. We also found that, in general, symbolic methods are the easiest to understand for students, even if it is an ensemble version of symbolic methods. Conversely, subsymbolic and statistical methods are more difficult for students to imagine, and it is necessary to devote more attention and time to them during teaching.

Author Contributions

Conceptualization, M.M.; formal analysis, F.F.; methodology, L.K.; supervision, M.M.; resources F.F.; validation, M.M.; All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the European Commission under the ERASMUS+ Programme 2021, KA2, grant number: 2021-1-SK01-KA220-HED-000032095 “Future IT Professionals EDucation in Artificial Intelligence”.

Data Availability Statement

The data presented in this study are openly available in https://cutt.ly/rwySV5Hc (accessed on 26 June 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Cremin, C.J.; Dash, S.; Huang, X. Big Data: Historic Advances and Emerging Trends in Biomedical Research. Curr. Res. Biotechnol. 2022, 4, 138–151. [Google Scholar] [CrossRef]
Chapman, P.; Clinton, J.; Khabaza, T.; Kerber, R.; Reinartz, T.; Shearer, T.; Wirth, R. CRISP-DM 1.0: Step-by-Step Data Mining Guide 2000. Available online: https://www.kde.cs.uni-kassel.de/wp-content/uploads/lehre/ws2012-13/kdd/files/CRISPWP-0800.pdf (accessed on 20 March 2023).
Hajek, P.; Barushka, A.; Munk, M. Neural Networks with Emotion Associations, Topic Modeling and Supervised Term Weighting for Sentiment Analysis. Int. J. Neur. Syst. 2021, 31, 2150013. [Google Scholar] [CrossRef] [PubMed]
Hajek, P.; Barushka, A.; Munk, M. Fake Consumer Review Detection Using Deep Neural Networks Integrating Word Embeddings and Emotion Mining. Neural Comput. Appl. 2020, 32, 17259–17274. [Google Scholar] [CrossRef]
Miller, T. Explanation in Artificial Intelligence: Insights from the Social Sciences. Artif. Intell. 2019, 267, 1–38. [Google Scholar] [CrossRef]
Johansson, U.; Sönströd, C.; Norinder, U.; Boström, H. Trade-off between Accuracy and Interpretability for Predictive in Silico Modeling. Future Med. Chem. 2011, 3, 647–663. [Google Scholar] [CrossRef] [PubMed]
Tsang, W.K.; Benoit, D.F. Interpretability and Explainability in Machine Learning. In Living Beyond Data; Ohsawa, Y., Ed.; Intelligent Systems Reference Library; Springer International Publishing: Cham, Switzerland, 2023; Volume 230, pp. 89–100. ISBN 978-3-031-11592-9. [Google Scholar]
Mitchie, D. Machine Learning in the next Five Years. In Proceedings of the Third European Working Session on Learning, Glasgow, UK, 3 October 1998; Volume 1, pp. 47–80. [Google Scholar]
Cunningham, S.J.; Humphrey, M.C.; Qithen, I.H. Understanding What Machine Learning Produces—Part I: Representations and Their Comprehensibility; Department of Computer Science, University of Waikato: Hamilton, New Zealand, 1996. [Google Scholar]
Cunningham, S.J.; Humphrey, M.C.; Qithen, I.H. Understanding What Machine Learning Produces—Part II: Knowledge Visualization Techniques; Department of Computer Science, University of Waikato: Hamilton, New Zealand, 1996. [Google Scholar]
Bratko, I. Machine Learning: Between Accuracy and Interpretability. In Learning, Networks and Statistics; Riccia, G., Lenz, H.-J., Kruse, R., Eds.; Springer: Vienna, Austria, 1997; pp. 163–177. ISBN 978-3-211-82910-3. [Google Scholar]
Zurada, J. Could Decision Trees Improve the Classification Accuracy and Interpretability of Loan Granting Decisions? In Proceedings of the 2010 43rd Hawaii International Conference on System Sciences, Honolulu, HI, USA, 5–8 January 2010; pp. 1–9. [Google Scholar]
El Shawi, R.; Sherif, Y.; Al-Mallah, M.; Sakr, S. Interpretability in Healthcare: A Comparative Study of Local Machine Learning Interpretability Techniques. In Proceedings of the 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS), Cordoba, Spain, 5–7 June 2019; pp. 275–280. [Google Scholar]
ElShawi, R.; Sherif, Y.; Al-Mallah, M.; Sakr, S. Interpretability in Healthcare: A Comparative Study of Local Machine Learning Interpretability Techniques. Comput. Intell. 2021, 37, 1633–1650. [Google Scholar] [CrossRef]
Sulmont, E.; Patitsas, E.; Cooperstock, J.R. Can You Teach Me to Machine Learn? In Proceedings of the 50th ACM Technical Symposium on Computer Science Education, Minneapolis, MN, USA, 22 February 2019; pp. 948–954. [Google Scholar]
Sulmont, E.; Patitsas, E.; Cooperstock, J.R. What Is Hard about Teaching Machine Learning to Non-Majors? Insights from Classifying Instructors’ Learning Goals. ACM Trans. Comput. Educ. 2019, 19, 1–16. [Google Scholar] [CrossRef]
Gonda, D.; Pavlovičová, G.; Ďuriš, V.; Tirpáková, A. Implementation of Pedagogical Research into Statistical Courses to Develop Students’ Statistical Literacy. Mathematics 2022, 10, 1793. [Google Scholar] [CrossRef]
Gonda, D.; Ďuriš, V.; Tirpáková, A.; Pavlovičová, G. Teaching Algorithms to Develop the Algorithmic Thinking of Informatics Students. Mathematics 2022, 10, 3857. [Google Scholar] [CrossRef]
Gao, L.; Guan, L. Interpretability of Machine Learning: Recent Advances and Future Prospects. IEEE MultiMedia 2023, 1–12. [Google Scholar] [CrossRef]
Dolejš, J.; Jureček, M. Interpretability of Machine Learning-Based Results of Malware Detection Using a Set of Rules. In Artificial Intelligence for Cybersecurity; Stamp, M., Aaron Visaggio, C., Mercaldo, F., Di Troia, F., Eds.; Advances in Information Security; Springer International Publishing: Cham, Switzerland, 2022; Volume 54, pp. 107–136. ISBN 978-3-030-97086-4. [Google Scholar]
Upadhyaya, D.P.; Tarabichi, Y.; Prantzalos, K.; Ayub, S.; Kaelber, D.C.; Sahoo, S.S. Characterizing the Importance of Hematologic Biomarkers in Screening for Severe Sepsis Using Machine Learning Interpretability Methods. medRxiv 2023. [Google Scholar] [CrossRef]
Beisbart, C.; Räz, T. Philosophy of Science at Sea: Clarifying the Interpretability of Machine Learning. Philos. Compass 2022, 17. [Google Scholar] [CrossRef]
Hazzan, O.; Mike, K. The Pedagogical Challenge of Machine Learning Education. In Guide to Teaching Data Science; Springer International Publishing: Cham, Switzerland, 2023; pp. 199–208. ISBN 978-3-031-24757-6. [Google Scholar]
Woolberg, W.; Street, W.N.; Mangasarian, O. Breast Cancer Wisconsin (Diagnostic) Data Set 1995. Available online: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data (accessed on 27 February 2023).
Munk, M.; Kapusta, J. Web Usage Mining; Prírodovedec; Univerzita Konštantína Filozofa v Nitre: Nitra, Slovakia, 2014; ISBN 978-80-558-0692-1. [Google Scholar]
Drlik, M.; Munk, M.; Skalka, J. Identification of Changes in VLE Stakeholders’ Behavior over Time Using Frequent Patterns Mining. IEEE Access 2021, 9, 23795–23813. [Google Scholar] [CrossRef]
Munk, M.; Drlik, M.; Benko, L.; Reichel, J. Quantitative and Qualitative Evaluation of Sequence Patterns Found by Application of Different Educational Data Preprocessing Techniques. IEEE Access 2017, 5, 8989–9004. [Google Scholar] [CrossRef]

Figure 1. Class distribution.

Table 1. Results without preprocessing.

Scoring (Mean)	LR	NB	DT	RF	KNN	SVM
AUC	0.996	0.985	0.916	0.991	0.989	0.996
Accuracy	0.981	0.937	0.919	0.970	0.967	0.977
F1	0.973	0.912	0.892	0.942	0.953	0.969

Table 2. Results with preprocessing.

Scoring (Mean)	LR	NB	DT	RF	KNN	SVM
AUC	0.992	0.988	0.906	0.990	0.961	0.976
Accuracy	0.954	0.932	0.918	0.968	0.930	0.914
F1	0.937	0.908	0.891	0.940	0.904	0.873

Table 3. Number of the best features.

Model	Scoring	Score	Features
LR	AUC	0.997	24
	Accuracy	0.981	24
	F1	0.973	15
NB	AUC	0.987	19
	Accuracy	0.946	6
	F1	0.927	6
DT	AUC	0.932	20
	Accuracy	0.940	16
	F1	0.913	19
RF	AUC	0.992	20
	Accuracy	0.968	19
	F1	0.956	23
KNN	AUC	0.991	20
	Accuracy	0.972	17
	F1	0.962	17
SVM	AUC	0.996	24
	Accuracy	0.974	19
	F1	0.964	19

Table 4. The best parameters for each model.

Model	Best Parameters
LR	‘C’: 1, ‘penalty’: ‘l2’, ‘solver’: ‘newton-cg’
NB	‘var_smoothing’: 3 × 10⁻¹⁰
KNN	‘n_neighbors’: 5
SVM	‘C’: 1.0. ‘gamma’: 0.01, ‘kernel’: ‘rbf’
DT	‘criterion’: ‘gini’, ‘max_depth’: 10, ‘min_samples_leaf’: 4, ‘min_samples_split’: 20
RF	‘criterion’:’gini’,’max_depth’: 10, ‘min_samples_leaf’: 3, ‘min_samples_split’: 10. ‘n_estimators’: 100

Table 5. Multiple comparisons for accuracy performance measure.

ACC	Average Rank	Mean	Std.Dev.	1	2
DT	1.85	0.9210	0.0504		****
NB	2.00	0.9385	0.0289		****
Ensemble	3.90	0.9666	0.0193	****
KNN	4.00	0.9684	0.0296	****
SVM	4.35	0.9754	0.0189	****
LR	4.90	0.9807	0.0226	****

Note: ****—Homogenous Groups p > 0.05.

Table 6. Multiple comparisons for (a) method complexity, (b) preprocessing complexity and (c) interpretability.

(a)
	Avg. Rank	Mean	Std. Dev.	1	2	3
DT	1.53	1.53	1.5		****
Ensemble	3.21	3.21	1.46	****
KNN	3.60	3.60	1.44	****
NB	3.74	3.74	1.24	****
LR	3.76	3.76	1.54	****
SVM	5.16	5.16	1.27			****
(b)
	Avg. Rank	Mean	Std. Dev.	1	2	3	4
DT	1.76	1.76	1.13		****
Ensemble	2.56	2.56	1.37			****
LR	3.56	3.56	1.50	****
KNN	3.79	3.79	1.45	****
NB	4.3	4.3	1.24	****
SVM	5.24	5.24	1.16				****
(c)
	Avg. Rank	Mean	Std. Dev.	1	2	3	4
DT	1.49	1.47	0.83			****
Ensemble	3.00	3.00	1.33		****
LR	3.35	3.35	1.52	****	****
NB	3.94	3.91	1.40	****
KNN	4.7	4.6	1.50	****
SVM	5.15	5.15	1.16				****

Note: ****—Homogenous Groups p > 0.05.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kelebercová, L.; Munk, M.; Forgáč, F. Could You Understand Me? The Relationship among Method Complexity, Preprocessing Complexity, Interpretability, and Accuracy. Mathematics 2023, 11, 2922. https://doi.org/10.3390/math11132922

AMA Style

Kelebercová L, Munk M, Forgáč F. Could You Understand Me? The Relationship among Method Complexity, Preprocessing Complexity, Interpretability, and Accuracy. Mathematics. 2023; 11(13):2922. https://doi.org/10.3390/math11132922

Chicago/Turabian Style

Kelebercová, Lívia, Michal Munk, and František Forgáč. 2023. "Could You Understand Me? The Relationship among Method Complexity, Preprocessing Complexity, Interpretability, and Accuracy" Mathematics 11, no. 13: 2922. https://doi.org/10.3390/math11132922

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Could You Understand Me? The Relationship among Method Complexity, Preprocessing Complexity, Interpretability, and Accuracy

Abstract

1. Introduction

2. Related Work

3. Methodology

Obtaining Feedback from Participants

4. Machine Learning Metrics for Evaluation

5. Experimentation

5.1. Experimentation with Students

5.2. Experimentation with Experts

6. Results

7. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI