Probabilistic Confusion Matrix: A Novel Method for Machine Learning Algorithm Generalized Performance Analysis

Markoulidakis, Ioannis; Markoulidakis, Georgios

doi:10.3390/technologies12070113

Open AccessArticle

Probabilistic Confusion Matrix: A Novel Method for Machine Learning Algorithm Generalized Performance Analysis

by

Ioannis Markoulidakis

^1,*

and

Georgios Markoulidakis

²

¹

Department of Digital Transformation, Public Power Company, 10432 Athens, Greece

²

Department of Electrical and Computer Engineering, National Technical University of Athens, 15773 Athens, Greece

^*

Author to whom correspondence should be addressed.

Technologies 2024, 12(7), 113; https://doi.org/10.3390/technologies12070113

Submission received: 4 June 2024 / Revised: 29 June 2024 / Accepted: 11 July 2024 / Published: 13 July 2024

(This article belongs to the Topic Theoretical and Applied Problems in Human-Computer Intelligent Systems)

Download

Browse Figures

Versions Notes

Abstract

:

The paper addresses the issue of classification machine learning algorithm performance based on a novel probabilistic confusion matrix concept. The paper develops a theoretical framework which associates the proposed confusion matrix and the resulting performance metrics with the regular confusion matrix. The theoretical results are verified based on a wide variety of real-world classification problems and state-of-the-art machine learning algorithms. Based on the properties of the probabilistic confusion matrix, the paper then highlights the benefits of using the proposed concept both during the training phase and the application phase of a classification machine learning algorithm.

Keywords:

confusion matrix; classification problem; machine learning; cross-validation; class probabilities; generalized algorithm performance

1. Introduction

The problem of data classification in the field of supervised machine learning refers to the capability of an algorithm to predict the class label of a sample consisting of a set of parameters called features or explanatory variables. Data classification is an instance of pattern recognition [1], and its solution includes two phases: (a) a training phase where the exploited machine learning algorithm is trained based on a set of training data, i.e., data for which both the features and the corresponding class label are known, and (b) the application of the trained machine learning algorithm to datasets with known feature values and unknown class labels leading to a prediction of the class labels.

Depending on the number of class labels, a classification problem can be characterized as binary classification (the class label can take only two values), multiclass classification (the class label can take more than two different values) or multi-label classification (each observation is associated with multiple classes). Different machine learning algorithms are employed to address classification problems [1], such as Linear Regression, logistic regression (LogReg), Naïve Bayes (NB), Support Vector Machines (SVMs), decision trees (DTs), random forest (RF), k-nearest neighbors, Convolutional Neural Networks (CNNs), Artificial Neural Networks (ANNs), etc. [2,3]. The performance of classification algorithms is expressed via specific metrics like accuracy, precision, recall, F1-score [4]. Such metrics are calculated based on the derivation of the confusion matrix (CM) [5,6] which represents the way that predicted labels are confused with the actual class labels.

During the training phase, an algorithm is optimized based on the available training dataset (fitting process). To assess the generalized performance [7] of the trained algorithm typically some cross-validation method is applied (e.g., exhaustive, non-exhaustive, hold-out cross-validation) [8,9]. In all of these methods the trained algorithm is applied to one or more datasets called test or validation sets with known class labels providing thus the opportunity to assess its performance for samples not seen during the training process.

Probabilistic versions of the confusion matrix have been proposed in the literature [10,11,12,13,14,15,16,17]. In [10], a probabilistic confusion matrix was developed as a method for cross-entropy/loss minimization, and in [11], the concept of a fuzzy confusion matrix was proposed aiming at the improvement of multi-label classification problem. In [12], the authors consider the true and the empirical version of the confusion matrix and they use a norm of the confusion matrix as a loss function in order to optimize the classification algorithm performance. In [13,14], the authors define a probabilistic confusion matrix version which relies on the actual class labels and the class probabilities and provide an analysis of the relevant performance metrics. In [15], multiple classification algorithms are exploited for document classification, and a probabilistic confusion matrix is proposed, which, for each class, sums up the class probabilities produced by each one of the applied algorithms in order to select the class with the maximum sum of class probabilities. In [16], the authors use the concept of the entropy confusion matrix developed in [10] in order to assess probabilistic classification models. Finally, in [17], the authors exploit a probabilistic confusion matrix for a binary classification problem analysis (presence–absence problem). The probabilistic confusion matrix used is based on the same definition as the one used in [13,14], i.e., based on the actual class labels and the class probabilities.

The current paper develops a novel concept based on a probabilistic definition of the confusion matrix using the predicted class and the associated class probabilities estimated by the applied algorithm. The proposed concept is called Actual Label Probabilistic confusion matrix (ALP CM) since the actual labels corresponding to a set of input samples are approximated by the relevant class probabilities. This is a different version compared to [13,14,17], where the probabilistic confusion matrix is based on the actual class labels and the class probabilities. Moreover, the ALP CM is based on a different definition compared to [10] as it becomes evident from the equations provided in Section 3.

The theoretical analysis indicates that the properties of the ALP confusion matrix under certain convergence conditions lead to a good estimation of the algorithm performance. Further theoretical results provide useful insights related to the class probabilities and the algorithm performance metrics. The theoretical analysis is then tested based on a set of real-world classification problems and a set of state-of-the-art classification algorithms. The results confirm the theoretical analysis and provide additional insights related to the observed pattern under the presence of overfitting and generalization error. The derived conclusions are exploited so as to develop a method for exploiting the proposed confusion matrix concept for improving the training phase of a classification problem as well as for predicting the trained algorithm performance when applied to a real-world set of input samples.

The paper is organized as follows: Section 2 provides modelling of state-of-the-art machine learning classification algorithms. Section 3 develops the probabilistic confusion matrix concept and provides the associated theoretical framework of the algorithm performance. Section 4 provides the analysis of real-world classification problems so as to verify the theoretical analysis. Section 5 provides the exploitation use cases of the probabilistic confusion matrix, and Section 6 summarizes the paper conclusions and discusses potential next research steps.

2. Classification Machine Learning Modelling Framework

As mentioned above there is a wide variety of machine learning algorithms which can be applied to find the solution of a classification problem [2]. In the following we provide the basic modelling and performance assessment framework of classification machine learning algorithms which are the basis for the development of the novel concept proposed in this paper.

2.1. Classification Problem Modelling

We consider a classification problem which is linked to a space expressed by the pair (S_X, S_Y), where S_X corresponds to the set of features of the problem and S_Y to the set of the associated class labels. Let a pair of sets

(X, Y)

with

X

a sub-set of S_X and

Y

a sub-set of S_Y consisting of a finite number of N samples. Assuming that the classification problem is based on K features, then the set of input samples

X

is defined as an N × K array:

X = [x_{i j}], i = 1,2, \dots, N, j = 1,2, \dots, K

(1)

where

x_{i j}

is the value of feature j for input sample instance i. Depending on the classification problem each feature

x_{i j}

can be a discrete/categorical or a continuous parameter.

The corresponding set of class labels

Y

can be defined as an N × 1 array (single-label, multiclass problem):

Y = [{y_{i}}_{}], i = 1,2, \dots, N, y_{i} \in L = \{L_{1}, L_{2_{}}, \dots, L_{M}\}

(2)

where

y_{i}

is the class label corresponding to sample instance

x_{i j}

, and

L = \{L_{1}, L_{2}, \dots, {L_{M}}_{}\}

is the set of the possible class labels with M different values (i.e.,

y_{i}

is a discrete parameter).

A trained machine learning algorithm f receives as an input the sample instances of the set of features

X

and provides as an output the set of predicted class labels

\hat{Y}

, which is an N × 1 array:

\hat{Y} = [{\hat{y}}_{i_{}}], i = 1,2, \dots, N, {\hat{y}}_{i} \in L

(3)

where

{\hat{y}}_{i}

is the class label predicted from the trained algorithm f when applied to sample instance

x_{i j}

:

{\hat{y}}_{i} = f ({[x}_{i j}]), i = 1,2, \dots, N, j = 1,2, \dots, K, {\hat{y}}_{i} \in L

(4)

Based on the fact that the set of actual and predicted class labels consists of discrete parameters, we can use a categorical way to express them as N × M arrays:

Y_{c} = [b_{i j}] a n d {\hat{Y}}_{c} = [{\hat{b}}_{i j}], i = 1,2, \dots, N, j = 1,2, \dots M

(5)

where

b_{i j} = \{\begin{matrix} 1 i f y_{i} = L_{\begin{matrix} j \end{matrix}} \\ 0 i f y_{i} \neq L_{j} \end{matrix}, {\hat{b}}_{i j} = \{\begin{matrix} 1 i f {\hat{y}}_{i} = L_{j} \\ 0 i f {\hat{y}}_{i} \neq L_{j} \end{matrix}

(6)

2.2. The Confusion Matrix

The confusion matrix (CM) is defined as the matrix with positive integer numbers as elements which represent the number of instances of the possible pairs of predicted and actual class label combinations. The CM can be calculated based on the pair of predicted and actual class sets,

(\hat{Y}, Y)

. For a classification problem with M different class labels, the resulting confusion matrix has a dimension of M × M and is defined as follows:

C M = [c_{{k m}_{}}], k, m = 1,2, \dots M

(7)

where CM is the confusion matrix resulting from the application of the trained algorithm f over an input set of samples

X

with known class labels

Y

, and

c_{k m}

is an integer with the property 0 ≤

c_{k m}

≤ N corresponding to the number of instances, for which the classification algorithm f predicts class label

L_{k}

and the actual class label is

L_{m}

:

c_{k m} = \sum_{i = 1}^{N} {\hat{b}}_{i k} \cdot b_{i m}

(8)

Note that the last part of Equation (8) occurs from the fact that the term

{\hat{b}}_{i k} \cdot b_{i m}

has the following property:

{\hat{b}}_{i k} \cdot b_{i m} = \{\begin{matrix} 1 if {\hat{y}}_{i} = L_{k} a n d y_{i} = L_{m} \\ 0 otherwise \end{matrix}

(9)

Equation (8) allows for the CM calculation based on the categorical representation of the predicted and the actual class labels:

C M = {\hat{Y}}^{T}_{c} {\cdot Y}_{c}

(10)

The CM is the basis of the definition of a wide set of metrics which represent the performance of the classification algorithm. Such metrics include accuracy, precision, recall, and F1-score, which are defined for both binary and multiclass classification problems [6]. For binary classification problems, additional performance metrics are available like Specificity, Negative Predictive Value, Miss Rate, False Discovery Rate, etc.

2.3. The Class Probabilities of a Classification Algorithm

For a set of classification machine learning algorithms, the identification of the predicted class label relies on the estimation of the so-called class probabilities [18,19]. In this context, we may assume that for a given input sample instance

x_{i j}

, (j = 1, 2, …, K), the trained algorithm f can provide the class probabilities as an estimation of the distribution of the probabilities of the actual class labels:

{\hat{p}}_{i m} = \Pr [y_{i_{}} = L_{m}| f, x_{i j}], i = 1,2, \dots N, j = 1,2, \dots K, m = 1,2, \dots, M

(11)

where

{\hat{p}}_{i m}

is the class probability produced by the trained algorithm f and represents the estimated probability that the actual label

y_{i}

of sample instance

x_{i j}

is

L_{m}

. From this definition, the following property applies:

\sum_{m = 1}^{M} {\hat{p}}_{i m} = 1

(12)

We can then define the array of the class probabilities of the trained classification algorithm f for a specific input set of samples

X

of size N as the following N × M dimensioned array:

{\hat{Y}}_{p r c} = [{\hat{p}}_{{i m}_{}}], i = 1,2, \dots, N, m = 1,2, \dots M

(13)

Based on Equation (12), it is easy to show that:

\sum_{i = 1}^{N} \sum_{m = 1}^{M} {\hat{p}}_{i m} = N

(14)

and for a substantially large sample size, it is easy to show that:

\frac{1}{N} \sum_{i = 1}^{N} {\hat{p}}_{i m} = \Pr [y = L_{m_{}}| X]

(15)

\sum_{i = 1}^{N} {\hat{p}}_{i m} = n (y = L_{m_{}})

(16)

where

\Pr [y = L_{m_{}}| X]

is the probability that the actual label is

L_{m}

for the entire input set of samples

X

, and

n (y = L_{m})

is the number of sample instances in set

X

, for which the actual label is

L_{m}

.

A machine learning algorithm that provides a prediction of the class probabilities typically selects the predicted class label based on the maximum likelihood method [19,20] aiming at algorithm performance optimization:

{\hat{y}}_{i} = L_{k}, L_{k} \in L, k = a r g \max_{m} \{{\hat{p}}_{{i m}_{}}\} (m = 1,2, \dots, M, a n d i = 1,2, \dots, N)

(17)

The set of estimated class labels can be expressed in its categorical representation

{\hat{Y}}_{c}

as follows:

{\hat{Y}}_{c} = [{\hat{b}}_{{i k}_{}}], (i = 1,2, \dots, N, k = 1,2, . ., M), {\hat{b}}_{i k} = \{\begin{matrix} 1 i f \arg \max_{m} \{{\hat{p}}_{{i m}_{}}\} = k \\ 0 i f \arg \max_{m} \{{\hat{p}}_{{i m}_{}}\} \neq k \end{matrix}

(18)

3. The Actual Label Probabilistic Confusion Matrix Concept

The current paper considers the definition of a different version of the confusion matrix, which is estimated based on the class probabilities produced by the applied classification algorithm. In particular, the confusion matrix defined in this paper is called Actual Label Probabilistic (ALP) confusion matrix and is developed based on the following assumptions:

(a): The selected classification algorithm f is trained based on a training dataset (i.e., a set of input samples $X_{t r a i n}$ with known actual class labels $Y_{t r a i n}$ ). The pair $(X_{t r a i n}, Y_{t r a i n})$ is assumed to be a set of samples from the problem space (S_X, S_Y).
(b): The trained classification algorithm is then applied to a set of input samples $X$ . It is assumed that the pair of input samples and the corresponding actual labels $(X, Y)$ are also a set of samples from the problem space (S_X, S_Y). Note that depending on the use case, the set of actual labels $Y$ can be either known (e.g., a test set so to validate the trained algorithm performance) or unknown (e.g., an application set so as to exploit the trained algorithm to predict the class labels).
(c): When the trained classification algorithm f is applied to the input set $X$ , it produces the predicted class probabilities (see Equation (13)), and it uses Equation (17) as the decision rule for the selection of the predicted class labels based on the estimated class probabilities. From this assumption, it is clear that the proposed concept applies to classification algorithms like logistic regression, decision trees, etc. On the other hand, the proposed concept is not applicable to algorithms like k-nearest neighbors (KNN) where the estimation of the predicted label of an input sample is based on a plurality vote of the k nearest neighbors [21].
(d): When the trained classification algorithm f is applied to the input set $X$ , the Actual Label Probabilistic confusion matrix (ALP CM) can be calculated based on: (i) the predicted class labels (as in the case of the regular confusion matrix) and (ii) the fractional approximation of the actual class labels, which is estimated based on the predicted class probabilities of each input sample instance. The concept behind this approach is the fact that the classification algorithm has been trained so as to approximate the distribution of actual class labels corresponding to the set of input samples $X$ through the produced class label probabilities.

Applying the process described in point (d) to Equation (10) by replacing the actual labels array

Y_{c}

with the estimated class probabilities array

{\hat{Y}}_{p r c}

, we obtain the following equations for the ALP CM:

C M_{A L P} = [c_{A L P_{}} (k, m)] = {\hat{Y}}_{c}^{T} \cdot {\hat{Y}}_{p r c}, k, m = 1,2, \dots, M

(19)

c_{A L P} (k, m) = \sum_{i = 1}^{N} {\hat{b}}_{i k} \cdot {\hat{p}}_{i m}, k, m = 1,2, \dots M

(20)

where

c_{A L P} (k, m)

is a real number with the property 0 ≤

c_{A L P} (k, m)

≤ N. The ALP CM of a trained algorithm can be calculated even for datasets for which the actual class labels

(Y)

are unknown as in the case of a real-world application dataset. The ALP CM has the following properties:

\sum_{k = 1}^{M} \sum_{m = 1}^{M} c_{A L P} (k, m) = \sum_{k = 1}^{M} \sum_{m = 1}^{M} \sum_{i = 1}^{N} {\hat{b}}_{i k} \cdot {\hat{p}}_{i m} = \sum_{m = 1}^{M} \sum_{i = 1}^{N} {\hat{p}}_{i m} = \sum_{i = 1}^{N} 1 = N

(21)

\sum_{k = 1}^{M} c_{A L P} (k, m) = \sum_{k = 1}^{M} \sum_{i = 1}^{N} {\hat{b}}_{i k} \cdot {\hat{p}}_{i m} = \sum_{i = 1}^{N} {\hat{p}}_{i m} \approx n (y = L_{m})

(22)

\sum_{m = 1}^{M} c_{A L P} (k, m) = \sum_{m = 1}^{M} \sum_{i = 1}^{N} {\hat{b}}_{i k} \cdot {\hat{p}}_{i m} = \sum_{i = 1}^{N} {\hat{b}}_{i k} = n (\hat{y} = L_{k})

(23)

where

n (y = L_{m})

is the number of instances for which the actual class label is

L_{m}

, and

n (\hat{y} = L_{k})

is the number of input sample instances for which the predicted class label is L_k. It should be noted that the property expressed in Equation (22) stems from Equation (16). Moreover, it is interesting to note that the above properties also apply to the regular CM.

3.1. Actual Label Probabilistic vs. the Regular Confusion Matrix

In this section we prove the following theorem which provides the relation between the ALP CM and the regular CM:

Theorem 1.

Let us consider an algorithm f that properly fits the training dataset (i.e., there is no presence of overfitting or underfitting) and produces estimated class probabilities with a good level of accuracy. Let us also consider an input dataset (X,Y) with a substantially large size N that follows the same distribution of samples as the one of the training set. Then, the elements of the ALP CM approximate the values of the elements of the regular CM corresponding to the same input dataset:

c_{A L P} (k, m) \approx c_{k m}, k, m = 1,2, \dots, M

(24)

Proof of Theorem 1.

It is evident that in the case of an algorithm that suffers from overfitting or underfitting, it will also suffer from generalization error issues. Therefore, for such an algorithm it is not possible to derive performance metrics that form a reliable prediction of the algorithm when applied to any input set of samples

X

.

It is also evident that in the case that the applied algorithm produces predicted class probabilities of limited accuracy, then the estimation of the ALP CM elements will also be characterized by limited accuracy. For example, it is known from the literature [19] that certain types of algorithms like Naïve Bayes or Boosted Trees have an issue of bias in the estimated class probabilities while algorithms like neural networks have a much better behavior.

Assuming that the applied algorithm does not suffer from overfitting or underfitting and the produced estimated class probabilities are of good accuracy, we have (from Equation (8)):

c_{k m} = \sum_{i = 1}^{N} {\hat{b}}_{i k} b_{i m} = \sum_{i = a_{1}}^{a_{n}} b_{i m}

(25)

where from the input set of samples

X

, we select the sub-set of sample instances

x_{i j}

with indexes i = a₁, a₂, …, a_n for which the predicted label is

L_{k}

and therefore

{\hat{b}}_{i k}

= 1. The number of sample instances in this sub-set is n and is equal to the number of samples in

X

for which the predicted label is

L_{k}

:

n = n (\hat{y} = L_{k})

(26)

We can now use

c_{k m}

to express the probability of the actual label y being L_m given that the predicted label

\hat{y}

is L_k:

\frac{c_{k m}}{n} = \frac{1}{n} \sum_{i = a_{1}}^{a_{n}} b_{i m} \approx P r [y = L_{m_{}} | \hat{y} = L_{k_{}}]

(27)

Accordingly, the ALP CM elements can be expressed as follows:

c_{A L P} (k, m) = \sum_{i = 1}^{N} {\hat{b}}_{i k} \cdot {\hat{p}}_{i m} = \sum_{i = a_{1}}^{a_{n}} {\hat{p}}_{i m}

(28)

Taking into account that

{\hat{p}}_{i m}

expresses the predicted class probability for class

L_{m}

and the above sum aggregates the class probabilities for all input sample instances for which the predicted class label is

L_{k}

, we obtain for a substantially large sample size:

\frac{c_{A L P} (k, m)}{n} = \frac{1}{n (\hat{y} = L_{k})} \sum_{i = a_{1}}^{a_{n}} {\hat{p}}_{i m} \approx P r [y = L_{m} | \hat{y} = L_{k_{}}]

(29)

Combining Equations (26), (27) and (29), we obtain Equation (24), which proves Theorem 1. □

It is now evident that the following property applies: under the conditions for which the above theorem holds, the performance metrics of the CM are the same as the performance metrics of the ALP CM of any input dataset (X,Y) that follows the distribution of the training dataset:

P M (C M) \approx P M (C M_{A L P})

(30)

Therefore, the above property applies for all types of input datasets (training, test/validation, and real application sets).

3.2. The Algorithm Performance Metrics vs. the Maximum Estimated Class Probabilities

In this section we provide an analysis that reveals the relation of the maximum estimated class probabilities and the algorithm performance metrics. We start from the estimation of

c_{A L P} (k, k)

of the ALP CM as a function of the maximum class probability. Assuming that Equation (17) is applied to the selection of the predicted class, and taking into account Equation (20), we obtain:

c_{A L P} (k, k) = \sum_{i = 1}^{N} {\hat{b}}_{i k} \cdot {\hat{p}}_{i k}, k = 1,2, \dots M

(31)

We can split the input set of samples

X

into the following two sub-sets:

(a): Sub-set $X_{A}$ with samples for which ${\hat{p}}_{i k} = {\hat{p}}_{m a x} (x_{i j}) = m a x {{\hat{p}}_{i 1}, {\hat{p}}_{i 2}, \dots, {\hat{p}}_{i M}}$ , i.e., ${\hat{p}}_{i k}$ is the maximum estimated class probability;
(b): Sub-set $X_{B}$ with samples where ${\hat{p}}_{i k} \neq {\hat{p}}_{m a x} (x_{i j})$ is not the maximum estimated class probability.

Apparently, the size of sub-set

X_{A}

is

n (\hat{y} = L_{k})

since according to Equation (17), if the estimated class probability

{\hat{p}}_{i k}

is the maximum one for sample

x_{i j}

, then the predicted class for the specific sample is

L_{k}

.

Based on the above split of set

X

, the estimation of

c_{A L P} (k, k)

becomes:

c_{A L P} (k, k) = \sum_{x_{i j} \in X_{A}} 1 \cdot {\hat{p}}_{m a x} (x_{i j}) + \sum_{x_{i j} \in X_{B}} 0 \cdot {\hat{p}}_{i k}

(32)

Let us define

{\bar{p}}_{m a x} (k)

as the average value of the predicted class probability for class k in the sub-set

X_{A}

(i.e., when

{\hat{p}}_{i k}

is the maximum class probability for sample

x_{i j}

):

{\bar{p}}_{m a x} (k) = \frac{\sum_{x_{i j} \in X_{A}} {\hat{p}}_{m a x} (x_{i j})}{n (\hat{y} = L_{k})}

(33)

Equation (32) can now be expressed as follows:

c_{A L P} (k, k) = {\bar{p}}_{m a x} (k) \cdot n (\hat{y} = L_{k})

(34)

Based on Equation (34), it is possible to estimate basic performance metrics like accuracy, precision, recall, F1-score as a function of the maximum class probabilities as produced by an algorithm f when applied over a set

X

of N samples:

A c c (C M_{A L P}) = \frac{1}{N} \sum_{k = 1}^{M} c_{A L P} (k, k) = \frac{\sum_{k = 1}^{M} {\bar{p}}_{m a x} (k) \cdot n (\hat{y} = L_{k})}{N} = {\bar{P}}_{m a x}

(35)

where

{\bar{P}}_{m a x}

is the average value of the maximum estimated class probability in the entire set of results produced by the algorithm f when applied over the input set of samples

X

.

The precision score can be defined as a function of the maximum estimated class probability as follows (considering the generic multiclass classification and macro average definition of precision):

P r e c (C M_{A L P}) = \frac{1}{M} \sum_{k = 1}^{M} \frac{c_{A L P} (k, k)}{n (\hat{y} = L_{k})} = \frac{1}{M} \sum_{k = 1}^{M} \frac{{\bar{p}}_{m a x} (k) \cdot n (\hat{y} = L_{k})}{n (\hat{y} = L_{k})} = \frac{1}{M} \sum_{k = 1}^{M} {\bar{p}}_{m a x} (k)

(36)

The precision score for a binary classification problem (assuming two classes with P for Positive and N for Negative) is:

P r e c (C M_{A L P}) = \frac{T P}{T P + F P} = \frac{c_{A L P} (P, P)}{n (\hat{y} = P)} = {\bar{p}}_{m a x} (P)

(37)

Note that in binary classification,

n (\hat{y} = P) = T P + F P

(where TP: True Positives, FP: False Positives).

The recall score can similarly be defined as a function of the maximum estimated class probability as follows (considering the generic multiclass classification case and macro average definition of recall):

R e c (C M_{A L P}) = \frac{1}{M} \sum_{k = 1}^{M} \frac{c_{A L P} (k, k)}{n (y = L_{k})} = \frac{1}{M} \sum_{k = 1}^{M} \frac{n (\hat{y} = L_{k})}{n (y = L_{k})} \cdot {\bar{p}}_{m a x} (k)

(38)

The recall metric for a binary classification problem is:

R e c (C M_{A L P}) = \frac{T P}{T P + F N} = \frac{c_{A L P} (P, P)}{n (y = P)} = \frac{n (\hat{y} = P)}{n (y = P)} {\bar{p}}_{m a x} (P)

(39)

Note that in binary classification,

n (y = P) = T P + F N

(where FN: False Negatives).

Finally, the F1-score can be estimated as follows assuming a multiclass classification problem and the macro average definition of the F1-score:

F 1_{s c o r e} (C M_{A L P}) = \frac{1}{M} \sum_{k = 1}^{M} 2 \frac{\frac{c_{A L P} (k, k)}{n (y = L_{k})} \cdot \frac{c_{A L P} (k, k)}{n (\hat{y} = L_{k})}}{\frac{c_{A L P} (k, k)}{n (y = L_{k})} + \frac{c_{A L P} (k, k)}{n (\hat{y} = L_{k})}} = \frac{1}{M} \sum_{k = 1}^{M} \frac{{2 c}_{A L P} (k, k)}{n (y = L_{k}) + n (\hat{y} = L_{k})}

(40)

Taking into account Equation (34), we obtain:

F 1_{s c o r e} (C M_{A L P}) = \frac{1}{M} \sum_{k = 1}^{M} \frac{2 \cdot n (\hat{y} = L_{k})}{n (y = L_{k}) + n (\hat{y} = L_{k})} {\cdot \bar{p}}_{m a x} (k)

(41)

and for a binary classification problem, we obtain:

F 1_{s c o r e} (C M_{A L P}) = \frac{2 \cdot P r e c (C M_{A L P}) \cdot R e c (C M_{A L P})}{P r e c (C M_{A L P}) + R e c (C M_{A L P})} = \frac{2 \cdot n (\hat{y} = P)}{n (y = P) + n (\hat{y} = P)} {\cdot \bar{p}}_{m a x} (P)

(42)

From the above analysis, we may obtain some insights regarding the performance of classification algorithms:

Based on Theorem 1 and Equation (30), we may conclude that if the conditions of Theorem 1 apply, then the equations provided for the performance metrics of the ALP CM approximate the regular CM performance metrics.
The above set of equations indicates that the performance metrics of a classification algorithm are related to the maximum classification probabilities produced by the algorithm when applied over a set of input samples. This is an expected result based on the fact that the maximum class probability provides an indication of the confidence of the prediction provided by the classification algorithm.
It is feasible to use the above equations so as to estimate the ALP CM performance metrics even without the need to first calculate the ALP CM. Note, that in the case of input sets with unknown actual class labels (e.g., real application sets) the number of actual class labels $n (y = L_{k})$ appearing in some of the equations can be estimated based on the produced class probabilities according to Equation (22).
Let us assume that two different trained algorithms f₁, f₂ when applied over the same set of input samples $X$ lead to the same number of predicted class labels, i.e., common $n (\hat{y} = L_{k})$ for all classes k = 1, 2, …, M. We also assume that the conditions of Theorem 1 apply for both algorithms. Then, if algorithm f₁ produces higher average class probabilities ${\bar{p}}_{m a x} (k)$ than algorithm f₂ for all classes k = 1, 2, …, M, then algorithm f₁ outperforms algorithm f₂ in accuracy, precision, recall, and F1-score metrics.
Let us assume that a trained algorithm f when applied over an input set of samples $X$ produces the same average class probabilities ${\bar{p}}_{m a x} (k) = P_{o}$ for all classes k = 1, 2, …, M. Then, the ALP CM-based accuracy and the precision score are equal: $A c c (C M_{A L P}) = P r e c (C M_{A L P}) = P_{o}$ . If the conditions of Theorem 1 apply, then the same property applies for the regular CM accuracy and precision metrics.
Let us assume that a trained algorithm f when applied over an input set of samples $X$ leads to the same number of predicted classes as the actual number of classes, i.e., $n (\hat{y} = L_{k}) = n (y = L_{k})$ for k = 1, 2, …, M. Then, the ALP CM-based precision score is equal to the relevant recall score and equal to the F1-score: $P r e c (C M_{A L P}) = R e c (C M_{A L P}) = F 1_{s c o r e} (C M_{A L P}) = \frac{1}{M} \sum_{k = 1}^{M} {\bar{p}}_{m a x} (k)$ . If the conditions of Theorem 1 apply, then the same property applies for the metrics of the regular CM.

4. Actual Label Probabilistic Confusion Matrix in Real Classification Problems

In order to verify the theoretical framework provided in the previous section for the ALP CM, we considered a set of known classification problems from the UCI machine learning repository [22] listed in Table 1. Moreover, the following set of state-of-the-art machine learning algorithms were exploited in this study: logistic regression (LogReg), Support Vector Machines (SVMs), decision trees (DTs), random forest (RF), Naïve Bayes (NB), eXtreme Gradient Boosting (XGB), Convolutional Neural Networks (CNN), and Artificial Neural Network (ANNs). Appendix A provides the necessary details on the applied machine learning algorithms.

Figure 1 provides examples of a regular CM and an ALP CM for Problems 1 (students’ dropout rate) and 8 (cover type) for LogReg and ANN algorithms, accordingly. As it can be observed, the values of the CM ALP elements are relatively close to their respective CM ones as expected from the theoretical analysis.

4.1. Algorithm Performance: ALP CM vs. Regular CM Metrics

Figure 2 provides an overview of the performance metrics (accuracy, precision, recall and F1-score) occurring from the CM and ALP CM analysis for all applied algorithms on Problem 1 (students’ dropout rate) dataset. The metrics refer to the training and the test set, which were defined based on a random 80–20% split of the problem dataset. As it can be observed, there was a good agreement in the estimated metrics based on the CM and ALP CM for LogReg, SVM, DT, and ANN. This fact indicates that the conditions of Theorem 1 were probably satisfied for these algorithms as the performance metrics converged. For NB, the ALP CM metrics were significantly higher than the CM metrics. For RF and XGB, the CM metrics were higher than the ALP CM metrics in the training set and at a closer distance for the test set. Probably, for these algorithms, the conditions of Theorem 1 were not fully satisfied, leading to a discrepancy in the relevant ALM CM and CM performance metrics.

The metrics of ALP CM provided in Figure 2 were calculated based on two different methods: (a) a calculation of the ALP CM and then an estimation of the relevant performance metrics (accuracy, precision, recall, F1-score) and (b) based on Equations (35)–(39), (41), and (42). The same methodology was applied to the rest of the problems provided in Table 1. In all cases, the resulting metrics for ALP CM were identical, as expected, thus verifying the provided theoretical analysis.

Table 2 provides an overview of the accuracy metric per problem and per applied algorithm as resulting from the CM and the ALP CM analysis. The training and test sets were produced based on a random split of 80–20% of the overall dataset of each problem. The results presented in Table 2 confirm that the accuracy defined based on the ALP CM was quite close to the one defined based on the CM for most of the applied algorithms. As observed in Figure 2, the results of Table 2 confirm that for the NB algorithm, there was a lack of agreement between the accuracy based on the ALP CM and CM. This finding was possibly related to the known fact that the estimation of class probabilities may have an issue for algorithms like NB. There were also some spotted differences between the CM and ALP CM accuracy metrics for the CNN and XGB algorithms indicating the need for further investigation. These discrepancies probably indicate cases where the conditions of Theorem 1 were not satisfied.

4.2. Algorithm Learning Curve Analysis

Figure 3, Figure 4 and Figure 5 provide an example of the algorithm learning curve based on the accuracy vs. the training set size for Problem 4 (red wine quality) and for all the applied algorithms. The training set was selected randomly from the problem input dataset with a size expressed in the figure as a percentage of the overall input dataset ranging from 3% to 78%. After selecting the training set, a test set was also selected randomly from the remaining input dataset with a size equal to 20% of the total dataset.

As shown in Figure 3, the LogReg, SVM, DT, ANN, and CNN algorithms were characterized by a similar pattern in the CM accuracy curves. For a small training set size, we observed an overfitting pattern with the training set CM accuracy being higher than the test set CM accuracy (generalization error). As the training set size increased, the CM accuracy converged to a rather stable level similar for both training and test sets. This was a clear pattern of overfitting which faded out as the training set size increased.

From the same figure, we may observe that the ALP CM accuracy for the training set was clearly lower than the corresponding CM accuracy in the small training set size region and effectively converged to the CM accuracy as the training set size increased. Moreover, the ALP CM accuracy values for the training and the test sets were very close for these algorithms especially in the range of a large training set size. The DT algorithm was an exception, where it appeared that the ALP CM accuracy of both training and test sets almost coincided with the CM accuracy of the training set. Based on these results, we may assume that for Problem 4, the set of algorithms presented in Figure 3 satisfied both conditions of Theorem 1 for a substantially large size of the training set, thus leading to a converged performance of the ALP CM and regular CM.

As shown in Figure 4, for the RF and XGB algorithms, across the entire range of training set size, the performance for the training set was very high while there was a clear evidence of generalization error, i.e., the test set CM performance was substantially lower. Moreover, the ALP CM accuracy was very close to the actual CM accuracy of the test set. In that case, we had a pattern of overfitting which was not related to the training set size but to the configuration of the applied algorithm. Therefore, for these algorithms, the conditions of Theorem 1 were not satisfied leading to a discrepancy between the ALP CM and the CM performance metrics (accuracy in this case).

There are methods to mitigate this type of overfitting for both RF and XGB algorithms. In this study, we set the maximum tree depth limit of the RF algorithm to two, and the learning rate parameter of the XGB algorithm to 0.01. The results are presented in Figure 4 with the adjusted algorithms marked as RF’ and XGB’, accordingly. The adjusted RF algorithm (RF’) led to a converged level of both training and test set CM accuracy even for small training set size. The corresponding ALP CM accuracy was common for training and test sets and close to the CM ones. For the adjusted XGB algorithm (XGB’), we observed an overfitting pattern in the range of a small training set size which faded out as the sample size increased. However, even in the area of convergence, the ALP CM accuracy values for both training and test sets were substantially lower than the CM ones. This was an indication of a potential issue with the quality of the estimated class probabilities. To further investigate this issue, we applied probability calibration (based on Platt’s method) [19,30,31]. The results shown in Figure 4 with the algorithm named XBG” indicate that probability calibration indeed resolved the issue of discrepancy between the ALP CM and CM accuracy. The results presented in Figure 4 indicate that for algorithms for which a discrepancy between ALP CM and CM metrics is observed, the conditions of Theorem 1 should be investigated, i.e., overfitting and class probability estimation quality. As soon as these issues were resolved (based on state-of-the-art methods), Theorem 1’s prediction applied (i.e., the ALP CM and CM metrics converged).

Observing the learning curves of the NB algorithm in Figure 5, it appears that the discrepancy between the CM and ALP CM metrics observed in Table 2 is evident here too. The ALP CM accuracy is very high compared to the CM metrics for both train and test sets. As already mentioned, this discrepancy is related to the quality of the class probabilities produced by NB algorithm. As in the case of XGB algorithm, we applied probability calibration to further investigate this observation. The resulting performance is presented in Figure 5 in the chart with the NB’ algorithm. As it can be seen, the probability calibration process improved the convergence of the algorithm performance (accuracy in this case) for both the training and the test sets. Moreover, the ALP CM accuracy was in full agreement with CM accuracy for both training and test sets.

The results presented in this section provide solid evidence of the predictions of Theorem 1 both in terms of the required conditions related to appropriate algorithm fitting and the good quality of class probabilities and in terms of the observed converged ALP CM and regular CM performance metrics. Moreover, the presented results highlight the fact that in order to exploit the ALP CM concept, it may be required to apply state-of-the art methods like techniques for the mitigation of overfitting or class probability calibration.

4.3. Distribution of Algorithm Performance Metrics

The performance of a trained algorithm over an application input set of samples (i.e., a set for which the class labels are unknown) is considered to be in line with the performance of the algorithm observed during the training phase (i.e., the algorithm performance on test/validation sets). This is valid under the following two basic assumptions: (a) the application set of samples

X

originates from the same space of solutions (S_X,S_Y) as the training and the test/validation sets and (b) the distribution of samples in the application set

X

is similar to the one of the training and test/validation sets.

Let us now consider the distribution of the algorithm performance over a set of randomly selected validation sets. We considered the Problem 2 dataset and applied a split of 60% for the training, 20% for the testing, and 20% for the validation analysis. The LogReg algorithm was then trained using the training and test sets leading to a prediction of the algorithm performance. Then, through a randomized process we produced 400 validation sets of size 10% of the overall input dataset selected from the reserved 20% of samples for validation purposes. Figure 6 presents the results of this analysis. As it can be seen, the accuracy metric was distributed quite closely to the predicted performance from the training analysis. On the other hand, the precision and recall metrics had a wider distribution around the predicted performance.

As already mentioned, the ALP CM has the advantage that it can be calculated based on the knowledge of the predicted labels and the class probabilities produced by the applied machine learning algorithm. This allows us to calculate the ALP CM for application datasets for which the actual labels are unknown. Based on this property, we investigated the option of using the ALP CM as a means to estimate the algorithm performance when applied to a specific application set.

Using the analysis performed to produce Figure 6 (Problem 2, LogReg algorithm, 400 randomly generated validation sets with size 10% of the overall input dataset), we assessed the capability of the ALP CM to predict the algorithm performance. As shown in Figure 7, the ALP CM metrics were quite close to the actual CM metrics of the randomly selected validation sets. However, it appeared that there was low correlation between ALP CM and CM metrics.

Table 3 provides the Mean Squared Error (MSE) of the predicted metrics (ALP CM based) vs. the actual ones (CM based) for Problem 2 and all of the applied algorithms based on the observed performance on validation sets generated according to the method described above. Combining the results presented in Figure 7 and Table 3, we may conclude that the ALP CM metrics can be used to predict the performance of a trained algorithm over a real-world application sample, at a reasonable level of approximation.

5. ALP CM Exploitation Methodology

Taking into account the results presented in the previous section, it is feasible to derive a methodology for exploiting the ALP CM and its properties for handling classification problems. Figure 8 presents the proposed methodology with special highlights on the steps related to ALP CM capabilities. The provided steps are explained later in this section.

During the algorithm training phase, the originally available known dataset is split into training and test sets, e.g., an 80–20% sample split. The typical state-of-the-art cross-validation methods eventually lead to algorithm training based on only a portion of the available dataset (e.g., 80% of the available samples with a known solution).

Taking advantage of the observed learning curve patterns presented in the previous section, it is feasible to use ALP CM as a validation method and proceed with an additional step of algorithm training and validation. In this additional step, the algorithm is trained based on the entire available input dataset, and the ALP CM analysis is exploited to support a validation analysis. It should be noted here that as shown in Section 3, before proceeding to the full exploitation of the ALP CM analysis, state-of-the-art techniques may have to be applied so as to handle possible issues of the algorithm related to overfitting as well as class probability calibration techniques so as to handle issues related to the quality of the produced class probabilities of the selected algorithm.

To demonstrate the potential of this method, we used Problem 1 (students’ dropout rate). Our study was based on an input dataset with known solutions of size N = 1194 samples which were generated based on a random selection of samples from the overall problem dataset (i.e., selected from a total of 4424 samples). The input set was randomly split into 80% of samples for forming the training set and 20% for forming a test set. Based on the method described above, we considered two phases of algorithm training: (a) algorithm training using the training set and verification of its performance based on the test set, and (b) based on the proposed method, the training and test sets were united to a single training set and based on ALP CM analysis, we continued the algorithm training for 100% of the input dataset. Figure 9 shows the resulting learning curve for the specific case study. The figure presents the above two training phases:

(a): In the range of training set size equal to 15–80% of the input dataset, we obtained a pattern of overfitting which was confirmed by the presence of a generalization error (i.e., the test set accuracy was substantially lower than the training set accuracy) and the lower ALP CM accuracy.
(b): In the range of training set size of 80–100% of the input set, we observed that the ALP CM accuracy converged to the training set CM accuracy. This was an indication that training the algorithm with 100% of the input dataset led to better algorithm performance.

To simulate the performance of the algorithm in a real-world application set of input samples we randomly selected a validation set of 300 samples from the remaining pool of problem samples (i.e., 4424–1194 = 3230 samples). The validation set was then analyzed based on the trained algorithm based on the two phases described before (i.e., based on 80% and 100% of the input dataset). Figure 9 shows the achieved accuracy performance, and Table 4 summarizes the achieved performance metrics. This example illustrates that the ALP CM can improve the training process of an algorithm.

It should be noted here that this example as designed to demonstrate the value of the ALP CM method. Therefore, although the proposed method did not lead to an improvement under any condition, the presented case study highlighted the fact that in practice, depending on the nature of the problem and the availability of known input data, the proposed method may prove valuable.

To further assess the potential of using the ALP CM in order to estimate the algorithm performance for real-world application sets, we applied a similar approach to the one followed in the previous section. A number of 200 validation sets with a size of 300 samples were generated by a random selection of samples from the pool of Problem 2 samples which were not used in the training and testing phase (i.e., 4.424–1.194 = 3.230 samples). The comparison of the ALP CM metrics with the CM metrics led to the following MSE levels: accuracy MSE: 5.48 × 10⁻³, precision MSE: 4.62 × 10⁻², recall MSE: 4.83 × 10⁻³. The results confirmed the fact the ALP CM metrics provide a good approximation of the application set performance metrics.

6. Conclusions and Next Research Steps

The current paper addressed the topic of classification problem analysis based on supervised machine learning and introduced a novel concept of Actual Label Probabilistic confusion matrix, which was calculated based on the predicted class labels and the relevant class probabilities. The theoretical analysis suggested that for algorithms not suffering from overfitting or underfitting and for a substantially large training dataset size, the performance metrics of the ALP CM would converge to the ones of the regular CM. In addition, the theoretical framework developed in this paper indicated that the performance metrics of an algorithm could be estimated from the ALP CM metrics as a function of the average of the maximum class probability.

The theoretical predictions were tested using a wide variety of real-world problems and machine learning algorithms. In this process, it was observed that the theoretical analysis results were applicable for all problems and for almost all of the applied algorithms, except for algorithms like random forest, Naïve Bayes, and XGBoost. The adoption of methods for mitigating overfitting and class probability calibration eventually led to agreement with the theoretical analysis for all algorithms.

Based on the above results, this paper provided two specific use cases where the ALP CM could be exploited to improve the analysis of a classification problem:

(a): During the training phase of a classification algorithm, the ALP CM provides the ability to use 100% of the input dataset for algorithm training. In this case, the ALP CM provided the means to validate the algorithm performance (assess the expected generalization error) as opposed to state-of-the-art methods where a portion of the available dataset is reserved for cross-validation purposes. This method may prove essential for cases where the available input dataset size is limited compared to the required sample size for achieving convergence in the performance of the algorithm.
(b): During the implementation of the trained algorithm over a real-world application set, the ALP CM can still be calculated despite the fact that the actual class labels are unknown. In this case, the ALP CM metrics provide a good approximation of the actual algorithm performance.

The next research steps in the area of probabilistic confusion matrix indicatively include the following:

A potential link of the probabilistic confusion matrix with the conditions which ensure that a machine learning algorithm generalizes [32].
The capability to employ nested methods based on the probabilistic confusion matrix so as to apply hyperparameter optimization [33].
The ability to combine the probabilistic confusion matrix use cases with the reduced confusion matrix methodology developed in [34].

Author Contributions

Conceptualization and methodology, I.M.; software, G.M. and I.M.; validation, I.M. and G.M.; investigation, I.M. and G.M.; writing—original draft preparation, I.M.; visualization, G.M.; supervision, I.M.; funding acquisition, I.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been partially financed by the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH—CREATE—INNOVATE (project code: T1EDK-05063).

Data Availability Statement

The original contributions presented in the study are included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. The Applied Machine Learning Algorithms

Appendix A.1. Decision Trees

Decision tree learning is a predictive modelling approach used in statistics, data mining, and machine learning [35,36]. Table A1 contains the decision tree parameters used in this paper.

Table A1. Parameters of the decision tree classifiers.

Parameter	Values
Function measuring the quality of a split	Entropy
Maximum depth of tree	3
Weights associated with classes	1

Appendix A.2. Support Vector Machines (SVMs)

SVMs are supervised learning models with associated learning algorithms [37,38,39] providing a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear margin that is as wide as possible. The parameters of the SVM method are presented in Table A2.

Table A2. Parameters of the Support Vector Machine classifiers.

Parameter	Values
Kernel type	Linear
Degree of polynomial kernel function	3
Weights associated with classes	1

Appendix A.3. Random Forest (RF)

The random forest classifier consists of a combination of tree classifiers where each of them is generated using a random vector sampled independently from the input vector, and each tree casts a unit vote for the most popular class to classify an input vector [40,41]. Table A3 contains the parameters of the RF classifier.

Table A3. Parameters of the random forest classifiers.

Parameter	Values
Number of trees	100
Maximum tree depth limit	Default: None/Adjusted: 2
Measurements of the quality of split	Gini index

Appendix A.4. Artificial Neural Networks (ANNs)

An ANN typically consists of an input layer, one or more hidden layers of neurons that process these data in a non-linear way, and one output layer that yields the classification outcome [42,43,44]. In this paper, an ANN with one input, one hidden, and one output layer was employed (Table A4). The ReLU function was adopted as the activation function for the input and hidden layers, while the softmax function was used for the output layer.

Table A4. Parameters of the Artificial Neural Network.

Parameter	Values
Number of hidden neurons	6
Activation function applied to the input and hidden layer	ReLU
Activation function applied to the output layer	Softmax
Optimizer network function	Adam
Calculated loss	Sparse categorical cross-entropy
Epochs used	100
Batch size	10

Appendix A.5. Convolutional Neural Networks (CNNs)

CNNs extract a set of features from the raw data by applying convolutions on the input signals, propagating them into deep layers, while at the last layer, a classification is carried out to assign the input data to classes based on the use of the deep features identified by the convolutional layers [45,46]. Table A5 provides the parameters of the CNN algorithm.

Table A5. Parameters of the Convolutional Neural Network.

Parameter	Values
Model	Sequential (array of Keras Layers)
Kernel size	3
Pool size	4
Activation function applied	ReLU
Calculated loss	Categorical cross-entropy
Epochs used	100
Batch size	128

Appendix A.6. Naïve Bayes

Naïve Bayes classifiers are a family of probabilistic classifiers based on applying Bayes’ theorem with strong independence assumptions between the features [47]. Maximum-likelihood training can be performed by evaluating a closed-form expression, which takes linear time, rather than by expensive iterative approximation as used in many other types of classifiers.

Appendix A.7. Logistic Regression

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable [48]. The defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a constant rate, with each independent variable having its own parameter. Table A6 provides the main configuration parameters of the algorithm.

Table A6. Parameters of the logistic regression.

Parameter	Values
Maximum number of iterations	5.000
Algorithm used in optimization	L-BFGS
Weights associated with classes	1

Appendix A.8. Extreme Gradient Boosting (XGB)

Extreme Gradient Boosting is a scalable, distributed, gradient-boosted decision tree machine learning library. It provides parallel tree boosting and is the leading machine learning library for regression, classification, and ranking problems [49]. Table A7 provides the main configuration parameters of the algorithm.

Table A7. Parameters of the eXtreme Gradient Boosting (XGB) algorithm.

Parameter	Values
Learning rate (eta)	Default: 0.3/adjusted: 0.01
Max depth	3
Subsample	1

References

Alpaydin, E. Introduction to Machine Learning; The MIT Press: Cambridge, MA, USA, 2010; ISBN 978-0-262-01243-0. [Google Scholar]
Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef] [PubMed]
Garg, A.; Roth, D. Understanding probabilistic classifiers. In Proceedings of the ECML 2001 12th European Conference on Machine Learning, LNAI 2167, Freiburg, Germany, 5–7 September 2001; pp. 179–191. [Google Scholar]
Uddin, S.; Khan, A.; Hossain, M.; Moni, M.A. Comparing different supervised machine learning algorithms for disease prediction. BMC Med. Inform. Decis. Mak. 2019, 19, 281. [Google Scholar] [CrossRef] [PubMed]
Stehman, S.V. Selecting and interpreting measures of thematic classification accuracy. Remote Sens. Environ. 1997, 62, 77–89. [Google Scholar] [CrossRef]
Ting, K.M. Confusion Matrix. In Encyclopedia of Machine Learning and Data Mining; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
Olivier, B.; Luxburg, U.; Rätsch, G. (Eds.) Advanced Lectures on Machine Learning. Lecture Notes in Computer Science; Springer: Dordrecht, The Netherlands, 2004; Volume 3176, pp. 169–207. [Google Scholar] [CrossRef]
Refaeilzadeh, P.; Tang, L.; Liu, H. Cross-Validation. In Encyclopedia of Database Systems; Liu, L., Özsu, M.T., Eds.; Springer: Boston, MA, USA, 2009. [Google Scholar] [CrossRef]
Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 20–25 August 1995; Morgan Kaufmann: San Mateo, CA, USA, 1995; pp. 1137–1143. [Google Scholar]
Wang, X.-N.; Wei, J.-M.; Jin, H.; Yu, G.; Zhang, H.-W. Probabilistic Confusion Entropy for Evaluating Classifiers. Entropy 2013, 15, 4969–4992. [Google Scholar] [CrossRef]
Trajdos, P.; Kurzynski, M. Weighting scheme for a pairwise multi-label classifier based on the fuzzy confusion matrix. Pattern Recognit. Lett. 2018, 103, 60–67. [Google Scholar] [CrossRef]
Koço, S.; Capponi, C. On multi-class classification through the minimization of the confusion matrix norm. Proc. Mach. Learn. Res. 2013, 29, 277–292. Available online: https://proceedings.mlr.press/v29/Koco13.html (accessed on 25 May 2024).
Yacouby, R.; Axman, D. Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Online, 20 November 2020; pp. 79–91. [Google Scholar]
Han, D.; Moniz, N.; Chawla, N.V. AnyLoss: Transforming Classification Metrics into Loss Functions. In Proceedings of the ACM Conference (Conference’17), Washington, DC, USA, 25–27 July 2017. 12p. [Google Scholar] [CrossRef]
Simske, S.J.; Wright, D.W.; Sturgill, M. Meta-algorithmic systems for document classification. In Proceedings of the 2006 ACM Symposium on Document Engineering (DocEng ‘06), Amsterdam, The Netherlands, 10–13 October 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 98–106. [Google Scholar] [CrossRef]
Tornetta, G.N. Entropy Methods for the Confidence Assessment of Probabilistic Classification Models. Statistica 2021, 81, 383–398. [Google Scholar] [CrossRef]
Lawson, C.R.; Hodgson, J.A.; Wilson, R.J.; Richards, S.A. Prevalence, thresholds and the performance of presence–absence models. Methods Ecol. Evol. 2014, 5, 54–64. [Google Scholar] [CrossRef]
Grandini, M.; Bagli, E.; Visani, G. Metrics for Multi-Class Classification: An Overview. arXiv 2020, arXiv:2008.05756. [Google Scholar] [CrossRef]
Niculescu-Mizil, A.; Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning (ICML ‘05), Bonn, Germany, 7–11 August 2005; Association for Computing Machinery: New York, NY, USA, 2005; pp. 625–632. [Google Scholar] [CrossRef]
Witten, I.H.; Frank, E.; Hall, M.A.; Pal, C.J. Data Mining, 4th ed.; Elsevier: Amsterdam, The Netherlands, 2017. [Google Scholar] [CrossRef]
Bhatia, N. Survey of nearest neighbor techniques. arXiv 2010, arXiv:1007.0085. [Google Scholar]
Kelly, M.; Longjohn, R.; Nottingham, K. The UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu (accessed on 20 March 2024).
Realinho, V.; Vieira Martins, M.; Machado, J.; Baptista, L. Predict Students’ Dropout and Academic Success. UCI Mach. Learn. Repos. 2021, 10, C5MC89. [Google Scholar] [CrossRef]
Moro, S.; Cortez, P.; Rita, P. A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst. 2014, 62, 22–31. [Google Scholar] [CrossRef]
Hofmann, H. Statlog (German Credit Data). UCI Mach. Learn. Repos. 1994, 53. [Google Scholar] [CrossRef]
Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; Reis, J. Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 2009, 47, 547–553. [Google Scholar] [CrossRef]
Becker, B.; Kohavi, R. Adult. UCI Mach. Learn. Repos. 1996. [Google Scholar] [CrossRef]
Slate, D. Letter Recognition. UCI Mach. Learn. Repos. 1991. [Google Scholar] [CrossRef]
Blackard, J. Covertype. UCI Mach. Learn. Repos. 1998. [Google Scholar] [CrossRef]
Zadrozny, B.; Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘02), Edmonton, AB, Canada, 23–26 July 2002; Association for Computing Machinery: New York, NY, USA, 2002; pp. 694–699. [Google Scholar] [CrossRef]
Platt, J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classif. 1999, 10, 61–74. [Google Scholar]
Mukherjee, S.; Niyogi, P.; Poggio, T.; Rifkin, R.M. Learning theory: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv. Comput. Math. 2006, 25, 161–193. [Google Scholar] [CrossRef]
Soper, D.S. Greed Is Good: Rapid Hyperparameter Optimization and Model Selection Using Greedy k-Fold Cross Validation. Electronics 2021, 10, 1973. [Google Scholar] [CrossRef]
Markoulidakis, I.; Rallis, I.; Georgoulas, I.; Kopsiaftis, G.; Doulamis, A.; Doulamis, N. Multiclass Confusion Matrix Reduction Method and Its Application on Net Promoter Score Classification Problem. Technologies 2021, 9, 81. [Google Scholar] [CrossRef]
Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
Rokach, L.; Maimon, O.Z. Data Mining with Decision Trees: Theory and Applications; World Scientific: Singapore, 2008; Volume 69. [Google Scholar]
Basak, D.; Srimanta, P.; Patranabis, D.C. Support Vector Regression. Neural Inf. Process.-Lett. Rev. 2007, 11, 203–224. [Google Scholar]
Abe, S. Support Vector Machines for Pattern Classification, 2nd ed.; Advances in Computer Vision and Pattern Recognition; Springer: London, UK, 2010. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Pal, M. Random forest classifier for remote sensing classification. Int. J. Remote Sens. 2005, 26, 217–222. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Doulamis, A.; Doulamis, N.; Kollias, S. On-line retrainable neural networks: Improving the performance of neural networks in image analysis problems. IEEE Trans. Neural Netw. 2000, 11, 137–155. [Google Scholar] [CrossRef] [PubMed]
Haykin, S. Neural Networks: A Comprehensive Foundation; Prentice-Hall Inc.: Upper Anhe, NJ, USA, 2007. [Google Scholar]
Hecht-Nielsen, R. Kolmogorov’s mapping neural network existence theorem. In Proceedings of the International Conference on Neural Networks, San Diego, CA, USA, 21–24 June 1987; IEEE Press: New York, NY, USA, 1987; Volume 3, pp. 11–14. [Google Scholar]
Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep Learning for Computer Vision: A Brief Review. Comput. Intell. Neurosci. 2018. [Google Scholar] [CrossRef] [PubMed]
Doulamis, A.; Doulamis, N.; Protopapadakis, E.; Voulodimos, A. Combined Convolutional Neural Networks and Fuzzy Spectral Clustering for Real Time Crack Detection in Tunnels. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4153–4157. [Google Scholar] [CrossRef]
Haouari, B.; Amor, N.B.; Elouedi, Z.; Mellouli, K. Naïve possibilistic network classifiers. Fuzzy Sets Syst. 2009, 160, 3224–3238. [Google Scholar] [CrossRef]
Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; JohnWiley & Sons: Hoboken, NJ, USA, 2013; Volume 398. [Google Scholar]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]

Figure 1. CM and ALP CM examples corresponding to test sets (20% of total dataset for Problems 1 and 8, accordingly).

Figure 2. CM and ALP CM performance metrics for Problem 1 for the training and the test set.

Figure 3. Learning curves for LogReg, SVM, DT, ANN, and CNN algorithms applied to Problem 4.

Figure 4. Learning curves for RF and XGB algorithms applied to Problem 4 before (RF, XGB) and after parameter tuning (RF’, XGB’) and additional step for XGB with probability calibration (XGB”).

Figure 5. Learning curve for Problem 4 for the plain NB algorithm (NB) and the NB algorithm with probability calibration (NB’).

Figure 6. Example result of performance metrics’ distribution for randomly generated validation sets (Problem 2, LogReg).

Figure 7. Investigation of the potential to use ALP CM metrics to predict actual metrics (CM-based) for randomly generated validation sets (Problem 2, LogReg).

Figure 8. ALP CM exploitation methodology.

Figure 9. Training and validation analysis using the ALP CM.

Table 1. The list of classification problems considered in the analysis.

Problem Number	Problem Title	Number of Features	Number of Classes	Sample Size
Problem 1	Students’ dropout rate [23]	36	3	4424
Problem 2	Bank analysis [24]	20	2	41,188
Problem 3	German data [25]	24	2	1000
Problem 4	Red wine quality [26]	11	10	1599
Problem 5	White wine quality [26]	11	10	4998
Problem 6	Adult income [27]	14	2	32,561
Problem 7	Letter recognition [28]	16	26	20,000
Problem 8	Cover type [29]	12	7	581,012

Table 2. Accuracy based on CM and ALP CM for all problems and all applied algorithms. Bold fonts indicate the clear mismatch of ALP CM vs. CM for the NB algorithm.

	Algorithm:	LogReg		SVM		DT		RF		NB		XGB		CNN		ANN
	Metric:	Acc CM	Acc ALP	Acc CM	Acc ALP	Acc CM	Acc ALP	Acc CM	Acc ALP	Acc CM	Acc ALP	Acc CM	Acc ALP	Acc CM	Acc ALP	Acc CM	Acc ALP
Problem 1	Training set	0.774	0.764	0.762	0.773	0.733	0.733	1.000	0.873	0.700	0.935	0.859	0.698	0.734	0.944	0.805	0.802
Problem 1	Test set	0.762	0.762	0.758	0.773	0.708	0.731	0.772	0.724	0.695	0.937	0.775	0.684	0.739	0.936	0.755	0.802
Problem 2	Training set	0.904	0.914	0.909	0.927	0.900	0.900	1.000	0.958	0.885	0.958	0.966	0.874	0.899	0.956	0.913	0.913
Problem 2	Test set	0.906	0.915	0.894	0.924	0.905	0.902	0.917	0.918	0.886	0.958	0.904	0.869	0.901	0.958	0.911	0.914
Problem 3	Training set	0.790	0.782	0.798	0.770	0.760	0.760	1.000	0.874	0.756	0.878	0.923	0.752	0.742	0.798	0.824	0.809
Problem 3	Test set	0.715	0.767	0.690	0.751	0.655	0.751	0.760	0.722	0.675	0.872	0.700	0.740	0.761	0.792	0.710	0.797
Problem 4	Training set	0.589	0.597	0.580	0.594	0.601	0.601	1.000	0.835	0.302	0.785	0.882	0.550	0.556	0.565	0.645	0.624
Problem 4	Test set	0.575	0.588	0.575	0.584	0.559	0.588	0.666	0.653	0.256	0.806	0.600	0.521	0.553	0.560	0.616	0.628
Problem 5	Training set	0.529	0.518	0.523	0.517	0.515	0.515	1.000	0.826	0.080	0.737	0.748	0.486	0.501	0.480	0.565	0.564
Problem 5	Test set	0.537	0.518	0.522	0.516	0.488	0.513	0.635	0.626	0.071	0.737	0.584	0.477	0.504	0.481	0.528	0.565
Problem 6	Training set	0.797	0.787	0.806	0.820	0.832	0.832	1.000	0.929	0.796	0.981	0.887	0.798	0.847	0.928	0.850	0.855
Problem 6	Test set	0.794	0.787	0.814	0.827	0.824	0.834	0.848	0.863	0.794	0.982	0.848	0.782	0.848	0.926	0.840	0.855
Problem 7	Training set	0.778	0.725	0.871	0.752	0.235	0.235	1.000	0.925	0.640	0.762	0.936	0.594	0.942	0.931	0.865	0.752
Problem 7	Test set	0.776	0.726	0.848	0.738	0.241	0.236	0.957	0.805	0.654	0.767	0.886	0.571	0.885	0.919	0.835	0.750
Problem 8	Training set	0.706	0.696	0.705	0.682	0.680	0.680	1.000	0.955	0.563	0.787	0.866	0.600	0.687	0.736	0.719	0.717
Problem 8	Test set	0.700	0.696	0.725	0.683	0.678	0.680	0.958	0.888	0.561	0.785	0.721	0.585	0.684	0.739	0.704	0.720

Table 3. The MSE of the prediction of CM metrics (accuracy, precision, recall) based on the ALP CM.

	ALP CM Metric vs. CM Metric (MSE)
Problem 2	MSE Accuracy	MSE Precision	MSE Recall
LogReg	1.04 × 10⁻⁶	1.70 × 10⁻³	9.24 × 10⁻⁵
SVM	9.06 × 10⁻⁴	7.82 × 10⁻³	3.64 × 10⁻³
DT	2.05 × 10⁻⁵	2.60 × 10⁻⁴	1.34 × 10⁻⁴
RF	2.09 × 10⁻⁴	2.06 × 10⁻²	8.73 × 10⁻⁴
NB	5.64 × 10⁻³	4.95 × 10⁻²	1.87 × 10⁻²
XGB	2.11 × 10⁻²	6.91 × 10⁻³	2.34 × 10⁻²
CNN	1.66 × 10⁻⁴	2.31 × 10⁻⁴	5.30 × 10⁻⁴
ANN	7.95 × 10⁻⁵	3.68 × 10⁻⁴	3.69 × 10⁻⁴

Table 4. The performance metrics of the Problem 1 case study (input set 1194 samples, validation set of 300 samples).

Logistic Regression Metrics	Accuracy	Precision	Recall	F1-Score
Training set 80%	0.802	0.772	0.742	0.753
Test set 20%	0.732	0.668	0.646	0.654
Training set 100%	0.769	0.723	0.682	0.691
Validation set
(a) Algorithm trained on 80% set	0.735	0.673	0.651	0.659
(b) Algorithm trained on 100% set	0.759	0.717	0.678	0.697

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Markoulidakis, I.; Markoulidakis, G. Probabilistic Confusion Matrix: A Novel Method for Machine Learning Algorithm Generalized Performance Analysis. Technologies 2024, 12, 113. https://doi.org/10.3390/technologies12070113

AMA Style

Markoulidakis I, Markoulidakis G. Probabilistic Confusion Matrix: A Novel Method for Machine Learning Algorithm Generalized Performance Analysis. Technologies. 2024; 12(7):113. https://doi.org/10.3390/technologies12070113

Chicago/Turabian Style

Markoulidakis, Ioannis, and Georgios Markoulidakis. 2024. "Probabilistic Confusion Matrix: A Novel Method for Machine Learning Algorithm Generalized Performance Analysis" Technologies 12, no. 7: 113. https://doi.org/10.3390/technologies12070113

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Probabilistic Confusion Matrix: A Novel Method for Machine Learning Algorithm Generalized Performance Analysis

Abstract

1. Introduction

2. Classification Machine Learning Modelling Framework

2.1. Classification Problem Modelling

2.2. The Confusion Matrix

2.3. The Class Probabilities of a Classification Algorithm

3. The Actual Label Probabilistic Confusion Matrix Concept

3.1. Actual Label Probabilistic vs. the Regular Confusion Matrix

3.2. The Algorithm Performance Metrics vs. the Maximum Estimated Class Probabilities

4. Actual Label Probabilistic Confusion Matrix in Real Classification Problems

4.1. Algorithm Performance: ALP CM vs. Regular CM Metrics

4.2. Algorithm Learning Curve Analysis

4.3. Distribution of Algorithm Performance Metrics

5. ALP CM Exploitation Methodology

6. Conclusions and Next Research Steps

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. The Applied Machine Learning Algorithms

Appendix A.1. Decision Trees

Appendix A.2. Support Vector Machines (SVMs)

Appendix A.3. Random Forest (RF)

Appendix A.4. Artificial Neural Networks (ANNs)

Appendix A.5. Convolutional Neural Networks (CNNs)

Appendix A.6. Naïve Bayes

Appendix A.7. Logistic Regression

Appendix A.8. Extreme Gradient Boosting (XGB)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI