Next Article in Journal
Global Boundedness of Weak Solutions to Fractional Nonlocal Equations
Previous Article in Journal
Coherent Control of Diabolic Points of a Hermitian Hamiltonian in a Four-Level Atomic System Using Structured Light Fields
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Invariance Properties and Evaluation Metrics Derived from the Confusion Matrix in Multiclass Classification

Independent Researcher, Plano, TX 75024, USA
Mathematics 2025, 13(16), 2609; https://doi.org/10.3390/math13162609
Submission received: 28 June 2025 / Revised: 28 July 2025 / Accepted: 12 August 2025 / Published: 14 August 2025

Abstract

Multiclass evaluation metrics derived from the confusion matrix—such as accuracy, precision, and recall—are widely used yet rarely formalized with respect to their structural invariance. This paper introduces a closed-form analytical framework demonstrating that core metrics remain stable under confusion matrix transposition and class label permutation. A minimal sufficient set of metrics is identified—Overall accuracy, Macro-averaged precision, Macro-averaged recall, and Weighted precision—from which all other commonly reported metrics can be derived. These invariance properties are established algebraically and validated numerically using both the Wine dataset and CIFAR-10 with a convolutional neural network. The results confirm robustness across models and datasets and clarify metric behavior under structural transformations. The proposed framework reinforces reproducibility and interpretability in multiclass evaluation, with implications for diagnostic screening, fraud detection, and classification audits.

1. Introduction

The confusion matrix is a foundational tool for evaluating classification models in machine learning, offering a structured summary of how model predictions compare to actual classes. Its conceptual origins date back to early developments in signal detection theory, where researchers in the 1940s and 1950s [1] introduced decision-based outcomes such as hit, miss, false alarm, and correct rejection. These outcomes map directly into modern notions of true positive, false negative, false positive, and true negative in binary classification.
Around the same time, the field of information retrieval advanced foundational evaluation ideas. Kent, Berry, Luhn, and Perry (1955) [2] formalized the distinction between relevant and retrieved documents, leading to precision and recall as core metrics. Duda and Hart (1973) [3], while not explicitly using the term “confusion matrix”, structured classification results in tabular form, which served the same purpose.
The terminology and widespread adoption of the confusion matrix emerged in the 1990s and early 2000s, aided by influential textbooks, such as Mitchell (1997) [4] and Manning, Raghavan, and Schütze (2008) [5]. Powers (2011) [6] later expanded the interpretation of evaluation metrics, introducing extensions like Informedness and Markedness. A more recent analytical treatment appears in Zeng (2020) [7], who explored confusion matrix variants in binary classification and their connection to tools such as ROC curves and the Kolmogorov–Smirnov statistic.
Despite these advances, confusion persists in applying the confusion matrix to multiclass classification settings. Common but critical questions remain underexplored:
  • Should actual classes be represented by rows or columns?
  • Does orientation affect evaluation metrics like precision, recall, or accuracy?
  • What happens when the order of class labels is permuted?
These seemingly technical questions are critical for reproducibility and interpretation of evaluation metrics, particularly in tasks involving imbalanced classes or hierarchical relationships.
Sokolova and Lapalme (2009) [8] provided a systematic survey of evaluation metrics across binary, multiclass, multilabel, and hierarchical tasks. While their classification of metrics by properties, like sensitivity and averaging strategy, is valuable, their treatment leaves structural ambiguities unresolved. Notably, they do not specify whether actual classes are represented by rows or columns. Nor do they address whether evaluation metrics change when the matrix is transposed or class labels are reordered. Grandini, Bagli, and Visani (2020) [9] improved upon this by clearly stating that in their paper, rows represent actual classes and columns represent predictions. While this adds clarity, they do not explore the consequences of changing orientation or label order.
Recent contributions have highlighted the need for analytical clarification. Farhadpour et al. [10], Takahashi et al. [11], and Opitz and Burst [12] examine macro averaging, class imbalance, and structural bias. Zhou [13] and Tharwat [14] address weak supervision and hierarchical classification. Görtler et al. [15] introduce a probabilistic algebra for confusion matrices and visualization tools for hierarchical and multioutput labels. Song [16] emphasizes visual diagnostics and matrix-based interpretation in complex classification regimes. Together, these studies reinforce the importance of structural invariance but leave foundational ambiguities unresolved.
This paper aims to characterize structural invariance in multiclass evaluation metrics, with respect to confusion matrix transposition and label permutation, through algebraic analysis and empirical validation. Its novel contributions are as follows:
  • Matrix transposition invariance: Demonstrates that twelve widely used metric including precision, recall, macro/micro/weighted F1 scores, overall and average accuracy, and error rate, remain unchanged under confusion matrix transposition. This invariance has not been formally documented across all metrics in the prior literature.
  • Label permutation robustness: Proves algebraically that these metrics also remain invariant under arbitrary reordering of class labels, generalizing and formalizing observations that have previously lacked theoretical grounding.
  • Interpretation rule: Introduces a simple rule “Precision is for Predicted, Recall is for Real”. This simplifies interpretation and reduces misinterpretation.
  • Linear relationship between Accuracy metrics: Identifies and proves a linear relationship between Average accuracy and Overall accuracy.
  • Metric redundancy and reconstruction: Identifies a minimal subset of four core metrics (Overall accuracy, Macro precision, Macro recall, and Weighted precision) from which all twelve standard metrics can be reconstructed. This reveals previously unrecognized dependencies and has implications for reproducibility and reporting efficiency.
The framework presented in this paper addresses a common but underappreciated issue in machine learning: how confusion matrix orientation affects the interpretation of model performance metrics. Many tools, papers, and visualizations present confusion matrices inconsistently—sometimes placing actual classes on rows, other times on columns, or omitting axis labels altogether. When users apply standard formulas without confirming axis orientation, this can result in mislabeling or incorrect calculation of key metrics, particularly Precision and Recall. Harrell [17] highlights this issue in healthcare, where such confusion has led to models being evaluated using the wrong metric for real-world decision-making. This type of structural ambiguity can distort model evaluation and undermine trust in results. By proving structural invariance, the proposed framework improves clarity, reliability, and consistency across datasets and modeling pipelines.
The analytical results are validated through numerical analysis on the Wine dataset using a Random Forest classifier and on CIFAR-10 using a convolutional neural network. These findings have practical implications: they simplify metric computation, eliminate structural ambiguity, and enable consistent and reproducible evaluation across diverse modeling settings.
The remainder of this paper is organized as follows. Section 2 revisits the binary confusion matrix and historical origins of evaluation metrics. Section 3 analyzes the multiclass confusion matrix under the standard row-based orientation. Section 4 examines the effect of matrix transposition. Section 5 studies the invariance under class label permutations. Section 6 presents numerical validation using benchmark datasets. Section 7 concludes with a summary of analytical insights and their implications for practice.
Throughout this paper, the symbols P ,   R , and F 1 are used to denote precision, recall, and F1-score, respectively. The analysis considers a standard single-label multiclass classification setting, in which each instance (or record) is assigned exactly one class. In this setting, the number of predictions made by the model is equal to the number of instances, denoted by N , which also corresponds to the number of rows in the classification target variable y C 1 ,   C 2 .   ,   C r N , where r 2 . Each row in y   contains a single class from the finite set C 1 ,   C 2 .   ,   C r , assigned to one instance.

2. Confusion Matrix in Binary Classification

The confusion matrix is a foundational tool in evaluating the performance of classification models. It captures not only the correct predictions but also the nature of misclassifications, providing the basis for a wide range of performance metrics such as Precision, Recall, F1-score, and Accuracy. While the confusion matrix is now a standard component of machine learning pipelines, its conceptual roots trace back to early work in information retrieval, where related ideas were first formalized using contingency tables.

2.1. Historical Foundations and Metric Definitions

The analysis begins by revisiting the original definitions of precision and recall as introduced in early information retrieval and the classification literature [18,19,20]. These metrics were initially formulated using a contingency table, which serves as a conceptual precursor to the modern confusion matrix.
Information Retrieval (IR) is the study of systems designed to assist users in locating relevant information within large collections of unstructured data, such as documents, web pages, or databases. Unlike traditional databases that return exact matches, IR systems retrieve documents that are likely to be relevant based on a user’s query, often expressed in natural language. As defined by Manning et al. [5], a document is retrieved if it is returned by the system in response to a query, and it is considered relevant if it satisfies the user’s information need, typically assessed through human judgment or gold-standard benchmarks.
The following definitions are well-established in the literature [5,6] and are presented here to ensure clarity and consistency of terminology.
Definition 1.
Precision is the proportion of retrieved documents that are relevant to the query.
P = N u m b e r   o f   r e l a v a n t   d o c u m e n t s   r e t r i e v e d T o t a l   n u m b e r   o f   d o c u m e n t s   r e t r i e v e d .
Definition 2.
Recall is the proportion of relevant documents that are successfully retrieved.
R = N u m b e r   o f   r e l a v a n t   d o c u m e n t s   r e t r i e v e d T o t a l   n u m b e r   o f   r e l a v a n t   d o c u m e n t s   .
Definition 3.
F1-Score is the harmonic mean of Precision and Recall:
F 1 = 2   P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l   .
Unless otherwise specified, the F1-score will be omitted from subsequent analysis, as it is a derived metric, i.e., the harmonic mean of the corresponding Precision metric and Recall metric.
Manning et al. [5] reformulated these definitions using the following contingency table, introducing the notions of true positives, false positives, false negatives, and true negatives (Table 1):
P = t p t p + f p .
R = t p t p + f n .
Building on this framework, it is now time to reinterpret these concepts using the structure of the binary confusion matrix, which is widely adopted in machine learning to evaluate classification models. In this formulation, retrieval status is treated as the system’s prediction and relevance as the ground truth or actual class. Specifically,
  • A document that is retrieved corresponds to being predicted as positive.
  • A document that is not retrieved corresponds to being predicted as negative.
  • A document that is relevant represents the actual positive class.
  • A document that is nonrelevant represents the actual negative class.
This mapping results in the binary confusion matrix in Table 2:
Based on this formulation and Table 2, Definitions 1 and 2 can be restated in terms commonly used in machine learning, leading to the following definitions of Precision and Recall. The F1-score remains defined as the harmonic mean of Precision and Recall.
Definition 4.
Precision is defined as the proportion of true positives among all predicted positives, while Recall is the proportion of true positives among all actual positives.
To further clarify, this analysis formally defines the four components of the binary confusion matrix. Throughout, it adopts the standard uppercase abbreviations TP, FP, FN, and TN, and use the singular forms “Positive” and “Negative”, consistent with modern conventions.
Definition 5.
In a binary classification setting, with one class designated as positive and the other as negative, each instance is evaluated by comparing its predicted class to its actual class, resulting in one of the following four outcomes:
  • True Positive (TP): Predicted Positive, and the actual is also Positive.
  • False Positive (FP): Predicted Positive, but the actual is Negative.
  • False Negative (FN): Predicted Negative, but actual is Positive.
  • True Negative (TN): Predicted Negative, and actual is also Negative.
Here, terms true and false indicate whether the prediction matches the actual class, while positive and negative refer to the predicted class. Thus, TP and TN represent correct predictions, whereas FP and FN represent incorrect predictions.
P = T P T P + F P  
R = T P T P + F N .
In addition to Precision, Recall, and F1-score, another key performance metric in classification tasks is Overall accuracy. It is defined as the proportion of correctly classified instances (i.e., the sum of the diagonal elements in the confusion matrix) out of the total number of instances:
O v e r a l l   a c c u r a c y =   T P + T N T P + F N + F P + T N  

2.2. Matrix Layout Conventions

In Table 2, the actual classes are placed in columns and predicted classes in rows, a layout commonly used in information retrieval and the evaluation literature. However, in machine learning, the more typical convention is the reverse: actual classes in rows and predicted classes in column, as used by popular libraries, such as scikit-learn, R caret, and in most machine learning textbooks. Under this convention, the confusion matrix appears as shown in Table 3, which is the transpose of the original confusion matrix in Table 2.
Since the definitions of precision, recall, and the components TP, FP, FN, and TN are independent of matrix orientation, Definitions 4 and 5 apply to this layout too. Thus, the corresponding formulas remain:
P = T P T P + F P   .
R = T P T P + F N   .
This confirms that, in binary classification, the computed values for Precision and Recall remain invariant under transposition of the confusion matrix—i.e., whether actual classes are placed in rows or columns. It is also straightforward to verify that the same invariance applies to Overall accuracy.

2.3. On the Choice of the Positive Class

In binary classification, either class can be designated as the positive class, depending on the application. The terms “positive” and “negative” are not inherent to the data, but rather are analytical conventions chosen by the practitioner.
For example, in credit scoring, it is common to define:
  • Class 1 (e.g., default) as the positive class, representing an adverse or minority outcome;
  • Class 0 (e.g., nondefault) as the negative class, representing the favorable or majority outcome.
This choice directly impacts how metrics, such as Precision, Recall, and F1-score, are computed and interpreted, as these metrics are defined with respect to the positive class. Therefore, it is essential to explicitly state which class is considered “positive” in any classification analysis to ensure clarity, reproducibility, and interpretability.
In binary classification, all instances are divided into two groups based on the actual class and the predicted class:
  • One class is designated as positive (typically the minority or class of interest);
  • The other is designated as negative.
TP, FP, FN, and TN are defined relative to the class designated as the “positive” class. Precision focuses on the predicted positives—the prefix “Pre” emphasizes its dependence on the classifier’s predictions., while Recall—or “Rell” as a helpful mnemonic—focuses on the Real (i.e., Actual) positives.

3. Multiclass Confusion Matrix with Actual Classes in Rows

Let the classification problem involve r 2 discrete (categorical) classes C 1 , C 2 , ,   C r . The confusion matrix is defined as a two-dimensional array (matrix or table) with the following convention, consistent with common implementations such as scikit-learn:
  • Actual class placed in rows;
  • Predicted classes placed in columns.
Each entry n i j for 1 i ,   j   r represents the number of instances where the actual or true class is C i and the predicted class is C j .   This format aligns with standard evaluation tools (e.g., confusion_matrix() in scikit-learn), which assume this structure for computing metrics. Table 4 below adheres to this layout. Here, k denotes an individual class index:
Let N i · and N · j   denote the sum of the entries in row i and column j , respectively, for 1 i ,   j r . Then,
N i · = j = 1 r n i j ,   1 i r .
N · j = i = 1 r n i j ,   1 i r .
Clearly,
i = 1 r N i · = j = 1 r N · j = i = 1 r j = 1 r n i j = N .
That is, the sum of all row totals and the sum of all column totals are both equal to the total sum of all entries in the confusion matrix, which corresponds to N , the total number of instances, or equivalently, the total number of predictions in the single-label classification setting.

3.1. Basic Metrics

Table A1 in Appendix A summarizes the notation, symbols, definitions, and formulas for multiclass metrics based on a confusion matrix where actual classes are represented in rows.
To help explain the derivations of the first four entries in Table A1, consider applying the one-vs.-all approach [9] to construct the binary confusion matrix for a particular class C k . The multiclass confusion matrix in Table 4 is conceptually partitioned into four mutually exclusive and collectively exhaustive regions as follows:
  • (I) True Positive: Instances correctly predicted as class C k . This corresponds to the diagonal element n k k , as confirmed in the formula on row 2 of Table A1.
  • (II) False Positive: Instances predicted as C k but actually belonging to other classes. This is the sum of column k , excluding the diagonal element n k k :
F P k = i = 1 r n i k n k k ,
which aligns with the second item under F P k in Table A1.
  • (III) False Negative: Instances of actual class C k that are predicted as other classes. This is the sum of row k , excluding the diagonal element n k k :
F N k = j = 1 r n k j n k k ,
confirming the formula listed in the third item under F N k in Table A1.
  • (IV) True Negative: All remaining instances that neither belong to class C k nor are predicted as class C k . These make up the rest of the matrix:
T N k = i k j k n i j ,
or equivalently,
N T P k F P k F N k
in accordance with the fourth item under T N k in Table A1.
This partitioning is visually represented in Table 5. Structurally, Table 5 mirrors the canonical layout of the binary confusion matrix presented in Table 3, thereby unifying the derivation logic with the one-vs.-all evaluation framework.
Remark 1.
It is important to note that this approach assumes class independence and does not explicitly account for inter-class correlations. The resulting metrics for each class (e.g., Precision, Recall, F1) represent marginal, class-conditional performance. That is, they reflect how well the model distinguishes class C k  from the rest, but do not capture joint misclassification patterns between other classes (e.g., systematic confusion between classes C i   and C j   where i ,   j   k  ). As such, while one-vs.-all metrics are useful for diagnostic evaluation and model comparison, they should be interpreted with care in domains where class relationships are known to be structured or correlated.
In addition, item 19 in Table A1 defines Overall accuracy as the proportion of correctly classified instances across the entire dataset. Although easy to compute, this metric should be interpreted with caution in imbalanced settings. Because it weighs all instances equally, Overall accuracy can be dominated by performance on the majority class. For example, if one class accounts for 90% of the data and is predicted correctly, Overall accuracy may appear high—even if predictions for the smaller classes are poor. This can give a misleading impression of model effectiveness. To provide a more balanced view across all classes, especially those underrepresented, alternative metrics, such as macro-averaged F1, are recommended.

3.2. Analytical Properties of Metrics

A natural starting point is the relationship between confusion matrix structure and the definitions of precision and recall. The one-vs.-all approach described in [9] treats each class C k   as the “positive” label in turn, aggregating the remaining classes into a composite “negative” group. This decomposition yields intuitive matrix-based expressions: precision reflects column-wise proportions, whereas recall reflects row-wise proportions.
Using Formulas (9) and (10) for precision and recall (listed as items 6 and 7 in Table A1), the following relationships hold:
P k =   T P k T P k + F P k
R k = T P k T P k + F N k .
Substituting values TP, FN, FP, and TN from Table 5 to obtain the following:
P k = n k k n k k +   i = 1 r n i k n k k = n k k N · k ,
R k = n k k n k k + j = 1 r n k j n k k = n k k N k · ,
where N · k and N k · , defined in Equations (11) and (12), denote the sums of the k -th column and k -th row, respectively. The diagonal element n k k lies at the intersection of the k -th row and k -th column, motivating the following property:
Property 1.
In a multiclass confusion matrix where actual classes are represented by rows, the precision for class C k  is the ratio of the diagonal element in the k -th column to the sum of all elements in that column. Likewise, the recall for class C k  is the ratio of the diagonal element in the k -th row to the sum of all elements in that row.
Average accuracy and error rate are commonly used to summarize overall classifier performance. The following property follows directly from their formulas in Table A1.
Property 2.
The Average accuracy and Error rate are complementary measures whose values always sum to 1.
The next theorem addresses the relationship between Average accuracy and Overall accuracy, highlighting how class-level correctness relates to global predictive performance. A formal proof of their relationship is provided in Appendix B.1.
Property 3.
The Average accuracy and Overall accuracy satisfy the following relationship:
A v e r a g e   a c c u r a c y = 1 2 r + 2 r O v e r a l l   a c c u r a c y
Furthermore, for any number of classes  r   2 ,   the Average accuracy is always greater than or equal to the Overall accuracy. In particular, when r = 2, the Average accuracy is equal to the Overall accuracy.
Some performance metrics in multiclass classification turn out to be the same, even though they may have different names. This happens when using Micro-averaging or Weighting by class prevalence, and it can help clarify which metrics are truly distinct.
Property 4.
In multiclass classification, the Micro-averaged precision, Micro-averaged recall, Micro-averaged F1-Score and Weighted recall are all equal, and they are also equal to the Overall accuracy.
Grandini et al. [9] proved that Micro-averaged precision, recall, and F1-score are all equal to Overall accuracy. Farhadpour et al. [10] extended this equivalence to Weighted recall using conceptual analysis and empirical evidence. For completeness, formal proof is provided in Appendix B.2.

4. Multiclass Confusion Matrix with Actual Classes in Columns

In Section 3, the confusion matrix adopts a row-based layout in which actual classes are indexed by rows and predicted classes by columns. Consider now an alternative orientation where actual classes are indexed by columns and predicted classes by rows, corresponding to a transposed configuration. Let n i j denote the (i,j)-th entry of the new confusion matrix for the same classification problem. Then n i j represents the number of instances where the predicted class is C i   and the actual class is C j   . Clearly, n i j = n j i where n i j is the (i,j)-th entry of the original confusion matrix in Table 4. This relationship confirms that the new confusion matrix is the transpose of the original matrix, as illustrated in Table 6.
In this section, the same symbols and definitions from notation Table A1 are reused, now assuming that actual classes are indexed by columns rather than rows. To distinguish these symbols from their original orientation, a superscript t is added to indicate transposition. The corresponding formulas will be shown to match those in Table A1, confirming their structural equivalence under this layout.

4.1. Invariant Metrics Under Transposition

Precision and recall for each class C k , where 1 k   r , are defined using the same one-vs.-all approach as when actual classes are placed in rows.
T P k t ,     F P k t ,     F N k t , and T N k t   are derived directly from the structure of the confusion matrix shown in Table 6 as follows:
T P k t = n k k .
F P k t = i = 1 r n i k n k k .
F N k t = j = 1 r n k j n k k .
T N k t = i = 1 r j = 1 r n i j T P k t F P k t F N k t = N T P k t F P k t F N k t .
By comparing Equations (17)–(20) with the first four rows (excluding the header) in Table A1, it can be seen that T P k t = T P k ,   F P k t = F P k , F N k t = F N k , T N k t =   T N k ,   which confirms that these four quantities remain unchanged under the transposed matrix orientation.
The support S u p p o r t k t for class C k , defined as the total number of true instances of that class, corresponds to the sum of entries in column k in Table 6. That is, j = 1 r n k j = N k , which matches the original definition, so S u p p o r t k t = S u p p o r t k . Thus, the first five metrics in Table A1 are unaffected by matrix transposition.
The next theorem shows that the remaining metrics in Table A1 also remain unchanged under the transposed matrix orientation. Its formal proof is provided in Appendix C.
Theorem 1.
(Metric Invariance under Transposition).
All classification metrics introduced in Section 3 remain unchanged when the confusion matrix is transposed (i.e., when actual classes are placed in columns instead of rows). These invariant metrics include the following:
(1) 
Per-class precision, recall, and F1-score;
(2) 
Macro-averaged precision, recall, and F1-score;
(3) 
Micro-averaged precision, recall, and F1-score;
(4) 
Weighted precision, recall, and F1-score;
(5) 
Overall accuracy, Average accuracy, and Error rate.

4.2. Preserved Analytical Properties Under Transposition

Analogous to Property 1, the following property follows from Equations (17)–(20).
Property 5.
In a multiclass confusion matrix where actual classes are represented by columns, the precision for class C k  is the ratio of the diagonal element in the k -th row to the sum of all elements in that row. Likewise, the recall for class k is the ratio of the diagonal element in the k -th column to the sum of all elements in that column.
Since all metrics remain invariant under transposition as established in Theorem 1, the results presented in Properties 2 through 4 continue to hold. These findings affirm that both the definitions of evaluation metrics and the relationships among them are structurally consistent and unaffected by the orientation of the confusion matrix.

5. Permutation: Order of Classes

Let π = π 1 , π 2 , , π k , , π r be one of the ( r ! –1) nonidentity permutations of set 1 , , k , , r . Classes C 1 , C 2 , , C k , , C r are recorded as C π 1 , C π 2 , , C π k , , C π r , so that C π 1 , C π 2 , , C π k , , C π r become the first, second, …, k -th, …, r -th classes, respectively. The corresponding confusion matrix, updated to reflect this new class order, is transformed from Table 4 (based on the original order C 1 , C 2 , , C k , , C r ) to Table 7 shown below. As before, the classes appear in the same order across both rows and columns of the confusion matrix. Here, the convention of placing actual classes along the rows is adopted. This choice is without loss of generality, as Section 4 has already demonstrated that all relevant metrics remain unchanged when actual and predicted classes are interchanged.
To illustrate Table 7, a confusion matrix with three classes is shown in Table 8a below, which is indeed extracted from Section 6.1. Under the original class order (1, 2, 3), each entry n i j indicates the number of instances whose true class is i and predicted class is j . In accordance with Table 5, the count values are as follows:
  • n 11 = 8 ,   n 12 = 3 ,   n 13 = 8
  • n 21 = 1 ,   n 22 = 17 ,   n 23 = 3
  • n 31 = 5 ,   n 32 = 2 ,   n 33 = 7 .
A label permutation (1, 2, 3) => (3, 1, 2) is then applied to both the actual and predicted class indices. The confusion matrix is reorganized accordingly: first, actual classes are placed in row order as 3, 1, 2, and predicted classes in column order as 3, 1, 2. Under this relabeling, each entry m i j   in Table 8b corresponds to the count of instances from actual class i (after permutation) predicted as class j (also permuted). For instance, the entry at position (1, 2) corresponds to actual class 3 and predicted class 1, so it contains n 31 = 5 .

5.1. Invariant Metrics Under Permutation

Using Table A1, for classes ordered as C π 1 , C π 2 , , C π k , , C π r ,   the following identities corresponding to the first four entries are obtained immediately for any k, such that 1 k r :
T P k π = n π k π k .
F P k π = i = 1 r n π i π k n π k π k .
F N k π = j = 1 r n π k π j n π k π k .
T N k π = N T P k π F P k π F N k π .
In the permuted confusion matrix (Table 7), the k -th class C π k corresponds to the π k - th class in the original confusion matrix (Table 4). Accordingly, the following invariance identities hold for all k   =   1 ,   2 ,   ,   r : T P k π = T P π k ,   F P k π = F P π k ,   F N k π = F N π k ,   T N k π =   T N π k .
As for the support S u p p o r t k π of the k -th class C π k in the permuted confusion matrix, it is equal to S u p p o r t π k , the support of class C π k in the original confusion matrix.
The next theorem shows that the remaining metrics in Table A1 also remain unchanged under permutation. Its formal proof can be found in Appendix D.
Theorem 2.
(Metric Invariance under permutation).
Let Table 7 be a multiclass confusion matrix with class labels reordered as C π 1 ,   C π 2 ,   ,   C π k ,   ,   C π r , where π is a permutation of the original class order C 1 ,   C 2 ,   ,   C k ,   ,   C r  in Table 4. Then, all the following classification metrics computed from Table 6 are identical to those computed from Table 4 with the original class order C 1 , C 2 , , C k , , C r .
(1) 
Per-class precision, recall, and F1-score;
(2) 
Macro-averaged precision, recall, and F1-score;
(3) 
Micro-averaged precision, recall, and F1-score;
(4) 
Weighted precision, recall, and F1-score;
(5) 
Overall accuracy, Average accuracy, and Error rate.

5.2. Preserved Analytical Properties Under Permutation

Analogous to Property 1, the following property follows from Equations (21)–(24).
Property 6.
Let Table 7 be a multiclass confusion matrix with class labels reordered as C π 1 ,   C π 2 ,   ,   C π k ,   ,   C π r , where π is a permutation of the original class order C 1 ,   C 2 ,   ,   C k ,   ,   C r  in Table 4. Then the precision for class C π k  is the ratio of the diagonal element in the k -th column to the sum of all elements in that column. Likewise, the recall for class k is the ratio of the diagonal element in the k -th row to the sum of all elements in that raw.
Average accuracy and error rate are commonly used to summarize overall classifier performance. The following property follows directly from their formulas in Table A1.
Property 7.
Let Table 7 be a multiclass confusion matrix with class labels reordered as C π 1 ,   C π 2 ,   ,   C π k ,   ,   C π r , where π is a permutation of the original class order C 1 ,   C 2 ,   ,   C k ,   ,   C r  in Table 4. Then Average accuracy and Error rate are complementary metrics whose values always sum to 1.
From Part (5) of Theorem 2 and Property 3, the following property holds:
Property 8.
Let Table 7 be a multiclass confusion matrix with class labels reordered as C π 1 ,   C π 2 ,   ,   C π k ,   ,   C π r , where π is a permutation of the original class order C 1 ,   C 2 ,   ,   C k ,   ,   C r  in Table 4. Then, the Average accuracy and Overall accuracy satisfy the following relationship:
A v e r a g e   a c c u r a c y π = 1 2 r + 2 r O v e r a l l   a c c u r a c y π
Moreover, for any number of classes r 2 , the Average accuracy is always greater than or equal to the Overall accuracy. In particular, when r = 2, the Average accuracy is equal to the Overall accuracy.
Applying Theorem 2 and Property 4, the following holds:
Property 9.
Let Table 7 be a multiclass confusion matrix with class labels reordered as C π 1 ,   C π 2 ,   ,   C π k ,   ,   C π r , where π is a permutation of the original class order C 1 ,   C 2 ,   ,   C k ,   ,   C r  in Table 4. Then the Micro-averaged precision, Micro-averaged recall, Micro-averaged F1-Score and Weighted recall are all equal, and they are also equal to the Overall accuracy.

6. Numerical Validation

To assess the analytical properties of multiclass evaluation metrics—specifically their invariance under confusion matrix transposition and class label permutation—a numerical experiment using a real-world dataset is performed. For implementation clarity and practical adoption, refer to pseudocode in Appendix E.

6.1. Basic Metric Validation on the UCI Wine Dataset

The Wine dataset from the UCI Machine Learning Repository is used, which is accessible via sklearn.datasets.load_wine(). This dataset contains 178 samples of wine, each described by 13 physicochemical features, and categorized into three classes: Class 1, Class 2, and Class 3. In the dataset, these are encoded as the numeric labels 0, 1, and 2, respectively, which serve as the target variable for multiclass classification. This structure makes the dataset particularly suitable for validating the analytical properties of multiclass evaluation metrics, including their invariance under confusion matrix transposition and class label permutation.
For this study, only one feature is selected—‘Alcohol’—to construct a minimal yet valid multiclass classification problem. This ensures that our focus remains on the validation of evaluation metric behavior rather than on achieving high classification accuracy.
A Random Forest classifier using RandomForestClassifier(random_state = 42) is trained to ensure reproducibility. The dataset is partitioned into training and testing subsets using a 70–30% split via train_test_split, also with random_state = 42, guaranteeing consistent and replicable results. All performance metrics, including the confusion matrix, are computed on the held-out test set, which constitutes 30% of the data, to enable an unbiased evaluation of the model’s predictive capability.
For the original confusion matrix, shown in Table 9, classification performance is assessed using a comprehensive set of evaluation metrics. These include the precision, recall, and F1-score, calculated via the precision_recall_fscore_support function from sklearn.metrics, under four distinct averaging strategies:
  • Per-class metrics (average = None): Computers per-class precision, recall, and F1. The results are in Table 10. To conserve space, all numerical values in the tables are rounded and displayed to five decimal places.
  • Macro-average (average = ‘macro’): Computes the unweighted mean of the per-class precision, recall, and F1 scores. This corresponds to the Macro F1 as defined in item 12, denoted F 1 M A . In addition, the harmonic mean of the macro precision and macro recall is computed, i.e., F 1 M H from item 11 in Table A1 and report it as Macro F1 (MH) in the final row of Table 11.
  • Micro-average (average e = ‘micro’): Computes precision, recall, and F1 by globally aggregating the counts of true positives, false positives, and false negatives across all classes.
  • Weighted-average (average = ‘weighted’): Computes the average of per-class metrics weighted by the class support (i.e., number of true instances per class).
In addition to the above, the following metrics are computed:
  • Overall accuracy: The proportion of correct predictions over the total number of test instances.
  • Average accuracy: Defined as the mean of the per-class accuracies (each class’s accuracy includes both true positives and true negatives). Since no built-in function is available for this measure, a custom function is implemented based on the formulation by Sokolova & Lapalme [8].
  • Error rate: Computed as 1 − Overall accuracy.
To verify the theoretical invariance of metrics under permutation, a fixed permutation of class labels (0, 1, 2) → (2, 0, 1) is applied and all calculations on the permuted labels are repeated. The Per-class metrics are shown in Table 8. The resulting aggregate metrics are shown in the Permuted column of Table 9.
Regarding transposition, where the confusion matrix is transposed (i.e., actual classes are represented as columns and predicted classes as rows), all metrics are computed via a dedicated Python function (Python 3.13.6) that correctly interprets precision, recall, and class support under the new orientation.

6.1.1. Invariant Metrics

As shown in Table 10 and Table 11, the per-class metrics, and the results for the Original, Permuted, and Transposed confusion matrices are numerically identical across all metrics, thereby empirically validating the analytical invariance properties established in previous sections.

6.1.2. Analytical Properties

For all Original, Permuted, and Transposed cases, the per-class precision and recall were independently computed directly from the confusion matrix by dividing the main diagonal elements by the corresponding column and row sums, respectively. These manual calculations yielded results identical to those produced by the precision_recall_fscore_support function and are consistent with the values reported in Table 10, further validating their correctness.
In addition, Table 11 illustrates that Overall accuracy and Error rate are complementary; specifically, their sum is close to 1 in all cases. The analytical identity was also validated:
A v e r g e   a c c u r a c y = 1 2 r + 2 r O v e r a l l   a c c u r a c y
as established in earlier theoretical analysis. This identity held exactly in all three settings—Original, Permuted, and Transposed—thus empirically supporting the derived formula and confirming the robustness of these metrics under structural transformations of the confusion matrix.
The numerical validation presented in Section 6.1.1 and Section 6.1.2 confirms that the proposed multiclass evaluation metrics maintain both analytical and structural invariance under matrix transposition and label permutation. This indicates that the metrics remain stable and reliable, even when the data is reorganized. These findings support the theoretical framework outlined in Section 4.1 and Section 5.1 and demonstrate that the metrics are well-suited for classification settings where label order may vary or structural ambiguity may occur.

6.2. Robustness Validation Using CIFAR-10

To extend the scope of numerical validation, robustness experiments were conducted using the widely studied CIFAR-10 benchmark dataset. CIFAR-10 comprises 60,000 color images of size 32 × 32, evenly distributed across 10 mutually exclusive classes. The dataset is split into 50,000 training images and 10,000 test images, with each class containing exactly 1000 test instances—yielding a perfectly balanced multiclass structure ideal for evaluating the stability of performance metrics under structural transformations.
The model was trained using a Convolutional Neural Network (CNN) architecture comprising three convolutional layers with 3 × 3 filters and ReLU activation functions, interleaved with 2 × 2 max-pooling layers to reduce spatial dimensionality and enhance feature abstraction. This feature extraction pipeline is followed by a fully connected layer with 64 hidden units and a final dense layer with softmax activation over 10 output classes. The model was trained for 10 epochs using the Adam optimizer and sparse categorical cross-entropy loss.
To assess the robustness of multiclass metrics, four key evaluation measures were computed directly from the confusion matrix:
  • Overall accuracy;
  • Macro-averaged precision;
  • Macro-averaged recall;
  • Weighted precision.
We then applied two representative permutations of class labels:
  • Permutation A: [2, 0, 1, 3, 4, 5, 6, 7, 8, 9].
  • Permutation B: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0].
For each configuration—original and permuted—the four metrics listed above were computed. All metric values remained numerically unchanged across permutations:
  • Overall accuracy: 0.7080.
  • Macro precision: 0.7186.
  • Macro recall: 0.7080.
  • Weighted precision: 0.7186.
These results confirm the empirical validity of our analytical claims: confusion-matrix-derived metrics are invariant under class label permutation, even when applied to real-world classification models. Moreover, by focusing on a minimal and theoretically sufficient set of metrics, the framework reinforces clarity, robustness, and reproducibility.

7. Conclusions

This paper establishes that multiclass evaluation metrics derived from the confusion matrix are structurally invariant under both matrix transposition and class label permutation. These findings are supported by algebraic proofs and empirical validation.
A key insight from this work is that not all aggregate metrics need to be computed separately. Specifically, all the 12 aggregate metrics can be determined using just four core metrics: Overall accuracy, Macro-averaged precision, Macro-averaged recall, Weighted precision. The rest eight can be directly calculated from the four metrics (see Appendix F).
This unified framework helps streamline evaluation, reduce redundancy, and enhance reproducibility in multiclass classification tasks by establishing a small set of core metrics from which others can be systematically derived.
A limitation of the current framework is its focus on metric invariance under matrix transposition and class label permutation within single-label multiclass classification. However, real-world classification tasks often involve additional challenges.
In class-imbalanced settings, metrics, such as macro precision and recall, may disproportionately reflect performance on majority classes, while minority classes may be underrepresented. In multilabel classification, label co-occurrence and partial relevance complicate both the structure of the confusion matrix and the interpretation of aggregate metrics. Label noise—particularly when correlated across classes—can distort precision and recall estimates and may require correction strategies to maintain reliability. Hierarchical classification further increases complexity, as evaluation must account for semantic relationships and tree-based label dependencies.
Extending the current invariance results to these domains would require redefinition of the confusion matrix structure and averaging schemes. The present framework offers a formal foundation that may serve as a basis for generalizing to more complex classification frameworks.

Funding

This research received no external funding.

Data Availability Statement

The dataset used in this study is the Wine dataset from the UCI Machine Learning Repository, accessible via sklearn.datasets.load_wine().

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A. Notation Table

The notation in Table A1 applies specifically to Section 3, where actual classes are represented by rows and predicted classes by columns.
Table A1. Notation for multiclass classification (actual classes in rows).
Table A1. Notation for multiclass classification (actual classes in rows).
NoSymbolDefinitionFormula
1 T P k True positives for class k n k k
2 F P k False positives for class k i = 1 r n i k     n k k
3 F N k False negatives for class k j = 1 r n k j     n k k
4 T N k True negatives for class k N T P k F P k F N k
5 S u p p o r t k Total actual instances of class k N k ·
6 P k Precision for class k T P k T P k + F P k
7 R k Recall for class k T P k T P k + F N k
8 F 1 k Harmonic mean of precision and recall for class k 2 P k R k P k + P k
9 P M Macro-averaged precision (Unweighted average of per-class precision) [8,9] k = 1 r P k r
10 R M Macro-average recall (Unweighted average of per-class recall) [8,9] k = 1 r R k r
11 F 1 M H Version 1 of Macro-averaged F1: F1 of Averages (Harmonic Mean over Arithmetic Mean over) [8,9] 2 P M R M P M + R M
12 F 1 M A Version 2 of Macro-averaged F1: Averaged F1 (Arithmetic Mean over Harmonic Mean) [11,12,21] 1 r k = 1 r 2 P k R k P k + R k
13 P μ Micro-averaged precision [8,9] k = 1 r T P k k = 1 r T P k + F P k
14 R μ Micro-averaged recall [8,9] k = 1 r T P k k = 1 r T P k + F N k
15 F 1 μ Micro-averaged F1 [8,9] 2 P W R W P W + R W
16 P w Weighted precision [10] i = 1 r N i · P i i = 1 r N i · = i = 1 r N i · P i N
17 R W Weighted recall [10] i = 1 r N i · R i i = 1 r N i · = i = 1 r N i · R i N
18 F 1 W Weighted F1 [10] 2 P W R W P W + R W
19Overall accuracyOverall accuracy k = 1 r n k k i = 1 r j = 1 r n i j = k = 1 r n i i N
20Average accuracyAverage accuracy [8] 1 r k = 1 r T P k + T N k T P k + F P k + T N k + F N k
21Error rateError rate [8] 1 r k = 1 r F P k + F N k T P k + F P k + T N k + F N k
To facilitate interpretation of Items 11 and 12 in Table A1, Table A2 presents a comparative summary of the two versions of Macro F1 score.
Table A2. Comparison of 2 versions of Macro-averaged F1.
Table A2. Comparison of 2 versions of Macro-averaged F1.
MetricAggregate MethodFormulaInterpretation
F 1 M H Harmonic mean of Macro precision and recall 2 P M R M P M + R M Computes a global F1 based on Macro-aggregated precision and recall
F 1 M A Arithmetic mean of per-class F1 scores 1 r k = 1 r 2 P k R k P k + R k Averages individual class-level F1 scores across all classes

Appendix B. Proofs of Section 3

This appendix presents formal proofs of the properties and theorems introduced in Section 3, including relationships among precision, recall, accuracy, and derived metrics. All derivations are based on the notation and definitions provided in Table A1, under the assumption that actual classes are represented by rows in the confusion matrix.

Appendix B.1. Proof of Property 3

Proof. 
Since T P k + F P k + T N k + F N k = N  for each k , it follows from the first four formulas in Table A1 (rows 1–4, excluding the header) and the formula for Average Accuracy (item 20) that,
A v e r a g e   A c c u r a c y =   1 r k = 1 r T P k + T N k T P k + F P k + T N k + F N k = 1 N r k = 1 r T P k + T N k = 1 N r k = 1 r N F P k F N k = 1 N r N r k = 1 r F P k k = 1 r F N k = 1 N r N r k = 1 r i = 1 r n i k n k k k = 1 r j = 1 r n k j n k k = 1 N r N r k = 1 r i = 1 r n i k + k = 1 r n k k k = 1 r j = 1 r n k j + k = 1 r n k k = 1 N r N r k = 1 r N · k + k = 1 r n k k k = 1 r N k · + k = 1 r n k k = 1 N r N r 2 N + 2 k = 1 r n k k = 1 2 r + 2 r O v e r a l l   A c c u r a c y .
Furthermore, since O v e r a l l   A c c u r a c y 1 ,
A v e r a g e   A c c u r a c y = 1 2 r + 2 r O v e r a l l   A c c u r a c y 1 2 r O v e r a l l   A c c u r a c y + 2 r O v e r a l l   A c c u r a c y = O v e r a l l   A c c u r a c y .
In particular, if r = 2 , then A v e r a g e   A c c u r a c y = O v e r a l l   A c c u r a c y .

Appendix B.2. Proof of Property 4

Proof. 
It follows from Table A1 that T P k = n k k ,   T P k + F P k = i = 1 r n i k  and T P k + F N k = i = 1 r n k i .  Substitute these into formulas for P μ    and R μ  (items 13 and 14 in Table A1) to obtain the following:
P μ = k = 1 r T P k k = 1 r T P k + F P k = k = 1 r n k k k = 1 r i = 1 r n i k = O v e r a l l   A c c u r a c y .
R μ = k = 1 r T P k k = 1 r T P k + F N k = k = 1 r n k k k = 1 r i = 1 r n k i = O v e r a l l   A c c u r a c y .
Since P μ = R μ = O v e r a l l   A c c u r a c y , we have Micro-averaged F1-Score = O v e r a l l   a c c u r a c y .
Based on the definition of Weighted recall provided in Table A1 (item 17) and by applying Property 1, it follows that,
R w = i = 1 r N i · R i i = 1 r N i · = i = 1 r N i · n i i N i · i = 1 r N i · = i = 1 r n i i N = O v e r a l l   a c c u r a c y .
Thus, P μ = R μ = M i c r o - a v e r a g e d   F 1 - S c o r e = R w = O v e r a l l   a c c u r a c y .

Appendix C. Proof of Theorem 1

Proof. 
(1) Recall that it has already been established that T P k t = T P k ,   F P k t = F P k F N k t = F N k , T N k t =   T N k ,  for any class C k .   Applying Equations (6) and (7), this yields:
P k t = T P k t T P k t + F P k t = T P k T P k + F P k .
R k t = T P k t T P k t + F N k t = T P k T P k + F N k .
Comparing Equations (A1) and (A2) with the sixth and seventh items in Table A1 establishes that P k t = P k and R k t = R k . Since the F1-score for class C k is the harmonic mean of P k t and R k t ,  it remains unchanged under transposition.
(2) Macro-averaged precision and recall under transposition are defined as the unweighted averages of the per-class Precision and Recall:
P M t = k = 1 r P k t r .
R M t = k = 1 r R k t r .
Since each P k t and R k t is equal to its corresponding P k and P k   by Part (1) of this theorem, it follows by comparison with items 9 and 10 in Table A1 that P M t = P M and R M t = R M .   As a result, the version 1 Macro-averaged F1-score  F 1 M H , being the harmonic mean of the Macro-averaged precision and recall, is also invariant under transposition:
F 1 M H t = 2 P M t R M t P M t + R M t   = 2 P M R M P M + R M   = F 1 M H .  
Furthermore, by Part (1) of Theorem 1, each per-class F1-score  F 1 k remains unchanged under transposition. Hence, the version 2 Macro-averaged F1-score  F 1 M A , defined as the arithmetic average of per-class F1-scores, is also invariant:
F 1 M A t = 1 r k = 1 r F 1 k t = k = 1 r F 1 k = F 1 M A .
(3) As in items 13 and 14 in Table A1, the Micro-averaged precision and recall under transposition are defined by the following:
P μ t = k = 1 r T P k t k = 1 r T P k t + F P k t
R μ t = k = 1 r T P k t k = 1 r T P k t + F N k t
Recall that it has already been established that T P k t = T P k ,   F P k t = F P k , F N k t = F N k , T N k t =   T N k , for any C k .   It follows that P μ t = P μ   and R μ t = R μ .   Therefore, the Micro-averaged F1-score, which is the harmonic mean of these two quantities, also remains unchanged under transposition.
(4) Weighted precision and recall under transposition are computed using fixed class weights applied to Per-class Precision and Recall. For each class C k , the weight is the proportion of total instances belonging to that class, given by j = 1 r n k j N = N k · N . So, the Weighted precision and recall are given by the following:
P W t = i = 1 r N i · P i t N .
R W t = i = 1 r N i · R i t N .
Since P i t = P i   and R i t = R i for each 1 i   r by Part (1) of this theorem, and the class weights are unchanged under transposition, it follows by comparison with items 16 and 17 in Table A1 that P W t =   P W and R W t =   R W . Since Weighted F1-score is the harmonic mean of Weighted precision and recall , it is also unaffected.
(5) Average accuracy and error rate under transposition are derived from class-specific contributions and rely on row or column sums that mirror each other in the transposed matrix. Hence, both remain unchanged. Finally, overall accuracy is computed as the total number of correct predictions divided by the total number of predictions. This global measure is entirely unaffected by the arrangement of rows and columns in the confusion matrix. □

Appendix D. Proof of Theorem 2

Proof. 
(1) Using Property 1, the Precision and Recall for class C π k  for the confusion matrix in Table 6 are given by the following:
P k π = n π k π k i = 1 r n π i π k
R k π = n π k π k j = 1 r n π k π j .
Since π 1 , π 2 , , π k , , π r is one permutation of ( 1 , , k , , r ) , the sums in Equations (A9) and (A10) can be rewritten as follows:
i = 1 r n π i π k = i = 1 r n i π k = N · π k .
j = 1 r n π k π j = j = 1 r n π k j = N π k ·
Substituting Equation (A11) into Equation (A9) and Equation (A12) into Equation (A10), yields the following:
P k π = n π k π k N · π k .
R k π = n π k π k N π k · .
By comparing Equations (A13) and (A14) with Equations (14) and (15), it follows that P k π = P π k and R k π = R π k . Thus, the Precision and Recall of the k -th class in the permuted matrix (Table 6) correspond exactly to those of the π k -th class in the original confusion matrix (Table 5).
(2) By items 9 and 10 in Table A1, the Macro-averaged precision and recall for Table 6 are given by the following:
P M π = k = 1 r P k π r
R M π = k = 1 r R k π r .
By Part (1) of this theorem, it has been established that,
P M π = k = 1 r P π k r
R M π = k = 1 r R π k r .
Since π 1 , π 2 , , π k , , π r is one permutation of ( 1 , , k , , r ) , the sums in Equations (A17) and (A18) are equivalent to k = 1 r P k and k = 1 r R k , respectively. Hence, P M π = P M and R M π = R M .
As the version 1 Macro-averaged F1-score is defined as the harmonic mean of P M π and R M π ,   and both components are unchanged, the Macro-averaged F1-score is also invariant under class permutation:
F 1 M H π = 2 P M π R M π P M π + R M π = 2 P M R M P M + R M .
As for version 2 Macro-averaged F1-score, applying the definition of item 12 in Table A1 together with Part (1) of this theorem, yielding the following:
F 1 M A π = 1 r k = 1 r 2 P k π R k π P k π + R k π = 1 r k = 1 r 2 P π k R π k P π k + R π k .
Again, since π 1 ,   π 2 ,   ,   π k ,   ,   π r is simply one permutation of ( 1 ,   ,   k ,   ,   r ) , the sum over the permuted indices is equal to the original sum:
F 1 M A π = 1 r k = 1 r 2 P k R k P k + R k = F 1 M A .
(3) By the definitions in items 13 and 14 in Table A1, the Micro-averaged precision and recall of Table 6 are given by the following:
P μ π = k = 1 r T P k π k = 1 r T P k + F P k = k = 1 r T P k π N
R μ = k = 1 r T P k k = 1 r T P k + F N k = k = 1 r T P k π N
Since T P k π = T P π k   for all k   =   1 ,   2 ,   ,   r , the sums in (A19) and (A20) are equivalent to k = 1 r T P k π = k = 1 r T P π k . Since π 1 , π 2 , , π k , , π r is one permutation of ( 1 , , k , , r ) , the sum k = 1 r T P π k equals k = 1 r T P k . Therefore, P μ π = P μ and R μ π = R μ .
(4) From item 16 in Table A1, the Weighted precision for Table 6 is given by the following:
P W π = k = 1 r j = 1 r n π k π j P k π N
Applying (A13), P W π can be rewritten as follows:
P W π = k = 1 r j = 1 r n π k π j n π k π k N · π k N = 1 N k = 1 r N π k · × n π k π k N · π k .
Since each π k simply reindexes the classes, this is equivalent to P W π = P W .
Similarly, R W π = R W .
(5) Since the total number of instances in Table 6 remains N , and the sum of the diagonal elements equals i = 1 r n k k as π 1 , π 2 , , π k , , π r is a permutation of 1 , 2 , , r , the Overall accuracy is the same as in Table 4.
Since it has already been established that T P k π = T P π k ,   F P k π = F P π k ,   F N k π = F N π k ,   T N k π =   T N π k for all 1 k r , the Average accuracy and Error rate computed from Table 6 are equal to those in Table 4. This conclusion is supported by the formulas provided in items 20 and 21 of Table A1. □

Appendix E. Pseudo Code

Pseudo Code for Actual Classed on Rows

Input:
- y_true: List of actual class labels
- y_pred: List of predicted class labels
- r: Number of unique classes
Procedure:
1. Initialize confusion matrix CM of size r × r
 For each pair (yt, yp) in (y_true, y_pred):
    CM[yt][yp] += 1
 // Rows: actual classes (yt), Columns: predicted classes (yp)
2. For each class i from 1 to r:
   a. True Positives (TP_i) = CM[i][i]
   b. False Negatives (FN_i) = sum over j ≠ i of CM[i][j]
   c. False Positives (FP_i) = sum over j ≠ i of CM[j][i]
   d. True Negatives (TN_i) = sum(CM) - (TP_i + FN_i + FP_i)
   e. Precision_i = TP_i / (TP_i + FP_i) if denominator > 0 else 0
   f. Recall_i = TP_i / (TP_i + FN_i) if denominator > 0 else 0
3. Aggregate metrics:
   a. Macro precision = average of all Precision_i
   b. Macro recall = average of all Recall_i
   c. Macro F1 = harmonic mean of Macro precision and recall
   d. Micro TP = sum of all TP_i
    Micro FP = sum of all FP_i
    Micro FN = sum of all FN_i
    Micro precision = Micro TP / (Micro TP + Micro FP)
    Micro recall = Micro TP / (Micro TP + Micro FN)
    Micro F1 = harmonic mean of Micro precision and recall
   e. Weighted precision = support-weighted average of Precision_i
    Weighted recall = support-weighted average of Recall_i
    Weighted F1 = support-weighted average of F1_i
   f. Overall accuracy = sum of all TP_i / total number of samples
   g. Error rate = 1 - Overall accuracy
4. Average accuracy (Sokolova):
    For each class i:
       TP_i = CM[i][i]
       TN_i = total elements in CM           
      excluding row i and column i
       Partial Acc_i = (TP_i + TN_i) / total number of samples
    Average accuracy = mean of Partial Acc_i across all classes
5. Class permutation (optional):
   - Define permutation π: [0,…,r−1] → [π(0),…,π(r−1)]
   - Replace each label y in y_true and y_pred with π(y)
   - Recompute CM and repeat steps 1–4
Output:
- Confusion matrices: Original and Permuted
- Per-class metrics: Precision_i, Recall_i
- Aggregated metrics: Macro, Micro, Weighted, Overall accuracy, Average accuracy, Error rate
E2. Pseudocode for actual classes on columns
Input:
- y_true: List of true class labels
- y_pred: List of predicted class labels
- r: Number of unique classes
Procedure:
1. Construct confusion matrix CM:
   CM[i][j] = number of instances with predicted class i and true class j
   // i → row index (prediction), j → column index (actual)
   // CM must be transposed if created using a library that assumes actuals on rows
2. Transpose matrix: CM_T = transpose(CM)
   // Now CM_T has actual classes on rows, predicted on columns
3. Compute Micro-averaged Metrics:
   TP_micro = trace(CM_T)
   Micro precision = TP_micro / sum(CM_T)
   Micro recall = TP_micro / sum(CM_T)
   Micro F1 = harmonic mean of Micro precision and recall
4. Compute Macro-averaged Metrics:
    For each class k from 1 to r:
       Precision_k = CM_T[k][k] / sum of row k in CM_T
       Recall_k = CM_T[k][k] / sum of column k in CM_T
     Macro precision = average of all Precision_k
     Macro recall = average of all Recall_k
     Macro F1 = harmonic mean of Macro precision and recall
5. Compute Weighted metrics:
    For each class k:
      Support_k = sum of column k in CM_T
      F1_k = 2 * Precision_k * Recall_k / (Precision_k + Recall_k)
    Weighted precision = Weighted average of Precision_k by Support_k
    Weighted recall = Weighted average of Recall_k by Support_k
    Weighted F1 = Weighted average of F1_k by Support_k
6. Compute Overall accuracy:
   Overall accuracy = trace(CM_T) / total number of instances
7. Compute Average accuracy (Sokolova & Lapalme):
   For each class k:
     TP_k = CM_T[k][k]
     FN_k = sum of column k in CM_T - TP_k
     FP_k = sum of row k in CM_T - TP_k
     TN_k = total sum of CM_T - TP_k - FN_k - FP_k
     Acc_k = (TP_k + TN_k) / total number of instances
   Average Accuracy = average of Acc_k over all classes
Output:
- Confusion matrix (with actuals in columns)
- Macro, Micro, and Weighted metrics
- Overall accuracy and Average accuracy (Sokolova)

Appendix F. Derivability of Aggregate Metrics from Core Metrics

Proposition A1.
(Sufficiency of Four Core Metrics):
Given a confusion matrix with r  classes. Then, regardless of whether actual classes areplaced in rows or columns, the complete set of twelve aggregate metrics can be uniquely determined from the following four core metrics: Overall accuracy, Macro-averaged precision P M , Macro-averaged recall R M , and Weighted precision P w
Proof. 
It is sufficient to consider the configuration where actual classes are assigned to rows, since all of the twelve aggregate metrics under consideration are invariant under matrix transposition (cf. Section 4). The remaining eight metrics follow deterministically from the core four as follows:
  • Error rate: By Property 2,
    Error rate = 1 − Overall accuracy.
  • Average accuracy: By Property 3,
A v e r a g e   a c c u r a c y = 1 2 r + 2 r O v e r a l l   a c c u r a c y .
  • Micro-averaged precision  P μ , Recall  R μ , F1 Score  F 1 μ , and Weighted recall  R W : All equal to Overall accuracy by Property 4.
  • Macro-averaged F1 score (Version 1): Computed as the harmonic mean of P M and R M
    F 1 M H = 2 P M R M P M + R M
Note: Version 2—defined as the arithmetic mean of per-class F1 scores—requires access to individual class-level precision and recall values and thus cannot be calculated directly from the four aggregate quantities alone.
  • Weighted F1 score: Computed as the harmonic mean of P W and R W :
F 1 W = 2 P W R W P W + R W   .

References

  1. Green, D.M.; Swets, J.A. Signal Detection Theory and Psychophysics; Wiley: New York, NY, USA, 1966. [Google Scholar]
  2. Kent, A.; Berry, M.M.; Luehrs, F.U.; Perry, J.W. Machine Literature Searching VIII: Operational Criteria for Designing Information Retrieval Systems. Am. Doc. 1955, 6, 93–101. [Google Scholar] [CrossRef]
  3. Duda, R.O.; Hart, P.E. Pattern Classification and Scene Analysis; Wiley: New York, NY, USA, 1973. [Google Scholar]
  4. Mitchell, T.M. Machine Learning; McGraw-Hill: New York, NY, USA, 1997. [Google Scholar]
  5. Manning, C.D.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
  6. Powers, D.M.W. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
  7. Zeng, G. On the Confusion Matrix in Credit Scoring and Its Analytical Properties. Commun. Stat. Theory Methods 2020, 49, 2080–2093. [Google Scholar] [CrossRef]
  8. Sokolova, M.; Lapalme, G. A Systematic Analysis of Performance Measures for Classification Tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
  9. Grandini, M.; Bagli, E.; Visani, G. Metrics for Multiclass Classification: An Overview. arXiv 2020, arXiv:2008.05756. [Google Scholar]
  10. Farhadpour, S.; Warner, T.A.; Maxwell, A.E. Selecting and Interpreting Multiclass Loss and Accuracy Assessment Metrics for Classifications with Class Imbalance: Guidance and Best Practices. Remote Sens. 2024, 16, 533. [Google Scholar] [CrossRef]
  11. Takahashi, K.; Yamamoto, K.; Kuchiba, A.; Koyama, T. Confidence Interval for Micro-Averaged F1 and Macro-Averaged F1 Scores. Appl. Intell. 2022, 52, 4961–4972. [Google Scholar] [CrossRef] [PubMed]
  12. Opitz, J.; Burst, S. Macro F1 and Macro F1. arXiv 2019, arXiv:1911.03347. [Google Scholar]
  13. Zhou, Z.-H. A Brief Introduction to Weakly Supervised Learning. Natl. Sci. Rev. 2021, 8, nwab045. [Google Scholar] [CrossRef]
  14. Tharwat, A. Classification Assessment Methods. Appl. Comput. Inform. 2021, 17, 168–192. [Google Scholar] [CrossRef]
  15. Görtler, J.; Hohman, F.; Moritz, D.; Wongsuphasawat, K.; Ren, D.; Nair, R.; Kirchner, M.; Patel, K. Neo: Generalizing Confusion Matrix Visualization to Hierarchical and Multi-Output Labels. Proc. CHI Conf. Hum. Factors Comput. Syst. 2022, 1, 1–13. [Google Scholar] [CrossRef]
  16. Song, S.; Kamei, S.; Li, C.; Hou, S.; Morimoto, Y. A Visual Interpretation-Based Self-Improved Classification System Using Virtual Adversarial Training. arXiv 2023, arXiv:2309.01196. [Google Scholar] [CrossRef]
  17. Levy, D.G. In Machine Learning Predictions for Health Care the Confusion Matrix is a matrix of Confusion; GoodScience, Inc.: Selangor, Malaysia, 2018; Available online: https://www.fharrell.com/post/mlconfusion (accessed on 18 July 2025).
  18. Cleverdon, C.W.; Mills, J.; Keen, M. Factors Determining the Performance of Indexing Systems. Volume 1: Design; Aslib Cranfield Research Project: Cranfield, UK, 1966. [Google Scholar]
  19. Salton, G. Automatic Information Organization and Retrieval; McGraw-Hill: New York, NY, USA, 1968. [Google Scholar]
  20. van Rijsbergen, C.J. Information Retrieval, 2nd ed.; Butterworths: London, UK, 1979. [Google Scholar]
  21. Scikit-Learn. Classification Metrics. Available online: https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics (accessed on 25 July 2025).
Table 1. Contingency table in Information Retrieval. Reproduced with permission from [5]. © 2025 Cambridge University Press.
Table 1. Contingency table in Information Retrieval. Reproduced with permission from [5]. © 2025 Cambridge University Press.
RelevantNonrelevant
Retrievedtrue positives (tp)false positives (fp)
Not retrievedfalse negatives (fn)true negatives (tn)
Table 2. Binary confusion matrix.
Table 2. Binary confusion matrix.
Actual: PositiveActual: Negative
Predicted: Positivetrue positives (tp)false positives (fp)
Predicted: Negativefalse negatives (fn)true negatives (tn)
Table 3. Alternative binary confusion matrix.
Table 3. Alternative binary confusion matrix.
Predictive PositivePredictive Negative
Actual PositiveTPFN
Actual NegativeFPTN
Table 4. Multiclass confusion matrix with actual class in rows (default scikit-learn orientation).
Table 4. Multiclass confusion matrix with actual class in rows (default scikit-learn orientation).
Predicted Class
C 1 C 2 C k C r
Actual Class C 1 n 11 n 12 n 1 k n 1 r
C 2 n 21 n 22 n 2 k n 2 r
C k n k 1 n k 2 n k k n k r
C r n r 1 n r 2 n r k n r r
Table 5. One-vs.-all binary partition of the confusion matrix for class C k .
Table 5. One-vs.-all binary partition of the confusion matrix for class C k .
Predicted :   C k Predicted :   Not   C k
Actual :   C k TP :   n k k FN :   j k n k j
Actual :   not   C k FP :   i k n i k TN :   i k j k n i j
Table 6. Multiclass confusion matrix with actual class in columns (transposed form).
Table 6. Multiclass confusion matrix with actual class in columns (transposed form).
Actual Class
C 1 C 2 C k C r
Predicted Class C 1 n 11 n 21 n k 1 n r 1
C 2 n 12 n 22 n k 2 n r 2
C k n 1 k n 2 k n k k n r k
C r n 1 r n 2 r n k r n r r
Table 7. Confusion matrix with new ordering of classes.
Table 7. Confusion matrix with new ordering of classes.
Predicted Class
C π 1 C π 2 C π k C π r
Actual Class C π 1 n π 1 π 1 n π 1 π 2 n π 1 π k n π 1 π r
C π 2 n π 2 π 1 n π 2 π 2 n π 2 π k n π 2 π r
C π k n π k π 1 n π k π 2 n π k π k n π k π r
C π r n π r π 1 n π r π 2 n π r π k n π r π r
Table 8. Confusion matrix before and after permutation.
Table 8. Confusion matrix before and after permutation.
(a) Original confusion matrix of the test data.
Predicted:
Class 1
Predicted:
Class 2
Predicted:
Class 3
Actual: Class 1838
Actual: Class 21173
Actual: Class 3527
(b) Confusion matrix after permutation
Predicted:
Class 3
Predicted:
Class 1
Predicted:
Class 2
Actual: Class 3 n 33 n 31 n 32
Actual: Class 1 n 13 n 11 n 12
Actual: Class 2 n 23 n 21 n 22
Table 9. Original confusion matrix of the test data.
Table 9. Original confusion matrix of the test data.
Predicted 0
(Class 1)
Predicted 1
(Class 2)
Predicted 2
(Class 3)
Actual 0 (Class 1)838
Actual 1 (Class 2)1173
Actual 2 (Class 3)527
Table 10. Per-class metrics for original and permuted.
Table 10. Per-class metrics for original and permuted.
Original
Precision
Original
Recall
Permuted
Precision
Permuted
Recall
Transposed
Precision
Transposed
Recall
Class 1
(y = 0)
0.5714290.4210530.3888890.5000000.3888890.500000
Class 2
(y = 1)
0.7727270.8095240.5714290.4210530.5714290.421053
Class 3
(y = 2)
0.3888890.5000000.7727270.8095240.7727270.809524
Table 11. Comparisons of metrics.
Table 11. Comparisons of metrics.
OriginalPermutedTransposed
Macro precision0.577681580.577681580.57768158
Macro recall0.576858810.576858810.57685881
Macro F10.571015390.571015390.57101539
Micro precision0.592592590.592592590.59259259
Micro recall0.592592590.592592590.59259259
Micro F10.592592590.592592590.59259259
Weighted precison0.602386300.602386300.60238630
Weighted recall0.592592590.592592590.59259259
Weighted F10.591514300.5915143000.591514300
Overall accuracy0.592592590.592592590.59259259
Error rate0.407407410.407407410.40740741
Average accuracy0.728395060.728395060.72839506
Macro F1 (MH)0.577269900.577269900.57726990
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zeng, G. Invariance Properties and Evaluation Metrics Derived from the Confusion Matrix in Multiclass Classification. Mathematics 2025, 13, 2609. https://doi.org/10.3390/math13162609

AMA Style

Zeng G. Invariance Properties and Evaluation Metrics Derived from the Confusion Matrix in Multiclass Classification. Mathematics. 2025; 13(16):2609. https://doi.org/10.3390/math13162609

Chicago/Turabian Style

Zeng, Guoping. 2025. "Invariance Properties and Evaluation Metrics Derived from the Confusion Matrix in Multiclass Classification" Mathematics 13, no. 16: 2609. https://doi.org/10.3390/math13162609

APA Style

Zeng, G. (2025). Invariance Properties and Evaluation Metrics Derived from the Confusion Matrix in Multiclass Classification. Mathematics, 13(16), 2609. https://doi.org/10.3390/math13162609

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop