1. Introduction
The confusion matrix is a foundational tool for evaluating classification models in machine learning, offering a structured summary of how model predictions compare to actual classes. Its conceptual origins date back to early developments in signal detection theory, where researchers in the 1940s and 1950s [
1] introduced decision-based outcomes such as hit, miss, false alarm, and correct rejection. These outcomes map directly into modern notions of true positive, false negative, false positive, and true negative in binary classification.
Around the same time, the field of information retrieval advanced foundational evaluation ideas. Kent, Berry, Luhn, and Perry (1955) [
2] formalized the distinction between relevant and retrieved documents, leading to precision and recall as core metrics. Duda and Hart (1973) [
3], while not explicitly using the term “confusion matrix”, structured classification results in tabular form, which served the same purpose.
The terminology and widespread adoption of the confusion matrix emerged in the 1990s and early 2000s, aided by influential textbooks, such as Mitchell (1997) [
4] and Manning, Raghavan, and Schütze (2008) [
5]. Powers (2011) [
6] later expanded the interpretation of evaluation metrics, introducing extensions like Informedness and Markedness. A more recent analytical treatment appears in Zeng (2020) [
7], who explored confusion matrix variants in binary classification and their connection to tools such as ROC curves and the Kolmogorov–Smirnov statistic.
Despite these advances, confusion persists in applying the confusion matrix to multiclass classification settings. Common but critical questions remain underexplored:
Should actual classes be represented by rows or columns?
Does orientation affect evaluation metrics like precision, recall, or accuracy?
What happens when the order of class labels is permuted?
These seemingly technical questions are critical for reproducibility and interpretation of evaluation metrics, particularly in tasks involving imbalanced classes or hierarchical relationships.
Sokolova and Lapalme (2009) [
8] provided a systematic survey of evaluation metrics across binary, multiclass, multilabel, and hierarchical tasks. While their classification of metrics by properties, like sensitivity and averaging strategy, is valuable, their treatment leaves structural ambiguities unresolved. Notably, they do not specify whether actual classes are represented by rows or columns. Nor do they address whether evaluation metrics change when the matrix is transposed or class labels are reordered. Grandini, Bagli, and Visani (2020) [
9] improved upon this by clearly stating that in their paper, rows represent actual classes and columns represent predictions. While this adds clarity, they do not explore the consequences of changing orientation or label order.
Recent contributions have highlighted the need for analytical clarification. Farhadpour et al. [
10], Takahashi et al. [
11], and Opitz and Burst [
12] examine macro averaging, class imbalance, and structural bias. Zhou [
13] and Tharwat [
14] address weak supervision and hierarchical classification. Görtler et al. [
15] introduce a probabilistic algebra for confusion matrices and visualization tools for hierarchical and multioutput labels. Song [
16] emphasizes visual diagnostics and matrix-based interpretation in complex classification regimes. Together, these studies reinforce the importance of structural invariance but leave foundational ambiguities unresolved.
This paper aims to characterize structural invariance in multiclass evaluation metrics, with respect to confusion matrix transposition and label permutation, through algebraic analysis and empirical validation. Its novel contributions are as follows:
Matrix transposition invariance: Demonstrates that twelve widely used metric including precision, recall, macro/micro/weighted F1 scores, overall and average accuracy, and error rate, remain unchanged under confusion matrix transposition. This invariance has not been formally documented across all metrics in the prior literature.
Label permutation robustness: Proves algebraically that these metrics also remain invariant under arbitrary reordering of class labels, generalizing and formalizing observations that have previously lacked theoretical grounding.
Interpretation rule: Introduces a simple rule “Precision is for Predicted, Recall is for Real”. This simplifies interpretation and reduces misinterpretation.
Linear relationship between Accuracy metrics: Identifies and proves a linear relationship between Average accuracy and Overall accuracy.
Metric redundancy and reconstruction: Identifies a minimal subset of four core metrics (Overall accuracy, Macro precision, Macro recall, and Weighted precision) from which all twelve standard metrics can be reconstructed. This reveals previously unrecognized dependencies and has implications for reproducibility and reporting efficiency.
The framework presented in this paper addresses a common but underappreciated issue in machine learning: how confusion matrix orientation affects the interpretation of model performance metrics. Many tools, papers, and visualizations present confusion matrices inconsistently—sometimes placing actual classes on rows, other times on columns, or omitting axis labels altogether. When users apply standard formulas without confirming axis orientation, this can result in mislabeling or incorrect calculation of key metrics, particularly Precision and Recall. Harrell [
17] highlights this issue in healthcare, where such confusion has led to models being evaluated using the wrong metric for real-world decision-making. This type of structural ambiguity can distort model evaluation and undermine trust in results. By proving structural invariance, the proposed framework improves clarity, reliability, and consistency across datasets and modeling pipelines.
The analytical results are validated through numerical analysis on the Wine dataset using a Random Forest classifier and on CIFAR-10 using a convolutional neural network. These findings have practical implications: they simplify metric computation, eliminate structural ambiguity, and enable consistent and reproducible evaluation across diverse modeling settings.
The remainder of this paper is organized as follows.
Section 2 revisits the binary confusion matrix and historical origins of evaluation metrics.
Section 3 analyzes the multiclass confusion matrix under the standard row-based orientation.
Section 4 examines the effect of matrix transposition.
Section 5 studies the invariance under class label permutations.
Section 6 presents numerical validation using benchmark datasets.
Section 7 concludes with a summary of analytical insights and their implications for practice.
Throughout this paper, the symbols , and are used to denote precision, recall, and F1-score, respectively. The analysis considers a standard single-label multiclass classification setting, in which each instance (or record) is assigned exactly one class. In this setting, the number of predictions made by the model is equal to the number of instances, denoted by , which also corresponds to the number of rows in the classification target variable , where . Each row in contains a single class from the finite set , assigned to one instance.
2. Confusion Matrix in Binary Classification
The confusion matrix is a foundational tool in evaluating the performance of classification models. It captures not only the correct predictions but also the nature of misclassifications, providing the basis for a wide range of performance metrics such as Precision, Recall, F1-score, and Accuracy. While the confusion matrix is now a standard component of machine learning pipelines, its conceptual roots trace back to early work in information retrieval, where related ideas were first formalized using contingency tables.
2.1. Historical Foundations and Metric Definitions
The analysis begins by revisiting the original definitions of precision and recall as introduced in early information retrieval and the classification literature [
18,
19,
20]. These metrics were initially formulated using a contingency table, which serves as a conceptual precursor to the modern confusion matrix.
Information Retrieval (IR) is the study of systems designed to assist users in locating relevant information within large collections of unstructured data, such as documents, web pages, or databases. Unlike traditional databases that return exact matches, IR systems retrieve documents that are likely to be relevant based on a user’s query, often expressed in natural language. As defined by Manning et al. [
5], a document is retrieved if it is returned by the system in response to a query, and it is considered relevant if it satisfies the user’s information need, typically assessed through human judgment or gold-standard benchmarks.
The following definitions are well-established in the literature [
5,
6] and are presented here to ensure clarity and consistency of terminology.
Definition 1. Precision is the proportion of retrieved documents that are relevant to the query.
Definition 2. Recall is the proportion of relevant documents that are successfully retrieved.
Definition 3. F1-Score is the harmonic mean of Precision and Recall:
Unless otherwise specified, the F1-score will be omitted from subsequent analysis, as it is a derived metric, i.e., the harmonic mean of the corresponding Precision metric and Recall metric.
Manning et al. [
5] reformulated these definitions using the following contingency table, introducing the notions of true positives, false positives, false negatives, and true negatives (
Table 1):
Building on this framework, it is now time to reinterpret these concepts using the structure of the binary confusion matrix, which is widely adopted in machine learning to evaluate classification models. In this formulation, retrieval status is treated as the system’s prediction and relevance as the ground truth or actual class. Specifically,
A document that is retrieved corresponds to being predicted as positive.
A document that is not retrieved corresponds to being predicted as negative.
A document that is relevant represents the actual positive class.
A document that is nonrelevant represents the actual negative class.
This mapping results in the binary confusion matrix in
Table 2:
Based on this formulation and
Table 2, Definitions 1 and 2 can be restated in terms commonly used in machine learning, leading to the following definitions of
Precision and
Recall. The
F1-score remains defined as the harmonic mean of
Precision and
Recall.
Definition 4. Precision is defined as the proportion of true positives among all predicted positives, while Recall is the proportion of true positives among all actual positives.
To further clarify, this analysis formally defines the four components of the binary confusion matrix. Throughout, it adopts the standard uppercase abbreviations TP, FP, FN, and TN, and use the singular forms “Positive” and “Negative”, consistent with modern conventions.
Definition 5. In a binary classification setting, with one class designated as positive and the other as negative, each instance is evaluated by comparing its predicted class to its actual class, resulting in one of the following four outcomes:
True Positive (TP): Predicted Positive, and the actual is also Positive.
False Positive (FP): Predicted Positive, but the actual is Negative.
False Negative (FN): Predicted Negative, but actual is Positive.
True Negative (TN): Predicted Negative, and actual is also Negative.
Here, terms true and false indicate whether the prediction matches the actual class, while positive and negative refer to the predicted class. Thus,
TP and
TN represent correct predictions, whereas
FP and
FN represent incorrect predictions.
In addition to
Precision,
Recall, and
F1-score, another key performance metric in classification tasks is
Overall accuracy. It is defined as the proportion of correctly classified instances (i.e., the sum of the diagonal elements in the confusion matrix) out of the total number of instances:
2.2. Matrix Layout Conventions
In
Table 2, the actual classes are placed in columns and predicted classes in rows, a layout commonly used in information retrieval and the evaluation literature. However, in machine learning, the more typical convention is the reverse: actual classes in rows and predicted classes in column, as used by popular libraries, such as scikit-learn, R caret, and in most machine learning textbooks. Under this convention, the confusion matrix appears as shown in
Table 3, which is the transpose of the original confusion matrix in
Table 2.
Since the definitions of precision, recall, and the components
TP,
FP,
FN, and
TN are independent of matrix orientation, Definitions 4 and 5 apply to this layout too. Thus, the corresponding formulas remain:
This confirms that, in binary classification, the computed values for Precision and Recall remain invariant under transposition of the confusion matrix—i.e., whether actual classes are placed in rows or columns. It is also straightforward to verify that the same invariance applies to Overall accuracy.
2.3. On the Choice of the Positive Class
In binary classification, either class can be designated as the positive class, depending on the application. The terms “positive” and “negative” are not inherent to the data, but rather are analytical conventions chosen by the practitioner.
For example, in credit scoring, it is common to define:
Class 1 (e.g., default) as the positive class, representing an adverse or minority outcome;
Class 0 (e.g., nondefault) as the negative class, representing the favorable or majority outcome.
This choice directly impacts how metrics, such as Precision, Recall, and F1-score, are computed and interpreted, as these metrics are defined with respect to the positive class. Therefore, it is essential to explicitly state which class is considered “positive” in any classification analysis to ensure clarity, reproducibility, and interpretability.
In binary classification, all instances are divided into two groups based on the actual class and the predicted class:
TP, FP, FN, and TN are defined relative to the class designated as the “positive” class. Precision focuses on the predicted positives—the prefix “Pre” emphasizes its dependence on the classifier’s predictions., while Recall—or “Rell” as a helpful mnemonic—focuses on the Real (i.e., Actual) positives.
3. Multiclass Confusion Matrix with Actual Classes in Rows
Let the classification problem involve discrete (categorical) classes . The confusion matrix is defined as a two-dimensional array (matrix or table) with the following convention, consistent with common implementations such as scikit-learn:
Each entry
for
represents the number of instances where the actual or true class is
and the predicted class is
This format aligns with standard evaluation tools (e.g., confusion_matrix() in scikit-learn), which assume this structure for computing metrics.
Table 4 below adheres to this layout. Here,
denotes an individual class index:
Let
and
denote the sum of the entries in row
and column
, respectively, for
. Then,
That is, the sum of all row totals and the sum of all column totals are both equal to the total sum of all entries in the confusion matrix, which corresponds to , the total number of instances, or equivalently, the total number of predictions in the single-label classification setting.
3.1. Basic Metrics
Table A1 in
Appendix A summarizes the notation, symbols, definitions, and formulas for multiclass metrics based on a confusion matrix where actual classes are represented in rows.
To help explain the derivations of the first four entries in
Table A1, consider applying the one-vs.-all approach [
9] to construct the binary confusion matrix for a particular class
. The multiclass confusion matrix in
Table 4 is conceptually partitioned into four mutually exclusive and collectively exhaustive regions as follows:
(I) True Positive: Instances correctly predicted as class
. This corresponds to the diagonal element
, as confirmed in the formula on row 2 of
Table A1.
which aligns with the second item under
in
Table A1.
confirming the formula listed in the third item under
in
Table A1.
or equivalently,
in accordance with the fourth item under
in
Table A1.
This partitioning is visually represented in
Table 5. Structurally,
Table 5 mirrors the canonical layout of the binary confusion matrix presented in
Table 3, thereby unifying the derivation logic with the one-vs.-all evaluation framework.
Remark 1. It is important to note that this approach assumes class independence and does not explicitly account for inter-class correlations. The resulting metrics for each class (e.g., Precision, Recall, F1) represent marginal, class-conditional performance. That is, they reflect how well the model distinguishes class from the rest, but do not capture joint misclassification patterns between other classes (e.g., systematic confusion between classes and
where ). As such, while one-vs.-all metrics are useful for diagnostic evaluation and model comparison, they should be interpreted with care in domains where class relationships are known to be structured or correlated.
In addition, item 19 in
Table A1 defines
Overall accuracy as the proportion of correctly classified instances across the entire dataset. Although easy to compute, this metric should be interpreted with caution in imbalanced settings. Because it weighs all instances equally,
Overall accuracy can be dominated by performance on the majority class. For example, if one class accounts for 90% of the data and is predicted correctly,
Overall accuracy may appear high—even if predictions for the smaller classes are poor. This can give a misleading impression of model effectiveness. To provide a more balanced view across all classes, especially those underrepresented, alternative metrics, such as macro-averaged F
1, are recommended.
3.2. Analytical Properties of Metrics
A natural starting point is the relationship between confusion matrix structure and the definitions of precision and recall. The one-vs.-all approach described in [
9] treats each class
as the “positive” label in turn, aggregating the remaining classes into a composite “negative” group. This decomposition yields intuitive matrix-based expressions: precision reflects column-wise proportions, whereas recall reflects row-wise proportions.
Using Formulas (9) and (10) for precision and recall (listed as items 6 and 7 in
Table A1), the following relationships hold:
Substituting values
TP,
FN,
FP, and
TN from
Table 5 to obtain the following:
where
and
, defined in Equations (11) and (12), denote the sums of the
-th column and
-th row, respectively. The diagonal element
lies at the intersection of the
-th row and
-th column, motivating the following property:
Property 1. In a multiclass confusion matrix where actual classes are represented by rows, the precision for class is the ratio of the diagonal element in the -th column to the sum of all elements in that column. Likewise, the recall for class is the ratio of the diagonal element in the -th row to the sum of all elements in that row.
Average accuracy and error rate are commonly used to summarize overall classifier performance. The following property follows directly from their formulas in
Table A1.
Property 2. The Average accuracy and Error rate are complementary measures whose values always sum to 1.
The next theorem addresses the relationship between
Average accuracy and
Overall accuracy, highlighting how class-level correctness relates to global predictive performance. A formal proof of their relationship is provided in
Appendix B.1.
Property 3. The Average accuracy and Overall accuracy satisfy the following relationship:
Furthermore, for any number of classes
the Average accuracy is always greater than or equal to the Overall accuracy. In particular, when r = 2, the Average accuracy is equal to the Overall accuracy.
Some performance metrics in multiclass classification turn out to be the same, even though they may have different names. This happens when using Micro-averaging or Weighting by class prevalence, and it can help clarify which metrics are truly distinct.
Property 4. In multiclass classification, the Micro-averaged precision, Micro-averaged recall, Micro-averaged F1-Score and Weighted recall are all equal, and they are also equal to the Overall accuracy.
Grandini et al. [
9] proved that
Micro-averaged precision, recall, and F1-score are all equal to
Overall accuracy. Farhadpour et al. [
10] extended this equivalence to
Weighted recall using conceptual analysis and empirical evidence. For completeness, formal proof is provided in
Appendix B.2.
4. Multiclass Confusion Matrix with Actual Classes in Columns
In
Section 3, the confusion matrix adopts a row-based layout in which actual classes are indexed by rows and predicted classes by columns. Consider now an alternative orientation where actual classes are indexed by columns and predicted classes by rows, corresponding to a transposed configuration. Let
denote the (
i,
j)-th entry of the new confusion matrix for the same classification problem. Then
represents the number of instances where the predicted class is
and the actual class is
. Clearly,
where
is the (
i,
j)-th entry of the original confusion matrix in
Table 4. This relationship confirms that the new confusion matrix is the transpose of the original matrix, as illustrated in
Table 6.
In this section, the same symbols and definitions from notation
Table A1 are reused, now assuming that actual classes are indexed by columns rather than rows. To distinguish these symbols from their original orientation, a superscript t is added to indicate transposition. The corresponding formulas will be shown to match those in
Table A1, confirming their structural equivalence under this layout.
4.1. Invariant Metrics Under Transposition
Precision and recall for each class , where , are defined using the same one-vs.-all approach as when actual classes are placed in rows.
, and
are derived directly from the structure of the confusion matrix shown in
Table 6 as follows:
By comparing Equations (17)–(20) with the first four rows (excluding the header) in
Table A1, it can be seen that
,
which confirms that these four quantities remain unchanged under the transposed matrix orientation.
The support
for class
, defined as the total number of true instances of that class, corresponds to the sum of entries in column
in
Table 6. That is,
which matches the original definition, so
Thus, the first five metrics in
Table A1 are unaffected by matrix transposition.
The next theorem shows that the remaining metrics in
Table A1 also remain unchanged under the transposed matrix orientation. Its formal proof is provided in
Appendix C.
Theorem 1. (Metric Invariance under Transposition).
All classification metrics introduced in Section 3 remain unchanged when the confusion matrix is transposed (i.e., when actual classes are placed in columns instead of rows). These invariant metrics include the following:
- (1)
Per-class precision, recall, and F1-score;
- (2)
Macro-averaged precision, recall, and F1-score;
- (3)
Micro-averaged precision, recall, and F1-score;
- (4)
Weighted precision, recall, and F1-score;
- (5)
Overall accuracy, Average accuracy, and Error rate.
4.2. Preserved Analytical Properties Under Transposition
Analogous to Property 1, the following property follows from Equations (17)–(20).
Property 5. In a multiclass confusion matrix where actual classes are represented by columns, the precision for class is the ratio of the diagonal element in the -th row to the sum of all elements in that row. Likewise, the recall for class k is the ratio of the diagonal element in the -th column to the sum of all elements in that column.
Since all metrics remain invariant under transposition as established in Theorem 1, the results presented in Properties 2 through 4 continue to hold. These findings affirm that both the definitions of evaluation metrics and the relationships among them are structurally consistent and unaffected by the orientation of the confusion matrix.
5. Permutation: Order of Classes
Let
be one of the
–1) nonidentity permutations of set
Classes
are recorded as
so that
become the first, second, …,
-th, …,
-th classes, respectively. The corresponding confusion matrix, updated to reflect this new class order, is transformed from
Table 4 (based on the original order
to
Table 7 shown below. As before, the classes appear in the same order across both rows and columns of the confusion matrix. Here, the convention of placing actual classes along the rows is adopted. This choice is without loss of generality, as
Section 4 has already demonstrated that all relevant metrics remain unchanged when actual and predicted classes are interchanged.
To illustrate
Table 7, a confusion matrix with three classes is shown in
Table 8a below, which is indeed extracted from
Section 6.1. Under the original class order (1, 2, 3), each entry
indicates the number of instances whose true class is
and predicted class is
In accordance with
Table 5, the count values are as follows:
A label permutation (1, 2, 3) => (3, 1, 2) is then applied to both the actual and predicted class indices. The confusion matrix is reorganized accordingly: first, actual classes are placed in row order as 3, 1, 2, and predicted classes in column order as 3, 1, 2. Under this relabeling, each entry
in
Table 8b corresponds to the count of instances from actual class
(after permutation) predicted as class
(also permuted). For instance, the entry at position (1, 2) corresponds to actual class 3 and predicted class 1, so it contains
.
5.1. Invariant Metrics Under Permutation
Using
Table A1, for classes ordered as
the following identities corresponding to the first four entries are obtained immediately for any
k, such that
:
In the permuted confusion matrix (
Table 7), the
-th class
corresponds to the
th class in the original confusion matrix (
Table 4). Accordingly, the following invariance identities hold for all
:
As for the support of the -th class in the permuted confusion matrix, it is equal to the support of class in the original confusion matrix.
The next theorem shows that the remaining metrics in
Table A1 also remain unchanged under permutation. Its formal proof can be found in
Appendix D.
Theorem 2. (Metric Invariance under permutation).
Let Table 7 be a multiclass confusion matrix with class labels reordered as , where π is a permutation of the original class order in Table 4. Then, all the following classification metrics computed from Table 6 are identical to those computed from Table 4 with the original class order - (1)
Per-class precision, recall, and F1-score;
- (2)
Macro-averaged precision, recall, and F1-score;
- (3)
Micro-averaged precision, recall, and F1-score;
- (4)
Weighted precision, recall, and F1-score;
- (5)
Overall accuracy, Average accuracy, and Error rate.
5.2. Preserved Analytical Properties Under Permutation
Analogous to Property 1, the following property follows from Equations (21)–(24).
Property 6. Let Table 7 be a multiclass confusion matrix with class labels reordered as , where π is a permutation of the original class order in Table 4. Then the precision for class is the ratio of the diagonal element in the -th column to the sum of all elements in that column. Likewise, the recall for class k is the ratio of the diagonal element in the -th row to the sum of all elements in that raw. Average accuracy and error rate are commonly used to summarize overall classifier performance. The following property follows directly from their formulas in
Table A1.
Property 7. Let Table 7 be a multiclass confusion matrix with class labels reordered as , where π is a permutation of the original class order in Table 4. Then Average accuracy and Error rate are complementary metrics whose values always sum to 1. From Part (5) of Theorem 2 and Property 3, the following property holds:
Property 8. Let Table 7 be a multiclass confusion matrix with class labels reordered as , where π is a permutation of the original class order in Table 4. Then, the Average accuracy and Overall accuracy satisfy the following relationship: Moreover, for any number of classes the Average accuracy is always greater than or equal to the Overall accuracy. In particular, when r = 2, the Average accuracy is equal to the Overall accuracy.
Applying Theorem 2 and Property 4, the following holds:
Property 9. Let Table 7 be a multiclass confusion matrix with class labels reordered as , where π is a permutation of the original class order in Table 4. Then the Micro-averaged precision, Micro-averaged recall, Micro-averaged F1-Score and Weighted recall are all equal, and they are also equal to the Overall accuracy. 6. Numerical Validation
To assess the analytical properties of multiclass evaluation metrics—specifically their invariance under confusion matrix transposition and class label permutation—a numerical experiment using a real-world dataset is performed. For implementation clarity and practical adoption, refer to pseudocode in
Appendix E.
6.1. Basic Metric Validation on the UCI Wine Dataset
The Wine dataset from the UCI Machine Learning Repository is used, which is accessible via sklearn.datasets.load_wine(). This dataset contains 178 samples of wine, each described by 13 physicochemical features, and categorized into three classes: Class 1, Class 2, and Class 3. In the dataset, these are encoded as the numeric labels 0, 1, and 2, respectively, which serve as the target variable for multiclass classification. This structure makes the dataset particularly suitable for validating the analytical properties of multiclass evaluation metrics, including their invariance under confusion matrix transposition and class label permutation.
For this study, only one feature is selected—‘Alcohol’—to construct a minimal yet valid multiclass classification problem. This ensures that our focus remains on the validation of evaluation metric behavior rather than on achieving high classification accuracy.
A Random Forest classifier using RandomForestClassifier(random_state = 42) is trained to ensure reproducibility. The dataset is partitioned into training and testing subsets using a 70–30% split via train_test_split, also with random_state = 42, guaranteeing consistent and replicable results. All performance metrics, including the confusion matrix, are computed on the held-out test set, which constitutes 30% of the data, to enable an unbiased evaluation of the model’s predictive capability.
For the original confusion matrix, shown in
Table 9, classification performance is assessed using a comprehensive set of evaluation metrics. These include the precision, recall, and F1-score, calculated via the precision_recall_fscore_support function from sklearn.metrics, under four distinct averaging strategies:
Per-class metrics (average = None): Computers per-class precision, recall, and F1. The results are in
Table 10. To conserve space, all numerical values in the tables are rounded and displayed to five decimal places.
Macro-average (average = ‘macro’): Computes the unweighted mean of the per-class precision, recall, and F1 scores. This corresponds to the Macro F1 as defined in item 12, denoted
. In addition, the harmonic mean of the macro precision and macro recall is computed, i.e.,
from item 11 in
Table A1 and report it as Macro F1 (MH) in the final row of
Table 11.
Micro-average (average e = ‘micro’): Computes precision, recall, and F1 by globally aggregating the counts of true positives, false positives, and false negatives across all classes.
Weighted-average (average = ‘weighted’): Computes the average of per-class metrics weighted by the class support (i.e., number of true instances per class).
In addition to the above, the following metrics are computed:
To verify the theoretical invariance of metrics under permutation, a fixed permutation of class labels (0, 1, 2) → (2, 0, 1) is applied and all calculations on the permuted labels are repeated. The Per-class metrics are shown in
Table 8. The resulting aggregate metrics are shown in the Permuted column of
Table 9.
Regarding transposition, where the confusion matrix is transposed (i.e., actual classes are represented as columns and predicted classes as rows), all metrics are computed via a dedicated Python function (Python 3.13.6) that correctly interprets precision, recall, and class support under the new orientation.
6.1.1. Invariant Metrics
As shown in
Table 10 and
Table 11, the per-class metrics, and the results for the Original, Permuted, and Transposed confusion matrices are numerically identical across all metrics, thereby empirically validating the analytical invariance properties established in previous sections.
6.1.2. Analytical Properties
For all Original, Permuted, and Transposed cases, the per-class precision and recall were independently computed directly from the confusion matrix by dividing the main diagonal elements by the corresponding column and row sums, respectively. These manual calculations yielded results identical to those produced by the precision_recall_fscore_support function and are consistent with the values reported in
Table 10, further validating their correctness.
In addition,
Table 11 illustrates that
Overall accuracy and
Error rate are complementary; specifically, their sum is close to 1 in all cases. The analytical identity was also validated:
as established in earlier theoretical analysis. This identity held exactly in all three settings—Original, Permuted, and Transposed—thus empirically supporting the derived formula and confirming the robustness of these metrics under structural transformations of the confusion matrix.
The numerical validation presented in
Section 6.1.1 and
Section 6.1.2 confirms that the proposed multiclass evaluation metrics maintain both analytical and structural invariance under matrix transposition and label permutation. This indicates that the metrics remain stable and reliable, even when the data is reorganized. These findings support the theoretical framework outlined in
Section 4.1 and
Section 5.1 and demonstrate that the metrics are well-suited for classification settings where label order may vary or structural ambiguity may occur.
6.2. Robustness Validation Using CIFAR-10
To extend the scope of numerical validation, robustness experiments were conducted using the widely studied CIFAR-10 benchmark dataset. CIFAR-10 comprises 60,000 color images of size 32 × 32, evenly distributed across 10 mutually exclusive classes. The dataset is split into 50,000 training images and 10,000 test images, with each class containing exactly 1000 test instances—yielding a perfectly balanced multiclass structure ideal for evaluating the stability of performance metrics under structural transformations.
The model was trained using a Convolutional Neural Network (CNN) architecture comprising three convolutional layers with 3 × 3 filters and ReLU activation functions, interleaved with 2 × 2 max-pooling layers to reduce spatial dimensionality and enhance feature abstraction. This feature extraction pipeline is followed by a fully connected layer with 64 hidden units and a final dense layer with softmax activation over 10 output classes. The model was trained for 10 epochs using the Adam optimizer and sparse categorical cross-entropy loss.
To assess the robustness of multiclass metrics, four key evaluation measures were computed directly from the confusion matrix:
We then applied two representative permutations of class labels:
Permutation A: [2, 0, 1, 3, 4, 5, 6, 7, 8, 9].
Permutation B: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0].
For each configuration—original and permuted—the four metrics listed above were computed. All metric values remained numerically unchanged across permutations:
These results confirm the empirical validity of our analytical claims: confusion-matrix-derived metrics are invariant under class label permutation, even when applied to real-world classification models. Moreover, by focusing on a minimal and theoretically sufficient set of metrics, the framework reinforces clarity, robustness, and reproducibility.
7. Conclusions
This paper establishes that multiclass evaluation metrics derived from the confusion matrix are structurally invariant under both matrix transposition and class label permutation. These findings are supported by algebraic proofs and empirical validation.
A key insight from this work is that not all aggregate metrics need to be computed separately. Specifically, all the 12 aggregate metrics can be determined using just four core metrics:
Overall accuracy,
Macro-averaged precision,
Macro-averaged recall,
Weighted precision. The rest eight can be directly calculated from the four metrics (see
Appendix F).
This unified framework helps streamline evaluation, reduce redundancy, and enhance reproducibility in multiclass classification tasks by establishing a small set of core metrics from which others can be systematically derived.
A limitation of the current framework is its focus on metric invariance under matrix transposition and class label permutation within single-label multiclass classification. However, real-world classification tasks often involve additional challenges.
In class-imbalanced settings, metrics, such as macro precision and recall, may disproportionately reflect performance on majority classes, while minority classes may be underrepresented. In multilabel classification, label co-occurrence and partial relevance complicate both the structure of the confusion matrix and the interpretation of aggregate metrics. Label noise—particularly when correlated across classes—can distort precision and recall estimates and may require correction strategies to maintain reliability. Hierarchical classification further increases complexity, as evaluation must account for semantic relationships and tree-based label dependencies.
Extending the current invariance results to these domains would require redefinition of the confusion matrix structure and averaging schemes. The present framework offers a formal foundation that may serve as a basis for generalizing to more complex classification frameworks.