Logitwise Distillation Network: Improving Knowledge Distillation via Introducing Sample Confidence

Shen, Teng; Cui, Zhenchao; Qi, Jing

doi:10.3390/app15052285

Open AccessArticle

Logitwise Distillation Network: Improving Knowledge Distillation via Introducing Sample Confidence

by

Teng Shen

,

Zhenchao Cui

^* and

Jing Qi

Hebei Province Machine Vision Engineering Research Center, School of Cyber Security and Computer, Hebei University, Baoding 071002, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(5), 2285; https://doi.org/10.3390/app15052285

Submission received: 17 December 2024 / Revised: 12 February 2025 / Accepted: 18 February 2025 / Published: 20 February 2025

Download

Browse Figures

Versions Notes

Abstract

While existing knowledge distillation (KD) methods typically force students to mimic teacher features without considering prediction reliability, this practice risks propagating the teacher’s erroneous supervision to the student. To address this, we propose the Logitwise Distillation Network (LDN), a novel framework that dynamically quantifies sample-wise confidence through the ranking of ground truth labels in teacher logits. Specifically, LDN introduces three key innovations: (1) weighted class means that prioritize high-confidence samples, (2) adaptive feature selection based on logit ranking, and (3) positive–negative sample adjustment (PNSA) to reverse error-prone supervision. These components are unified into a feature direction (FD) loss, which guides students to selectively emulate trustworthy teacher features. Experiments on CIFAR-100 and ImageNet demonstrate that LDN achieves state-of-the-art performance, improving accuracy by 0.3–0.5% over SOTA methods. Notably, LDN exhibits stronger compatibility with homogeneous networks (2.4% gain over baselines) and requires no additional training costs when integrated into existing KD pipelines. This work advances feature distillation by addressing error propagation, offering a plug-and-play solution for reliable knowledge transfer.

Keywords:

image classification; knowledge distillation; logit information; sample confidence

1. Introduction

Deep neural networks (DNNs) have come to have a very wide range of applications in computer vision in recent decades. However, powerful networks typically benefit from large model capacities, introducing high computational and storage costs [1,2,3]. To remedy this problem, one potential method is knowledge distillation (KD). The concept of KD was first proposed by Hinton et al. [4], which represents a range of methods for transferring knowledge from a heavy model (teacher) to a light model (student), which can improve the performance of the light model without introducing additional costs.

Generally, KD can be classified into two categories: logit-based distillation [5,6,7,8,9,10,11,12] and feature-based distillation [13,14,15,16,17,18,19,20,21,22]. Logit-based distillation obtains the student model by minimizing the KL divergence between logit layers of student networks and teacher networks based on training data. In general, most of the methods on logit-based distillation methods focus on regularization or optimization methods. Zhang [7] proposed a mutual learning approach that trains both the student and the teacher simultaneously. TAKD [8] was proposed by Mirzadeh to introduce a medium-sized network called “teacher assistant” that serves as a bridge between the teacher and the student. Huang [9] proposed a correlation-based loss to explicitly capture the intrinsic inter-class relationships of the teacher. BookKD [10] was proposed by Zhu et al. By decoupling the process of knowledge distillation into two parts, “book-making” and “book-learning”, it significantly reduces the resource consumption of knowledge distillation. Additionally, there are some works that focus on explaining logit distillation methods and that propose DKD [11] to decouple the KL loss into separate meaningful parts. logit-based distillation is simple and easy to use, and it significantly improves performance. However, it may not fully exploit the potential of the teacher model, thereby conveying limited information through the logits. For instance, features from other layers of the teacher network remain underutilized.

Compared to logit-based methods, feature-based methods align features in the intermediate layers, extracting knowledge from the intermediate feature maps with more diversity, flexibility, and selectivity. Most feature-based distillation methods [13,14,16,17] train the student model by encouraging similarity between intermediate-layer features and teacher features, such as minimizing the

L_{2}

distance between features. Chen [13] investigated the cross-level factors of the connection paths between the teacher and student networks and reveals their importance. KD++ [18] was proposed by Wang to introduce a correlation loss that encourages the student to generate features with large norms and directions consistent with the teacher’s class means. Liu proposed a novel adaptive multi-teacher multi-level knowledge distillation learning framework (AMTML-KD) [19] that allows students to learn multi-level knowledge from multiple teachers. MulKD [20] was proposed by Guermazi to use sequential learning and parallel learning to gradually generate more suitable teacher models, thereby promoting the transfer of knowledge to student models. Most feature-based methods can achieve better performance (significantly higher than logit-based methods), but forcing the student to generate similar features to the teacher may not correctly extract the teacher’s knowledge. If the student simply imitates the teacher, generating similar prediction results or features, they might make mistakes that are similar to the those of the teacher.

To reduce the issue of incorrect supervision caused by teacher prediction errors, we propose a Logitwise Distillation Network (LDN). The ranking of ground truth in logits reflects the confidence of the teacher model’s intermediate-layer features on that sample. By incorporating this confidence as weights into the method of feature-based distillation, the impact of teacher mispredictions on the student model will be greatly reduced.

Generally, KD networks usually conduct knowledge distillation from layer to layer, allowing student features or logits to mimic the features or logits of the teacher at the same layer as closely as possible. However, in LDN, the credibility of teacher features is extracted from the logit information in different samples and incorporated as weight information into the feature-based distillation process. We also propose a simple and effective loss term called FD loss. It applies LDN to the layer before the logit layer, the embedding layer. Unlike conventional feature-based distillation methods, LDN incorporates the ranking information of ground truth in logits as weights in different steps of feature distillation. LDN is reflected in three areas: weighted class means, feature selection, and positive and negative sample adjustment (PNSA). It effectively utilizes the sample confidence information from logits to guide the distillation of feature layers, thereby minimizing the impact of incorrect supervision caused by teacher prediction errors.

Our principal contributions are threefold:

Dynamic Sample Confidence Mechanism: Unlike prior works that crudely filter samples via binary correctness labels (e.g., Meng et al. [23]), LDN quantifies continuous confidence through ground truth rankings in teacher logits. This allows for the fine-grained weighting of samples, effectively suppressing error propagation while retaining nuanced supervision from partially correct predictions.
Multi-Stage Error Correction Framework: We propose the first synergistic integration of (i) weighted class means to amplify high-confidence prototypes, (ii) adaptive feature selection to toggle between raw features and class means, and (iii) PNSA to invert supervision for error-prone samples. This tripartite design systematically addresses error accumulation at feature, sample, and distribution levels.
Plug-and-Play Generalizability: FD loss requires no architectural modifications or additional training phases. When integrated into SOTA methods (e.g., DKD, ReviewKD), it consistently boosts accuracy by 0.3–2.4% across both homogeneous (ResNet families) and heterogeneous (ResNet→MobileNet) architectures, demonstrating unprecedented compatibility.

In the following sections, we present a comprehensive analysis and evaluation of our proposed method. In Section 2, we analyze relevant work, compare existing methods, and highlight the unique features of LDN. In Section 3, we describe the theoretical foundations and key concepts underlying our method. Section 4 presents the results and analysis of our experiments, demonstrating the effectiveness and performance of our method. Furthermore, finally, we conclude the paper in Section 5, summarizing the contributions of our work and outlining potential directions for future research.

2. Related Work

Knowledge distillation (KD) has evolved along two primary axes: logit-based and feature-based methods, with recent advancements incorporating sample selection strategies to mitigate error propagation. We systematically contextualize our work within these categories.

2.1. Logit-Based Distillation

Pioneered by Hinton et al. [4], logit distillation minimizes the KL divergence between teacher and student logits. Subsequent works enhance this paradigm through regularization (e.g., DKD [11] decouples target/non-target class contributions) or architectural innovations (e.g., TAKD [8] introduces teacher assistants). While effective, these methods inherently limit knowledge transfer to the logit layer, disregarding richer intermediate features and dynamic sample-wise reliability.

2.2. Feature-Based Distillation

Feature alignment methods, such as FitNet [24] and ReviewKD [13], align intermediate layer features to exploit the teacher’s representational hierarchy. For instance, ReviewKD [13] aggregates multi-level features via cross-layer connections, while KD++ [18] enforces directional consistency between student features and class-wise prototypes. However, these approaches enforce rigid feature mimicry, indiscriminately propagating the teacher’s errors. Our work diverges by integrating confidence-aware weighting into feature distillation, enabling students to discern and reject unreliable supervision.

2.3. Sample Selection Strategies

Recent efforts address error propagation by filtering low-confidence samples. Meng et al. [23] propose binary masking, where students learn only from correctly predicted samples. While effective, this binary thresholding discards partially informative samples (e.g., near-correct predictions). In contrast, LDN leverages continuous confidence scores derived from logit ranking, allowing nuanced reweighting of samples based on their reliability. This aligns with Huang et al.’s work [9], who emphasize relational consistency, but extend their work by explicitly quantifying per-sample trustworthiness.

2.4. Positioning Our Work

LDN uniquely bridges feature distillation and sample selection. Unlike prior methods that treat these aspects independently, we synergize (1) logit-derived confidence weights, (2) adaptive feature alignment, and (3) error correction via PNSA. This tripartite framework mitigates error propagation while preserving fine-grained knowledge transfer, advancing beyond the static or binary paradigms of existing works.

3. Method

In this section, we propose our network: LDN. First, we describe the notation with the KD background. Then, we investigate an undesirable phenomenon in KD-based methods and introduce the corresponding solution, called Logitwise Distillation Network (LDN). Finally, we introduce the specific principles and algorithms of FD loss.

3.1. Notations and Background

Generally speaking, we consider classification neural networks as two modules: feature extractor and classifier. For a given input data x, we use

f^{t}

to represent their embedded features in the teacher model, and

z^{t}

to represent the logits. Similarly, we use

f^{s}

to denote their embedded features in the student model, and

z^{s}

to denote the logits. The final softmax score is represented by the vector

q^{t} = s o f t m a x (z^{t}; τ)

, where

τ

is a temperature parameter.

The final loss function of KD can be written as Equation (1) [4]:

L = L_{c e} + {α L}_{kd}

(1)

where

L_{k d}

depends on the choice of distillation: logit distillation or feature distillation.

α

is the weights for

L_{k d}

.

Logit distillation trains the student model by utilizing both the cross-entropy loss function

L_{c e}

and the knowledge distillation loss function

L_{k d}

to enhance the performance of the student model through the transfer of teacher knowledge. The seminal work of KD [4] utilizes KL divergence as the knowledge distillation loss function, represented as Equation (2):

L_{k d} = \frac{1}{N} \sum_{i = 1}^{N} KL (q_{i}^{t}, q_{i}^{s})

(2)

Feature distillation involves extracting teacher knowledge by minimizing the differences in intermediate features before logits. The

L_{2}

distance between features is a loss term commonly used in feature distillation, represented as Equation (3):

L_{k d} = \frac{1}{N} \sum_{i = 1}^{N} L_{2} (f_{i}^{t}, f_{i}^{s})

(3)

Class means are defined as the average feature vectors of samples belonging to a specific class in a dataset. It represents the central tendency or prototype of the features within that class. The class mean of teacher feature vectors is represented as Equation (4):

c_{k}^{t} = \frac{1}{N} \sum_{i = 1}^{N} f_{i}^{t}

(4)

where k represents the class, N represents the total number of samples in that class, and

f_{i}^{t}

represents the teacher’s feature vector.

In typical knowledge distillation methods, the student model often assigns the same weight to different samples when learning from the teacher model. This is unreasonable. Because in samples where the teacher’s predictions are correct, the feature information in the teacher model usually has higher confidence. Meng proposes a simple method [23]: let the student model selectively learn from the teacher model. If the teacher model can predict the ground truth, then the student model chooses to learn from the teacher model; otherwise, it learns only from the ground truth, as shown in Equation (5):

L_{k d} = \frac{1}{N} \sum_{i = 1}^{N} {a L}_{2} (f_{i}^{t}, f_{i}^{s})

(5)

where a is 1 when the teacher’s prediction is correct, otherwise a is 0.

3.2. Logitwise Distillation Network

In KD and KD-based approaches, the student model is trained under the predictions of the teacher model, regardless of whether they are correct or not. Clearly, when the teacher makes mistakes, it is difficult for the student to correct the erroneous information on their own. Although Meng’s method is simple and effective, it does not differentiate the confidence levels of different samples. Instead, it simply divides them into trustworthy and untrustworthy categories. However, we find that the ranking of ground truth in teacher logits accurately reflects the credibility of the teacher model’s predictions for each sample. A higher ranking indicates greater credibility of the teacher model’s predictions. Based on this observation, Logitwise Distillation Network (LDN) is proposed. As shown in Figure 1, in LDN, we introduce logit-ranking information from three aspects, weighted class mean, feature selection, and PNSA, to help the student model avoid incorrect supervision from the teacher’s erroneous predictions.

The class mean can be used as a strong classifier, but misclassified samples will obviously reduce the reliability of the class mean. Therefore, when calculating the class mean, logit information is introduced, and different weights are assigned to each sample based on the ranking of the ground truth in the logit information. This is called weighted class mean. The weighted class mean of teacher feature vectors is represented as Equations (6) and (7):

{c_{k}^{t}}^{'} = \frac{1}{N} \sum_{i = 1}^{N} {b f}_{i}^{t}

(6)

b = softmax (\frac{rank (y)}{τ})

(7)

where b represents the weight, which is related to the ranking of the teacher logits for that sample. The confidence weight b for each sample is derived from the ranking of its ground truth label in the teacher’s logits. Higher rankings (e.g., Top-1) yield larger b, reflecting stronger confidence. To avoid extreme weighting, we normalize b using a temperature-scaled softmax over all samples within a class.

The ranking of the ground truth label in teacher logits correlates with feature discriminability. Higher rankings indicate that the teacher’s features for this sample are closer to the class centroid in the embedding space. Thus, weighting by b effectively emphasizes samples with well-clustered features, reducing noise from ambiguous or misclassified instances.

In LDN, weighted class means are used to replace teacher features to regularize student features. This helps the student model achieve higher classification accuracy. For the majority of samples, using weighted class mean to regularize student features instead of the original teacher features will have a positive impact on the student model. However, for some samples with significant deviations in teacher predictions, i.e., samples with ground truth ranked lower in logits, opting for a weighted class mean may have a negative effect instead. Therefore, in LDN, a flexible approach is employed. This approach regulates student features by utilizing either the weighted class mean of teacher features or the original features. The choice is based on the ranking of samples’ ground truth in the teacher logit information. When the ground truth in logits is ranked higher, using weighted class mean instead of teacher features will be beneficial for the student to learn typical features of that category. Conversely, if the ground truth in logits is ranked lower, using the original teacher features will yield better results. We refer to this part as feature selection.

In general feature-based distillation methods, minimizing the

L_{2}

distance between the student and teacher features in the same layer allows the student to generate features that are close to the teacher’s. In ideal cases, the student model can achieve performance similar to that of the teacher. However, when faced with different samples, different learning strategies should be adopted for the teacher features. When the teacher predicts correctly, the student is encouraged to generate features that closely resemble the teacher’s predictions. Conversely, when the teacher makes a prediction error, it is considered a “typical error”, and the goal is to minimize such errors when generating features in the student model.

Based on this theory, we propose a module, called positive and negative sample adjustment (PNSA), which divides all samples into positive and negative samples based on the prediction results of the teacher model. Samples correctly predicted by the teacher model are labeled as positive, whereas those incorrectly predicted are labeled as negative. For positive samples, the student is encouraged to generate features similar to the teacher’s as usual. However, for negative samples, the student is encouraged to generate features that differ from the teacher’s features. The specific operation involves adjusting the loss function

L_{k d}

of feature distillation using a function

A (•)

as Equations (8) and (9):

L_{P N S A} = A (L_{k d}) = \frac{1}{N} \sum_{i = 1}^{N} {A (L}_{2} (f_{i}^{t}, f_{i}^{s}))

(8)

{A (L}_{2} (f_{i}^{t}, f_{i}^{s})) = \{\begin{matrix} L_{2} (f_{i}^{t}, f_{i}^{s}) & q^{t} = y \\ - η L_{2} (f_{i}^{t}, f_{i}^{s}) & q^{t} \neq y \end{matrix}

(9)

where

q^{t}

represents the results predicted by the teacher model, y represents the ground truth, and

η

is the weights.

A (•)

handles different samples differently. For positive samples,

A (•)

does not modify

L_{k d}

. And for negative samples,

A (•)

reverses the results of

L_{k d}

and attaches a certain weight.

3.3. The Proposed FD Loss

As shown in Figure 2 and Algorithm 1, based on the teacher features

f_{t}

and teacher logits

z^{t}

of samples from that class, we obtain the weighted class mean

{c_{k}^{t}}^{'}

. Through the FS module, the decision to use

{c_{k}^{t}}^{'}

instead of

f_{t}

for subsequent calculations is determined based on the ranking of the ground truth in logits.

If the ground truth has a higher ranking in the logits, we calculate the cosine similarity between the feature direction

f_{i}^{s}

and the corresponding weighted class mean

{c_{k}^{t}}^{'}

as Equation (10):

L_{d} = \frac{1}{C} \sum_{k = 1}^{C} \frac{1}{|I_{k}|} \sum_{i \in I_{k}} (1 - cos (f_{i}^{s}, {c_{k}^{t}}^{'}))

(10)

If the ground truth has a lower ranking in the logits, we calculate the cosine similarity between the student feature

f_{i}^{s}

and the corresponding teacher feature

f_{i}^{t}

as Equation (11):

{L_{d}}^{'} = \frac{1}{C} \sum_{k = 1}^{C} \frac{1}{|I_{k}|} \sum_{i \in I_{k}} (1 - cos (f_{i}^{s}, f_{i}^{t}))

(11)

Despite

L_{d}

extracting the supervisory signals present in the teacher’s feature map, pre-training the teacher model does not ensure that the supervision is always accurate. When the teacher model makes incorrect predictions, the student model is prone to making the same errors. In order to tackle this problem, we introduce the PNSA module and obtain

L_{f d}

based on different rankings in the logits, as shown in Equations (12) and (13):

L_{f d} = \{\begin{matrix} \frac{1}{C} \sum_{k = 1}^{C} \frac{1}{|I_{k}|} \sum_{i \in I_{k}} 1 - cos (f_{i}^{s}, {c_{k}^{t}}^{'}) & q^{t} = y \\ \frac{1}{C} \sum_{k = 1}^{C} \frac{1}{|I_{k}|} \sum_{i \in I_{k}} - η (1 - cos (f_{i}^{s}, {c_{k}^{t}}^{'})) & q^{t} \neq y \end{matrix}

(12)

{L_{f d}}^{'} = \frac{1}{C} \sum_{k = 1}^{C} \frac{1}{|I_{k}|} \sum_{i \in I_{k}} - η (1 - cos (f_{i}^{s}, f_{i}^{t}))

(13)

where

q^{t}

represents the results predicted by the teacher model, y represents the true labels, and

η

is the weights.

L_{f d}

is compatible with existing KD methods and can be combined with

L_{c e}

and

L_{k d}

for student training. The total loss L can be expressed as Equation (14):

L = L_{c e} + {β L}_{k d} + γ L_{f d}

(14)

where

β

and

γ

represent the weights of

L_{k d}

and

L_{f d}

, respectively.

L_{k d}

depends on the distillation method.

Algorithm 1 FD loss computation in a PyTorch-like style, using PyTorch version 2.0.0.

Input: teacher logits $z^{t}$ ; student embedding features $f_{s}$ ; ground truth y; teacher embedding features $f_{t}$ ; weighted class means ${c_{k}^{t}}^{'}$ .
Output: FD loss $L_{f d}$

1:: $p r e d \leftarrow z^{t} . data . topk (1, \dim = 1)$
2:: $p r e d 2 \leftarrow z^{t} . data . topk (5, \dim = 1)$
3:: $L_{f d} \leftarrow 0.0$
4:: for $s, t, i, l, n, center$ in zip ( $f_{s}, f_{t}, y, p r e d, p r e d 2, {c_{k}^{t}}^{'}$ ) do
5:: $n \leftarrow n . tolist () [: 5]$
6:: $e_{c} \leftarrow center / center . norm (p = 2)$
7:: $t_{c} \leftarrow t / t . norm (p = 2)$
8:: if $l = = i$ then
9:: $L_{f d} + = 1 - torch . dot (s, e_{c}) / s . norm (p = 2)$
10:: else if i in n then
11:: $L_{f d} + = 2.5 * (- 1 + torch . dot (s, e_{c}) / s . norm (p = 2))$
12:: else
13:: $L_{f d} + = 1.5 * (- 1 + torch . dot (s, t_{c}) / s . norm (p = 2))$
14:: end if
15:: end for
16:: return $L_{f d}$

4. Experiments

We conduct experiments to validate the proposed LDN and FD loss in image classification. Section 3.1 introduces the datasets and specific experimental setups used in the experiments. Section 3.2 shows the comparative tests we conducted of our method and existing KD methods across various teacher–student networks. Section 3.3 presents the ablation experiments conducted in relation to the FD loss.

4.1. Experimental Settings

4.1.1. CIFAR-100

The CIFAR-100 dataset consists of over 50,000 training images and over 10,000 test images, spanning 100 classes for classification [25]. We trained all student networks from the beginning using the weight initialization method described in a previous study [26], while the teacher loaded publicly available weights [27].

4.1.2. ImageNet

ImageNet is a large-scale classification dataset consisting of 1000 classes. The training set contains 1.28 million images, and the validation set contains 50k images [28].

4.1.3. Experimental Environment

All experiments were conducted on NVIDIA 3080TI GPU with a Ubuntu 18.04 operating system and CUDA version 11.0.

4.1.4. Hyperparameter Setting

All models were trained using SGD with momentum 0.9, weight decay 5 × 10⁻⁴, and initial learning rate 0.1 (divided by 10 at 150 and 180 epochs). For CIFAR-100, we trained for 200 epochs with a batch size of 64; for ImageNet, 100 epochs with batch size 256. Temperature parameter (controls confidence weight smoothing)

τ = 2

, PNSA weight (adjusts loss inversion strength for mispredicted samples)

η = 2.5

, and loss weight (FD loss scaling)

γ = 0.8

. All hyperparameters remained fixed across architectures/datasets to ensure reproducibility.

4.2. Experimental Results

4.2.1. CIFAR-100 Classification

Table 1 lists the performance of different knowledge distillation methods on the CIFAR-100 dataset. In this context, we selected two homogeneous network architectures and one heterogeneous network architecture, and conducted experiments using prominent feature distillation and logit distillation methods. +FD indicates the integration of our new FD loss into existing methods. From the data in Table 1, it can be seen that when using our FD loss in different methods and networks, these methods consistently outperform their original counterparts, achieving state-of-the-art performance on this dataset.

The superiority of LDN is particularly evident in homogeneous architectures. For example, when distilling from ResNet-56 to ResNet-20, LDN + KD achieves 72.36% accuracy—a 2.4% gain over the baseline KD (70.66%), outperforming even the teacher model (72.34%). This phenomenon arises because homogeneous networks share similar feature hierarchies, allowing PNSA to effectively invert erroneous patterns propagated through aligned feature deviations. Conversely, in heterogeneous pairs (e.g., ResNet-50→MobileNet-V2), structural discrepancies decouple error distributions, limiting PNSA’s corrective capacity (69.32% vs. teacher’s 79.34%). These results validate our hypothesis in Section 3.2: architectural alignment is critical for error correction.

4.2.2. ImageNet Classification

We conducted further in-depth research on the effectiveness of the FD loss on the larger ImageNet dataset. Table 2 and Table 3 present the performance of knowledge distillation on the ImageNet dataset. +FD indicates the integration of our new FD loss into existing methods. * represents the results we reproduced. From Table 2 and Table 3, it is evident that incorporating the FD loss into existing knowledge distillation methods leads to performance improvements, regardless of whether it is a homogeneous or heterogeneous network. Furthermore, the performance enhancement is more pronounced in homogeneous networks, which aligns with the results observed on the CIFAR-100 dataset.

Table 2 presents the top-1 and top-5 accuracy of LDN on ImageNet with ResNet-34 as the teacher and ResNet-18 as the student. The baseline KD method achieves 70.66% top-1 accuracy, while LDN + KD improves this to 71.03%. More notably, when integrated with DKD (a state-of-the-art logit distillation method), LDN achieves 71.56% top-1 accuracy—a 0.9% gain over the original DKD (70.66%). This improvement is consistent across both homogeneous (ResNet-34→18) and heterogeneous (ResNet-50→MobileNet-V2) architectures, demonstrating LDN’s versatility. The top-5 accuracy also sees a significant boost, with LDN+DKD reaching 90.38%, outperforming the baseline by 0.5%. These results validate LDN’s ability to enhance existing distillation frameworks without additional computational overhead.

Table 3 extends the evaluation to a more challenging setting: distilling from ResNet-50 to MobileNet-V2. Here, LDN + ReviewKD achieves 72.49% top-1 accuracy, surpassing the baseline ReviewKD by 0.44%. The top-5 accuracy also improves from 90.83% to 91.11%. Notably, the performance gap between LDN and non-LDN methods is smaller in this heterogeneous setting (0.44% vs. 0.9% in homogeneous pairs), aligning with our earlier observation that architectural alignment enhances error correction. Despite this, LDN still delivers consistent gains, underscoring its robustness across diverse teacher–student pairs. Combining the results from Table 2 and Table 3, we observe two key trends: (1) Homogeneous vs. heterogeneous gains: LDN achieves larger improvements in homogeneous architectures (e.g., ResNet-34→18: +0.9%) compared to heterogeneous pairs (e.g., ResNet-50→MobileNet-V2: +0.44%). This is attributed to the shared feature hierarchies in homogeneous networks, which facilitate more effective error correction through PNSA. (2) Compatibility with SOTA methods: LDN consistently boosts the performance of existing distillation frameworks (e.g., DKD, ReviewKD) without requiring architectural modifications or additional training costs. This plug-and-play compatibility makes LDN a practical solution for real-world applications.

4.3. Ablation Study

In this section, we detail how we first conducted ablation experiments on the FD loss to validate its effectiveness as an independent module added to existing KD methods. Then, we investigated the impact of weighted class means (WCMs), feature selection (FS), and positive–negative sample adjustment (PNSA) on the FD loss, demonstrating the effectiveness of each component in the FD loss. Finally, we tested the impact of different training sets on FD loss.

We trained teacher (ResNet-56) and student (ResNet-20) models on the CIFAR100 dataset and reported the accuracy (%) on their respective test sets. We used KD as the baseline, which is a logit distillation method.

Table 4 and Figure 3 show comparisons of the performance of FD loss under different configurations. The baseline (CE+KL) achieves 70.66% accuracy, while adding FD loss as a standalone module (CE+KL+FD) boosts accuracy to 72.36%. This 2.4% improvement highlights FD loss’s ability to complement existing distillation frameworks. Notably, when FD loss is used to replace the original module (CE+FD), the gain is smaller (71.09%), suggesting that FD loss works best as an auxiliary component rather than a complete replacement.

Table 5 and Figure 4 demonstrate the impact of weighted class means (WCMs), feature selection (FS), and positive–negative sample adjustment (PNSA) on the FD loss. It can be seen that using all three components in the FD loss yields the best performance (72.35%), while removing any one of them leads to varying degrees of performance degradation (71.97%, 71.71%, and 71.77%). Table 5 reveals synergies among LDN’s three core modules, WCMs prepare denoised prototypes for FS to be selectively enforced. FS ensures that PNSA operates on contextually appropriate features (prototypes or raw), maximizing error correction precision, and PNSA’s inversion mechanism amplifies the contrast between reliable and erroneous supervision, guided by WCM/FS filtering.

Table 6 evaluates FD loss on a subset of CIFAR-100 where the teacher model predicts correctly (CIFAR-100-1). The minimal improvement (68.09% vs. 67.87%) indicates that FD loss primarily enhances performance by correcting erroneous supervision. This is further supported by the significant gains on the full dataset (72.36% vs. 70.66%), where teacher errors are prevalent. These findings underscore FD loss’s role in mitigating error propagation rather than merely amplifying correct predictions.

5. Conclusions and Future Work

5.1. Conclusions

This work addresses a critical challenge in knowledge distillation: the propagation of erroneous supervision from teacher to student models. Traditional methods, by enforcing rigid feature mimicry, inadvertently amplify teacher errors, particularly in complex datasets like ImageNet. Our proposed LDN framework tackles this issue through three key innovations: (1) dynamic confidence weighting via logit ranking to suppress unreliable samples, (2) a multi-stage error correction mechanism (WCM-FS-PNSA) to address feature-level and sample-level noise, and (3) a plug-and-play design requiring zero architectural modifications.

The results demonstrate significant practical impacts: LDN reduces error-prone supervision compared to binary filtering methods, achieves state-of-the-art accuracy improvements (up to 2.4% on CIFAR-100 and 0.9% on ImageNet), and maintains computational efficiency. These advancements enable the more reliable deployment of lightweight models in resource-constrained environments—for instance, compressing ResNet-50 to MobileNet-V2 while preserving 95% of the teacher’s accuracy on edge devices. By fundamentally addressing error propagation, LDN establishes a new paradigm for trustworthy knowledge transfer in real-world applications.

5.2. Future Work

While LDN demonstrates significant improvements in error correction for knowledge distillation, several promising directions remain unexplored:

Dynamic Confidence Thresholding: Current fixed thresholds may limit adaptability across diverse tasks. Future work will explore adaptive thresholding mechanisms, such as meta-learning or reinforcement learning, to dynamically adjust thresholds based on task complexity and dataset characteristics.
Cross-Modal Knowledge Distillation: Extending LDN to vision language models (e.g., CLIP distillation) is a natural next step. Preliminary experiments suggest that logit ranking could serve as a proxy for cross-modal alignment confidence, enabling more robust distillation in multimodal settings.
Large Teacher Adaptation: The diminishing gains observed with stronger teachers (e.g., ResNet-152) indicate a need for hybrid ranking-entropy metrics. Future research will investigate combining logit ranking with entropy-based confidence measures to better handle overconfident predictions from large models.

These directions aim to broaden LDN’s applicability while deepening its theoretical foundations, paving the way for more reliable and efficient knowledge transfer in real-world applications.

Author Contributions

Conceptualization, T.S.; Methodology, T.S.; Software, Z.C.; Validation, T.S.; Formal analysis, T.S. and J.Q.; Resources, Z.C. and J.Q.; Data curation, T.S.; Writing—original draft, T.S.; Writing—review & editing, T.S.; Visualization, T.S.; Supervision, Z.C. and J.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy concerns.

Acknowledgments

This work was supported in part by the Science Research Project of Hebei Education Department under grant QN2022107, and in part by the Science and Technology Program of Hebei Province under grant 22370301D (corresponding author: Zhenchao Cui).

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Cho, J.H.; Hariharan, B. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4794–4802. [Google Scholar]
Furlanello, T.; Lipton, Z.; Tschannen, M.; Itti, L.; Anandkumar, A. Born again neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1607–1616. [Google Scholar]
Zhang, Y.; Xiang, T.; Hospedales, T.M.; Lu, H. Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4320–4328. [Google Scholar]
Mirzadeh, S.I.; Farajtabar, M.; Li, A.; Levine, N.; Matsukawa, A.; Ghasemzadeh, H. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 5191–5198. [Google Scholar]
Huang, T.; You, S.; Wang, F.; Qian, C.; Xu, C. Knowledge distillation from a stronger teacher. Adv. Neural Inf. Process. Syst. 2022, 35, 33716–33727. [Google Scholar]
Zhu, S.; Shang, R.; Tang, K.; Xu, S.; Li, Y. BookKD: A novel knowledge distillation for reducing distillation costs by decoupling knowledge generation and learning. Knowl.-Based Syst. 2023, 279, 110916. [Google Scholar] [CrossRef]
Zhao, B.; Cui, Q.; Song, R.; Qiu, Y.; Liang, J. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11953–11962. [Google Scholar]
Sun, S.; Ren, W.; Li, J.; Wang, R.; Cao, X. Logit standardization in knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15731–15740. [Google Scholar]
Chen, P.; Liu, S.; Zhao, H.; Jia, J. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 5008–5017. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv 2016, arXiv:1612.03928. [Google Scholar]
Yim, J.; Joo, D.; Bae, J.; Kim, J. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4133–4141. [Google Scholar]
Huang, Z.; Wang, N. Like what you like: Knowledge distill via neuron selectivity transfer. arXiv 2017, arXiv:1707.01219. [Google Scholar]
Beyer, L.; Zhai, X.; Royer, A.; Markeeva, L.; Anil, R.; Kolesnikov, A. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10925–10934. [Google Scholar]
Wang, Y.; Cheng, L.; Duan, M.; Wang, Y.; Feng, Z.; Kong, S. Improving knowledge distillation via regularizing feature norm and direction. arXiv 2023, arXiv:2305.17007. [Google Scholar]
Liu, Y.; Zhang, W.; Wang, J. Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing 2020, 415, 106–113. [Google Scholar] [CrossRef]
Guermazi, E.; Mdhaffar, A.; Jmaiel, M.; Freisleben, B. MulKD: Multi-layer Knowledge Distillation via collaborative learning. Eng. Appl. Artif. Intell. 2024, 133, 108170. [Google Scholar] [CrossRef]
Guo, Z.; Yan, H.; Li, H.; Lin, X. Class attention transfer based knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 11868–11877. [Google Scholar]
Yang, C.; An, Z.; Cai, L.; Xu, Y. Hierarchical self-supervised augmented knowledge distillation. arXiv 2021, arXiv:2107.13715. [Google Scholar]
Meng, Z.; Li, J.; Zhao, Y.; Gong, Y. Conditional teacher-student learning. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6445–6449. [Google Scholar]
Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Tian, Y.; Krishnan, D.; Isola, P. Contrastive representation distillation. arXiv 2019, arXiv:1910.10699. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Park, W.; Kim, D.; Lu, Y.; Cho, M. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3967–3976. [Google Scholar]
Passalis, N.; Tzelepi, M.; Tefas, A. Probabilistic knowledge transfer for lightweight deep representation learning. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 2030–2039. [Google Scholar] [CrossRef] [PubMed]

Figure 1. General framework of the LDN.

Figure 2. The calculation process of

L_{f d}

. Teacher logits are ranked to compute sample-wise confidence weights. These weights dynamically adjust contributions to class means. Feature selection toggles between class means and raw features based on confidence. PNSA inverts supervision for mispredicted samples.

Figure 2. The calculation process of

L_{f d}

. Teacher logits are ranked to compute sample-wise confidence weights. These weights dynamically adjust contributions to class means. Feature selection toggles between class means and raw features based on confidence. PNSA inverts supervision for mispredicted samples.

Figure 3. The test error rates of the teacher–student networks ResNet56–ResNet20 using different methods on the CIFAR100 dataset.

Figure 4. The test error rates of the teacher–student networks ResNet56–ResNet20 using different modules in FD loss on the CIFAR100 dataset.

Table 1. Top-1 accuracy (%) on the CIFAR-100 validation.

Methods	Homogeneous		Heterogeneous
	ResNet-56	ResNet-32×4	ResNet-50
	ResNet-20	ResNet-8×4	MobileNet-V2
teacher (T)	72.34	79.42	79.34
student (S)	69.06	72.50	64.60
Feature distillation methods
FitNet [24]	69.21	73.50	63.16
RKD [29]	69.61	71.90	64.43
PKT [30]	70.34	73.64	66.52
ReviewKD * [13]	71.89	75.63	69.73
Logit distillation methods
KD [4]	70.66	73.33	67.65
DIST [9]	71.78	75.79	69.17
DKD * [11]	71.97	75.44	69.78
KD++ * [18]	72.05	74.65	68.87
DKD++ * [18]	71.72	75.92	69.86
KD + FD (Ours)	72.36	75.27	69.32
DKD + FD (Ours)	71.98	76.33	70.02
ReviewKD + FD (Ours)	72.08	76.05	69.83

* represents the results we reproduced. Bold numbers indicate the best performance under this structure.

Table 2. Top-1 and top-5 accuracy (%) on the ImageNet validation. We set ResNet-34 as the teacher and ResNet-18 as the student.

Methods	Top-1	Top-5
teacher	73.31	91.42
student	69.75	89.07
CRD [27]	71.17	90.13
ReviewKD * [13]	71.21	90.21
KD [4]	70.66	89.88
DKD * [11]	71.24	90.23
DKD++ * [18]	71.42	90.12
KD + FD (ours)	71.03	90.18
ReviewKD + FD (ours)	71.48	90.32
DKD + FD (ours)	71.56	90.38

* represents the results we reproduced. Bold numbers indicate the best performance under this structure.

Table 3. Top-1 and top-5 accuracy (%) on the ImageNet validation. We set ResNet-50 as the teacher and MobileNet-V2 as the student.

Methods	Top-1	Top-5
teacher	76.16	92.86
student	68.87	88.76
CRD [27]	71.37	90.41
ReviewKD * [13]	72.13	90.86
KD [4]	68.58	88.98
DKD [11]	72.05	91.05
ReviewKD++ * [18]	72.44	90.83
KD + FD (ours)	72.13	90.18
DKD + FD (ours)	72.23	91.32
ReviewKD + FD (ours)	72.49	91.11

* represents the results we reproduced. Bold numbers indicate the best performance under this structure.

Table 4. Analysis of the ablation experiments on FD loss.

Case	Acc
CE + KL (baseline)	70.66
CE + FD	71.09
KL + FD	71.47
CE + KL + FD	72.36

Bold numbers indicate the best performance under this structure.

Table 5. Internal ablation experiments of the FD loss module.

Case	Acc
Baseline	70.66
WCM + FS	71.71
WCM + PNSA	71.97
FS + PNSA	71.77
WCM + FS + PNSA	72.36

Bold numbers indicate the best performance under this structure.

Table 6. Ablation experiments of FD loss on different datasets.

Case	CIFAR100	CIFAR100-1
CE + KL (baseline)	70.66	68.87
CE + KL + FD	72.36	69.09

Bold numbers indicate the best performance under this structure.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shen, T.; Cui, Z.; Qi, J. Logitwise Distillation Network: Improving Knowledge Distillation via Introducing Sample Confidence. Appl. Sci. 2025, 15, 2285. https://doi.org/10.3390/app15052285

AMA Style

Shen T, Cui Z, Qi J. Logitwise Distillation Network: Improving Knowledge Distillation via Introducing Sample Confidence. Applied Sciences. 2025; 15(5):2285. https://doi.org/10.3390/app15052285

Chicago/Turabian Style

Shen, Teng, Zhenchao Cui, and Jing Qi. 2025. "Logitwise Distillation Network: Improving Knowledge Distillation via Introducing Sample Confidence" Applied Sciences 15, no. 5: 2285. https://doi.org/10.3390/app15052285

APA Style

Shen, T., Cui, Z., & Qi, J. (2025). Logitwise Distillation Network: Improving Knowledge Distillation via Introducing Sample Confidence. Applied Sciences, 15(5), 2285. https://doi.org/10.3390/app15052285

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Logitwise Distillation Network: Improving Knowledge Distillation via Introducing Sample Confidence

Abstract

1. Introduction

2. Related Work

2.1. Logit-Based Distillation

2.2. Feature-Based Distillation

2.3. Sample Selection Strategies

2.4. Positioning Our Work

3. Method

3.1. Notations and Background

3.2. Logitwise Distillation Network

3.3. The Proposed FD Loss

4. Experiments

4.1. Experimental Settings

4.1.1. CIFAR-100

4.1.2. ImageNet

4.1.3. Experimental Environment

4.1.4. Hyperparameter Setting

4.2. Experimental Results

4.2.1. CIFAR-100 Classification

4.2.2. ImageNet Classification

4.3. Ablation Study

5. Conclusions and Future Work

5.1. Conclusions

5.2. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI