Next Article in Journal
Robust Control of a Bimorph Piezoelectric Robotic Manipulator Considering Ellipsoidal-Type State Restrictions
Next Article in Special Issue
Static Video Summarization Using Video Coding Features with Frame-Level Temporal Subsampling and Deep Learning
Previous Article in Journal
RETRACTED: Nahar et al. PSHRisk-Tool: A Python-Based Computational Tool for Developing Site Seismic Hazard Analysis and Failure Risk Assessment of Infrastructure. Appl. Sci. 2020, 10, 7487
Previous Article in Special Issue
Medium Transmission Map Matters for Learning to Restore Real-World Underwater Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Language Bias-Driven Self-Knowledge Distillation with Generalization Uncertainty for Reducing Language Bias in Visual Question Answering

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Xiyuan West Road 2006, Chengdu 611731, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2022, 12(15), 7588; https://doi.org/10.3390/app12157588
Submission received: 18 June 2022 / Revised: 25 July 2022 / Accepted: 25 July 2022 / Published: 28 July 2022
(This article belongs to the Special Issue Recent Advances of Learning Based Intelligent Vision System)

Abstract

:
To answer questions, visual question answering systems (VQA) rely on language bias but ignore the information of the images, which has negative information on its generalization. The mainstream debiased methods focus on removing language prior to inferring. However, the image samples are distributed unevenly in the dataset, so the feature sets acquired by the model often cannot cover the features (views) of the tail samples. Therefore, language bias occurs. This paper proposes a language bias-driven self-knowledge distillation framework to implicitly learn the feature sets of multi-views so as to reduce language bias. Moreover, to measure the performance of student models, the authors of this paper use a generalization uncertainty index to help student models learn unbiased visual knowledge and force them to focus more on the questions that cannot be answered based on language bias alone. In addition, the authors of this paper analyze the theory of the proposed method and verify the positive correlation between generalization uncertainty and expected test error. The authors of this paper validate the method’s effectiveness on the VQA-CP v2, VQA-CP v1 and VQA v2 datasets through extensive ablation experiments.

1. Introduction

Visual Question Answering (VQA) [1,2] is a cross-domain task of computer vision and natural language processing, and it has become increasingly important in the research and application of multimodal machine learning. In the past few decades, significant advances have been made in computer vision and natural language processing, with an explosion of visual and textual data to acquire and process. The most common VQA consists of an image and a question to be answered by the machine. Compared with other computer vision tasks, this model answers in real-time, not in advance. Moreover, the VQA model is required to comprehend the multimodal information of images and texts in a more artificially intelligent [3] way, leading to an in-depth understanding of vision and language.
VQA remains a challenging and open research topic. Recent research has focused on how to solve language bias. Language bias [4,5,6,7,8] threatens the implementation of VQA, which indicates that the current VQA model has an inadequate understanding of multimodal information. Language bias seems to be caused by the uneven distribution of datasets, a common problem in the real world. For example, if 90 percent of the bananas in the training set are yellow, the model would ask, “what color the banana is” and answer, “yellow” all the time, based on language bias. As shown in Figure 1, many VQA models tend to answer “yes” or “no” directly. Take another typical example. For the question “what color is the banana in the image?”, although the banana is green, the model still tends to predict “yellow”.
With language bias, the model overly relies on the correlation between the question and the answer while it ignores the information in the image. In essence, language bias arises from data imbalance, which leads to over-fitting of the model; that is, the model fits the head samples in the dataset [8]. The over-fitting of the model is an inherent problem of the model itself. The deep neural network has a variance in the case of an individual model, and the variance can be reduced by an ensemble or knowledge distillation [9,10,11,12] (Over-fitting to the label imbalance may lead to some models not training very well). At the feature level, the variance is caused by the incompleteness of the feature subgraph [12], so it is easy to produce over-fitting. Recently, knowledge [13,14] distillation and self-knowledge distillation [15,16] have been proven to be able to learn multi-view features and reduce over-fitting [12,17,18].
Neural network analysis of information is multi-view; for the same object, its different views have semantic consistency [19,20,21], and the multi-view structure can be ubiquitous in the dataset and feature level [12]. Therefore, the model can give the prediction based on a learned subgraph. However, if the subgraph is not comprehensive, the prediction can be biased. As shown in Figure 2, for the same question, “what color is the banana?”, the model learns the feature of yellow bananas while it ignores the feature of green bananas, which are less frequent in the training set. In other words, the model ignores the view of green bananas, causing visual bias, which further leads to language bias. Therefore, the model needs to focus on the feature of the less distributed samples in the training set and learn a comprehensive set of multi-view features so as to overcome VQA language bias and over-fitting.
The paper discusses how to reduce the language bias of the VQA model via self-knowledge distillation and proposes a new online learning framework, “language bias-driven self-knowledge distillation (LBSD)”, for implicit learning of multi-view visual features. Self-knowledge distillation enables the model to acquire more dark knowledge and improves its generalization ability. In short, with self-knowledge distillation, the model can have a more comprehensive understanding of view features. Online knowledge distillation no longer uses teacher models but allows student models to learn from each other by using KL divergence to uniformly constrain the output. It is worth mentioning that the student network is actually equal to the teacher network; the two networks are the same. However, the learning degree of student models cannot be described by using KL divergence alone [22,23]. Therefore, the authors of this paper put forward the concept of generalization uncertainty to help the model learn unbiased knowledge.
LBSD enables two debiased models to distill knowledge from each other to learn more complete visual features. It distinguishes between debiased students and biased students by calculating the generalization uncertainty of the prediction of student models and reinforces the mutual learning of the two models about unbiased knowledge. The paper also finds that heterogeneous student models can be used to reduce language bias. LBSD enables the model to learn a more complete set of visual features and to focus on the features of the less distributed samples in the training set by utilizing generalization uncertainty, thus reducing the language bias of the model and improving the robustness of the VQA model.
Contribution. In summary, the contributions of this paper are as follows:
(1) The authors of this paper propose a training framework (LBSD) based on online self-knowledge distillation, which can considerably reduce the VQA language bias. Moreover, the authors of this paper explore the different cases of student models (heterogeneous networks). The authors of this paper verify the effectiveness of the LBSD method and analyze the theory behind it.
(2) The authors of this paper propose a method to measure generalization uncertainty based on Top-k information entropy, and use it to distinguish between debiased students and biased students, so as to force the model to focus on the samples that cannot be directly answered by language bias in the VQA datasets. The authors of this paper also prove the proportional relationship between the generalized uncertainty and the expected test error.

2. Related Work

2.1. Language Bias in VQA

The language bias [8] in VQA has a negative impact on the general application of the model in real-world scenarios. The reason behind it is that there is often a strong correlation between questions and answers. Moreover, the questions tend to concern conspicuous objects in the image. In VQA v1 [1] and v2 [7], a positive answer or a question-related answer tends to have higher accuracy. When the questions and answers in the training set and the test set are distributed inconsistently, this language bias is obvious. Therefore, the VQA-CP v2 dataset was recently proposed to evaluate the language bias. Train and test splits of the VQA-CP v2 have different question-answer distributions. The current approach to language bias can be divided into (1) Strengthening visual information: AttAlign [24], HINT [24], SCR [25], ReGAT [26], ESR [27], VGQE [28] and so on; (2) Weakening language priors: AdvReg [29], GRL [30], RUBi [31], LM [32], LMH [32], Bias-Product (POE) [32], RMFE [33], CF-VQA [34] and GGE-DQ [35]; (3) Using various data enhancement: CSS [36], CL-VQA [37], GradSup [38], Loss-Rescaling [39], Mutant [40], RandImg [41], Unshuffling [42], ADA-VQA [43] and X-GGM [44].

2.2. Knowledge Distillation

In recent years, knowledge distillation [45,46,47,48] has been widely used in deep learning to transfer knowledge between different models. Hinton et al. [13] used knowledge distillation for model compression; that is, moving knowledge from powerful but complex models (teacher models) to simple models (student models). By minimizing the Kullback–Leibler (KL) divergence loss of the categorical output probability, the student can imitate the output of the teacher model. In addition, some new knowledge transfer goals have been proposed, such as intermediate feature maps [49], attention maps [50], second-order statistics [46], contrastive features [51,52] or structured knowledge [53,54,55].
However, these methods require a distinction between the roles of the teacher and the student and are typically distilled offline. Online knowledge distillation is a knowledge distillation based on a series of student (generally two) models by eliminating cumbersome teacher models. Based on the Kullback–Leibler divergence, Zhang et al. [16] proposed a technique for deep mutual learning (DML) in which pair-wise students learn from each other using a mimicry loss. By adding distillation loss after updating enough steps, co-distillation [15] (similar to DML) enables student networks to sustain their diversity for a longer time. However, KL divergence alone cannot capture the learning degree of student models. The authors of this paper put forward the notion of generalization uncertainty as a way for the model to learn unbiased knowledge.

3. Methods

In order to reduce VQA language bias, the authors of this paper consider making the model focus on the less distributed samples in the training set to learn a more complete set of multi-view features. To this end, the authors of this paper propose a new online self-knowledge distillation learning framework (LBSD) for implicit learning of multi-view visual feature sets to alleviate language bias. The methods are divided into: (1) language bias-driven self-knowledge distillation and (2) using generalization uncertainty to help student models learn unbiased visual knowledge. In the following sections, the authors of this paper explain the workflow of LBSD and analyze the theory behind it. The block diagram of the method presented in this paper is shown in Figure 3 and Algorithm 1.
Algorithm 1: Language Bias-Driven Self-Distillation
Input: Training set I , Q ( X ), label set A ( Y ), learning rate γ 1 , t and
γ 2 , t .
Initialize: Debiased Models N 1 and N 2 (different initial conditions
or models).
Repeat :
    t = t + 1
   Randomly sample data I i , Q i from I , Q .
   1: Update the predictions p 1 and p 2 of I i , Q i for the current
mini-batch
   2: Compute the stochastic gradient and update N 1 by Equation (13) :
N 1 N 1 + γ 1 , t L N 1 N 1
   3: Update the predictions p 1 of I i , Q i .
   4: Compute the stochastic gradient and update Θ 2 :
N 2 N 2 + γ 2 , t L N 2 N 2
   5: Update the predictions p 2 of I i , Q i .
Until : convergence

3.1. Preliminaries

To tackle the multi-class classification problem in VQA field, the general form of VQA is: A dataset is given D = I i , Q i , a i N containing N triplets of images I i I , questions Q i Q and answers a i A .
The aim of the VQA task is to learn a mapping function f v q a : I × Q [ 0 , 1 ] A , which generates the answer distributions for any given image-question pair. The authors of this paper omit subscript i in the following.
For each question Q and image I, the Bottom-Up Top-Down (UpDn) [56] model uses a question encoder e q and an object detector separately i q to extract a set of word embeddings Q and a set of visual object embeddings V. The model is fed both V and Q to get the joint feature m m ( V , Q ) . Then, the joint features are fed into the classifier C to get the final predictions.
P v q a ( a | I , Q ) = f v q a ( V , Q ) = C ( m m ( V , Q ) )
For fair comparisons, the authors of this paper use the Bottom-Up Top-Down (UpDn) model [56], which is mainly used by many researchers as the backbone network.

3.2. Language Bias-Driven Self-Distillation

The method aims to learn unbiased visual knowledge via the mutual learning of two debiased models so as to reduce VQA language bias. The training strategy, which can be integrated with the current debiased methods, consists of the mutual learning of two debiased models. A dataset is given D = I i , Q i , a i N containing N triplets of images I i I , questions Q i Q and answers a i A , it can be input into two identical models with different random initializations, N 1 and N 2 , and two probability vectors p can be predicted by the model, z means Softmax output.
p 1 k I i , Q i = exp z 1 k k = 1 K exp z 1 k
where k represents the number of outputs or classes of the neural network.
At the same time, the VQA model is generally defined as multi-type. Therefore, for multiple types, the objective function of the training network N 1 is defined as the cross-entropy error between the prediction and the correct label, as shown as follows, K means samples, M means classes and L C means the cross entropy error:
L C = i = 1 K m = 1 M I a i , m log p 1 m I i , Q i
In order to allow the two student models to learn unbiased visual features from each other (similar to self-knowledge distillation), the authors of this paper use KL divergence to constrain all the predictions, thus distilling the unbiased knowledge of the two models. The formula of KL divergence between N 1 and N 2 is shown as follows:
D K L p 2 p 1 = i = 1 K m = 1 M p 2 m I i , Q i log p 2 m I i , Q i p 1 m I i , Q i
The two student models simultaneously start parameter optimization, and the optimization loss is shown as follows. The consistency constraint of the predictions of the two models can realize the mutual learning of unbiased knowledge between the two models.
L N 1 = L C 1 + D K L p 2 p 1 L N 2 = L C 2 + D K L p 1 p 2
Since KL divergence is asymmetric, it can be replaced by Jensen–Shannon (JS) divergence (a variation of KL divergence) to ensure the consistency constraint between the two student models. Such replacement will not affect the final precision of the model.
L J S ( p 1 p 2 ) = 1 2 K L p 1 p 1 + p 2 2 + 1 2 K L p 2 p 1 + p 2 2
Moreover, all the current self-knowledge distillation models use student models with different random initializations. The strategy is effective because the model learns more complete sets of multi-view features. The authors of this paper also explore the case where two heterogeneous student networks serve as the student models. The heterogeneous networks have the same feature extraction structure, but they have different loss functions and network branches.

3.3. Debiased Mutual Students

As mentioned above, the language bias of datasets is, in essence, the distribution bias of image samples. For the same input image/text sample pair, the two student networks may have different outputs because of an inconsistent random seed, order of data reading or even network structure.
As shown in Figure 4, for more-distributed image/text sample pairs in the dataset, the model can simply answer the question through language bias, and the confidence of the answer is very high. The different student models tend to have the same answer. For the gradient update of neural networks, the cross-entropy loss and KL divergence loss of image/text samples that can answer the question by language bias are minimal. However, for the less-distributed samples, the model is more likely to have different answers. Therefore, the different answers can be measured and analyzed to help the model focus more on the samples that cannot be directly answered by language bias so as to reduce language bias.
In general, the current self-distillation methods only use KL divergence for the mutual distillation of knowledge. As KL divergence is not commutative, it cannot be understood as “distance”, which measures the information loss between two distributions. Simply constraining the KL divergence of the two student models cannot figure out the difference between the output and help the two models learn from each other with more precision. As shown in Figure 4, KL divergence for different distributions and consistency constraints is not always consistent with our expectations. For this reason, the authors of this paper consider using information entropy to evaluate the output uncertainty of the two models and evaluate the output difference based on the uncertainty.
As shown in Figure 4, although information entropy H is a common method to measure information uncertainty, the output is not always consistent with our understanding. For p 1 = [0.5, 0.25, 0.25] and p 2 = [0.5, 0.5, 0], the formula leads to H ( p a ) > H ( p b ) . For general classification scenarios, it is clear that p b is less certain than p a , and the confidence of predictions is extremely low. Therefore, in order to describe the prediction uncertainty, the authors of this paper adopt a simple and improved version: Top-k information entropy.
Suppose that p 1 , p 2 , …, p k are k values with the highest probability, the following formula can be obtained:
H n o r a m l ( p ) = i = 1 m p i log p i
H top k ( p ) = i = 1 k p ˜ i log p ˜ i
p ˜ i = p i / i = 1 k p i
C = H top - k ( p ) / log k
By using the above formula, the authors of this paper can get a result in the range of 0 to 1 and take C as the final uncertainty measure.
In order to measure the output difference between the two student models, for the uncertainty C 1 and C 2 , the output difference can be defined as | C 1 C 2 | . In order to enhance the mutual learning of the two student models in the case of output difference (questions that cannot be answered directly with language bias), the authors of this paper define a generalization uncertainty index G U to represent the intensity. The formula is G U = e | C 1 C 2 | , and the final loss function of the generalization uncertainty index can be obtained. The formula is as follows:
L r e g = F scale G U , D K L = e | C 1 C 2 | D K L p 2 p 1
L N 1 = L C 1 + L r e g
In the next section, the authors of this paper will prove that the generalization uncertainty index G U of the two student models can be used to estimate the test error of the model.

3.4. Theoretical Analysis

3.4.1. Theoretical Analysis of Generalization Uncertainty (GU)

In this section, the authors of this paper demonstrate that the generalized uncertainty index between two student models can be used to estimate the model test error on image-text sample pairs. Thus, generalized uncertainty is used in the training process to help students learn unbiased knowledge. Following the research of Nakkiran and Bansal [57], Jiang et al. [58] and others [59,60,61], the authors of this paper use class-segregated calibration (or class-wise calibration) [58,62,63,64,65,66] to prove the proportional relationship between the generalized uncertainty and the test error.
Notation 1.
The authors in this paper define two neural networks trained from different random seeds as n, n . The data of this model include K categories with input I i , Q i and label a i (Y). The model is parameterized by stochastic learning. The probability expression of the predicted output of the model is p ( a ^ I , Q ) . D v q a is the distribution map from I i , Q i × [ K ] , and p ( ω ) is the sample estimate for the different parameter distribution of models. The parameters of the model can be defined as Ω. The 1 [ ] function is the indicator function, which means the prediction is true or otherwise.
Definition 1.
The model N ( p ( a ^ , ω I , Q ) ) satisfies the generalization uncertainty proportional (GUP) on the distribution D v q a if:
T e s t E r r D v q a ( n ) E D v q a [ 1 | n ( I , Q ) a ] G U E r r D v q a ( n , n ) ( T o p K ) E D v q a [ 1 | n ( I , Q ) n ( I , Q ) ]
Definition 2.
The self-knowledge distillation model N ( n , n ) satisfies class-wise calibration (or class-segregated calibration) on D v q a if any kind of confidence value q falls in [ 0 , 1 ] and for any class k falls in [ K ] ,
p a = k n ˜ k ( I , Q ) = q = q
k = 0 K 1 p a = k , n ˜ k ( I , Q ) = q k = 0 K 1 p n ˜ k ( I , Q ) = q = q
Theorem 1.
If the self-knowledge distillation model N ( n , n ) satisfies class-wise calibration (or class-segregated calibration) on D v q a , then N satisfies the generalization uncertainty proportional (GUP) on D v q a .
E Ω , Ω GUErr D v q a n , n E Ω T e s t E r r D v q a ( n )
Proof. 
The authors in this paper define the Expected Test error as TE (Test error). The TOP-K error of the two-student model with generalized uncertainty is fixed at GUE. By simplifying the two errors, the following results can be obtained, and the proportional relationship (GUP) between them can be obtained. Since K previously represented the number of categories, for this reason, the authors of this paper represent J as the K term in TOP-K, i as the corresponding prediction at the sample of J-th value.
TE = q [ 0 , 1 ] q ( 1 q ) k = 0 K 1 p n ˜ k ( I , Q ) = q d q
GUE = j = 0 J 1 / p i k = 0 K 1 q [ 0 , 1 ] swap p n ˜ k ( I , Q ) = q q ( 1 q ) d q = j = 0 J 1 / p i q [ 0 , 1 ] q ( 1 q ) k = 0 K 1 p n ˜ k ( I , Q ) = q d q T E .
The detailed proof of generalization uncertainty can be found in Appendix A. □

3.4.2. Theoretical Analysis of Debiased Self-Distillation

In this section, the authors of this paper demonstrate that self-knowledge distillation and generalized uncertainty can enable models to learn more complete multi-view feature sets and reduce language bias in VQA. The authors of this paper followed the research of Allen Zhu and Zhiyuan Li [12,19].
Notation 2.
Let us set up a model whose dataset contains K categories, p-input patch s and the ReLU function. The model input is I i , Q i and the label is a i . To simplify the problem, the authors of this paper assume that each category contains related features that are orthogonal to each other. The authors of this paper define these features as vectors of v q a j , 1 , v q a j , 2 .
Following the settings of Zhu et al.’s research. The authors of this paper get the definitions as follows. The set of all features:
X v q a = def v q a j , 1 , v q a j , 2 j [ k ]
v q a j , v q a j , w h e n ( j , ) ( j , )
Definition 3.
(Data distribution) The authors of this paper define the multi-view and single-view distribution D v q a m and D v q a s , D∈ D v q a m / s , and ( I i , Q i , a i ) D . Sample features with probability s / k , s 1 , k 0.2 . The coefficients z p , n p , v q a is the feature noise, ξ p is the random Gaussian noise. For each p [ P ] P ( I i , Q i ) , the authors of this paper set:
x p = z p v q a + v q a X v q a n p , v q a v q a + ξ p
Definition 4.
(The final data distribution D and the training dataset S d . Suppose D contains 1 μ   D v q a m and μ D v q a s . For N samples in D, the training dataset S d = S d m S d s . ( I i , Q i , a i ) random sampling from the set S d . μ = 1 poly ( k ) , and N = k 1.2 / μ .
Definition 5.
The authors of this paper define a network V Q A ( I i , Q i ) with a cross-entropy loss function using a stochastic learning algorithm as follows:
L ( V Q A ) = E ( I i , Q i , a i ) S d [ L ( V Q A ; I i , Q i , a i ) ]
The logits function of the single model ( η 1 poly ( k ) , T = poly ( k ) η ) can be defined as
logit i ( V Q A , I , Q ) = def e V Q A i ( I , Q ) j [ k ] e V Q A j ( I , Q )
The logits function of the model using knowledge distillation can be defined as
logit i τ ( V Q A , I , Q ) = e min τ 2 V Q A i ( I , Q ) , 1 / τ j [ k ] e min τ 2 V Q A j ( I , Q ) , 1 / τ
Theorem 2.
For the single model, the authors of this paper use the prediction error as follows, i [ k ] { a } :
P r ( I , Q , a ) D V Q A a ( T ) ( I , Q ) < V Q A i ( T ) ( I , Q ) ( 0.5 ± 0.01 ) μ
Theorem 3.
For self-knowledge distillation with the generalization uncertainty model, λ (λ > 1) is the gain from generalized uncertainty. The authors of this paper use the prediction error as follows, i [ k ] { y } :
Pr ( I , Q , a ) D V Q A a T + T ( I , Q ) < V Q A i T + T ( I , Q ) 0.26 μ / λ
The authors of this paper find that when comparing Theorems 2 and 3, the prediction error of the model decreased. That means the LBSD method can reduce language bias. The detailed proof can be found in Appendix A.

4. Settings, Results and Discussion

In this section, the authors of this paper evaluate the effectiveness of all the LBSD methods in the three mainstream datasets (VQA-CP v2, VQA-CP v1 and VQA v2), carry out an ablation experiment with the typical debiased method and compare the performance of the LBSD methods and that of the latest method. Table 1 shows the statistics of all the datasets.

4.1. Experimental Settings Setup

4.1.1. Datasets and Backbone

The paper uses the standard VQA evaluation metric [1] to evaluate the performance of the model on the VQA-CP v2 [67], VQA-CP v1 [67] and VQA v2 [7] datasets. For fair comparisons, all the methods are based on the UpDn model, and their best-recorded performance is compared. The experiment trains and tests the models on two Titan Xp GPUs.
Currently, for the VQA language bias issue, researchers evaluate the performance of the proposed models on the VQA-CP v2 dataset and conduct auxiliary verification on the VQA v2 dataset. Most findings test the models on VQA-CP v2 and VQA v2 and calculate the gap index [36] as an auxiliary index to verify the robustness of the model.
VQA-CP v2. The researchers propose the VQA-CP v2 dataset, which is derived from the re-classification of the samples in the VQA v2 dataset, to measure language bias. The VQA-CP v2 and VQA-CP v1 datasets are the only open-source datasets for language bias evaluation. The questions and answers in the training and testing sets are distributed in considerably different ways. In other words, for the same type of questions, the answers in the training set and testing set are distributed very differently. Therefore, the VQA-CP v2 dataset is suitable for measuring the language bias of the models. The training set consists of 121 K images, 438 K questions and 4.4 million answers, and the testing set consists of 98 K images, 220 K questions and 2.2 million answers.
VQA-CP v1. The VQA-CP v1 dataset, the first version of the VQA-CP dataset, is the first-ever dataset for language bias evaluation. It is derived from the re-classification of the VQA v1 [1] dataset. The VQA-CP v1 training set consists of 118 K images, 245 K questions and 2.5 million answers, and the VQA-CP v1 testing set consists of 87 K images, 125 K questions and 1.3 million answers.
VQA v2. The VQA v2 dataset is the second version of the VQA dataset. The training set consists of 82,783 images, 443,757 questions and 4,437,570 answers. The testing set consists of 40,504 images, 214,354 questions and 2,143,540 answers. The VQA v2 dataset is double the VQA v1 dataset in size.

4.1.2. Experimental Details

For LBSD, the k in generalization uncertainty is set at 3, and the KL divergence coefficient is set at 2 or 3. The basic VQA network UpDn uses a pre-trained Faster-RCNN to extract image features, a pretrained model GloVe (300 dimensions) to extract text features and a single-layer GRU to obtain question-embedded vectors (512 dimensions). Finally, the joint embedding is 2048 dimensions. In addition, the batch size is set at 512 and trained and tested on two Titan Xp GPUs. Because VQA-CP v1 and v2 lack validation datasets, VQA v2 datasets generally display results on validation datasets. In order to select the parameters of the model, the authors of this paper divide 10% of the samples from the test datasets or the validation datasets to act as the validation dataset, select the parameters of the model on the validation datasets and then test the precision of the model on the test datasets. The results of our experiments are based on the results of the original published papers, and for experiments that were not performed in the original papers, we reproduced them using the official code, and for the results that we reproduced, we put an asterisk in the upper right-hand corner. With regard to run time, the proposed method runs 30 epochs for 15 h in a 256 GB memory and two Titan XP GPUs environment. For the statistical analysis, we performed multiple experiments with a confidence of 95% for the experimental results, and for the purpose of fillability of the experimental results, we selected the median of the results of multiple experiments as the final result, the final precision and the precision of each index are filled in the table.

4.2. Ablation Studies

To verify the effectiveness of LBSD, the authors of this paper conduct an ablation experiment on every aspect. For fair comparisons, the authors of this paper select the mainstream VQA network UpDn as the skeleton and carry out ablation experiments on typical debiased methods such as Bias product, Reweight and LMH. In these tables, * indicates the results of our reimplementation from the official code.

4.2.1. Architecture Agnostic

Since LBSD is irrelevant to the model, it can be integrated into various VQA networks. To evaluate the performance of LBSD on debiased methods, the authors of this paper combine it with other typical methods and baseline, including UpDn, Bias product (Product of Experts), reweight and LMH. Reweight, a non-ensemble method, encourages the model to focus on the samples that are predicted erroneously by the language bias model. While Bias product and LMH are ensemble models. Compared with these, the LBSD-integrated models have higher precision.
The authors of this paper conduct ablation experiments on the VQA-CP v2 and VQA-CP v1 datasets. As shown in Table 2, for typical debiased methods, including ensemble and non-ensemble methods, LBSD improves the precision of the model on the VQA-CP v2 dataset. For example, the performance of reweighting (non-ensemble) and LMH (ensemble) improves by 1.26% and 2.22%, respectively. Even for UpDn without debiased methods, LBSD improves the precision by 0.25%, which demonstrates that LBSD reduced the language bias from the perspective of feature learning. As shown in Table 3, for reweight (non-ensemble) and bias product (ensemble), LBSD improves the performance by 2.2% and 0.58% (“NUM” index has been improved by 7.51%), respectively, on the VQA-CP v1 dataset.

4.2.2. Effectiveness of GU

To verify the effectiveness of generalization uncertainty in the reduction of language bias, the authors of this paper conduct ablation experiments on VQA-CP v2. Two debiased methods, including Reweight (non-ensemble) and LMH (ensemble), are selected for verification. As shown in Table 4, the results show that, compared with LBSD without the generalization uncertainty constraint, LBSD with the generalization uncertainty constraint improves the performance by 0.27% and 0.63%, respectively, on Reweight and LMH. For the question types “YES/NO” and “Other” that are highly dependent on language bias, the generalization uncertainty constraint can be added to reduce the language bias of these question types.

4.2.3. Heterogeneous Student Networks

Generally, the student models of self-knowledge distillation have identical network structures. The authors of this paper also explore heterogeneous student networks, where the two student models are not identical. The authors of this paper select two debiased methods based on the UpDn model to verify the effectiveness of heterogeneous student networks. As shown in Table 5, heterogeneous student networks can have similar effects to homogeneous student networks. Moreover, the precision of the two heterogeneous student models is improved.

4.3. Comparisons with State-of-the-Arts

To evaluate the performance of LBSD, the authors of this paper carry out an experiment on VQA-CP v2, VQA-CP v1, and VQA v2 and compare it with the state-of-the-art method. In these tables, * indicates the results of our reimplementation from the official code.

4.3.1. Performance on VQA-CP v2

Setting. The authors of this paper combine LBSD with LMH and name it LBSD-LMH. For fair comparisons, the authors of this paper choose the debiased method based on UpDn. According to the principles of reducing language bias, the authors of this paper divide the methods into groups: (1) Strengthening visual information [24,25]. (2) Weakening language priors [29,31,32]. (3) Using various data enhancement and data balance [36,68].
Since LBSD improves the performance by enabling the model to focus more on visual information and difficult samples (the model cannot answer based on language bias), the authors of this paper compare other methods with those in the first and second groups. Moreover, according to the experiment settings of CSS [36], the authors of this paper test and calculate the gap index as an auxiliary index on VQA v2 to verify the robustness of the model.
Results. Comparisons are reported in Table 6. As shown in Table 6, compared with other methods with UpDn as the standard VQA model, LBSD improves the performance on VQA-CP v2. The gap index has also been improved (“All” and “Other”). The results show that the proposed LBSD can reduce language bias in VQA. For individual items, such as Num, yes/no and others, CFVQA is slightly higher than our method in num index; it is an ensemble method based on causal inference. Similar to boosting, CFVQA uses more ensemble networks as additional information than ours. Therefore, it is unfair to compare directly on small indices.

4.3.2. Performance on VQA-CP v1

Settings. The authors of this paper compare the state-of-the-art methods to LBSD-LMH and VQA-CP v1. According to the principle and method of reducing language bias, the authors of this paper divide them into groups: (1) Strengthening visual information [24,25]. (2) Weakening language priors [29,31,32]. (3) Using various data enhancement and data balance [36,68]. Moreover, the authors of this paper conducted another experiment based on the official codes of the methods, as the results of some methods on VQA-CP v1 were not shown.
Results. As shown in Table 7, compared with the methods in group 1 and group 2, LBSD realizes the best performance on VQA-CP v1. In particular, LBSD improves the performance of LMH and Reweight by 0.66% and 2.2%, respectively. The results show that the proposed method is effective for different datasets and is effective for different types of debiased methods. The results verify the effectiveness of LBSD.

4.4. Qualitative Examples

In order to better show the results, the authors of this paper conduct a visualization analysis of some representative findings of the model from the perspective of qualitative analysis and compare it with other methods. Figure 5 shows that our method is superior to the baseline method.

5. Conclusions

This paper discusses how to reduce the language bias of the VQA model via self-knowledge distillation and proposes a new online learning framework, “language bias-driven self-knowledge distillation (LBSD)”, for implicit learning of multi-view visual features. Moreover, in order to help student models learn unbiased visual knowledge, the authors of this paper propose generalization uncertainty to measure the learning results of student models and use KL divergence to reinforce the debiased mutual learning of student models. In this way, the student model can learn unbiased knowledge from each other through the output of Top-K information entropy. In addition, the paper also discusses the effect of the heterogenous student models on the reduction of language bias. The experiment proves that even the heterogeneous student model can improve the unbiased learning ability through the LBSD method. Extensive experiments and ablation experiments on the VQA-CP v2, VQA-CP v1 and VQA v2 datasets verify the effectiveness of the proposed method. In the future, we will continue to explore how to better define the concept of unbiased knowledge, such as using multimodal knowledge graphs to help the model understand the type of knowledge in the dataset and how to optimize the loss function to enable the model to distinguish biased and unbiased knowledge, so as to reduce the experimental bias against language.

Author Contributions

Conceptualization, D.Y., L.W., Q.W., F.M., K.N.N. and L.X.; methodology, visualization, experiments and writing, D.Y. and L.W.; writing, review and editing, Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China; grant numbers 61831005 and 61971095.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Detailed Proof

Appendix A.1.1. Proof of Generalization Uncertainty (GU)

Proof. 
Predicting generalization and Calibration. It is found that the distribution over predicted classes and ground truth labels match each other within a set of confidence levels; a measure of disagreement between an ensemble and the ensemble itself boils down to measuring disagreement against the ground truth.
Recall Theorem 1. The authors of this paper can express the expected disagreement rate between two debiased students as an integral over the confidence values.
In order to simplify the expected TOP-K disagreement rate (GU), the authors of this paper will first simplify the expected test error (TE) as follows.
TE E n H A [ p ( n ( I , Q ) a n ) ] = E n H A E ( I , Q , a ) D [ [ n ( I , Q ) a ] ] = E ( I , Q , a ) D E H A [ [ n ( I , Q ) a ] ] = E ( I , Q , a ) D 1 n ˜ a ( I , Q ) .
Deal with integrals, and define n ˜ k ( I , Q ) as q k , 1 n ˜ k ( I i , Q i ) as f k , the authors of this paper can get:
TE = k = 0 K 1 I i , Q i f k p ( I , Q = I i , Q i , a = a k ) d ( I i , Q i )
= q Δ K k = 0 K 1 I i , Q i f k p ( I , Q = I i , Q i , a = a k , n ˜ ( I , Q ) = q ) d ( I i , Q i ) d q
= q Δ K k = 0 K 1 swap p ( a = a k , n ˜ ( I , Q ) = q ) 1 q k d q .
= k = 0 K 1 q Δ K p ( a = a k , n ˜ ( I , Q ) = q ) 1 q k d q
= k = 0 K 1 q k p a = a k , n ˜ k ( I , Q ) = q k 1 q k d q k
= q [ 0 , 1 ] k = 0 K 1 p a = a k , n ˜ k ( I , Q ) = q ( 1 q ) d q
Using the calibration in aggregate assumption, the authors of this paper can get:
TE = q [ 0 , 1 ] q ( 1 q ) k = 0 K 1 p n ˜ k ( I , Q ) = q d q
As defined in Section 3.4.2, the authors of this paper need to prove a direct relationship between GU error and TE. Different from the method proposed by Jiang et al., the GU can provide more soft information than the hard target. Similar to the definition of TE, the authors of this paper can get:
GUerror T o p k E n , n H A p n ( I , Q ) n ( I , Q ) n , n = T o p k E n , n H A E ( I , Q , a ) D n ( I , Q ) h ( I , Q ) = T o p k E ( I , Q , a ) D E n , n H A n ( I , Q ) n ( I . Q )
Similar to the simplified proof of TE, the authors of this paper can get:
GUerror = j = 0 J 1 / p i k = 0 K 1 q [ 0 , 1 ] swap p n ˜ k ( I , Q ) = q q ( 1 q ) d q = j = 0 J 1 / p i q [ 0 , 1 ] q ( 1 q ) k = 0 K 1 p n ˜ k ( I , Q ) = q d q T E .
Proof finished. □

Appendix A.1.2. Proof of Debiased Self-Distillation

Proof. 
Referring to the research work of Allen Zhu, the authors of this paper use the same lottery winning theory and other lemmas to prove it. Refer to Allen Zhu’s research for the details of the lemma. The authors of this paper expand the research to VQA with GU.
For the single model. For every t < T , according to the noise lower bound and multi-view error claim, the authors of this paper can get:
t = T 0 T E ( I , Q , a ) S d m 1 logit a V Q A ( t ) , I , Q O ˜ k η
t = T 0 T E ( I , Q , a ) S d s 1 logit a V Q A ( t ) , I , Q O ˜ N η ρ q 1
The training objective is:
L V Q A ( t ) = E ( I , Q , a ) S d log logit a V Q A ( t ) , I , Q
For every data:
1. If logit a V Q A ( t ) , I , Q 1 2 :
log logit a V Q A ( t ) , I , Q O 1 logit a V Q A ( t ) , I , Q
2. If logit a V Q A ( t ) , I , Q 1 2 :a naive bound,
log logit a V Q A ( t ) , I , Q [ 0 , O ˜ ( 1 ) ]
Therefore, the authors of this paper can get, that when T poly ( k ) / η :
1 T t = T 0 T E ( I , Q , a ) S d log logit a V Q A ( t ) , I , Q 1 poly ( k )
However, the objective value does not increase monotonically, as the authors of this paper are using gradient descent and using O(1)-Lipscthiz continuous as the objective function, the authors of this paper define E ( I , Q , a ) S d as E d , and get:
E d 1 logit a V Q A ( T ) , I , Q E d log logit a V Q A ( T ) , I , Q
E ( I , Q , a ) S d log logit a V Q A ( T ) , I , Q 1 poly ( k )
As a result, during training, the accuracy is perfect. The accuracy of the single-view test is as follows:
V Q A a ( T ) ( I , Q ) max j y V Q A j ( T ) ( I , Q ) 1 poly log ( k )
For the single-view D s , the authors of this paper can get M V Q A k ( 1 o ( 1 ) ) and the prediction error with a probability of at least 1 2 ( 1 o ( 1 ) , so the authors of this paper can get Theorem 2.
For self-knowledge distillation with the generalization uncertainty model.
The logits function of the model using knowledge distillation can be defined as
logit i τ ( V Q A , I , Q ) = e min τ 2 V Q A i ( I , Q ) , 1 / τ j [ k ] e min τ 2 V Q A j ( I , Q ) , 1 / τ
Similar to the proof of knowledge distillation (from Allen Zhu) and Theorem 1. For a network with ( i , ) M , the authors of this paper can get:
S i , = def E ( I , Q , a ) S d m 1 a = i p P v q a i , ( I , Q ) z p q
M = def i , * [ k ] × [ 2 ] Λ i , * ( 0 ) Λ i , 3 * ( 0 ) S i , 3 * S i , * 1 q 2 1 + 1 log 2 ( m )
Assume that the distribution of p P v q a ( I , Q ) z p q for v q a v q a a , 1 , v q a a , 2 are the same. The authors of this paper can get:
M V Q A 1 = def i , * [ k ] × [ 2 ] Λ i , * ( 0 ) Λ i , 3 * ( 0 ) 1 + 2 log 2 ( m )
Similar to the single model, for every ( I i , Q i , a i ) S d m , the authors of this paper can get:
( i , ) M V Q A 1 :
If v q a i , ,is in X v q a ( I , Q ) :
logit i τ ( V Q A 1 , I , Q ) 1 s ( I , Q ) k Ω ( log k )
If v q a i , is in X v q a ( I , Q ) :
logit i τ ( V Q A 1 , I , Q ) = k Ω ( log k )
i [ k ] :
If v q a i , , is in X v q a ( I , Q ) :
logit i τ ( V Q A 1 , I , Q ) 1 s ( I , Q ) + k Ω ( log k )
If neither v q a i , 1 or v q a i , 2 not in X v q a ( I , Q ) :
logit i τ ( V Q A 1 , I , Q ) = k Ω ( log k )
Similar to the proof of the ensemble model, at the end of self-knowledge distillation, in multi-view data, the network should provide the same (near-perfect) accuracy. With the generalization uncertainty, the test accuracy has been improved by λ .
Therefore, with M V Q A 1 k ( 1 o ( 1 ) ) and M V Q A 2 k ( 1 o ( 1 ) ) , additionally, they are totally independent random sets, and the authors of this paper can obtain that M V Q A 1 M V Q A 2 3 2 k ( 1 o ( 1 ) ) . That means the model with self-distillation has an accuracy of 3 4 λ ( 1 o ( 1 ) ) . Therefore, the authors of this paper can get:
Pr ( I , Q , a ) D i [ k ] \ { a } : V Q A a T + T ( I , Q ) < V Q A i T + T ( I , Q ) 0.26 μ / λ
Proof finished. □

References

  1. Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. Vqa: Visual question answering. In Proceedings of the IEEE International Conference On Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
  2. Agrawal, A.; Lu, J.; Antol, S.; Mitchell, M.; Zitnick, C.L.; Parikh, D.; Batra, D. Vqa: Visual question answering. Int. J. Comput. Vis. 2017, 123, 4–31. [Google Scholar] [CrossRef]
  3. Teney, D.; Wu, Q.; van den Hengel, A. Visual question answering: A tutorial. IEEE Signal Process. Mag. 2017, 34, 63–75. [Google Scholar] [CrossRef]
  4. Agrawal, A.; Batra, D.; Parikh, D. Analyzing the behavior of visual question answering models. arXiv 2016, arXiv:1606.07356. [Google Scholar]
  5. Zhang, P.; Goyal, Y.; Summers-Stay, D.; Batra, D.; Parikh, D. Yin and yang: Balancing and answering binary visual questions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  6. Johnson, J.; Hariharan, B.; van der Maaten, L.; Fei-Fei, L.; Lawrence Zitnick, C.; Girshick, R. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  7. Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6904–6913. [Google Scholar]
  8. Yuan, D. Language bias in Visual Question Answering: A Survey and Taxonomy. arXiv 2021, arXiv:2111.08531. [Google Scholar]
  9. Brown, G.; Wyatt, J.L.; Tino, P.; Bengio, Y. Managing diversity in regression ensembles. J. Mach. Learn. Res. 2005, 6, 1621–1650. [Google Scholar]
  10. Mehta, P.; Bukov, M.; Wang, C.H.; Day, A.G.; Richardson, C.; Fisher, C.K.; Schwab, D.J. A high-bias, low-variance introduction to machine learning for physicists. Phys. Rep. 2019, 810, 1–124. [Google Scholar] [CrossRef]
  11. Munson, M.A.; Caruana, R. On feature selection, bias-variance, and bagging. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Geramny, 2009; pp. 144–159. [Google Scholar]
  12. Allen-Zhu, Z.; Li, Y. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv 2020, arXiv:2012.09816. [Google Scholar]
  13. Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
  14. Yuan, L.; Tay, F.E.; Li, G.; Wang, T.; Feng, J. Revisiting Knowledge Distillation via Label Smoothing Regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3903–3911. [Google Scholar]
  15. Anil, R.; Pereyra, G.; Passos, A.; Ormandi, R.; Dahl, G.E.; Hinton, G.E. Large scale distributed neural network training through online distillation. arXiv 2018, arXiv:1804.03235. [Google Scholar]
  16. Zhang, Y.; Xiang, T.; Hospedales, T.M.; Lu, H. Deep Mutual Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4320–4328. [Google Scholar]
  17. Lyu, S.; Zhao, Q.; Ma, Y.; Chen, L. Make Baseline Model Stronger: Embedded Knowledge Distillation in Weight-Sharing Based Ensemble Network. 2021. Available online: https://www.bmvc2021-virtualconference.com/assets/papers/0212.pdf (accessed on 17 June 2022).
  18. Lukasik, M.; Bhojanapalli, S.; Menon, A.K.; Kumar, S. Teacher’s pet: Understanding and mitigating biases in distillation. arXiv 2021, arXiv:2106.10494. [Google Scholar]
  19. Allen-Zhu, Z.; Li, Y. Backward feature correction: How deep learning performs deep learning. arXiv 2020, arXiv:2001.04413. [Google Scholar]
  20. Wu, L.; Li, J.; Wang, Y.; Meng, Q.; Qin, T.; Chen, W.; Zhang, M.; Liu, T.Y. R-drop: Regularized dropout for neural networks. Adv. Neural Inf. Process. Syst. 2021, 34, 10890–10905. [Google Scholar]
  21. Wen, Z.; Li, Y. Toward understanding the feature learning process of self-supervised contrastive learning. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; 2021; pp. 11112–11122. [Google Scholar]
  22. Fuglede, B.; Topsoe, F. Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the International Symposium on Information Theory, 2004, ISIT 2004, Proceedings, Chicago, IL, USA, 27 June–2 July 2004; p. 31. [Google Scholar]
  23. Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef] [Green Version]
  24. Selvaraju, R.R.; Lee, S.; Shen, Y.; Jin, H.; Batra, D.; Parikh, D. Taking a hint: Leveraging explanations to make vision and language models more grounded. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019. [Google Scholar]
  25. Wu, J.; Mooney, R.J. Self-Critical Reasoning for Robust Visual Question Answering. In Proceedings of the Thirty-third Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  26. Li, L.; Gan, Z.; Cheng, Y.; Liu, J. Relation-aware graph attention network for visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 10313–10322. [Google Scholar]
  27. Shrestha, R.; Kafle, K.; Kanan, C. A negative case analysis of visual grounding methods for VQA. arXiv 2020, arXiv:2004.05704. [Google Scholar]
  28. Kv, G.; Mittal, A. Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020. [Google Scholar]
  29. Ramakrishnan, S.; Agrawal, A.; Lee, S. Overcoming language priors in visual question answering with adversarial regularization. Adv. Neural Inform. Process. Syst. 2018, 31, 1541–1551. [Google Scholar]
  30. Grand, G.; Belinkov, Y. Adversarial regularization for visual question answering: Strengths, shortcomings, and side effects. arXiv 2019, arXiv:1906.08430. [Google Scholar]
  31. Cadene, R.; Dancette, C.; Ben-younes, H.; Cord, M.; Parikh, D. RUBi: Reducing Unimodal Biases in Visual Question Answering. In Proceedings of the Thirty-Third Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  32. Clark, C.; Yatskar, M.; Zettlemoyer, L. Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases. arXiv 2019, arXiv:1909.03683. [Google Scholar]
  33. Gat, I.; Schwartz, I.; Schwing, A.G.; Hazan, T. Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies. Adv. Neural Inf. Process. Syst. 2020, 33, 3197–3208. [Google Scholar]
  34. Niu, Y.; Tang, K.; Zhang, H.; Lu, Z.; Hua, X.S.; Wen, J.R. Counterfactual vqa: A cause-effect look at language bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 12700–12710. [Google Scholar]
  35. Han, X.; Wang, S.; Su, C.; Huang, Q.; Tian, Q. Greedy Gradient Ensemble for Robust Visual Question Answering. In Proceedings of the ICCV 2021, Virtual, 11–17 October 2021. [Google Scholar]
  36. Chen, L.; Yan, X.; Xiao, J.; Zhang, H.; Pu, S.; Zhuang, Y. Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10800–10809. [Google Scholar]
  37. Liang, Z.; Jiang, W.; Hu, H.; Zhu, J. Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 3285–3292. [Google Scholar]
  38. Teney, D.; Abbasnedjad, E.; van den Hengel, A. Learning what makes a difference from counterfactual examples and gradient supervision. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 580–599. [Google Scholar]
  39. Guo, Y.; Nie, L.; Cheng, Z.; Tian, Q. Loss-rescaling VQA: Revisiting Language Prior Problem from a Class-imbalance View. arXiv 2020, arXiv:2010.16010. [Google Scholar]
  40. Gokhale, T.; Banerjee, P.; Baral, C.; Yang, Y. MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 878–892. [Google Scholar]
  41. Teney, D.; Abbasnejad, E.; Kafle, K.; Shrestha, R.; Kanan, C.; van den Hengel, A. On the Value of Out-of-Distribution Testing: An Example of Goodhart’s Law. In Proceedings of the Advances in Neural Information Processing Systems, Virtual Event, 6–12 December 2020; Volume 33, pp. 407–417. [Google Scholar]
  42. Teney, D.; Abbasnejad, E.; Hengel, A.v.d. Unshuffling Data for Improved Generalization. arXiv 2020, arXiv:2002.11894. [Google Scholar]
  43. Guo, Y.; Nie, L.; Cheng, Z.; Ji, F.; Zhang, J.; Del Bimbo, A. Adavqa: Overcoming language priors with adapted margin cosine loss. arXiv 2021, arXiv:2105.01993. [Google Scholar]
  44. Jiang, J.; Liu, Z.; Liu, Y.; Nan, Z.; Zheng, N. X-GGM: Graph Generative Modeling for Out-of-Distribution Generalization in Visual Question Answering. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, China, 20–24 October 2021; pp. 199–208. [Google Scholar]
  45. Rashid, J.; Shah, S.M.A.; Irtaza, A. An efficient topic modeling approach for text mining and information retrieval through K-means clustering. Mehran Univ. Res. J. Eng. Technol. 2020, 39, 213–222. [Google Scholar] [CrossRef] [Green Version]
  46. Yim, J.; Joo, D.; Bae, J.; Kim, J. A Gift From Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  47. Feng, Z.; Lai, J.; Xie, X. Resolution-Aware Knowledge Distillation for Efficient Inference. IEEE Trans. Image Process. 2021, 30, 6985–6996. [Google Scholar] [CrossRef] [PubMed]
  48. Rashid, J.; Kim, J.; Hussain, A.; Naseem, U.; Juneja, S. A novel multiple kernel fuzzy topic modeling technique for biomedical data. BMC Bioinform. 2022, 23, 275. [Google Scholar] [CrossRef] [PubMed]
  49. Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar]
  50. Komodakis, N.; Zagoruyko, S. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In Proceedings of the ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
  51. Tian, Y.; Krishnan, D.; Isola, P. Contrastive Representation Distillation. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  52. Xu, G.; Liu, Z.; Li, X.; Loy, C.C. Knowledge distillation meets self-supervision. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 588–604. [Google Scholar]
  53. Park, W.; Kim, D.; Lu, Y.; Cho, M. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3967–3976. [Google Scholar]
  54. Liu, Y.; Cao, J.; Li, B.; Yuan, C.; Hu, W.; Li, Y.; Duan, Y. Knowledge distillation via instance relationship graph. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7096–7104. [Google Scholar]
  55. Passalis, N.; Tzelepi, M.; Tefas, A. Heterogeneous knowledge distillation using information flow modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2339–2348. [Google Scholar]
  56. Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  57. Nakkiran, P.; Bansal, Y. Distributional generalization: A new kind of generalization. arXiv 2020, arXiv:2009.08092. [Google Scholar]
  58. Jiang, Y.; Nagarajan, V.; Baek, C.; Kolter, J.Z. Assessing generalization of sgd via disagreement. arXiv 2021, arXiv:2106.13799. [Google Scholar]
  59. Chuang, C.Y.; Torralba, A.; Jegelka, S. Estimating generalization under distribution shifts via domain-invariant representations. arXiv 2020, arXiv:2007.03511. [Google Scholar]
  60. Jiang, Y.; Krishnan, D.; Mobahi, H.; Bengio, S. Predicting the generalization gap in deep networks with margin distributions. arXiv 2018, arXiv:1810.00113. [Google Scholar]
  61. Jiang, Y.; Neyshabur, B.; Mobahi, H.; Krishnan, D.; Bengio, S. Fantastic generalization measures and where to find them. arXiv 2019, arXiv:1912.02178. [Google Scholar]
  62. Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Proceedings of the NIPS 2017, Thirty-First Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  63. Dawid, A.P. The well-calibrated Bayesian. J. Am. Stat. Assoc. 1982, 77, 605–610. [Google Scholar] [CrossRef]
  64. Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
  65. Gupta, C.; Podkopaev, A.; Ramdas, A. Distribution-free binary classification: Prediction sets, confidence intervals and calibration. Adv. Neural Inf. Process. Syst. 2020, 33, 3711–3723. [Google Scholar]
  66. Wu, X.; Gales, M. Should ensemble members be calibrated? arXiv 2021, arXiv:2101.05397. [Google Scholar]
  67. Agrawal, A.; Batra, D.; Parikh, D.; Kembhavi, A. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  68. Abbasnejad, E.; Teney, D.; Parvaneh, A.; Shi, J.; Hengel, A.v.d. Counterfactual vision and language learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10044–10054. [Google Scholar]
  69. Zhang, L.; Liu, S.; Liu, D.; Zeng, P.; Li, X.; Song, J.; Gao, L. Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4362–4373. [Google Scholar] [CrossRef]
  70. Teney, D.; Kafle, K.; Shrestha, R.; Abbasnejad, E.; Kanan, C.; Hengel, A.v.d. On the Value of Out-of-Distribution Testing: An Example of Goodhart’s Law. arXiv 2020, arXiv:2005.09241. [Google Scholar]
Figure 1. Example of the language bias in VQA. The output of the model is directly affected by the question. For example, the model answers “yellow” to all the questions regarding the color of the banana. If it is a yes-or-no question type, the model tends to simply answer “yes”.
Figure 1. Example of the language bias in VQA. The output of the model is directly affected by the question. For example, the model answers “yellow” to all the questions regarding the color of the banana. If it is a yes-or-no question type, the model tends to simply answer “yes”.
Applsci 12 07588 g001
Figure 2. The VQA-CP v2 dataset contains images with multi-view (features). The authors of this paper visualize the features at the same layer in the neural network. For the same question, although the images, views and features are different, the semantic information is the same. Broadly speaking, this “multi-view” structure [12] exists both in the original data and the feature sets extracted from the middle layer.
Figure 2. The VQA-CP v2 dataset contains images with multi-view (features). The authors of this paper visualize the features at the same layer in the neural network. For the same question, although the images, views and features are different, the semantic information is the same. Broadly speaking, this “multi-view” structure [12] exists both in the original data and the feature sets extracted from the middle layer.
Applsci 12 07588 g002
Figure 3. The flowcharts of the language bias-driven self-distillation framework, including: (1) language bias-driven self-knowledge distillation and (2) using generalization uncertainty to help student models learn unbiased visual knowledge.
Figure 3. The flowcharts of the language bias-driven self-distillation framework, including: (1) language bias-driven self-knowledge distillation and (2) using generalization uncertainty to help student models learn unbiased visual knowledge.
Applsci 12 07588 g003
Figure 4. Examples used to show the difference between KL divergence and uncertainty.
Figure 4. Examples used to show the difference between KL divergence and uncertainty.
Applsci 12 07588 g004
Figure 5. Qualitative examples of VQA-CP v2 (test set). The wrong and right answers are highlighted in red and green.
Figure 5. Qualitative examples of VQA-CP v2 (test set). The wrong and right answers are highlighted in red and green.
Applsci 12 07588 g005
Table 1. Statistics of VQA-CP v2, VQA v2 and VQA-CP v1.
Table 1. Statistics of VQA-CP v2, VQA v2 and VQA-CP v1.
DatasetVQA-CP v2 [67]VQA-CP v1 [67]VQA v2 [7]
TrainTestTotalTrainTestTotalTrainTestTotal
Images121 K98 K219 K118 K87 K205 K440 K214 K654 K
Questions438 K220 K658 K245 K125 K370 K83 K41 K124 K
Answers4.4 M2.2 M6.6 M2.5 M1.3 M3.8 M4.4 M2.1 M6.5 M
Table 2. VQA-CP v2: Ablation experiments of the LBSD method on the VQA-CP v2 dataset. * indicates the results from our reimplementation using officially released codes.
Table 2. VQA-CP v2: Ablation experiments of the LBSD method on the VQA-CP v2 dataset. * indicates the results from our reimplementation using officially released codes.
ModelOverallYes/NoNumOther
UpDn [56]39.7442.2711.9346.05
   +LBSD39.9942.7612.3646.12
Bias Product39.93
Bias Product *39.8641.9612.5946.25
   +LBSD40.4744.2812.2846.21
Reweight40.06
Reweight *40.0245.0912.3044.96
   +LBSD41.2847.0712.3046.20
LMH52.0569.8144.4645.54
   +LBSD54.2775.4944.0245.96
Table 3. VQA-CP v1: Ablation experiments of the LBSD method on the VQA-CP v1 dataset. * indicates the results from our reimplementation using officially released codes.
Table 3. VQA-CP v1: Ablation experiments of the LBSD method on the VQA-CP v1 dataset. * indicates the results from our reimplementation using officially released codes.
ModelOverallYes/NoNumOther
UpDn [56]37.8742.5814.1642.71
   +LBSD38.55 43.2912.9044.13
Bias Product *38.8142.9613.3444.91
   +LBSD39.3945.0713.0844.32
Reweight *41.4661.5213.0232.94
   +LBSD43.6666.6312.2333.45
LMH55.2776.4726.6645.68
   +LBSD55.9375.4334.1745.28
Table 4. VQA-CP v2: Ablation experiments of the generalization uncertainty method on the VQA-CP v2 dataset. * indicates the results from our reimplementation using officially released codes.
Table 4. VQA-CP v2: Ablation experiments of the generalization uncertainty method on the VQA-CP v2 dataset. * indicates the results from our reimplementation using officially released codes.
ModelOverallYes/NoNumOther
Reweight40.06
Reweight *40.0245.0912.3044.96
   +LBSD without GU40.9946.3412.5545.98
   +LBSD with GU41.2847.0712.3046.20
LMH52.0569.8144.4645.54
   +LBSD without GU53.6475.4440.1945.91
   +LBSD with GU54.2775.4944.0245.96
Table 5. VQA-CP v2: Heterogeneous student networks. * indicates the results from our reimplementation using officially released codes.
Table 5. VQA-CP v2: Heterogeneous student networks. * indicates the results from our reimplementation using officially released codes.
ModelOverallYes/NoNumOther
Bias Product *39.8641.9612.5946.25
   +LBSD-with LMH40.0643.3112.4145.94
   +LBSD-same stu40.4744.2812.2846.21
Reweight *40.0245.0912.3044.96
   +LBSD-with LMH40.6044.9212.5646.03
   +LBSD-same stu41.2847.0712.3046.20
LMH52.0569.8144.4645.54
   +LBSD-with Bias product53.8975.2543.5845.52
   +LBSD-with Reweight53.8275.5641.7945.73
   +LBSD-same stu54.2775.4944.0245.96
Table 6. Performance comparison (Accuracies (%)) on the VQA-CP v2 test set and the VQA v2 val set of state-of-the-art models. The gap means the performance difference between VQA v2 and VQA-CP v2.
Table 6. Performance comparison (Accuracies (%)) on the VQA-CP v2 test set and the VQA v2 val set of state-of-the-art models. The gap means the performance difference between VQA v2 and VQA-CP v2.
ModelVenueVQA-CP v2 Test ↑VQA v2 val ↑Gap Δ
AllYes/NoNumOtherAllYes/NoNumOtherAllOther
GVQA [67]CVPR’1831.3057.9913.6822.1448.2472.0331.1734.6516.9412.51
UpDn [56]CVPR’1839.7442.2711.9346.0563.4881.1842.1455.6623.749.61
methods based on strengthening visual information
AttAlign [24]ICCV’1939.3743.0211.8945.0063.2480.9942.5555.2223.8710.22
HINT [24]ICCV’1946.7367.2710.6145.8863.3881.1842.9955.5616.559.68
ReGAT [26]ICCV’1940.4267.1826.76
VGQE [28]ECCV’2048.7564.0415.29
ESR [27]ACL’2048.969.811.347.862.613.70
KAN [69]TNNLS’2042.6042.1215.5250.28
methods based on weakening language priors:
AReg [29]NeurIPS’1841.1765.4915.4835.4862.7579.8442.3555.1621.5819.68
GRL [30]ACL’1942.3359.7414.7840.7651.929.59
RUBi [31]NeurIPS’1945.2364.8511.8344.1150.5649.4541.0253.955.339.84
LM [32]EMNLP’1948.7872.7814.6145.5863.2681.1642.2255.2214.489.64
LMH [32]EMNLP’1952.0569.8144.4645.5461.6477.8540.0355.049.599.50
CF-VQA [34]CVPR’2153.5591.1513.0344.9763.5482.5143.9654.309.999.33
LBSD-LMHOurs54.2775.4944.0245.9657.9569.2238.1954.633.688.67
methods based on data argumentation
RandImg [70]NeurIPS’2055.3783.8941.6044.2057.2476.5333.8748.571.874.37
CVL [68]CVPR’2042.1245.7212.4548.34
LMH+CSS [36]CVPR’2058.9584.3749.4248.2159.9173.2539.7755.110.966.90
LMH+CSS+CL [37]EMNLP’2059.1886.9949.8947.1657.2967.2738.4054.711.897.55
Unshuffling [42]ICCV’2142.3947.7214.4347.2461.0878.3242.1652.8118.695.57
X-GGM [44]ACM MM’2145.7143.4827.6552.34
Table 7. Performance comparison (Accuracies %) with the state-of-the-art model’s accuracy on the VQA-CP v1 test. * indicates the results from our reimplementation using officially released codes.
Table 7. Performance comparison (Accuracies %) with the state-of-the-art model’s accuracy on the VQA-CP v1 test. * indicates the results from our reimplementation using officially released codes.
ModelAllYes/NoNumOther
GVQA [67]39.2364.7211.8724.86
UpDn [56]37.8742.5814.1642.71
Group 1
Reweight * [32]41.4661.5213.0232.94
LBSD-Reweight43.6666.6312.2333.45
Group 2
AReg [29]41.1765.4915.4835.48
RUBi [31]44.8169.6514.9132.13
LMH [32]55.2776.4726.6645.68
LBSD-LMH55.9375.4334.1745.28
Group 3
CSS [36]60.9585.6040.5744.62
CSS+GS [38]58.0578.5037.2446.08
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Yuan, D.; Wang, L.; Wu, Q.; Meng, F.; Ngan, K.N.; Xu, L. Language Bias-Driven Self-Knowledge Distillation with Generalization Uncertainty for Reducing Language Bias in Visual Question Answering. Appl. Sci. 2022, 12, 7588. https://doi.org/10.3390/app12157588

AMA Style

Yuan D, Wang L, Wu Q, Meng F, Ngan KN, Xu L. Language Bias-Driven Self-Knowledge Distillation with Generalization Uncertainty for Reducing Language Bias in Visual Question Answering. Applied Sciences. 2022; 12(15):7588. https://doi.org/10.3390/app12157588

Chicago/Turabian Style

Yuan, Desen, Lei Wang, Qingbo Wu, Fanman Meng, King Ngi Ngan, and Linfeng Xu. 2022. "Language Bias-Driven Self-Knowledge Distillation with Generalization Uncertainty for Reducing Language Bias in Visual Question Answering" Applied Sciences 12, no. 15: 7588. https://doi.org/10.3390/app12157588

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop