Next Article in Journal
Dual-Polarized Stacked Patch Antenna for Wireless Communication Application and Microwave Power Transfer
Next Article in Special Issue
Impacts of GPS Spoofing on Path Planning of Unmanned Surface Ships
Previous Article in Journal
A Privacy-Oriented Approach for Depression Signs Detection Based on Speech Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Two-Branch Attention Learning for Fine-Grained Class Incremental Learning

1
College of Electronics and Information Engineering, South-Central University for Nationalities, Wuhan 430074, China
2
Computer Information Systems Department, State University of New York at Buffalo State, Buffalo, NY 14222, USA
*
Author to whom correspondence should be addressed.
Electronics 2021, 10(23), 2987; https://doi.org/10.3390/electronics10232987
Submission received: 27 October 2021 / Revised: 21 November 2021 / Accepted: 25 November 2021 / Published: 1 December 2021
(This article belongs to the Special Issue Advancements in Cross-Disciplinary AI: Theory and Application)

Abstract

:
As a long-standing research area, class incremental learning (CIL) aims to effectively learn a unified classifier along with the growth of the number of classes. Due to the small inter-class variances and large intra-class variances, fine-grained visual categorization (FGVC) as a challenging visual task has not attracted enough attention in CIL. Therefore, the localization of critical regions specialized for fine-grained object recognition plays a crucial role in FGVC. Additionally, it is important to learn fine-grained features from critical regions in fine-grained CIL for the recognition of new object classes. This paper designs a network architecture named two-branch attention learning network (TBAL-Net) for fine-grained CIL. TBAL-Net can localize critical regions and learn fine-grained feature representation by a lightweight attention module. An effective training framework is proposed for fine-grained CIL by integrating TBAL-Net into an effective CIL process. This framework is tested on three popular fine-grained object datasets, including CUB-200-2011, FGVC-Aircraft, and Stanford-Car. The comparative experimental results demonstrate that the proposed framework can achieve the state-of-the-art performance on the three fine-grained object datasets.

1. Introduction

In the real world, a visual system may involve constantly emerging new objects. The visual system should be able to keep the recognition performance on existing objects when it keeps learning to recognize new objects [1]. As a straightforward approach of computer vision, pretrained models, such as VGG [2], Inception [3,4] or ResNet [5], are finetuned on a new training dataset for the recognition of new objects. However, this may lead to a common issue—catastrophic forgetting. To be more specific, one pretrained model finetuned on a new dataset result in considerable performance drop on previous datasets. Therefore, class incremental learning (CIL) is proposed to learn a unified classifier for both previous and new object classes. As a major reason, the imbalance between previous and new training data causes catastrophic forgetting [6,7,8]. Existing CIL methods [9,10,11,12,13] can be divided into three categories: replay-based [9], regularization-based [10,13], and architecture-based [12] methods. In replay-based methods, a tiny exemplar subset of the previous dataset is stored to reduce the forgetting. In [9], samples that are the closest ones to the average sample of each class are selected and added to the tiny exemplar set. However, there is still a large room to improve. As a typical example of regularization-based methods, distillation regularization term, which encourages the outputs of a current model to be similar to the reference model, is introduced into the loss function used in [13]. In [10], several regularization terms such as forgetting-less constraint and inter-class separation are introduced to rebalance the previous and new data. In architecture-based methods, novel architectures are designed to solve existing issues in CIL, such as the stability–plasticity dilemma [14]. For example, there are two kinds of residual blocks in adaptive aggregation network (AANet) [12]. Specifically, stable and plastic blocks are designed to preserve the previous knowledge and learn new knowledge, respectively.
Although some existing studies focus on CIL, the datasets applied to the experiments, such as CIFAR100 [15] and ImageNet [16], are mostly coarse-grained. In these datasets, there is a wide gap between most of the categories. In other words, the difference between inter-class objects in these datasets is large and easier to be captured by a learning system. However, the CIL for fine-grained objects have received less attention. Fine-grained visual categorization (FGVC) is a more challenging visual task due to the subtle differences between fine-grained subcategories. The primary goal for FGVC methods is to learn effective fine-grained feature representation. There are two general directions to approach this goal. The first is to localize and crop critical regions for the extraction of fine-grained features, and the second is to directly learn effective fine-grained feature representation in an end-to-end fashion. The difference between these two methods mainly is whether or not to intercept local critical regions.
In this paper, we focus on the model’s ability to learn fine-grained feature representation under a CIL setting, which has not been extensively studied in prior efforts. Our study strives to answer two questions: (1) How well do existing FGVC models perform in a CIL setting? (2) How can attention mechanism help boost a model’s performance via better localization and usage of critical regions for fine-grained CIL? To answer the first question, we adopt a CIL process proposed in [10], which divides the training in multiple incremental phases [17]. Initially, a certain number of fine-grained categories are first used to train a model. Then the obtained model is further trained in the subsequential phases to recognize new fine-grained categories. To prevent catastrophic forgetting, a fixed-sized exemplar set is kept and updated. During the incremental learning, samples from the exemplar set also participate in training to refresh the memory of the model. Existing FGVC models are plugged into this CIL framework for evaluation. To answer the second question, we design a novel neural architecture named two-branch attention learning network (TBAL-Net), which leverages attention modules to better localize critical regions. These highlighted parts are further cropped and fed back into the backbone network to boost feature learning.
In summary, the core contribution of this study is the proposal of TBAL-Net for fine-grained CIL. TBAL-Net focuses on the feature mining of an object’s critical regions, which can be effectively learned through an attention mechanism. A series of experiments have been conducted to validate the efficacy of TBAL-Net in a comparison with several baseline models on three widely used FGVC datasets, including CUB-200-2011 [18], FGVC-Aircraft [19], and Stanford Cars [20]. Results demonstrate that TBAL-Net achieves consistent performance gains in the top-1 accuracy compared to its peers. In addition, we have quantitively verified the positive effect brought by the attention module via an ablation study.
The rest of this paper is organized as follows. Section 2 provides a review of relevant studies in CIL and FGVC tasks. Section 3 presents a detailed description of the proposed TBAL-Net model. Section 4 reports the experimental design, results, and analysis. Lastly, we summarize this work in Section 5.

2. Related Work

2.1. Class Incremental Learning

Class incremental learning (CIL) [7] methods aim to learn effective feature representation for both previous and new classes. “Catastrophic forgetting” [21,22] occurs frequently in deep neural network (DNN) as reported in CIL which refers to a degradation of the performance on previous dataset when the model is trained to adapt to a new dataset. Such a bias in performance exists extensively in the CIL approaches currently. In LwF [13], knowledge distillation regularization term is first introduced into the loss function to retain the knowledge learnt from the previous training data. Knowledge distillation refers to distilling knowledge from a model of a cumbersome teacher and infusing it to a model of a light student which applied extensively in teaching [23,24,25,26] and can contribute to a generalization of model [27,28,29]. As shown in [9], LwF prefers to process new classes in the inference phase. To solve this problem, iCaRL [9] proposes a classification strategy called nearest-mean-of-exemplars. In this strategy, a prototype is computed by averaging the features extracted from all samples of the same class. In the inference phase, the class labels of the most similar prototypes are assigned to testing samples. iCaRL also constructs an exemplar set with the fixed memory size. The samples that are the closest ones to the average prototype of each class are stored in the exemplar set. Although iCaRL has improved the performance of CIL, it still shows a bias to new classes. The main reason is the imbalance between the previous and new classes. To further improve the performance, Ref. [10] introduced several regularization terms such as forgetting-less constraint and inter-class separation to solve the problem of imbalance between previous and new classes. Besides this, Ref. [12] proposed a new network architecture with stable and plastic blocks to deal with the stability–plasticity dilemma in CIL.

2.2. Fine-Grained Visual Categorization

In FGVC, it is important to localize the critical regions for the recognition of fine-grained objects [30,31,32]. So, localization subnetworks [33,34,35] are designed in many exiting FGVC methods. In [33], a navigator–teacher–scrutinizer network (NTS-Net) was proposed as a multi-agent learning framework to learn fine-grained features and localize informative regions simultaneously without any bounding-boxes or part annotations. Ref. [34] proposed a network architecture called multi-branch and multi-scale attention learning (MMAL) for the localization of critical regions, which used less parameters than the previous work. In MMAL, a large critical part is first localized and then subtle critical parts are localized with multiple scales. In [35], recurrent attention convolutional neural network (RA-CNN) was proposed to recursively learn discriminative region attention and region-based features at multiple scales.

3. Proposed Method

In this paper, a two-branch attention learning network (TBAL-Net) is designed for the recognition of fine-grained objects in fine-grained CIL. The network first trains in the initial phase and then learns to recognize new fine-grained objects with additional training data. In this section, the architecture of TBAL-Net is introduced in Section 3.1. Then the details of the CIL process applied to the experiments are discussed in Section 3.2.
The overall process of our method can be summarized as follows. TBAL-Net with the backbone CNN pretrained on ImageNet is trained on existing classes in the initial phase. To mitigate catastrophic forgetting, an exemplar set with the fixed memory size is constructed. Samples selected from this exemplar set are the most similar ones to the prototypes of each class. Catastrophic forgetting is addressed in CIL by finetuning the model on this rebalanced exemplar set. To further improve the performance of CIL, distillation regularization, forgetting-less constraint regularization, and inter-class separation regularization are introduced like in [10].

3.1. Network Architecture

The architecture of TBAL-Net is shown in Figure 1. Two attention modules are added to the backbone. The parameters used in the backbone network of TBAL-Net are defined as Pbackbone, and the parameters used in attention modules are defined as Patten,i, i = 1,2…, n, where n is the number of attention modules applied to TBAL-Net.
Attention module. Similar to [36], the attention module also contains channel and spatial attention modules. The feature map is defined as F R C × H × W . In the channel attention module, average and maximum pooling operations are applied to the spatial dimensions. The results of each pooling operation are defined as F a v g C R C × 1 × 1 and F m a x C C × 1 × 1 . Each result of average and maximum pooling operations is first fed into multiple layer perception (MLP). The outputs of MLP are then summarized together. Sigmoid function is applied to summation. In short, the output of the channel attention module is defined as
C a t t e n t i o n = s i g m o i d ( M L P ( A v g P o o l ( F ) ) + M L P ( M a x P o o l ( F ) ) )
In spatial attention module, average and maximum pooling operations are applied to the channel dimension. The results of each pooling operation are defined as F a v g S R 1 × H × W and F a v g S R 1 × H × W . These two maps are concatenated according to the channel dimension. The result of the feature map is first convolved by a standard convolutional layer. The output of the convolutional layer is then fed into a sigmoid function. In short, the output of the spatial attention module is defined as
S a t t e n t i o n = s i g m o i d ( f s × s ( [ A v g P o o l ( F ) ; M a x P o o l ( F ) ] ) )  
In the experiments, channel and spatial attention modules are integrated into a ResBlock, as shown in Figure 2.
Attention part localization module (APLM). In APLM, activation maps are used to localize the critical regions. Activations in the convolutional layer can be considered as the informativeness of regions with a certain window size. In [34], the activations mean the values are computed according to
a - w = x = 0 W w 1 y = 0 H w 1 F w ( x , y ) H w × W w
where Hw and Ww are the height and width of a feature map in a window. As an informativeness measure, activations mean the values of all windows are sorted to localize the most informative regions. To reduce the region redundancy, non-maximum suppression (NMS) is adopted to select a fixed number of windows with different scales. In TBAL-Net, the parameters of backbone CNN and FC layer are shared by both two branches.
The category probabilities of two branches in TBAL-Net are defined as Pr and Pp. The loss function is defined as
L t o t a l = L r a w + L p a r t s
where
L r a w = l o g ( P r ( c ) )
L p a r t s = i = 0 N 1 l o g ( P p ( i ) ( c ) )

3.2. Class Incremental Learning

In this paper, TBAL-Net is integrated into the CIL process introduced in [10]. The data of the previous class C o is defined as X o , and the data of a new class C n is defined as X n . As shown in Figure 3, CIL can be considered as an ( N + 1 )-phase training process, i.e., one initial phase and N incremental phases. In the initial phase, training data X o is available for training the TBAL-Net parameterized by θ o . The FC layer of TBAL-Net is initialized as a fully connected layer. After the initial phase, only a small subset of X o can be stored in an exemplar set with the fixed size. In the following N incremental phases, all samples from the new classes and previously selected exemplar set are first used to train the model. The output of FC layer in TBAL-Net is extended to | C o | + | C n | .
To balance magnitudes across all classes, cosine normalization is applied to the last layer of TBAL-Net as follows.
p i ( x ) = e x p ( λ θ - i , f - ( x ) ) j e x p ( λ θ - j , f - ( x ) ) ,
where v - = v / | | v | | 2 denotes the l 2 normalization vector, , denotes the cosine similarity between two vectors, and λ is a learnable scale parameter, which is introduced to control the peak of softmax distribution, since the cosine similarity is restricted to [ 1 ,   1 ] . λ can be updated through back propagation. Through cosine normalization, all scores before softmax distribution are in the same range and thus are comparable. The distillation loss in [10] is defined as
L d i s ( x ) = 1 f - * ( x ) , f - ( x ) ,
where f - * ( x ) and f - ( x ) are two normalized features extracted from the original and current models, respectively. Different from the distillation loss shown in LwF [13], this term encourages the orientation of features extracted by the current network to be similar to those extracted by the original model.
The inter-class separation regularization term is defined as
L m r ( x ) = k = 1 K m a x ( m θ - ( x ) , f - ( x ) + θ - k , f - ( x ) , 0 ) ,  
The objective of this process is defined as
L = 1 | N | x N ( L c e ( x ) + π L d i s ( x ) ) + 1 | N o | x N o L m r ( x ) ,
where L c e is a traditional cross-entropy loss function.

4. Experiments

4.1. Datasets

Experiments in this paper are conducted on three popular fine-grained object datasets, i.e., CUB-200-2011 [18], FGVC-Aircraft [19], and Stanford Cars [20]. The details of these three datasets are introduced in Table 1. In the experiments, only image labels are used without involving any part annotations.
  • CUB-200-2011. It is the most widely used fine-grained visual categorization dataset. For each subcategory, about 30 images are used for training and 11–30 images for testing.
  • Stanford Car. In this dataset, each subcategory contains 24–84 images for training and 24–83 images for testing.
  • FGVC-Aircraft. This dataset is organized into a three-layer label structure. The three layers, from bottom to top, consist of 100 variants, 70 families, and 30 manufacturers, respectively. It is split into 6667 training images and 3333 test images. In the experiments, we considered the case of dividing the images into 70 families.

4.2. Baselines

In the experiments, both traditional CNN such as ResNet50 and several FGVC methods such as NTS-Net and MMAL are evaluated. NTS-Net contains more parameters than MMAL. According to the discussion of Section 3.2, all FC layers in these baselines are extended in the experiments.
  • ResNet50. As a traditional CNN architecture, ResNet50 [5] pretrained on ImageNet is chosen as a feature extractor. The pretrained FC layer is deleted from the architecture and a new initialized random FC layer is added to the network. Following the experimental setting in [10], when adopting cosine normalization in the last layer, the ReLU in the penultimate layer is removed to allow the features to take both positive and negative values.
  • NTS-Net. Critical regions with different sizes and aspect ratios are automatically selected by a region proposal network. It could fuse both local and global features for recognition. ResNet50 is the backbone network of NTS-Net. The number of proposal regions is set to 3. In the experiments, the number of learnable parameters in NTS-Net is about 2.8 M. The backbone network in NTS-Net is pretrained on ImageNet dataset. The final feature is obtained through the summation of global and local features. When NTS-Net is trained on the initialized training data, the cosine normalization is also added to the last layer. When facing the new classes, the trainable parameters is added in the FC layer for training.
  • MMAL. The backbone of MMAL is also ResNet50, which has been pretrained on the ImageNet dataset. In the attention object location module (AOLM), the outputs of Conv5_b and Conv5_c are used for localization of objects. In the attention part proposal module (APPM), the settings of each dataset are same as the settings used in this paper. In the experiments, the number of learnable parameters in MMAL is about 2.6 M. Similar to the setting in NTS-Net, the final feature in MMAL is also obtained through the summation of global and parts features. The trainable parameters in the FC layer, which is shared by the global and local branch, are also added when facing new classes.

4.3. Implementation

In this paper, we apply TBAL-Net to the CIL framework proposed by [10]. All experiments are implemented with PyTorch and trained on a PC with four TITAN-X GPUs.
We adopt the ResNet50 pretrained on ImageNet as the backbone in TBAL-Net. For all three datasets, the learning rate starts from 0.001 and is divided by 10 every 30 epochs (90 epochs in total). The TBAL-Net is trained by SGD with the batch size 32 (8 for each GPU). In the training phase, the input image is first resized to 512 × 512 , and then randomly flipped and cropped the region with a size of 448 × 448 from the image. In the testing phase, the input image is first resized to 512 × 512 and then cropped the region with the size of 448 × 448 from the image center. The part images are resized to 224 × 224 , three broad categories of scales: { [ 4 × 4 , 3 × 5 ] , [ 6 × 6 , 5 × 7 ] , [ 8 × 8 , 6 × 10 , 7 × 9 , 7 × 10 ] } are construct for feature map of 14 × 14 . The number of a raw image’s part images is N = 7 , among them N 1 = 2 ,   N 2 = 3 , N 3 = 2 . The number of parts in TBAL-Net are set to be the same as in [34]. The reason for doing this is that the selected regions based on the activations are basically stable. Moreover, the candidate regions processed by the NMS contain meaningless regions. If the number of regions is too large, meaningless regions will be input into the model and affect the final performance. The optimal values of hyperparameters in TBAL-Net, such as the reduction ratio and the size of convolutional kernel in attention module, are obtained empirically. The reduction ratio is set to be 16 and the size of convolutional kernel is set to be 3. Under this setting, the number of trainable parameters of TBAL-Net is about 2.7 M, which sits in between MMAL (2.6 M parameters) and NTS-Net (2.8 M parameters). The addition of 0.1 M parameters, compared to MMAL, is mainly from the integration of the attention module, which have provided satisfying return on investment in performance boosting, as demonstrated in Section 4.5.
Similar to [10], there are three components in our CIL method, including cosine normalization (CN), less-forget constraint (LC), and inter-class separation (IS). As shown in the results, (CN+LC+IS) means all these three components are applied to the experiments. There are two different classification strategies used in the experiments, CNN predictions (CNN) and nearest-mean-of-exemplar (NME). Both of these two predictions (Top-1 accuracy) are shown in the results.
In the CIL of CUB-200-2011, 50 classes are randomly selected as the initial training set for training the proposed TBAL-Net and all baselines. In each incremental phase, 10 new classes are fed into the model to train models for recognizing new classes. In the construction process of an exemplar set, 20 samples which are the closest ones to the average prototype of each class are selected. In the CIL of FGVC-Aircraft, half of the total classes (35 classes) are randomly selected as the initial training set for training the proposed TBAL-Net and all baselines. In each incremental phase, five new classes are fed into the model to train models for recognizing new classes of aircraft families. According to the strategy used in the CIL of CUB-200-2011, 20 samples are selected during the construction of an exemplar set. In the Stanford-Car, half of the total classes (98 classes) are randomly selected as the initial training set for training the proposed TBAL-Net and all baselines. In each incremental phase, 14 classes are fed into the model to train models for recognizing new cars. Twenty samples are selected during the construction of an exemplar set. There is no constraint on the total size of exemplar set in our experiments. It is worth noting that in the experiment, the output results of the global branch of TBAL-Net were used as the basis for constructing the exemplar set, in order to avoid feature representation errors caused by localization errors in local regions. Each phase results of three fine-grained datasets are shown in each column of Table 2, Table 3 and Table 4.

4.4. Ablation Study

To validate the design choices, we evaluated the proposed model under different settings. Specifically, the effects of the attention module and the number of incremental phases are evaluated:
  • Impact of attention module. Table 2, Table 3 and Table 4 present the top-1 accuracy of models with and without the attention module. Models with the attention module have a suffix -ATT. It is observed that the addition of the attention mechanism leads to consistent performance improvement for all three datasets in all incremental phases, demonstrating a reliable boosting effect.
  • Impact of incremental phases. Figure 4, Figure 5 and Figure 6 show the experimental results of different choices of incremental phases number. For each dataset, we chose two levels, corresponding to a low level and a high level of the incremental phase number, as shown in subfigures (a) and (b), respectively. It is observed that the models perform better with a lower number of incremental phases in all datasets, which is explainable due to the nature of CIL. Essentially, the more incremental phases we have, the less classes per phase, and the more challenges for models to memorize features and patterns learned from previous phases.

4.5. Results

Evaluation on CUB-200-2011. Table 5 and Table 6 and Figure 4 show the experimental results on CUB-200-2011. Both CNN and NME predictions of the proposed TBAL-Net outperform the baselines. It is observed that MMAL presents better performance in the initial two phases than other methods, showing its ability to capture more distinguishable patterns in the beginning of the CIL training when the number of classes is relatively low. However, as more new classes participate into training, the proposed TBAL-Net starts to outperform its peers. It is shown that as the number of classes went beyond 70, the TBAL-Net-(CN-LC-IS)-CNN model consistently outperforms other methods. Furthermore, TBAL-Net with the CNN prediction is better than TBAL-Net with the NME prediction, showing that the former demonstrates a superior ability in extracting more discriminative fine-grained features. Moreover, it is noted that the localization modules designed for FGVC, such as region proposal network (RPN) in NTS-Net, may be not suitable for fine-grained CIL. The potential reasons are: (1) the RPN is randomly initialized, and (2) due to the limited data size per category, RPN may not be trained well.
Evaluation on FGVC-Aircraft. Table 7 and Figure 5b show the experimental results on FGVC-Aircraft. We can observe similar results as CUB-200-2011. In the initial two phases, MMAL has demonstrated better performance, and TBAL-Net catches up since the third phase when the number of classes reaches 45. It is also observed that the performance differences among the compared models are relatively small. This may be caused by the lower number of classes of the FGVC-Aircraft dataset, which leads to more samples per class, allowing each model to learn more informative features.
Evaluations on Stanford-Car. Table 8 and Figure 6a show the experimental results on Stanford-Car. Similar to the previous two datasets, the only competition of TBAL-Net is MMAL, which presents a better accuracy than TBAL-Net at phase 2. However, TBAL-Net dominates MMAL and other models in all of the other incremental phases.

5. Conclusions

In this paper, we propose TBAL-Net, which contains an attention module similar to [15] and a localization module for fine-gained CIL. We adopt the CIL framework introduced in [10] and evaluate ResNet50, NTS-Net, MMAL, and the proposed TBAL-Net on three fine-grained object datasets, including CUB-200-2011 [18], FGVC-Aircraft [19], and Stanford Cars [20]. As an effective FGVC method, MMAL can achieve better performance than other methods in the initial phase but has lower performance than the proposed TBAL-Net. NTS-Net can also achieve good performance in the initial phase, but its performance is lower than other methods on all three fine-grained object datasets. As a traditional and effective network architecture, ResNet50 outperforms both NTS-Net and MMAL on Stanford-Car in CIL. These results lead to the following conclusions:
(1)
The localization modules designed for FGVC, such as region proposal network (RPN) in NTS-Net, may be not suitable for fine-grained CIL. The RPN is randomly initialized. Due to the limited data size, RPN may not be trained well.
(2)
The localization modules in MMAL only increases few parameters. Additionally, MMAL can achieve good performance on FGVC. MMAL does not have enough learning ability in fine-grained CIL.
(3)
The attention module similar to [36] is effective in fine-grained CIL. Therefore, TBAL-Net can extract a lot of discriminative fine-grained features in the experiments, as shown in the NME predictions.

Author Contributions

Conceptualization, J.G. and X.L.; methodology, G.Q.; software, J.G.; validation, G.Q., S.X. and X.L.; writing—original draft preparation, J.G.; writing—review and editing, X.L. and G.Q.; supervision, S.X. and X.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Ministry of Education 2021 University-Industry Cooperation Project, China (202002018063, 9 November 2020–9 November 2021), under the project entitled “Virtual prototype-based autonomous driving of miniature intelligent vehicles.” The funding agency has no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability Statement

The CUB-200-2011 [9] (http://www.vision.caltech.edu/visipedia/CUB-200-2011.html (accessed on 2 June 2021)), FGVC-Aircraft [10] (https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/ (accessed on 2 June 2021)) and Stanford-Cars [11] (https://ai.stanford.edu/~jkrause/cars/car_dataset.html (accessed on 2 June 2021)) data sets presented in this work are publicly available.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Dang, S.; Cao, Z.; Cui, Z.; Pi, Y.; Liu, N. Class Boundary Exemplar Selection Based Incremental Learning for Automatic Target Recognition. IEEE Trans. Geosci. Remote Sens. 2020, 58, 5782–5792. [Google Scholar] [CrossRef]
  2. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  3. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  4. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
  5. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  6. Shmelkov, K.; Schmid, C.; Alahari, K. Incremental Learning of Object Detectors without Catastrophic Forgetting. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
  7. Zhang, J.; Zhang, J.; Ghosh, S.; Li, D.; Tasci, S.; Heck, L.; Zhang, H.; Kuo, C.-C.J. Class-incremental Learning via Deep Model Consolidation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020. [Google Scholar]
  8. Masana, M.; Liu, X.; Twardowski, B.; Menta, M.; Bagdanov, A.D.; van de Weijer, J. Class-incremental learning: Survey and performance evaluation on image classification. arXiv 2020, arXiv:2010.15277. [Google Scholar]
  9. Rebuffi, S.-A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. iCaRL: Incremental Classifier and Representation Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  10. Hou, S.; Pan, X.; Loy, C.C.; Wang, Z.; Lin, D. Learning a Unified Classifier Incrementally via Rebalancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  11. Castro, F.M.; Marín-Jiménez, M.J.; Guil, N.; Schmid, C.; Alahari, K. End-to-End Incremental Learning; Springer: Singapore, 2018. [Google Scholar]
  12. Liu, Y.; Schiele, B.; Sun, Q. Meta-Aggregating Networks for Class-Incremental Learning. arXiv 2020, arXiv:2010.05063. [Google Scholar]
  13. Li, Z.; Hoiem, D. Learning without Forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2935–2947. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Mermillod, M.; Bugaiska, A.; Bonin, P. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects. Front. Psychol. 2013, 4, 504. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Krizhevsky, A.; Hinton, G. Learning multiple layers of features from tiny images. In Handbook of Systemic Autoimmune Diseases; Elsevier: Amsterdam, The Netherlands, 2009; Volume 1, p. 4. [Google Scholar]
  16. Russakovsky, O.; Deng, J.; Karpathy, A.; Ma, S.; Russakovsky, O.; Huang, Z.; Bernstein, M.; Krause, J.; Su, H.; Fei-Fei, L.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
  17. Liu, Y.; Schiele, B.; Sun, Q. Adaptive Aggregation Networks for Class-Incremental Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
  18. Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds200-2011 Dataset. Adv. Water Resour. 2011. Available online: https://authors.library.caltech.edu/27452/ (accessed on 20 November 2021).
  19. Maji, S.; Rahtu, E.; Kannala, J.; Blaschko, M.; Vedaldi, A. Fine-Grained Visual Classification of Aircraft. arXiv 2013, arXiv:1306.5151. [Google Scholar]
  20. Krause, J.; Stark, M.; Deng, J.; Fei-Fei, L. 3D Object Representations for Fine-Grained Categorization. In Proceedings of the 2013 IEEE International Conference on Computer Vision Workshops, Sydney, NSW, Australia, 2–8 December 2013. [Google Scholar]
  21. Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  22. Kemker, R.; Abitino, A.; McClure, M.; Kanan, C. Measuring Catastrophic Forgetting in Neural Networks. arXiv 2018, arXiv:1708.02072. [Google Scholar]
  23. Yuan, L.; Tay, F.E.; Li, G.; Wang, T.; Feng, J. Revisiting Knowledge Distillation via Label Smoothing Regularization. arXiv 2020, arXiv:1909.11723. [Google Scholar]
  24. Shi, Y.; Hwang, M.-Y.; Lei, X.; Sheng, H. Knowledge Distillation for Recurrent Neural Network Language Modeling with Trust Regularization. arXiv 2019, arXiv:1904.04163. [Google Scholar]
  25. Yun, S.; Park, J.; Lee, K.; Shin, J. Regularizing Class-Wise Predictions via Self-Knowledge Distillation. arXiv 2020, arXiv:2003.13964. [Google Scholar]
  26. Yuan, L.; Tay, F.E.H.; Li, G.; Wang, T.; Feng, J. Revisit Knowledge Distillation: A Teacher-free Framework. arXiv 2019, arXiv:1909.11723. [Google Scholar]
  27. Kim, K.; Ji, B.; Yoon, D.; Hwang, S. Self-Knowledge Distillation with Progressive Refinement of Targets. arXiv 2020, arXiv:2006.12000. [Google Scholar]
  28. Wang, Y.; Li, H.; Chau, L.-P.; Kot, A.C. Embracing the Dark Knowledge: Domain Generalization Using Regularized Knowledge Distillation. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event. 20–24 October 2021. [Google Scholar]
  29. Zhao, L.; Peng, X.; Chen, Y.; Kapadia, M.; Metaxas, D.N. Knowledge as Priors: Cross-Modal Knowledge Generalization for Datasets Without Superior Knowledge. arXiv 2020, arXiv:2004.00176. [Google Scholar]
  30. Liu, C.; Xie, H.; Zha, Z.-J.; Ma, L.; Yu, L.; Zhang, Y. Filtration and Distillation: Enhancing Region Attention for Fine-Grained Visual Categorization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11555–11562. [Google Scholar]
  31. Kermany, D.S.; Goldbaum, M.; Cai, W.; Valentim, C.C.; Liang, H.; Baxter, S.; McKeown, A.; Yang, G.; Wu, X.; Yan, F.; et al. Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell 2018, 172, 1122–1131.e9. [Google Scholar] [CrossRef] [PubMed]
  32. Niu, Y.; Jiao, Y.; Shi, G. Attention-shift based deep neural network for fine-grained visual categorization. Pattern Recognit. 2021, 116, 107947. [Google Scholar] [CrossRef]
  33. Yang, Z.; Luo, T.; Wang, D.; Hu, Z.; Gao, J.; Wang, L. Learning to Navigate for Fine-Grained Classification; Springer: Berlin, Germany, 2018. [Google Scholar]
  34. Zhang, F.; Li, M.; Zhai, G.; Liu, Y. Multi-branch and Multi-scale Attention Learning for Fine-Grained Visual Categorization. arXiv 2020, arXiv:2003.09150. [Google Scholar]
  35. Fu, J.; Zheng, H.; Mei, T. Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  36. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module; Springer: Singapore, 2018. [Google Scholar]
Figure 1. The framework of TBAL-Net. Both channel and spatial attention modules, namely, module 1 and module 2 in the figure, are added to the CNN backbone for feature extraction. The extracted feature maps are fed into the APLM, where critical regions are highlighted and used to guide the cropping module, which crops a number of top-informative regions from the raw image. Finally, the generated part images are fed into the CNN backbone for further feature extraction, aiming to learn more visual patterns from the part images to enhance feature representation for FGVC. The extracted feature map passes a fully connected (FC) layer followed by the detection head.
Figure 1. The framework of TBAL-Net. Both channel and spatial attention modules, namely, module 1 and module 2 in the figure, are added to the CNN backbone for feature extraction. The extracted feature maps are fed into the APLM, where critical regions are highlighted and used to guide the cropping module, which crops a number of top-informative regions from the raw image. Finally, the generated part images are fed into the CNN backbone for further feature extraction, aiming to learn more visual patterns from the part images to enhance feature representation for FGVC. The extracted feature map passes a fully connected (FC) layer followed by the detection head.
Electronics 10 02987 g001
Figure 2. Attention modules with both channel and spatial attention modules similar to [36]. Both attention modules can be added to any location in CNN.
Figure 2. Attention modules with both channel and spatial attention modules similar to [36]. Both attention modules can be added to any location in CNN.
Electronics 10 02987 g002
Figure 3. The CIL method used in our experiments. The whole training process consists of (N + 1) phases, including one initial phase and N incremental phases. The initial phase of training takes a subset of data to train an initial model to recognize a subset of classes, while the following phases incrementally add more samples to fine tune the model to recognize new object classes. To prevent catastrophic forgetting, a fixed-sized exemplar set is kept and updated to include typical training samples for all classes that have been seen. At each incremental phase, samples from the exemplar set are also used for training and will be updated to stay current after the incremental learning completes.
Figure 3. The CIL method used in our experiments. The whole training process consists of (N + 1) phases, including one initial phase and N incremental phases. The initial phase of training takes a subset of data to train an initial model to recognize a subset of classes, while the following phases incrementally add more samples to fine tune the model to recognize new object classes. To prevent catastrophic forgetting, a fixed-sized exemplar set is kept and updated to include typical training samples for all classes that have been seen. At each incremental phase, samples from the exemplar set are also used for training and will be updated to stay current after the incremental learning completes.
Electronics 10 02987 g003
Figure 4. The performance of different number of incremental phases on CUB-200-2011.
Figure 4. The performance of different number of incremental phases on CUB-200-2011.
Electronics 10 02987 g004
Figure 5. The performance of different number of incremental phases on FGVC-Aircraft.
Figure 5. The performance of different number of incremental phases on FGVC-Aircraft.
Electronics 10 02987 g005
Figure 6. The performance of different number of incremental phases on Stanford-Car.
Figure 6. The performance of different number of incremental phases on Stanford-Car.
Electronics 10 02987 g006
Table 1. Datasets in the experiments.
Table 1. Datasets in the experiments.
DatasetsNumber of ClassesTrainingTesting
CUB-200-201120059945794
FGVC-Aircraft7066673333
Stanford Cars19681448041
Table 2. Performance of TBAL-Net with/without attention modules on CUB-200-2011 (Top-1 Accuracy). The highest score in each column is marked in bold.
Table 2. Performance of TBAL-Net with/without attention modules on CUB-200-2011 (Top-1 Accuracy). The highest score in each column is marked in bold.
Method5060708090100110120
TBAL-Net-(CN-LC-IS)-CNN92.91291.01289.79388.87487.57286.47685.31784.216
TBAL-Net-(CN-LC-IS)-NME92.23090.97389.74189.16787.83086.37684.91783.853
TBAL-Net-(CN-LC-IS)-CNN-ATT93.79292.12690.95890.13388.74687.19386.17884.973
TBAL-Net-(CN-LC-IS)-NME-ATT93.13991.77290.43189.83188.10386.73385.20884.187
Method130140150160170180190200
TBAL-Net-(CN-LC-IS)-CNN82.46681.83180.87180.16779.30178.65378.20077.467
TBAL-Net-(CN-LC-IS)-NME81.90381.01380.27679.33778.49877.66776.60576.031
TBAL-Net-(CN-LC-IS)-CNN-ATT83.24082.61781.42080.70779.91079.37979.08878.210
TBAL-Net-(CN-LC-IS)-NME-ATT82.74081.65080.87379.92179.12178.31077.14176.563
Table 3. Performance of TBAL-Net with/without attention modules on FGVC-Aircraft (Top-1 Accuracy).
Table 3. Performance of TBAL-Net with/without attention modules on FGVC-Aircraft (Top-1 Accuracy).
Method3540455055606570
TBAL-Net-(CN-LC-IS)-CNN97.13796.26294.13792.86392.03191.07389.16788.393
TBAL-Net-(CN-LC-IS)-NME97.33095.16794.03389.16788.03086.73785.31783.973
TBAL-Net-(CN-LC-IS)-CNN-ATT97.84696.9194.8793.6592.87991.79889.8689.08
TBAL-Net-(CN-LC-IS)-NME-ATT98.01296.1294.9693.4592.32591.0289.65888.233
Table 4. Performance of TBAL-Net with/without attention modules on Stanford-Car (Top-1 Accuracy).
Table 4. Performance of TBAL-Net with/without attention modules on Stanford-Car (Top-1 Accuracy).
Method98112126140154168182196
TBAL-Net-(CN-LC-IS)-CNN95.86393.31792.07389.91588.76787.87686.71785.916
TBAL-Net-(CN-LC-IS)-NME95.31593.07391.93089.39088.13087.17686.52785.353
TBAL-Net-(CN-LC-IS)-CNN-ATT96.7494.21593.01390.8789.70688.94288.02187.312
TBAL-Net-(CN-LC-IS)-NME-ATT96.31793.91592.81389.7788.91688.34287.82186.93
Table 5. Performance (Top-1 Accuracy %) on CUB-200-2011 as the number of classes increases from 50 to 120.
Table 5. Performance (Top-1 Accuracy %) on CUB-200-2011 as the number of classes increases from 50 to 120.
Method5060708090100110120
ResNet50-(CN-LC-IS)-CNN92.96889.92488.20087.95786.74985.65084.18783.088
ResNet50-(CN-LC-IS)-NME92.68689.80787.65087.17485.97883.98682.61681.847
NTS-Net-(CN-LC-IS)-CNN93.19291.62689.15888.83387.74686.19385.17883.973
NTS-Net-(CN-LC-IS)-NME92.89190.77288.93088.23187.10385.73384.20883.187
MMAL-(CN-LC-IS)-CNN94.01892.13090.81089.50288.31086.73085.65384.210
MMAL-(CN-LC-IS)-NME94.10792.07090.53189.31288.51286.03784.70283.903
TBAL-Net-(CN-LC-IS)-CNN93.79292.12690.95890.13388.74687.19386.17884.973
TBAL-Net-(CN-LC-IS)-NME93.13991.77290.43189.83188.10386.73385.20884.187
Table 6. Performance (Top-1 Accuracy %) on CUB-200-2011 as the number of classes increases from 130 to 200.
Table 6. Performance (Top-1 Accuracy %) on CUB-200-2011 as the number of classes increases from 130 to 200.
Method130140150160170180190200
ResNet50-(CN-LC-IS)-CNN82.71682.17681.03280.39079.70779.17978.31977.477
ResNet50-(CN-LC-IS)-NME81.65180.69279.36678.33277.34677.14576.33775.492
NTS-Net-(CN-LC-IS)-CNN82.24081.61780.42079.70778.91078.37978.08877.210
NTS-Net-(CN-LC-IS)-NME81.94080.65079.87378.92178.12177.31076.14175.563
MMAL-(CN-LC-IS)-CNN82.72082.31280.92180.21879.15078.94078.21077.501
MMAL-(CN-LC-IS)-NME82.39081.03080.12779.23878.75077.91376.42075.980
TBAL-Net-(CN-LC-IS)-CNN83.24082.61781.42080.70779.91079.37979.08878.210
TBAL-Net-(CN-LC-IS)-NME82.74081.65080.87379.92179.12178.31077.14176.563
Table 7. Performance (Top-1 Accuracy %) on FGVC-Aircraft as the number of classes increases from 35 to 70.
Table 7. Performance (Top-1 Accuracy %) on FGVC-Aircraft as the number of classes increases from 35 to 70.
Method3540455055606570
ResNet50-(CN-LC-IS)-CNN97.4894.51993.3392.24891.59590.50688.56687.849
ResNet50-(CN-LC-IS)-NME97.58795.5293.81692.3791.3790.05288.8587.549
NTS-Net-(CN-LC-IS)-CNN98.0497.01294.2193.1292.2191.03189.01288.233
NTS-Net-(CN-LC-IS)-NME97.92196.2194.03792.6791.4790.1988.92187.75
MMAL-(CN-LC-IS)-CNN98.12797.2394.79693.54592.43991.3289.4788.676
MMAL-(CN-LC-IS)-NME98.02196.54394.53993.09891.9790.3389.21587.901
TBAL-Net-(CN-LC-IS)-CNN97.84696.9194.8793.6592.87991.79889.8689.08
TBAL-Net-(CN-LC-IS)-NME98.01296.1294.9693.4592.32591.0289.65888.233
Table 8. Performance (Top-1 Accuracy %) on Stanford-Car as the number of classes increases from 98 to 196.
Table 8. Performance (Top-1 Accuracy %) on Stanford-Car as the number of classes increases from 98 to 196.
Method98112126140154168182196
ResNet50-(CN-LC-IS)-CNN96.05393.96592.62090.39289.18288.06987.71687.054
ResNet50-(CN-LC-IS)-NME95.85493.85792.46690.70489.30888.46088.21287.303
NTS-Net-(CN-LC-IS)-CNN95.94094.02192.47389.79488.32987.44086.21785.233
NTS-Net-(CN-LC-IS)-NME95.13793.56691.81588.77987.31686.24285.32184.930
MMAL-(CN-LC-IS)-CNN95.8394.3292.77890.12189.01288.10387.0286.133
MMAL-(CN-LC-IS)-NME96.0393.9792.53189.3387.9387.53186.3985.127
TBAL-Net-(CN-LC-IS)-CNN96.7494.21593.01390.8789.70688.94288.02187.312
TBAL-Net-(CN-LC-IS)-NME96.31793.91592.81389.7788.91688.34287.82186.93
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Guo, J.; Qi, G.; Xie, S.; Li, X. Two-Branch Attention Learning for Fine-Grained Class Incremental Learning. Electronics 2021, 10, 2987. https://doi.org/10.3390/electronics10232987

AMA Style

Guo J, Qi G, Xie S, Li X. Two-Branch Attention Learning for Fine-Grained Class Incremental Learning. Electronics. 2021; 10(23):2987. https://doi.org/10.3390/electronics10232987

Chicago/Turabian Style

Guo, Jiaqi, Guanqiu Qi, Shuiqing Xie, and Xiangyuan Li. 2021. "Two-Branch Attention Learning for Fine-Grained Class Incremental Learning" Electronics 10, no. 23: 2987. https://doi.org/10.3390/electronics10232987

APA Style

Guo, J., Qi, G., Xie, S., & Li, X. (2021). Two-Branch Attention Learning for Fine-Grained Class Incremental Learning. Electronics, 10(23), 2987. https://doi.org/10.3390/electronics10232987

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop