An Integrated Transformer with Collaborative Tokens Mining for Fine-Grained Recognition

Yang, Weiwei; Yin, Jian

doi:10.3390/electronics12122635

Open AccessArticle

An Integrated Transformer with Collaborative Tokens Mining for Fine-Grained Recognition

by

Weiwei Yang

^1,2,† and

Jian Yin

^1,2,*,†

¹

The School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510275, China

²

Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2023, 12(12), 2635; https://doi.org/10.3390/electronics12122635

Submission received: 11 April 2023 / Revised: 7 May 2023 / Accepted: 13 May 2023 / Published: 12 June 2023

Download

Browse Figures

Versions Notes

Abstract

:

Fine-grained recognition mainly classifies subclass images into hundreds of subcategorical labels by locating the discriminative regions (e.g., Cape May warbler or Magnolia warbler bird). Due to the high complexity and non-differentiation of region locations through the traditional backbone architecture, most existing approaches utilize multi-level reinforcement learning to distinguish the similar appearance among sub-categories. These methods explore incomplete information through only the intra-class informative regions in one image or the inter-class and intra-class relationship in pairwise images, leading to the tendency for overlapped region locations. Since the inter-class correlations and new backbone with complete contextual semantic information play important roles in distinguishing fine-grained classes, we propose a novel transformer with the collaborative token mining (TCTM) scheme by fully exploiting the relationships between inter-class and intra-class regions. The proposed TCTM scheme with a new transformer backbone consists of two modules that collaboratively explore the spatially aware tokens: the Pyramid Tokens Multiplication (PTM) module which exploits the integrated multi-stage inter-class and intra-class correlations from new transformer architecture and the Tokens Proposals Generation (TPG) module which captures two groups of top-four discriminative tokens. The two PTMs extract contrastive tokens for each image and learn to rank these tokens, assuming that those from the same class and the same module should have smaller distances. The TPGs further sort and update the candidate tokens from the extracted attention tokens by ranking their probabilities with ground truth subcategorical labels. Through the collaboration between the PTM and TPG, our TCTM scheme can take the integrated correlations into account and mine the discriminative tokens for final fine-grained classification. Extensive experiments on four popular benchmarks show that our proposed TCTM outperforms the state-of-the-art methods by a large margin.

Keywords:

fine-grained recognition; transformer; image patches; tokens

1. Introduction

Fine-grained classification primarily distinguishes subordinate classes under a generic class, such as recognizing a specific bird species, “Cape May warbler”. Subordinate classes have similar inter-class geometry and appearances while having various intra-class postures and occlusions. Academia and industry mainly deal with this task using traditional backbones, such as Visual Geometry Group (VGG) [1] or Deep Residual Network (ResNet) [2], which provides considerably high complexity, non-differentiation, and incomplete contextual semantics. Recently, vision transformers have shown higher performance levels with fewer complexities in traditional classification tasks, which directly map image patches to tokens via self-attention of the transformer. Thus, choosing the backbone and locating the discriminative regions for fine-grained classification is crucial.

Previous works [3,4,5,6,7,8] tend to utilize traditional backbones to detect multiple discriminative regions for fine-grained classification. The main approach is to use multi-stage localization to locate the coordinates of discriminative parts, which is non-differentiable and can only be achieved through reinforcement learning layer by layer. Therefore, a large amount of computation is required. As these methods only consider the spatial relationships within a single image and do not consider the relationships between different categories, the overlap between detected discriminative parts is extremely high. Because of the non-differentiation in the traditional backbone, in [9,10,11,12,13,14,15], these regions are located by multi-level reinforcement learning. These methods especially focus on exploring the intra-class spatial relationships in a single image, which lose the influences of the inter-class spatial relations and the mutual effects among the discriminative regions. Consequentially, the detected regions tend to overlap with higher complexities. Some methods [16,17,18] bring inter-class and intra-class relations with paired images to exploit their feature differences, falling to the channel attention. As a multi-head attention learning mechanism, Transformer has improved performance in various fields of vision, replacing the hard and soft attention to realize the mutual interaction among the discriminative regions [19,20]. These two methods still rely on traditional neural network architectures and inevitably require multi-stage reinforcement learning training, resulting in high computational complexities. To solve these problems, some recent methods [21,22,23] acquire locations of discriminative regions by replacing the traditional backbone with the transformer backbone. However, these transformer backbones neglect the relations among image patches and their spatial locations, consuming large data resources.

Since the appropriate backbone plays a key role in extracting the intra-class and inter-class relations, in this paper, we incorporate a new transformer backbone to extract these relations to improve the fine-grained classification accuracy. We propose the integrated Transformer with Collaborative Tokens Mining (TCTM) scheme to mine different discriminative tokens. As shown in Figure 1, the proposed TCTM consists of two collaborative modules: the Pyramid Tokens Multiplication (PTM) for producing and multiplying multi-stage tokens from new transformer architecture to extract complete inter-class and intra-class tokens; and the Tokens Proposals Generation (TPG) for randomly selecting the most discriminate attention regions from candidate tokens with intra-class relations. Specifically, the two PTMs first locate two groups of distinct image tokens. To realize this goal, we establish an ordered list by altering the class and PTM from the query, utilize the constraint to enlarge the distance of tokens from different PTMs and classes, and apply three ranking losses to train the PTMs on all images within a batch rather than on image pairs. Then, the TPG predicts a probability value for each candidate group of tokens and selects the top-k discriminative tokens for final classification. Meanwhile, the PTM is used to learn different inter-class tokens for intra-class token mining in TPG. Thus, these two modules can collaboratively learn to select and locate crucial object parts for fine-grained classification.

Our contributions are summarized as follows: (1) we propose an integrated transformer with a collaborative token mining scheme for fine-grained recognition by exploiting the inter-class and intra-class relations to capture discriminative tokens. (2) Based on TPG, we replace the multi-level reinforcement learning with differentiate optimization to capture the discriminate attention regions. Extensive experiments on four widely used benchmark datasets: CUB-200-2011, Stanford Cars, FGVC Aircraft, and Stanford Dogs, show the effectiveness of our proposed TCTM.

2. Related Work

2.1. Fine-Grained Recognition

Researchers in fine-grained recognition first utilize supervised learning via extra annotations of bounding boxes from exports, then bring weakly supervised learning with the truth label of images to locate the most discriminative regions by one-stage or multi-level reinforcement learning. Some works directly calculate the features of the whole image to explore intra-class relations in one-stage learning, where the located parts have large differences from the annotations parts. Ref. [3] proposes reinforcement learning to select two discriminative regions with the higher peak response separately, which makes no comparison between the two parts. Ref. [4] alternatively picks positive regions by averaging all the features, which neglects their mutual relations. Ref. [8] forces Navigator and Scrutinizer to have a consistent sequence by a reinforcement learning training mechanism, neglecting the mutual influences among located regions. The above methods adopt the global full image search to select the most discriminative regions globally, with which it is easy to bring in noise features. Ref. [9] roughly estimates the position of the whole object through saliency extraction, which transfers the whole image selection to the saliency attention region selection. Scale-consistent Attention Part Network (SCAPNet) [24] not only introduces the attention of the whole image but also builds relationships among intra-regions, which shows a significant improvement.

However, these methods locate the regions indirectly by multi-scale learning and overlook the correlations of discriminative parts in spatial structures between the same and different classes. We mine the relations of different parts of pairwise images in a batch, which collaborates with locating two lists of the most discriminative regions.

2.2. Inter-Class and Intra-Class Relations by Metric Learning

Metric learning supplies fine-grained recognition with another category of attention-guided learning processes, which directly measures the similarity to mine relations of images in the same or different classes. Following the contrastive differences of paired images, Multi-Attention Multi-class Constraint (MAMC) [18] regards their spatial correlations of different parts as prior knowledge and divides them into four categories to obtain three groups of positive and negative samples for each query, then utilizes N-pair loss [25] to constrain them. Ref. [25] identifies one positive sample from the rest of the negative samples, pulling the positive sample to the query too near and pushing the rest too far. MAMC [18] sacrifices the structure of the datasets and concentrates on artificial division, which limits their constraints to lead into three levels. Refs. [25,26,27,28] consider these relations equally and constrain their ability to explore more informative relations. Refs. [29,30] introduce margin losses to consider the weights among all the positive and negative samples, which keep the structure unchanged.

These methods are convenient, lightweight, plug-and-play, and fast. However, they have much lower accuracy than intra-class relation searching.

2.3. Feature Representation

Although attention-guided mechanisms can directly locate discriminative regions, the features of these regions are often confused with noisy features. Filtering noisy features to obtain the most discriminative features [31] is another key factor for successful recognition. Convolutional Neural Network (CNN) features can directly represent the global information of the images. As the CNN features keep invariance with the change in the translation and posture of objects, some existing methods obtain high-order information to decrease the noisy features. Refs. [32,33] calculate high-order information by aggregating CNN features. Ref. [10] concentrates on the spatial location correlations between all channels rather than a single channel to strengthen the most discriminative features. However, these methods concentrate on the correlations among channel or spatial attentions in a single image, overlooking the mutual correlations in one image or among multiple images.

To solve the mutual relations in one image, as a multi-head attention learning mechanism, Transformer has achieved good performance improvement in various fields of vision, and the work of replacing the hard or soft attention learning mechanism with the Transformer attention learning has also achieved good results. PART [20] appends the Transformer attention module to the fine-tuning model, learning the relationships among the extracted regions. Based on the Transformer attention module, CP-CNN [19] removes noise information by learning the relationships among the selected parts, obtaining coordinates of more discriminative parts than PART [20]. These two methods still rely on traditional neural network architectures and inevitably require multi-stage reinforcement learning training, resulting in non-differentiation and high computational complexity. A new backbone is needed.

As a new backbone, Vision Transformer performs better on large traditional tasks, such as image classification, segmentation, generation, etc. Recent methods for pre-training Vision Transformers design object classification tasks, such as Vision Transformer (ViT) [22], Generative Pretraining from Pixels (Image GPT) [34], and BERT Pre-Training of Image Transformers (BEiT) [35]. ViT [22] and Image GPT [34] directly replace CNNs with transformers to classify the fine-grained datasets, which neglects the high number of data sources and the correlations among different transformer layers. BEiT utilizes arbitrary unmasked patches to predict vision tokens of masked image patches, which merges the whole structure information by the unmasked version of Jigsaw pre-training [36], regarding the unmasked positions as different views of an image. However, BEiT has to feed vision tokens of the whole image in advance by the pre-trained model [37]. These methods take the whole image patches as inputs, and their representation results are feature-based or pixel-based, which may obtain suboptimal results in capturing their local structure. In contrast, Swin [38] extracts the tokens with spatial coordinates, which is the most significant difference from ViT [22] with the horizontal locations. Swin [38] leverages the stationary structured image patches in the stationary shifted Transformer model to capture structured information of varying patches, which replaces the overall multi-head attention MSA with window multi-head attention W-MSA to obtain adjacent local information. Due to the fusion of parts in every two layers, the model parameters are significantly reduced. Transformer-based model (TransFG) [23] considers the relationships among different layers, which falls into utilizing ViT with horizontal locations as a backbone and extracting incomplete tokens. We first design the correlations of multiple object parts of multiple images between the inter-class and intra-class. Then, we propose the Pyramid Tokens Multiplication (PTM) module to locate two attention regions in each image and Tokens Proposals Generation (TPG) to locate two lists of discriminative regions.

3. Transformer Collaborative Region Mining

We propose a Transformer Collaborative Region Mining (TCTM) scheme to locate two groups of discriminative regions by exploring various inter- and intra-class relations. It consists of two sub-networks: Pyramid Tokens Multiplication (PTM) and Tokens Proposals Generation (TPG), as shown in Figure 1. PTMs have similar structures but different inputs and outputs, obtaining two different stacked multi-scale tokens by learning to rank these concatenated tokens extracted from different sub-networks, images, and classes. The extracted tokens can guide TPGs to precisely detect corresponding groups of neighbor candidate boxes by ranking their token regions.

3.1. Pyramid Tokens Multiplications

Given a randomly sampled training batch

{(I_{i}, {\tilde{I}}_{i}, y_{i})}_{i = 1}^{N}

with N paired-image, for the image

I_{i}

, the two Pyramid Tokens Multiplications o and

o^{'}

attend to multi-scale tokens to obtain two contrastive tokens

f_{i}^{o}

and

f_{i}^{o^{'}}

. The extracted features have varying scales with the input image in VGG or ResNet backbones, while the pyramid-merged tokens have the same spatial coordinates with the input patches in Swin [38]. Hence, the merged tokens can be directly mapped to the input image without the help of anchors. We obtain tokens for image patches through a pre-trained Swin [38]. The image

I_{i}

is divided into multiple patches with spatial coordinates information. Swin includes four stages to make the adjacent patches interact by a sliding fusion operation for every two Swin Transformer Blocks. The patches are fused layer by layer by designing half of the resolution of the previous layer and the number of channels twice that of the previous layer. The location fusion is similar to the pooling operation, which will not lose information with the spatial coordinates. Each Pyramid Tokens Multiplication (PTM) extracts multi-scale tokens and multiplies them, contributing to the lowest-stage tokens that can accurately attend to the small-scale regions. In contrast, the higher-stage tokens can tokenize richer semantics better to attend to different regions at a more extensive view. In Figure 2, the attention weights of each stage can be written as follows:

\begin{matrix} a^{0} = [t_{0}^{0}, t_{1}^{0}, t_{2}^{0}, t_{3}^{0}], \end{matrix}

(1)

\begin{matrix} a^{i} = [t_{0}^{i}, t_{1}^{i}, t_{2}^{i}, t_{3}^{i}], i \in 0, 1, \dots, A, \end{matrix}

(2)

where A is the sequence length of image patches. We integrate all the multi-scale attention regions by multiplication:

\begin{matrix} t^{i} = \prod_{j = 0}^{3} t_{j}^{i} \end{matrix}

(3)

As

t^{i}

captures multi-scale information from the first to the last stage, it can be regarded as the best choice for choosing discriminative regions in all the single stages, as shown in Figure 2.

The integrated stage

f^{o}

is fed into a fully-connected layer, and the same operations are conducted on the

f^{o^{'}}

. To obtain discriminative attended tokens with intra-class and inter-class relations, given a query

f_{i}^{o}

, we adaptively select four different levels of positive and negative merged tokens within the batch and apply constraints to enlarge the distance among

f_{i}^{o}

,

f_{i}^{o^{'}}

and other images tokens from different classes.

3.2. Tokens Proposals Generation

Inspired by Faster-RCNN [39], the two TPGs choose many sequences of attention neighbor tokens from

f_{i}^{o}

and

f_{i}^{o^{'}}

as attended candidate boxes. Based on the anchor’s size

48, 96, 192

and image patches size 16, the smallest number of tokens sampling is set to 3, and the largest number of tokens sampling is set to 6. We especially filter some tokens if the values of tokens are smaller than the designed threshold. The merged tokens have the same spatial coordinates with the input patches, which can be directly mapped to image patches. We can create various neighbor concatenated tokens by the sliding windows mechanism, where the windows can be horizontally, vertically, or spatially adjacent. The input image size is

448 \times 448

, and the patch size is

16 \times 16

. We sample and zoom one patch, represented by a black box with 49 markers during training, as shown in Figure 3. The red box indicates one candidate box, where five candidate boxes have been chosen. We define all the proposed candidate boxes as

T^{o} = {T_{1}^{o}, T_{2}^{o}, \dots, T_{S}^{o}}

, where S is the number of candidate boxes,

T_{1}^{o} = t_{1}^{o}, t_{2}^{o}, t_{5}^{o}

, and

T_{2}^{o} = t_{2}^{o}, t_{3}^{o}, t_{4}^{o}

. Due to two contrastive instances of each image, we define another group of tokens proposals

T^{o^{'}} = {T_{1}^{o^{'}}, T_{2}^{o^{'}}, \dots, T_{S}^{o^{'}}}

. We predict the probability between these concatenated tokens and the ground truth class to select the top-four groups of concatenated tokens.

3.3. Relations among Inter-Class and Intra-Class

To force the two PTMs to attend to two different groups of tokens in one image, we magnify the tokens’ distance between different PTMs and classes. Given a query

f_{i}^{o}

by randomly extracting from i-th image with the PTMo and label

y_{i}

, we divide the whole batch tokens into four levels based on relations among the inter-class and intra-class. To keep the distribution of one batch of tokens unchanged, we first calculate different distances between the query and the other groups of tokens, and then sort the distances in descending order. Inspired by MAMC [18], we split the groups of tokens into four levels: the closest as the first level

S_{1} = \{{\tilde{f}}_{i}^{o}\}

, the second level

S_{2} = \{f_{j}^{o}, {\tilde{f}}_{j}^{o}\}

, the third level

S_{3} = \{f_{i}^{o^{'}}, {\tilde{f}}_{i}^{o^{'}}\}

, and the fourth level

S_{4} = \{f_{j}^{o^{'}}, {\tilde{f}}_{j}^{o^{'}}\}

, where

j \neq i, o \neq o^{'}

.

We select three groups of positive and negative tokens for these levels: the first positive tokens

S_{1}

and negative samples

S_{2} ⋃ S_{3} ⋃ S_{4}

; the second positive and negative tokens are

S_{3}

and

S_{4}

; the three positive and negative tokens are

S_{2}

,

S_{3}

and

S_{2}

,

S_{4}

. We design three contrastive losses to progressively enlarge the tokens distance between the query and different levels.

For each contrastive loss, we aim to decrease the distance between the query

f_{i}^{o}

and the positive tokens, such that the distance is less than

α - m

and enlarge all the negative tokens more than

α

. The optimization process can be shown: after randomly selecting one sequence of tokens as a query, we pull the distance between the query and the first group of positive tokens and push the distance between the query and the first group of negative tokens. We optimize similarly for the second and third groups of positive and negative tokens to implement the second and third divisions. We finally split the four levels into four hierarchies.

3.4. Loss Function

Ranking Loss.

Inspired by [30], to eliminate the influences from different distance losses, we normalize them by weighting the positive and negative tokens. The weight of the positive tokens is generated as follows:

\begin{matrix} ω_{p} & = exp (T \cdot (∥ f_{i, y}^{o} - f^{+} ∥_{2} - (α - m))), f^{+} \in P . \end{matrix}

(4)

where the temperature parameter

T \geq 0

measures the degrees of weighing the positive tokens. The normalization of the positive tokens is as follows:

\begin{matrix} ω_{p o} = \frac{ω_{p}}{\sum_{f^{+} \in P} ω_{p}} . \end{matrix}

(5)

The positive tokens are minimized by:

\begin{matrix} L_{P} (f_{i, y}^{o}, f^{+}) = \sum_{f^{+} \in P} ω_{p o} ([∥ f_{i, y}^{o} - f^{+} ∥_{2} - (α - m)]_{+}) . \end{matrix}

(6)

Similarly, the weight of the negative tokens is as follows:

ω_{n} = exp (T \cdot (α - ∥ f_{i, y}^{o} - f^{-} ∥_{2})), f^{-} \in N .

(7)

where the temperature parameter

T \geq 0

measures the degrees of weighing the negative tokens. The normalization for the negative tokens is as follows:

ω_{p n} = \frac{ω_{n}}{\sum_{f^{-} \in N} ω_{n}} .

(8)

The negative tokens are minimized by:

L_{N} (f_{i, y}^{o}, f^{-}) = \sum_{f^{-} \in N} ω_{p n} [α - ∥ f_{i, y}^{o} - f^{-} {∥_{2}]}_{+} .

(9)

Each hierarchy ranking loss can be represented as:

L^{ORL} (f_{i, y}^{o}, f^{+}, f^{-}) = L_{P} (f_{i, y}^{o}, f^{+}) + λ L_{N} (f_{i, y}^{o}, f^{-}),

(10)

inspired by RLL [30], where

λ = 1

.

The three ranking losses together:

L^{OL} = L_{1}^{ORL} + L_{2}^{ORL} + L_{3}^{ORL},

(11)

inspired by MAMC [18], where we assume the same contribution of each ranking loss to the three ranking losses in the optimization.

The complexity of splitting the three groups of negative and positive tokens is

2 (4 N - 1) + 4 {(2 N - 1)}^{2} + 4 (2 N - 1)

. Inspired by RLL [30], the complexity of each ranking loss is

O {(N)}^{2}

.

We aim to implement two locations of two object instances in one image by using the softmax function:

L^{SL} = - log C (f^{o}) - log C (f^{o^{'}})) .

(12)

The guiding loss of PTM is:

\begin{matrix} L^{G} = λ_{1} L^{OL} + L^{SL}, \end{matrix}

(13)

where

λ_{1}

is equal to 1 with tuning.

Tokens List Loss.

The tokens list loss forces the probability of predicting the true label for two lists of tokens

T^{o} = {T_{1}^{o}, T_{2}^{o}, \dots, T_{S}^{o}}

and

T^{o^{'}} = {T_{1}^{o^{'}}, T_{2}^{o^{'}}, \dots, T_{S}^{o^{'}}}

descending from big to small. The tokens list loss is:

L^{TLL} (π_{y}) = \prod_{j = 1}^{n} \frac{e x p (l o g (T_{j}^{o}))}{\sum_{k = j}^{n} e x p (l o g (T_{k}^{o}))} + \prod_{j = 1}^{n} \frac{e x p (l o g (T_{j}^{o^{'}}))}{\sum_{k = j}^{n} e x p (l o g (T_{k}^{o^{'}}))} .

(14)

Predicting loss.

The final classification is obtained with two groups of tokens

{T_{1}^{o}, T_{2}^{o}, \dots, T_{k}^{o}}

and

{T_{1}^{o^{'}}, T_{2}^{o^{'}}, \dots, T_{k}^{o^{'}}}

and the whole image

X

. The fine-grained classification loss is defined as:

L^{P} = - log S (x, T_{1}^{o}, T_{2}^{o}, \dots, T_{k}^{o}, T_{1}^{o^{'}}, T_{2}^{o^{'}}, \dots, T_{k}^{o^{'}}) .

(15)

In all, our total loss is:

\begin{matrix} L^{t o t a l} = λ_{2} L^{G} + λ_{3} L^{TLL} + L^{P}, \end{matrix}

(16)

where

λ_{2}

, and

λ_{3}

are equal to 1 with tuning.

4. Experiments

4.1. Experimental Setup

Our model is evaluated on four widely used benchmark datasets, CUB-200-2011 [40], Stanford Cars [41], and FGVC Aircraft [42], Stanford Dogs [43] for fine-grained image classification. We do not introduce the bounding box/part annotations provided within the datasets. Table 1 shows the statistics of the datasets.

During the training, we set batch size of 24. The initial learning rate is 0.001, and the weight coefficient is 0.01. The threshold for the candidate is set to 0.15. During the testing, we adopt the same evaluation protocol. We leverage

α = 0.13

to weigh the tiny differences in images with various labels.

α < 0.13

is too small to push away the negative tokens and

α > 0.13

is too large to pull the positive tokens. Because images with the same labels have large differences, we choose the proper

m = 0.05

to determine the distance of the final negative tokens. We make a decision

T = 10

to weigh the positive and negative tokens. For each TPG, we finally choose

k = 4

groups of candidate boxes. We conduct our experiments in the graphics processing unit environment (GPU) using four NVIDIA TITAN X (Pascal) in Pytorch 1.8.1.

4.2. Comparison with Previous Results

We evaluate our approach with current state-of-the-art methods. Based on the backbone, the current methods include two categories: (1) based on Convolutional Network (CNN); (2) based on Transformer. Table 2 is divided into four groups, with the first three based on VGG or Resnet backbone and the fourth based on Transformer backbone. All methods only use category labels and do not use bounding box annotations.

The first group of methods apply multi-stage and reinforcement learning to obtain good accuracy, focusing on relations of intra-class bringing in noise. Ref. [10] removes the noises by multiplying multi-scale features to suppress the low information. Ref. [15] brings into position information to align the detected regions and Ref. [14] focuses on separating those confusing classes for each image. SCAPNet [24] confines the key regions in the attention of the whole image, which guides to locate the discriminate regions.

To introduce inter-class and intra-class, the third group of papers create paired images. Ref. [17] explores the relations of two images with the same or different classes, and MAMC [18] brings into relations of multiple images by artificial division. Ref. [47] attend to the discriminative parts by clustering multiple object parts. To some extent, these papers mine some relations of inter-class and intra-class, which are disturbed by noise in the discriminative regions.

Although our method falls into the second category, we still far outweigh the above results. According to the advantages of the Transformer architecture, there are currently mainly classifications based on ViT [22] and DeiT [21] architectures, which have exceeded the accuracy of the first two groups on bird datasets. However, these two types of architecture consume a large amount of data and do not consider the relationship between parts. We use Swin [38] as the backbone to extract features of intra-class and inter-class markers. Especially when searching for intra-class relationships, there is no need to extract the coordinate information of discriminative parts, and there is no need to use the backbone again. Instead, multiple labeled features can be directly obtained based on the labeled location information. There is no need to use multi-stage reinforcement learning, which reduces a large amount of computation and significantly improves classification accuracy. Based on Transformer, we extract tokens of image patches, which have better representation than VGG-19 and ResNet-50. We first multiply multi-scale tokens to remove some noises, then conduct the coarse location to guide intra-class tokens. We adaptively select four groups of tokens for each PTM, which improve

1.5 %

compared with TransPG in CUB. The results are shown in Table 2.

Our proposed model selects top-four indexes of discriminative tokens on four-scale multiplied tokens, which can obtain eight discriminative regions at most. We can attend to two parts, such as the bird’s head and belly, separately, showing that PTMs can guide TPGs and through two feature maps and build suitable interdependence of the two attention regions, locating more discriminative parts.

4.3. Ablation Experiments

We design a series of ablation experiments to evaluate the effectiveness of two modules and their corresponding losses.

4.3.1. Influence of Pyramid Tokens Multiplication

Without any region annotation data, we choose Swin [38] as the backbone module to extract basic tokens. MAMC [18] utilizes extra artificial information to constrain relationships of inter-class and intra-class. In contrast, we utilize ranking loss to keep the data distribution unchanged and further introduce PTM to extract multi-scale tokens, increasing the accuracy by

0.4 %

on CUB. To investigate the influences of tokens from different stages, we append these reshaped tokens to PTMs step-by-step. The tokens from stage 4 have the most excellent impact on the top-one accuracy, and the effects of tokens from stage 3, 2, and 1 decrease gradually. The results are shown in Table 3.

4.3.2. Influence of Tokens Proposals Generation

If we use TPG to detect two groups of tokens, TPG I and TPG II are inclined to select the same groups of tokens, reducing the accuracy. As PTM guides TPG, TPG can attend to groups of tokens surrounding two different attention parts. Hence, we can select more groups of tokens where the large and critical distinguishing regions and the small and subcritical discriminative regions can also be detected. We conduct PTM I to guide TPG I or PTM to guide TPG, and a large margin will increase the accuracy. The results are shown in Table 4.

Influence of $λ_{1}$ . Due to the guiding relations of PTM and TPG, we first select different coefficients for ranking loss in

L^{G}

and then select the appropriate coefficients for total loss in

L^{t o t a l}

. We mainly access the influences of splitting and ranking loss, shown in Figure 4. From the results, the influence first increases and then decreases when ranking loss changes from small to large. Especially, when the ranking loss is more significant than the splitting loss, the splitting loss drops rapidly, and the accuracy decreases instead.

Influence of $λ_{2}$ and $λ_{3}$ . For the total loss

L^{t o t a l}

, we mainly consider the mutual effects of guiding loss

L^{G}

and tokens list loss

L^{TLL}

on the top-one accuracy. The results are shown in Figure 4. From the results, the final result changes similarly when the guiding loss changes from small to large. Hence, the guiding loss plays a guiding and collaborative role in the final results. When tokens list loss changes from small to large, the final result reaches a peak, and then it decreases. Hence, the token list loss is more synergistic.

4.3.3. Complexity

Due to the pre-trained model Swin [38], we mainly calculate the parameters and flops of the location module PTMs. The results are shown in Table 5. PTMs locate two parts of the image quickly with low parameters, which is conducive to selecting discriminative tokens for TPGs quickly.

4.3.4. Limitations

When the image has a simple structure, such as with birds or dogs, TCTM can attend to fewer than eight discriminative regions. TPG spends much more time ranking to select the most discriminative intra-regions.

5. Conclusions

This paper proposes detecting two discriminative region groups for fine-grained classification, which integrates a Pyramid Tokens Multiplications sub-network with a token proposal generation sub-network. The PTMs sub-network extracts multiple object feature vectors among inter- and intra-class relations and learns these correlations by additive ranking loss. Under the guidance of the Pyramid Features Extractor, the tokens proposal generation sub-network selects two groups of discriminative tokens surrounding the attention feature maps. Experiments predict high accuracy on four datasets.

Author Contributions

Methodology, W.Y. and J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Guangdong Basic and Applied Basic Research Foundation grant number 2019B1515130001, Key-Area Research and Development Program of Guangdong Province grant number 2018B010107005, 2020B0101100001.

Informed Consent Statement

Written informed consent has been obtained from the authors to publish this paper.

Data Availability Statement

My manuscript has associated data in a data repository.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (U1811264, U1811262, U1811261, U1911203, U2001211), Guangdong Basic and Applied Basic Research Foundation (2019B1515130001), Key-Area Research and Development Program of Guangdong Province (2018B010107005, 2020B0101100001).

Conflicts of Interest

The authors declare no conflict of interest.

References

Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Li, Z.; Yang, Y.; Liu, X.; Zhou, F.; Wen, S.; Xu, W. Dynamic Computational Time for Visual Attention. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1199–1209. [Google Scholar]
Liu, X.; Xia, T.; Wang, J.; Lin, Y. Fully Convolutional Attention Localization Networks: Efficient Attention Localization for Fine-Grained Recognition. arXiv 2016, arXiv:1603.06765. [Google Scholar]
Zhang, X.; Xiong, H.; Zhou, W.; Lin, W.; Tian, Q. Picking Neural Activations for Fine-Grained Recognition. IEEE Trans. Multimedia 2017, 19, 2736–2750. [Google Scholar] [CrossRef]
Liu, C.; Xie, H.; Zha, Z.; Yu, L.; Chen, Z.; Zhang, Y. Bidirectional Attention-Recognition Model for Fine-Grained Object Classification. IEEE Trans. Multimedia 2020, 22, 1785–1795. [Google Scholar] [CrossRef]
Fu, J.; Zheng, H.; Mei, T. Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4476–4484. [Google Scholar]
Zheng, H.; Fu, J.; Mei, T.; Luo, J. Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5219–5227. [Google Scholar]
Yang, Z.; Luo, T.; Wang, D.; Hu, Z.; Gao, J.; Wang, L. Learning to Navigate for Fine-Grained Classification. In Proceedings of the 15th Conference on European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 438–454. [Google Scholar]
He, X.; Peng, Y.; Zhao, J. Fine-grained Discriminative Localization via Saliency-guided Faster R-CNN. In Proceedings of the 27th the Conference on ACM International Conference on Multimedia (ACM MM), Mountain View, CA, USA, 23–27 October 2017; pp. 627–635. [Google Scholar]
Wang, Z.; Wang, S.; Zhang, P.; Li, H.; Zhong, W.; Li, J. Weakly Supervised Fine-grained Image Classification via Correlation-guided Discriminative Learning. In Proceedings of the 27th the Conference on ACM International Conference on Multimedia (ACM MM), Nice, France, 21–25 October 2019; pp. 1851–1860. [Google Scholar]
Ding, Y.; Zhou, Y.; Zhu, Y.; Ye, Q.; Jiao, J. Selective Sparse Sampling for Fine-Grained Image Recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6598–6607. [Google Scholar]
Wang, Z.; Wang, S.; Li, H.; Dou, Z.; Li, J. Graph-Propagation Based Correlation Learning for Weakly Supervised Fine-Grained Image Classification. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 12289–12296. [Google Scholar]
Wang, Z.; Wang, S.; Yang, S.; Li, H.; Li, J.; Li, Z. Weakly Supervised Fine-Grained Image Classification via Gaussian Mixture Model Oriented Discriminative Learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9746–9755. [Google Scholar]
Sun, G.; Cholakkal, H.; Khan, S.; Khan, F.; Shao, L. Fine-Grained Recognition: Accounting for Subtle Differences between Similar Classes. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 12047–12054. [Google Scholar]
Wang, S.; Li, H.; Wang, Z.; Ouyang, W. Dynamic Position-aware Network for Fine-grained Image Recognition. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), Virtually, 2–9 February 2021; pp. 2791–2799. [Google Scholar]
Dubey, A.; Gupta, O.; Guo, P.; Raskar, R.; Farrell, R.; Naik, N.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Pairwise Confusion for Fine-Grained Visual Classification. In Proceedings of the 15th Conference on European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 71–88. [Google Scholar]
Zhuang, P.; Wang, Y.; Qiao, Y. Learning Attentive Pairwise Interaction for Fine-Grained Classification. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 13130–13137. [Google Scholar]
Sun, M.; Yuan, Y.; Zhou, F.; Ding, E. Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition. In Proceedings of the 15th Conference on European Conference on Computer Vision (ECCV), European Conference, Munich, Germany, 8–14 September 2018; Volume 11220, pp. 834–850. [Google Scholar]
Liu, M.; Zhang, C.; Bai, H.; Zhang, R.; Zhao, Y. Cross-Part Learning for Fine-Grained Image Classification. IEEE Trans. Image Process. 2022, 31, 748–758. [Google Scholar] [CrossRef] [PubMed]
Zhao, Y.; Li, J.; Chen, X.; Tian, Y. Part-Guided Relational Transformers for Fine-Grained Visual Recognition. IEEE Trans. Image Process. 2021, 30, 9470–9481. [Google Scholar] [CrossRef] [PubMed]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Mathematical Linguistics (ICML), Virtual Event, 18–24 July 2021; Volume 139, pp. 10347–10357. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
He, J.; Chen, J.; Liu, S.; Kortylewski, A.; Yang, C.; Bai, Y.; Wang, C. TransFG: A Transformer Architecture for Fine-Grained Recognition. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI), Virtual Event, 22 February–1 March 2022; pp. 852–860. [Google Scholar]
Liu, H.; Li, J.; Li, D.; See, J.; Lin, W. Learning Scale-Consistent Attention Part Network for Fine-Grained Image Recognition. IEEE Trans. Multimedia 2022, 24, 2902–2913. [Google Scholar] [CrossRef]
Sohn, K. Improved Deep Metric Learning with Multi-class N-pair Loss Objective. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016; pp. 1849–1857. [Google Scholar]
Law, M.T.; Thome, N.; Cord, M. Quadruplet-Wise Image Similarity Learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Sydney, Australia, 1–8 December 2013; pp. 249–256. [Google Scholar]
Song, H.O.; Xiang, Y.; Jegelka, S.; Savarese, S. Deep Metric Learning via Lifted Structured Feature Embedding. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4004–4012. [Google Scholar]
Ge, W.; Huang, W.; Dong, D.; Scott, M.R. Deep Metric Learning with Hierarchical Triplet Loss. In Proceedings of the 15th Conference on European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Volume 11210, pp. 272–288. [Google Scholar]
Manmatha, R.; Wu, C.; Smola, A.J.; Krähenbühl, P. Sampling Matters in Deep Embedding Learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2859–2867. [Google Scholar]
Wang, X.; Hua, Y.; Kodirov, E.; Hu, G.; Garnier, R.; Robertson, N.M. Ranked List Loss for Deep Metric Learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 5207–5216. [Google Scholar]
Lam, M.; Mahasseni, B.; Todorovic, S. Fine-Grained Recognition as HSnet Search for Informative Image Parts. In Proceedings of the conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6497–6506. [Google Scholar]
Cai, S.; Zuo, W.; Zhang, L. Higher-Order Integration of Hierarchical Convolutional Activations for Fine-Grained Visual Categorization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 511–520. [Google Scholar]
Yu, C.; Zhao, X.; Zheng, Q.; Zhang, P.; You, X. Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition. In Proceedings of the 15th Conference on European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 595–610. [Google Scholar]
Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative Pretraining From Pixels. In Proceedings of the 37th International Conference on Mathematical Linguistics (ICML), Virtual Event, 13–18 July 2020; Volume 119, pp. 1691–1703. [Google Scholar]
Bao, H.; Dong, L.; Piao, S.; Wei, F. BEiT: BERT Pre-Training of Image Transformers. In Proceedings of the 10th International Conference on Learning Representations (ICLR), Virtual Event, 25–29 April 2022. [Google Scholar]
Noroozi, M.; Favaro, P. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In Proceedings of the 14th Conference on European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Volume 9910, pp. 69–84. [Google Scholar]
Esser, P.; Rombach, R.; Ommer, B. Taming Transformers for High-Resolution Image Synthesis. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 12873–12883. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the 28th Conference on Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200-2011; Dataset.Tech.rep.; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
Krause, J.; Stark, M.; Deng, J.; Li, F.-F. 3D Object Representations for Fine-Grained Categorization. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Sydney, Australia, 1–8 December 2013; pp. 554–561. [Google Scholar]
Maji, S.; Rahtu, E.; Kannala, J.; Blaschko, M.B.; Vedaldi, A. Fine-Grained Visual Classification of Aircraft. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Sydney, Australia, 23–28 June 2013. [Google Scholar]
Rabiee, H.R.; Haddadnia, J.; Mousavi, H.; Kalantarzadeh, M.; Nabi, M.; Murino, V. Novel dataset for fine-grained abnormal behavior understanding in crowd. In Proceedings of the 13th IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS) 2016, Colorado Springs, CO, USA, 23–26 August 2016; pp. 95–101. [Google Scholar]
Engin, M.; Wang, L.; Zhou, L.; Liu, X. DeepKSPD: Learning Kernel-Matrix-Based SPD Representation For Fine-Grained Image Recognition. In Proceedings of the 16th Conference on European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 629–645. [Google Scholar]
Wang, Y.; Morariu, V.I.; Davis, L.S. Learning a Discriminative Filter Bank Within a CNN for Fine-Grained Recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4148–4157. [Google Scholar]
Chen, Y.; Bai, Y.; Zhang, W.; Mei, T. Destruction and Construction Learning for Fine-Grained Image Recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 5157–5166. [Google Scholar]
Zheng, X.; Qi, L.; Ren, Y.; Lu, X. Fine-Grained Visual Categorization by Localizing Object Parts With Single Image. IEEE Trans. Multimedia 2021, 23, 1187–1199. [Google Scholar] [CrossRef]

Figure 1. The overview of our TCTM. Given a batch of paired and unpaired images, to generate two contrastive groups of region proposals, two Pyramid Tokens Multiplication sub-networks (PTMs) progressively extract stacked multi-scale tokens, mine the spatial structure distributions, and generate two attention tokens

f_{i}^{o}

and

f_{i}^{o^{'}}

for each image. TCTM divides the tokens into three groups of positive and negative samples and utilizes

L^{ORL}

to pull the distance of the query and positive tokens and push the distance of the query and negative tokens. After the division, TCTM locates two different attention regions for each image. Corresponding to these attention tokens, TPGs acquire two lists of candidate tokens and rank them

T^{o}

and

T^{o^{'}}

to select two top-k lists of discriminative tokens.

Figure 1. The overview of our TCTM. Given a batch of paired and unpaired images, to generate two contrastive groups of region proposals, two Pyramid Tokens Multiplication sub-networks (PTMs) progressively extract stacked multi-scale tokens, mine the spatial structure distributions, and generate two attention tokens

f_{i}^{o}

and

f_{i}^{o^{'}}

for each image. TCTM divides the tokens into three groups of positive and negative samples and utilizes

L^{ORL}

to pull the distance of the query and positive tokens and push the distance of the query and negative tokens. After the division, TCTM locates two different attention regions for each image. Corresponding to these attention tokens, TPGs acquire two lists of candidate tokens and rank them

T^{o}

and

T^{o^{'}}

to select two top-k lists of discriminative tokens.

Figure 2. The overview of our PTM. From pre-trained Swin [38], we conduct the linear embedding for the output of stage 2, stage 3, and stage 4 to have the same scale as stage 1, which are integrated recursively using multiplication. We define the integrated attention tokens as

f_{i}^{o} = \{t^{0}, t^{1}, \dots, t^{A}\}

for each image.

Figure 2. The overview of our PTM. From pre-trained Swin [38], we conduct the linear embedding for the output of stage 2, stage 3, and stage 4 to have the same scale as stage 1, which are integrated recursively using multiplication. We define the integrated attention tokens as

f_{i}^{o} = \{t^{0}, t^{1}, \dots, t^{A}\}

for each image.

Figure 3. The overview of TPG. We enlarge one black box into

7 \times 7

image patches. We select many sequences of tokens from the integrated attention

f_{i}^{o}

and illustrate five types of candidate boxes

T_{i}^{o}

, shown by the large red box.

Figure 3. The overview of TPG. We enlarge one black box into

7 \times 7

image patches. We select many sequences of tokens from the integrated attention

f_{i}^{o}

and illustrate five types of candidate boxes

T_{i}^{o}

, shown by the large red box.

Figure 4. (a) Influence of

λ_{1}

on CUB-200-2011, Stanford Cars, FGVC Aircraft, and Stanford Dogs. The backbone is Swin [38], and the top-1 accuracy only depends on ranking loss. (b) Influence of

λ_{2}

and

λ_{3}

on CUB-200-2011. The backbone is Swin [38], and the top-one accuracy depends on the total loss.

Figure 4. (a) Influence of

λ_{1}

on CUB-200-2011, Stanford Cars, FGVC Aircraft, and Stanford Dogs. The backbone is Swin [38], and the top-1 accuracy only depends on ranking loss. (b) Influence of

λ_{2}

and

λ_{3}

on CUB-200-2011. The backbone is Swin [38], and the top-one accuracy depends on the total loss.

Table 1. Statistics of benchmark datasets.

Dataset	#Class	#Train	#Test
CUB-200-2011	200	5994	5794
Stanford Cars	196	8144	8041
FGVC Aircraft	100	6667	3333
Stanford Dogs	120	12,000	8580

Table 2. Top-one accuracy on CUB-200-2011, Stanford Cars, FGVC Aircraft, and Stanford Dogs.

Method	Backbone	CUB-200-2011	Stanford Cars	FGVC Aircraft	Stanford Dogs
RA-CNN [6]	VGGNet-19	85.3%	92.5%	88.2%	87.3%
MA-CNN [7]	VGGNet-19	86.5%	92.8%	89.9%	-
HIHCA [32]	ResNet-50	85.3%	91.7%	88.3%	-
Deep KSPD [44]	VGGNet-19	86.5%	92.8%	89.9%	-
HBP [33]	VGGNet-16	87.1%	93.7%	90.3%	-
NTS [8]	ResNet-50	87.5%	93.9%	-	-
DFL-CNN [45]		87.4%	93.8%	92.0%	-
DCL [46]		87.8%	94.5%	93.0%	-
S3Ns [11]		88.5%	94.7%	92.8%	-
GCL [12]		88.3%	94.0%	93.2%	-
BARM [5]		88.5%	94.3%	92.5%	-
ASD [14]		88.6%	94.9%	93.5%	-
DF-GMM [13]		88.8%	94.8%	93.8%	-
DP-Net [15]		89.3%	94.8%	93.9%	-
SCAPNet [24]		89.5%	-	93.6%	-
PART [20]	ResNet-50	89.6%	94.4%	95.1%	-
CP-CNN [19]	ResNet-50	91.4%	95.4%	95.5%	-
PC [16]	ResNet-50	80.2%	93.4%	83.4%	-
MAMC [18]	ResNet-50	86.2%	92.8%	-	84.8%
API-Net [17]	ResNet-50	87.7%	94.8%	93.0%	90.3%
LOPSI [47]	ResNet-50	88.9%	94.2%	92.3%	-
DeiT	DeiT-B [21]	90.0%	93.9%	-	-
ViT	ViT-B [22]	90.3%	93.7%	-	91.7%
TransFG [23]	ViT-B [22]	91.7%	94.8%	-	92.3%
TCTM	Swin-B [38]	93.2%	96.3%	95.7%	93.8%

Table 3. Top-one accuracy of Pyramid Tokens Multiplication on CUB-200-2011, Stanford Cars, FGVC Aircraft, and Stanford Dogs.

Method	CUB-200-2011	Stanford Cars	FGVC Aircraft	Stanford Dogs
Swin [38]	90.8%	91.8%	90.3%	90.9%
Swin [38] + PTMs (stage 4) + Loss (Equation (11))	91.7%	92.7%	91.4%	91.8%
Swin [38] + PTMs (stage 3, 4 ) + Loss (Equation (11))	92.0%	93.6%	92.3%	92.1%
Swin [38] + PTMs(stage 2, 3, 4) + Loss (Equation (11))	92.2%	94.4%	94.2%	92.3%
Swin [38] + PTMs + Loss (Equation (11))	92.3%	95.1%	94.2%	92.5%

Table 4. Top-one accuracy of Tokens Proposals Generations on CUB-200-2011, Stanford Cars, FGVC Aircraft, and Stanford Dogs.

Method	CUB	Stanford Cars	FGVC Aircraft	Stanford Dogs
Swin [38] + PTM + Loss (Equation (11))	92.3%	95.1%	94.2%	92.5%
Swin [38] + PTM I + TPG I + Loss (Equation (15))	91.2%	91.9%	90.7%	89.3
Swin [38] + PTM I + TPG + Loss (Equation (15))	92.6%	93.7%	92.3%	90.6
Swin [38] + PTM + TPG I + Loss (Equation (15))	92.5%	93.4%	92.1%	90.4
Swin [38] + PTM + TPG + Loss (Equation (15))	93.2%	96.3%	95.7%	93.8%

Table 5. Complexity on PTM I, PTM II, TPG I, TPG II.

Module	PTM I	PTM II	TPG I	TPG II
Params	2.64 M	2.62 M	0.73 M	0.71 M
Flops	388 MMac	386 MMac	294 MMac	293 MMac

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, W.; Yin, J. An Integrated Transformer with Collaborative Tokens Mining for Fine-Grained Recognition. Electronics 2023, 12, 2635. https://doi.org/10.3390/electronics12122635

AMA Style

Yang W, Yin J. An Integrated Transformer with Collaborative Tokens Mining for Fine-Grained Recognition. Electronics. 2023; 12(12):2635. https://doi.org/10.3390/electronics12122635

Chicago/Turabian Style

Yang, Weiwei, and Jian Yin. 2023. "An Integrated Transformer with Collaborative Tokens Mining for Fine-Grained Recognition" Electronics 12, no. 12: 2635. https://doi.org/10.3390/electronics12122635

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Integrated Transformer with Collaborative Tokens Mining for Fine-Grained Recognition

Abstract

1. Introduction

2. Related Work

2.1. Fine-Grained Recognition

2.2. Inter-Class and Intra-Class Relations by Metric Learning

2.3. Feature Representation

3. Transformer Collaborative Region Mining

3.1. Pyramid Tokens Multiplications

3.2. Tokens Proposals Generation

3.3. Relations among Inter-Class and Intra-Class

3.4. Loss Function

4. Experiments

4.1. Experimental Setup

4.2. Comparison with Previous Results

4.3. Ablation Experiments

4.3.1. Influence of Pyramid Tokens Multiplication

4.3.2. Influence of Tokens Proposals Generation

4.3.3. Complexity

4.3.4. Limitations

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI