Real-Time RGBT Target Tracking Based on Attention Mechanism

Zhao, Qian; Liu, Jun; Wang, Junjia; Xiong, Xingzhong

doi:10.3390/electronics13132517

Open AccessArticle

Real-Time RGBT Target Tracking Based on Attention Mechanism

¹

School of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin 644000, China

²

Key Laboratory of Higher Education of Sichuan Province for Enterprise Informationalization and Internet of Things, Sichuan University of Science and Engineering, Yibin 644000, China

³

Artificial Intelligence Key Laboratory of Sichuan Province, Sichuan University of Science and Engineering, Yibin 644000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(13), 2517; https://doi.org/10.3390/electronics13132517

Submission received: 5 June 2024 / Revised: 25 June 2024 / Accepted: 26 June 2024 / Published: 27 June 2024

Download

Browse Figures

Versions Notes

Abstract

:

The fusion tracking of RGB and thermal infrared image (RGBT) has attracted widespread interest within target tracking by leveraging the complementing benefits of information from both visible and thermal infrared modalities, but achieving robustness while operating in real time remains a challenge. Aimed at this problem, this paper proposes a real-time tracking network based on the attention mechanism, which can improve the tracking speed with a smaller model, and at the same time, introduce the attention mechanism in the module to strengthen the attention to the important features, which can guarantee a certain tracking accuracy. Specifically, the modal features of visible and thermal infrared are extracted separately by using the backbone of the dual-stream structure; then, the important features in the two modes are selected and enhanced by using the channel attention mechanism in the feature selection enhancement module (FSEM) and the Transformer, while noise is reduced by using gating circuits. Finally, the final enhancement fusion is performed by using the spatial channel adaptive adjustment fusion module (SCAAM) in both the spatial and channel dimensions. The PR/SR of the proposed algorithm tested on the GTOT, RGBT234 and LasHeR datasets are 90.0%/73.0%, 84.4%/60.2%, and 46.8%/34.3%, respectively, and generally good tracking accuracy has been achieved, with a speed of up to 32.3067 fps, meeting the model’s real-time requirement.

Keywords:

RGBT tracking; information fusion; attention mechanisms; real-time tracking

1. Introduction

RGBT target tracking [1] combines the information of two different modalities, namely, visible light (RGB) and thermal infrared (T), which can fully utilize the complementarity between thermal infrared and visible light. Thermal infrared can provide information about the object’s own thermal radiation, which is not affected by changes in lighting conditions, while visible light images can provide information about the texture and color of the object, which is not affected by the thermal radiation emitted by other objects in the outside world. By fully utilizing the complementarity between thermal infrared and visible light, this approach can provide more accurate and robust target tracking than single-modal target tracking systems, especially in complicated situations. For this reason, the domains of intelligent transportation, video surveillance, and human–computer interaction have made extensive use of RGBT-based target [2].

The current fusion approaches can be divided into three categories based on their distinct fusion focuses: pixel-level fusion [3,4], feature-level fusion [5,6], and decision-level fusion [7,8]. This is because it is crucial to fully fuse the visible and thermal infrared information in RGBT-based target tracking. Note that prior to executing a pixel-level computation, we use resolution matching to ensure that all of the photos have the same resolution. After that, we combine the images into a single, larger image using image stitching, and finally perform the pixel-level computation on the combined image. This paper examines the fusion effect and better tracking effect that can be achieved by selecting the right tracking algorithm, even though the tracking effect presented by various fusion methods may differ. The algorithms analyzed in this paper are based on correlation filtering and deep learning.

Varied fusion methods may provide varied tracking effects; therefore, selecting an appropriate tracking algorithm can maximize the fusion effect and produce superior tracking results. Deep learning and correlation filtering-based techniques can be used to classify the multimodal tracking algorithms now in use. Filter templates are mostly used by correlation-based filtering techniques to capture and update the target data. A new soft consistent correlation filter was proposed by Wang et al. [9]. This approach can process data at a speed of 50 frames per second and reduces computational overhead by using the Fast Fourier Transform. By creating the matching filters for the two RGB and TIR modalities, Zhai et al. [10] presented the ADMM (Alternating Direction Method). The Alternating Direction Method of Multipliers technique improves rate and performance but loses objective frequently since it ignores modal weights. Many researchers use the correlation-filtering method in the field of target tracking, and various network frameworks are derived for target tracking, despite the method’s slower speed compared to the deep learning-based method in terms of feature characterization and generalization ability. MDNet (multi-domain learning network) [11], Siamese network (fully convolutional twin network) [12], DiMP (discriminative model prediction) [13], and Transformer framework [14] are a few examples of popular network frameworks.

A binary gated mutual conditioning network (DMCNet) was proposed by Lu et al. [6]. It can use mutual learning modules to convert visible and thermal infrared information into mutual conditions, and it can also refine the quality of the conditions by utilizing a binary gating mechanism to increase the network’s. In order to bridge the gaps between data-driven primary feature extraction and dynamic learning based on modal heterogeneity, Mei et al. [15] proposed a differentially reinforced and global collaborative network (DRGCNet), which also integrates global information and cross-modal mines. A novel framework, termed as jointly training the primary task and multiple auxiliary tasks (JTPMA), was introduced by Cai et al. [16] as a knowledge-driven learning approach for co-training numerous auxiliary tasks and the core task. It fully utilizes data from several modalities to enhance the performance of the primary task. The challenge-based appearance de-entanglement and interaction network created by Liu et al. [17] can accurately model the target’s appearance representation under a variety of difficult qualities, greatly boosting tracking robustness. While the previously mentioned network can increase accuracy, it is not very fast. Therefore, Xue et al. [18] proposed a high-performance SiamCAF network with ultra-high real-time speed, adding a complementary coupled feature fusion module to reduce modal differences and emphasize the features of the visible and thermal infrared modes using a residual channel attention enhancement (RCAE) module. A framework called SiamFEA was proposed by Feng et al. [19] based on cross-modal feature enhancement and self-focusing mechanism to further exploit complementarities between various modalities and enhance the original features. This framework adaptively fuses the bimodal features and improves tracking performance by capturing non-localized attentions in both the channel dimension and the spatial dimension. But there is still a challenge that needs to be solved, namely, how to properly exploit the two pieces of modal information to accomplish both high precision and real-time tracking at the same time.

In order to tackle the aforementioned issues, this study suggests a real-time tracking network based on the attention mechanism. This would increase computing efficiency, decrease redundant information, and ensure feature hierarchy, all while improving accuracy. In particular, we build a feature selection enhancement module, wherein the Transformer is used to materialize features that further improve the guidance of significant features to the tracking results, and the channel attention mechanism is used to calculate the weights of various branches. Then, in order to further enrich and fuse the important features, we design a spatial channel adaptive adjustment and fusion module. This module allows the aggregated features to be adjusted and fused in both the channel and space dimensions. Precise Pooling Layer (PrRoIPooling) is then used to reduce the computational time complexity, and three fully connected layers are used to integrate and map the previously obtained features in order to achieve the target state estimation.

In general, the following are our primary contributions:

i.: To achieve real-time tracking of RGBTs, we suggest a real-time tracking network based on the attention mechanism; the network uses the attention mechanism to achieve feature enhancement, which increases speed and guarantees tracking accuracy at the same time, as well as enhancement fusion operation at the last layer to decrease computational complexity and redundant information;
ii.: We design a feature selection enhancement module that makes use of the channel attention mechanism to adaptively select and fuse the features learned from various convolutional kernels; in addition, we combine this module with the Transformer, which can be useful for exploring rich contextual information, thereby enhancing and strengthening the useful information and suppressing the unimportant information to enhance tracking performance;
iii.: In order to better direct the tracker and produce better tracking results, we built the spatial channel adaptive adjustment fusion module; this module can adjust and fuse the information previously received from both spatial and channel dimensions.

2. Related Work

To achieve RGBT target tracking, we primarily rely on the MDNet target-tracking algorithm and attention mechanism. Therefore, we will briefly discuss the three components of RGBT target tracking, the MDNet target tracking method, and the attention mechanism here.

2.1. RGBT Target Tracking

By utilizing the rich texture and color information found in visible light photos and the thermal radiation information found in thermal infrared images, RGBT target tracking combines thermal infrared and visible light to produce more reliable and precise target tracking. The traditional RGBT target tracking algorithms [20,21] extract visible light and thermal infrared feature information manually. However, as deep learning technology has developed and been widely applied, many deep-learning-based RGBT target-tracking algorithms have emerged that can use neural networks for feature extraction. These algorithms perform better in terms of feature extraction and representation than the traditional RGBT-tracking algorithms. Deep-learning-based techniques have attracted a lot of interest and study. In order to handle the difficulties posed by notable variations in appearance, FANet [22] adaptively collects hierarchical deep characteristics in each modality. While DAPNet [23] and DAFNet [24] both employ a recursive approach to fuse features, the former adds redundancy by directly aggregating all modal features. In contrast to the former, which incorporates redundant information and noise due to all modal features, the latter suppresses noise and redundancy to some extent by fusing several modalities and convolutional layers. In contrast to the RGBT multimodal visual-tracking techniques mentioned above, there exist certain techniques that address both the modal sharing issue and the modal-specific issue with RGBT tracking. In order to improve the discriminative power of some weak modalities, Li et al. [25] introduced a bootstrap module that transfers discriminative features from one modality to another. It can also adaptively combine all of the branches and embed them into a backbone network for efficient target representation. For every heterogeneous attribute, Zhang et al. [26] performed residual representation and aggregated them using channel-level and pixel-level data. An attribute-based progressive fusion network was proposed by Xiao et al. [27]. To improve the aggregated features and modality-specific features, a Transformer-based codec was created based on the adaptive fusion characteristics of each attribute branch and characteristics unique to a modality. The majority of multimodal fusion networks concentrate primarily on merging multimodal data and reducing the disparity across heterogeneous modalities; nevertheless, this typically leads to a slow tracking rate. In contrast to earlier research, we primarily concentrate on the final layer in our study to carry out the necessary fusion-tracking operation, thereby enhancing RGBT tracking speed and accuracy.

2.2. MDNet Target Tracking Algorithm

MDNet [11] is a multi-domain learning-based technique that uses an online updating strategy that does not require a huge training dataset size to reduce training costs. It incorporates deep learning with target tracking and uses deep learning to extract features. As shown in Figure 1, it illustrates how MDNet’s network is divided into domain-specific and shared layers. The domain-specific layers are represented by the final fully connected layer, FC6, which has k branches (FC61~FC6K) inside it. Each branch has a softmax crosstalk layer. The shared layers are composed of three convolutional layers (Conv1-3) and two fully connected layers (FC4-5). The binary classification layer uses a softmax cross-entropy loss function to distinguish between each domain’s background and target.

In order to obtain a useful generalized representation of the features, the shared layer is trained and updated each time a specific domain is trained. This helps to separate the domain-specific information from unrelated information, which helps to remove target ambiguity between the target and the background. During the testing process, only the shared layer after training is fixed, and only domain-specific layers are trained online to obtain the target representation of a specific video sequence. The CNN inside the MDNet network is trained using the SDG (Stochastic Gradient Descent) method.

2.3. Attention Mechanisms

By weighing the input data to direct attention where it is needed, the attention mechanism enhances our tracking performance and accuracy. The channel attention mechanism, spatial attention mechanism, and self-attention mechanism are briefly explained below.

Channel Attention Mechanism: To obtain certain feature categories, a convolutional neural network is used to extract local and global information. The channel weights can be used to adaptively weigh the related channel attributes in order to emphasize the significant features and suppress the less important ones, as the size of the channel weights indicates importance. The SE-Net (Channel Attention Mechanism Network) was proposed by Hu et al. [28]. It aims to increase the important feature weights by using a loss function to adaptively acquire the feature channel weights, and then it selects the relevant features based on each feature weight, and then dynamically modifies the initial data in accordance with each feature’s significance to enhance the network’s accuracy. From a frequency domain standpoint, Qin et al. [29] suggested a multispectral channel attention mechanism based on the former. To improve accuracy, the channel attention mechanism incorporates the DCT (discrete cosine transform) [30]. Wang et al. [31] used a local cross-channel approach to obtain the weights of neighboring channel information and one-dimensional convolution to obtain neighboring channel information because the previous ones did not take into account the significance of such information. This approach can reduce the model complexity while maintaining performance. In contrast to the previously discussed techniques, Li et al. [32] introduce a novel method for focusing attention on neurons, namely, obtaining different receptive fields through adaptively adjusting the receptive fields and dynamically choosing the size of the convolution kernel. Its multi-branch structure is applied in certain lightweight networks, which can improve performance when computing resources are scarce.

The spatial attention mechanism allows the network to focus on relevant information by filtering specific targeted regions and performing a weighting operation based on various regional properties in the spatial dimension. Recurrent models of visual attention, or RAM [33], were the foundation of the spatial attention era. By repeatedly extracting local information, computing corresponding weights, combining them with corresponding feature information, and feeding them into an RNN (Recurrent Neural Network), it is possible to adaptively focus attention on different regions. STNs (spatial transformer networks) [34] were developed in response to the inability of the input image to be geometrically transformed. These networks use differentiable geometric transformer layers to adaptively transform and align inputs, using the optimal network location as the region of interest. This allows the networks to better handle morphological changes in the input. A new cross-cross-attention model was proposed by Huang et al. [35] that uses channel and spatial relationships to gain additional information, hence introducing a cross-channel attention mechanism.

Self-attention mechanism: It is an attention computation on the acquired sequence itself. It facilitates adaptive learning to the key information in the input sequence and derives the relationship between various subspaces by teaching the weights between the vectors Q (Query), K (Key), and V (Value). Both single-head and multi-head self-attention mechanisms are relevant attention mechanisms. Single-head self-attention establishes global dependency by calculating the interaction between any two positions, whereas multi-head attention is composed of numerous independent single-head self-attention heads that can be extracted from various viewpoints to the feature correlation. Numerous derivative attention techniques, including Swin Transformer [36] and CSWin Transformer [37], are likewise based on them.

3. Methods

The attention mechanism-based real-time tracking network framework is presented in its entirety in this part, followed by separate descriptions of the spatial channel adaptive adjustment fusion module and the feature selection improvement module.

3.1. An Attention-Based Real-Time Tracking Network Framework

The last layer contains more semantic richness, which is more beneficial to the target classification. Therefore, the useful information in the features in the last layer is selected to enhance the fusion operation, which can improve the tracking accuracy. This paper chooses to carry out the corresponding feature fusion enhancement operation only in the last layer in order to address the issue of the slow tracking speed of the multi-domain network framework. This can reduce the size of our model to some extent and improve the tracking speed.

The general structure of the attention-based real-time tracking network is depicted in Figure 2, and it consists primarily of the following components: the spatial channel adaptive adjustment and fusion module, the feature selection and enhancement module, the accurate pooling layer, three fully connected layers, and a softmax layer. Using the first three layers of VGG m [38], a two-stream structure is built as the backbone network for extracting image features. The convolution kernel sizes of these layers are 7 × 7, 5 × 5, and 3 × 3, respectively, and they extract the modal features of the RGB and TIR images from shallow to deep. Despite having the same structure, these two feature extractors have different parameters. According to Figure 2, the trunk is used to extract the modal features first. Then, the feature selection and enhancement module perform feature fusion and aggregation operations on the two modal features. This module can use convolution with different convolution kernel sizes to obtain features with different scale sizes. Finally, the encoder and decoder of the Transformer are used to improve the fusion operations on the data, allowing the important features to be better extracted and enhanced. We combine the improved fusion features with the RGB and TIR features, respectively, in an effort to preserve as many of the distinctive characteristics of each color as possible. We then employ the cascade operation to get more thorough and practical data. We transfer the improved features to the spatial channel adaptive fusion module, which further fuses them. Afterwards, we use the accurate pooling operation to speed up feature extraction while maintaining the quality of the extracted features. Lastly, we use three fully connected layers and one softmax layer to predict the target’s location in order to achieve target tracking. These steps are taken in order to further enhance the previously obtained features. These steps are sent to the spatial channel adaptive fusion module, which can capture useful features from both channel and spatial dimensions and adaptively fuse this information. We transfer the improved features to the spatial channel adaptive fusion module, which further fuses them.

3.2. Feature Selection Enhancement Module

In order to capture the important information that will enable us to track the target more accurately, we designed the feature selection enhancement module, whose overall block diagram is shown in Figure 3. The module is composed of three parts: the Transformer, the gating mechanism, and the feature selection. If the fusion operation is performed directly on the corresponding features, it may lose some important information and affect the tracking effect.

The feature selection portion draws inspiration from SK-Net [32], which has the ability to merge and adaptively weigh the channel weights corresponding to several convolutional kernels in order to suppress irrelevant information and increase crucial information. The process involves setting up four parallel convolutional branches in the first place. The top and bottom convolutional kernel sizes are identical to those of the 3 × 3 and 1 × 1 convolutional kernels. The RGB and TIR features are extracted from the top and bottom convolutional branches, respectively, and the features extracted from all four convolutional branches are then combined. Then, two fully connected layers are used to fine-tune and obtain the information of each branch structure. Finally, softmax is used to direct the obtained branch information to obtain the weight of each channel and then perform the weighting operation with the corresponding information, which can highlight the important channel features and lessen redundancy. These steps are followed by the global average pooling operation (GAP) to calculate the average value of their pixels to obtain the global information of the channel. Next, element-by-element summing is applied to the features of RGB and TIR with the same convolutional kernel size in order to obtain the fusion features of RGB and TIR with different sizes. The following is a representation of the steps in this process:

y = x_{3 \times 3}^{v} ⨁ x_{1 \times 1}^{i} ⨁ x_{1 \times 1}^{v} ⨁ x_{3 \times 3}^{i}

(1)

w_{a} = s (f_{2} (f_{1} (G A P (y))))

(2)

Z_{1 \times 1}^{v, i} = {(x}_{1 \times 1}^{i} ⨀ w_{1 \times 1}^{i}) ⨁ (x_{1 \times 1}^{v} ⨀ w_{1 \times 1}^{v})

(3)

Z_{3 \times 3}^{v, i} = {(x}_{3 \times 3}^{i} ⨀ w_{3 \times 3}^{i}) ⨁ {(x}_{3 \times 3}^{v} ⨀ w_{3 \times 3}^{v})

(4)

where

w_{a}

denotes the weights produced by the a-th channel,

i

stands for thermal infrared,

v

for visible, and

f_{1}

and

f_{2}

for the several completely connected layers. The symbols

s

, ⨁, and ⨀ denote the softmax function, element-by-element summing, and element-by-element multiplication, respectively.

To mitigate the risk of unneeded noise introduction and lessen the influence of redundant noise on the network, a gating mechanism is implemented. This mechanism comprises a Sigmoid activation function and a 1 × 1 convolution. It can be adaptively adjusted to the input features to effectively reduce noise propagation. Consequently, to enhance the reliability of feature extraction and follow-up, the fusion of features from different scales will be routed through the gate control mechanism to achieve adaptive correction, and this is how it works:

m_{3 \times 3}^{v, i} = Z_{3 \times 3}^{v, i} ⨀ S i g (C o n v (Z_{3 \times 3}^{v, i}))

(5)

m_{1 \times 1}^{v, i} = Z_{1 \times 1}^{v, i} ⨀ S i g (C o n v (Z_{1 \times 1}^{v, i}))

(6)

m^{v, i} = (x^{v} ⨁ x^{i}) ⨀ S i g (C o n v (x^{v} ⨁ x^{i}))

(7)

where

S i g

denotes the Sigmoid activation function,

C o n v

the 1 × 1 convolution operation,

m

the result of the gating mechanism, and

x^{v}

and

x^{i}

the visible and thermal infrared image features extracted by the backbone, respectively.

To improve the model feature extraction and information interaction abilities, the Transformer is then used to construct the global correlation information, focusing on the multiple subspace information of the features. This is because it can adaptively learn the key information in the input information and obtain the correlation between different subspaces. Since the attention mechanism is the fundamental component of the Transformer, this paper employs a single-head attention mechanism to simplify the model. It does this by multiplying the input features by the corresponding weight matrices of Q (Query), K (Key), and V (Value), resulting in the three matrices of Q (Query), K (Key), and V (Value). The Q and K matrices are then subjected to a dot product operation, yielding a result that indicates the degree of attention to a position. The process must then be normalized using the softmax function, and following normalization, the result is multiplied by the V vector before being subjected to a weighted summation that can be used to increase or decrease the amount of attention paid to the process

Q = X W^{Q}, K = X W^{K}, V = {X W}^{V}

(8)

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(9)

where

X

is the feature matrix of the input,

W^{Q}

,

W^{K}

, and

W^{V}

are the weight matrices corresponding to

Q

,

K

, and

V

, respectively, and

d_{k}

is the dimension of the

k

-vector, which is designed to be more stable when backpropagating for the gradient.

Inspired by the work of Wang et al. [39] and Xiao et al. [27], who employed the encoder and decoder in the Transformer independently, three encoders and two decoders are designed. The encoders are employed to enhance the modal features in an interactive manner, while the decoders are used to enhance the modal features in an encoder-assisted manner. This approach not only improves the modal features but also capitalizes on the complementary strengths of the modal features at different scales, enabling better handling of these challenges. In order to effectively handle the difficulties of scale shifts and object deformation, this not only improves the salient characteristics of the modality itself but also makes use of the complementing benefits of modal features at various scales. To be more precise, the two encoders are used to improve the fusion features. To be more precise, the two encoders are used to improve the fusion features

m_{1 \times 1}^{v, i}

varied by 1 × 1 convolutional layer and the fusion feature

m_{3 \times 3}^{v, i}

varied by 3 × 3 convolutional layer. The original fusion feature

m^{v, i}

and the fusion feature of the two scales are enhanced interactively by the decoder, where the original fusion feature serves as the query. Finally, they are summed up to further improve the fusion feature of the two scales. To improve the fusion features even further, the interactively improved features are added together. The following represents the entire process:

F = E (E (m_{1 \times 1}^{v, i}, m_{1 \times 1}^{v, i}) ⨁ E (m_{3 \times 3}^{v, i}, m_{3 \times 3}^{v, i}) ⨁ D E (m^{v, i}, m_{1 \times 1}^{v, i}) ⨁ D E (m^{v, i}, m_{3 \times 3}^{v, i}))

(10)

where

E

represents the Transformer encoder,

D E

represents the Transformer decoder, and

F

represents the enhanced fused features.

3.3. Spatial Channel Adaptive Adjustment Fusion Module

Inspired by [39,40,41,42], this paper proposes the Spatial Channel Adaptive Adjustment Fusion Module (SCAAFM), whose overall structure, as shown in Figure 4, consists of a spatial fusion part and a channel fusion part. The advantages of spatial attention and channel attention are utilized to adaptively fuse the features from the two dimensions of space and channel. Their inputs are

F_{1}

,

F_{2}

, and

F_{3}

, where

F_{1}

and

F_{3}

are the primary features of the search area and template, respectively.

F_{2}

is obtained by adding enhancement information following the FSEM module and the specific features of RGB and TIR, respectively, and then performing the cascade operation, which can, to some extent, synthesize the benefits of the modal features and broaden the spatial depth to obtain more comprehensive image information and more spatial information. To a certain extent, this can increase the accuracy and resilience of the model.

In the channel fusion part, the global average pooling (GAP) operation is performed on the ft sum, and then two fully connected layers, the ReLU function and softmax function, are used to obtain the channel weights

W_{C}

. The process is as follows:

W_{C} = s o f t m a x (C_{2} (R (C_{1} (G A P (F_{2})))))

(11)

where

C_{1}

and

C_{2}

denote two 1 × 1 convolutions and

R

denotes the ReLU function.

Within the spatial fusion portion, which was inspired by [43,44], a dual-stream branch is designed. Initially, the first branch uses the inter-correlation operation to calculate the correlation between the initial frame and the current frame of the volume. Next, the feature map

W_{s 1}

is obtained by performing the aggregation using the channel maximum pooling (Max) operation. In order to adapt it to the target transformation, there is a second branch, which consists of two convolutions with convolution kernels that are all 3 × 3, as well as two inverse convolutions. Following this, the feature map

W_{s 2}

is obtained. These two feature maps are then cascaded, and the softmax and sigmoid functions are used to obtain the spatial channel weights

W_{s}

. In doing so, the spatial level interference is suppressed, and the information is improved. The steps involved are as follows:

W_{s 1} = M a x (F_{1} * F_{3})

(12)

W_{s 2} = {D C}_{2} ({D C}_{1} (C_{4} (C_{3} (F_{1}, F_{3}))))

(13)

W_{s} = S i g (s o f t m a x (W_{s 1} © W_{s 2}))

(14)

where

{D C}_{1}

and

{D C}_{2}

stand for two separate deconvolution operations, * stands for correlation operation,

C_{3}

and

C_{4}

stand for two different convolution operations, and © stands for concatenated.

In order to achieve the final fusion feature

F_{4}

, we must first multiply the channel weights by

F_{2}

and then by

W_{s}

. Then, we must add the backbone features to the obtained enhancement information. The steps involved are as follows:

F_{4} = (W_{C} ⨀ F_{2}) ⨀ W_{s} ⨁ F_{1}

(15)

4. Experiments

Several experiments were carried out to confirm the efficacy of the suggested approach in this work. Python 3.8, PyTorch 1.8, and an NVIDIA GeForce RTX3090 GPU comprise the hardware environment used in this paper.

4.1. Implementation Details

During training: In this research, we use the parameters from ImageNet’s pre-training model to initialize the convolutional layer in the backbone network, thereby performing end-to-end training [45]. Initially, eight randomly selected frames from the dataset’s video sequence are cropped to 107 × 107 in each training. Next, 64 positive and 196 negative samples are selected using Gaussian distribution sampling around the target. The overlap ratio (IOU) between the sampled and real surrounded frames is used to determine which samples are positive or negative. If the IOU is greater than 0.7, the sample is considered positive; if it is less than 0.5, it is considered negative. Sampled Stochastic Gradient Descent (SGD) is a gradient-based optimization algorithm that can update the parameters by randomly selecting a small batch of samples, which improves training speed when working with large-scale datasets. This algorithm is used to update the network parameters during the training process. The network is trained for 500 cycles with the following settings: momentum set to 0.9, weight decay set to 0.0005, and learning rate set to 0.001. MDNet [11] can be used as a reference for the particular training procedure. The RGBT234 dataset is used for testing on GTOT, whereas the GTOT [20] dataset is used for training and testing on the RGBT234 [46] and LasHeR [47] datasets in this article.

Online tracking: After loading the previously trained model, all parameters are locked with the exception of the three fully connected layers (fc4, fc5, and fc6), and the fc6 layer’s parameters are initialized at random. The learning rates for fc4, fc5, and fc6 are set at 0.0003, 0.003, and 0.0003, respectively, and the video sequences are used to fine-tune them so that they can adjust to changes in the target. To be more precise, each video sequence’s first frame is Gaussian-sampled to get 500 positive samples and 5000 negative samples. These samples are then used to train fc4, fc5, and fc6 for 50 rounds. The parameters of fc4, fc5, and fc6 are updated in this work using both short-term and long-term updates. Bounding box regression is a technique used to increase tracking accuracy. Using the first frame of successful tracking, the bounding box regressor is used to modify the target state. In the following tracking, the target prediction position from the previous frame is used as the center for Gaussian sampling to produce 256 candidate frames. These frames are then used as inputs to obtain the prediction score, and the fifth candidate frame with the highest score is averaged to produce an average position, which serves as the current frame’s final prediction result. Until the entire video sequence is monitored, this is repeated.

4.2. Datasets and Evaluation Metrics

The datasets used for the studies in this research are the publicly accessible GTOT, RGBT234, and LasHeR datasets.

The seven feature categories in the GTOT dataset—occlusion (OCC), large scale variation (LSV), fast motion (FM), low illumination (LR), thermal crossover (TC), tiny object (SO), and deformation (DEF)—are present in 50 RGB and TIR video pairs. The RGBT234 dataset comprises 234 RGB and TIR video pairs with 12 attribute categories: low resolution (LR), low illumination (LI), thermal crossover (TC), fast motion (FM), scale variation (SV), motion blur (MB), camera moving (CM), heavy occlusion (HO), no occlusion (NO), partial occlusion (PO), and deformation (DEF). There are 1124 RGB and TIR video pairs in the LasHeR dataset, which is separated into training and test datasets. Of these, 245 RGB and TIR video pairs are utilized as the test dataset and the remaining pairings are used as the training dataset. This dataset has additional attribute problems.

Precision rate (PR) and success rate (SR) are used as assessment measures in the experiments in this research. Whereas SR indicates the percentage of the overlap rate (IoU) between the predicted frame and the true frame that is greater than the threshold, PR indicates the proportion of all frames in which the distance between the center point of the tracking result and the center point of the true position is less than the threshold. For the GTOT dataset, the threshold is set to 5 pixels; for the RGBT234 and LasHeR datasets, it is set to 20 pixels.

4.3. Results Comparisons

This research aims to validate the efficacy of the suggested method by conducting comparative experiments with several algorithms within the RGBT target tracking on three datasets: GTOT, RGBT234, and LasHeR.

4.3.1. Evaluation of GTOT Dataset

For comparison with the approach in this study, the following algorithms are chosen: M⁵L [48], CAT [25], MANET [49], HDINet [50], DAFNet [24], MaCNet [51], SGT [52], ECO [53], MDNet+RGBT, and RT-MDNet [54].

Evaluation overall: Figure 5 presents an overall comparison of the evaluation results on the GTOT dataset. It is evident from this that, when compared to these 10 trackers, the tracker suggested in this work ranks highest in terms of PR and SR, coming in at 90.0% and 73.0%, respectively. In comparison to MANET and HDINet, the technique presented in this study improves PR/SR by 0.6%/0.6% and 1.2%/1.2%, respectively. It also considerably improves PR/SR when compared to MDNet+RGBT, ECO, and RT-MDNet. The usefulness of the algorithm suggested in this research is demonstrated by the overall comparison.

Attribute-based evaluation: The method in this work is compared with other state-of-the-art algorithms, namely, CAT [25], M⁵L [48], DFNet [55], DFAT [56], and MDNet+RGBT, under seven tasks inside GTOT in order to better examine the performance of this algorithm under various challenge attributes. The comparative results are displayed in Table 1, where the red color indicates the best performance, and the blue and green colors indicate the second and third places, respectively. Table 1 illustrates that the algorithm provided in this research performs optimally overall, particularly when dealing with the obstacles posed by OCC and TC. It also demonstrates the efficacy of the suggested algorithm by ranking both PR and SR first among the comparator algorithms.

4.3.2. Evaluation of RGBT234 Dataset

The following are compared with the algorithms in this study: CMPP [57], DAFNet [24], CAT, MaCNet, MDNet+RGBT, C-COT, ECO, SGT, HDINet, DAPNet [23], JMMAC [58], and MANet.

Overall evaluation: Figure 6 presents an overall comparison of the assessment results on the RGBT234 dataset. Compared to these 12 trackers, the tracker suggested in this article comes in first, with PR and SR of 84.4% and 60.2%, respectively. In comparison to the current better tracker, CMPP, this algorithm improves PR/SR by 2.1%/2.7%; in comparison to JMMAC and CAT, it improves PR/SR by 5.4%/2.9% and 4%/4.1%, respectively. The efficiency of the technique suggested in this study is demonstrated by the notable gains in PR and SR when compared to the overall comparison on the RGBT234 dataset with MDNet+RGBT, SGT, C-COT, and ECO.

Attribute-based evaluation: This paper’s algorithm is compared with other cutting-edge algorithms, including CAT, M5L, ADRNet [26], DFAT, and APFNet , under 12 problems inside RGBT234 in order to better study the performance of this algorithm under various challenge attributes. Table 2 displays the comparative findings. Red indicates the best outcomes, while blue and green represent the second and third places, respectively. Table 2 demonstrates that the algorithm proposed in this paper performs optimally overall, with the exception of the challenges of CM, HO, LI, LR, MB, and PO, where it performs particularly well. Additionally, the algorithm’s effectiveness is demonstrated by the fact that both PR and SR rank first among the comparative algorithms.

4.3.3. Evaluation of LasHeR Dataset

The algorithms MANet++ [59], MANet, CAT, DAFNet, FANet [22], mfDiMP [60], DAPNet, SGT++, CMR, and SGT are selected here for comparison with the algorithms in this paper.

The comparison results are shown in Figure 7. The PR and SR of the proposed algorithm in this paper are 46.8% and 34.3%, respectively, which lead overall, although there is a certain gap in the success rate compared to mfDiMP and MANET. Furthermore, the accuracy is improved by 2.1% and 1.3% compared to mfDiMP and MANET, respectively. The improvement in PR/SR is 1.8%/1% and 0.1%/1% compared to CAT and MANET++, respectively. The above results on different test datasets show that the algorithm in this paper is robust on different datasets.

4.4. Implementation Detail Analysis of Visualization Comparison Results

To illustrate how well the suggested algorithm tracks in various scenarios, it is then compared to four other algorithms: SGT, MDNet+RGBT, FANet, and C-COT. The comparisons are made on six video sequences: children4, diamond, elecbikewithlight1, nightthreepeople, Yellowcar, and manlight. The comparison results are plotted in Figure 8.

The suggested approach is able to track the target accurately in Figure 8a when there is background clutter and the tracked target is moving quickly, making other trackers unable to follow the target. The tracking target in Figure 8b is small, and the suggested algorithm can better differentiate the background interference and follow the target when there is severe occlusion. The tracking target in Figure 8c is smaller and exhibits fluctuations in size and illumination. While most comparison algorithms are unable to follow the target effectively, the current algorithm is able to more effectively exploit the complimentary advantages of RGBT and TIR to track the target with accuracy. The current algorithm also performs well in tracking the target in Figure 8d under conditions such as lens motion and poor illumination. While this paper’s method is better equipped to cope with the difficulty of size change, several of the techniques in Figure 8e suffer from target shift when size change happens. When faced with an exposure problem, as in Figure 8f, comparison algorithms are more easily influenced by light and so are unable to track the target, but the algorithms in this research are more capable of handling this kind of difficulty. It is evident that this paper’s approach is better equipped to achieve effective target tracking than other algorithms that may experience target shift and loss from these difficulties.

4.5. Ablation Study

In order to verify the validity of the modules added in this paper, the ablation experiments of each variant of this network were then chosen to be performed on top of the GTOT and RGBT234 test datasets, which are as follows. (1) Ours-FSEM: the FSEM module is removed, and the RGB features and the TIR features convolved in the third layer in the backbone are directly subjected to a cascade operation before being fed into the SCAAFM module. (2) Ours-SCAAFM: the SCAAFM module is removed to add the results from the FSEM module with the TIR and TGB features out of the main stem, respectively, after the cascade operation and then added with the directly spliced TIR and TGB features out of the main stem. The comparison of the ablation experimental results is shown in Table 3, from which it can be seen that the algorithm proposed in this paper is better than several other variants of the network as a whole, which can prove the effectiveness of the network model proposed in this paper.

5. Conclusions

This paper primarily aims to ensure a certain level of accuracy while improving the tracking speed, as most multi-domain networks have higher tracking accuracy but slower speeds. To this end, it proposes a real-time tracking network based on the attention mechanism, which can improve target tracking accuracy by making effective use of the third layer of the backbone of the convolution out of the RGB and TIR features, as well as their own unique features that complement each other. Experiments demonstrate that this approach outperforms comparative techniques in the publicly available GTOT, RGBT234 and LasHeR datasets, and that it is capable of handling the obstacles included in each dataset while maintaining a speed of up to 32.3067 fps. In order to maintain real-time speed and also increase the model’s resilience, we will investigate modal characteristics more in the future.

Author Contributions

Conceptualization, X.X.; methodology, Q.Z.; investigation, Q.Z.; writing—original draft preparation, Q.Z.; writing—review and editing, X.X., J.L. and J.W.; supervision, X.X., J.L. and J.W.; project administration, X.X. and J.L.; funding acquisition, X.X. and J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Research on key technology of intelligent information processing of energy internet for resilient city evaluation, grant number 2023NSFSC1987 and in part by the Opening Project of Key Laboratory of Higher Education of Sichuan Province for Enterprise Informationalization and Internet of Things under Grant 2022WYJ04.

Data Availability Statement

Data available in a publicly accessible repository.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tang, Z.; Xu, T.; Wu, X.-J. A survey for deep rgbt tracking. arXiv 2022, arXiv:2201.09296. [Google Scholar]
Yuan, D.; Zhang, H.; Shu, X.; Liu, Q.; Chang, X.; He, Z.; Shi, G. Thermal Infrared Target Tracking: A Comprehensive Review. IEEE Trans. Instrum. Meas. 2023, 73, 1–19. [Google Scholar] [CrossRef]
Schnelle, S.R.; Chan, A.L. Enhanced target tracking through infrared-visible image fusion. In Proceedings of the 14th International Conference on Information Fusion, Chicago, IL, USA, 5–8 July 2011; pp. 1–8. [Google Scholar]
Chan, A.L.; Schnelle, S.R. Fusing concurrent visible and infrared videos for improved tracking performance. Opt. Eng. 2013, 52, 017004. [Google Scholar] [CrossRef]
Zhang, X.; Ye, P.; Peng, S.; Liu, J.; Xiao, G. DSiamMFT: An RGB-T fusion tracking method via dynamic Siamese networks using multi-layer feature fusion. Signal Process. Image Commun. 2020, 84, 115756. [Google Scholar] [CrossRef]
Lu, A.; Qian, C.; Li, C.; Tang, J.; Wang, L. Duality-gated mutual condition network for RGBT tracking. IEEE Trans. Neural Netw. Learn. Syst. 2022. [Google Scholar] [CrossRef]
He, F.; Chen, M.; Chen, X.; Han, J.; Bai, L. Siamdl: Siamese Dual-Level Fusion Attention Network for RGBT Tracking; SSRN 4209345; Elsevier: Amsterdam, The Netherlands, 2022. [Google Scholar] [CrossRef]
Wang, Y.; Wei, X.; Tang, X.; Wu, J.; Fang, J. Response map evaluation for RGBT tracking. Neural Comput. Appl. 2022, 34, 5757–5769. [Google Scholar] [CrossRef]
Wang, Y.; Li, C.; Tang, J. Learning soft-consistent correlation filters for RGB-T object tracking. In Proceedings of the Pattern Recognition and Computer Vision: First Chinese Conference, PRCV 2018, Guangzhou, China, 23–26 November 2018; Proceedings, Part IV 1. pp. 295–306. [Google Scholar]
Zhai, S.; Shao, P.; Liang, X.; Wang, X. Fast RGB-T tracking via cross-modal correlation filters. Neurocomputing 2019, 334, 172–181. [Google Scholar] [CrossRef]
Nam, H.; Han, B. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4293–4302. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10, 15–16 October 2016; Proceedings, Part II 14. pp. 850–865. [Google Scholar]
Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 11. [Google Scholar]
Mei, J.; Zhou, D.; Cao, J.; Nie, R.; He, K. Differential reinforcement and global collaboration network for rgbt tracking. IEEE Sens. J. 2023, 23, 7301–7311. [Google Scholar] [CrossRef]
Cai, Y.; Sui, X.; Gu, G. Multi-modal multi-task feature fusion for RGBT tracking. Inf. Fusion 2023, 97, 101816. [Google Scholar] [CrossRef]
Liu, L.; Li, C.; Xiao, Y.; Ruan, R.; Fan, M. Rgbt tracking via challenge-based appearance disentanglement and interaction. IEEE Trans. Image Process. 2024, 33, 1753–1767. [Google Scholar] [CrossRef] [PubMed]
Xue, Y.; Zhang, J.; Lin, Z.; Li, C.; Huo, B.; Zhang, Y. SiamCAF: Complementary Attention Fusion-Based Siamese Network for RGBT Tracking. Remote Sens. 2023, 15, 3252. [Google Scholar] [CrossRef]
Feng, L.; Song, K.; Wang, J.; Yan, Y. Exploring the potential of Siamese network for RGBT object tracking. J. Vis. Commun. Image Represent. 2023, 95, 103882. [Google Scholar] [CrossRef]
Li, C.; Cheng, H.; Hu, S.; Liu, X.; Tang, J.; Lin, L. Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Trans. Image Process. 2016, 25, 5743–5756. [Google Scholar] [CrossRef] [PubMed]
Ye, M.; Huang, J. A Hierarchical Registration Method of the Chang’E-1 Stereo Images; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
Zhu, Y.; Li, C.; Tang, J.; Luo, B. Quality-aware feature aggregation network for robust RGBT tracking. IEEE Trans. Intell. Veh. 2020, 6, 121–130. [Google Scholar] [CrossRef]
Zhu, Y.; Li, C.; Luo, B.; Tang, J.; Wang, X. Dense feature aggregation and pruning for RGBT tracking. In Proceedings of the 27th ACM International Conference on Multimedia, New York, NY, USA, 21–25 October 2019; pp. 465–472. [Google Scholar]
Gao, Y.; Li, C.; Zhu, Y.; Tang, J.; He, T.; Wang, F. Deep adaptive fusion network for high performance RGBT tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Li, C.; Liu, L.; Lu, A.; Ji, Q.; Tang, J. Challenge-aware RGBT tracking. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 222–237. [Google Scholar]
Zhang, P.; Wang, D.; Lu, H.; Yang, X. Learning adaptive attribute-driven representation for real-time RGB-T tracking. Int. J. Comput. Vis. 2021, 129, 2714–2729. [Google Scholar] [CrossRef]
Xiao, Y.; Yang, M.; Li, C.; Liu, L.; Tang, J. Attribute-based progressive fusion network for rgbt tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; pp. 2831–2838. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 783–792. [Google Scholar]
Ahmed, N.; Natarajan, T.; Rao, K.R. Discrete cosine transform. IEEE Trans. Comput. 1974, 100, 90–93. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Mnih, V.; Heess, N.; Graves, A. Recurrent models of visual attention. Adv. Neural Inf. Process. Syst. 2014, 27, 9. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28, 9. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar]
Chatfield, K.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Return of the devil in the details: Delving deep into convolutional nets. arXiv 2014, arXiv:1405.3531. [Google Scholar]
Wang, N.; Zhou, W.; Wang, J.; Li, H. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1571–1580. [Google Scholar]
Park, J.; Woo, S.; Lee, J.-Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
Zhang, Q.-L.; Yang, Y.-B. Sa-net: Shuffle attention for deep convolutional neural networks. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2235–2239. [Google Scholar]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
Wang, Z.; Xu, J.; Liu, L.; Zhu, F.; Shao, L. Ranet: Ranking attention network for fast video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3978–3987. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. pp. 234–241. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 9. [Google Scholar] [CrossRef]
Li, C.; Liang, X.; Lu, Y.; Zhao, N.; Tang, J. RGB-T object tracking: Benchmark and baseline. Pattern Recognit. 2019, 96, 106977. [Google Scholar] [CrossRef]
Li, C.; Xue, W.; Jia, Y.; Qu, Z.; Luo, B.; Tang, J.; Sun, D. LasHeR: A large-scale high-diversity benchmark for RGBT tracking. IEEE Trans. Image Process. 2021, 31, 392–404. [Google Scholar] [CrossRef]
Tu, Z.; Lin, C.; Zhao, W.; Li, C.; Tang, J. M 5 l: Multi-modal multi-margin metric learning for rgbt tracking. IEEE Trans. Image Process. 2021, 31, 85–98. [Google Scholar] [CrossRef]
Long Li, C.; Lu, A.; Hua Zheng, A.; Tu, Z.; Tang, J. Multi-adapter RGBT tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Mei, J.; Zhou, D.; Cao, J.; Nie, R.; Guo, Y. Hdinet: Hierarchical dual-sensor interaction network for rgbt tracking. IEEE Sens. J. 2021, 21, 16915–16926. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, L.; Zhuo, L.; Zhang, J. Object tracking in RGB-T videos using modal-aware attention network and competitive learning. Sensors 2020, 20, 393. [Google Scholar] [CrossRef]
Li, C.; Zhao, N.; Lu, Y.; Zhu, C.; Tang, J. Weighted sparse representation regularized graph learning for RGB-T object tracking. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1856–1864. [Google Scholar]
Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
Real-Time MDNet. Available online: https://openaccess.thecvf.com/content_ECCV_2018/html/Ilchae_Jung_Real-Time_MDNet_ECCV_2018_paper.html (accessed on 4 June 2024).
Peng, J.; Zhao, H.; Hu, Z. Dynamic fusion network for RGBT tracking. IEEE Trans. Intell. Transp. Syst. 2022, 24, 3822–3832. [Google Scholar] [CrossRef]
Tang, Z.; Xu, T.; Li, H.; Wu, X.-J.; Zhu, X.; Kittler, J. Exploring fusion strategies for accurate RGBT visual object tracking. Inf. Fusion 2023, 99, 101881. [Google Scholar] [CrossRef]
Wang, C.; Xu, C.; Cui, Z.; Zhou, L.; Zhang, T.; Zhang, X.; Yang, J. Cross-modal pattern-propagation for RGB-T tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7064–7073. [Google Scholar]
Zhang, P.; Zhao, J.; Bo, C.; Wang, D.; Lu, H.; Yang, X. Jointly modeling motion and appearance cues for robust RGB-T tracking. IEEE Trans. Image Process. 2021, 30, 3335–3347. [Google Scholar] [CrossRef]
Lu, A.; Li, C.; Yan, Y.; Tang, J.; Luo, B. RGBT tracking via multi-adapter network with hierarchical divergence loss. IEEE Trans. Image Process. 2021, 30, 5613–5625. [Google Scholar] [CrossRef]
Zhang, L.; Danelljan, M.; Gonzalez-Garcia, A.; Van De Weijer, J.; Shahbaz Khan, F. Multi-modal fusion for end-to-end RGB-T tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]

Figure 1. The overall framework diagram of MDNet.

Figure 2. The overall network framework.

Figure 3. The overall block diagram of the FSEM module.

Figure 4. The overall block diagram of the SCAAFM module.

Figure 5. Evaluation curves for PR/SR on GTOT.

Figure 6. Evaluation curves for PR/SR on RGBT234.

Figure 7. Evaluation curves for PR/SR on LasHeR.

Figure 8. Comparison of visualization results.

Table 1. Comparison of PR/SR score (%) results based on attribute challenges for different trackers on the GTOT dataset.

	CAT	M⁵L	DFNet	DFAT	MDNet+RGBT	Ours
OCC	89.9/69.2	87.1/66.6	88.7/68.9	86.3/68.7	82.9/64.1	90.4/70.3
LSV	85.0/67.9	91.0/70.2	84.2/69.7	92.4/75.0	77.0/57.3	86.1/70.3
FM	83.9/65.4	89.4/68.5	81.4/64.4	89.1/74.0	80.5/59.8	85.8/68.8
LI	89.2/72.3	91.7/73.0	89.6/73.3	92.2/74.1	79.5/64.3	91.1/74.2
TC	89.9/71.0	89.2/69.5	88.6/71.5	89.1/70.7	79.5/60.9	91.9/72.4
SO	94.7/69.9	96.0/70.2	94.3/71.3	94.4/71.9	87.0/62.2	94.6/71.8
DEF	92.5/75.5	92.2/74.6	92.8/74.8	91.9/73.5	81.6/68.8	92.1/75.4
ALL	88.9/71.7	89.6/71.0	88.1/71.9	89.3/72.3	80.0/63.7	90.0/73.0

Table 2. Comparison of PR/SR score (%) results based on attribute challenges for different trackers on the RGBT234 dataset.

	CAT	M⁵L	ADRNet	DFAT	APFNet	Ours
BC	81.1/51.9	75.0/47.7	80.4/53.6	71.9/47.8	81.3/54.5	80.0/54.5
CM	75.2/52.7	75.2/52.9	74.3/52.9	74.2/54.7	77.9/56.3	78.8/57.1
DEF	76.2/54.1	73.6/51.1	74.3/52.8	76.0/57.6	78.5/56.4	79.2/56.9
FM	73.1/47.0	72.8/46.5	74.9/48.9	65.4/46.2	79.1/51.1	79.4/50.9
HO	70.0/48.0	66.5/45.0	71.4/49.6	63.9/45.5	73.8/50.7	74.8/53.3
LI	81.0/54.7	82.1/54.7	81.1/56.0	78.3/56.2	84.3/56.9	84.8/58.3
LR MB NO PO SV TC ALL	82.0/53.9 68.3/49.0 93.2/66.8 85.1/59.3 79.7/56.6 80.3/57.7 80.4/56.1	82.3/53.5 73.8/52.8 93.1/64.6 86.3/58.9 79.6/54.2 82.1/56.4 79.5/54.2	83.8/56.2 73.3/53.2 91.6/66.0 85.1/60.3 78.6/56.2 79.6/58.6 80.7/57.0	75.2/51.5 68.6/50.2 93.3/69.6 80.7/59.2 77.4/57.5 67.5/49.4 76.1/55.5	84.4/56.5 74.5/54.5 94.8/68.0 86.3/60.6 83.1/57.9 82.2/58.1 82.7/57.9	85.4/59.0 75.7/55.2 93.2/67.0 90.1/64.1 83.0/59.4 80.7/59.8 84.4/60.2

Table 3. Results of PR/SR scores for several versions on the RGBT234 and GTOT datasets.

		Ours-FSEM	Ours-SCAAFM	Ours
GTOT	PR	0.886	0.878	0.900
GTOT	SR	0.714	0.707	0.730
RGBT234	PR	0.836	0.822	0.844
RGBT234	SR	0.590	0.589	0.602

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Q.; Liu, J.; Wang, J.; Xiong, X. Real-Time RGBT Target Tracking Based on Attention Mechanism. Electronics 2024, 13, 2517. https://doi.org/10.3390/electronics13132517

AMA Style

Zhao Q, Liu J, Wang J, Xiong X. Real-Time RGBT Target Tracking Based on Attention Mechanism. Electronics. 2024; 13(13):2517. https://doi.org/10.3390/electronics13132517

Chicago/Turabian Style

Zhao, Qian, Jun Liu, Junjia Wang, and Xingzhong Xiong. 2024. "Real-Time RGBT Target Tracking Based on Attention Mechanism" Electronics 13, no. 13: 2517. https://doi.org/10.3390/electronics13132517

APA Style

Zhao, Q., Liu, J., Wang, J., & Xiong, X. (2024). Real-Time RGBT Target Tracking Based on Attention Mechanism. Electronics, 13(13), 2517. https://doi.org/10.3390/electronics13132517

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time RGBT Target Tracking Based on Attention Mechanism

Abstract

1. Introduction

2. Related Work

2.1. RGBT Target Tracking

2.2. MDNet Target Tracking Algorithm

2.3. Attention Mechanisms

3. Methods

3.1. An Attention-Based Real-Time Tracking Network Framework

3.2. Feature Selection Enhancement Module

3.3. Spatial Channel Adaptive Adjustment Fusion Module

4. Experiments

4.1. Implementation Details

4.2. Datasets and Evaluation Metrics

4.3. Results Comparisons

4.3.1. Evaluation of GTOT Dataset

4.3.2. Evaluation of RGBT234 Dataset

4.3.3. Evaluation of LasHeR Dataset

4.4. Implementation Detail Analysis of Visualization Comparison Results

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI