1. Introduction
Contrastive learning (CL) has achieved great success in visual understanding in recent years [
1,
2,
3,
4,
5,
6]. It is widely applied in cross-modal retrieval [
7,
8,
9], action recognition [
10], instance segmentation [
11], and other fields [
12,
13,
14]. Through contrastive loss, the purpose of CL is to bring positive pairings together and push negative pairs apart in the feature embedding space. In other words, given the positive and negative pairings, we can build a meaningful feature embedding space using a contrastive loss.
The challenges that affect the contrastive learning performance are mainly reflected in the following two aspects: (1) how to define positive and negative sample pairs; and (2) how to design appropriate contrastive loss. The first aspect is used to design supervised learning or self-supervised learning paradigms. A positive sample pair in supervised contrastive learning is made up of samples from the same category, whereas a negative sample pair is made up of samples from separate categories. A positive pair in self-supervised contrastive learning is frequently generated via two perspectives (e.g., distinct data augmentations) of the same sample, whereas a negative sample pair is built of a sample and additional samples of other categories or their augmented samples, according to [
15]. The effectiveness of the second element strongly influences the performance on high-level contrastive learning tasks, which is a major difficulty for CL. Broadly speaking, contrastive loss is the most significant element influencing the contrastive learning performance.
The triplet loss [
1] is one of the most widely used contrastive loss functions for visual understanding tasks. A triplet consists of three parts: an anchor, a positive, and a negative. Taking image-to-text retrieval as an example, we take each image as an anchor. Captions that are relevant to the anchor image are positive, while those that are irrelevant are negative. The case where the anchor is closer to the negative than the positive is penalized via the triplet loss.
However, in visual understanding tasks, in addition to negative samples that are completely opposite to positive samples, there are also many negative samples that are very similar to positive samples in semantics or pixels. Specifically, each anchor has one positive and many negatives in a single batch. A large proportion of these negatives are further away from the anchor than the positives. Therefore, these negatives are redundant and are often called easy negatives. Negatives that are closer to the anchor than the positive are defined as hard negatives [
1]. Ignoring the contrastive learning loss of hard negative samples will prevent the network from learning discriminative features. In practice, the performance of triplet loss is highly dependent on Hard Negative Mining (HNM). HNM mines use hard negatives for triplet loss. Many state-of-the-art visual understanding models [
16,
17,
18,
19,
20,
21] employ Triplet loss with Hard Negative Mining (T-HNM) [
22] as the optimization objective. T-HNM enables models to mine hard negative samples on various visual tasks, improving the performance of high-level tasks.
Nevertheless, some studies observe that HNM can make training difficult to converge [
23]. HNM essentially increases the penalty strength for hard negatives. HNM provides a large gradient-to-hard negatives, which are optimized emphatically. Easy negatives are either barely or not at all optimized. Focusing on optimizing hard negatives can help the model learn discriminative features [
21]. However, is it true that the stronger the penalty, the better? Existing studies mainly design HNM strategies based on intuition and lack quantitative analysis. What the appropriate level of penalty strength is for hard negatives has not been studied.
In order to solve the above-mentioned problems, we revisit hard negative mining in contrastive learning and propose a method for measuring and controlling the penalty strength of negatives. We first define a metric for the penalty strength of negatives. Then, we perform a quantitative analysis of common loss functions. The penalty strength of hard negatives and the complexity of model optimization are shown to be conflicting. Too large a penalty strength can lead to difficulties in optimizing. As a result, training is difficult to converge. To this end, we further propose a Triplet loss with Penalty Strength Control (T-PSC). A temperature coefficient is introduced to control the penalty strength. We can balance these two contradictory properties by controlling . Balancing the two properties can speed up model convergence and improve the retrieval performance. The major contributions of this paper are summarized as follows:
We define a metric for the penalty strength of negatives, which provides a quantitative analysis tool for HNM.
We find that the penalty strength of hard negatives and the difficulty of model optimization are contradictory. The design of loss functions needs to balance the two items.
Experiments on two visual understanding tasks, i.e., Image–Text Retrieval (ITR) and Temporal Action Localization (TAL), with different modal data as research objects have verified that T-PSC can accelerate model training and improve the performance of current visual understanding models. T-PSC can be applied to existing ITR and TAL models in a plug-and-play manner without any changes.
4. Experiments
We utilize the proposed T-PSC to conduct experiments on visual understanding tasks in a plug-and-play manner. In order to verify the effectiveness of T-PSC in different modalities, we applied it to two tasks: ITR and TAL. ITR focuses on the matching degree of image–text pairs, while TAL tends to the semantic similarity of similar video frames. From the perspective of contrastive learning, the purpose of both tasks is to pull positive pairs together and push negative pairs apart in the feature embedding space. We first introduce our training details and evaluation metrics, then perform extensive ablation studies on different aspects of the ITR task and provide a better understanding of how T-PSC measures and controls the penalty strength of negatives. Finally, we apply T-PSC to existing ITR and TAL models and obtain performance improvements.
4.1. Datasets and Experiment Settings
4.2. Ablation Studies
We conducted ablation experiments on the ITR model to verify the impact of different parameters on the cross-modal performance of T-PSC.
4.2.1. Impact of Hyperparameters
There are two hyperparameters, i.e., margin
and temperature coefficient
, in T-PSC that can be tuned. We experiment with several combinations of hyperparameters on Flickr30K using VSE++. All experiments on the effects of hyperparameters are shown in
Figure 3.
We test the impact of
by fixing
to
.
Figure 3a,b are the performance impact curves of hyperparameter
on image-to-text and text-to-image, respectively. It can be seen from these two pictures that when
, the ITR model achieves the best performance.
degenerates into
when
. When
, the performance of
is always better than
. It shows the significance of the margin for ITR.
We test the impact of
by fixing
to
.
Figure 3c,d are the performance impact curves of hyperparameter
on image-to-text and text-to-image, respectively. As shown in these two pictures, when
, the performance of
is always higher than
. According to
Figure 2, when
, the penalty strength of
for hard negatives is comparable to that of
. At the same time, the optimization difficulty of
is lower than
. When
, the performance of
is lower than
. According to
Figure 2, when
, the penalty strength of
for hard negatives is not large enough. In particular, when
,
can achieve a large penalty strength and a small optimization difficulty. Two contradictory properties are balanced. At this point, the best performance is achieved.
4.2.2. Impact of Loss Functions
T-PSC is designed based on triplet loss and contrastive loss. Therefore, we compare
with
,
, and
. We conduct experiments on VSE++ [
22]. We reproduce VSE++ with all experimental settings identical to [
34]. Experimental results are shown in
Table 1.
outperforms all three loss functions. As for Avg.,
is 5.9%p, 2.3%p, and 1%p ahead of
,
, and
in the image-to-text sub-task, respectively. At the same time, the performance of
is 3.8%p, 2.5%p, and 1%p higher than
,
, and
in the text-to-image sub-task, respectively. Both
and
are special forms of
. As a more flexible form,
exhibits an optimal performance.
4.2.3. Comparisons with Existing Loss Functions
There are several loss functions proposed for ITR: SSP [
66], Meta-SPN [
67], AOQ [
68], and NCR [
69]. We compare T-PSC with these losses on Flickr30K. The experimental results are shown in
Table 2. Compared with these losses, T-PSC improves most evaluation metrics. T-PSC does not need to introduce many hyperparameters like SSP. Compared with Meta-SPN and NCR, T-PSC does not need to train an additional weight assignment network for loss functions. T-PSC also does not need to mine hard negatives on the entire dataset like AOQ. Overall, T-PSC achieves an impressive performance with a simple and easy-to-implement modification.
4.3. Comparisons with Existing ITR and TAL Models
4.3.1. Improvements to Existing ITR Models
T-PSC can plug-and-play to improve the performance of existing ITR models. We conduct experiments on three classic ITR models: VSE++ [
22], BFAN [
70] and SGRAF [
71]. Except for replacing the loss function, the other experimental settings are the same.
Table 3 shows the improvements to these models on Flickr30K. In the image-to-text sub-task, the Avg. of T-PSC is 0.5%p, 3.2%p and 0.4%p higher than VSE++, BFAN and SGRAF, respectively, while the improvement in the text-to-image sub-task is 0.9%p, 2.6%p and 1.9%p. As shown in
Table 4, on MS-COCO, applying T-PSC to VSE++, BFAN and SGRAF can improve Avg. by 0.4%p, 0.7%p and 0.5%p in image-to-text sub-task. While the improvement in the text-to-image sub-task is 0.3%p, 0.5%p and 0.7%p. T-PSC can be easily integrated into the existing ITR model and improve the retrieval performance.
4.3.2. Improvements to Existing TAL Models
To maintain the plug-and-play nature, we introduce our earlier work [
72] on generating boundary-aware proposals in TAL, which is the same as T-PSC and uses contrastive learning with the hard negative mining strategy. Instead of using cosine similarity, we conduct the experiments on T-PSC to verify the contrastive loss performance on video. Specifically, the main approach is to use the T-PSC loss function proposed in this paper to replace the cosine similarity loss function used via BAPG. The experiment results of our T-PSC with the state-of-the-art method in THUMOS14 are shown in
Table 5.
As can be seen from
Table 5, after replacing the cosine similarity loss function in BAPG with T-PSC, the model performance is improved compared to both the original model and BAPG. Especially when tIoU = 0.7, taking TriDet as an example, T-PSC can obtain a gain of 0.1% based on BAPG, which is 1.05% compared to the original TriDet. The experiment results of our T-PSC with the state-of-the-art method in ActivityNet-1.3 are shown in
Table 6. It can be seen from the data in
Table 6 that T-PSC comprehensively improves the performance of the existing model. Although the videos in ActivityNet-1.3 are more complicated and variable than THUMOS14, T-PSC can still improve the original TriDet model and BAPG model by 0.3% and 0.1% at tIoU = 0.8, respectively. The experiment results show that T-PSC is effective and can improve the TAL performance.
4.4. Convergence Analysis
Figure 4 compares the performance of
and
during training. As shown in
Figure 4a,
has a better convergence than
.
decreases rapidly in the early phase of training. On the contrary,
decreases slowly since the optimization of
is too difficult.
Figure 4b,c shows that
can achieve a higher performance faster. On the one hand,
reduces the model optimization difficulty by controlling the penalty strength. Thus, model training is accelerated. On the other hand,
still provides a relatively large penalty strength for hard negatives. Coupled with better training behavior, the final retrieval performance is also improved.
5. Conclusions
By revisiting hard negative mining in contrastive learning, this paper proposes T-PSC to effectively distinguish hard negative samples in visual understanding tasks. In order to overcome the side effects of convergence difficulties caused by traditional hard negative mining methods, we define a metric for the penalty strength of negatives. We can use the penalty strength of hard negatives to quantitatively analyze and find the appropriate level for visual understanding models. Moreover, we can employ T-PSC to balance the penalty strength of hard negatives and the difficulty of model optimization. We find that a reasonable control of the penalty strength can speed up training and obtain discriminative visual presentations. Our T-PSC is flexible and can seamlessly combine with the current visual understanding models in a plug-and-play manner. In order to confirm that the characteristics of T-PSC can be generally applied to various tasks of visual understanding, we conduct extensive experiments. By combining it with models in the field of Image–Text Retrieval, we verify the feature representation capabilities of T-PSC in both the image and text modalities. By combining it with models in the field of video temporal localization, we discover the effectiveness of T-PSC in the video modality. In future work, we will explore the adaptive control of the penalty strength to avoid complicated parameter tuning and find the optimal penalty intensity for different visual understanding tasks.