5.1. Experimental Settings
Datasets. For pre-training, as we are mainly interested in pre-training on real-world scene images containing diverse and complex contents, we use the training set of MS COCO [
22], which contains ∼118k images and is broadly used for scene-level pre-training. COCO is also widely used for benchmarking dense prediction tasks such as object detection, instance segmentation, and semantic segmentation. Moreover, a COCO image contains 7.3 objects on average, which is in stark contrast to the meticulously curated ImageNet [
20] images, for which the number of objects per image is 1.1 [
9].
Architecture. We base our architecture on that of MoCo-V2+ [
36]. Following [
9], we add dense learning branches to the global learning branches. Specifically, the online encoder has a ResNet50 [
37] backbone, which is appended with a global projection head and a dense projection head. The former has two fully connected layers, while the latter has two
convolutional layers. Both heads have batch normalization followed by ReLU in between the two layers. For both heads, the hidden dimensionality and the output dimensionality are 2048 and 128, respectively. The global and dense heads are appended with their respective predictors, which have the same architectures as the heads with an input dimensionality of 128. The target encoder has the same architecture as the online encoder except that it does not have predictors.
Data augmentation. Pre-training data augmentation is in accordance to [
7], where each image is randomly cropped into two views, which are then resized to
, followed by random horizontal flipping, color distortion, Gaussian blur, and solarization. Crops without overlapping are skipped.
Pre-training setup. Following [
9], the negative-storing queues for both global learning and dense learning are of length 65,536. The momentum for updating the target encoder is initially set to 0.99 and increased to 1 at the end of training [
7]. Synchronized batch normalization [
39] is used for all batch normalization layers [
7]. The temperature
is set to
. We use the SGD optimizer with an initial learning rate of 0.4 and a cosine learning rate decay schedule. We set the weight decay to 0.0001 and the momentum for the optimizer to 0.9. We train each model for 800 epochs on COCO with four GPUs and a total batch size of 512. Training is conducted under the MMSelfSup framework [
44].
Evaluation settings. We follow previous work [
2,
3,
9,
11,
16] to evaluate feature transferability by fine-tuning the pre-trained models on target downstream tasks. We then evaluate the resulting models by reporting the metrics used in the corresponding tasks, including VOC object detection [
29], COCO object detection, COCO instance segmentation [
22], VOC semantic segmentation [
29], and Cityscapes semantic segmentation [
30].
For VOC object detection, we fine-tune a Faster R-CNN with a C4-backbone. Training is performed on the VOC
trainval07+12 set for 24k iterations. The evaluation is performed on the VOC
test2007 set. Both training and evaluation use the Detectron2 [
45] code base.
For COCO object detection and instance segmentation, we fine-tune a Mask R-CNN with an FPN backbone on COCO’s train2017 split with the standard schedule and evaluate the fine-tuned model on COCO’s val2017 split. Following previous work, we synchronize all the batch normalization layers. Detectron2 is used to conduct the training and evaluation.
We strictly follow the settings in [
11] for VOC and Cityscapes semantic segmentation. Specifically, an FPN is initialized with the pre-trained model, fine-tuned on the
train_aug2012 set for 30 k iterations, and evaluated on the
val2012 set. For Cityscapes, we conduct fine tuning on the
train_fine set for 90 k iterations and evaluate the fine-tuned model on
val_fine. The training and evaluation are conducted by using MMSegmentation [
46].
The results, including ours and those of reproducible previous methods, are reported as the average of five, three, three, and five independent runs for VOC detection, COCO detection and instance segmentation, Cityscapes segmentation, and VOC segmentation, respectively.
5.3. Detailed Analysis
Similarity-based matching encourages learning regional semantics. Compared with the similarity-based matching used for PixCon-Sim, the coordinate-based matching of PixCon-Coord guarantees semantic consistency between the positive matches, as the matches represent the same patch in the image, which undergoes different augmentations. However, such strict geometric matching does not encourage relating spatially distant pixels associated with the same object and is thus limited in learning regional semantics.
Though similarity-based matches do not always enjoy such geometric proximity, their semantic consistency becomes increasingly better as training proceeds if the query feature has semantic correspondences in the key view [
9]. For query pixels not lying in the intersection of the two views, i.e., out-of-box queries, their matches in the key view are guaranteed to be spatially apart from them. When such matches are semantically related, they could strengthen the correlation of spatially distant pixels belonging to the same semantic group. A qualitative investigation in the form of self-attention maps is provided in
Figure 4, where semantically related but spatially distant pixel features are more holistically correlated for PixCon-Sim than for PixCon-Coord and MoCo-v2+. Moreover,
Table 1 shows that PixCon-Sim delivers better transfer performance compared with PixCon-Coord, which may be attributed to the better regional semantics made possible by the similarity-based matching.
Semantic reweighting helps learn better regional semantics. The semantic reweighting strategy of PixCon-SR in
Section 4.3 aims to discount the influence of inaccurate matches caused by semantically inconsistent views of scene images while utilizing as many semantically consistent matches as possible. Therefore, we expect the resulting features to be less correlated when they are associated with different semantic classes and have better intra-class coherence. Indeed,
Figure 4 shows that PixCon-SR’s self-attention maps allow for a better localization of semantic objects compared with PixCon-Sim (less attention on features of different semantic classes) while guaranteeing sufficient coverage of whole objects (better intra-class cohesion), even when compared with the region-level method SlotCon [
11]. Moreover, as shown in
Table 1, PixCon-SR achieves better transfer performance compared with PixCon-Sim and PixCon-Coord, as well as previous region-level methods, which further indicates the efficacy of the semantic reweighting strategy in helping learn decent regional semantics crucial to better transfer performance.
Figure 5 provides visualizations of the semantic weights for the query features, where we can observe that the semantic contents not shared by the two views are given small weights and out-of-box query pixels with semantic correspondences in the key view are assigned nontrivial weights.
Designs of semantic reweighting. In Equation (
7), spatial information is used to fully utilize matches with better guarantees for their semantic consistency regardless of their feature similarities, as their queries, i.e., in-box queries, are present in the two views’ intersected part and thus always have semantic correspondences in the key view. Additionally, feature similarities are used to reweight the matches with out-of-box queries to diminish the effect of semantically inconsistent ones while exploiting those that are still informative.
Table 3 allows for an examination of the impact of these two tools based on ablation studies. Interestingly, when using similarity-based matches with in-box queries alone, PixCon-SR (Spa.) achieves slightly better performance than PixCon-Coord, which also merely utilizes matches having in-box queries but with coordinated-based matching. This indicates that similarity-based matching provides matches with sufficient semantic consistency. While only using either spatial information or feature similarities does not give apparent performance gain, combining them, i.e., PixCon-SR (full), offers immediate improvements in the transfer performance, indicating the importance of sufficiently leveraging informative positives and mitigating the influence of false positives simultaneously.
Effect of the sharpening factor . As shown in
Table 3, the sharpening factor
does not cause drastic fluctuations in transfer performance, but a value of 2 helps strike a good balance between detection and semantic segmentation tasks, which is then applied as the default value.
A step-by-step investigation from DenseCL to PixCon-Sim. After applying the MoCo-v2+/BYOL training pipeline, MoCo-v2-based DenseCL becomes PixCon-Sim, which delivers consistently better transfer performance. It is thus interesting to investigate which newly introduced component in the new pipeline is contributing to better transfer performance.
As shown in
Table 4, SyncBN can be used to replace the ShuffleBN in MoCo-v2 without affecting transfer performance much. Asymmetric predictors do not have an apparent contribution. Momentum ascending, symmetric loss, and BYOL augmentation all contribute to better transfer performance, which is consistent with the observation made in the paper where MoCo-v2+ is introduced [
36]. However, we found that symmetric loss and BYOL augmentation deliver a more consistent performance boost when applied together.
Though asymmetric predictors and SyncBN do not improve transfer performance, they have been shown, in [
36], to contribute to linear probing accuracy on the pre-training dataset. If linear probing accuracy is not considered, it might be interesting to investigate the effect of removing these two techniques. However, to align with previous region-level methods, which invariantly incorporate all the BYOL components, we do so as well by default and leave the investigation for future work.
SlotCon and PixPro do not benefit from image-level loss. DenseCL [
9] and the proposed PixCon framework both require image-level loss to work well. However, for the SOTA region-level methods, SlotCon [
11] and PixPro [
15], the former does not contain an image-level loss, while the latter does not use it by default. Therefore, we would like to investigate whether an additional image-level loss will help these two methods. The experiments are based on the officially released codes of SlotCon and PixPro. As shown in
Table 5, both SlotCon and PixPro fail to benefit from the additional image-level learning.
We can observe that all the reported methods have gained from leveraging more scene-centric images for pre-training. It is interesting to see that SlotCon has substantially better performance on VOC detection, COCO detection, instance segmentation, and VOC segmentation. UniVIP also witnessed an impressive performance boost on VOC detection after utilizing COCO+ for pre-training. PixCon-SR experienced consistent transfer performance improvements across the benchmarks and remains competitive compared with region-level methods. Interestingly, PixCon-SR falls behind SlotCon on ADE20k when pre-trained on COCO but catches up after COCO+ pre-training. SlotCon has a smaller relative improvement on ADE20k after pre-training on COCO+ compared with PixCon-SR.
Attempts to relax the use of prior knowledge in region-level learning. Among the region-level learning methods, there are two that also consider pixel-level features, i.e., PixPro and SlotCon. As opposed to pure pixel-level learning applied in DenseCL and the proposed PixCon, PixPro applies pixel-to-region matching based on self-attention to explicitly learn regional semantics. On the other hand, SlotCon enforces pixel-level features to be grouped under learnable prototypes, the number of which is tuned for them to capture region-level semantics. Additionally, SlotCon also applies an attention-based region-level loss. The common first step between pixel or pixel-to-region losses is to find pixel-level positive matches. DenseCL and PixCon find such matches mainly by bootstrapping feature similarities, while PixPro and SlotCon utilize a safer source of information based on prior knowledge, i.e., spatial coordinates.
As we have discussed in
Section 5.3 in the main text, similarity-based matching encourages learning regional semantics more than coordinate-based matching. Thus, if we desire to learn regional semantics
without explicitly applying region-level learning, similarity-based matching is the key. PixPro and SlotCon are equipped with coordinate-based matching, but they need to explicitly leverage region-level losses. One question that naturally comes to mind is the following: will similarity-based matching facilitate
explicit region-level learning? In other words, we may want to know whether it helps to augment/replace the coordinate-based matching in PixPro or SlotCon with bootstrapping-driven similarity-based matching. We made several attempts in this direction but did not witness any improvements. The results are shown in
Table 6. We provide our analyses of the results below.
SlotCon+Pix. means that we augment SlotCon with an additional pixel-level learning branch, for which we apply the PixCon pixel-level loss (without semantic reweighting). We can observe that simply augmenting SlotCon with similarity-based pixel-level learning does not help.
SlotCon-Coord.+Sim. means that we replace coordinate-based matching with similarity-based matching, and this scenario leads to a significant performance drop. This is expected, as similarity-based matching needs the image-level loss as a basis for semantically meaningful features, whereas SlotCon’s region-level loss, similar to similarity-based matching, also relies on bootstrapping feature similarities. Therefore, the scenario
SlotCon-Coord.+Sim.+Img., where the image-level loss is added, shows more reasonable performance, which still does not match the original performance. Moreover, as shown in
Table 5, SlotCon does not benefit from the image-level loss to begin with. When we tried to augment the original coordinate-based loss with the similarity-based loss on the same branch (
SlotCon+Sim.), we observed a similar performance drop. Semantic reweighting (SR) helps regain part of the original performance. We observe similar trends for PixPro but only report SlotCon results here, as we have only managed to verify the reproducibility of SlotCon’s code.
What could account for the failure? Compared with the straightforward pixel-level loss in PixCon, SlotCon, as well as PixPro, takes a step forward to further bootstrap feature similarities/attention for conducting region-level learning. Compared with similarity-based matching, which is already driven by bootstrapping, coordinate-based matching is apparently a safer tool for providing better semantically meaningful features, at least in the initial stage, to support such region-level bootstrapping. Semantic reweighting helps avoid part of the negative effect of bootstrapping by incorporating spatial information, but it still relies on similarity-based matching.
Similar to PixPro and SlotCon, the proposed PixCon framework is another step towards making dense representation learning less restricted by human prior knowledge via relying more on bootstrapping. Attempting to combine PixCon and region-level bootstrapping is yet another effort in the same direction but remains challenging for now and interesting for future work.
COCO+ results. To investigate whether PixCon-SR can further benefit from more scene-centric training images, we conduct pre-training with the COCO+ dataset and provide the corresponding transfer results in
Table 7.
Visualizations of matches with in-box queries but low matching similarities. When formulating the semantic reweighting strategy, we assume that matches with in-box queries, which lie in the intersected area of query and key views, are highly likely to own semantically consistent keys regardless of the query–key similarities, as they are guaranteed to have semantic correspondences in the key view. In
Figure 6, we visualize the correspondences between in-box query pixels and their matched key pixels. We can observe that even in an early stage of training, most of the in-box queries with low matching similarities still have semantically consistent key pixels. This validates our assumption that in-box queries tend to have semantically consistent keys regardless of their matching similarities. As training goes further, the matches also get more accurate despite the magnitudes of similarities.