PixCon: Pixel-Level Contrastive Learning Revisited

Pang, Zongshang; Nakashima, Yuta; Otani, Mayu; Nagahara, Hajime

doi:10.3390/electronics14081623

Open AccessArticle

PixCon: Pixel-Level Contrastive Learning Revisited^†

¹

D3 Center, Osaka University, Suita 565-0871, Japan

²

CyberAgent, Inc., Tokyo 150-0042, Japan

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024.

Electronics 2025, 14(8), 1623; https://doi.org/10.3390/electronics14081623

Submission received: 12 March 2025 / Revised: 1 April 2025 / Accepted: 11 April 2025 / Published: 17 April 2025

(This article belongs to the Special Issue Applications of Computer Vision, 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Contrastive image representation learning has been essential for pre-training vision foundation models to deliver excellent transfer learning performance. It was originally developed based on instance discrimination, which focuses on instance-level recognition tasks. Lately, the focus has shifted to directly working on the dense spatial features to improve transfer performance on dense prediction tasks such as object detection and semantic segmentation, for which pixel-level and region-level contrastive learning methods have been proposed. Region-level methods usually employ region-mining algorithms to capture holistic regional semantics and address the issue of semantically inconsistent scene image crops, as they assume that pixel-level learning struggles with both. In this paper, we revisit pixel-level learning’s potential and show that (1) it can effectively and more efficiently learn holistic regional semantics and (2) it intrinsically provides tools to mitigate the impact of semantically inconsistent views involved with scene-level training images. We prove this by proposing PixCon, a pixel-level contrastive learning framework, and testing different positive matching strategies based on this framework to rediscover the potential of pixel-level learning. Additionally, we propose a novel semantic reweighting approach tailored for pixel-level learning-based scene image pre-training, which outperforms or matches the performance of previous region-level methods in object detection and semantic segmentation tasks across multiple benchmarks.

Keywords:

self-supervised learning; computer vision; semantic segmentation; object detection; visual pre-training

1. Introduction

Contrastive image representation learning [1,2,3,4,5,6,7,8], which pulls closer the features of positive pairs produced by applying data augmentation to the same image while maximizing the distance between the features of negative samples, greatly advances the transfer learning performance of vision foundation models. Instance discrimination [1] methods work with global average-pooled image feature vectors and are thus referred to as image-level learning methods [9,10,11,12]. Such methods are highly effective in improving models’ image classification performance but often struggle to improve their performance on dense prediction tasks such as object detection [13] and semantic segmentation [14]. Various researchers propose to generalize image-level contrastive learning to work with dense spatial image features to facilitate transfer learning to dense prediction tasks [9,10,11,12,15,16,17,18,19]. Therefore, such methods are usually referred to as dense learning methods due to their focus on dense spatial features.

Though image-level learning methods are highly effective when applied on instance-centric images, e.g., ImageNet [20], they are less promising in pre-training with scene-centric images with multiple instances and complex structures [10,11,16,21], such as MS COCO images [22]. To better utilize scene images during the contrastive pre-training of vision foundation models, pixel-level [9,23] and region-level [10,11,12,15,16,17,18,19] methods have been proposed. Pixel-level learning works with individual spatial feature vectors, whereas region-level learning works with selective aggregations of them. To construct positive pairs for pixel-level learning, the semantically closest spatial feature vectors [9] in the two respective views are used. Region-level methods consider this to be insufficient for exploiting complex scene structures and leverage various region-mining algorithms, such as unsupervised object detection [5,7,24,25] or segmentation [8,26], to obtain regions of interest for constructing region-level positive pairs. A conceptual illustration of their positive matching processes is provided in Figure 1.

Moreover, the random cropping step used to create positive pairs for performing contrastive learning risks creating semantically inconsistent views, which causes features with different semantic meanings, e.g., different objects or objects and backgrounds, to be correlated. An example of such cases is provided in Figure 1, where the panda only appears in the first view but will be forced to correlate with the human’s features by contrastive learning. With the help of region-mining algorithms, region-level methods are usually considered to be better at handling such cases, as they can rely on unsupervised region masks to evaluate the semantic consistency between the views.

In the conference version of this paper [27], we primarily revisited pixel-level learning and showed that (1) the potential of the pixel-level learning baseline, DenseCL [9], has not been fully exploited; (2) regional semantics can also emerge by applying pixel-level learning; and (3) pixel-level learning readily provides tools to successfully address the problem of semantically inconsistent scene crops. Specifically, this paper makes the following contributions:

We propose PixCon, A stronger pixel-level contrastive learning framework, which augments DenseCL [9] by aligning its training pipeline with that of state-of-the-art (SOTA) region-level methods [10,11,12,16,19,28]. We show that PixCon outperforms SOTA region-level methods in terms of transfer learning tasks.
We thoroughly analyze pixel-level learning based on two positive matching schemes: semantic similarities [9] and spatial coordinates [11,23]. We name the corresponding models PixCon-Sim and PixCon-Coord. We show that the similarity-based scheme intrinsically encourages the learning of regional semantics that region-level methods focus on.
Finally, we propose PixCon-SR with a semantic reweighting strategy to deal with semantically inconsistent scene crops by jointly utilizing spatial and semantic information. PixCon-SR achieves better or competitive transfer performance compared with current SOTA methods on dense prediction tasks, including PASCAL VOC object detection [29], COCO object detection and instance segmentation [22], PASCAL VOC semantic segmentation [29], and Cityscapes semantic segmentation [30].

In this extended version of the conference paper, we provide further analyses of PixCon:

We provide a detailed analysis of how each new component in PixCon’s training pipeline contributes to improving DenseCL’s performance to match that of region-level methods.
As pixel-level learning frameworks rely on an additional image-level loss to work well, we add it to region-level methods for a fairer comparison. We show that the region-level methods cannot leverage the image-level loss.
We show that there exist challenges to improving region-level methods with pixel-level matching strategies, which opens new opportunities for future research toward more robust, dense contrastive learning frameworks.

2. Related Work

Image-level Self-Supervised Learning. Pretext tasks such as predicting colors [31], relative positions [32], or the rotations of pixels [33] are essential to self-supervised image representation learning. Instance discrimination [1] based on contrastive learning has recently become the most effective pretext, where augmented views of the same image are drawn closer to one another and pushed farther from different images [2,3,5]. Though both the pulling and pushing forces are proven to be essential in contrastive learning [6], BYOL [7] came up with techniques to only optimize the pulling part of contrastive loss.

As the aforementioned methods invariantly treat each image as a single feature vector, they are referred to as image-level learning methods. Though the resulting models excel at image classification, they perform less impressively in transferring to dense prediction tasks, which rely on sufficiently discriminative spatial features, which image-level methods do not explicitly optimize.

Dense Self-Supervised Learning. By directly optimizing spatial image features, dense learning methods yield better transfer performance in dense prediction tasks. Among them, pixel-level methods rely on crafting cross-view pixel-level positive matches utilizing either spatial coordinates [23] or bootstrapping semantic similarities [9]. As such pixel-level methods are considered insufficient for leveraging the rich semantics in complex scene images, region-level methods rely on region-mining algorithms, such as unsupervised object region proposal methods [5,7,24,25], used by [10,16,17], or unsupervised segmentation algorithms [26,34], used by [19,28], to find semantically meaningful regions, which are then used to aggregate spatial features for contrastive [5] or self-distillation learning [7]. Additionally, Wen et al. [11] and Zhang et al. [12] utilize learnable prototypes to perform unsupervised segmentation, while PixPro [15] relies on spatial distances to select semantically related features. However, we will show that region-mining algorithms are not as crucial to mining regional semantics as claimed for current region-level methods, as pixel-level learning methods can also be exploited to promote region-level learning.

Learning with Scene-Centric Images. The complex structures of scene-centric images, such as those from MS COCO [22], often cause challenges to the fundamental positive pair creation strategy, i.e., siamese learning with two augmented image views. Specifically, random crops of multi-object scene images may include totally different objects, and pulling their features closer does not contribute to learning semantically meaningful features. Region-level methods that rely on object proposals or segmentation masks can roughly evaluate the semantic consistency of the positive pairs and thus largely avoid such a problem, though at the cost of complicated pre-processing [16,21,28,35], nontrivial computational burden during training [19], or less transferable features [11,12] compared with pixel-level methods. However, we will show that tools to alleviate the negative influence of semantically inconsistent videos can be crafted with pixel-level learning alone.

3. Preliminaries

This section reviews two popular image-level learning pipelines, MoCo-v2 [2] and BYOL [7], where the latter is the default pipeline of most region-level methods. We also introduce a variant of MoCo-v2 with a similar architecture to that of BYOL, coined MoCo-v2+ by [36].

Common to MoCo-v2 and BYOL, each input image is augmented into two different views,

x_{1} \sim T_{1} (x)

and

x_{2} \sim T_{2} (x)

, which are then fed into the online encoder

f_{θ}

and the target encoder

f_{ξ}

, where

θ

represents the learnable parameters and

ξ

is the exponential moving average of

θ

. The encoders are backbone networks, e.g., ResNet [37], appended with two-layer multilayer perceptions (MLPs). The MLPs are usually called projection heads. The

f_{θ}

in BYOL has an additional two-layer MLP called the predictor, resulting in an asymmetric structure between the two encoders. Moreover, MoCo-v2 feeds each view into either the online or the target encoder to compute a loss

L_{img} (x_{1}, x_{2})

, while BYOL sends each view to both encoders and symmetrizes the loss computation with respect to the two views, i.e.,

L_{img} (x_{1}, x_{2}) + L_{img} (x_{2}, x_{1})

. Huang et al. [36] added, to MoCo-v2, the asymmetric encoder structure, where the online encoder contains a predictor, and the symmetrized loss, with

L_{img} (x_{1}, x_{2}) = - log \frac{exp (q \cdot k^{+} / τ)}{\sum_{k \in {k^{+}} \cup K} exp (q \cdot k / τ)},

(1)

where

q = f_{θ} (x_{1}) / {∥ f_{θ} (x_{1}) ∥}_{2}

is the query feature and

k^{+} = f_{ξ} (x_{2}) / {∥ f_{ξ} (x_{2}) ∥}_{2}

is the positive key feature.

K

is the set of

f_{ξ}

outputs from other images which are

q

’s negative key features stored in a fixed-length queue [2], and

τ

is the temperature coefficient.

L_{img} (x_{2}, x_{1})

is computed by obtaining the query from

x_{2}

and the positive key from

x_{1}

. The loss in Equation (1) is usually referred to as the InfoNCE loss [38]. In contrast, BYOL only aligns the positive features by maximizing their cosine similarities [7].

Additionally, BYOL also applies a momentum ascending strategy for updating

ξ

and synchronized batch normalization [39] as opposed to shuffling batch normalization [2] in MoCo-v2. When MoCo-v2 is equipped with these BYOL-style designs, it is called MoCo-v2+ in [36], demonstrating similar linear probing and transfer learning performance to those of BYOL but better than those of MoCO-v2. Moreover, SimSiamese [40] is a simplified version of BYOL, achieving better performance under similar training settings. For simplicity, we refer to BYOL, MoCo-v2+, and SimSiamese as BYOL pipelines if not stated otherwise.

4. Proposed Method

Based on MoCo-v2+, we add another asymmetric prediction structure to the backbone that outputs dense spatial feature maps, or pixel-level (pixels, in this context, refers to spatial components of dense feature maps as opposed to those of the input RGB images) features [10,11,12]. The online encoder

f_{θ}

now gives two sets of feature vectors,

q \in R^{C}

and

U \in R^{S^{2} \times C}

(after flattening the first two dimensions), where C is the feature dimensionality and S denotes the length and width of the dense feature maps, which are set as equal for simplicity. Similarly, the target encoder

f_{ξ}

gives

k \in R^{C}

and

V \in R^{S^{2} \times C}

. Figure 2 provides a schematic illustration of the forward process. Based on this forward pipeline, we propose different variants of a pixel-level contrastive learning framework, namely, PixCon, with the loss function being

L (x_{1}, x_{2}) = L_{img} (x_{1}, x_{2}) + L_{pix} (x_{1}, x_{2}),

(2)

where

L_{pix} (x_{1}, x_{2})

is the pixel-level contrastive loss to be defined. The final loss is symmetrized with respect to the two views, i.e.,

L (x_{1}, x_{2}) + L (x_{2}, x_{1})

.

4.1. PixCon-Sim

Let the backbone networks’ outputs be

F \in R^{S^{2} \times C}

and

F^{'} \in R^{S^{2} \times C}

for the query and the key views, respectively; the spatial positions of the features in F are matched to those in

F^{'}

by

l (i) = \underset{j}{arg max} s i m (F (i), F^{'} (j)),

(3)

where

i, j \in [0, S^{2} - 1]

and

s i m (a, b) = a^{⊤} b / ∥ a ∥ ∥ b ∥

. The similarity-based matching scheme aims to bootstrap feature similarities, i.e., features with better semantic correlation give more semantically meaningful matches, which are in turn used to strengthen the correlation of such features. Similar bootstrapping strategies are also applied in region-level methods [11,12,15,19].

With similarity-based matching, the pixel-level contrastive loss is then computed as follows:

L_{pix}^{l} (x_{1}, x_{2}) = - \frac{1}{S^{2}} \sum_{i} log \frac{exp (u_{i} \cdot v_{l (i)}^{+} / τ)}{\sum_{v \in {v_{l (i)}^{+}} \cup V} exp (u_{i} \cdot v / τ)},

(4)

where

u_{i} = U [i] \in R^{C}

,

v_{l (i)}^{+} = V [l (i)] \in R^{C}

, and

V

contains image-level negative key features from other images, in accordance with [9], for computational efficiency. The negative keys are stored in a fixed-length queue.

However, the matching function in Equation (3) hardly makes sense at the beginning of training. As demonstrated in DenseCL [9], jointly conducting image-level and pixel-level learning can help mitigate the problem, as image-level learning also encourages the emergence of semantic relations among spatial features [10,41]. Additionally, image-level learning is also commonly conducted along with dense learning [10,15,16] and brings benefits. Therefore, by using

L_{pix}^{l} (x_{1}, x_{2})

as the pixel-level loss in Equation (2) and symmetrizing the resulting loss with respect to the two views, we obtain the final loss for PixCon-Sim, i.e., pixel-level contrastive learning with similarity-based matches. When using the MoCo-v2 pipeline instead of MoCo-v2+ and not using the symmetrized loss, PixCon-Sim becomes DenseCL [9].

4.2. PixCon-Coord

Though similarity-based matching gives increasingly better matches as the training proceeds [9], it still retrieves semantically inconsistent matches, especially at the beginning of training. To further investigate its pros and cons, we compare it with the coordinate-based matching scheme [11,15,23], which matches two cross-view spatial features only if they have (approximately) the same coordinates when mapped back to the input image space, thus guaranteeing semantic consistency among the positive matches.

Therefore, we propose another variant of PixCon using coordinate-based matching based on inverse augmentation [11], which involves RoIAlign [42] and horizontal flipping if the input image has been flipped. The schematic illustrations of both similarity-based matching and coordinate-based matching are provided in Figure 3.

By slightly overloading the notations

U

and

V

as the pixel-level outputs of inverse augmentation, we have the corresponding pixel-level loss

L_{pix}^{c} (x_{1}, x_{2})

, which replaces the matching function l in Equation (4) with c, which is defined as

c (i) = i,

(5)

connecting the same positions in the two views’ feature maps aligned by inverse augmentation. By using

L_{pix}^{c} (x_{1}, x_{2})

as the pixel-level loss in Equation (2) and symmetrizing the resulting loss with respect to the two views, we obtain the final loss for PixCon-Coord, i.e., pixel-level contrastive learning with coordinate-based matches.

4.3. PixCon-SR

As shown in Figure 3, the two augmented views of the input multi-object image are semantically inconsistent, i.e., the panda only appears in the first view. Thus, similarity-based matches for such view-specific objects’ pixels will have different semantic classes. While coordinate-based matching helps mitigate such false matches, it only matches cross-view pixel-level features at (approximately) the same spatial location in the input image. As a result, it fails to relate semantically related but spatially distant features, whereas pulling such features closer is crucial to learning regional semantics for better transfer performance [10,11,16,28]. Therefore, it is natural to ask the following question: how do we leverage the benefits of both similarity- and coordinate-based matching schemes?

We start to craft a matching scheme that leverages both spatial and semantic information by further noting the hidden problems of similarity-based matching. Firstly, some matches with low similarities can actually be highly semantically close, constituting hard positive pairs that are important to leverage for better feature quality [43]. Moreover, although similarity-based positive matches share maximal similarities among cross-view samples, the similarities can still be low, indicating that they belong to different semantic classes. To exploit hard positive samples, we choose to fully trust positive matches whose query pixels lie in the intersection of two views regardless of the query–key similarity. We call such queries the “in-box” queries, as the intersection area is always a box. The matched key for an in-box query is highly likely to be meaningful, as the query is guaranteed to have semantic correspondences in the key view, e.g., the same pixel itself in the key view in the worst case. To address the negative influence of positive matches with low matching similarities, we propose to reweight such matches with “out-of-box” queries by their query–key similarities. We illustrate such a reweighting process in Figure 3.

We term the consequent reweighting strategy semantic reweighting, with which the pixel-level loss becomes

L_{pix}^{l, w} (x_{1}, x_{2}) = - \sum_{i} \frac{w (i)}{A} log \frac{exp (u_{i} \cdot v_{l (i)}^{+} / τ)}{\sum_{v \in {v_{l (i)}^{+}} \cup V} exp (u_{i} \cdot v / τ)},

(6)

where

A = \sum_{i} w (i)

is the normalization factor. Let

Y

be the set of indices of the in-box query features, which can be easily obtained during data augmentation; we compute

w (i)

as

w (i) = \{\begin{matrix} 1, & if i \in Y . \\ n o r m {(max_{j} s i m (F (i), F^{'} (j)))}^{α}, & otherwise . \end{matrix}

(7)

where

n o r m (x) = (x - min_{j \notin Y} w (j)) / (max_{j \notin Y} w (j) - min_{j \notin Y} w (j))

guarantees the continuity of weights and enlarges their contrast and

α

is for further sharpening the contrast and is set to 2 by default. Note that the formulation of Equation (6) is not related to inverse augmentation, which is more computationally expensive, i.e.,

U

and

V

are dense outputs from

f_{θ}

and

f_{ξ}

. By using

L_{pix}^{l, w} (x_{1}, x_{2})

as the pixel-level loss in Equation (2) and symmetrizing the resulting loss with respect to the two views, we obtain the final loss for PixCon-SR, i.e., pixel-level contrastive learning with semantic reweighting.

PixPro [15] also simultaneously utilizes spatial information and feature similarities. However, they use spatial information to retrieve positive matches, whose quality highly depends on the pre-defined size of a spatial neighborhood. We impose no spatial constraint on the positive matches at all and only bootstrap feature similarities. Due to the use of spatially close positive matches, they need to use self-attention maps to relate spatially distant pixels, whereas we merely rely on pixel-level features together with default random cropping and the inherent uncertainty of similarity-based matching to achieve this purpose.

5. Experiments

5.1. Experimental Settings

Datasets. For pre-training, as we are mainly interested in pre-training on real-world scene images containing diverse and complex contents, we use the training set of MS COCO [22], which contains ∼118k images and is broadly used for scene-level pre-training. COCO is also widely used for benchmarking dense prediction tasks such as object detection, instance segmentation, and semantic segmentation. Moreover, a COCO image contains 7.3 objects on average, which is in stark contrast to the meticulously curated ImageNet [20] images, for which the number of objects per image is 1.1 [9].

Architecture. We base our architecture on that of MoCo-V2+ [36]. Following [9], we add dense learning branches to the global learning branches. Specifically, the online encoder has a ResNet50 [37] backbone, which is appended with a global projection head and a dense projection head. The former has two fully connected layers, while the latter has two

1 \times 1

convolutional layers. Both heads have batch normalization followed by ReLU in between the two layers. For both heads, the hidden dimensionality and the output dimensionality are 2048 and 128, respectively. The global and dense heads are appended with their respective predictors, which have the same architectures as the heads with an input dimensionality of 128. The target encoder has the same architecture as the online encoder except that it does not have predictors.

Data augmentation. Pre-training data augmentation is in accordance to [7], where each image is randomly cropped into two views, which are then resized to

224 \times 224

, followed by random horizontal flipping, color distortion, Gaussian blur, and solarization. Crops without overlapping are skipped.

Pre-training setup. Following [9], the negative-storing queues for both global learning and dense learning are of length 65,536. The momentum for updating the target encoder is initially set to 0.99 and increased to 1 at the end of training [7]. Synchronized batch normalization [39] is used for all batch normalization layers [7]. The temperature

τ

is set to

0.2

. We use the SGD optimizer with an initial learning rate of 0.4 and a cosine learning rate decay schedule. We set the weight decay to 0.0001 and the momentum for the optimizer to 0.9. We train each model for 800 epochs on COCO with four GPUs and a total batch size of 512. Training is conducted under the MMSelfSup framework [44].

Evaluation settings. We follow previous work [2,3,9,11,16] to evaluate feature transferability by fine-tuning the pre-trained models on target downstream tasks. We then evaluate the resulting models by reporting the metrics used in the corresponding tasks, including VOC object detection [29], COCO object detection, COCO instance segmentation [22], VOC semantic segmentation [29], and Cityscapes semantic segmentation [30].

For VOC object detection, we fine-tune a Faster R-CNN with a C4-backbone. Training is performed on the VOC trainval07+12 set for 24k iterations. The evaluation is performed on the VOC test2007 set. Both training and evaluation use the Detectron2 [45] code base.

For COCO object detection and instance segmentation, we fine-tune a Mask R-CNN with an FPN backbone on COCO’s train2017 split with the standard

1 \times

schedule and evaluate the fine-tuned model on COCO’s val2017 split. Following previous work, we synchronize all the batch normalization layers. Detectron2 is used to conduct the training and evaluation.

We strictly follow the settings in [11] for VOC and Cityscapes semantic segmentation. Specifically, an FPN is initialized with the pre-trained model, fine-tuned on the train_aug2012 set for 30 k iterations, and evaluated on the val2012 set. For Cityscapes, we conduct fine tuning on the train_fine set for 90 k iterations and evaluate the fine-tuned model on val_fine. The training and evaluation are conducted by using MMSegmentation [46].

The results, including ours and those of reproducible previous methods, are reported as the average of five, three, three, and five independent runs for VOC detection, COCO detection and instance segmentation, Cityscapes segmentation, and VOC segmentation, respectively.

5.2. Main Results

As discussed in Section 4.1, PixCon-Sim boils down to DenseCL [9] when not applying the BYOL pipeline; this is, however, invariantly used by the region-level methods. As per Table 1, PixCon-Sim outperforms DenseCL across all the benchmarks. Additionally, with a simple pixel-level learning algorithm, PixCon-Sim is already competitive compared with region-level methods across all the benchmarks. PixCon-Coord, with a geometric matching scheme, is also competitive.

For all four tasks, PixCon-SR brings consistent performance boosts to its image-level baseline MoCo-v2+ and surpasses previous region-level methods, as well as the other two PixCon variants. Though PixCon-SR’s performance on COCO detection and instance segmentation is similar to that of UniVIP [16] and SlotCon [11], it has better performance in terms of the other three tasks. It achieves this without relying on any region-mining algorithms, as shown in Table 2, most of which resort to complex pre-processing or computationally expensive multi-stage training. Specifically, for prototype-based methods, i.e., DenseSiamese [12] and SlotCon [11], their transfer performance in VOC detection is conspicuously lower than that of the other methods. This is likely caused by the fact that the dense features are trained to cluster around a fixed number of prototypes, which may cause the features to be overfitted to the prototypes and thus may hurt the transfer performance due to overly small intra-class variances [47]. The pre-training based on a specific number of prototypes also struggles to serve multiple downstream tasks equally well [11]. Overall, Table 1 sufficiently indicates the potential of pixel-level learning and the effectiveness of PixCon-SR.

5.3. Detailed Analysis

Similarity-based matching encourages learning regional semantics. Compared with the similarity-based matching used for PixCon-Sim, the coordinate-based matching of PixCon-Coord guarantees semantic consistency between the positive matches, as the matches represent the same patch in the image, which undergoes different augmentations. However, such strict geometric matching does not encourage relating spatially distant pixels associated with the same object and is thus limited in learning regional semantics.

Though similarity-based matches do not always enjoy such geometric proximity, their semantic consistency becomes increasingly better as training proceeds if the query feature has semantic correspondences in the key view [9]. For query pixels not lying in the intersection of the two views, i.e., out-of-box queries, their matches in the key view are guaranteed to be spatially apart from them. When such matches are semantically related, they could strengthen the correlation of spatially distant pixels belonging to the same semantic group. A qualitative investigation in the form of self-attention maps is provided in Figure 4, where semantically related but spatially distant pixel features are more holistically correlated for PixCon-Sim than for PixCon-Coord and MoCo-v2+. Moreover, Table 1 shows that PixCon-Sim delivers better transfer performance compared with PixCon-Coord, which may be attributed to the better regional semantics made possible by the similarity-based matching.

Semantic reweighting helps learn better regional semantics. The semantic reweighting strategy of PixCon-SR in Section 4.3 aims to discount the influence of inaccurate matches caused by semantically inconsistent views of scene images while utilizing as many semantically consistent matches as possible. Therefore, we expect the resulting features to be less correlated when they are associated with different semantic classes and have better intra-class coherence. Indeed, Figure 4 shows that PixCon-SR’s self-attention maps allow for a better localization of semantic objects compared with PixCon-Sim (less attention on features of different semantic classes) while guaranteeing sufficient coverage of whole objects (better intra-class cohesion), even when compared with the region-level method SlotCon [11]. Moreover, as shown in Table 1, PixCon-SR achieves better transfer performance compared with PixCon-Sim and PixCon-Coord, as well as previous region-level methods, which further indicates the efficacy of the semantic reweighting strategy in helping learn decent regional semantics crucial to better transfer performance. Figure 5 provides visualizations of the semantic weights for the query features, where we can observe that the semantic contents not shared by the two views are given small weights and out-of-box query pixels with semantic correspondences in the key view are assigned nontrivial weights.

Designs of semantic reweighting. In Equation (7), spatial information is used to fully utilize matches with better guarantees for their semantic consistency regardless of their feature similarities, as their queries, i.e., in-box queries, are present in the two views’ intersected part and thus always have semantic correspondences in the key view. Additionally, feature similarities are used to reweight the matches with out-of-box queries to diminish the effect of semantically inconsistent ones while exploiting those that are still informative. Table 3 allows for an examination of the impact of these two tools based on ablation studies. Interestingly, when using similarity-based matches with in-box queries alone, PixCon-SR (Spa.) achieves slightly better performance than PixCon-Coord, which also merely utilizes matches having in-box queries but with coordinated-based matching. This indicates that similarity-based matching provides matches with sufficient semantic consistency. While only using either spatial information or feature similarities does not give apparent performance gain, combining them, i.e., PixCon-SR (full), offers immediate improvements in the transfer performance, indicating the importance of sufficiently leveraging informative positives and mitigating the influence of false positives simultaneously.

Effect of the sharpening factor $α$ . As shown in Table 3, the sharpening factor

α

does not cause drastic fluctuations in transfer performance, but a value of 2 helps strike a good balance between detection and semantic segmentation tasks, which is then applied as the default value.

A step-by-step investigation from DenseCL to PixCon-Sim. After applying the MoCo-v2+/BYOL training pipeline, MoCo-v2-based DenseCL becomes PixCon-Sim, which delivers consistently better transfer performance. It is thus interesting to investigate which newly introduced component in the new pipeline is contributing to better transfer performance.

As shown in Table 4, SyncBN can be used to replace the ShuffleBN in MoCo-v2 without affecting transfer performance much. Asymmetric predictors do not have an apparent contribution. Momentum ascending, symmetric loss, and BYOL augmentation all contribute to better transfer performance, which is consistent with the observation made in the paper where MoCo-v2+ is introduced [36]. However, we found that symmetric loss and BYOL augmentation deliver a more consistent performance boost when applied together.

Though asymmetric predictors and SyncBN do not improve transfer performance, they have been shown, in [36], to contribute to linear probing accuracy on the pre-training dataset. If linear probing accuracy is not considered, it might be interesting to investigate the effect of removing these two techniques. However, to align with previous region-level methods, which invariantly incorporate all the BYOL components, we do so as well by default and leave the investigation for future work.

SlotCon and PixPro do not benefit from image-level loss. DenseCL [9] and the proposed PixCon framework both require image-level loss to work well. However, for the SOTA region-level methods, SlotCon [11] and PixPro [15], the former does not contain an image-level loss, while the latter does not use it by default. Therefore, we would like to investigate whether an additional image-level loss will help these two methods. The experiments are based on the officially released codes of SlotCon and PixPro. As shown in Table 5, both SlotCon and PixPro fail to benefit from the additional image-level learning.

We can observe that all the reported methods have gained from leveraging more scene-centric images for pre-training. It is interesting to see that SlotCon has substantially better performance on VOC detection, COCO detection, instance segmentation, and VOC segmentation. UniVIP also witnessed an impressive performance boost on VOC detection after utilizing COCO+ for pre-training. PixCon-SR experienced consistent transfer performance improvements across the benchmarks and remains competitive compared with region-level methods. Interestingly, PixCon-SR falls behind SlotCon on ADE20k when pre-trained on COCO but catches up after COCO+ pre-training. SlotCon has a smaller relative improvement on ADE20k after pre-training on COCO+ compared with PixCon-SR.

Attempts to relax the use of prior knowledge in region-level learning. Among the region-level learning methods, there are two that also consider pixel-level features, i.e., PixPro and SlotCon. As opposed to pure pixel-level learning applied in DenseCL and the proposed PixCon, PixPro applies pixel-to-region matching based on self-attention to explicitly learn regional semantics. On the other hand, SlotCon enforces pixel-level features to be grouped under learnable prototypes, the number of which is tuned for them to capture region-level semantics. Additionally, SlotCon also applies an attention-based region-level loss. The common first step between pixel or pixel-to-region losses is to find pixel-level positive matches. DenseCL and PixCon find such matches mainly by bootstrapping feature similarities, while PixPro and SlotCon utilize a safer source of information based on prior knowledge, i.e., spatial coordinates.

As we have discussed in Section 5.3 in the main text, similarity-based matching encourages learning regional semantics more than coordinate-based matching. Thus, if we desire to learn regional semantics without explicitly applying region-level learning, similarity-based matching is the key. PixPro and SlotCon are equipped with coordinate-based matching, but they need to explicitly leverage region-level losses. One question that naturally comes to mind is the following: will similarity-based matching facilitate explicit region-level learning? In other words, we may want to know whether it helps to augment/replace the coordinate-based matching in PixPro or SlotCon with bootstrapping-driven similarity-based matching. We made several attempts in this direction but did not witness any improvements. The results are shown in Table 6. We provide our analyses of the results below.

SlotCon+Pix. means that we augment SlotCon with an additional pixel-level learning branch, for which we apply the PixCon pixel-level loss (without semantic reweighting). We can observe that simply augmenting SlotCon with similarity-based pixel-level learning does not help. SlotCon-Coord.+Sim. means that we replace coordinate-based matching with similarity-based matching, and this scenario leads to a significant performance drop. This is expected, as similarity-based matching needs the image-level loss as a basis for semantically meaningful features, whereas SlotCon’s region-level loss, similar to similarity-based matching, also relies on bootstrapping feature similarities. Therefore, the scenario SlotCon-Coord.+Sim.+Img., where the image-level loss is added, shows more reasonable performance, which still does not match the original performance. Moreover, as shown in Table 5, SlotCon does not benefit from the image-level loss to begin with. When we tried to augment the original coordinate-based loss with the similarity-based loss on the same branch (SlotCon+Sim.), we observed a similar performance drop. Semantic reweighting (SR) helps regain part of the original performance. We observe similar trends for PixPro but only report SlotCon results here, as we have only managed to verify the reproducibility of SlotCon’s code.

What could account for the failure? Compared with the straightforward pixel-level loss in PixCon, SlotCon, as well as PixPro, takes a step forward to further bootstrap feature similarities/attention for conducting region-level learning. Compared with similarity-based matching, which is already driven by bootstrapping, coordinate-based matching is apparently a safer tool for providing better semantically meaningful features, at least in the initial stage, to support such region-level bootstrapping. Semantic reweighting helps avoid part of the negative effect of bootstrapping by incorporating spatial information, but it still relies on similarity-based matching.

Similar to PixPro and SlotCon, the proposed PixCon framework is another step towards making dense representation learning less restricted by human prior knowledge via relying more on bootstrapping. Attempting to combine PixCon and region-level bootstrapping is yet another effort in the same direction but remains challenging for now and interesting for future work.

COCO+ results. To investigate whether PixCon-SR can further benefit from more scene-centric training images, we conduct pre-training with the COCO+ dataset and provide the corresponding transfer results in Table 7.

Visualizations of matches with in-box queries but low matching similarities. When formulating the semantic reweighting strategy, we assume that matches with in-box queries, which lie in the intersected area of query and key views, are highly likely to own semantically consistent keys regardless of the query–key similarities, as they are guaranteed to have semantic correspondences in the key view. In Figure 6, we visualize the correspondences between in-box query pixels and their matched key pixels. We can observe that even in an early stage of training, most of the in-box queries with low matching similarities still have semantically consistent key pixels. This validates our assumption that in-box queries tend to have semantically consistent keys regardless of their matching similarities. As training goes further, the matches also get more accurate despite the magnitudes of similarities.

6. Conclusions

In this paper, we exploited the potential of pixel-level learning on pre-training with scene images. We find that pixel-level learning baselines do not enjoy the same sophisticated training pipeline as employed in region-level methods. After training pipeline alignment, pixel-level methods can be improved to match region-level methods’ performance. Moreover, we show that pixel-level methods can also grasp regional semantics, where the key is the similarity-based positive matching strategy [9]. We eventually propose a semantic reweighting strategy to leverage both semantic and spatial cues to equip pixel-level learning with the capability of coping with semantically inconsistent scene image views. The semantic reweighting strategy helps pixel-level learning outperform or rival region-level methods but with a much simpler methodology. We believe there is still under-explored potential for pixel-level learning, and we will keep exploring this direction in future work.

Author Contributions

Conceptualization Z.P. and Y.N.; methodology, Z.P., Y.N., and M.O.; software, Z.P.; validation, Z.P.; formal analysis, Z.P.; investigation, Z.P. and Y.N.; resources, Y.N. and H.N.; writing—original draft preparation, Z.P.; writing—review and editing, Z.P., Y.N., and M.O.; visualization, Z.P.; supervision, Y.N., M.O., and H.N.; project administration, Y.N., M.O., and H.N.; funding acquisition, Y.N. All authors have read and agreed to the published version of the manuscript.

Funding

Thiswork was partly supported by FOREST, grant No. JPMJFR216O.

Data Availability Statement

COCO: https://cocodataset.org/#home (accessed on 10 April 2025); PASCAL07: http://host.robots.ox.ac.uk/pascal/VOC/voc2007/ (accessed on 10 April 2025); PASCAL12: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ (accessed on 10 April 2025); CityScape: https://www.cityscapes-dataset.com/ (accessed on 10 April 2025).

Conflicts of Interest

Author Mayu Otani was employed by the company CyberAgent Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Wu, Z.; Xiong, Y.; Yu, S.X.; Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, Virtual, 14–19 June 2020. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020. [Google Scholar]
Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. In Proceedings of the Conference on Neural Information Processing Systems, Virtual, 6–12 December 2020. [Google Scholar]
Chen, X.; Fan, H.; Girshick, R.; He, K. Improved baselines with momentum contrastive learning. arXiv 2020, arXiv:2003.04297. [Google Scholar]
Wang, T.; Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020. [Google Scholar]
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Pires, B.A.; Guo, Z.D.; Azar, M.G.; et al. Bootstrap your own latent: A new approach to self-supervised learning. In Proceedings of the Conference on Neural Information Processing Systems, Virtual, 6–12 December 2020. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021. [Google Scholar]
Wang, X.; Zhang, R.; Shen, C.; Kong, T.; Li, L. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, Virtual, 19–25 June 2021. [Google Scholar]
Xie, J.; Zhan, X.; Liu, Z.; Ong, Y.S.; Loy, C.C. Unsupervised object-level representation learning from scene images. In Proceedings of the Conference on Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
Wen, X.; Zhao, B.; Zheng, A.; Zhang, X.; Qi, X. Self-supervised visual representation learning with semantic grouping. In Proceedings of the Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Zhang, W.; Pang, J.; Chen, K.; Loy, C.C. Dense Siamese Network for Dense Unsupervised Learning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Neural Information Processing Systems, Montréal, QC, Canada, 7–12 December 2015. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, Boston, MA, USA, 8–12 June 2015. [Google Scholar]
Xie, Z.; Lin, Y.; Zhang, Z.; Cao, Y.; Lin, S.; Hu, H. Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, Virtual, 19–25 June 2021. [Google Scholar]
Li, Z.; Zhu, Y.; Yang, F.; Li, W.; Zhao, C.; Chen, Y.; Chen, Z.; Xie, J.; Wu, L.; Zhao, R.; et al. Univip: A unified framework for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
Selvaraju, R.R.; Desai, K.; Johnson, J.; Naik, N. Casting your model: Learning to localize improves self-supervised representations. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, Virtual, 19–25 June 2021. [Google Scholar]
Henaff, O. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 12–18 July 2020. [Google Scholar]
Hénaff, O.J.; Koppula, S.; Shelhamer, E.; Zoran, D.; Jaegle, A.; Zisserman, A.; Carreira, J.; Arandjelović, R. Object discovery and representation networks. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
Van Gansbeke, W.; Vandenhende, S.; Georgoulis, S.; Gool, L.V. Revisiting contrastive methods for unsupervised learning of visual representations. In Proceedings of the Conference on Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Pinheiro, P.O.; Almahairi, A.; Benmaleck, R.Y.; Golemo, F.; Courville, A. Unsupervised Learning of Dense Visual Representations. In Proceedings of the Conference on Neural Information Processing Systems, Virtual, 6–12 December 2020. [Google Scholar]
Nguyen, T.; Dax, M.; Mummadi, C.K.; Ngo, N.; Nguyen, T.H.P.; Lou, Z.; Brox, T. Deepusps: Deep robust unsupervised saliency prediction via self-supervision. In Proceedings of the Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
Felzenszwalb, P.F.; Huttenlocher, D.P. Efficient graph-based image segmentation. Int. J. Comput. Vis. 2004, 59, 167–181. [Google Scholar] [CrossRef]
Pang, Z.; Nakashima, Y.; Otani, M.; Nagahara, H. Revisiting pixel-level contrastive pre-training on scene images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024. [Google Scholar]
Hénaff, O.J.; Koppula, S.; Alayrac, J.B.; Oord, A.v.d.; Vinyals, O.; Carreira, J. Efficient visual pretraining with contrastive detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A. Colorful image colorization. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Doersch, C.; Gupta, A.; Efros, A.A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Santiago, MN, USA, 7–13 December 2015. [Google Scholar]
Gidaris, S.; Singh, P.; Komodakis, N. Unsupervised Representation Learning by Predicting Image Rotations. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
Xie, J.; Girshick, R.; Farhadi, A. Unsupervised deep embedding for clustering analysis. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
Huang, J.; Kong, X.; Zhang, X. Revisiting the Critical Factors of Augmentation-Invariant Representation Learning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Chen, X.; He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, Virtual, 19–25 June 2021. [Google Scholar]
Chen, T.; Luo, C.; Li, L. Intriguing properties of contrastive losses. In Proceedings of the Conference on Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Yeh, C.H.; Hong, C.Y.; Hsu, Y.C.; Liu, T.L.; Chen, Y.; LeCun, Y. Decoupled contrastive learning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Contributors, M. MMSelfSup: OpenMMLab Self-Supervised Learning Toolbox and Benchmark. 2021. Available online: https://github.com/open-mmlab/mmselfsup (accessed on 10 April 2025).
Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.Y.; Girshick, R. Detectron2. 2019. Available online: https://github.com/facebookresearch/detectron2 (accessed on 10 April 2025).
Contributors, M. MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. 2020. Available online: https://github.com/open-mmlab/mmsegmentation (accessed on 10 April 2025).
Zhao, N.; Wu, Z.; Lau, R.W.; Lin, S. What makes instance discrimination good for transfer learning? arXiv 2020, arXiv:2006.06606. [Google Scholar]

Figure 1. An illustration of the common assumptions regarding the differences in pixel and region-level learning methods. Girds’ colors roughly indicate pixels’ associated semantic classes based on the two input views for illustration purposes. The cross-view pixels connected by solid lines with round markers indicate positive matches. The matching process for pixel-level learning imitates the similarity-based matching from [9]. Region-level methods are motivated by the shown assumptions about pixel-level learning and rely on region-mining algorithms as tools to perform learning based on regional features. In this paper, we question these assumptions about pixel-level learning and revisit it to further exploit its potential.

Figure 2. Both the online and the target encoders output two sets of outputs: global image-level outputs (

q, k

) and dense outputs (

U, V

). The dense outputs are of size

S \times S \times C

before flattening the spatial dimensions. We leave out the visualization of global features and dense features’ last dimension (C).

Figure 2. Both the online and the target encoders output two sets of outputs: global image-level outputs (

q, k

) and dense outputs (

U, V

). The dense outputs are of size

S \times S \times C

before flattening the spatial dimensions. We leave out the visualization of global features and dense features’ last dimension (C).

Figure 3. An illustration of different PixCon variants’ matching schemes. The red bounding boxes indicate the intersected area of the two views. Girds’ colors roughly indicate pixels’ associated semantic classes for illustration purposes. We treat view 1 as the query view and view 2 as the key view. PixCon-Sim’s matching scheme is the similarity-based matching in Equation (3). PixCon-Coord uses the matching function in Equation (5), and the involved inverse augmentation includes RoIAlign [42] and optional horizontal flipping depending on whether the input is flipped. PixCon-SR uses similarity-based matching but applies the semantic reweighting in Equation (7). For the illustration of PixCon-SR, solid lines indicate matches with query pixels in the red bounding box, dashed lines represent the rest of the matches, and different line widths indicate the magnitudes of semantic weights. The matches are drawn for illustration purposes, and not all are drawn for clarity.

Figure 4. Visualizations of self-attention maps. For each row, the first image is the original image, with the red dot highlighting the pixel whose feature is used to calculate the cosine-similarity-based self-attention maps. The subsequent images are self-attention maps using different models’ features. See main texts for analyses.

Figure 5. Visualizations of semantic weights. The first row shows the raw images with the blue bounding boxes indicating the query views and the yellow bounding boxes the key views. The second row shows the heatmap of semantic weights for the query pixels (in the blue bounding box), where the red bounding boxes indicate the intersection between query and key views. All images and heatmaps are resized to the same size for visualization purposes.

Figure 6. For each query view (view 1), we calculate the cosine similarities between its backbone features and those of the key view (view 2) at different training epochs. We keep five in-box query pixels that have the lowest similarities with their matched keys using similarity-based matching. The input images are randomly cropped, resized to

1024 \times 1024

, and then go through the other default data augmentations. The large input size is to more precisely visualize the correspondences. “qk sim.” stands for the backbone feature similarities between the query and its matched key pixels and is only visualized for the query view.

Figure 6. For each query view (view 1), we calculate the cosine similarities between its backbone features and those of the key view (view 2) at different training epochs. We keep five in-box query pixels that have the lowest similarities with their matched keys using similarity-based matching. The input images are randomly cropped, resized to

1024 \times 1024

, and then go through the other default data augmentations. The large input size is to more precisely visualize the correspondences. “qk sim.” stands for the backbone feature similarities between the query and its matched key pixels and is only visualized for the query view.

Table 1. Main transfer results. All self-supervised models were pre-trained for 800 epochs on COCO, except that DetCon was trained for 1000 epochs. Among all the methods, MoCo-v2 and DenseCL are based on the MoCo-v2 pipeline, while the others are based on the BYOL pipeline. Refer to Section 4 for more details on the differences between the pipelines. We also categorize the methods into different types based on their training strategies, including image level, region level, and pixel level. Refer to Table 2 for more information about region- and pixel-level methods. On all the benchmarks, our method shows strong transfer performance. We use boldface to indicate single best results but underline multiple best results that have the same value (†: re-impl. w/official weights. ‡: full re-impl.).

Method	Type	VOC Detection			COCO Detection			COCO Instance Seg.			City. Seg.	VOC Seg.
Method	Type	AP	${AP}_{50}$	${AP}_{75}$	${AP}^{bb}$	${AP}_{50}^{bb}$	${AP}_{75}^{bb}$	${AP}^{mk}$	${AP}_{50}^{mk}$	${AP}_{75}^{mk}$	mIoU	mIoU
Random init. [9,11]	-	32.8	59.0	31.6	32.8	50.9	35.3	29.9	47.9	32.0	65.3	39.5
MoCo-v2 [5]	Image	54.7	81.0	60.6	38.5	58.1	42.1	34.8	55.3	37.3	73.8	69.2
BYOL ^‡ [7]		55.7	81.8	61.6	39.5	59.4	43.3	35.6	56.6	38.2	75.3	70.2
MoCo-v2+ ^‡ [36]		54.6	81.4	60.5	39.8	59.7	43.6	35.9	57.0	38.5	75.6	71.1
ORL ^† [10]	Region	55.8	82.1	62.3	40.2	60.0	44.3	36.4	57.4	38.8	75.4	70.7
PixPro [15]		-	-	-	40.5	60.5	44.0	36.6	57.8	39.0	75.2	72.0
DetCon [28]		-	-	-	39.8	59.5	43.5	35.9	56.4	38.7	76.1	70.2
UniVIP [16]		56.5	82.3	62.6	40.8	-	-	36.8	-	-	-	-
Odin ^‡ [19]		56.9	82.4	63.3	40.4	60.4	44.6	36.6	57.5	39.3	75.7	70.8
DenseSiam [12]		55.5	81.1	61.5	-	-	-	-	-	-	-	-
SlotCon ^† [11]		54.5	81.9	60.3	40.8	61.0	44.8	36.8	58.0	39.5	76.1	71.7
DenseCL [9]	Pixel	56.7	81.7	63.0	39.6	59.3	43.3	35.7	56.5	38.4	75.8	71.6
PixCon-Sim (ours)		57.3	82.4	63.9	40.5	60.5	44.2	36.6	57.5	39.2	76.1	72.6
PixCon-Coord (ours)		57.2	82.6	63.4	40.3	60.3	43.9	36.5	57.4	39.2	75.8	72.3
PixCon-SR (ours)		57.6	82.8	64.0	40.8	61.0	44.8	36.8	57.9	39.6	76.6	73.0

Table 2. Comparisons between region- and pixel-level methods. While most of the region-level methods require object priors, multi-stage training, or prototype learning, pixel-level methods need none of them.

Method	Scheme	Obj. Prior	Multi-Stage	Proto.
ORL [10]	Region level	✓	✓	✕
PixPro [15]		✕	✕	✕
DetCon [19]		✓	✕	✕
UniVIP [16]		✓	✕	✕
Odin [19]		✕	✓	✕
DenseSiam [12]		✕	✕	✓
SlotCon [11]		✕	✕	✓
DenseCL [9]	Pixel level	✕	✕	✕
PixCon-∗	Pixel level	✕	✕	✕

Table 3. We examine the influence of the tools used to formulate the semantic weights in Equation (7) based on ablation studies. PixCon-SR (Spa.) means that only matches whose query features lie in the two views’ intersected parts are accepted and the other matches have weights 0. Here, only the spatial information is used for formulating the semantic weights. PixCon-SR (Sim.) means that only the similarities between the matched features are used as semantic weights, regardless of whether the query features exist in the two views’ intersected area. PixCon-SR (full) utilizes both tools. The effect of the sharpening factor

α

in Equation (7) is also investigated here.

Table 3. We examine the influence of the tools used to formulate the semantic weights in Equation (7) based on ablation studies. PixCon-SR (Spa.) means that only matches whose query features lie in the two views’ intersected parts are accepted and the other matches have weights 0. Here, only the spatial information is used for formulating the semantic weights. PixCon-SR (Sim.) means that only the similarities between the matched features are used as semantic weights, regardless of whether the query features exist in the two views’ intersected area. PixCon-SR (full) utilizes both tools. The effect of the sharpening factor

α

in Equation (7) is also investigated here.

Method	$α$	COCO		VOC Seg.
Method	$α$	${AP}^{bb}$	${AP}^{mk}$	mIoU
PixCon-Sim	-	40.5	36.6	72.6
PixCon-Coord	-	40.3	36.5	72.3
PixCon-SR (Spa.)	2	40.5	36.5	72.5
PixCon-SR (Sim.)	2	40.3	36.4	72.3
PixCon-SR (Full)	2	40.8	36.8	73.0
PixCon-SR (Full)	1	40.5	36.5	73.2
PixCon-SR (Full)	4	40.5	36.6	73.0

Table 4. Investigating the effect of components in MoCo-v2+/BYOL on DenseCL’s transfer performance.

Method	COCO		VOC Seg.
Method	${AP}^{bb}$	${AP}^{mk}$	mIoU
DenseCL	39.6	35.7	71.6
+ SyncBN	39.6	35.6	71.7
+ Asymmetric predictor	39.6	35.7	71.7
+ Momentum ascending	40.1	36.2	72.1
+ Symmetric loss	40.3	36.4	71.5
+ BYOL Aug. (PixCon-Sim)	40.5	36.6	72.6
− Symmetric loss	39.8	36.0	72.2

Table 5. SlotCon and PixCon with image-level losses.

Method	COCO		VOC Seg.
Method	${AP}^{bb}$	${AP}^{mk}$	mIoU
SlotCon	40.8	36.8	71.7
SlotCon + image	40.5	36.6	70.2
PixPro	40.1	36.1	71.0
PixPro + image	40.5	36.6	69.8

Table 6. Attempts to combine similarity-based matching with SlotCon. See text for analyses.

Method	COCO		VOC Seg.
Method	${AP}^{bb}$	${AP}^{mk}$	mIoU
SlotCon	40.8	36.8	71.7
SlotCon+Pix.	40.7	36.6	70.6
SlotCon-Coord.+Sim.	39.7	35.7	68.3
SlotCon-Coord.+Sim.+Img.	40.5	36.5	69.7
SlotCon+Sim.	40.5	36.6	69.5
SlotCon+Sim.+SR	40.7	36.7	70.5

Table 7. Transfer results from COCO+. The results of SlotCon and PixCon-SR are reported as the averages of 5, 3, 3, 5, and 3 independent runs for VOC detection, COCO detection and instance segmentation, Cityscapes segmentation, VOC segmentation, and ADE20k segmentation, respectively. Except for PixCon-SR, all the methods are region-level methods. (†: re-prod. w/official weights).

Method	Dataset	VOC Detection			COCO		City. Seg.	VOC Seg.	ADE20k
Method	Dataset	AP	${AP}_{50}$	${AP}_{75}$	${AP}^{bb}$	${AP}^{mk}$	mIoU	mIoU	mIoU
ORL ^† [10]	COCO	55.8	82.1	62.3	40.2	36.4	75.4	70.7	-
UniVIP [16]		56.5	82.3	62.6	40.8	36.8	-	-	-
SlotCon ^† [11]		54.5	81.9	60.3	40.8	36.8	76.1	71.7	38.7
PixCon-SR (ours)		57.6	82.8	64.0	40.8	36.8	76.6	73.0	38.0
ORL [10]	COCO+	-	-	-	40.6	36.7	-	-	-
UniVIP [16]		58.2	83.3	65.2	41.1	37.1	-	-	-
SlotCon ^† [11]		57.0	83.0	63.4	41.7	37.6	76.6	74.1	38.9
PixCon-SR (ours)		58.5	83.4	65.2	41.2	37.1	77.0	73.9	38.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pang, Z.; Nakashima, Y.; Otani, M.; Nagahara, H. PixCon: Pixel-Level Contrastive Learning Revisited. Electronics 2025, 14, 1623. https://doi.org/10.3390/electronics14081623

AMA Style

Pang Z, Nakashima Y, Otani M, Nagahara H. PixCon: Pixel-Level Contrastive Learning Revisited. Electronics. 2025; 14(8):1623. https://doi.org/10.3390/electronics14081623

Chicago/Turabian Style

Pang, Zongshang, Yuta Nakashima, Mayu Otani, and Hajime Nagahara. 2025. "PixCon: Pixel-Level Contrastive Learning Revisited" Electronics 14, no. 8: 1623. https://doi.org/10.3390/electronics14081623

APA Style

Pang, Z., Nakashima, Y., Otani, M., & Nagahara, H. (2025). PixCon: Pixel-Level Contrastive Learning Revisited. Electronics, 14(8), 1623. https://doi.org/10.3390/electronics14081623

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PixCon: Pixel-Level Contrastive Learning Revisited^†

Abstract

1. Introduction

2. Related Work

3. Preliminaries