**1. Introduction**

Visual representation learning has been an extended research area on supervised and unsupervised methods. Most supervised learning models learn visual representations by training with many labeled datasets, then transferring the knowledge to other tasks [1–5]. Most supervised learning frameworks try to tune their parameters such that they maximally compress mapping the particular input variables that preserve the information on the output variables [6–8]. As a result, most deep neural networks fail to generalize and maintain robustness if the test samples are different from the training samples on variant distribution and domains.

The new approaches are self-supervised representation learning to overcome the existing drawbacks of supervised learning [9–15]. These techniques have attracted significant attention for efficient, generalization, and robustness representation learning when transferring learned representation on multiple downstream tasks achieving on-par or even outperforming supervised baselines. Furthermore, self-supervised learning methods overcome the human supervision capability of leveraging the enormous availability of unlabeled data. Despite various self-supervised frameworks, these methods involve certain

**Citation:** Tran, V.N.; Liu, S.-H.; Li, Y.-H.; Wang, J.-C. Heuristic Attention Representation Learning for Self-Supervised Pretraining. *Sensors* **2022**, *22*, 5169. https://doi.org/ 10.3390/s22145169

Academic Editor: Jing Tian

Received: 4 June 2022 Accepted: 7 July 2022 Published: 10 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

forms of the joint embedding architectures of the two branches neural network such as the Siamese network [16]. The neural networks of two branches are usually weights-sharing or different. In the joint embedding self-supervised framework, the common objective is to maximize the agreemen<sup>t</sup> between embedding vectors from different views of the same image. However, the biggest challenge is avoiding collapsing to a trivial constant solution, which is that all output embedding vectors are the same. Several strategies to prevent the collapsing phenomenon can be categorized into two main approaches: contrastive learning and non-contrastive learning. Self-supervised contrastive learning [9,17] prevents collapse via negative sample pairs. However, contrastive learning requires a large number of negative samples leading to the requirement of high computational resources. The efficient alternative approach is non-contrastive learning [13,14,18]. These frameworks rely only on positive pairs with a momentum encoder [13] or using an extra neural network on one branch with the block gradient flow [14,18].

Most existing contrastive and non-contrastive objectives are optimized based on the whole image semantic features across different augmented views. However, under this assumption, several challenges exist. First, popular contrastive methods such as SimCLR [9] and MoCo [17] require more computation and training samples than supervised methods. Second, more importantly, there is no guarantee that semantic representation of different objects will differentiate between different cropping views of the same image. For instance, several meaningful objects (vehicles, humans, animals, etc.) may exist in the same image. The semantic representation of vehicles and humans is different, so contrasting the similarity between different views based on the whole-image semantic feature may be misleading. Research in cognitive psychology and neural science [19–22] showed that early visual attention helps humans focus on the main group of important objects. In computer vision, the perceptual grouping principle is used to group visual features into meaningful parts that allow a much more effective learning representation of the input context information [21].

Motivated by perceptual grouping, we proposed the **H**euristic **A**ttention **R**epresentation **L**earning (HARL) framework that comprises two main components. First, the early attention mechanism uses unsupervised techniques to generate the heuristic mask to extract object-level semantic features. Second, we construct a framework to abstract and maximize similarity object-level agreemen<sup>t</sup> (foreground and background) across different views beyond augmentations of the same image [13,18,23]. This approach helps enrich the quantity and quality of semantic representation by leveraging foreground and background features extracted from the training dataset.

We can summarize our main findings and contributions as follows:


The remainder of this paper is organized as follows. In Section 2, we discussed related works. Section 3 introduces the HARL framework in detail. Section 4.1 briefly describes the implementation of the HARL framework in self-supervised pretraining. Section 4.2 evaluates and benchmarks HARL performance on the ImageNet evaluation, transfers learning to other downstream tasks and compares it to previous state-of-the-art methods. In Section 5, we provide the analysis of the components impacting the performance and understanding of the behavior of our proposed method. Finally, this paper is concluded in Section 6.

#### **2. Related Works**

Our method is mostly related to unsupervised visual representation learning methods, aiming to exploit input signals' internal distributions and semantic information without human supervision. The early works focused on several design-solving pretext tasks, and image generation approaches. Pretext tasks focus on the aspects of image restoration such as denoising [25], predicting noise [26], colorization [27,28], inpainting [29], predicting image rotation [30], solving jigsaw puzzles [31] and more [32,33]. However, these methods, the learned representation of neural networks pre-trained on pretext tasks, still failed in generalization and robustness when performed on different downstream tasks. The generative adversarial learning [34–36] and variational auto-encoding [25,37,38] operate directly on pixel space and high-level details for image generations, which require costly computation that may not be essential and efficient for visual representation learning.

**Self-supervised contrastive learning.** The popular self-supervised contrastive learning frameworks [9,39,40] aim to pull semantic features from different cropping views of the same image while pushing other features away from other images. However, the downside of contrastive methods is that they require a considerable number of negative pairs, leading to significant computation resources and memory footprint. The efficient alternative approach is non-contrastive learning [13,18], which only maximizes the similarity of two views from the same image without contrast to other views from different images.

**Self-supervised non-contrastive learning.** Distillation learning-based framework [13,18] inspired by knowledge distillation [41] is applied to joint embedding architecture. One branch is defined as a student network, and another is described as a teacher network. The student network is trained to predict the representation of the teacher network; the teacher network's weights are optimized from the student network by a running average of the student network's weights [13] or by sharing with the student's weights and blocking the gradient flow through the teacher network [18]. Non-contrastive frameworks are effective and computationally efficient compared to self-supervised contrastive frameworks [9,17,39].

However, most contrastive or non-contrastive self-supervised techniques maximize similarity agreements of the whole-image context representation of different augmented views. While developing localization attention to separate the semantically features [42,43] by the perceptual grouping of semantic information proved that adopting prior mid-level visible in pretraining gains efficiency for representation learning. The most recent study related to our [39] leveraging visual attention with segmentation obtained impressive results when transferring the learned representation to downstream tasks on object detection and segmentation in multiple datasets. In contrast to our work, previous work employs pixel-level models for contrastive learning, which uses backbones specialized for semantic segmentation and uses different loss functions. It is important to note that the primary work objective is difficult to transfer to other self-supervised frameworks. It also did not investigate the masking feature method or the impact of the dimension and size of the output spatial feature maps on the latent embedding representation, which we will examine next.

## **3. Methods**

In contrastive or non-contrastive learning-based frameworks, HARL object-level objectives are applicable. For example, our study implements a non-contrastive learning framework using an exponential moving average weight parameter of one encoder to another and an extra predictor inspired by BYOL [13]. HARL's objective maximizes the

agreemen<sup>t</sup> of the object-level (foreground and background) latent embedding vector across different cropping views beyond augmentations shown in Figure 1.

**Figure 1.** The HARL's architecture. The heuristic binary mask can be estimated by using either conventional computer vision or deep learning approaches. After that, data augmentation transformation is applied to both the image and its mask (**bottom**). Then, the image pairs flow to a convolutional feature extraction module. The heuristic mask is used to mask the feature maps (which are the outputs of the feature extraction module) in order to separate the foreground from the background features (**middle**). These features are further processed by non-linear multi-layer perceptron modules (MLP). Finally, the similarity objective maximizes foreground and background embedding vectors across different augmented views from the same image (**top**).

#### *3.1. HARL Framework*

The HARL framework consists of three essential steps. In step 1, we estimate the heuristic binary mask for the input image, which segments an image into foreground and background (see described detail in Section 3.2). Next, these masks can be computed using either conventional computer vision methods such as DRFI [44] or unsupervised deep learning saliency prediction [42]. After the mask is estimated, we perform the same image transformation (cropping, flipping, resizing, etc.) to both the image and its mask. Finally, if it is the RGB image, transformations such as color distortion can be applied to the image, such as the image augmentation pipeline of SimCLR [9]. The detailed augmentation pipeline is described in Appendix A.1. After data augmentation, each image and mask pair

generated two augmented images x, x aligned with two augmented masks m and m as illustrated in Figure 1.

In step 2, we implement standard ResNet-50 [45] convolution residual neural network for feature extractor denotation as ƒ. Each image through the feature extractor encodes the output to obtain the spatial feature maps of size 7 × 7 × 2048, and this feature extraction process can be formulated as *h* = ƒ(x), where *h* ∈ R <sup>H</sup>×W×D.. Then, the feature maps can be separated into the foreground and background feature maps by performing element-wise multiplication with the heuristic binary mask. In addition, we provide ablation studies to analyze the impact of the spatial feature map in various sizes and dimensions, as described in Section 5.1. The foreground and background features are denoted as, *hf hb* (Appendix A.2 provides detail of the masking feature method). The foreground and background spatial features are down-sampled using global average pooling to project to a smaller dimension with non-linear multi-layer perceptron (MLP) architecture g.

HARL framework structure adapts from BYOL [13], in which one augmented image (x) is processed with the encoder *f<sup>θ</sup>*, and projection network *gθ*, where *θ* is the learned parameters. Another augmented image (x ) is processed with *fξ* and *gξ* , where *ξ* is an exponential moving average of *θ*. The first augmented image is further processed with the predictor network *qθ*. The projection and predictor network architectures are the same using the non-linear multi-layer perceptron (MLP), as detailed in Section 4. The definition of encoder, projection, and prediction network is adapted from the BYOL. Finally, the latent representation embedding vectors corresponding to the augmented image's foreground and background features are denoted as *z f* , *zb*, *z f* and *zb* ∈ R*d*.

$$\begin{array}{ll}\text{where}:\\\text{where}:\\&\begin{array}{c}\begin{array}{c}\begin{array}{c}\boldsymbol{z}\_{f'} \ \boldsymbol{z}\_{b} \end{array} \stackrel{\scriptstyle\Delta}{\Longrightarrow} \mathcal{g}\_{\boldsymbol{\theta}} \,\,\mathrm{q}\_{\boldsymbol{\theta}} \Big(\boldsymbol{h}\_{f'},\boldsymbol{h}\_{\boldsymbol{b}}\Big),\\\end{array},\end{array}\end{array}$$

In step 3, we compute the HARL's loss function of the given foreground and background latent representations (*z f* , *zb*, *z f* and *zb* are extracted from two augmented images x, x ) which is defined as mask loss, as illustrated in Equation (1). We apply -2-normalization to these latent vectors, then minimize their negative cosine similarity agreemen<sup>t</sup> with the weighting coefficient α . We study the impact of α value and the combination of the whole image and object-level latent embedding vector in the loss objective provided in Section 5.2.

$$\mathcal{L}\_{\theta}^{\text{Maksloss}} = -\left(\alpha \cdot \frac{z\_f}{||z\_f||\_2} \cdot \frac{z\_{f'}}{||z\_{f'}||\_2} + (1 - \alpha) \cdot \frac{z\_b}{||z\_b||\_2} \cdot \frac{z\_{b'}}{||z\_{b'}||\_2}\right),\tag{1}$$

where .<sup>2</sup> is -2-norm, and it is equivalent to the mean squared error of -2-normalized vectors. The weighting coefficient α is in the range [0–1].

We symmetrized loss L by separately feeding augmented image and mask of view one to the online network and augmented image and mask of view two to the target network and vice versa to compute the loss at each training step. We perform a stochastic optimization step to minimize the symmetrized loss <sup>L</sup>*symmetrized* = L + L∼.

$$\mathcal{L}\_{symmetric} = \mathcal{L}\_{\theta}^{Maskloss} + \mathcal{L}\_{\theta}^{\sim Maskloss}.\tag{2}$$

After pretraining processing is complete, we only keep the encoder *θ* and discard all other parts of the networks. The whole training procedure summary is in the python pseudo-code Algorithm 1.


#### **Algorithm1:HARL:HeuristicAttentionRepresentationLearning**

#### *3.2. Heuristic Binary Mask*

Our heuristic binary mask estimation technique does not rely on external supervision, nor is it trained with the limited annotated dataset. We proposed two approaches using conventional computer vision and unsupervised deep learning to carry it out, and these methods appear to be well generalized for various image datasets. First, we use the traditional computer vision method DRFI [44] to generate a diverse set of binary masks by varying the two hyperparameters (the Gaussian filter variance σ and the minimum cluster size s). In our implementation, we defined σ = 0.8 and s = 1000 for generating binary masks in the ImageNet [24] dataset. In the second approach, we leverage the selfsupervised encoder feature extractor of the pre-trained ResNet-50 backbone from [9,42], then pass the output feature maps into a 1 × 1 convolutional classification layer for saliency prediction. The classification layer predicts the saliency or "foregroundness" of a pixel. Therefore, we take the output values of the classification layer and set a threshold of 0.5 to decide which pixels belong to the foreground. Pixel saliency values greater than the threshold are determined as foreground objects. Figure 2 shows the example heuristic mask estimated by these two methods. The detailed implementation of the two methods, DRFI and deep learning feature extractor combined with 1 × 1 convolutional layer is described in Appendix C. In most of our experiments, we used the mask generated by the deep learning method because it is faster than DRFI by running with GPU instead of only with CPU.

**Figure 2.** Example of heuristic binary masks used for mask contrastive learning framework. First row: random images from the ImageNet [24] training set. Second row: mask generated based on DRFI algorithm with a predefined sigma σ value of 0.8 and component size values of 1000. The third row is the mask obtained from the self-supervised pre-trained feature extractor ResNet-50 backbone directly followed by a 1 × 1 convolutional classification foreground and background prediction.
