**4. Experiments**

*4.1. Self-Supervised Pretraining Implementation*

HARL is trained on RGB images and the corresponding heuristic mask of the ImageNet ILSVRC-2012 [24] training set without labels. We implement standard encoder ResNet [45]. According to previous works by SimCLR and BYOL [9,13], the encoder representation output is projected into a smaller dimension using a multi-layer perceptron (MLP). In our implementation, the MLP comprised a linear layer with an output size of 4960 followed by batch normalization [46], rectified linear units (ReLU) [47] and the final linear layer with 512 output units. We apply the LARS optimizer [48] with the cosine decay learning rate schedule without restarts [49], over 1000 epochs on the base learning rate of 0.2, scaled linearly [50] with the batch size (LearningRate = 0.2 × BatchSize/256) and the warmup epochs of 10. Furthermore, we apply a global weight decay parameter of 5 × 10−<sup>7</sup> while excluding the biases and normalization parameters from the LARS adaptation and weight decay. The optimization of the online network and target network follow the protocol of BYOL [13]. We use a batch size of 4096 splits over 8 Nvidia A100GPUs. This setup takes approximately 149 h to train a ResNet-50 (×1).

The computational self-supervised pretraining stage requirements are largely due to forward and backward passes through the convolutional backbone. For the typical ResNet-50 architecture applied to 224 × 224 resolution images, a single forward pass requires approximately 4B FLOPS. The projection head MLP (2048 × 4096 + 4096 × 512) requires roughly 10M FLOPS. In our implementation, the convolution network backbone and MLP network are similar compared to baselines BYOL. Since we forward to the foreground and background representation through the projection head two times instead of one, it results in an additional 10M FLOPS in our framework, less than 0.25% of the total. Finally, the cost of computing the heuristic mask images is negligible because they can be computed once and reused throughout training. Therefore, the complexity of each iteration between our method and the baseline BYOL is almost the same for "computational cost" and "training time".

#### *4.2. Evaluation Protocol*

We evaluate the learned representation from the self-supervised pretraining stage on various natural image datasets and tasks, including image classification, segmentation and object detection. First, we assess the obtained representation on the linear classification and semi-supervised learning on the ImageNet following the protocols of [9,51]. Second, we evaluate the generalization and robustness of the learned representation by conducting transfer learning to other natural image datasets and other vision tasks across image classification, object detection and segmentation. Finally, in Appendix B, we provide a detailed configuration and hyperparameters setting of the linear and fine-tuning protocol in our transfer learning implementation.

#### 4.2.1. Linear Evaluation and Semi-Supervised Learning on the ImageNet Dataset

The evaluation for linear and semi-supervised learning follows the procedure in [9,52,53]. For the linear evaluation, we train a linear classifier on top of the frozen encoder representation and report Top-1 and Top-5 accuracies in percentage for the test set, as shown in Table 1. We then evaluate semi-supervised learning, which is fine-tuning the pre-trained encoder on a small subset with 1% and 10% of the labeled ILSVRC-2012 ImageNet [24] training set. We also report the Top-1 and Top-5 accuracies for the test set in Table 1. HARL obtains 54.5% and 69.5% in Top-1 accuracy for semi-supervised learning using the standard ResNet-50 (×1). It represents a +1.3% and +0.7% advancement over the baseline framework BYOL [13] and significant improvement compared to the strong supervised baseline in the accuracy metric.


**Table 1. Evaluation on the ImageNet.** The linear evaluation and semi-supervised learning with a fraction (1% and 10%) on ImageNet labels report Top-1 and Top-5 accuracies (in%) using the pre-trained ResNet-50 backbone. The best result is bolded.

4.2.2. Transfer Learning to Other Downstream Tasks

We evaluated the HARL's quality of representation learning on linear classification and fine-tuned model following the evaluation setup protocol [9,13,39,55] as detailed in Appendix B.2. HARL's learned representation can perform well for all six different natural distribution image datasets. It has competitive performance in various distribution datasets compared to baseline BYOL [13] and improves significantly compared to the SimCLR [9] approach over six datasets, as shown in Table 2.

We further evaluated HARL's generalization ability and robustness with different computer vision tasks, including object detection of VOC07 + 12 [56] using Faster R-CNN [57] architecture with R50-C4 backbone and instance segmentation task of COCO [58] using Mask R-CNN [59] with R50-FPN backbone. The fine-tuning setup procedure and setting hyperparameter are detailed in Appendix B.3. We report the performance of the standard AP, AP50 and AP75 metrics in Table 3. HARL outperforms the baselines BYOL [13] and also has a significantly better performance than other self-supervised frameworks such as SimCLR [9], MoCo\_v2 [17] and supervised baseline on object detection and segmentation.


**Table 2. Transfer via fine-tuning on the image classification task.** The transfer learning performance between HARL framework and other self-supervised baseline benchmarks across six natural image classification datasets with the self-supervised pre-trained representation on the ImageNet 1000 classes using the standard ResNet-50 backbone. The best result is bolded.

**Table 3. Transfer learning to other downstream vision tasks.** Benchmark the transfer learning performance between HARL framework and other self-supervised baselines on object detection and instance segmentation task. We use Faster R-CNN with C4 backbone for object detection and Mask-RCNN with FPN backbone for instance segmentation. Object detection and instance segmentation backbone initialize with the pre-trained ResNet-50 backbone on ImageNet 1000 classes. The best result is bolded.


#### **5. Ablation and Analysis**

We study the HARL's components to give the intuition of its behavior and impact on performance. We reproduce the HARL framework with multiple running experiments. For this reason, we hold the same set of hyperparameter configurations and change the configuration of the corresponding component, which we try to investigate. We perform our ablation experiments on the ResNet-50 and ResNet-18 architecture on the ImageNet training set without labels. We evaluate the learned representation on the ImageNet linear evaluation during the self-supervised pretraining stage. To do so, we attach the linear classifier on top of the base encoder with the block gradient flow on the linear classifier's input, which stops influencing and updating the encoder with the label information (a similar approach to SimCLR [9]). We run ablations over 100 epochs and evaluate the performance of the public validation set of the original ILSVRC2012 ImageNet [24] in the Top-1 accuracy metric at every 100 or 200 steps per epoch following the protocol as described in Appendix B.1.

#### *5.1. The Output of Spatial Feature Map (Size and Dimension)*

In our HARL framework, separating foreground and background features from the output spatial feature map is essential to maximize the similarity objective across different augmented views. To verify this hypothesis, we analyze several spatial outputs in various

sizes and dimensions by modifying the ResNet kernel's stride to generate the different feature map sizes with the same dimension. For illustration, the standard ResNet is the sequence of four convolution building blocks (conv2\_x, conv3\_x, conv4\_x, conv5\_x). For ResNet-50 architecture, the dimension of conv\_5x block output feature map is 7 × 7 × 2048. After changing the kernel stride of the conv\_4x block from two to one, its new dimension will be 14 × 14 × 2048. In this modified ResNet-50 architecture, the conv5\_x block's spatial feature map size is the same as the conv4\_x block output.

We conduct the experiment for three different sizes including a deep ResNet-50 (7 × 7 × 2048, 14 × 14 × 2048, 28 × 28 × 2048) and a shallow ResNet-18 (7 × 7 × 512, 14 × 14 × 512, 28 × 28 × 512). Figure 3 shows the experimental results of various output sizes and dimensions in the pretraining stage that impact the learned representation when evaluating transfer representation on the ImageNet with linear evaluation protocol. Both shallow and deep ResNet architecture yields better learning ability on the larger output spatial feature map size 14 × 14 than 7 × 7. In our experiments, the performance decreases as we continue to go to a larger output size, 28 × 28 or 56 × 56.

**Figure 3.** The ImageNet linear evaluation Top-1 accuracy (in %) of spatial output feature maps in various sizes and dimensions during the self-supervised pretraining stage. (**a**) The self-supervised pre-trained encoder uses the ResNet-18 backbone; (**b**) The self-supervised pre-trained encoder uses the ResNet-50 backbone.

#### *5.2. Objective Loss Functions*

HARL framework structure reuses elements of BYOL [13]. We use two neural networks denoted as *online network* and *target network*. Each network is defined by a set of parameters *θ* and *ξ*. The optimization objective minimizes the loss L*θ*, *ξ* with respect to learnable parameters *θ*, while the set *ξ* is parameterized by using an exponential moving average of the *θ*, as shown in Equation (3):

$$
\mathfrak{F} \leftarrow \tau \mathfrak{F} + (1 - \tau)\theta. \tag{3}
$$

Unlike previous approaches that minimize loss function only based on the whole image latent embedding vector between two augmented views, HARL minimized the similarity of object-level latent representation, which associated the same spatial regions abstracting from segmentation mask and thus same semantic meanings. As shown in Figure 1, we use the mask information to separate the spatial semantic object-level feature (foreground and background) of the two augmented views. Then, we minimize their negative cosine similarity, denoting mask loss in Equation (1). In addition to our mask loss objective, we combine the distance loss of the whole image representation and object-level, resulting in hybrid loss as described in Equation (4). We study these two loss objectives in the self-supervised pretraining stage and then evaluate the obtained representation on the ImageNet with a linear evaluation protocol.

#### 5.2.1. Mask Loss

The mask loss objective converges to minimizing the distance loss objective between foreground and background latent embedding on vector space <sup>L</sup>foreground(*<sup>θ</sup>*, *ξ*) and <sup>L</sup>background(*<sup>θ</sup>*, *ξ*) with the weighting coefficient α as described in Equation (1). We study the impact of α when it is set to a few predefined values and when it varies according to the cosine scheduling rule. In the first approach, we perform self-supervised pretraining sweeping over three different values {0.3, 0.5, 0.7}. In the second approach, we schedule the α based on a cosine schedule, α (1 − (1 − <sup>α</sup>*base*))·(cos *πk*/*K*) + 1)/2, to gradually increase from the starting α*base* value to 1 corresponding current training step *k* over total training step *K*. We tried three α*base* values, including 0.3, 0.5 and 0.7. We report the Top-1 accuracy on the ImageNet linear evaluation set during the self-supervised pretraining stage, as shown in Figure 4. The weighting coefficient α value of 0.7 yields the consistent learned representation of both approaches. Furthermore, the experimental results demonstrate that the foreground is more important than the background latent representation. For example, in the ImageNet training set, many images exist in which the background information is more than 50% of the image.

**Figure 4.** The impact of weighting coefficient α value to the obtained representation during the self-supervised pretraining stage with the ResNet-50 backbone. The evaluation during pretraining uses the ImageNet linear evaluation protocol in Top-1 accuracy (in%). (**a**) The α value is the fixed value; (**b**) The α value follows the cosine function scheduler.

#### 5.2.2. Hybrid Loss

The objective combines whole image representation embedding *v*1 and *v*2 together with object-level representation embedding mask loss described in Equation (1). *v*1 and *v*2

are extracted from the two augmented views x and x' and are denoted as *v*1 *θ* ◦*gθ*◦*qθ* (*x*) R*<sup>d</sup>* and *v*2 ξ ◦gξ( x ) R*d*. The hybrid loss minizines the negative cosine similarity with weighting coefficient *λ*:

$$\mathcal{L}\_{\theta}^{hydrid} = -\left[\lambda \cdot \frac{\upsilon\_1}{||\upsilon\_1||\_2} \cdot \frac{\upsilon\_2}{||\upsilon\_2||\_2} + (1 - \lambda) \cdot \mathcal{L}\_{\theta}^{Maksloss}\right],\tag{4}$$

where *v*1 and *v*2 are the whole image latent representation; L*Maskloss θ* is the distance loss computed from the foreground and background latent representation described in Equation (1); .<sup>2</sup> is -2-norm; and *λ* is the weighting coefficient in the range [0–1].

To study the impact of weighting coefficient *λ*, we use a cosine scheduling value similar to α in the mask loss section. In our experiment, the weighting coefficient *λ* cosine scheduling sweeping over four *λbase* values {0.3, 0.6, 0.7, 0.9}. We report the Top-1 accuracy of the ImageNet linear evaluation protocol on the validation set during the self-supervised pretraining stage, shown in Figure 5. We found using the weighting coefficient *λbase* value of 0.7 obtains the consistent learned representation when transferring to downstream tasks.

**Figure 5.** The impact of the weighting coefficient *λ* value to the obtained representation of the pre-trained encoder (ResNet-50) during the self-supervised pretraining stage on the ImageNet linear evaluation protocol in Top-1 accuracy (in%).

#### 5.2.3. Mask Loss versus Hybrid Loss

We compare the obtained representation using mask loss and hybrid loss on selfsupervised pretraining. To do so, we implement the HARL framework with both loss objectives on self-supervised pretraining. We use the cosine schedule function to control the weighting coefficient α and *λ* sweeping on three different initial values {0.3, 0.5, 0.7} for both coefficients. We evaluate the obtained representation of the pre-trained encoder using ResNet-50 backbone in Top-1 and Top-5 accuracy (in%) on ImageNet linear evaluation protocol, as shown in Table 4. According to the experimental result, using the hybrid loss incorporated between global and object-level latent representation yields better representation learning during self-supervised pretraining.


**Table 4.** The comparison obtained representation of HARL framework using mask loss and hybrid loss objective. We report Top-1 and Top-5 (in %) accuracy on ImageNet linear evaluation from 100 epochs pre-trained ResNet-50 backbone on ImageNet 1000 classes.

#### *5.3. The Impact of Heuristic Mask Quality*

In our work, the HARL objective uses two different image segmentation techniques. Which ones lead to the best representation? We first consider the heuristics mask retrieving from the computer vision DRFI [44] approach by varying the two hyperparameters (the Gaussian filter variance σ and the minimum cluster size s) as described in detail in Appendix C.1. In our implementation, we generate a diverse set of binary masks by different combinations of σ ∈ {0.2, 0.4, 0.8} and c ∈ {1000, 1500}. The sets of the generated masks are shown in Figure 6. We found that the setting of σ = 0.8 and s = 1000 generate more stable mask quality than other combinations. Following the deep learning technique, we use the pre-trained deep convolution neural network as the feature extractor and design a saliency head prediction on top of the feature extractor output's representation in the following three steps described in Appendix C.2. The generated masks are dependent on the pixel saliency threshold, which determines the foregroundness and backgrounness of the pixel. In our implementation, we tested the saliency threshold value ranging in {0.4, 0.6, 0.7} as shown in Figure 7. We choose the threshold value equal to 0.5 for generating masks in the ImageNet dataset. After choosing the best configure of the two techniques, we generate the mask for the whole training set of the ImageNet [24] dataset. We evaluate the mask quality generated by computing the mean Intersection-Over-Union (mIoU) between masks generated with the ImageNet ground-truth mask annotated by humans from Pixel-ImageNet [60]. The mIoU of the deep learning masks achieves 0.485 over 0.398 of DRFI masks on the subset of 0.485 million images (946/1000 classes of ImageNet). We found that in a complex scene, where multiple objects exist in a single image, the mask generated from the DRFI technique is noisier and less accurate than the deep learning masks, as illustrated in Figure 8.

To fully evaluate the impact of representation learning on downstream performance, we inspect the obtained representational quality with the transfer learning performance on the object detection and segmentation shown in Table 5. The result indicates that for most object detection and segmentation tasks, HARL learning based on masks with deep learning outperforms the one with DRFI masks, although the difference is very small. It shows that the quality of the mask used for HARL does have a small impact on the performance of the downstream task.

**Figure 6.** The heuristic binary masks are generated using DRFI with σ = {0.2, 0.4, 0.8} with c = {1000, 1500}.

**Figure 7.** The heuristic binary masks are generated using an unsupervised deep learning encoder with saliency threshold values.

**Figure 8.** The inspection examples of the generated heuristic binary masks between DRFI and deep learning.

**Table 5.** The impact of mask quality on HARL framework performance on the downstream object detection and instance segmentation task. We use Faster R-CNN with C4 backbone for object detection and Mask-RCNN with FPN backbone for instance segmentation. Object detection and instance segmentation backbones are initialized with the 100-epoch pre-trained ResNet-50 backbone on ImageNet dataset. The best result is bolded.


#### **6. Conclusions and Future Work**

We introduce the HARL framework, a new self-supervised visual representation learning framework, by leveraging visual attention with the heuristic binary mask. As a result, HARL manages higher-quality semantical information that considerably improves representation learning of self-supervised pretraining compared to previous state-of-the-art methods [9,13,17,18] on semi-supervised and transfers learning on various benchmarks. The two main advantages of the proposed method include: (i) the early attention mechanism that can be applied across different natural image datasets because we use unsupervised techniques to generate the heuristic mask and do not rely on external supervision; (ii) the entire framework can transfer and adapt quickly either to self-supervised contrastive or non-contrastive learning framework. Furthermore, our method will apply and accelerate the currently self-supervised learning direction on pixel-level objectives. Our object-level abstract will make this objective more efficient than the existing work based on computing pixel distance [61].

In our HARL framework, the heuristic binary mask is critical. However, the remaining challenge of estimating accurate masks is suitable for datasets with one primary object, such as the ImageNet dataset. The alternative is mining the object proposal of the image in the complex dataset which contains multiple things by producing heuristic semantic segmentation masks. Designing the new self-supervised framework to solve the remaining

challenge of datasets which contain multiple objects is an essential next step and exciting research direction for our future work.

**Supplementary Materials:** The following documents support our experimental results reported in this study. Our code implementation on PyTorch implementation (https://github.com/TranNhiem/ Heuristic\_Attention\_Represenation\_Learning\_SSL\_Pytorch accessed on 19 September 2021) and TensorFlow (https://github.com/TranNhiem/Heuristic\_Attention\_Representation\_Learning\_SSL\_ Tensorflow accessed on 26 November 2021). Our experimental results and report included in different sections can be downloaded at: https://www.hh-ri.com/2022/05/30/heuristic-attentionrepresentation-learning-for-self-supervised-pretraining/ (accessed on 30 May 2021).

**Author Contributions:** Conceptualization, Y.-H.L.; methodology, Y.-H.L. and V.N.T.; software, S.-H.L. and V.N.T.; validation, S.-H.L. and V.N.T.; writing—original draft preparation, Y.-H.L. and V.N.T.; writing—review and editing, Y.-H.L. and V.N.T.; supervision, Y.-H.L. and J.-C.W.; project administration, Y.-H.L. and J.-C.W.; funding acquisition, Y.-H.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** In this study, we construct two novel sets of heuristic binary mask datasets for the ImageNet ILSVRC training set, which can be found here: https://www.hh-ri.com/20 22/05/30/heuristic-attention-representation-learning-for-self-supervised-pretraining/ (accessed on 30 May 2022).

**Acknowledgments:** The authors would like to thank the following people for their help throughout the process of building this study: Timothy Ko for helping create the heuristic binary mask on the DRFI technique for the ImageNet ILSVRC-2012 dataset and Kai-Lin Yang who provided the setting virtual machines for experiments and gave suggestions during the development of the project.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Implementation Detail**

*Appendix A.1. Implementation Data Augmentation*

HARL data augmentation pipeline starts with the standard inception-style random cropping [62]. These cropping views continue to transform using the same set of image augmentation as in SimCLR [9], consisting of the arbitrary sequence composition transformation (color distortion, grayscale conversion, gaussian blur, solarization).

Each RGB image and the heuristic binary mask corresponding to each image are transformed through the augmentation pipeline composed of the following operations described below. First, we utilize the image with the random crop with resizing and random flipping. For the binary mask, these masks apply, only cropping and flipping the underlying RGB image which corresponds. Then, these crop images used give the probability of color distortion (color jittering, color dropping), random Gaussian blur and solarization.


The two views augmented image's x, x and mask pair m, m results from augmentations sample from distributions T, T , M and M , respectively. These distributions apply the primitives described above with different probabilities and magnitudes shown in Table A1. The following table specifies these parameters' inherence from the BYOL framework [13] without modification.


**Table A1.** Parameters used to generate image augmentations.

#### *Appendix A.2. Implementation Masking Feature*

The masking feature step of the HARL framework is essential to leverage the objectivelevel information from the heuristic binary mask. The masking features method is composed of three steps. The first step is taking the spatial feature map output 7 × 7 × 2048, which is the final layer before the global average output pooling of the ResNet architecture. Second, in our training loop design, the mask image is directly resized to 7 × 7 × 3 to match the size of the output spatial feature maps without passing through the encoder. Then, the resized mask indexes the feature, one encodes for the foreground feature and zero encodes for the background feature. In the end, we multiply the indexing mask with the spatial features maps to separate the foreground and background features (the corresponding output is two spatial features maps of 7 × 7 × 2048 for foreground and background features). Then, these two spatial feature map outputs apply global average pooling and further reduce dimension with non-linear multi-layer perceptron (MLP) architecture.

#### **Appendix B. Evaluation on the ImageNet and Transfer Learning**

#### *Appendix B.1. Implementation Masking Feature Linear Evaluation Semi-Supervised Protocol on ImageNet*

Our data preprocessing procedure is described as follows: At training time, the images apply the simple augmentations strategies, including random flip and crops with resizing to 224 × 224 pixels. At testing time, all images applied are resized to 256 pixels along the shorter side using bicubic resampling, which took a 224 × 224 center crop. Images are normalized by color channel in training time and testing and divided by standard deviation computed on ImageNet ([9,13] provide a similar pipeline for data processing).

**Linear evaluation**: We train a linear classifier on top of the frozen pre-trained encoder representation in the linear evaluation without updating the network parameters and the batch statistics. In design and configuration protocol, we follow the standard on ImageNet as in [9,51,54,55]. To train and optimize the linear classifier, we use the SGD optimizer to optimize the cross-entropy loss with the Nesterov momentum over 100 epochs using a batch size of 1024 and a momentum of 0.9. without regularization methods such as weight decay, gradient clipping [63], etc. We report the test set's accuracy (the public validation set of the original ILSVRC2012 ImageNet [24] dataset).

**Semi-supervised evaluation**: We fine-tuned the network parameters of the pretrained encoder representation following the semi-supervised learning protocol and procedure as in [9]. Data preprocessing and augmentation strategies at training and testing time for 1% and 10% follow a similar procedure of linear evaluation (described in Appendix C.1) except that with a larger batch size of 2048 and trained over 60 epochs for 1% labeled data

and 30 epochs for 10% labeled data. In Table 1 of Section 5.1, we report that the result fine-tuned the representation over the 1% and 10% ImageNet splits from [9] with ResNet-50 (1×) architectures.

**Datasets**: We followed previous works [9,13] to transfer the representation on the linear classification and fine-tuned it on six different natural image datasets. These datasets are namely Food-101 [64], CIFAR-10 [65] and CIFAR-100 [65], the SUN397 scene dataset [66], Stanford Cars [67] and the Describable Textures Dataset (DTD) [68]. The detail of each dataset is described in Table A2. We use the training set and validation set, which are specified by the dataset creators, to select hyperparameters. On datasets without a test set or validation set, we use the validation examples as a test set or hold out a subset of the training examples we use as the validation set, as described in Table A2.

**Table A2.** The different image datasets used in transfer learning. When an official test split with labels is not publicly available, we use the official validation split as a test set and create a held-out validation set from the training examples.


**Standard evaluation metrics**: To evaluate HARL transfer learning on different datasets and other vision tasks, we use the standard evaluation metrics of each dataset to assess and benchmark our results on these datasets as described in Top-1, AP, AP50 and AP75.


#### *Appendix B.2. Transfer via Linear Classification and Fine-Tuning*

**Transfer linear classification:** We initialize the network parameters and freeze the pre-trained encoder without updating the network parameters and batch statistics. The standard linear evaluation protocol follows [9,51,55]. In training and testing, the images are resized to 224 × 224 along the shorter side using bicubic resampling and then normalized with ImageNet statistics without data augmentation. Both phase images normalized the color channels by subtracting the average color and dividing by the standard deviation. We train a regularized multinomial logistic regression classifier on top of the frozen representation. We optimize cross-entropy loss -2—regularization with the parameters from a range of 45 logarithmically spaced values between 10−<sup>6</sup> and 10<sup>5</sup> (similar to the optimization procedure of [13]). The model is retrained on the training and validation set combined. The model accuracy performance is reported for the test set.

**Transfer fine-tuning:** We follow fine-tuning protocol as in [9,51,69] to initialize the network with the parameters of the pre-trained representation. At both phase training and testing time, we follow the image preprocessing and data augmentation strategies to the linear evaluation procedure in Appendix B.1. To fine-tune the network, we optimized the cross-entropy loss using SGD optimizer with a Nesterov momentum value of 0.9 and trained over 20,000 steps with a batch size of 256. We set a hyperparameter including the momentum parameter for batch statistics, learning rate and weight decay selection method, same as in [9,13]. After selecting the optimal hyperparameters configured for the validation set, the model is retrained on the combined training and validation set together, using the specified parameters. The absolute accuracy is reported for the test set.

#### *Appendix B.3. Transfer Learning to Other Vision Tasks*

**Object detection and instance segmentation**: We followed previous works [13,17] for the standard setup transferring procedure on Pascal object detection. We use a Faster R-CNN [57] with the R50-C4 backbone. We fine-tune with the training and validation set (16K images) and report the results for the test set of the PASCAL VOC07 + 12 [56] dataset. The backbone is initialized with our pre-trained ResNet50. We use the SGD optimizer to optimize network parameters for 24K iterations with a batch size of 16. We use the initial learning rate of 0.08, then it is reduced to 10−<sup>2</sup> at 18K and 10−<sup>3</sup> at 22K with a linear warmup of the slope 0.3333 for 1000 iterations and the region proposal loss weight of 0.2. Then, we report the final results of AP, AP50 and AP75 metrics for the test set. For instance, regarding the segmentation task on the COCO [58] dataset, we use Mask R-CNN with FPN backbone to iterate over 90K iterations with a batch size of 16. We initialize the learning rate at 0.03 and reduce it by 10 at the 60K and 80K iterations with warmup iterations of 50.

#### **Appendix C. Heuristic Mask Proposal Methods**

In our HARL framework, to generate the heuristic binary mask we investigated various supervised and unsupervised techniques from conventional machine learning to deep-learning-based approaches. The benchmark qualitative and quantitatively state-ofthe-art approaches use computer vision methods [70]. The comprehensive literature survey and benchmark [71] offer multiple supervised deep-learning-based methods for salient object detection on multi-level supervision, network architectures and learning paradigms. Several works of the unsupervised deep learning method [72,73] used predictions obtained with the hand-crafted prior as the pseudo label to train the deep neural network.

#### *Appendix C.1. Heuristic Binary Mask Generates Using DRFI*

Our first approach uses the conventional machine learning method to generate binary masks by adopting the DRFI [44] technique. This method detects a salient object inside an image by carrying out three main steps: multi-level segmentation that segments an image into regions and regional saliency computation that maps the features extracted from each area to a saliency score, which is predicted by a random forest based on three elements: regional contrast, regional property and regional backgrounds. Additionally, at last, multi-level saliency fusion combines the saliency maps over all the layers of segmentation to obtain the final saliency map. To obtain a binary mask, we generate the saliency map of an image. Then, we define a threshold of 40% (top 40% saliency score) to determine what regions are considered salient objects. Any area that is not a salient object will be regarded as background. We generate a diverse set of binary masks by varying the two hyperparameters σ and the minimum cluster size c. Using σ ∈ {0.2, 0.4, 0.8} and c ∈ {1000, 1500} in our implementation, we defined σ = 0.8 and c = 1000 for generating masks in the ImageNet dataset. Additionally, the different configuration hyperparameters experimented with sweeping sigma values σ = {0.2, 0.4, 0.8} and component sizes of c = {1000, 1500} are shown in Figure 6.

#### *Appendix C.2. Heuristic Binary Mask Generates Using Unsupervised Deep Learning*

The second approach in our mask-generated techniques is based on a self-supervised pre-trained feature extractor from previous works [9,17,39,42]. We design a new saliency head prediction with pre-trained encoder representation to generate the binary masks. The design is to obtain a binary mask by carrying out three main steps. First, we take the output feature maps from a pre-trained ResNet-50 encoder [9,42]. Second, we pass the output feature map into a 1 × 1 convolutional classification layer for saliency prediction. The classification layer predicts the saliency or "foregroundness" of a pixel. Finally, we take the classification layer's output values and set a threshold to decide which pixels belong to the foreground. The pixel saliency value more significant than the threshold is determined as a foreground object. In our implementation, we defined a threshold value equal to 0.5 for

generating masks in the ImageNet dataset. We further experiment with several threshold values in {0.4, 0.6, 0.7}; all these configure mask-generated examples in Figure 7.
