*Article* **Mutual Guidance Meets Supervised Contrastive Learning: Vehicle Detection in Remote Sensing Images**

**Hoàng-Ân Lê 1,\*, Heng Zhang 2, Minh-Tan Pham <sup>1</sup> and Sébastien Lefèvre <sup>1</sup>**

	- <sup>2</sup> Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université Rennes 1,
	- F-35000 Rennes, France; heng.zhang@irisa.fr

**\*** Correspondence: hoang-an.le@irisa.fr

**Abstract:** Vehicle detection is an important but challenging problem in Earth observation due to the intricately small sizes and varied appearances of the objects of interest. In this paper, we use these issues to our advantage by considering them results of latent image augmentation. In particular, we propose using supervised contrastive loss in combination with a mutual guidance matching process to helps learn stronger object representations and tackles the misalignment of localization and classification in object detection. Extensive experiments are performed to understand the combination of the two strategies and show the benefits for vehicle detection on aerial and satellite images, achieving performance on par with state-of-the-art methods designed for small and very small object detection. As the proposed method is domain-agnostic, it might also be used for visual representation learning in generic computer vision problems.

**Keywords:** contrastive learning; mutual guidance; spatial misalignment; vehicle detection

#### **1. Introduction**

Object detection consists of two tasks: localization and classification. As they are different in nature [1] yet contribute toward the overall detection performance, deep architectures usually have two distinct prediction heads, which share the same features extracted from an input. The separated branches, despite the shared parameters, have shown inefficiency as classification scores might not well reflect proper localization [2,3], while the intersection-over-union (IOU) scores of anchor boxes might miss the semantic information [4].

The misalignment of localization and classification may be aggravated depending on the domain of application. Vehicle detection is a challenging but important problem in Earth observation. It is instrumental for traffic surveillance and management [5], road safety [6], traffic modeling [7], and urban planning [8] due to large coverage from aerial viewpoints [9]. The intrinsic challenges include, but are not limited to, the small and diverse sizes of vehicles, inter-class similarity, illumination variation, and background complexity [10,11].

A simple method to combine the localization and classification score to mutually guide the training process, recently introduced by Zhang et al. [4], has shown effectiveness in alleviating the task misalignment problem on generic computer vision datasets MS-COCO [12] and PASCAL-VOC [13]. Its ability to cope with the intricacies of remote sensing vehicle detection yet remains unexplored.

In this paper, we propose a framework inspired by the mutual guidance idea [4] for vehicle detection from remote sensing images (Figure 1). The idea is that the intersectionover-union (IOU) of an anchor box should contribute toward the predicted category and vice versa; the learned semantic information could help in providing more fitting bounding boxes.

**Citation:** Lê, H.-Â.; Zhang, H.; Pham, M.-T; Lefèvre, S. Mutual Guidance Meets Supervised Contrastive Learning: Vehicle Detection in Remote Sensing Images. *Remote Sens.* **2022**, *14*, 3689. https://doi.org/ 10.3390/rs14153689

Academic Editors: Jukka Heikkonen, Fahimeh Farahnakian and Pouya Jafarzadeh

Received: 31 May 2022 Accepted: 27 July 2022 Published: 1 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Figure 1.** Vehicle detection from the VEDAI's aerial images performed by the proposed contrastive mutual guidance loss. Class labels include car (1), truck (2), pickup (3), tractor (4), camping (5), boat (6), van (7), other (8).

To improve the semantic understanding and overcome the varied object sizes and appearances, we also propose a loss module based on the contrastive learning notion [14,15]: for each detected object, the other objects of the same class are pulled closer in the embedding space, while those of different classes are pushed away. The underlying intuition is that the features of the same-class objects should be close together in the latent space, and by explicitly imposing this, the network is forced to learn representations that better underline intra-class characteristics.

Contrastive learning is a discriminative approach to visual representation learning, which has proven effective for pre-training networks before transferring to an actual downstream task [16–20]. The well-known SimCLR framework [16] proposes applying image augmentation to create an image's positive counterpart, eliminating the need for manual annotations for pretext tasks, hence self-supervision. Our hypothesis is that different objects of the same class from aerial points of view could be considered as a result of compositions of multiple augmentation operations, such as cropping, scaling, re-coloring, adding noises, etc., which, as shown by SimCLR, should be beneficial for representation learning (Figure 2). Thus, by pulling together same-class objects and pushing away the others, the network could learn to overcome the environmental diversity and better recognize the objects of interest.

As we rely on ground truth labels to form positive and negative contrastive pairs, the proposed contrastive loss could be seen as being inspired by supervised contrastive learning [17], but applied here to object detection. The differences are that the contrastive pairs are drawn from object-instance level, not image level, and that contrastive loss is employed as an auxiliary loss in combination with the mutually guided detection loss.

**Figure 2.** Different objects of the same class, "car", from an aerial point of view could be considered as passing through various compositions of image augmentation, such as cropping, rotation, re-coloring, noise adding, etc.

The contributions of the paper are fourfold, i.e.,


#### **2. Related Work**

#### *2.1. Vehicle Detection in Remote Sensing*

Deep-learning-based vehicle detection from aerial and satellite images has been an active research topic in remote sensing for Earth observation within the last decade due to intrinsically challenging natures such as intricately small vehicle sizes, various types and orientations, heterogeneous backgrounds, etc. General approaches include adapting stateof-the-art detectors from the computer vision community to apply to Earth observation context [11,23,24]. Similar to the general object detection task [25], most of the proposed methods could be divided into one-stage and two-stage approaches and are generally based on anchor box prediction. Famous anchor-based detector families such as Faster-RCNN, SSD, and YOLO have been widely exploited in remote sensing object detection, including vehicles. In [26,27], the authors proposed to modify and improve the Faster-RCNN detector for vehicle detection from aerial remote sensing images. Multi-scaled feature fusion and data augmentation techniques such as oversampling or homography transformation have proven to help two-stage detectors to provide better object proposals.

In [28,29], YOLOv3 and YOLOv4 were modified and adapted to tackle small vehicle detection from both Unmanned Aerial Vehicle (UAV) and satellite images with the objective of providing a real-time operational context. In the proposed YOLO-fine [28] and YOLO-RTUAV [29] models, the authors attempted to remove unnecessary network layers from the backbones of YOLOv3 and YOLOv4-tiny, respectively, while adding some others to focus on small object searching. In [23], the Tiramisu segmentation model as well as the YOLOv3 detector were experimented and compared for their capacity to detect very small vehicles from 50-cm Pleiades satellite images. The authors finally proposed a late fusion technique to obtain the combined benefits from both models. In [30], the authors focused on the detection of dense construction vehicles from UAV images using an orientation-aware feature fusion based on the one-stage SSD models.

As the use of anchor boxes introduces many hyper-parameters and design choices, such as the number of boxes, sizes, and aspect ratios [9], some recent works have also investigated anchor-free detection frameworks with feature enhancement or multi-scaled dense path feature aggregation to better characterize vehicle features in latent spaces [9,31,32]. We refer interested readers to these studies for more details about anchor-free methods. As anchor-free networks usually require various extra constraints on the loss functions, well-established anchor-based approaches remain popular in the computer vision community for their stability. Therefore, within the scope of this paper, we base our work on anchor-based approaches.

#### *2.2. Misalignment in Object Detection*

Object detection involves two tasks: classification and localization. Apparently, precise detection results require high-quality joint predictions of both tasks. Most object detection models regard these two tasks as independent ones and ignore their potential interactions, leading to the misalignment between classification and localization tasks. Indeed, detection results with correct classification but imprecise localization or with precise localization but wrong classification will both reduce the overall precision, and should be prevented.

The authors of IoU-Net [2] were the first to study this task-wise misalignment problem. Their solution is to use an additional prediction head to estimate the localization confidence (i.e., the intersection-over-union (IoU) between the regressed box and the true box), and then aggregate this localization confidence into the final classification score. In this way, the classification prediction contains information from the localization prediction, and the misalignment is greatly alleviated.

Along this direction, the authors of Double-Head RCNN [1] propose to apply different network architectures for classification and localization networks. Specifically, they find the fully connected layers more suitable for the classification task, and the convolutional layers more suitable for the localization task.

TSD [3] further proposes to use disentangled proposals for classification and localization predictions. To achieve the best performance of both tasks, two dedicated region of interest (RoI) proposals are estimated for classification and localization tasks, respectively, and the final detection result comes from the combination of both proposals.

The recently proposed MutualGuidance [4] addresses the misalignment problem from the perspective of label assignment. It introduces an adaptive matching strategy between anchor boxes and true objects, where the labels for one task are assigned according to the prediction quality on the other task, and vice versa. Compared to the aforementioned methods, the main advantage of MutualGuidance is that its improvement only involves the loss computation, while the architecture of the detection network remains unchanged, so it can be generalized to different detection models and application cases. These features motivate us to rely on this method in our study, and to explore its potential in Earth observation.

#### *2.3. Contrastive Learning*

Contrastive learning has been predominantly employed to transfer representations learned from a pretext task, usually without provided labels, to a different actual task, by finetuning using accompanied annotations [14–16,18–20,33]. The pretext tasks involving mostly feature vectors in embedding space are usually trained with metric distance learning such as N-pair loss [34] or triplet [35].

Depending on the downstream tasks, the corresponding pretexts are chosen accordingly. Chen et al. [16] propose a simple framework, called SimCLR, exploiting image augmentation to pretrain a network using the temperature-scaled N-pair loss and demonstrate an improvement in classifying images. An image paired with the augmented version and used against its pairing with other images in a mini-batch for optimization helps in learning decent visual representations. The representations can be further improved when they participate in the contrastive loss by non-linear transformed proxy. This notion is employed in our paper as the projection head.

Contrastive learning trained on image-level tasks, i.e., a single feature vector per image, however, is shown to be sub-optimal for downstream tasks requiring instance-level or dense pixel-level prediction, such as detection [18] or segmentation [20], 3respectively. The reasons are attributed to the missing of dedicated properties such as spatial sensitivity, translation, and scale invariance. Consequently, different pretext schemes are proposed to effectively pretrain a network conforming to particular downstream tasks, including but not limited to DenseCL [36], SoCo [18], DetCo [19], and PixPro [20]. The common

feature of these methods is the use of explicit image augmentation to generate positive pairs, following SimCLR's proposal, for pretraining networks. In our method, we acquire the augmentation principles yet consider the aerial views of different same-class objects as their augmented versions; hence, no extra views are generated during training. Moreover, the contrastive loss is not used as pretext but as auxiliary loss to improve the semantic information in the mutual guidance process.

In contrast to most works that apply contrastive learning in a self-supervised context, Khosla et al. [17] leverage label information and formulate the batch contrastive approach in the supervised setting by pulling together features of the same class and pushing apart those from different classes in the embedding space. They also unify the contrastive loss function to be used for either self-supervised or supervised learning while consistently outperforming cross-entropy on image classification. The contrastive loss employed in our paper could be considered as being inspired by the same work but repurposed for a detection problem.

#### **3. Method**

In this paper, we follow the generic one-stage architecture for anchor-based object detection comprising a backbone network for feature extraction and 2 output heads for localization and classification. The overview of our framework is shown in Figure 3. For illustration purposes, a 2-image batch size, single spatial resolution features, and 6 anchor boxes are shown, yet the idea is seamlessly applicable to larger batch sizes with different numbers of anchor boxes, and multi-scaled feature extraction such as FPN [37].

**Figure 3.** An overview of our framework: the backbone network encodes a batching input before passing the extracted features to the localization and classification heads, which predict 4-tuple bounding box values and *nc*-class confidence scores for each anchor box. The mutual guidance module re-ranks the anchor boxes based on semantic information from the classification branch and improves the confidence score with localization information. The ground truth categories of the anchor boxes are used to supervise the contrastive loss. The pipeline is illustrated with a batch size of 2 and the number of anchor boxes *na* = 6.

The 2 output heads have the same network architecture: two parallel branches with two 3 × 3 convolution layers, followed by one 1 × 1 convolution layer for localization and classification predictions. The former classifies each anchor box into foreground (positive) or background (negative), while the latter refines anchor boxes via bounding-box regression

to better suit target boxes. Instead of optimizing the 2 head networks independently, mutual guidance [4] introduces a task-based bidirectional supervision strategy to align the model predictions of localization and classification tasks.

#### *3.1. Generation of Detection Targets*

A general supervised object detection provides, for each input image, a list of ground truth bounding boxes *<sup>B</sup>* <sup>∈</sup> <sup>R</sup>*nB*×<sup>4</sup> accompanied by a list of labels *<sup>L</sup>* <sup>∈</sup> <sup>R</sup>*nB* , where *nB* is the number of ground truth boxes annotated for the image. Each box is represented by a 4-tuple (*l*, *t*, *w*, *h*) (in MS-COCO [12] format) or (*xc*, *yc*, *w*, *h*) (in YOLO [38] format), where (*l*, *t*) and (*xc*, *yc*) are the (*x*, *y*) coordinates of a box's top-left corner and center, respectively, and *w*, *h* are the box's width and height. The ground truth boxes are arbitrary and unordered and thus usually adapted into targets of a different form that is more compatible for optimization in a deep network. The process is called *matching*.

The idea is to define a list of fixed-size boxes called *anchors*, *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*nA*×4, for each vector in a CNN output feature map, where *nA* is the total number of predefined anchors per image. For a 512 × 512 input image with *na* = 6 predefined anchor sizes per vector, a 3-level FPN-based feature extraction network with output scale of (8, 16, 32) can produce up to

$$\left(\frac{512}{8} \times \frac{512}{8} + \frac{512}{16} \times \frac{512}{16} + \frac{512}{32} \times \frac{512}{32}\right) \times 6 = 32\angle 56\tag{1}$$

anchors. As the anchors are defined at every vector in an output feature map, they are directly compatible with loss calculation and thus are used as targets for optimization.

**Conventional matching.** Depending on how similar each anchor is to the real ground truth boxes, it is marked as a positive (i.e., object) or negative target (i.e., background). The most common similarity metric is the Jaccard index [39], which measures the ratio of the overlapping area of 2 boxes (an anchor and a ground truth box) over their area of union, as shown in Equation (2).

$$\mathcal{J}(\mathbf{X}, \mathbf{Y}) = \frac{\mathbf{X} \cap \mathbf{Y}}{\mathbf{X} \cup \mathbf{Y}}. \tag{2}$$

Specifically, the matrix *M* containing the Jaccard indices between all pairs of ground truth and anchor boxes is computed. We define the Jaccard index over the Cartesian product of two sets of boxes as the Jaccard indices of all the pairs of boxes in the sets as follows:

$$\mathcal{J}(\mathcal{X}\times\mathcal{Y}) = \{ \mathcal{J}(X,\mathcal{Y}) | X \in \mathcal{X} \text{ and } \mathcal{Y} \in \mathcal{Y} \}.\tag{3}$$

Thus, *M* = J (*B* × *A*). An anchor is matched to a ground truth box if (1) this anchor is the closest that the ground truth box can have (among all anchors) or (2) this ground truth box is the closest that the anchor can have (among all other ground truths). A threshold can be applied to further filter out the matched anchors with low intersection-over-union scores. Subsequently, each anchor is associated with, at most, 1 ground truth box, i.e., *positive target*, or none, i.e., background or *negative target*. Some of the positive targets can be marked as *ignored* and do not contribute to the optimization process. The concrete algorithm is shown in Algorithm 1.

**Mutual matching.** Mutual guidance [4] formulates the process of label assignment in a mutual supervision manner. In particular, it constrains anchors that are well localized to be well classified (localize to classify), and those well classified to be well localized (classify to localize).

**Localize to classify.** The target anchor box corresponding to a feature vector that well localizes an object must be covering semantically important parts of the underlying object; therefore, it should be prioritized as a target for classification. A step-by-step procedure is shown in Algorithm 2. To this end, the Jaccard matrices between all ground truth and predicted boxes are computed, i.e., *<sup>M</sup>*<sup>ˆ</sup> <sup>=</sup> <sup>J</sup> (*<sup>B</sup>* <sup>×</sup> *<sup>B</sup>*ˆ) (see Algorithm 2, Line 1). The top-*<sup>K</sup>* anchors per ground truth box are shortlisted as positive classification targets, while the rest are considered negative targets. Concretely, we keep the Jaccard score of the best ground truth box (if any) for each anchor and zero out the other ground truth boxes, i.e., a column

in the Jaccard matrix now has at most a single non-zero entry (Line 3–5). Then, each ground box will have all anchors besides the *K* with the highest score removed (Line 6–7). The remaining ground truth box per anchor is associated with it. We also use their Jaccard scores as soft-label targets for the loss function by replacing 1s in one-hot vectors with the corresponding scores. The loss is shown in Section 3.2.


```
Input: list of ground truth boxes B ∈ RnB×4, and corresponding labels L ∈ RnB ,
   list of anchors A ∈ RnA×4,
   negative and positive threshold θn, θp, where θn ≤ θp
Output: list of target boxes B˜ ∈ RnA×4,
   and corresponding target labels L˜ ∈ RnA for each anchor
 1: M ← J (B × A) # M ∈ RnB×nA
 2: L˜ ← 
        0 0 ··· 0

 3: B˜ ← A # the target boxes are the anchor boxes
 4: for each column index c of M do
 5: iou ← max(M∗c) # Processing condition 2
 6: i ← argmax(M∗c)
 7: if iou ≥ θ
 8: L˜ c ← Li
 9: B˜c∗ ← Bi
10: else if iou < θn
11: L˜ c ← −1
12: for each row index r of M do
13: iou ← max(Mr∗) # Overwritten with condition 1
14: i ← argmax(Mr∗)
15: if iou ≥ θ
16: L˜i ← Lr
17: B˜
        i ← Br
18: else if iou < θn
19: L˜i ← −1
```
**Algorithm 2** Generating classification targets from predicted localization

**Input:** list of ground truth boxes *<sup>B</sup>* <sup>∈</sup> <sup>R</sup>*nB*×4, and corresponding labels *<sup>L</sup>* <sup>∈</sup> <sup>R</sup>*nB* , list of anchors *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*nA*×4, list of predicted boxes *<sup>B</sup>*<sup>ˆ</sup> <sup>∈</sup> <sup>R</sup>*nA*×<sup>4</sup> **Output:** list of target labels for all anchors *<sup>L</sup>*˜ <sup>∈</sup> <sup>R</sup>*nA* 1: *<sup>M</sup>*<sup>ˆ</sup> ← J (*<sup>B</sup>* <sup>×</sup> *<sup>B</sup>*ˆ) # *<sup>M</sup>*<sup>ˆ</sup> <sup>∈</sup> <sup>R</sup>*nB*×*nA* 2: *<sup>L</sup>*˜ <sup>←</sup> 0 0 ··· 0 3: **for each** column index *c* of *M*ˆ **do** 4: *<sup>i</sup>* <sup>←</sup> argmax- *<sup>M</sup>*<sup>ˆ</sup> <sup>∗</sup>*<sup>c</sup>* 5: *<sup>M</sup>*<sup>ˆ</sup> *kc* <sup>←</sup> 0, <sup>∀</sup> *<sup>k</sup>* <sup>=</sup> *<sup>i</sup>* 6: **for each** row index *r* of *M*ˆ **do** 7: *<sup>M</sup>*<sup>ˆ</sup> *rk* <sup>←</sup> 0, <sup>∀</sup> *<sup>k</sup>* <sup>∈</sup>/ topk(*M*<sup>ˆ</sup> *<sup>r</sup>*∗) 8: **for each** column index *c* of *M*ˆ **do** 9: *<sup>i</sup>* <sup>←</sup> argmax- *<sup>M</sup>*<sup>ˆ</sup> <sup>∗</sup>*<sup>c</sup>* 10: *<sup>L</sup>*˜ *<sup>c</sup>* <sup>←</sup> *Li*

**Classify to localize.** Likewise, a feature vector at the output layer that induces correct classification indicates the notable location and shape of the corresponding target anchor box. As such, the anchor should be prioritized for bounding box regression. To this end, the Jaccard similarity between a ground truth and anchor box is scaled by the confidence score of the anchor's corresponding feature vector for the given ground truth box. Concretely, a curated list *<sup>C</sup>*˜ <sup>∈</sup> <sup>R</sup>*nB*×*nA* of confidence scores for the class of each given ground truth

box is obtained from the all-class input scores *<sup>C</sup>*<sup>ˆ</sup> <sup>∈</sup> <sup>R</sup>*nA*×*nC* , as shown in Algorithm <sup>3</sup> on Line 2–4, where *nC* is the number of classes in the classification task. The Jaccard similarity between a ground truth and anchor box *M* (similar to conventional detection matching) is scaled by the corresponding confidence score and clamped to the range [0, 1] (Line 5, where  indicates the Hadamard product). The rest of the algorithm proceeds as shown in the previous algorithm with the updated similarity matrix *M*˜ in lieu of the predicted similarity matrix *M*ˆ .

**Algorithm 3** Generating localization targets from predicted class labels

**Input:** list of ground truth boxes *<sup>B</sup>* <sup>∈</sup> <sup>R</sup>*nB*×4, and corresponding labels *<sup>L</sup>* <sup>∈</sup> <sup>R</sup>*nB* , list of anchors *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*nA*×4, list of confidence scores for all classes *<sup>C</sup>*<sup>ˆ</sup> <sup>∈</sup> <sup>R</sup>*nA*×*nC* , **Output:** list of target box specifications for all anchors *<sup>B</sup>*˜ <sup>∈</sup> <sup>R</sup>*nA*×<sup>4</sup> 1: *<sup>M</sup>* ← J (*<sup>B</sup>* <sup>×</sup> *<sup>A</sup>*) # *<sup>M</sup>* <sup>∈</sup> <sup>R</sup>*nB*×*nA* 2: **for each** row index *r* of *M* **do** 3: *l* ← *Lr*<sup>∗</sup> 4: *C*˜ *<sup>r</sup>*<sup>∗</sup> <sup>←</sup> exp *C*ˆ *l*∗ *σ* # *<sup>C</sup>*˜ <sup>∈</sup> <sup>R</sup>*nB*×*nA* 5: *<sup>M</sup>*˜ <sup>←</sup> max- 0, min- 1, *<sup>M</sup> <sup>C</sup>*˜ 6: *<sup>L</sup>*˜ <sup>←</sup> 0 0 ··· 0 7: **for each** column index *c* of *M*˜ **do** 8: *<sup>i</sup>* <sup>←</sup> argmax- *<sup>M</sup>*˜ <sup>∗</sup>*<sup>c</sup>* 9: *<sup>M</sup>*˜ *kc* <sup>←</sup> 0, <sup>∀</sup> *<sup>k</sup>* <sup>=</sup> *<sup>i</sup>* 10: **for each** row index *r* of *M*˜ **do** 11: *<sup>M</sup>*˜ *rk* <sup>←</sup> 0, <sup>∀</sup> *<sup>k</sup>* <sup>∈</sup>/ topk(*M*˜ *<sup>r</sup>*∗) 12: **for each** column index *c* of *M*˜ **do** 13: *<sup>i</sup>* <sup>←</sup> argmax- *<sup>M</sup>*˜ <sup>∗</sup>*<sup>c</sup>* 14: *<sup>B</sup>*˜*c*<sup>∗</sup> <sup>←</sup> *Bi*<sup>∗</sup>

*3.2. Losses*

**Classification loss.** For classification, we adopt the Generalized Focal Loss [40] with soft target given by the Jaccard scores of predicted localization and ground truth boxes. The loss is given by Equation (4):

$$\mathcal{L}\_{\text{class}}(\mathfrak{g}, \mathfrak{y}) = -|\mathfrak{y} - \mathfrak{y}|^2 \sum\_{i}^{n\_{\mathbb{C}}} \mathfrak{y}\_i \log \mathfrak{g}\_{i\prime} \tag{4}$$

where *<sup>y</sup>*˜ <sup>∈</sup> <sup>R</sup>*nC* is the one-hot target label given by *<sup>C</sup>*˜, softened by the predicted Jaccard scores, and *<sup>y</sup>*<sup>ˆ</sup> <sup>∈</sup> <sup>R</sup>*nC* is the anchor's confidence score.

**Localization loss.** We employ the balanced L1 loss [41], derived from the conventional smooth L1 loss, for the localization task to promote the crucial regression gradients from accurate samples (inliers) by separating inliers from outliers, and we clip the large gradients produced by outliers with a maximum value of *β*. This is expected to rebalance the involved samples and tasks, thus achieving a more balanced training within classification, overall localization, and accurate localization. We first define the balanced loss *Lb*(*x*) as follows:

$$L\_b(\mathbf{x}) = \begin{cases} \frac{\alpha}{b} (b|\mathbf{x}| + 1) \ln \left( b \frac{|\mathbf{x}|}{\beta} + 1 \right) - a|\mathbf{x}|, & \text{if } |\mathbf{x}| < \beta\\ \gamma |\mathbf{x}| + \frac{\gamma}{b} - a \* \beta, & \text{otherwise} \end{cases} \tag{5}$$

where *α* = 0.5, *β* = 0.11, *γ* = 1.5, and *b* is constant such that

$$
\mathfrak{a}\ln(b+1) = \gamma.\tag{6}
$$

The localization loss using balanced L1 loss is defined as *L*loc = *Lb*(*pred* − *target*).

**Contrastive Loss.** The mutual guidance process assigns to each anchor box a confidence score *si* ∈ [0, 1] from the prediction of the feature vector associated with it, and a category label *ci* > 0 if the anchor box is deemed to be an object target or *ci* = 0 if background target. Let <sup>B</sup>*<sup>φ</sup> <sup>k</sup>* = {*i* = *k* : *ci* = *φ*} be the index set of all anchor boxes other than *k*, whose labels follow the condition *φ* and **z** be a feature vector at the before-last layer in the classification branch (Figure 3). Following SupCo [17], we experiment with two versions of the loss function, Lout, with summation being outside of the logarithm, and Lin inside, whose equations are given as follows:

$$\mathcal{L}\_{\text{in}} = \frac{-1}{|\mathcal{B}|} \sum\_{i \in \mathcal{B}} \log \left( \frac{1}{|\mathcal{B}\_i^{c\_i}|} \frac{\sum\_{j \in \mathcal{B}\_i^{c\_i}} \delta \left(\mathbf{z}\_i, \mathbf{z}\_j\right)}{\sum\_{k \in \mathcal{B}\_i} \delta \left(\mathbf{z}\_i, \mathbf{z}\_k\right)} \right), \tag{7}$$

$$\mathcal{L}\_{\text{out}} = \frac{-1}{|\mathcal{B}|} \sum\_{i \in \mathcal{B}} \frac{1}{|\mathcal{B}\_i^{c\_i}|} \sum\_{j \in \mathcal{B}\_i^{c\_i}} \log \frac{\delta \left(\mathbf{z}\_i, \mathbf{z}\_j\right)}{\sum\_{k \in \mathcal{B}\_i} \delta \left(\mathbf{z}\_i, \mathbf{z}\_k\right)},\tag{8}$$

where *<sup>δ</sup>*(**v**1, **<sup>v</sup>**2) <sup>=</sup> exp <sup>1</sup> *τ* **v**<sup>1</sup> · **v**<sup>2</sup> **v**<sup>1</sup> **v**<sup>2</sup> is the temperature-scaled similarity function. In this paper we choose *τ* = 1.

#### **4. Experiments**

#### *4.1. Setup*

In this section, the proposed modules are analyzed and tested using the YOLOX small (-s) and medium (-m) backbones, which are adopted exactly from the YOLOv5 backbone and its scaling rules, as well as the YOLOv3 backbone (DarkNet53+SPP bottleneck) due to its simplicity and broad compatibility, and hence popularity, in various applied domains. More detailed descriptions can be referred to in the YOLOX paper [42]. We also perform an ablation study to analyze the effects of different components and a comparative study with state-of-the-art detectors including EfficientDet [43], YOLOv3 [38], YOLO-fine [28] YOLOv4, and Scaled-YOLOv4 [44].

For fair comparison, the input image size is fixed to 512 × 512 pixels for all experiments.

**Dataset**. We use the VEDAI aerial image dataset [21] and xView satellite image dataset [22] to conduct our experiments. For VEDAI, there exist two RGB versions with 12.5-cm and 25-cm spatial resolutions. We name them as VEDAI12 and VEDAI25, respectively, in our experimental results. The original data contain 3757 vehicles of 9 different classes, including *car, truck, pickup, tractor, camper, ship, van, plane*, and *others*. As done by the authors in [28], we merge class *plane* into class *others* since there are only a few *plane* instances. Next, the images from the xView dataset were collected from the WorldView-3 satellite at 30-cm spatial resolution. We followed the setup in [28] to gather 19 vehicle classes into a single *vehicle* class. The dataset contains a total number of around 35,000 vehicles. It should be noted that our intention to benchmark these two datasets is based on their complementary characteristics. The VEDAI dataset contains aerial images with multiple classes of vehicles from different types of backgrounds (urban, rural, desert, forest, etc.). Moreover, the numbers of images and objects are quite limited (e.g., 1200 and 3757, respectively). Meanwhile, the xView dataset involves satellite images of lower resolution, with a single merged class of very small vehicle sizes. It also contains more images and objects (e.g., 7400 and 35,000, respectively).

**Metric**. We report per-class average precision (AP) and their mean values (mAP) following the PASCAL VOC [13] metric. An intersection-over-union (IOU) threshold computed by the Jaccard index [39] is used for identifying positive boxes during evaluation. IOU values vary between 0 (no overlapping) and 1 (tight overlapping). Within the context of vehicle detection in remote sensing images, we follow [28] to set a small threshold, i.e., testing threshold is set to 0.1 unless stated otherwise.

To be more informative, we also show the widely used precision–recall (PR) curves in later experiments. The recall and precision are computed by Equations (9) and (10), respectively.

$$\text{Recall} = \frac{\text{number of correct detections}}{\text{number of existing objects}} = \frac{TP}{TP + FN} \tag{9}$$

$$\text{Precision} = \frac{\text{number of correct detections}}{\text{number of detected objects}} = \frac{TP}{TP + FP},\tag{10}$$

where *TP*, *FP*, and *FN* denote true positive, false positive, and false negative, respectively. The PR curve plots the precision values, which usually decrease, at each recall rate. Higher recall rates correspond to lower testing confidence thresholds, thus indicating a

higher likelihood of false positives and a lower precision rate. On the other hand, lower recall rates mean stricter testing thresholds and a reduced likelihood of false positives, thus resulting in better precision. The visualization of the precision–recall curve gives a global vision of the compromise between precision and recall.

#### *4.2. Mutual Guidance*

In this section, we show the impact of mutual guidance on the remote sensing data by applying it directly for vehicle detection, apart from the other modules. The baseline is the same backbone with a generic setup, as used in [4]. As they use focal loss [45] in their setup, we include the mutual guidance with the same loss for a fair comparison.

The results in Table 1 show the improvement when switching from the IOU-based scheme to mutual guidance. The impact is diminished with YOLOX-m as was already efficient to begin with. The use of GFocal loss shows even further improvement for both architectures.

**Table 1.** Mutual guidance for different backbone architectures on VEDAI25 dataset. The best performance per column is shown in boldface.


#### *4.3. Contrastive Loss*

Similar to the previous subsection, here, we aim to test the ability of contrastive loss in the context of vehicle detection. To this end, the contrastive loss is used together with the detection losses using the IOU-based matching strategy. Following [17], we also test the two possibilities of loss function, namely Lin (Equation (7)) and Lout (Equation (8)). The results are shown in Table 2.

**Table 2.** YOLOX-s performance on VEDAI25 with different contrastive loss functions.


The contrastive loss seems to have the reverse effect of mutual guidance on the two YOLOX backbones. The additional auxiliary loss does not improve the performance of YOLOX-s as highly as YOLOX-m, and, for the case of the outside loss, it even has negative impacts. This shows that YOLOX-m does not suffer from the misalignment problem as much as YOLOX-s does; thus, it can benefit more from the improvement in visual representation brought about by the contrastive loss.

#### *4.4. Mutual Guidance Meets Contrastive Learning*

The results of YOLOX with the mutual guidance strategy and contrastive learning are shown in Table 3. Contrastive loss shows great benefit to the network when the misalignment between localization and classification is alleviated by mutual guidance. The improvement seems balanced between both backbones. Although the inside contrastive loss seems to dominate over the outside one in the previous experiment, it becomes inferior when the semantic information from the classification branch and projection head is properly utilized in the localization process, conforming to the finding from [17]. The combination of mutual guidance and outside contrastive loss is coined contrastive mutual guidance, or CMG.

**Table 3.** Performance of YOLOX backbones on VEDAI25 when training with mutual guidance (MG) and contrastive loss.


**Multiple datasets.** We further show the results on different datasets with different resolutions in Table 4 and the corresponding precision-recall curve in Figure 4.

**Table 4.** Performance of YOLOX-s vanilla with mutual guidance (MG) and contrastive mutual guidance (CMG) on the 3 datasets. The contrastive mutual guidance strategy consistently outperforms other configurations, showing its benefit.


**Figure 4.** Precision–recall curve of YOLOX-s on 3 datasets, from left to right: VEDAI12, VEDAI25, and xView30. The methods with +CMG gain improvement over the others at around recall level of 0.5 for the VEDAI datasets and both +MG and +CMG outperform the vanilla method on the xView dataset.

The methods with +CMG gain an improvement over the others at around a recall level of 0.5 for the VEDAI datasets and both +MG and +CMG outperform the vanilla method on the xView dataset.

Some qualitative results on the VEDAI25 and xView datasets can be found in Figures 5 and 6, respectively. Several objects are missing in the second and third columns, while the CMG strategy (last column) is able to recognize objects of complex shape and appearance.

**Comparison to the state-of-the-art.** In Table 5, we compare our method with several state-of-the-art methods on the three datasets. Our YOLOX backbone with the CMG strategy outperforms others on the VEDAI datasets and is on par with YOLO-fine on xView. From the qualitative results in Figures 7 and 8, respectively, for the VEDAI and xView, it can be seen that although the xView dataset contains extremely small objects, our method, without deliberate operations for tiny object detection, can approach the state-of-the-art method specifically designed for small vehicle detection [28]. A breakdown of performance for each class of VEDAI is shown in Table 6.

**Table 5.** Performance of different YOLOX backbones with CMG compared to the state-of-the-art methods. Our method outperforms or is on par with the methods designed for tiny object recognition.


**Table 6.** Per-class performance of YOLOX backbones with CMG on VEDAI25 dataset. Our method outperforms the state-of-the-art for all classes.


Two failure cases are shown in the last columns of Figures 7 and 8. We can see that our method has difficulty in recognizing the "other" class (VEDAI), which comprises various object types, and might wrongly detect objects of extreme resemblance (xView).

**Figure 5.** Qualitative results of YOLOX-s on VEDAI25. The contrastive mutual guidance helps to recognize intricate objects. The number and color of each box correspond to one of the classes, i.e., (1) car, (2) truck, (3) pickup, (4) tractor, (5) camper, (6) ship, (7) van, and (8) plane.

**Figure 6.** Qualitative results of YOLOX-s on xView. The contrastive mutual guidance helps to recognize intricate objects. The number and color of each box indicate the vehicle class.

**Figure 7.** Qualitative results of our methods and state-of-the-art methods on VEDAI25. The number and color of each box correspond to one of the classes, i.e. (1) car, (2) truck, (3) pickup, (4) tractor, (5) camper, (6) ship, (7) van, and (8) plane. The last column shows a failure case. Our method has difficulties in recognizing the "other" class, which comprises various object types.

**Figure 8.** Qualitative results of our methods and state-of-the-art methods on xView. The number and color of each box indicates the vehicle class. The last column shows a failure case. Our method could recognize objects of various shapes and would wrongly detect objects of extreme resemblance (although this might have been because of the faulty annotations).

#### **5. Discussion**

Although supervised contrastive loss has been shown to be able to replace crossentropy for classification problems [17], in this paper, contrastive loss is applied as an auxiliary loss besides the main localization and classification losses. This is because only a small number of anchors are involved in the contrastive process due to the large number of anchors, especially negative anchors.

However, contrastive loss shows weakness when the annotations are noisy, such as those of the xView dataset. Several boxes are missing for (what appear to be) legitimate objects, as shown in Figure 9.

**Figure 9.** Examples of faulty annotations in the xView dataset: non-vehicle annotation (red border), missing annotations of container trucks (green border), and cars (blue border). The number and color of each box indicates the vehicle class.

It is shown from the experimental results that inward contrastive loss is not always inferior to its outward counterpart, as shown in [17]. We speculate that this could be due to the auxiliary role of contrastive loss in the detection problem and/or the characteristics of small objects in remote sensing images.

#### **6. Conclusions**

This paper presents a combination of a mutual guidance matching strategy and supervised contrastive loss for the vehicle detection problem. The mutual guidance helps in better connecting the localization and classification branches of a detection network, while contrastive loss improves the visual representation, which provides better semantic information. The vehicle detection task is generally complicated due to the varied object sizes and similar appearances from the aerial point of view. This, however, provides an opportunity for contrastive learning, as it can be regarded as image augmentation, which has been shown to be beneficial for learning visual representations. Although the paper is presented in a remote sensing context, we believe that this idea could be expanded to generic computer vision applications.

**Author Contributions:** Conceptualization, H.-Â.L. and S.L.; methodology, H.-Â.L., H.Z. and M.-T.P.; software, H.-Â.L. and H.Z.; validation, H.-Â.L. and M.-T.P.; formal analysis, H.-Â.L.; investigation, H.-Â.L.; writing—original draft preparation, H.-Â.L., H.Z. and M.-T.P.; writing—review and editing, H.-Â.L., M.-T.P. and S.L.; visualization, H.-Â.L.; supervision, M.-T.P. and S.L.; project administration, M.-T.P. and S.L.; funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the SAD 2021-ROMMEO project (ID 21007759).

**Data Availability Statement:** The VEDAI and xView datasets are publicly available. Source code and dataset will be available at https://lhoangan.github.io/CMG\_vehicle/.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

