*3.2. Discriminator*

As described in the previous section, we use the relativistic discriminator *DRa* for training the generator *G*. The architecture of the discriminator is taken from ESRGAN [21] which employs the VGG-19 [62] architecture. We use Faster R-CNN [8] and SSD [10] for our detector networks. The discriminator (*DRa*) and the detector network jointly act as discriminator for the generator module. We briefly describe these two detectors in the next two sections.

#### 3.2.1. Faster R-CNN

The Faster R-CNN [8] is a two-stage object detector and contains two networks: a region proposal network (RPN) to generate region proposals from an image and another network to detect objects from these proposals. In addition, the second network also tries to fit the bounding boxes around the detected objects.

The task of the RPN is to return image regions that have a high probability of containing an object. The RPN network uses a backbone network such as VGG [62], ResNet, or ResNet with feature pyramid network [51]. These networks are used as feature extractors, and different types of feature extractors can be chosen based on their performance on public datasets. We use ResNet-50-FPN [51] as a backbone network for our faster R-CNN. We use this network because it displayed a higher precision than VGG-19 and ResNet-50 without FPN (especially for small object detection) [51]. Even though the use of a larger network might lead to a further performance improvement, we chose ResNet-50-FPN due to its comparably moderate hardware requirements and more efficient convergence times.

After the RPN, there are two branches for detection: a classifier and a regressor. The classification branch is responsible for classifying a proposal to a specific object, and the regression branch finds the accurate bounding box of the object. In our case, both datasets contain objects with only one class, and therefore, our classifier infers only two classes: the background class and the object class.

## 3.2.2. SSD

The SSD [10] is a single-shot multibox detector that detects objects in a single stage. Here, single-stage means that classification and localization are done in a single forward pass through the network. Like Faster R-CNN, SSD also has a feature extractor network, and different types of networks can be used. To serve the primary purpose of SSD, which is speed, we use VGG-16 [62] as a feature extractor network. After this network, SSD has several convolutional feature layers of decreasing sizes. This representation can seem like a pyramid representation of images at different scales. Therefore, the detection of objects happens in every layer, and finally, we ge<sup>t</sup> the object detection output as class values and coordinates of bounding boxes.

#### 3.2.3. Loss of the Discriminator

\_

The relativistic discriminator loss (*LRa D* ) is already described in the previous section and depicted in Equation (4). This loss is added to the detector loss to ge<sup>t</sup> the final discriminator loss.

Both Faster R-CNN and SSD have similar regression/localization losses but different classification losses. For regression/localization, both use smooth *L*1 [8] loss between detected and ground truth bounding box coordinates (*<sup>t</sup>*∗). Classification (*Lcls*\_ *f rcnn*) and regression loss (*Lreg*\_ *f rcnn*) and overall loss (*Ldet f rcnn*) of Faster R-CNN are given in the following:

$$L\_{cls\\_fcvm} = \mathbb{E}\_{I\_{LR}}\left[ -\log\left( \text{Det}\_{cls\\_fcvm}\left( G\_{G\\_env}\left( I\_{LR} \right) \right) \right) \right] \tag{11}$$

$$L\_{reg\\_frcmn} = \mathbb{E}\_{I\_{LR}}\left[\text{smooth}\_{L1}(\text{Det}\_{reg\\_frcmn}(G\_{G\\_ern}(I\_{LR})), t\_\*)\right] \tag{12}$$

$$L\_{\text{left\\_frcnm}} = L\_{\text{cls\\_frcnm}} + \lambda L\_{\text{reg\\_frcnm}} \tag{13}$$

Here, *λ* is used to balance the losses, and it is set to 1 empirically. *Detcls*\_ *f rcnn* and *Detreg*\_ *f rcnn* are the classifier and regressor for the Faster R-CNN. Classification (*Lcls*\_*ssd*), regression loss (*Lreg*\_*ssd*) and overall loss (*Ldet*\_*ssd*) of SSD are as following:

$$L\_{cls\\_sd} = \mathbb{E}\_{I\_{LR}}\left[ -\log(softmax(Det\_{cls\\_sdd}(G\_{G\\_ern}(I\_{LR})))) \right] \tag{14}$$

$$L\_{\text{re\\_s\\_sd}} = \mathbb{E}\_{I\_{LR}}[\text{smooth}\_{L1}(\text{Det}\_{\text{re\\_s\\_sd}}(G\_{G\\_\text{cen}}(I\_{LR})), t\_\*)] \tag{15}$$

$$L\_{\text{det\\_ssd}} = L\_{\text{cls\\_ssd}} + aL\_{\text{r\\_ssd}} \tag{16}$$

Here, *α* is used to balance the losses, and it is set to 1 empirically. *Detcls*\_*ssd* and *Detreg*\_*ssd* are the classifier and regressor for the SSD.
