*4.3. Results*

#### 4.3.1. Detection without Super-Resolution

We ran the two detectors to document the object detection performance on both LR and HR images. We used SSD with vgg16 [62] network and Faster R-CNN (FRCNN) with ResNet-50-FPN [51] detector. We trained the two models with both HR and 4×-downscaled LR images. Testing was also done with both HR and LR images.

In Table 1, we show the results of the detection performance of the detectors with different train/test combinations. When we only used LR images for both training and testing, we observed 64% AP for Faster R-CNN. When training on HR images and testing with LR images, the accuracy dropped for both detectors. We also added detection results (using LR images for training/testing) for both the datasets using SSD with RFB modules (SSD-RFB) [52], where accuracy slightly increased from the base SSD.

The last two rows in Table 1 depict the accuracy of both detectors when training and testing on HR images. We have achieved up to 98% AP with the Faster R-CNN detector. This, shows the large impact of the resolution to the object detection quality and sets a natural upper bound on how close a SR-based method can ge<sup>t</sup> when working on LR images. In the next sections, we demonstrate that our approaches considerably improve the detection rate on LR imagery and ge<sup>t</sup> astonishingly close to the performance of directly working on HR imagery.


**Table 1.** Detection on LR (low-resolution) and HR (high-resolution) images without using super-resolution. Detectors are trained with both LR and HR images and AP (average precision) values are calculated using 10 different IoUs (intersection over union).

#### 4.3.2. Separate Training with Super-Resolution

In this experiment, we created 4× upsampled images from the LR input images using bicubic upsampling and different SR methods. Let us note that no training was needed for applying bicubic upsampling since it is a parameter free function. We used the SR images as test data for two types of detectors. We compared three GAN architectures for generating SR images, our new EESRGAN architecture, ESRGAN [21] and EEGAN [22]. Each network was trained separately on the training set before the object detector was trained. For the evaluation, we again compared detectors being trained on the SR images from the particular architecture and detectors being directly trained on the HR images.

In Table 2, the detection output of the different combinations of SR methods and detectors is shown with the different combinations of train/test pairs. As can be seen, our new EESRGAN architecture displayed the best results already getting close to the detection rates which could be observed when working with HR images only. However, after training EESRGAN can be directly applied to LR imagery where no HR data is available and still achieved very good results. Furthermore, we could observe that other SR methods EEGAN and ESRGAN have already improved the AP considerably when used for preprocessing of LR images. However, for both data sets, EESRGAN have outperformed the other two methods.

**Table 2.** Detection on SR (super-resolution) images with separately trained SR network. Detectors are trained with both SR and HR (high-resolution) images and AP (average precision) values are calculated using 10 different IoUs (intersection over union).



**Table 2.** *Cont.*

#### 4.3.3. End-to-End Training with Super-Resolution

We trained our EESRGAN network and detectors end-to-end for this experiment. The discriminator (*DRa*), and the detectors jointly acted as a discriminator for the entire architecture. Detector loss was backpropagated to the SR network, and therefore, the loss contributed to the enhancement of LR images. At training time, LR-HR image pairs were used to train the EEGAN part, and then the generated SR images were sent to the detector for training. At test time, only the LR images were fed to the network. Our architecture first generated a SR image of the LR input before object detection was performed.

We also compared our results with different architectures. We used ESRGAN [21] and EEGAN [22] with the detectors for comparison. Table 3 clearly shows that our method delivers superior results compared to others.

**Table 3.** Detection with end-to-end SR (super-resolution) network. Detectors are trained with SR images and AP (average precision) values are calculated using 10 different IoUs (intersection over union).


#### 4.3.4. AP Versus IoU Curve

We have calculated the AP values on different IoUs. In Figure 8, we plot the AP versus IoU curves for our datasets. The performance of EESRGAN-FRCNN, end-to-end EESRGAN-FRCNN, and FRCNN is shown in the figure. The end-to-end EESRGAN-FRCNN network has performed better than the separately trained network. The difference is most evident for the higher IoUs on the COWC dataset.

Our results indicate excellent performance compared to the highest possible AP values obtained from standalone FRCNN (trained and tested on HR images).

The OGST dataset has displayed less performance variation compared to the COWC dataset. The object size of the OGST dataset is larger than that of the COWC dataset. Therefore, the performance difference was not similar to the COWC dataset when we compared between standalone FRCNN and our method on the OGST dataset. To conclude, training our new architecture in an end-to-end manner has displayed an improvement for both the datasets.

**Figure 8.** AP-IoU (average precision-intersection over union) curves for the datasets. Plotted results show the detection performance of standalone faster R-CNN on HR (high-resolution) images and our proposed method (with and without end-to-end training) on SR (super-resolution) images.

#### 4.3.5. Precision Versus Recall

In Figure 9, precision-recall curves are shown for both of our datasets. The precision-recall curve for COWC dataset is depicted in Figure 9a,b represents the curve for OGST dataset. For each dataset, we plot the curves for standalone faster R-CNN with LR training/testing images, and our method with/without end-to-end training. We used IoU = 0.5 to calculate precision and recall.

The precision-recall curves for both datasets show that our method has higher precision values in higher recall values compared to the standalone faster R-CNN models. Our models with end-to-end training performed better than our models without the end-to-end training. In particular, the end-to-end models have detected more than 99% of the cars with 96% AP in the COWC dataset. For the OGST dataset, our end-to-end models have detected more than 81% of the cars with 97% AP.

#### 4.3.6. Effects of Dataset Size

We trained our architecture with different training set sizes and tested with a fixed test set. In Figure 10, we plot the AP values (IoU = 0.5:0.95) against different numbers of labeled objects for both of our datasets (training data). We used five different dataset sizes: {500, 1000, 3000, 6000, 10, 000 (*cars*)} and {100, 200, 400, 750, 1491 (*tanks*)} to train our model with and without the end-to-end setting.

**Figure 9.** Precision-recall curve for the datasets. Plotted results show the detection performance of standalone faster R-CNN on LR (low-resolution) images and our proposed method (with and without end-to-end training) on SR (super-resolution) images.

We go<sup>t</sup> the highest AP value of 95.5% with our full COWC training dataset (10,000 cars), and we used the same test dataset (1000 cars) for all combinations of the training dataset (with end-to-end setting). We also used another set of 1000 labeled cars for validation. Using 6000 cars, we go<sup>t</sup> an AP value near to the highest AP, as shown with the plot of AP versus dataset size (COWC). The AP value decreased significantly when we used only 3000 labeled cars as training data. We go<sup>t</sup> the lowest AP using only 500 labeled cars, and the trend of AP was further decreasing as depicted in Figure 10a. Therefore, we can infer that we needed around 6000 labeled cars to ge<sup>t</sup> precision higher than 90% for the COWC dataset. We observed slightly lower AP values for all sizes of COWC datasets when we did not use the end-to-end setting, and we observed higher differences between the two settings (with and without end-to-end) when we used less than 6000 labeled cars.

**Figure 10.** AP (average precision) with varying number of training sets from the datasets. Plotted results show the detection performance of our proposed method (with and without end-to-end training) on SR (super-resolution) images.

The OGST dataset gave 83.2% AP (with end-to-end setting) using the full training dataset (1491 tanks), and we used 100 labeled tanks as test and same amount as validation data for all combinations of the training dataset. We go<sup>t</sup> high AP values with 50% of our full training dataset as depicted in Figure 10b. AP values dropped below 80% when we further decreased the training data. Similar to the COWC datasets, we also go<sup>t</sup> comparatively lower AP values for all sizes of OGST datasets. We observed slightly higher differences between the two settings (with and without end-to-end) when the dataset consisted of less than 400 labeled tanks, as shown in the plot of AP versus dataset size (OGST dataset).

We used 90% of the OGST dataset for training while we used the 80% of the COWC dataset for the same purpose. The accuracy of the testing data (OGST) slightly increased when we added more training data, as depicted in Figure 10b. Therefore, we used a larger percentage of training data for the OGST dataset than for the COWC dataset, and it slightly helped to improve the relatively low accuracy of the OGST test data.

#### 4.3.7. Enhancement and Detection

In Figure 11, we have shown our input LR images, corresponding generated SR image, enhanced edge information and final detection. The image enhancement has helped the detectors to ge<sup>t</sup> high AP values and also makes the images visually good enough to identify the objects easily. It is evident from the figure that the visual quality of the generated SR images is quite good compared to the corresponding LR images, and the FRCNN detector has detected most of the objects correctly.

(**a**) Input LR image (**b**) Generated SR image (**c**) Enhanced edge (**d**) Detection **Figure 11.** Examples of SR (super-resolution) images that are generated from input LR (low-resolution) imagesareshownin(**<sup>a</sup>**,**b**).Theenhancededgesanddetectionresultsareshownin(**<sup>c</sup>**,**d**).

#### 4.3.8. Effects of Edge Consistency Loss (*Ledge*\_*cst*)

In EEGAN [22], only image consistency loss (*Limg*\_*cst*) was used for enhancing the edge information. This loss generated edge information with noise, and as a result, the final SR images became blurry. The blurry output with noisy edge using only *Limg*\_*cst* loss is shown in Figure 12a. The blurry final images gave lower detection accuracy compared to sharp outputs.

Therefore, we have introduced edge consistency loss (*Ledge*\_*cst*) in addition to *Limg*\_*cst* loss that gives noise-free enhanced edge information similar to the edge extracted from ground truth images and the effects of the *Ledge*\_*cst* loss is shown in Figure 12b. The ground truth HR image with extracted edge is depicted in Figure 12c.

(**a**) Final SR image and enhanced edge with *Limg*\_*cst* loss

(**b**) Final SR image and enhanced edge with *Limg*\_*cst* and *Ledge*\_*cst*losses

(**c**) Ground truth HR image with extracted edge

 **Figure 12.** Effects of edge consistency loss (*Ledge*\_*cst*) on final SR (super-resolution) images and enhanced edges compared to the extracted edges from HR (high-resolution) images.
