*4.1. Datasets*

#### 4.1.1. Cars Overhead with Context Dataset

Cars overhead with context (COWC) dataset [31] contains 15 cm (one pixel cover 15 cm distance at ground level) satellite images from six different regions. The dataset contains a large number of unique cars and covers regions from Toronto in Canada, Selwyn in New Zealand, Potsdam and Vaihingen in Germany, Columbus and Utah in the United States. Out of these six regions, we used the dataset from Toronto and Potsdam. Therefore, when we refer to the COWC dataset, we refer to the dataset from these two regions. There are 12,651 cars in our selected dataset. The dataset contains only RGB images, and we used these images for training and testing.

We used 256-by-256 image tiles, and every image tile contains at least one car. The average length of a car was between 24 and 48 pixels, and the width was between 10 and 20 pixels. Therefore, the area of a car was between 240 and 960 pixels, which can be considered as a small object relative to the other large satellite objects. We used bi-cubic downsampling to generate LR images from the COWC dataset. The downscale factor was 4×, and therefore, we had 64 pixels to 64 pixels size for LR images. We had a text file associated with each image tile containing the coordinates of the bounding box for each car.

Our experiments considered the dataset having only one class, car, and did not consider any other type of object. Figure 6 shows examples from the COWC dataset. We experimented with a total of 3340 tiles for training and testing. Our train/test split was 80%/20%, and the training set was further divided into a training and a validation set by an 80% to 20% ratio. We trained our end-to-end architecture with an augmented training dataset with random horizontal flips and ninety-degree rotations.

 **Figure 6.** COWC (car overhead with context) dataset: LR-HR (low-resolution and high-resolution) image pairs are shown in (**<sup>a</sup>**,**b**) and GT (ground truth) images with bounding boxes for cars are in (**c**).

#### 4.1.2. Oil and Gas Storage Tank Dataset

The oil and gas storage tank (OGST) dataset has been complied in Alberta Geological Survey (AGS) [67], a branch of the Alberta Energy Regulatory (AER) [35]. AGS provides geoscience information and support to AER's regulatory functions on energy developments to be carried out in a manner to ensure public and environmental safety. To assist AER with sustainable land managemen<sup>t</sup> and compliance assurance [68], AGS is utilizing remote sensing imagery for identifying the number of oil and gas storage tanks inside well pad footprints in Alberta.

While the SPOT-6 satellite imagery at 1.5 m pixel resolution provided by the AGS has sufficient quality and details for many regulatory functions, it is difficult to detect small objects within well pads, e.g., oil and gas storage tanks with ordinary object detection methods. The diameter of a typical storage tank is about 3 m and their placements are usually vertical and side-by-side with less than 2 m. To train our architecture for this use-case, we needed a dataset for providing pairs of low and high-resolution images. Therefore, we have created the OGST dataset using free imagery from the Bing map [69].

The OGST dataset contains 30 cm resolution remote sensing images (RGB) from the Cold Lake Oil Sands region of Alberta, Canada where there is a high level of oil and gas activities and concentration of well pad footprints. The dataset contains 1671 oil and gas storage tanks from this area.

We used 512-by-512 image tiles, and there was no image without any oil and gas storage tank in our experiment. The average area covered by an individual tank was between 800 and 1600 pixels. Some industrial tanks were large, but most of the tanks covered small regions on the imagery. We downscaled the HR images using bi-cubic downsampling with the factor of 4×, and therefore, we go<sup>t</sup> a LR tile of size 128-by-128 pixels. Every image tile was associated with a text file containing the coordinates of the bounding boxes for the tanks on a tile. We have showed examples from the OGST dataset in Figure 7.

**Figure 7.** OGST (oil and gas storage tank) dataset: LR-HR (low-resolution and high-resolution) image pairs are shown in (**<sup>a</sup>**,**b**) and GT (ground truth) images with bounding boxes for oil and gas storage tanks are in (**c**).

As with the COWC dataset, our experiments considered one unique class here, tank, and we had a total of 760 tiles for training and testing. We used a 90%/10% split for our train/test data. The training data was further divided by 90%/10% for the train/validation split. The percentage of training data was higher here compared to the previous dataset to increase the training data because of the smaller size of the dataset. The dataset is available at [66].

#### *4.2. Evaluation Metrics for Detection*

We obtained our detection output as bounding boxes with associated classes. To evaluate our results, we used average precision (AP), and calculated intersection over union (IoU), precision, and recall for obtaining AP.

We denote the set of correctly detected objects as true positives (*TP*) and the set of falsely detected objects of false positives (*FP*). The precision is now the ratio between the number of *TP*s relative to all predicted objects:

$$Precision = \frac{|TP|}{|TP| + |FP|} \tag{19}$$

We denote the set of objects which are not detected by the detector as false negatives (*FN*). Then, the recall is defined as the ratio of detected objects (*TP*) relative to the number of all objects in the data set:

$$Recall = \frac{|TP|}{|TP| + |FN|} \tag{20}$$

To measure the localization error of predicted bounding boxes, IoU computes the overlap between two bounding boxes: the detected and the ground truth box. If we take all the boxes that have an IoU ≥ *τ* as *TP* and consider all other detections as *FP*, then we ge<sup>t</sup> the precision at *τ* IoU. If we now vary *τ* from 0.5 to 0.95 IoU with a step size of 0.05, we receive ten different precision values which can be combined into the average precision (AP) at IoU = 0.5:0.95 [8]. Let us note that in the case of multi-class classification, we would need to compute the AP for object each class separately. To receive a single performance measure for object detection, the mean AP (mAP) is computed which is the most common performance measure for object detection quality.

In this paper, both of our datasets only contain single class, and hence, we used AP as our evaluation metric. We mainly showed the results of AP at IoU = 0.5:0.95 as our method performed increasingly better compared to other models when we increased the IoU values for AP calculation. We show this trend in Section 4.3.4.
