*Article* **Sensor-Level Mosaic of Multistrip KOMPSAT-3 Level 1R Products**

**Changno Lee <sup>1</sup> and Jaehong Oh 2,\***


**Featured Application: The proposed method can generate a mosaic image at the product level that is corrected only for radiometric and sensor distortions.**

**Abstract:** High-resolution satellite images such as KOMPSAT-3 data provide detailed geospatial information over interest areas that are evenly located in an inaccessible area. The high-resolution satellite cameras are designed with a long focal length and a narrow field of view to increase spatial resolution. Thus, images show relatively narrow swath widths (10–15 km) compared to dozens or hundreds of kilometers in mid-/low-resolution satellite data. Therefore, users often face obstacles to orthorectify and mosaic a bundle of delivered images to create a complete image map. With a single mosaicked image at the sensor level delivered only with radiometric correction, users can process and manage simplified data more efficiently. Thus, we propose sensor-level mosaicking to generate a seamless image product with geometric accuracy to meet mapping requirements. Among adjacent image data with some overlaps, one image is the reference, whereas the others are projected using the sensor model information with shuttle radar topography mission. In the overlapped area, the geometric discrepancy between the data is modeled in spline along the image line based on image matching with outlier removals. The new sensor model information for the mosaicked image is generated by extending that of the reference image. Three strips of KOMPSAT-3 data were tested for the experiment. The data showed that irregular image discrepancies between the adjacent data were observed along the image line. This indicated that the proposed method successfully identified and removed these discrepancies. Additionally, sensor modeling information of the resulted mosaic could be improved by using the averaging effects of input data.

**Keywords:** KOMPSAT-3A; strip; sensor modeling; RPCs; mosaic; matching; discrepancy

#### **1. Introduction**

High-resolution satellite images provide detailed geospatial information with a high geospatial resolution up to 30~80 cm over the area of interest, even located in inaccessible areas. There are many operating satellites such as Ziyuan-3 (2.1 m), KOMPSAT-2 (1 m), Gaofen-2 (0.8 m), TripleSat (0.8 m), EROS B (0.7 m), KOMPSAT-3 (0.7 m), Pléiades 1A/1B (0.7 m), SuperView 1–4 (0.5 m), GeoEye-1 (0.46 m), WorldView-1/2 (0.46 m) and WorldView 3 (0.31 m), etc. [1]. The satellites operate at low altitudes, such as 500,700 km, to achieve a high geospatial resolution of the data. In addition, the satellite cameras are specially designed by increasing the focal length up to around 10 m using a few aspherical mirrors. For example, WorldView-2, Pleiades-HR, and KOMPSAT-3 have focal lengths of 13.311, 12.905, and 8.562 m, respectively.

As a trade-off for the low altitude and long focal lengths, the high-resolution satellite data show a relatively narrow field of view compared to the mid- or low-resolution satellite data. WorldView-3, Pleiades-HR, and KOMPSAT-3, for example, have swath widths of

**Citation:** Lee, C.; Oh, J. Sensor-Level Mosaic of Multistrip KOMPSAT-3 Level 1R Products. *Appl. Sci.* **2021**, *11*, 6796. https://doi.org/10.3390/ app11156796

Academic Editor: Yang Dam Eo

Received: 22 June 2021 Accepted: 22 July 2021 Published: 23 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

13.1, 20, and 16.8 km, respectively. Note that mid-/low-resolution satellite data have dozens or hundreds of kilometers of swath width. These high-resolution satellite cameras frequently use a combination of shorter CCD (Charge-Coupled Device) lines with a slight overlap to increase the swath width [2–6]. As examples, IKONOS, Quickbird, KOMPSAT-3 have three, six, and two overlapping PAN CCD lines, respectively, with shifts in the CCD lines in the scan direction. The merge of each sub-scene from CCD lines is carried out with precise camera calibration information. Each sub-scene is processed considering the sensor alignment, ephemeris effects, and terrain elevations to be merged for a single scene covering a larger swath [2,5].

After the sub-scene merging process, high-resolution satellite data are provided in different processing levels. For example, Maxar provides WorldView data in system-ready, view-ready, and map-ready categories. System-ready imagery allows users to perform custom photogrammetric processes such as digital surface model (DSM) generation and orthorectification using the custom data. View-ready imagery data are products already photogrammetrically processed and designed for users interested in remote sensing applications. Map-ready is a base map that has been orthomosaicked. Level 1R and 1G KOMPSAT-3 data from the Korea Aerospace Research Institute are also available. Level 1R is a product that has been corrected for radiometric and sensor distortions. Level 1G is the product corrected for geometric distortions, including optical distortions and terrain effects, and finally projected to a universal transverse mercator coordinate system.

Many satellite data, including WorldView System-ready and KOMPSAT-3 products, are usually delivered in a single image. This is true when the target area is small enough to be located in an archived image region or a new collection less than the swath width is requested. However, in some cases where the area of interest is large and located crossing over the archived images, users are delivered with a bundle of satellite images. Then, the users have to carry out a photogrammetric process for each data bundle to meet their application purposes.

Typical photogrammetric processes with the bundle of images delivered include orthorectification and mosaics to create a complete image map. The orthorectification requires accurate sensor modeling information such as physical model or rational polynomial coefficients (RPCs) and DSM of the target area. In advance of the orthorectification and mosaic, users should carry out bias compensation of the original sensor model information using ground controls to meet mapping requirements [7]. Then, each image is orthorectified for the DSM and the resulting orthoimages are mosaicked for an image map.

There have been many studies for high-resolution satellite image mosaics in the ground coordinates [8–12]. The proposed algorithms deal with radiometric differences in images caused by seasonal changes [8], image registration and cloud detection with removal [9,10], efficient processing [11], and color balancing [12,13]. Most studies are carried out with photogrammetrically processed orthoimages. However, the cost of these photogrammetric processes should increase with the number of images in the delivered bundle.

With a mosaicked image at the sensor level delivered only with radiometric correction, users should take advantage of more efficient and convenient photogrammetric data processing and management for the simplified data. However, no relevant work on the sensor-level image mosaic was carried out before a photogrammetric process. Firstly, if users are delivered with a single image with single sensor model information instead of multiple data sets, the sensor modeling processing burden should be lifted. This is because users do not have to identify the ground control points on the multiple images. In addition, the tie point extraction process over multiple images is not required for accurate co-registration between the images. Secondly, the orthorectification and mosaic process is simplified because the single image orthorectification is simpler, and mosaic methods, including the seamline generation, are not required.

Therefore, we propose a sensor-level mosaic to generate a seamless image product with geometric accuracy to meet mapping requirements. The approach is different than the ground-level mosaic, as depicted in Figure 1. The ground-level mosaic is carried out with the orthorectification of each image strip to the ground, followed by the seamline extraction and mosaic. As a result, each pixel in the mosaicked image is assigned with map coordinates. In contrast, in the sensor-level mosaic, each image is projected into a reference sensor plane to be merged. The resulting image has single sensor modeling information to relate each mosaic image to the ground.

**Figure 1.** Sensor-level mosaic vs. ground-level mosaic.

The proposed method begins with setting one image to the reference. Each pixel of the other images is projected to the ground using their sensor model information and SRTM (Shuttle Radar Topography Mission) [14] and then projected into the reference using the reference sensor model information. The problem is that the sensor model information is erroneous such that a large geometric discrepancy occurs due to the satellite's inaccurate position and attitude information. Therefore, we aimed to model and remove the irregular difference along the image line using the image matching and outlier removal in the overlapped area.

The paper is structured as follows. In Section 2, the methodology is described with the flowchart with RPCs as the sensor model for image projections. In Section 3, the experimental results are presented for three KOMPSAT-3 strips. The conclusion is presented in Section 4.

#### **2. Methods**

The flowchart of the study is given in Figure 2. Given partially overlapped multiple image strips (*n* images in the figure) and sensor models covering the area of interest, if one image partially overlapped with other images, it was chosen as the reference image. Each pixel of the other images (collateral images) was first projected to the ground using SRTM DEM and then back-projected onto the reference image space. These projections produce (*n* − 1) projected images partially overlapped with the reference image. Next, image matching was carried out to extract tie points in the overlap area. A lot of matching outliers should exist because of radiometric and geometric differences, such that it requires detecting and remove them accurately. The discrepancy is expected to show irregular patterns along the image line because of push-broom sensor characteristics. Each line of image has a different position and attitude information. Therefore, we modeled the discrepancy with polynomials after dividing the whole image strip into multiple sub-image

regions. Based on the polynomial model, outliers are detected and removed in each subimage region. This leads to the outlier suppressed tie points set, which enables the irregular discrepancy estimation. The mosaicked image strip can be generated after compensating for the image line discrepancy. Finally, single sensor model information for the mosaic image strip is generated.

**Figure 2.** Flowchart of the proposed method for the sensor-level mosaic.

#### *2.1. Projection onto the Reference Image*

Except for the reference image, the other images, i.e., collateral images, are required to be projected onto the reference image space using the sensor modeling information. This study used RPCs instead of the physical model for compatibility with little difference in accuracy [15].

#### 1. Ground to image projection:

Ground to image projection is called the forward projection, which equation is expressed as Equation (1). Given 3D ground coordinates (*φ*, *λ*, *h*), the corresponding image coordinates (*l*,*s*) can be obtained based on the non-linear equation of 78 coefficients (RPCs) [16].

$$\begin{array}{l} Y = \frac{\textit{Num}\_{L}(\textit{UL}, \textit{V}, \textit{W})}{\textit{Dcu}\_{L}(\textit{UL}, \textit{V}, \textit{W})} = \frac{\textit{a}^{\mathsf{T}}\textit{u}}{\textit{b}^{\mathsf{T}}\textit{u}}\\\ X = \frac{\textit{Num}\_{s}(\textit{UL}, \textit{V}, \textit{W})}{\textit{Dcu}\_{s}(\textit{UL}, \textit{V}, \textit{W})} = \frac{\textit{c}^{\mathsf{T}}\textit{u}}{\textit{d}^{\mathsf{T}}\textit{u}} \end{array} \tag{1}$$

with

$$\begin{cases} \mathcal{U} = \frac{\Phi - \Phi\_{\mathcal{S}}}{\Phi \circ}, \mathcal{V} = \frac{\lambda - \lambda\_{\mathcal{S}}}{\lambda\_{\mathcal{S}}}, \mathcal{W} = \frac{h - h\_{\mathcal{S}}}{h \mathcal{S}}, \mathcal{V} = \frac{l - L\_{\mathcal{S}}}{L\_{\mathcal{S}}}, \mathcal{X} = \frac{s - \mathcal{S}\_{\mathcal{Q}}}{\mathcal{S}\_{\mathcal{S}}}\\ \mathcal{u} = \begin{bmatrix} 1 & \mathcal{V} \amalg \mathcal{V} \circ \mathcal{U} \amalg \mathcal{V} \mathcal{W} \ \mathcal{U} \amalg \mathcal{V}^{2} & \mathcal{U}^{2} \ \mathcal{W}^{2} \ \mathcal{U} \amalg \mathcal{V}^{3} \ \mathcal{V} \mathcal{U}^{2} \ \mathcal{V} \mathcal{W}^{2} \ \mathcal{V}^{2} \mathcal{U} \\\ \mathcal{U}^{3} \ \amalg \mathcal{V}^{2} \mathcal{V}^{2} \mathcal{W} \ \mathcal{U}^{2} \mathcal{W} \ \mathcal{W}^{3} \end{bmatrix} \\ \mathcal{a} = \begin{bmatrix} a\_{1} \ a\_{2} \ \ldots \ a\_{20} \end{bmatrix}^{T}, \mathcal{b} = \begin{bmatrix} 1 \ b\_{2} \ \cdots \ b\_{20} \end{bmatrix}^{T}, \mathcal{c} = \begin{bmatrix} c\_{1} \ c\_{2} \ \cdots \ c\_{20} \end{bmatrix}^{T}, \mathcal{d} = \begin{bmatrix} 1 \ d\_{2} \ \cdots \ d\_{20} \end{bmatrix}^{T} \end{aligned}$$

where (*φ*, *λ*, *h*) are the geodetic latitude, longitude, and ellipsoidal height. (*l*,*s*) are the image row and column coordinates. (*X*,*Y*) and (*U*, *V*, *W*) are the normalized image and ground coordinates, respectively. (*φO*, *λO*, *hO*, *SO*, *LO*) and (*φS*, *λS*, *hS*, *SS*, *LS*) are the offset and scale factors, respectively for the latitude, longitude, height, column, and row.

However, the major problem is that the target elevation must be given, and there is no closed solution for the ground elevation computation. Figure 3 depicts the iterative ground elevation search process is depicted. Given an image point, the first image to ground projection is performed to the reference elevation, such as the mean elevation of RPCs. The computed horizontal coordinates are used to look up the ground elevation in SRTM DEM. Next, the second image to ground projection is tried for the estimated ground elevation. This iterative process continues until the no changes in the computed horizontal coordinates.

**Figure 3.** Iterative ground elevation search.

2. Image to ground projection:

Image to ground projection is called the backward projection. Given an image coordinates (*l*,*s*) with the ground elevation (*h*), the horizontal ground coordinates (*φ*, *λ*) are computed using Equation (2). The backward projection is a non-linear equation that requires to be linearized as Equation (2). The linearized equation requires the initial horizontal ground coordinates *φ*0, *λ*<sup>0</sup> for *U*0, *V*<sup>0</sup> . The solution is obtained by iteration until (*dU*, *dV*) it reaches near zero.

$$
\begin{bmatrix} V \\ \mathcal{U} \end{bmatrix} = \begin{bmatrix} V^0 \\ \mathcal{U}^0 \end{bmatrix} + \begin{bmatrix} dV \\ d\mathcal{U} \end{bmatrix} \tag{2}
$$

$$
\begin{bmatrix} dV \\ d\mathcal{U} \end{bmatrix} = \begin{bmatrix} \left. \frac{\partial Y}{\partial V} \right|\_{V=V^0} & \left. \frac{\partial Y}{\partial \mathcal{U}} \right|\_{\mathcal{U}\mathcal{U}=\mathcal{U}^0} \\
\left. \frac{\partial X}{\partial \mathcal{V}} \right|\_{V=V^0} & \left. \frac{\partial X}{\partial \mathcal{U}} \right|\_{\mathcal{U}\mathcal{U}=\mathcal{U}^0} \end{bmatrix}^{-1} \begin{bmatrix} Y - Y^0 \\ X - X^0 \end{bmatrix}
$$

where

$$\begin{aligned} \mathcal{V}^{0} &= \frac{d^{\mathrm{T}}u^{0}}{d^{\mathrm{T}}u^{0}}, \quad \mathcal{X}^{0} = \frac{\varepsilon^{\mathrm{T}}u^{0}}{d^{\mathrm{T}}u^{0}} \\ \mathcal{U}^{0} &= \Big[1 \quad V^{0} \quad \mathcal{U}^{0} \; \mathcal{W} \; V^{0} \mathcal{U}^{0} \; \mathcal{V}^{0} \mathcal{W} \; \mathcal{U}^{0} \mathcal{W} \; \begin{pmatrix} \mathcal{V}^{0} \end{pmatrix}^{2} \; \begin{pmatrix} \mathcal{U}^{0} \end{pmatrix}^{2} \; \mathcal{W}^{0} \mathcal{V}^{0} \mathcal{W} \; \begin{pmatrix} \mathcal{V}^{0} \end{pmatrix}^{2} \; \mathcal{V}^{0} \mathcal{W}^{2} \; \begin{pmatrix} \mathcal{V}^{0} \end{pmatrix}^{2} \; \mathcal{V}^{0} \mathcal{W}^{2} \; \begin{pmatrix} \mathcal{V}^{0} \end{pmatrix}^{2} \mathcal{U}^{0} \\ \mathcal{U}^{0} &= \mathcal{U}^{0} \end{aligned} \right]$$

$$\begin{array}{ll}\frac{\partial Y}{\partial V} = \frac{\partial Y}{\partial u^{\top}} \frac{\partial u}{\partial V}, & \frac{\partial Y}{\partial U} = \frac{\partial Y}{\partial u^{\top}} \frac{\partial u}{\partial U},\\ \frac{\partial X}{\partial V} = \frac{\partial X}{\partial u^{\top}} \frac{\partial u}{\partial V}, & \frac{\partial X}{\partial U} = \frac{\partial X}{\partial u^{\top}} \frac{\partial u}{\partial U}.\end{array}$$

#### *2.2. Image Matching and Outlier Removal*

Image matching in the overlap area is carried out to extract tie points used for discrepancy compensation. This study uses a template matching based on *NCC* (Normalized Cross-Correlation) as Equation (3). The similarity between reference and projected images is measured using *NCC*. A matching with *NCC* larger than 0.5 is typically considered similar, but a higher threshold such as 0.7 is preferred to reduce matching outliers.

$$NCC = \frac{\sum\_{i=1}^{\overline{w}} \sum\_{j=1}^{\overline{w}} \left( R\_{ij} - \overline{\mathbb{R}} \right) \left( P\_{ij} - \overline{\mathbb{P}} \right)}{\sqrt{\left[ \sum\_{i=1}^{\overline{w}} \sum\_{j=1}^{\overline{w}} \left( R\_{ij} - \overline{\mathbb{R}} \right)^2 \right] \left[ \sum\_{i=1}^{\overline{w}} \sum\_{j=1}^{\overline{w}} \left( P\_{ij} - \overline{\mathbb{P}} \right)^2 \right]}} \tag{3}$$

where *R* is a patch in the reference image and *P* is a patch within the established search region in the projected image, both are in the size of *w* × *w*. *R*, *P* are averages of all intensity value in the patches.

These automated image matchings often produce a lot of mismatches that should be detected and removed. RANSAC (Random Sample Consensus) is a popular outlier detection method [17] because it iteratively estimates established modeling parameters from a set of data that includes outliers.

#### *2.3. Piecewise Discrepancy Compensation*

High-resolution satellite image strips are acquired using a push-broom sensor that uses a line of detectors arranged perpendicular to the flight direction of the spacecraft. As the spacecraft flies forward, the image is collected one line at a time, with all of the pixels in a line being measured simultaneously.

This mechanism should produce an irregular geometric discrepancy between the adjacent strips along the image line. We applied a piecewise discrepancy compensation that models the local difference for some image lines, as depicted in Figure 4. However, it is a possibility of discontinuity between adjacent image pieces. Therefore, we model each local discrepancy with a spline curve.

**Figure 4.** Piecewise discrepancy compensation.

The sensor model for the mosaic image strip should be generated for photogrammetric processes. Since the mosaic image consists of several image strips of different sensor modeling information, the RPCs for the mosaic can be generated by bias-compensating the RPCs of the reference considering the estimated compensations to the adjacent images [14].

#### **3. Experimental Results**

#### *3.1. Data*

The test data are three image strips of KOMPSAT-3 product level 2R over Romania, as the specifications are listed in Table 1. The acquisition dates are 8 and 24 April and 4 May 2018. The strips have long image line sizes up to 60,000–70,000 pixels with an image swath width of 24,060 pixels. Each image stripe is made up of three image scenes with over 20,000 image lines each. The acquisition geometry includes incidence and azimuth angles. Strips #1 and #3 have similar geometry and a low incidence angle. Small incidence angles of Strips #1 and #3 produce a small GSD (Ground Sample Distance) than Strip #2 with a relatively large incidence angle. Note that the azimuth angle of Strip #2 is in an almost opposite direction from those of the others.

**Table 1.** Test data specification.


Figure 5 shows the three data strips. Strip #2 is located in the center with partial overlap with the other strips.

**Figure 5.** Test image strip of three scenes: (**a**) Strip #1, (**b**) Strip #2, (**c**) Strip #3.

#### *3.2. Sensor Modeling of Each Image Strip*

The long strip images were delivered with an ephemeris and attitude data for the physical sensor modeling. However, RPCs are much compatible and easier to use than the physical sensor model, whereas the accuracy is similar. Therefore, we first converted the physical sensor model of each strip into RPCs. The conversion into RPCs is conducted by interpolating satellite attitude information such as roll, pitch, and yaw angles with the first-order equation.

Figure 6 depicts the interpolation residuals for the roll angles of Strip #1, demonstrating that the original roll angle varies locally along the image line. The conversion residuals from the physical model into RPCs are presented in Table 2 for two cases using the original ephemeris and the interpolated ephemeris. Using the interpolated ephemeris shows residuals that are a little better than the other case, which is affected by the local variation in the ephemeris. In Strip #1, the residual in the sample direction improved by more than one pixel.

**Figure 6.** Difference between the original and the interpolated roll angles (Strip #1).



#### *3.3. Projection of Each Image onto the Reference*

We set the center strip (Strip #2) as the reference. Then, we projected each image onto the reference image space using the generated RPCs with 1 arcsec SRTM DEM. First, the reference image is extended to the sides for the image resampling. A point in the extended reference image space is projected iteratively projected onto SRTM DEM as explained in Figure 3, followed by ground to image projection to look up the corresponding digital number in the adjacent strips. Figure 7 depicts three overlaid stripes side by side.

**Figure 7.** Projected images onto the reference image space.

#### *3.4. Image Matching and Outlier Removal in an Overlap Area*

We generated a grid of 50 and 100 pixels along line and sample directions in the overlap area, respectively. Then, we carried out *NCC* image matching between the reference and the adjacent projected images for the grid points. As matching parameters, we used 77 × 77 pixels for the matching window size, search range 60 pixels. We selected the matching parameters considering the geolocation accuracy of the sensor modeling for KOMPSAT-3, which has 48.5 m (CE90, Circular Error 90% confidence range).

The matching pairs are showing *NCC* larger than 0.7 were selected as matching candidates in this study. Then, the image coordinates differences were computed between the matching pairs and plotted in Figure 8. Figure 8a,b shows the line and sample coordinates differences between Strips #1 and #2. Figure 8c,d shows the line and sample coordinates differences between Strips #2 and #3. The blue dots show all the coordinates differences for the matching candidates.

We applied the RANSAC algorithm with second polynomial models for each line and sample coordinate differences to suppress the matching outliers. The polynomial model was applied to each scene in an image strip. The red dots show the results after the outlier removal.

**Figure 8.** Discrepancy between the image coordinates in the matching pair—(**a**) line difference between Strips #1 and #2; (**b**) sample difference between Strips #1 and #2; (**c**) line difference between Strips #2 and #3; (**d**) sample difference between Strips #2 and #3.

#### *3.5. Piecewise Discrepancy Compensation*

After removing the matching outliers, we can estimate the discrepancy compensation of the projected image by averaging the image coordinates differences between the matching pairs. However, the discrepancy varies for each image line. As shown in Figure 7, averaging single image line discrepancies may produce inaccurate compensation values because there are no redundant matching pairs in an image line. Therefore, we estimated the local discrepancy compensation in the line and sample directions by averaging discrepancies in a block of image lines such as 500 image lines. In addition, we interpolated the averaged differences using a spline curve along the image line to ensure the continuity between compensated image blocks.

Figure 9 shows the estimated local discrepancy for the line and sample directions for every 500 image lines after the spline interpolation. In other words, the red line was derived by averaging the red dots in Figure 8 for every 500 image lines and interpolating them in the spline curve. Figure 9a,b shows the line and sample compensations for Strip #1, and Figure 9c,d are for Strip #3. The rewards for sample coordinates ranging from 30 to 44 pixels are much larger than those for line coordinates.

**Figure 9.** Estimated discrepancy compensation—(**a**) line compensation for Strip #1; (**b**) sample compensation for Strip #1; (**c**) line compensation for Strip #3; (**d**) compensation for Strip #3.

The piecewise image compensation produced the final strip mosaic in Figure 10. Note that the color balancing was not carried out in the study. Some examples showing geometric consistency at the strip boundary even over the building areas are presented in Figure 11.

**Figure 10.** Final strip mosaic.

**Figure 11.** Sample images showing geometric consistency at a boundary.

#### *3.6. Sensor Model Information Generation*

As the sensor-level strip mosaic was completed, the sensor modeling information for the single mosaic strip was generated for the photogrammetric process. A 7 × 7 × 7 cubic grid covering the whole mosaic image strips was developed in the ground, and the grid points were projected onto the mosaic strip for the corresponding image coordinates. First, only RPCs of the center strip (Strip #2) were extended to cover the whole mosaic image boundary. Secondly, three RPCs were processed together to generate ground and image coordinate sets for single RPCs generation.

To check the accuracy of the generated RPCs, we collected 25 GCPs over the mosaic strip from Google Earth, as shown in Figure 12. We used Google Earth Pro to extract the horizontal and vertical coordinates. Though the accuracy of Google Earth may differ depending on the areas, a few meters of positional accuracy was reported over near urban areas in Europe [18]. First, using the 25 GCPs as checkpoints, we estimated the accuracy of the aforementioned two RPCs of the center strip and mosaic strip, as shown in Table 3. RPCs of the center strip showed rather low positional accuracy of 4.02 and 40.07 pixels in RMSE for the line and sample directions, respectively. However, the RPCs of the mosaic showed much better results reported as 2.88 and 21.07 pixels in RMSE for line and sample directions. The accuracy improvement ranged from 18% to 47.4%. The geolocation performance of the resulted mosaic RPCs seemed improved due to the averaging effects of all RPCs of input data. The RPCs of the mosaic should be more accurate than the RPCs of each strip if more image strips are used for the mosaic.

**Table 3.** Accuracy of mosaic strip RPCs (unit: pixels).


**Figure 12.** GCP distribution with the number.

Next, the bias compensation of the mosaic RPCs was carried out with the GCPs, and the improved accuracy was presented in Table 4. The bias compensation is a process to improve the input sensor modeling using ground controls. The biases are estimated in image coordinates using the rules and compensated for better accuracy [7]. The errors of the mosaic RPCs were compensated for line and sample directions with constant values estimated from the GCPs. Table 4 shows the RPCs' accuracy after the compensation process. The compensated RPCs showed adequate accuracies ranging from 1.4 to 3.3 pixels in RMSE compared to the ones shown in Table 3.

**Table 4.** Accuracy of mosaic strip RPCs after the bias compensation (unit: pixels).


#### **4. Discussion**

In the study, we used RPCs instead of rigorous sensor modeling. This is for easier and efficient processing as well as compatibility. However, satellite image providers may use the same approach with their physical sensor model. Regarding image matching, the matching window size and search area can be better optimized considering the area of interest and satellite data specification. For example, fewer features would require a larger matching window size, and satellites with precise sensor models would require a smaller search area. In addition, feature-based image matching methods can be used instead [19]. The discrepancy patterns between image strips in line and sample coordinates would be different for satellite data. The data with stable ephemeris would show rather regular discrepancy patterns along the image lines. However, in any case, image compensation should not be carried out for each image line because there are no redundant matching pairs on a single image line. The sensor modeling of the mosaic tends to be more accurate compared to each image strip due to the averaging effects. Therefore, a mosaic of more image strips would produce better positional accuracy [20].

As shown in the resulting mosaic, the three strips' radiometric differences are observed due to the differences in the acquisition date and angles. The focus of the study is on minimizing the geometric discrepancy and the generation of single sensor model information. Therefore, we have not treated the radiometry in this study, and future research will include the sensor-level radiometric adjustment between the input image strips.

Note that the proposed method is different from the conventional image mosaic carried out with orthorectified images. The proposed sensor-level mosaic is carried out before the photogrammetric processes, including the sensor modeling and orthorectification. Therefore, users can perform their photogrammetric function with the mosaic and the sensor model information.

#### **5. Conclusions**

High-resolution satellite images show relatively narrow swath widths such that users often face obstacles to orthorectify and mosaic a bundle of delivered images to create a complete image map. Therefore, the proposed sensor-level mosaicking can generate a seamless image product with improved geometric accuracy. The experimental result with KOMPSAT-3 data showed that the irregular discrepancy between the input images due to the differences in acquisition angles could be minimized for geometrical continuity in the resulted mosaic image. In addition, single sensor modeling information of the mosaic image could be generated for the later photogrammetric processes. The accuracy improvement of the sensor modeling ranged from 18% to 47.4%. Therefore, we believe that the proposed sensor-level mosaic method enables users to take advantage of more efficient and convenient photogrammetric data processing.

**Author Contributions:** Conceptualization, C.L.; data curation, C.L.; formal analysis, C.L. and J.O.; methodology, C.L. and J.O.; validation, C.L. and J.O.; writing—original draft, J.O.; writing—review and editing, C.L. and J.O. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was supported by National Research Foundation of Korea, grant number 2019R1I1A3A01062109.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


### *Article* **Coupling Denoising to Detection for SAR Imagery**

**Sujin Shin \*, Youngjung Kim, Insu Hwang, Junhee Kim and Sungho Kim**

Agency for Defense Development, Institute of Defense Advanced Technology Research, Daejeon 34186, Korea; read12300@add.re.kr (Y.K.); hciinsu@add.re.kr (I.H.); kjh1127@add.re.kr (J.K.); cocktail@add.re.kr (S.K.) **\*** Correspondence: sujinshin@add.re.kr; Tel.: +82-42-821-4639

**Featured Application: The proposed object detection framework aims to improve detection performance for noisy SAR images, which is applicable for general object detection in SAR imagery: recognition of militarily important targets such as ships and aircrafts or monitoring for abnormal civilian events.**

**Abstract:** Detecting objects in synthetic aperture radar (SAR) imagery has received much attention in recent years since SAR can operate in all-weather and day-and-night conditions. Due to the prosperity and development of convolutional neural networks (CNNs), many previous methodologies have been proposed for SAR object detection. In spite of the advance, existing detection networks still have limitations in boosting detection performance because of inherently noisy characteristics in SAR imagery; hence, separate preprocessing step such as denoising (despeckling) is required before utilizing the SAR images for deep learning. However, inappropriate denoising techniques might cause detailed information loss and even proper denoising methods does not always guarantee performance improvement. In this paper, we therefore propose a novel object detection framework that combines unsupervised denoising network into traditional two-stage detection network and leverages a strategy for fusing region proposals extracted from both raw SAR image and synthetically denoised SAR image. Extensive experiments validate the effectiveness of our framework on our own object detection datasets constructed with remote sensing images from TerraSAR-X and COSMO-SkyMed satellites. Extensive experiments validate the effectiveness of our framework on our own object detection datasets constructed with remote sensing images from TerraSAR-X and COSMO-SkyMed satellites. The proposed framework shows better performances when we compared the model with using only noisy SAR images and only denoised SAR images after despeckling under multiple backbone networks.

**Keywords:** denoising; detection; SAR imagery; fusing region proposals

#### **1. Introduction**

Synthetic Aperture Radar (SAR) is a type of radar system used to reconstruct 2D or 3D terrain and objects on the ground (or over oceans). The SAR system utilizes a technology to synthesize a long virtual aperture through a coherent combination of the received signals from objects. The synthesized aperture transmits pulses of microwave radiation, which in turn has the effect of narrowing the effective beam width in an azimuth direction and thus achieving high resolution. Combining return signals by an on-board radar antenna, SAR overcomes the main limitations of traditional systems that the azimuth resolution is determined by physical antenna size. Optical and infrared sensors are passive since they detect objects by reflected light and emitted signals from the objects, respectively, while the radars can actively transmit and receive radar waves, operating in all-weather and day-and-night conditions.

Thanks to the useful characteristics available under all-weather conditions and also during night-time, SAR images are especially applied to military reconnaissance as most military operations take place at night in poor weather conditions. There is a variety of

**Citation:** Shin, S.; Kim, Y.; Hwang, I.; Kim, J.; Kim, S. Coupling Denoising to Detection for SAR Imagery. *Appl. Sci.* **2021**, *11*, 5569. https://doi.org/ 10.3390/app11125569

Academic Editor: Yang-Dam Eo

Received: 12 May 2021 Accepted: 8 June 2021 Published: 16 June 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

applications such as information and electronic warfare, target recognition of aircrafts that maneuver irregularly, battlefield situational awareness, and development of aircrafts that are hard for the other party to track with radar. In addition, it is necessary to study on object detection using radar imagery for civilian applications (e.g., resources exploration, environmental monitoring, etc.).

With the recent rapid development of deep learning, many deep convolutional neural network (CNN)-based object detection approaches using SAR imagery have gained increased attention. The successes of the deep detectors on SAR images facilitate a wide range of civil and military applications, such as detection of ship [1–5], aircraft [6–9], destroyed building [10], oceanic internal wave [11], oceanic eddy [12], oil spill [13], avalanche [14], and trough [15]. For the further research purposes, several SAR object detection datasets have also been released called AIR-SARShip-1.0 [16], SAR-Ship-Dataset [17], SAR ship detection dataset (SSDD) [18], and HRSID [19].

SAR images are formed from a coherent sum of backscattered signal components at the boundary of different media after pulsed transmissions of microwave radiation, enabling to observe the interior of the targets otherwise invisible to the naked eye. However, when obtaining the SAR images, if the emitted pulses are reflected from the boundary of a target with uneven surface, then scattering and interference waves are created. These wave signals have a direct impact on a SAR imaging the structure of the target as noise components. The produced noise is often called *speckle noise*, which hinders the original image information and causes a speckle corrupted SAR image as shown in Figure 1. The scattering characterization of the target gets severe depending on changes in radial properties and orbital surfaces, leading to degradation of recognition performance. It is worth noting that a number of published studies were conducted for denoising (or despcekling) SAR images [20–25].

(**a**) TerraSAR-X (**b**) COSMO-SkyMed

**Figure 1.** Examples of the real-world SAR image where noise-like speckle appears.

Many previous works first perform despeckling on SAR images as one of preprocessing steps and then utilize the SAR images for several tasks via deep learning; e.g., classification task [26,27], detection task [28–30], etc. Processing separately the large amount of SAR images results in high time consumption and low efficiency. Though various despeckling methods such as Lee filter [22], Kuan filter [23], Frost filter [24], Probabilistic Patch-Based (PPB) filter[25] have been proposed, if we take an improper despeckling methodology without considering the dataset characteristics carefully, then the despeckling may lead to poor performance due to the information loss from raw SAR images. Meanwhile, to further improve the visual quality of SAR images, there are other preprocessing methods such as contrast enhancement methods. Given that most of SAR images are usually grayscale

images, we can consider various processing methods, for example, fuzzy-based gray-level image contrast enhancement [31] or fuzzy-based image processing algorithm [32].

To overcome the issue and guide for directly promoting object detection performance, developing an object detection framework through incorporating an alternative deep denoiser replacing the separate denoising preprocessing step into the classical object detection network is significant and necessary. The motivation shares the similar spirit to the recent classification work proposed by Wang et al. [33], where they learn a noise matrix from an input noisy image and with the noise matrix synthesize a despeckled image taken as the input into a subsequent classification network. According to our best knowledge, we are the first to connect a denoising network to an object detection network. We additionally introduce *fusing region proposals* approach which fuses set of Region of Interests (RoIs) from both noisy and denoised images; rather than simply ending with the coupling structure as in Wang et al. [33].

We propose a novel object detection framework whose the core idea comprises two parts: (1) connecting an unsupervised denoising network to an object detection network for dynamically extracting a denoised SAR image from a given noisy SAR image, and (2) forwarding an image pair of two SAR images (the given real SAR image and the synthetically generated SAR image) to an object detection network and fusing region proposals from the two SAR images for complementarily integrating regional information. Here *fusing region proposals* refers to merging two sets of RoIs yielded by a shared region proposal network within the object detection network. This is inspired by the observation that utilizing only real SAR image may bring about false positives due to the inherent speckle noise of the image and on the contrary, depending on only denoised SAR image may cause missing targets because inadequate denoising leads to fine information loss of raw data.

The rest of this paper is organized as follows. Section 2 mainly consists of two parts, where the first part introduces our datasets constructed with SAR images from TerraSAR-X and COSMO-SkyMed satellites, and the second part describes the detailed design of our proposed object detection framework, i.e., how to incorporate an unsupervised denoising network into an object detection network and fuse the region proposals within the object detection network. Section 3 reports comparative experimental results for the proposed object detection network on our own datasets. To validate the effectiveness of our approach, we carry out multiple experiments; (1) we need to experimentally demonstrate that our coupling structure between denoising and detection networks can strengthen detection performance, (2) we further verify the proposed region proposal fusing strategy in terms of input data for detection network and fusing method through ablation studies, and (3) we additionally perform comparative experiments with respect to the choice of a feature map extracted from either real or synthetic SAR image, where the feature map refers to the output of CNN backbone in the detection network. Section 4 presents the discussion of the experimental results together with an additional time complexity analysis. Finally, Section 5 includes the final remarks and a conclusion.

#### **2. Materials and Methods**

In this section, we describe SAR remote sensing datasets that we constructed and the proposed object detection framework which fuses region proposals utilizing denoised SAR image. The remote sensing datasets include not only SAR imagery but also corresponding labeled objects. We develop our object detection framework with the datasets and detail the proposed framework in the rest of this section.

#### *2.1. SAR Remote Sensing Dataset*

#### 2.1.1. Description

We constructed our datasets with 60 TerraSAR-X images from German Aerospace Center [34] and 55 COSMO-SkyMed images from Italian Space Agency [35], which is mainly covering harbor- and airport- peripheral areas. For TerraSAR-X satellite, the images have resolutions from 0.6 m to 1 m, and is of the size in the range from about 6 k × 2 k to 11 k × 6 k pixels (sorted by their area). For COSMO-SkyMed satellite, the images have a resolution of 1m, and is of the size in the range from about 13 k × 14 k to 20 k × 14 k pixels (sorted by their area). Each remote sensing image is labeled by experts in aerial image interpretation with multiple categories such as airplane (A), etcetera (E) and ship (S). The ship/airplane classes contain a variety of civil and military ships/airplanes while the etcetera class includes support vehicles, air defense weapons and air defense vehicles. Some example ship/airplane objects are shown in Figures 2 and 3 for TerraSAR-X and COSMO-SkyMed imagery, respectively.

**Figure 2.** Example airplane (top) and ship (bottom) objects in TerraSAR-X image. The groundtruth bounding boxes labeled as corresponding class are plotted in red color.

**Figure 3.** Example airplane (top) and ship (bottom) objects in COSMO-SkyMed image. The groundtruth bounding boxes labeled as corresponding class are plotted in red color.

Our labeled objects include a total of 15.7 k instances of 3 categories; 3.7 k instances for A class, 0.2 k instances for E class, and 11.8 k instances for S class, which implies that our datasets are quite imbalanced between the categories and relatively skewed towards S class. The class distribution by type of satellite imagery is plotted in Figure 4. Furthermore, target objects in our dataset exist at a variety of scales due to our multiresolution images and the variety of shapes, especially for ships objects. We measure the bounding box size of objects with *wbbox* × *hbbox* and present the frequency of boxes by size as a histogram in Figure 5, where *wbbox* and *hbbox* is the width and height of the bounding box, respectively.

**Figure 4.** Number of annotated instances per category for TerraSAR-X and COSMO-SkyMed imagery.

**Figure 5.** Histogram that exhibits the number of annotated instances with respect to area (width × height) in pixels.

#### 2.1.2. Comparison to other SAR Detection Datasets

Table 1 summarizes the detailed comparisons between our own constructed dataset and other publicly available SAR detection datasets, i.e., AIR-SARShip-1.0 [16], SSDD [18], SAR-Ship-Dataset [17], and HRSID [19]. SAR-Ship-Dataset is the dataset with the largest number of instances, followed by our own dataset. The primary differentiator of our dataset as compared with other datasets lies in (1) class diversity such as ships, aircrafts, and etcetera classes, and (2) the number of scene areas. We obtained the SAR images from a variety of harbor and airport peripheral areas around the world wide and annotated different shapes of objects.

**Table 1.** Comparison of statistics among multiple datasets. We denote the number of instances, patches, and areas as # Instances, # Patches, and # Areas, respectively.


#### *2.2. Proposed Methodology*

Given the inherent speckle noise of SAR, researchers have previously performed a preprocessing step like despeckling before training an object detection model. However, such prior preprocessing independent of the performance of object detection may not only be inefficient, but also lead to weak detection performance because an unintentionally improper denoising induces loss of detailed information. Therefore, we integrate a denoising network with a two-stage detection network so that the denoising network can directly receive feedbacks from the detection network, as illustrated in Figure 6.

We choose a blind-spot neural network [36] based self-supervised scheme as the unsupervised denoising model and adopt Gamma noise modeling as in Speckle2Void [37] fitted with SAR speckle, but not limited to this model sturcture. We can train the unsupervised denoising model as a generator *G* that maps a real (noisy) SAR image *Ireal* to the synthetic (denoised) SAR image *G*(*Ireal*). The core idea of our model is to infer a synthetic denoised SAR image from the input SAR image and merge the two sets of extracted RoIs to improve detection performance. Without any help of related materials such as corresponding denoised image for an input SAR image, we can autonomously simulate the denoised image and fuse the inferred information such as RoIs. The entire model enables effective end-to-end learning.

**Figure 6.** Overview of the proposed object detection framework: (1) connecting an unsupervised denoising network to an object detection network for dynamically extracting a denoised SAR image from a given noisy SAR image, and (2) forwarding an image pair of two SAR images to an object detection network and fusing region proposals from the two SAR images for complementarily integrating regional information.

The unsupervised denoising network *G* in our model firstly takes as input a real (noisy) SAR image *Ireal* and extracts synthetic (denoised) SAR image *G*(*Ireal*) as the output. Then, the formed (real, synthetic) image pairs (*Ireal*, *G*(*Ireal*)) are fed into a shared region proposal network and the region proposal network outputs two corresponding feature maps and sets of RoIs. The two sets of RoIs B*real*,B*synth* are merged and the redundant bounding boxes are subsequently removed by a NMS procedure, i.e., <sup>B</sup>*final* <sup>=</sup> *NMS*(B*real* <sup>∪</sup> <sup>B</sup>*synth*), where B*final* is the resultant fused bounding boxes. For each RoI in B*final* on the feature map from the real SAR image, the RoI feature vector is then forwarded to obtain the classification and regression results as traditional two-stage detection network.

Usually, only single SAR image which is either real or denoised (preprocessed) is employed for training an object detection network as shown in Figure 7. Suppose we have real SAR images which is inherently speckled noisy without any preprocessing, relying solely on the real SAR image for training may cause false alarms of region proposals. On the other hand, utilizing denoised SAR images alone may be prone to suffer from missing targets because of detailed information loss. We, therefore, devise a novel denoising-based object detection network to make full use of the complementary advantages between the real and denoised SAR images.

**Figure 7.** Overview of the traditional two-stage object detection network given a real or denoised (preprocessed) SAR image as input.

To combine extracted information from both real and synthetic SAR images, we consider *fusing region proposals* which merges two sets of RoIs yielded by a region proposal network. Considering that there exist qualitative differences between the two sets of RoIs derived real and synthetic SAR images, the real and synthetic SAR images are separately trained by the region proposal network. After fusing region proposals, we take the feature map from the real SAR image for preserving the global context information of the raw input SAR image.

The proposed architecture is trained end-to-end with a multi-task loss which mainly consists of (1) unsupervised denoising loss, (2) region proposal loss, and (3) RoI loss for classification and bounding-box regression. Especially, the region proposal network is trained for both real and synthetic SAR image, and thus two distinctly losses are defined. The final loss function that we propose is a weighted summation of all losses as follows.

$$\mathcal{L}(I\_{\rm real}) = \lambda\_1 \mathcal{L}\_{\rm drn}(I\_{\rm real}) + \lambda\_2 \mathcal{L}\_{\rm rpn}^{\rm real}(I\_{\rm real}) + \lambda\_3 \mathcal{L}\_{\rm rpn}^{\rm symh} \left( \mathcal{G}(I\_{\rm real}) \right) + \lambda\_4 \mathcal{L}\_{\rm roi}(\mathbb{B}\_{final}) \tag{1}$$

where:

*Ireal* = a real (noisy) image *G*(*Ireal*) = a synthetic (denoised) image extracted from the denoising network *G* <sup>B</sup>*final* <sup>=</sup> *NMS*(B*real* <sup>∪</sup> <sup>B</sup>*synth*), where <sup>B</sup>· is set of RoIs from either *Ireal* or *<sup>G</sup>*(*Ireal*)

where L*den* denotes the unsupervised denoising loss. L*real rpn* and <sup>L</sup>*synth rpn* are the region proposal loss of RPN for *Ireal* and *G*(*Ireal*), respectively. L*roi* refers to the loss summation of classification and bounding-box regression loss for all RoIs B*final*. *<sup>λ</sup>*1:4 are the hyperparameters to balance the interplay between the losses and the all parameters are set to 1 in all our experiments.

#### **3. Results**

We first present the description of our experimental dataset settings in Section 3.1. Section 3.2 presents the details of our model architecture and the hyperparameter settings. Based on this implementation, we conduct extensive experiments to validate the contributions of the proposed model and Sections 3.3 and 3.4 contain the experimental results. Section 3.5 provides comprehensive ablation studies.

#### *3.1. Dataset Settings*

We acquired 60 TerraSAR-X raw scenes from German Aerospace Center [34] and 55 COSMO-SkyMed raw scenes from Italian Space Agency [35]. The raw scenes go through multiple stages like preprocessing, Doppler centroid estimation (DCE), and focusing to obtain single look slant range complex (SSC) images. The SSC images are then converted to multi-look ground range detected (MGD) images by multi-looking procedures. With the MGD images, we create patches of size 800×800 via sliding-window operation, within each patch containing at least one target object which belongs to airplane (A), etcetera (E), or ship (S) categories. Finally, we randomly split patches into 80% for training, and 20% for testing.

#### *3.2. Implementation Details*

We implemented our unsupervised denoising model following self-Poisson Gaussian [38], however, adopted Gamma noise modeling as in Speckle2Void [37] to characterize the SAR speckle. Our implementation for detection framework was based on the MMDetection tool box [39] which is developed in PyTorch [40]. Stochastic gradient descent (SGD) Optimizer [41,42] with momentum of 0.9 was used for optimization. We trained a total of 24 epochs, with an initial learning rate of 0.0025, momentum of 0.9, and weight decay of 0.0001. We experimented with ResNet-50-FPN and ResNet-101-FPN backbones [43,44]. All evaluations were carried out on a TITAN Xp GPUs with 12G memory.

#### *3.3. Qualitative Evaluation*

Figure 8 shows paired examples of real SAR images and corresponding synthetically denoised SAR images where the denoised SAR images are the intermediate results in our model. After the denoising stage, the general speckle noises are drastically reduced; however, there inevitably exists a trade-off between the noise level and image clarity. Especially, a lot of buoys that usually look like actual ships are located in the first example of Figure 8 and in the denoised SAR image, brightness of the buoys relatively gets faded and the visual difference with the surrounding ships becomes clear. In addition, scattering waves around target objects which are one of factors hindering accurate localization is blurred after the denoising. The denoising within our network confirms such positive effectiveness.

Some image triples of groundtruth, baseline detection, and our detection visualizations are presented in Figure 9. We train the baseline detection model with non-preprocessed and raw noisy SAR images. For a fair comparison, both the baseline and our detection model equally adopt Faster RCNN with ResNet-101-FPN [43,44] backbone architecture. The detection results show that our model could localize overall objects accurately with higher confidence scores and detects with a small number of false alarms compared to with the baseline detection model in the given patch images. Although the progress made by our detection models are inspiring, our detectors still have a room further improvement due to the few remaining false alarms and missing targets.

#### *3.4. Quantitative Evaluation*

To quantitatively evaluate the detection performance, we calculate mean average precision (mAP). The mAP metric is widely used as a standard metric to measure the performance of object detection and estimated as the average value of AP over all categories. Here, AP computes the average value of precision over the interval from recall = 0 to recall = 1. The precision weighs the fraction of detections that are true positives, while the recall measures the fraction of positives that are correctly identified. Hence, the higher the mAP, the better the performance.

As shown in Table 2, we compare the proposed network with the traditional twostage detection model under two different backbones such as ResNet-50-FPN and ResNet-101-FPN [43,44]. By varying despeckling approaches, we set several baseline models as previous work processes: (1) inputting non-preprocessed real SAR images, (2) feeding denoised SAR images into the traditional two-stage detection model after denoising via representative techniques called Lee filter [22] or PPB filter [25]. We observe that the despeckling effect of applying Lee filter is more minor than PPB filter. PPB filter enables us to reduce more speckle noises; but, much detailed information visually gets concealed. This validates our experimental results that the baseline model with PPB filter slightly performs inferior compared to the baseline model with Lee filter. On the other hand, our detection network provides significant advances in performance under all backbone architectures. Through observation of the test results, this is attributed to the suppression of many false positive detections resulting from speckle noise problems of real SAR images.

(**a**) Real SAR image (**b**) Synthetically denoised SAR image

**Figure 8.** Two paired examples of noisy SAR (left) and despeckled SAR (right) images. Red bounding boxes for each image enlarge corresponding sub-regions. As shown in the enlarged windows, scattering waves and speckle noises are relatively less observed in denoised examples.

(**a**) Airplane (A) class

(**b**) Etcetera (E) class

#### (**c**) Ship (S) class

**Figure 9.** Image triples are shown in which the left image is groundtruth, while the middle image is for baseline models (traditional two-stage detection models with real SAR images), and the right image is for our models. The groundtruth and predicted bounding boxes are plotted in blue color for A class, yellow color for E class, and pink color for S class. The numbers on the bounding boxes in the middle and right images denote the confidence score for each corresponding category. We visualize all detected bounding boxes after NMS and thresholding detector confidence at 0.05.


**Table 2.** Comparison of detection performance on our constructed dataset with TerraSAR-X and COSMO-SkyMed images. By incorporating region proposals from denoised SAR images within detection network, our model shows significant improvement in AP. The entries with the best APs for each object category are highlight in bold.

#### *3.5. Ablation Study*

We conduct an ablation study for structurally verifying the proposed fusing region proposal strategy. We first compare the case without fusing itself after denoising on input noisy SAR image, which corresponds to the first experiment in Table 3. With the comparison to inputting only denoised SAR image as an input to detection network, we can identify whether the usage of real SAR image as another input of the detection network is important. This case shows the poorest detection performance and justifies the importance of fusing information from raw noisy SAR images. Secondly, for the choice of feature map after fusing, we perform experiments with feature map from denoised SAR image or feature map from real SAR image. As a result, keeping the feature map from the real SAR image as proposed is found to be much better.

**Table 3.** Ablation study across the input type of detection network and feature map forwarded to subsequent sub-network for classification and bounding box regression for each RoIs. The entries with the best APs for each object category are highlight in bold. The backbone is ResNet-50-FPN.


#### **4. Discussion**

Our proposed detection framework obviously achieves a better performance through combining a denoising network with an existing detection network; however, more parameters and the complex structure demand larger memory for model storage and higher computing cost. We report average inference times (measured in seconds/(patch image) on a Titan Xp GPU) for the purpose of time complexity analysis, as presented in Table 4. Compared with the existing two-stage object detection network like Faster RCNN [45] in the first row of Table 4, our detection framework further requires denoising time and time for fusing region proposals during inference. The denoising time makes up a large portion of the added running times, so the most promising way for reducing the average inference time would be adopting a relatively light denoising network.


**Table 4.** Comparison of running times for the time complexity analysis. We evaluated the running times on a patch image sized 800 × 800 with a Titan Xp GPU.

#### **5. Conclusions**

In this study, we develop a novel object detection framework, where an unsupervised denoising network is combined with a two-stage detection network and two sets of region proposals extracted from a real noisy SAR image and a synthetically denoised SAR image are complementarily merged. The coupling structure of denoising network with detection network together intends to replace a cumbersome preprocessing step for denoising with our denoising network and at the same time, the integrated denoising network performs denoising to support the subsequent object detection. To remedy a potential risk due to fine information loss after denoising, we keep raw information from input SAR image within detection network while only utilize a set of region proposals inferred from the synthetically denosied SAR image. The extensive qualitative and quantitative experiments on our own datasets with TerraSAR-X and COSMO-SkyMed satellite images suggest that the proposed object detection framework involves the adaptive denoising for directly influencing detection performance. Our method shows significant improvements over several detection baselines on the datasets constructed from TerraSAR-X and COSMO-SkyMed satellite images.

**Author Contributions:** Conceptualization, all authors; methodology, S.S.; software, S.S.; validation, Y.K., I.H. and J.K.; formal analysis, J.K.; investigation, I.H.; resources, S.S.; data curation, S.S.; writing—original draft preparation, S.S.; writing—review and editing, S.S.; visualization, S.S. and Y.K.; supervision, S.K.; project administration, S.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** This research was supported by the Defense Challengeable Future Technology Program of Agency for Defense Development, Republic of Korea.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

