**2. Methodology**

ShadowDeNet is based on the mainstream two-stage framework Faster R-CNN [36]. A two-stage detector usually has better detection accuracy than a one-stage one [40], so we select the former as our experimental baseline in the paper. Figure 2 shows the shadow detection framework of ShadowDeNet. Faster R-CNN consists of a backbone network,

a region proposal network (RPN), and a Fast R-CNN [36]. HESE is a preprocessing tool. TSAM and SDAL are used to improve the feature extraction ability of the backbone network. SGAAL is used to improve the proposal generation ability of RPN. OHEM is used to improve the detection ability of Fast R-CNN.

**Figure 2.** Shadow detection framework of ShadowDeNet. HESE denotes the histogram equalization shadow enhancement. TSAM denotes the transformer self-attention mechanism. SDAL denotes the shape deformation adaptive learning. SGAAL denotes the semantic-guided anchor-adaptive learning.OHEM denotes the online hard-example mining. In ShadowDeNet, without losing generality, we select the commonly-used ResNet-50 [50] as the backbone network.

From Figure 2, we first preprocess the input video SAR images using the proposed HESE technique to enhance shadow's saliency or contrast ratio. The detailed descriptions are introduced in Section 2.1. Then, a backbone network is used to extract shadow features. In ShadowDeNet, without losing generality, we select the commonly-used ResNet-50 [50] as the backbone network. One can leverage more advanced backbone network which may achieve better performance, but this is beyond the scope of this article. In the backbone network, the proposed TSAM and SDAL are embedded, which can both enable better feature extraction. The former is used to pay more attention to regions of interests to suppress clutter interferences based on the attention mechanism [51], which is introduced in detail in Section 2.2. The latter is used to adapt to moving target deformed shadows to overcome motion speed variations based on deformable convolutions [47], which is introduced in detail in Section 2.3.

Immediately, the feature maps are inputted into an RPN to generate regions of interests (ROIs) or proposals. In RPN, a classification network outputs a 2*k*-dimension vector to represent a proposal category, i.e., a positive or negative sample. Here, *k* denotes the number of anchor boxes, which is set to nine in line with the raw Faster R-CNN. The determination of positive and negative samples is based on the intersection over union (IOU), also called Jaccard distance [52,53], with the corresponding ground truth (GT). Similar to the original Faster R-CNN, IOU > 0.70 means positive samples, while IOU < 0.30 means negative samples. Samples with 0.30 < IOU < 0.70 are discarded. Moreover, a regression network outputs a 4*k*-dimension vector to represent a proposal location and shape, i.e., (*<sup>x</sup>*, *y*, *w*, *h*), where (*<sup>x</sup>*, *y*) denotes the proposal central coordinate, *w* denotes the width, and *h* denotes the height. Regression is performed to locate shadows in essence, whose inputs are the feature maps extracted by the backbone network, and outputs are the possible locations of shadows. In RPN, the proposed SGAAL is inserted to improve the quality of proposals. It can generate optimized anchors to adaptively match shadow location and shape inspired by the works of [23,45], which are introduced in detail in Section 2.4.

Afterwards, one ROIAlign layer [54] is used to map the proposals to the feature maps in the backbone network for the subsequent refined classification and regression. Note that the raw Faster R-CNN used one ROIPooling layer to reach such aim, but we replace it with ROIAlign because ROIAlign can address the problem of misalignments caused by twice-quantization [54] so as to avoid a feature loss. Finally, the refined classification and regression are completed by Fast R-CNN [55] to output the final shadow detection results. Moreover, in training, in Fast R-CNN, OHEM is applied to select typical difficult negative

samples to boost background discrimination capacity inspired by the works of [48,53], which are introduced in detail in Section 2.5.

Next, we introduce HESE, TSAM, SDAL, SGAAL, and OHEM in detail in the following subsections.

### *2.1. Histogram Equalization Shadow Enhancement (HESE)*

Moving target shadows in video SAR images are rather dim [16] and are always easy to be submerged by surrounding clutters, leading to their less-salient features from the human vision perspective. Figure 3 shows a video SAR image and the corresponding shadow ground truths. In Figure 3a, it is difficult for human vision to find the shadow of a moving target quickly and clearly if one does not refer to the ground truths in Figure 3b. The contrast between the shadow and the surrounding is very low, resulting in their unclear appearances. Moreover, the patrol gate made of metal materials also poses serious negative effects to shadow detection. This experimental data is introduced in detail in Section 3.1. Therefore, to perform some image preprocessing means is necessary, otherwise the learning benefits of features would be reduced.

**Figure 3.** A video SAR image. (**a**) The raw video SAR image; (**b**) the shadow corresponding ground truths. Here, different vehicles are marked in boxes with different colors and numbers for an intuitive visual observation. This video SAR image is the 50th frame in the SNL data.

Many previous scholars proposed various techniques for image preprocessing, e.g., denoising [8], pixel density clustering [24], morphological filtering [17], visual saliencybased enhancement [15], etc. However, they all rely heavily on expert experience with a series of cumbersome steps, reducing model flexibility. Therefore, we come up with the simple but effective histogram equalization to preprocess video SAR images. For brevity, we denote this process as the histogram equalization shadow enhancement (HESE).

For a video SAR image *I*, if *ni* denotes the number of occurrences of the gray value *i* 0 ≤ *i* < 256, then the occurrence probability of pixels with the gray value *i* is

$$p\_I(i) = \frac{n\_i}{n} \tag{1}$$

where *n* denotes the number of all pixels in the image, and *pI*(*i*) denotes, actually, the histogram of the image with pixel value *i*, normalized to [0, 1]. The HESE is described by

$$HESE(i) = \sum\_{i=0}^{k} p\_I(i) \; k = 0, 1, 2, \dots, 255 \tag{2}$$

Figure 4 shows the image histogram of the image in Figure 3a before HESE and after HESE. From Figure 4, after HESE, the whole gray value distribution (marked in red) is

similar to the uniform distribution, so this image has a large gray dynamic range and high contrast, and the details of the image are richer. In essence, HESE is used to stretch the image nonlinearly, and redistribute the image pixel values so that the number of pixel values in a certain gray range is roughly equal. In this way, the contrast of the peak part in the middle of the original histogram is enhanced, while the contrast of the valley bottom part on both sides is reduced. Finally, the contrast of the entire image increases.

**Figure 4.** Image pixel histogram before HESE and after HESE.

Figure 5 shows the moving target shadow enhancement results. From Figure 5, one can clearly find that after HESE, the shadow in the zoom region becomes clearer. In Figure 5a, the raw shadow is hardly captured by human eye vision, but in Figure 5b, anyone can find the shadow quickly and easily. We also evaluate the shadow quality in the zoom region by using the classic 4-neighborhood method [56]. The evaluation results are shown in Table 1. From Table 1, the shadow contrast with HESE is far larger than that without HESE (29,215.43 >> 20,979.31). The shadow contrast enhancement reaches up to ~40%, i.e., (29,215.43–20,979.31)/20,979.31. Moreover, the running time is just 13.06 ms, i.e., 7.66 images per second. This seems to be acceptable. Compared to many previous preprocessing means [8,15,17,24], HESE is rather fast with a rather simple theory and workflow. Readers can find more shadow enhancement results of other frames (#3, #13, #23, #33) in Figure 6.

**Figure 5.** Moving target shadow before HESE and after HESE. (**a**) Before HESE; (**b**) after HESE. The raw video SAR image is in Figure 3a.

**Table 1.** Shadow quality evaluation results with and without HESE. The running time is obtained on

the Intel(R) Core(TM) i9-10900KF CPU.


(**b**) 

**Figure 6.** More results of the histogram equalization shadow enhancement (HESE). (**a**) Before HESE; (**b**) after HESE. Different vehicles are marked in boxes with different colors and numbers for an intuitive visual observation. #*N* denotes the *N*-th frame. The white arrows indicate the moving direction.

### *2.2. Transformer Self-Attention Mechanism (TSAM)*

Attention mechanisms are widely used in the CV community that can adaptively learn feature weights to focus on important information and suppress the useless. So far, scholars from the SAR community have applied it to various applications, e.g., SAR automatic target recognition (ATR) [57,58], SAR target detection [59,60] and classification [61,62], and so on. Recently, transformer detectors [63,64] have received increasing concerns in the CV community. The remarkable characteristic of transformer models is the internal self-attention mechanism which is able to effectively capture some important long-range dependencies among the entire location space [65]. In video SAR images, there are many clutter interferences from Figure 3a, so we adopt such self-attention mechanism to suppress them so as to focus on more valuable regions of interests. We call this process transformer self-attention mechanism (TSAM). Specifically, we insert TSAM to the backbone network which enables efficient information flow to extract more representative features. As mentioned before, we selected ResNet-50 as our backbone network in Figure 2, thus we insert TSAM to the residual block to promote better residual learning, as is shown in Figure 7.

In Figure 7, the first 1 × 1 convolution (conv) is used to reduce the input channel dimension, and the second 1 × 1 conv is used to increase the output channel dimension for the follow-up adding operation. The 3 × 3 conv is used to extract shadow features. TSAM is used behind the 3 × 3 conv, meaning that the extracted shadow features are prescreened by TSAM. In this way, the important features are retained while the useless interferences are suppressed. The above practice is similar to that in the convolutional block attention module (CBAM) [66] and squeeze-and-excitation (SE) [67], which can be described as

$$F' = F + f\_{1 \times 1} \{ TSAM(f\_{3 \times 3}(f\_{1 \times 1}(F))) \} \tag{3}$$

where *F* denotes the input of a residual block, *F'* denotes the output, *f* <sup>1</sup>×1(·) denotes the 1 × 1 conv operation, *f* <sup>3</sup>×3(·) denotes the 3 × 3 conv operation, and *TSAM*(·) denotes the *TSAM* operation.

**Figure 7.** Residual block in the backbone network. (**a**) The raw residual block in ResNet-50; (**b**) the improved residual block with TSAM.

Figure 8 shows the detailed implementation process of *TSAM*. In Figure 8, *H* denotes the height of the input feature map *X*, *H* denotes the width, and *C* denotes the channel number. According to Wang et al. [68], the general transformer self-attention can be summarized as

$$\mathbf{y}\_{i} = \frac{1}{\mathcal{C}(\mathbf{x})} \sum\_{\forall j} f(\mathbf{x}\_{i}, \mathbf{x}\_{j}) \mathcal{g}(\mathbf{x}\_{j}) \tag{4}$$

where *i* is the index of the required output location (i.e., the response of the *i*-th location is to be calculated), and *j* is the index that enumerates all possible locations, i.e., ∀*j*. **x** denotes the input feature map, and y denotes the output feature map with the same dimension as **x**. The paired function *f* computes the relationship between *i* and all *j*. The unary function *g* calculates the representation of the input feature map at the *j*-th location. Finally, the response is normalized by a factor *C*(**x**).

**Figure 8.** Detailed implementation process of TSAM.

The paired function *f* can be achieved by an embedded Gaussian function so as to compute similarity between *i* and all *j* in an embedding space, i.e.,

$$f(\mathbf{x}\_i, \mathbf{x}\_j) = e^{\theta(\mathbf{x}\_i)^T \boldsymbol{\phi}(\mathbf{x}\_j)} \tag{5}$$

where *θ* denotes the embedding of **x***i* and *φ* denotes the embedding of **<sup>x</sup>***j*. From Figure 8, they are implemented by using two 1 × 1 convs *Wθ* and *<sup>W</sup>φ*. That is, *θ*(**<sup>x</sup>***i*) = *Wθ***x***i* and *φ*(**<sup>x</sup>***j*) = *<sup>W</sup>φ***<sup>x</sup>***j*. Here, to reduce the computation cost, their kernel numbers are set to *C*/2 if the input channel number is *C*. The normalization factor is set as *C*(**x**) = <sup>Σ</sup>∀*jf*(**<sup>x</sup>***i*, **<sup>x</sup>***j*). Thus, for a given *i*, *f*(**<sup>x</sup>***i*, **<sup>x</sup>***j*)/*C*(**x**) will become the *softmax* computation along the dimension *j* where *softmax* is defined by *e***x***i*/ ∑*j e***<sup>x</sup>***j* [69]. Here, the *softmax* computation is responsible for generating the weight (i.e., importance level) of each location. Similarly, the representation of the input feature map at the *j*-th location is also calculated in an embedding space by using another one 1 × 1 conv *Wg*. With the matrix multiplication, the output of the selfattention **Y** is obtained. Finally, in order to complete the residual operation (i.e., adding), one 1 × 1 conv *Wz* is used to increase the channel number from *C*/2 to *C*, i.e.,

$$\mathbf{Z} = \mathcal{W}\_{\overline{z}} \mathbf{Y} + \mathbf{X} \tag{6}$$

In essence, TSAM is able to calculate the interaction between any two positions and also directly captures the remote dependence without being limited to adjacent points. It is equivalent to constructing a convolution kernel as large as the size of the feature map, so that more background context information can be maintained. In this way, the network can focus on important regions of interests to suppress clutter interferences or other negative effects of useless backgrounds.

### *2.3. Shape Deformation Adaptive Learning (SDAL)*

The contrast between the shadow generated by the moving target and its background area, the gradient information of the shadow intensity along the moving direction, and the shape of the shadow are all closely related to the moving speed of the target [7]. On the premise that the shadow can be formed, the smaller the moving speed of the target, the greater the shadow extension [70], and the clearer the shadow contour of the moving target. However, the larger the moving speed of the target, the lower the shadow extension, and the more blurred the shadow contour of the moving target. In other words, when the motion speed of the moving target changes, the shadow shape will change. When the motion speed changes continuously in multiframe video SAR images, the shadow of the same target will become deformed. This challenges the robustness of the detector. Readers can refer to [11] for more details about the shadow size relationship with the speed.

Figure 9 shows the moving target shadow deformation with the change of moving speed. In Figure 9, due to the stopping signal of the traffic light, the vehicle-A's speed is becoming smaller and smaller. One can find that the same vehicle-A exhibits shadows with different shapes (the zoom region) in the different frames in the video SAR. Thus, a good shadow detector should resist such shadow deformation. However, the feature extraction process of classical convolution neural networks mainly depends on convolution kernels, but the geometric structure of traditional convolution kernel is fixed. Only fixed local feature information is extracted each time when a convolution operation is performed. Thus, the classical convolution cannot solve the shape deformation problem. Fortunately, the recent deformation convolution proposed by Dai et al. [47] can overcome this problem because its convolution kernel can produce free deformation to adapt to the geometric deformation of the target. The deformable convolution changes the sampling position of the standard convolution kernel by adding additional offsets at the sampling points. The compensation obtained can be learned through training without additional supervision. We call the above shape deformation-adaptive learning (SDAL).

**Figure 9.** Moving target shadow deformation with the change of moving speed. From left to right (#64 → #74 → #84 → #94), the speed becomes smaller and smaller. The blue arrow indicates the moving direction.

Figure 10 is the sketch map of different convolutions. From Figure 10, the deformation convolution can resist shadow deformation effectively by the learned location offsets. Figure 11 shows the implementation of SDAL.

**Figure 10.** Different convolutions. (**a**) Classical convolution; (**b**) deformation convolution.

**Figure 11.** Detailed implementation process of SDAL.

The standard convolution kernel is augmented with offsets Δ**p**n which are adaptively learned in training to model various shape features, i.e.,

$$\mathbf{y}(\mathbf{p}\_0) = \sum\_{\mathbf{p}\_n \in \mathcal{R}} \mathbf{w}(\mathbf{p}\_n) \cdot \mathbf{x}(\mathbf{p}\_0 + \mathbf{p}\_n + \Delta \mathbf{p}\_n) \tag{7}$$

where **p**0 denotes each location, denotes the convolution region, **w** denotes the weight parameters, **x** denotes the input, **y** denotes the output, and Δ**p***n* denotes the learned offsets in the *n*-th location. Δ**p***n* is typically fractional, so the bilinear interpolation is used to ensure the smooth implementation of convolution, i.e.,

$$G(\mathbf{q}\_\prime \mathbf{p}) = \lg(q\_{\mathbf{x}\prime} p\_\mathbf{x}) \cdot \lg(q\_{\mathbf{y}\prime} p\_\mathbf{y}) \tag{8}$$

where *g*(*<sup>a</sup>*,*b*) = max(0, 1–|*<sup>a</sup>*–*b*|). We add another convolution layer (marked in purple in Figure 11) to learn the offsets Δ**p***<sup>n</sup>*, and then, the standard convolution combining Δ**p***n* is performed on the input feature maps. Moreover, inspired by [47], the traditional convolutions of the high-level layers, i.e., conv3\_x, conv4\_x, conv5\_x in ResNet-50, are replaced with deformation ones to extract more robust shadow features. This is because the slightest change in the receptive field size among the high-level layers is able to pose a remarkable difference to the following networks, thus obtaining a better geometric modeling capacity in transformation of shape-changeable moving target shadows [71].
