3.2. Breadth Search Compensation Module (BSCM)
In the inshore scenes, complex background information introduces scattering noise in the SAR imaging mechanism, leading to interference and erroneous detections in the network. To tackle this concern, we propose a BSCM consisting of two main parts: MLKA and NDCL. It enables an extensive information search that leverages the contextual cues surrounding the targets to enhance the recognition, shape furnishing, and positional insights.
First, we treat the input feature
with a
convolutional layer, a Batch Normalization (BN) layer, and an activation function, where
C is the channel number, and
H and
W give the spatial size of the input. This yields
, serving as the BSCM input.
Multi-scale Large Kernel Attention (MLKA): We employed MLKA to achieve an extensive information search. The pivotal component of MLKA is Multi-scale Large Kernel Convolution (MLKC). MLKC utilizes various sizes of Large Kernel Convolution (LKC) to create a multi-scale search window. This approach enables the effective selection of appropriate search windows for different-sized ship targets, thereby enhancing the target recognition capability. Specifically,
is achieved by decomposing the
convolution kernel into three consecutive convolutional layers, namely,
depthwise convolution
,
depthwise dilated convolution
(
d is the dilated rate), and
convolution
, formulated as
MLKC constructs four
with different kernel sizes: 3-5-1, 5-7-1, 7-9-1, and 9-11-1, where
a-
b-1 means cascading
depth-wise convolution,
depth-wise-dilated convolution, and point-wise convolution. Different from related work [
28] that used different expansion rates to realize different scales of receptive fields, this study uniformly set the expansion rate to 3, reduced the setting of hyperparameters, and made the network easier to understand and adjust. Specifically, we first applied a 1 × 1 convolution and GELU activation function to
, which obtained
, while preserving both the spatial and channel dimensions. Subsequently, we evenly divided
into
n parts along the channel dimension
. Each
underwent processing through
, and their outcomes were concatenated along the channel dimension to construct feature information
with different receptive fields, formulated as follows:
where
denotes the feature map concatenation along the channel dimension.
To enhance the connection between different receptive fields, we employed average pooling and max pooling to process
, which effectively extracts spatial relationships from different receptive fields.
where
and
are the max pooling and average pooling operators, which both reduce the channel dimension to 1. By concatenating these outcomes to yield spatial attention (SA) with a channel size of 2, we subsequently utilized a
convolution to expand the channel size to 4 to match the four distinct receptive fields. The sigmoid function processes
to capture crucial information. Multiplying and summing the processed
with
achieves effective spatial information fusion, detailed as
where
represents the sigmoid function. We employed
to selectively extract feature information from different receptive fields and subsequently summed these values to derive the multi-head spatial attention (MSA). We multiplied
with
to obtain the output of the MLKC component.
Finally, we performed convolutional processing on and applied skip connections to obtain the MLKA output .
Neural Discrete Codebook Learning (NDCL): MLKA adopts a strategy of employing dilated convolutions to achieve extensive information exploration. However, due to the presence of holes in convolutions, there might be a risk of information loss. To mitigate this loss, we introduced the NDCL method, which involves learning discrete information through the codebook, thereby compensating for the potential information deficiency that can arise from MLKA. As shown in
Figure 3, the feature heatmap of MLKA had a large receptive field but lacked local detail attention. NDCL made up for this shortage and made the heatmap show more sensitive and detailed features in local areas. Finally, BSCM combined the output of MLKA and NDCL to achieve an accurate capture of the global information.
For the input feature
, we first obtained
through the stem block, which was then integrated into our NDCL module. We utilized a learnable codebook
to represent the dimensional information in
Z, where
K signifies the number of codewords
and
N is the dimension of each codeword. By employing
-dimensional codewords, we discretely represented
Z, which effectively compensated for fine-grained information. Unlike previous dictionary-learning methods [
35,
36] that only establish codewords in the channel dimension, we extended this concept to include codewords within the spatial dimensions (
) to achieve a three-dimensional representation of local information. This was accomplished as follows:
where
and
are the codebooks in the channel and spatial dimensions, respectively.
represents the
k-th codeword. We replaced the corresponding dimensions of
Z with the codewords to obtain the quantized feature
v. Additionally, we employed a learnable scale factor
to adjust the similarity between the codeword and the dimensional information, whether in the channel or spatial dimensions.
where
is the information in the channel dimension.
represents the feature information of each pixel in the spatial dimension.
is the
k-th scaling factor.
denotes the softmax function.
and
mean the k-th quantized channel and spatial information, respectively. We computed the
distance between
Z and the codewords using
and subsequently employed the softmax function to yield smoothed features. Following this, we employed
to combine all
and
, where
comprises a BN layer with a ReLU activation layer and a mean layer. Based on this, the full information of the whole image with respect to the
K codewords is calculated:
We performed element-wise multiplication of the codewords and the input vector
Z along the channel dimension, followed by summing the products. The output value
a was obtained by applying the sigmoid function to the sum:
where
represents the sigmoid function. The outcome of the NDCL could be determined using the following equation:
where
a aggregates the codeword information of the channel and space to adjust the required information by multiplying it with the feature
Z.
Finally, we fused
and
along the channel dimension to achieve the output of the BSCM:
3.3. Sine Fourier Transform Coding (SFTC)
In order to deal with the problem of boundary discontinuity caused by rotation angle periodicity, this section mainly introduces the encoding and decoding process of the detection box angle information predicted by the detection head. As shown in
Figure 4, we sine-encoded the predicted angle
. The angle was encoded using a four-step phase shift method [
37], where the initial phases were set at 0, 90, 180, and 270 degrees. This angle representation method complies with the sampling theorem and possesses encoding fault tolerance. The angle is represented by the sine function of two different frequencies:
where
and
are two frequencies representing the conversion relationship with the predicted angle
.
Sine encoding: and
are encoded as
and
using sine functions as follows:
where
and
M is the number of sine components. We can deduce that
represents a rotation period of
, while
corresponds to a rotation period of
, corresponding to the period of the rectangular OBB and square OBB.
Discrete Fourier transform (DFT): Directly performing regression calculations on these sinusoidal components will result in the loss of phase information of the components. Inspired by the wave–particle duality in quantum mechanics, a wave typically contains both amplitude and phase attributes, and the wave equation of a particle can fully describe its state from amplitude and phase. Therefore, we regard the sinusoidal wave components as the wave equation of free particles. Furthermore, there is a connection between the wave equation of free particles and the discrete Fourier transform. In the wave equation, the frequency is related to the oscillatory nature of the wave function, while in the Fourier transform, frequency represents the signal intensity. The superposition of free particle wave functions is analogous to spatial superposition, while superposition in the Fourier transform occurs in the frequency domain. Based on interference effects, the superposition state of wave functions reflects the phase relationship between particles. With the assistance of DFT, we realized the superposition of particles (sine components) to allow for a better observation of wave phenomena in the results and comprehensive utilization of amplitude and phase information to describe angles. The superposition of wave equations is shown as follows:
where
is the frequency domain representation.
and
represent the discrete values of particles at different frequencies.
k denotes the wave number.
N is the number of particles where
.
e means the base of the natural logarithm.
j refers to the imaginary unit.
and
are the amplitude and phase, respectively.
Decoding function: The formula for decoding
from
can be described as
where
and
are calculated as follows:
where
has a twofold frequency relationship with respect to
. We calculated the cosine of the difference between
and
to help restore the predicted angle:
where
denotes the cosine value of the angular difference between two phases and was utilized to recover the ultimate predicted angle.
The formula for restoring the predicted angle using
is
where
is the predicted angle output by the network.