5.1. Experimental Parameter Setting
All experiments are conducted using an Intel Core i5-12400F CPU (2.50 GHz) sourced from Intel Corporation, Santa Clara, California, USA, and an NVIDIA RTX 3080 GPU sourced from NVIDIA Corporation, Santa Clara, California, USA. The deep learning framework employed is PyTorch (version 1.10.0), and the optimizer used is Adaptive Moment Estimation (Adam).
CrossEntropyLoss is used as the loss function, and precision (P), pixel accuracy (PA), and mean intersection over union (MIoU) are used as the evaluation indexes. The formula for each evaluation index is as follows:
where
, true positive, represents the number of pixels correctly predicted as positive class;
, false positive, represents the number of pixels incorrectly predicted as positive class.
where
, diagonal elements, represents the number of pixels correctly predicted as class
i;
, non-diagonal elements, represents the number of pixels belonging to class
i but predicted as class
j;
k is number of classes (excluding the background).
where
represents the number of pixels correctly predicted as class
i and actually belonging to class
j;
represents the number of pixels correctly predicted as class
j and actually belonging to class
i;
represents the number of pixels correctly predicted as class
i; and
k is number of classes (excluding the background).
5.2. Ablation Experiments on WHU-OPT-SAR Dataset
The proposed modules (modified backbone (Laplacian convolution + MAFM), SAIM, and GLFM) are integrated into the model step by step, and network performance is evaluated using MIoU, PA, and Params, with the computation cost presented in
Table 2. Params refer to the model’s total number of trainable parameters used to measure the model’s complexity and scale. FLOPs refer to the floating-point operations required for a single forward pass, representing the model’s computational complexity.
Each module improves the model’s accuracy but inevitably increases the computational cost. Among them, SAIM and MAFM are the main contributors to the increased computing cost, with an increase of 17.19 G and 12.28 G in FLOPs, respectively. The reason is that these two modules use the attention mechanism in the high-dimensional feature space, and many dot product operations lead to increased computing costs. However, these two modules significantly improve model performance: SAIM increases accuracy by 5.45% in the village category, and MAFM increases accuracy by 4.35% in the water category. Laplacian convolution has low computational costs due to preset parameters and simple calculations, but it improves the feature quality of SAR branches and shows good improvement in various categories. In the decoding stage, GLFM integrates multi-scale features with fewer parameters and computational costs, among which low-scale features include shallow information, such as edge and surface color, effectively improving the accuracy of forest classification.
Through heat maps in
Figure 10, we can visualize the effect of each module more intuitively. The heat map demonstrates how much attention the model pays to different categories of regions. The intensity of the red region indicates the model’s primary focus, followed by the yellow-green area, with the blue representing areas of lower attention.
Heat maps of attention for water and roads are purposely selected and presented in rows one and two of the figure. Water and roads usually have a variety of scales, shapes, and texture variations, as well as complex boundaries and features similar to those of their surroundings. Therefore, accurately detecting and identifying watersheds and roads is challenging.
Ablation of SAIM:Compared to ResNet, SAIM corrects the wrong area of the water (rectangle in the bottom left corner of the first row). However, its outline shape is slightly rough (round in the first row, square in the bottom right corner of the second row). Furthermore, too many yellow and green areas almost cover the entire image, which means that the model still focuses on too much redundant information.
Ablation of Laplacian Convolution: We find that in the optical image of the second row, the color of the road looks similar to the surroundings, and in the SAR image, the location of the road has an obvious edge over the surroundings. Laplacian convolution helps the model to capture the location of the feature boundaries better, so Column e has a sharper outline of the region of interest compared to the previous heat map. The black box of the first row in the lower right corner shows that the Laplacian operator’s introduction makes the model notice the slender bridge.
Ablation of MAFM: The MAFM module eliminates unnecessary regions of interest by dynamically adjusting the weights between different modal channels. It selectively emphasizes feature channels that are critical to the task while suppressing responses from task-irrelevant or noisy channels. Column f clearly shows the decrease in the yellow–green area.
Ablation of GLFM: The deep features of the model encapsulate the overall structure and semantic information of the entire image, whereas the shallow features focus on local details. The GLFM enables the model to comprehend local information from a global perspective. As depicted in the figure, the multi-scale module effectively eliminates unnecessary regions of interest, preserving and optimizing the details of the relevant regions.
Figure 11 visualizes the impact of the introduction of different modules on the segmentation results.
In the first set of images, the lakes are fragmented and scattered, with complex contour shapes. The SAR images are almost covered by scattering noise, with only part of the waters showing clear contours. The second set of images shows a typical complex scene of a river running through a village. The river and the road are clearly distinguishable in the SAR image.
The segmentation results of ResNet-50 in Column d have the problem of blurred edges. The scattered water area in the middle of the image in the first row is not correctly identified, and the water area in the second row is much smaller than the label, resulting in many misidentifications. After the introduction of SAIM, the accuracy of the water area is improved. However, there is still much misidentification, indicating that only the fusion of deep features cannot significantly improve the model performance. After introducing Laplacian convolution in Column f, the road recognition in the second-row image is improved, and the contour information is more precise. Column g introduces the MAFM module, which allows multi-scale features in the backbone network to be fused, improving the accuracy of roads and waters. It can also be seen from the heat map that the MAFM module effectively weakens the influence of redundant information, so the false recognition area is reduced in the segmented image. There is a forest area in the lower right corner of the second line image, darker than the surrounding color in the optical image but similar to the surrounding backscattering in the SAR image. The GLFM module combines deep semantic and shallow information to distinguish this area from the surrounding land and improve the recognition rate of forest area. Although some areas are still not identified, this reflects GLFM’s necessity.
However, the model exhibits an excessive association between low backscattering and classification as the water category. In the first row of
Figure 11, several scattered farmland areas are misclassified as water, typically corresponding to regions with low backscatter. In the upper right corner of the second row, indicated by the red circle, the model accurately identifies only the low backscattering area as water.
5.3. Comparison Test of WHU-OPT-SAR Dataset
MCANet [
4], ACNet [
44], RDFNet [
45], V-FuseNet [
46], CMGFNet [
47], and DeepLab v3+ [
48] were selected for comparative experiments.
Figure 12 shows four groups of segmentation results.
It is evident that Columns e, h, and i are seriously affected by noise, and many error pixels are scattered in the picture, demonstrating that the noise problem cannot be ignored.
In the red circle of the first row, the optical images show an obvious outline. However, due to the low backscattering coefficient of the water area, the SAR image can accurately determine that this region is not water. In the lower right corner of the second line, the SAR image shows a low backscatter, but the optical image can determine that the area is farmland. Through a comprehensive analysis of optical images and SAR images, the combination makes the judgment more accurate. However, other models produce many misjudgments in these two regions, proving that our model effectively utilizes the difference between the two modalities. Notably, the failure of forest areas in the second line to be correctly identified is a common problem for all models. There is no obvious outline in the optical image and SAR image. In this case, distinguishing forest and farmland is still a problem that needs to be solved in the future.
In the third row, the backscattering in the red-boxed area is significantly lower than in the surrounding areas, and OSNet misclassifies it as water.
In the WHU-OPT-SAR dataset, water labels constitute 38% of the total, leading to class imbalance that affects model training. We addressed this by setting the cross-entropy loss weights based on the proportion of each class. Despite the model correctly identifying some road areas (e.g., the upper left corner of the third row in
Figure 12 and the roads in the second row of
Figure 11), the red-boxed area is still misclassified as water. This suggests that weight adjustment alone may not completely resolve the issue of class imbalance.
We analyze the distribution of three low-backscattering categories in the labels (water, road, and farmland), which are 51.35%, 1.36%, and 47.29%, respectively. For these pixels, the predicted proportions are 53.8%, 1.17%, and 43.2%. Compared to the true values, the proportion of predicted water pixels increases, while the other two categories decrease. This indicates that the model has a tendency to classify low-backscattering areas as water, leading to misclassification of some farmland and road pixels.
To understand the model’s misclassification tendencies, we analyze the pixel values representing backscattering intensity in the SAR images. The average values for water, road, and farmland in the true labels are 27.87, 44.26, and 57.46, respectively; in the predictions, they increase to 28.38, 52.07, and 59.54. The significant rise in the road category’s value, nearing that of farmland, suggests that some road and farmland areas with lower backscattering intensity (potentially due to surface cover or soil type) are misclassified as water. Consequently, roads and farmlands in the predictions appear only when the backscattering intensity is relatively high, highlighting the model’s bias towards classifying low-backscattering regions as water.
In addition, although our model performs well on contour detection, small areas within large outlines are not accurately captured. The small edges in the farmland area in the city (last line) are challenging to distinguish for current models.
A quantitative evaluation is performed to compare the effectiveness of these methods presented in
Table 3.
In
Table 3, we compare the computational cost of the MCANet model (proposed by the authors of WHU-OPT-SAR) and its baseline Deeplab V3+ model. It can be observed that the FLOPs of MCANet have nearly doubled compared to the baseline model, with MIoU and PA metrics increasing by 1.5%. In contrast, our model, OSNet, is more computationally efficient than MCANet, achieving a % improvement compared to the baseline model.
OSNet achieved the highest pixel accuracy (PA) of 81.32% and the highest MIoU of 55.7% with a moderate computational cost. Compared to other methods, our model significantly outperforms in accuracy for farmland, village, water and road, attaining 84.02%, 55.07%, 76.3%, and 30.16%, respectively. In the city and forest metrics, OSNet is slightly lower than the best model by 0.19% and 0.17%, respectively.