CMFPNet: A Cross-Modal Multidimensional Frequency Perception Network for Extracting Offshore Aquaculture Areas from MSI and SAR Images

Yu, Haomiao; Wang, Fangxiong; Hou, Yingzi; Wang, Junfu; Zhu, Jianfeng; Cui, Zhenqi

doi:10.3390/rs16152825

Open AccessArticle

CMFPNet: A Cross-Modal Multidimensional Frequency Perception Network for Extracting Offshore Aquaculture Areas from MSI and SAR Images

by

Haomiao Yu

^1,2,†

,

Fangxiong Wang

^1,2,†

,

Yingzi Hou

^1,2

,

Junfu Wang

^1,2

,

Jianfeng Zhu

^1,2

and

Zhenqi Cui

^3,4,*

¹

School of Geographical Sciences, Liaoning Normal University, Dalian 116029, China

²

Liaoning Provincial Key Laboratory of Physical Geography and Geomatics, Liaoning Normal University, Dalian 116029, China

³

Faculty of Geo-Information Science and Earth Observation (ITC), University of Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands

⁴

School of Geosciences and Info Physics, University of Central South, Changsha 410012, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2024, 16(15), 2825; https://doi.org/10.3390/rs16152825

Submission received: 17 June 2024 / Revised: 23 July 2024 / Accepted: 30 July 2024 / Published: 1 August 2024

(This article belongs to the Special Issue Deep Learning and Multi-modal Data Processing for Geological Environment Remote Sensing Interpretation: Methods, Techniques and Applications)

Download

Browse Figures

Versions Notes

Abstract

The accurate extraction and monitoring of offshore aquaculture areas are crucial for the marine economy, environmental management, and sustainable development. Existing methods relying on unimodal remote sensing images are limited by natural conditions and sensor characteristics. To address this issue, we integrated multispectral imaging (MSI) and synthetic aperture radar imaging (SAR) to overcome the limitations of single-modal images. We propose a cross-modal multidimensional frequency perception network (CMFPNet) to enhance classification and extraction accuracy. CMFPNet includes a local–global perception block (LGPB) for combining local and global semantic information and a multidimensional adaptive frequency filtering attention block (MAFFAB) that dynamically filters frequency-domain information that is beneficial for aquaculture area recognition. We constructed six typical offshore aquaculture datasets and compared CMFPNet with other models. The quantitative results showed that CMFPNet outperformed the existing methods in terms of classifying and extracting floating raft aquaculture (FRA) and cage aquaculture (CA), achieving mean intersection over union (mIoU), mean F1 score (mF1), and mean Kappa coefficient (mKappa) values of 87.66%, 93.41%, and 92.59%, respectively. Moreover, CMFPNet has low model complexity and successfully achieves a good balance between performance and the number of required parameters. Qualitative results indicate significant reductions in missed detections, false detections, and adhesion phenomena. Overall, CMFPNet demonstrates great potential for accurately extracting large-scale offshore aquaculture areas, providing effective data support for marine planning and environmental protection. Our code is available at Data Availability Statement section.

Keywords:

multimodal remote sensing; offshore aquaculture; sentinel; feature fusion; semantic segmentation

1. Introduction

China is a major fishing country with the most developed aquaculture production scale in the world. According to the Food and Agriculture Organization (FAO) of the United Nations, approximately 56.7% of the global aquaculture production of aquatic animals comes from China, most of which is from marine aquaculture [1]. The rapid development of offshore aquaculture, an activity that promotes the reproduction and growth of marine plants and animals through artificial measures, strongly supports global food security and the economic growth of coastal areas [2,3,4]. However, while offshore aquaculture development has led to significant economic benefits, it has also created serious ecological and environmental problems. Thus, aquaculture in many marine areas is facing unsustainable threats [5,6]. In offshore aquaculture, feed and feces exceed the carrying capacity of the aquaculture area, leading to increased levels of seawater pollution and affecting benthic organisms and marine ecosystems [7,8]. Additionally, the large quantity of greenhouse gas emissions from aquaculture can lead to global warming, seriously threatening ecosystem stability and sustainable socioeconomic development [9]. Therefore, marine fisheries’ management authorities must accurately monitor aquaculture areas while strengthening the environmental protection of marine resources to ensure the scientific management of coastal zones and the sustainable development of marine ecology [7].

Remote sensing technology, which is characterized by large-area observations and high timeliness, is widely used in environmental and resource monitoring tasks [10,11,12], and its rapid development has significantly improved the ability to monitor large-scale, cyclical offshore aquaculture. In recent years, researchers have proposed many methods for extracting marine aquaculture areas based on remote sensing images. For example, Kang et al. used Landsat data and an object-oriented automatic extraction method to analyze the spatial and temporal distribution information of marine aquaculture in Liaoning Province, China, from 2000 to 2018, which provides a great reference value for monitoring the dynamic spatial distribution of marine aquaculture [13]. Hou et al. designed a hyperspectral index for floating raft aquaculture (FRA) using hyperspectral remote sensing imagery and combined it with a decision tree model to extract the FRA contained in complex offshore marine environments [14]. However, the above work is only based on the spectral information of remote sensing image features for target identification and does not fully make fully use of detailed features such as feature shape and texture; moreover, the method used cannot generalize and cannot deal with changes in natural conditions such as water color in multiple areas.

Developing and applying deep learning techniques provide a more accurate and effective method for extracting data from offshore aquaculture areas. For example, Fu et al. designed TCNet based on a hybrid architecture of transformers and CNNs and used high spatial resolution (HSR) optical images to achieve a precise delineation of mariculture areas, which made an important contribution to developing precision agriculture in coastal areas [15]. Ai et al. designed SAMALNet for the aquaculture zone extraction task from GF-2 images of Jiaozhou Bay, Qingdao, using a self-attention mechanism and an auxiliary loss network and achieved a high extraction accuracy [16]. Although a better accuracy has been obtained by combining deep learning methods to achieve offshore aquaculture extraction and observation tasks using high-resolution optical images, optical image characteristics make them unsuitable for dealing with variable natural conditions in all-weather applications, and accurate monitoring cannot be achieved when the aquaculture tools are below the sea surface. In contrast, synthetic aperture radar (SAR), which actively transmits signals and receives reflected signals from features, has all-weather and day-and-night application capabilities [17]. For example, Gao et al. designed D-ResUNet for monitoring offshore FRA with Sentinel-1 imagery. It is capable of aquaculture fine-grained extraction [18]. Zhang et al. proposed a method to extract marine raft aquaculture areas from multitemporal phase-synthesized Sentinel-1 images based on UNet and an attention mechanism, which attenuates noise interference and improves the salience of FRA area features [19]. However, the scattering signals acquired from offshore aquaculture areas are affected by coherent speckle noise and sea states, resulting in complex features in SAR images. This complexity leads to blurred aquaculture area edges, especially in medium-resolution images, which are prone to sticking. Therefore, relying solely on SAR remote sensing images is insufficient for effectively realizing the accurate extraction of offshore aquaculture areas.

Although deep learning-based methods have demonstrated improved extraction accuracy and generalizability in comparison with the traditional methods, the process of relying on only single-modal remote sensing image data is still limited by sensor characteristics, making it difficult to achieve refined all-weather offshore aquaculture area monitoring. For example, optical images are limited in all-weather applications, and edge blurring problems occur in SAR images. Therefore, how to effectively combine different advanced methods and efficiently utilize diverse remote sensing data remains a key challenge for improving the accuracy of offshore aquaculture area extraction.

With the development of remote sensing satellites and observation technologies, the combined use of multimodal remote sensing data is becoming an increasingly important strategy [20,21,22]. Compared with that of unimodal remote sensing image data, the information provided by multimodal sensing image data is complementary and cooperative, which can effectively improve the reliability of Earth observation tasks [23,24,25]. Therefore, integrating the advantages of multimodal remote sensing data and combining optical images with SAR images can compensate for the shortcomings of single-modal remote sensing data and improve the accuracy and reliability of offshore aquaculture area monitoring. However, how to effectively utilize this complementary information to achieve cross-modal information fusion, especially in-depth fusion at different feature levels, is still a problem that is worth exploring.

Thus, this study applies multispectral imagery (MSI) in combination with SAR imagery to monitor offshore aquaculture areas and designs a cross-modal multidimensional frequency perception network (CMFPNet) to understand multimodal remote sensing data in a more comprehensive manner. CMFPNet is able to extract the features of offshore aquaculture areas from multimodal remote sensing imagery in a more effective way, and it efficiently fuses data from different modalities, thus realizing the accurate monitoring and extraction of offshore aquaculture areas in complex marine environments. Our main contributions are as follows:

We addressed the limitations of single-modal remote sensing imagery by incorporating two heterogeneous datasets, Sentinel-1 and Sentinel-2, into the offshore aquaculture area extraction task. This approach enriched features in these areas and provided the model with a more diverse and comprehensive set of land feature information for training. Spectral feature indices were computed for the MSI imagery to avoid data redundancy. Consequently, we established a multimodal remote sensing image dataset comprising visible light imagery, spectral index imagery, and SAR imagery, which was utilized for model training;
We propose a cross-modal multidimensional frequency perception network (CMFPNet) method to enhance offshore aquaculture area extraction accuracy. Through experimental validation on datasets from six typical offshore aquaculture areas, we found that CMFPNet outperforms all other models. It reduces instances of false negatives, false positives, and adhesion phenomena, resulting in extracted shapes closely aligning with the actual aquaculture area shapes. This demonstrates the significant potential of CMFPNet in accurately extracting large-scale offshore aquaculture areas;
In CMFPNet, we designed the local–global perception block (LGPB), which combines local and global semantic information to effectively understand the global relationship of the scene and the detailed target features to enhance model robustness in different sea environments. We also designed the multidimensional adaptive frequency filtering attention block (MAFFAB), which retains and enhances information that is beneficial to recognizing aquaculture areas by dynamically filtering the information in different frequency domains. It also efficiently aggregates the modal semantic features from different modalities to enhance the recognition accuracy of the model.

2. Materials

2.1. Study Area

Liaoning Province is located in northeastern China, bordered by the Bohai Sea to the west and the Yellow Sea to the south, with more than 2100 kilometers of continental coastline. This region has the highest latitude and lowest water temperature in China [26]. Influenced by the Liaodong Bay circulation, the North Yellow Sea coastal currents and upwelling, the region has a moderate environment, fertile water quality, and a long growth cycle to ensure the high quality of marine organisms, forming two major marine ecosystems and important fishery resource areas in the northern Yellow Sea and the Liaodong Bay of the Bohai Sea [13]. At present, Liaoning Province is vigorously developing marine fisheries to actively create national-level sea farm demonstration areas, and the scale of offshore aquaculture is expanding. Therefore, this study took the sea area of Liaoning Province as a typical offshore aquaculture area, and its geographical location is shown in Figure 1. The offshore aquaculture in Liaoning Province mainly includes floating raft aquaculture (FRA) and cage aquaculture (CA). The FRA is constructed with buoyant balls and ropes, presenting as a regular black strip distribution in images. The CA is composed of plastic, wood, and other materials and presents as a greyish-white or greyish-blue regular rectangle with clear boundaries in images [27]. When training the model, the performance achieved by different aquaculture methods under different geographical conditions helped us establish a more comprehensive aquaculture area monitoring model and improved its generalizability. Therefore, to ensure the completeness of the training samples and effectively avoid the collection of duplicate data, we selected six typical aquaculture zones in the Bohai Sea and Yellow Sea waters of Liaoning Province as the data collection zones, namely, Gaoling Town, Bayuquan District, Pulandian Bay, Jinshitan Bay, Changhai County, and Shicheng Township. These six sampling areas included both FRA and CA at different scales, covering a wide range of geographic conditions, such as coastal islands, open ocean areas and river inlets, effectively ensuring intraclass consistency and interclass differentiation, which helps to improve the efficiency of the model training process and avoid wasting computational resources due to data redundancy. The geographical extent and sample of each area are shown in the true-color remote sensing images in Figure 1A–F.

2.2. Dataset and Processing

The aim of this study was to achieve precise and effective monitoring of aquaculture areas. Therefore, we employed a combination of two different image modalities to enrich the characteristics of offshore aquaculture areas and provide a model with more diverse and comprehensive land feature information. This study used SAR images acquired from the Sentinel-1 remote sensing satellite and MSI image data from Sentinel-2. The series includes Earth observation satellites launched by the European Space Agency (ESA) to provide high-resolution image data for environmental monitoring, resource management, and disaster response (https://dataspace.copernicus.eu/, accessed on 21 April 2024). Equipped with advanced multispectral imagers and radar sensors, these satellites can capture surface information under all-weather and all-terrain conditions with a spatial resolution of up to 10 m and a global revisit frequency of five days. The SAR imagery of Sentinel-1 offers single-frequency VV or HH and dual-frequency VV+VH and HH+HV polarization combinations, which have the ability to penetrate clouds, realize all-weather surface observations, capture rich surface structure and material characteristic information, and improve the ability of the model to recognize surface features. The MSI imagery of Sentinel-2 has 13 spectral bands covering the spectral range from visible light to infrared light, providing rich spectral features, a high spatial resolution, and detailed ground information, thus contributing to fine-grained classification and analysis tasks [28].

We used the Google Earth Engine (GEE) cloud platform (https://developers.google.com/earth-engine/, accessed on 21 April 2024) as the data downloading and preprocessing tool to acquire an image collection conisting of Sentinel-1 GRD and Sentinel-2 L2A images. Utilizing the GEE platform, we preprocessed the Sentinel-2 L2A images with geographic alignment, atmospheric correction, image declouding, and pixel value normalization operations. In addition, we filtered the Sentinel-2 data to exclude data located in other low-resolution bands with non-10 m and non-20 m spatial resolutions and resampled the 20 m spatial resolution band to 10 m via bilinear interpolation. Similarly, we performed geographic alignment, thermal noise removal, radiometric value calibration, and pixel value normalization preprocessing operations on the Sentinel-1 GRDs in GEE. In addition, to reduce the effect of coherent speckle noise in the images, we perform refined Lee filtering [29], which effectively removed Gaussian and pepper noise while preserving the detailed information of the images.

To mitigate the data redundancy caused by the presence of too many bands in the multimodal data, we further selected the bands for the above data by integrating a priori knowledge. In the Sentinel-1 data, the VV-polarized vertically transmitted and vertically received radar waves interact better with vertical structures (e.g., mariculture facilities), and the signals reflected from these structures are more easily and clearly captured [30,31]. Therefore, we selected only the Sentinel-1 images in the VV polarization mode as the data source for the SAR modes. In addition, on the basis of the spectral differences between the offshore aquaculture areas and the seawater background, we selected only the B2, B3, and B4 bands, which have more spectral features than the background seawater does, to alleviate the increase in the amount of data caused by spectral redundancy. However, to take full advantage of the multispectral bands contained in the MSI data and enhance the variability between seawater and aquaculture areas, we performed spectral index calculations on the data, and the selected spectral indices included the normalized difference water index (NDWI) [32], vegetation red-edge-based water index (RWI) [33] and enhanced water index (EWI) [34]. Figure A1 shows the spectral reflectance of FRA, CA, and seawater and a comparison of the various water body indices. The formulas for calculating the indices for each water body are shown below:

N D W I = \frac{B 3 - B 8}{B 3 + B 8},

(1)

R W I = \frac{(B 3 + B 5) - (B 8 + B 8 A + B 12)}{(B 3 + B 5 + B 8 + B 8 A + B 12)},

(2)

E W I = \frac{B 3 - (B 8 + B 11)}{B 3 + (B 8 + B 11)},

(3)

where

B 3

,

B 5

,

B 8

,

B 8 A

,

B 11

, and

B 12

are the spectral reflectance values of each band in the Sentinel-2 remote sensing image.

In summary, the multimodal image dataset we constructed contained seven bands, namely, VV-polarized images from Sentinel-1 and B2, B3, B4, NDWI, RWI, and EWI images from Sentinel-2, and we classified the B2, B3, B4, NDWI, RWI, and EWI based on the homologous data as the MSI modality. We further classified the VV-polarized images as the SAR modality, and a sample of the data is shown in Figure 2. According to the investigations, the deployment cycle of CA covers the whole year, while FRA usually starts to be deployed around September each year, and harvesting and retrieval take place around May of the following year. As CA is located above the seawater, its spectral features are more different from those of the background seawater in any season or month, and the model is able to learn its features better. In contrast, FRA is mostly located on or below the water surface, resulting in smaller spectral differences from the background seawater, making it susceptible to spectral characteristic differences caused by seasonal sensitivity, which can affect the resulting extraction accuracy. Therefore, we selected the images generated from the mean values of each month of February and October, which were within the cultivation period, for constructing the dataset. This effectively avoided instability in single-day images, thus improving the quality and usability of the data, and also helped the model to learn spectral characteristics under different seasonal conditions. The image information obtained for each region is shown in Table 1. Additionally, we normalized the input spectral data to reduce the impacts of seasonal variations and lighting condition differences, thereby enhancing the robustness of the model.

Finally, the original multimodal image data were annotated at the pixel level by interpretation experts using high-resolution images from Google Earth. The annotations were categorized into FRA areas, CA areas, and background areas, resulting in the creation of corresponding original multimodal images and annotated images. Then, we performed sliding cropping (with a step size of 128) on the sample and labeled images in each region to generate 9063 pairs of 256 × 256 image and label blocks. After horizontal flipping, rotation, and vertical flipping, we rapidly expanded the data into 36,252 image–label pairs and divided them into training, validation, and testing sets with an approximate ratio of 3:1:1. Finally, we obtained 21,751 pairs of images for training, 7250 pairs of images for model validation, and 7250 pairs of images for model testing.

3. Methodology

3.1. Architecture of the Proposed CMFPNet

The CMFPNet proposed in this study aims to enhance the accuracy of offshore aquaculture area extraction from multimodal remote sensing images. The network structure of CMFPNet is shown in Figure 3. The overall structure of CMFPNet is designed based on an encoder–decoder structure. In the encoder, CMFPNet uses a stacked inverted residual block (IRB) [35] and an LGPB to construct two independent feature extraction branches, extracting local feature information from the shallow layer via the lightweight IRB and using the LGPB in the deep layer to achieve local and holographic semantic feature extraction, effectively understanding the overall scene structure and object-specific detail information in the image. At the junction of the encoder and decoder, we designed a MAFFAB to efficiently fuse the multimodal semantic information derived from different feature extraction branches to acquire more comprehensive and accurate semantic features for the target objects, input them into the decoder to convert the extracted high-level semantic feature mappings back to the resolution of the original input image and combined them with the low-level features in the encoder to generate the final classification and extraction results for the offshore aquaculture area.

3.2. Local–Global Perception Block (LGPB)

Remote sensing images have wider geographic coverage than natural images, which indicates that the target features can appear at any location in the image at different scales, and the semantic features of the same category of features will show a high degree of correlation, independent of spatial location [36,37,38]. Global information helps to understand the overall scene structure, and local information helps to analyze specific objects in the scene in more detail. Therefore, we need to effectively model the global features of an image to enhance the recognition robustness of the model under the interference of complex and variable marine backgrounds. Additionally, focusing on the local image features helps to improve the fineness of feature delineation, thus obtaining features with intraclass consistency and interclass differentiation. CNNs mainly consider local modeling, as the convolution operation considers only the relationships between neighboring pixels [39,40]. In contrast, the transformer can capture long-distance dependencies between different positions in the sequence in each layer, thus enabling a global sense of wildness [41,42]. However, networks designed entirely based on the transformer require more computational resources and learning data for training due to their large computational size and lack of spatial inductive bias [37,43]. Therefore, many studies have combined CNNs with transformers to compensate for the difficulties of slow network training convergence due to the limited receptive field of CNNs and the insensitivity of transformer spatial location [44,45,46].

We designed a lightweight LGPB to better fuse local and global features and effectively improve the ability of the network for remote sensing image analysis and understanding by combining the CNN structure and the transformer; the block structure is shown in Figure 4. The LGPB consists of three parts: local perception (LP), global perception (GP), and feature fusion (FF) modules. The LP module effectively captures the details and edge features of nearshore aquaculture areas, addressing the small-scale variations and subtle differences observed in images. This capability is crucial for identifying complex structures within nearshore aquaculture areas. For example, different types of aquaculture facilities and methods can lead to various local features in images. By acquiring global information, the GP module aids in understanding the positions and distribution of nearshore aquaculture areas across the entire input image, thereby enabling both large and small aquaculture areas to be more accurately identified. The FF module combines local and global information, allowing the model to focus on the details and overall structures of images, significantly enhancing the accuracy and robustness of nearshore aquaculture area recognition.

For the input feature map

F_{i n} \in R^{H \times W \times C}

, local feature extraction is performed by LP consisting of depthwise separable convolution [47], and global relationship modeling of the feature map is achieved by MetaFormer [48], a lightweight PoolFormer with a structure similar to that of the transformer. The pooling operation is used in PoolFormer instead of the multihead self-attention mechanism in the transformer to achieve the mixing of information between tokens, which reduces the computational complexity of the model and the number of parameters while ensuring performance. After N cascaded Poolformers, features with global information are concatenated with local features from LP along the channel dimension to achieve the initial fusion of local and global features. Finally, the FF fuses the local and global semantic features from LP and GP using

1 \times 1

convolution, adjusts the number of channels to the size of the original input, and performs a shortcut operation with the input

F_{i n}

to generate

F_{o u t} \in R^{H \times W \times C}

with a local–global context. The above process can be expressed as follows:

F_{l o c a l} = C_{1 \times 1} (DW (F_{i n})),

(4)

P o o l f o r m e r_{s 1} (X) = P_{A v g} (N o r m (X)) + X,

(5)

P o o l f o r m e r_{s 2} (X) = M L P (N o r m (X)) + X,

(6)

P o o l f o r m e r (X) = P o o l f o r m e r_{s 2} (P o o l f o r m e r_{s 1} (X)),

(7)

F_{g l o b a l} = P o o l f o r m e r (C_{1 \times 1} (F_{i n})),

(8)

F_{o u t} = F_{i n} + C_{1 \times 1} (C (F_{l o c a l}, F_{g l o b a l})),

(9)

where

F_{i n} \in R^{H \times W \times C}

is the feature map of the input,

DW

is the depthwise separable convolution with a convolution kernel size of

3 \times 3

, and

C_{1 \times 1}

is the standard convolution with a convolution kernel size of

1 \times 1

.

P o o l F o r m e r_{s 1}

and

P o o l F o r m e r_{s 2}

represent the operations before and after the two

P o o l F o r m e r

stages, respectively, and X is the input to each stage.

N o r m

denotes the LayerNorm operation.

P_{A v g}

denotes the mean pooling operation.

C

denotes cascading along the channel dimensions.

f_{o u t}

is the final LGPB operation obtained feature map with local and global semantic information.

3.3. Multidimensional Adaptive Frequency Filtering Attention Block (MAFFAB)

In remote sensing imagery, MSI imagery provides rich information such as spectra, colors and boundaries, while SAR imagery provides rich texture features, independent of weather conditions, which are useful for feature classification and target detection [28,49]. Both modalities help to capture semantic information about the same features. Therefore, the target of multimodal learning is to efficiently fuse diverse information and features from different modalities. The introduction of an attention mechanism allows the selection of features from different modalities and the assignment of weights to them for the efficient fusion of multimodal features [50,51,52]. However, global average pooling (GAP) or global maximum pooling (GMP) in the attention mechanism is not sufficient to express the original richness of information of the features and will inhibit feature diversity, leading to the loss of some effective information [53]. Frequency-domain feature extraction has been shown to be a powerful method, containing not only low-frequency information equivalent to GAP but also a more comprehensive set of mid- and high-frequency components [54,55].

Therefore, to more comprehensively and accurately fuse multimodal feature information, we developed a multidimensional adaptive frequency filtering attention block (MAFFAB) that incorporates the fast Fourier transform (FFT) [56] technique, and the structure of the block is shown in Figure 5. The MAFFAB first preliminarily integrates feature maps from different modalities and then transforms them along different dimensions to convert multidimensional spatial-domain features into frequency-domain features. By dynamically filtering the contributions of different frequency components with a learnable weight matrix, it preserves the frequency components that are beneficial for identifying and extracting nearshore aquaculture areas. Subsequently, the MAFFAB maps the multidimensional frequency-domain features back to spatial-domain features and effectively integrates features across multiple dimensions, thereby efficiently aggregating semantic features derived from different modalities and significantly enhancing the recognition accuracy of the model.

MAFFAB first achieves the shallow fusion of features

F_{s f} \in R^{H \times W \times C}

using depthwise separable convolution with a convolution kernel size of

3 \times 3

for features from different modalities. Subsequently,

F_{s f}

is permuted along the H and W dimensions, and a two-dimensional FFT is performed along with the original

F_{s f}

in parallel to convert the features from the spatial domain to the frequency domain. In this section, we design a learnable external weight of the same size as the frequency-domain feature map to perform a multiplication operation with the different frequency components in the frequency feature map to dynamically filter and select the effective frequency components. Finally, we use the inverse transform of the FFT to convert the features in the frequency domain to spatial-domain features and perform an inverse operation on the features that are permuted along the H and W dimensions, summing the features from multiple dimensions. Then, we perform a shortcut operation with

F_{s f}

to generate a unified multimodal fusion feature,

F_{d f} \in R^{H \times W \times C}

, which is fed into the network’s decoder. The mechanism of action of MAFFAB is shown in the following equation:

F_{s f} = DW (C (F_{M S I}, F_{S A R})),

(10)

F_{i (I, J)} = W_{(I, J)} ⊙ F_{(I, J)} [F_{i}],

(11)

F_{i}^{'} = F_{(I, J)}^{- 1} [F_{i (I, J)}],

(12)

F_{d f} = \frac{1}{3} (F_{H}^{'} + F_{W}^{'} + F_{C}^{'}) + F_{s f},

(13)

where

DW

is a depthwise separable convolution with a convolution kernel size of

3 \times 3

.

C

is a cascade operation along the channel dimensions.

W_{(I, J)}

and

F_{(I, J)}

denote the learnable external weights and 2D FFT for the corresponding axes, respectively. When

i = 1

, I and J denote the channel-height axis. When

i = 2

, I and J denote the channel-width axis. When

i = 3

, I and J denote the height–width axes. ⊙ denotes an element-level multiplication operation.

F_{(I, J)}^{- 1}

refers to the 2D inverse FFT.

3.4. Experimental Setups

In the experiments in this paper, the unimodal approach used the data from the seven bands of the MSI and SAR fusion as network input, and the multimodal approach employed the MSI modes and the corresponding SAR modes as separate inputs to a multibranch encoder. To ensure the fairness of the experiment, all methods were performed under the same experimental conditions. The experiments were based on the PyTorch 2.0.2 framework on a single NVIDIA RTX 3080 graphics processing unit (GPU) for training and validation. In the training phase, the batch size of the network was set to 16, Adam [57] was used as the optimizer, and the total number of training epochs was 50. The initial learning rate of each network was set to

1 \times 10^{- 4}

with an overall weight decay parameter of 0.01, and cosine annealing [58] was used to dynamically adjust the learning rate. All the experimental models were trained using stochastic data enhancement methods, including random scaling, flipping and color field transformation to improve the generalizability of the networks. All networks used the equal-weighted sum of cross-entropy loss (

L_{c e}

) [59] and Dice loss (

L_{d i c e}

) [60] as the loss function (

L_{S e g}

), balancing the classification accuracy of each pixel with the concern for spatial overlap. Specifically, the expressions of

L_{c e}

and

L_{d i c e}

are as follows:

L_{c e} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{3} (y_{i, c} l o g {\hat{y}}_{i, c}),

(14)

L_{d i c e} = 1 - \frac{2 | \hat{y} \cap y |}{| \hat{y} | + | y |},

(15)

L_{S e g} = L_{D i c e} + L_{C E},

(16)

where

\hat{y}

and y denote the sets of predicted and true pixels, respectively. n denotes the number of samples, and

y_{i, c}

is the true label of category c in sample i, which is 1 if the pixel belongs to category c, and 0 otherwise.

{\hat{y}}_{i, c}

is the predicted probability for category c in sample i.

3.5. Evaluation Metrics

We used three common evaluation metrics, intersection over union (

I o U

), the F1 score (

F 1

), and the Kappa coefficient (

K a p p a

), to quantitatively assess the effectiveness of the proposed model. The formulas are as follows:

{I o U}_{n} = \frac{{T P}_{n}}{{T P}_{n} + {F P}_{n} + {F N}_{n}},

(17)

{P r e}_{n} = \frac{{T P}_{n}}{{T P}_{n} + {F P}_{n}}, {R e c}_{n} = \frac{{T P}_{n}}{{T P}_{n} + {F N}_{n}},

(18)

{F 1 S c o r e}_{n} = 2 \times \frac{{P r e}_{n} \times {R e c}_{n}}{{P r e}_{n} + {R e c}_{n}},

(19)

{O A}_{n} = \frac{{T P}_{n} + {T N}_{n}}{T P_{n} + {F P}_{n} + {T N}_{n} + {F N}_{n}},

(20)

{E A}_{n} = \frac{({T P}_{n} + {F N}_{n}) ({T P}_{n} + {F P}_{n}) + ({T N}_{n} + {F N}_{n}) ({T N}_{n} + {F P}_{n})}{{({T P}_{n} + {T N}_{n} + {F P}_{n} + {F N}_{n})}^{2}},

(21)

K a p p a_{n} = \frac{{O A}_{n} - {E A}_{n}}{1 - {E A}_{n}},

(22)

where

T P_{n}

,

F P_{n}

,

T N_{n}

, and

F N_{n}

denote the number of true positives, the number of false positives, the number of true negatives and the number of false negatives, respectively, and n is the category index.

4. Results

4.1. Comparative Experiments

4.1.1. Quantitative Results

We compared CMFPNet with state-of-the-art semantic segmentation methods for a comprehensive evaluation. Among the comparison methods, we selected UNet [61], DeepLabV3+ [62], HRNet [63], and UNetFormer [64], which are based on unimodal image design. In addition, we selected four networks for multimodal remote sensing image segmentation, namely, FTransUNet [65], CMFFNet [66], BHENet [67], and JoiTriNet [51]. The main evaluation metrics we extracted for the two aquaculture areas in our experiments were the IoU and F1 score and the Kappa coefficient. We also calculated the average values of the evaluation metrics for the two foregrounds (both FRA and CA, without background regions) to assess the performance of the integrated model. Moreover, to evaluate the differences between each network in a more refined way, we evaluated the complexity of the network model by the number of floating-point operations per second (FLOPS) and the number of network parameters, where the input data size was 7 × 256 × 256. The quantitative evaluation results of the different methods are shown in Table 2 and Figure 6.

The experimental results show that the CMFPNet model achieves the optimal level of the three metrics of IoU, F1, and Kappa coefficient in the extraction task of the CA and FRA areas with lower model complexity. Compared with FTransUNet, which is the best performing multimodal network, CMFPNet improves the mIoU, mF1, and mKappa by 1.2%, 0.69%, and 0.77%, respectively, in offshore aquaculture extraction. Additionally, the number of model parameters and the computational complexity of its model are much smaller than those of FTransUNet. Compared with UNet, which has the strongest comprehensive capability among the unimodal models, CMFPNet improves the mIoU, mF1, and mKappa by 4.4%, 2.64%, and 3.08%, respectively. Especially in the FRA area extraction, CMFPNet improves the IoU, F1, and Kappa values by 7.26%, 4.42%, and 5.28% compared to UNet, respectively.

Overall, the performances of the first four unimodal networks (UNet, DeepLabV3+, HRNet, and UNetFormer) varied in various metrics; however, the overall performance was generally slightly inferior to that of the multimodal networks (FTransUNet, CMFFNet, BHENet, and JoiTriNet). In particular, the advantages of multimodal networks are more obvious in the FRA area extraction task. However, multimodal network design also inevitably increases the number of model parameters and computational complexity. CMFPNet achieved excellent performance while maintaining low computational complexity through its optimized design, demonstrating its significant advantages in terms of efficiency and accuracy.

4.1.2. Qualitative Results

We visualized and compared the offshore aquaculture extraction results of each model on the test dataset to visually evaluate the segmentation performance of the proposed method; the extraction results are shown in Figure 7. In scenarios (a), (c), (e), (f), (g), and (h), each network roughly identifies and extracts the offshore aquaculture areas; however, different degrees of adhesion phenomena remain; therefore, most of the models cannot segment the offshore aquaculture areas accurately. In contrast, our proposed CMFPNet benefits from learning complementary features in the two modalities fully considers intra- and interclass differences, has finer segmentation granularity, effectively reduces adhesion phenomena generation in high-density aquaculture area segmentation, and can provide more refined segmentation edges. In scenarios (b) and (d), the complexities of the mixed culture of shallow clouds and high-density small-size FRA and CA lead to varying degrees of misdetection and misclassification in all unimodal networks, especially UNet and DeepLabV3+ out of more misclassifications. The extraction effect is not satisfactory in dense mixed-culture areas with multiple categories. Similarly, CMFFNet, BHENet, and JoiTriNet also show small-scale omissions and misclassifications. In contrast, our proposed CMFPNet benefits from multiscale local and global semantic feature extraction and achieves accurate and detailed delineation of multiple types of aquaculture areas without misclassification in dealing with high-density small-target segmentation.

Overall, the qualitative and quantitative analyses of the network models show that the CMFPNet model is significantly better than the other models. CMFPNet performs well in dealing with different complex scenarios; it can provide high-precision segmentation results, effectively reduce the adhesion, misdetection, and omission phenomena, is particularly suitable for segmenting high-density and mixed aquaculture areas, and can accurately extract offshore aquaculture areas.

4.2. Ablation Experiments

We conducted ablation experiments to verify the effectiveness of each key element designed in the model to further evaluate the contributions of different components in CMFPNet. In these experiments, we first established a baseline model by removing both LGPB and MAFFAB. Subsequently, we incrementally added LGPB and MAFFAB to the baseline model. The experimental results are presented in Table 3. The results indicate that the classification accuracy for CA and FRA improves progressively with the addition of different modules. After incorporating LGPB into the baseline model, the classification accuracy for CA and FRA significantly increases due to the effective integration of local and global information. Specifically, the IoU, F1 score, and Kappa coefficient improve by 1.48%, 1.03%, and 1.18%, respectively, with a particularly notable enhancement in the classification accuracy for the FRA regions. Building on this, adding MAFFAB, resulting in CMFPNet, further enhances the classification accuracy for each category due to the efficient fusion of multimodal features at different levels. Considering the model complexity perspective, the gradual incorporation of the LGPB and MAFFAB into our baseline model results in a slight increase in the number of model parameters but does not yield a significant increase in the number of FLOPS. This is because we integrated designs with low computational complexity, such as pooling operations and FFTs, into the LGPB and MAFFAB to minimize the computational complexity increase. This not only helps to improve the prediction accuracy and generalizability of the model but also effectively controls the use of additional computational resources to ensure the efficiency and scalability of the model.

We performed qualitative analysis using GradCAM [68] on the baseline model and the models with progressively added LGPB and MAFFAB to further evaluate their effectiveness. This method generated gradient-weighted class activation maps for the network classification layer, producing attention heatmaps that visually explain the focus areas in the model. The results are shown in Figure 8.

The visualization results indicate that, while the baseline model can classify CA and FRA, some details and boundary features remain unclear. After adding LGPB, the ability of the model to capture local and global image features is enhanced, resulting in more prominent details and clearer, sharper boundaries, particularly in FRA regions. The focus of the network on different aquaculture areas also increases. Adding MAFFAB on top of the baseline model and LGPB further improves the image detail representation, significantly enhancing the fine feature and edge detection in complex scenes. As shown, the boundaries of the aquaculture areas in the images become clearer, the color contrast is greater, and the details are more abundant.

Overall, progressively adding LGPB and MAFFAB to the baseline model resulted in significant improvements in both quantitative metrics and visual comparisons. The delineation of aquaculture area boundaries became clearer, proving the effectiveness of the different module designs in CMFPNet. This approach enhanced the accuracy of classifying and extracting multiclass offshore aquaculture areas, especially under complex marine conditions.

5. Discussion

5.1. Influence of Different Band Data Combinations

In constructing the dataset for the CMFPNet model, we selected only the B2, B3, and B4 bands (visible light) from the Sentinel-2 data, along with the NDWI, RWI, and EWI water spectral indices, as inputs for the MSI modality. We conducted experiments using different combinations of Sentinel-2 bands as inputs for the MSI modality to explore the optimal data selection method to validate the rationality and effectiveness of our dataset. The experiments were divided into three groups. Group 1 included combinations of Sentinel-2 B2, B3, and B4 bands with Sentinel-1 VV polarization. Group 2 included Sentinel-2 B2, B3, B4, B5, B6, B7, B8, B8A, B11, and B12 bands combined with Sentinel-1 VV polarization. Group 3 included the combination of Sentinel-2 B2, B3, and B4 bands with the NDWI, RWI, and EWI spectral indices as MSI modality data and Sentinel-1 VV polarization. The experimental results are presented in Table 4.

The experimental results from the three groups indicate that Group 3 performs the best across all the metrics, especially excelling in nearshore CA and overall averages compared to the other groups. Although Group 2 utilizes more bands, its performance is inferior to that of Group 3, which uses fewer bands supplemented with spectral indices. This suggests that simply increasing the number of bands does not enhance model performance and may introduce data redundancy. The selection of key bands (B2, B3, and B4) and three effective water spectral indices (NDWI, RWI, and EWI) in Group 3 reduces the data volume while maximizing the advantages of MSI data and providing more effective and enriched features, thereby improving model accuracy. Therefore, the rational selection and combination of spectral bands and indices can significantly enhance the extraction of characteristic information in nearshore aquaculture areas and improve overall model performance. In the future, we will continue to explore different data combination methods to further optimize and enrich the features of nearshore aquaculture areas, ultimately enhancing the overall performance of the model.

5.2. Application of the Model

We acquired a collection of images composed of multiple image blocks from Sentinel-1 (SAR) and Sentinel-2 (MSI) over the coastal waters of Liaoning Province in November 2023 using the GEE platform to further verify the reliability and generalizability of the CMFPNet model. Each image collection was stitched together, and data preprocessing and selection were performed according to the dataset creation methods described above. The coverage of these images is shown in Figure 9a. This process ensured the consistency and high quality of the image data, allowing for a more accurate reflection of the actual conditions in the coastal aquaculture areas of Liaoning Province.

The offshore aquaculture areas located in Liaoning Province in 2023 were subsequently extracted via the trained CMFPNet model. The distribution of the aquaculture areas within the study region is shown in Figure 9b. According to the extraction results produced by CMFPNet, the area of CA contained in Liaoning Province in November 2023 was approximately 87.98 (km)², and the area of FRA was approximately 651.44 (km)². The main distribution area of the aquaculture area is the northern part of the Yellow Sea, and FRA occupies a large area of the sea with CA. This type of aquaculture forms extensive, contiguous farming zones, exhibiting significant spatial continuity and intensive farming characteristics. In contrast, within Liaodong Bay of the Bohai Sea, FRA is relatively rare and is replaced by small-scale CA scattered across various regions. This CA is generally concentrated in nearshore waters, with a relatively dispersed distribution, which adapts to the complex marine environment and hydrological conditions of the area. In Pulan Dian Bay within Liaodong Bay, we observed large-scale CA areas, demonstrating the advanced development and intensive management of cage aquaculture in the region. Compared to the original images from 2023, the results extracted by the CMFPNet model accurately match the distribution areas of offshore aquaculture zones in the remote sensing images, achieving high-precision extraction of these areas in Liaoning Province. CMFPNet effectively identifies and classifies FRA and CA regions even when faced with challenges such as water color variations, suspended sediments, and cloud cover. In densely populated FRA and CA areas, CMFPNet successfully reduces adhesion phenomena, maintaining clear and distinct classification results even in high-density regions. However, it should be noted that some false positives and false negatives may still occur at the edges of images and at the seams of multiple stitched images. Additionally, irregular small noise points can appear in offshore areas. Therefore, these issues need further optimization in future research. By increasing the quantity of both positive and negative samples for offshore aquaculture areas, model classification performance and robustness can be enhanced.

In the future, we will further optimize the CMFPNet model to address more complex marine environments and diverse aquaculture patterns, thereby supporting the regulation and sustainable development of marine aquaculture. On the one hand, we plan to use remote sensing images derived from different periods to dynamically monitor and analyze the distributions of various types of aquaculture zones. By comparing data acquired from different periods, we can reveal the spatiotemporal evolution characteristics and development trends of aquaculture zones. This will provide efficient and accurate data support for fishery regulatory authorities, help formulate scientific and reasonable fishery management policies and promote the sustainable development of marine aquaculture. On the other hand, we will conduct quantitative assessments of areas with different types of aquaculture zones to evaluate the carbon sequestration potential of the fishery industry. This work will focus on analysing the carbon fixation potential of different types of aquaculture zones and assessing the long-term impacts and contributions of aquaculture to the marine environment. Such research can provide scientific evidence for policymakers, promoting low-carbon and environmentally friendly aquaculture practices and thus mitigating the negative impacts of climate change on marine ecosystems.

6. Conclusions

In this study, we proposed a CMFPNet model for extracting offshore aquaculture areas by integrating two heterogeneous remote sensing images, Sentinel-1 and Sentinel-2. Compared to existing research, we utilized Sentinel-2’s B2, B3, and B4 bands along with the NDWI, RWI, and EWI spectral indices and combined them with Sentinel-1’s VV polarization mode to construct a remote sensing image dataset containing multimodal data. This approach enriched the characteristics of offshore aquaculture areas and provided a model with more diverse and comprehensive land feature information. In the CMFPNet model, we employed two independent feature extraction branches to extract the semantic information of offshore aquaculture areas from different remote sensing image modalities. Additionally, we designed LGPB to capture both local and global feature information to enhance the recognition ability of the model under complex and variable background interference. Furthermore, we introduced MAFFAB, which efficiently aggregates semantic information from different modalities using frequency-domain information and attention mechanisms, enabling the precise identification and extraction of various types of offshore aquaculture areas. To evaluate the performance of CMFPNet, we conducted aquaculture area extraction experiments on the test dataset and compared CMFPNet with four neural networks based on single-modal designs and four based on multimodal designs. Through quantitative and qualitative comparative analyses, our CMFPNet model demonstrated superior performance, achieving mIoU, mF1 score, and mKappa coefficients of 87.66%, 93.41%, and 92.59%, respectively, in the extraction and classification of the FRA and CA regions in offshore aquaculture. Moreover, we designed CMFPNet to outperform the other existing methods in terms of model complexity and successfully achieved a good balance between model performance and resource consumption. We conducted ablation experiments to validate the effectiveness of our designed LGPB and MAFFAB, which confirmed that these two module designs effectively enhance the accuracy of offshore aquaculture area extraction. Furthermore, we experimented with different combinations of Sentinel-2 bands to explore the impact of the data on the extraction results, demonstrating the rationality of our dataset design. This design effectively alleviates data redundancy while providing diversified feature information, thereby improving the accuracy of the model’s extraction. Finally, we applied the trained CMFPNet model to accurately extract information from offshore aquaculture areas in Liaoning Province in November 2023. However, in practical applications, we observed minor instances of false negatives and false positives in the CMFPNet extraction results. Therefore, in the future, we aim to further enhance data quality, explore different data combination methods, collect more samples, and optimize the network model. Our goal is to extend the application of CMFPNet to larger scale offshore aquaculture monitoring and extraction tasks. In this study, we demonstrated that the effective fusion of multimodal remote sensing data significantly improves the accuracy of offshore aquaculture area extraction through exploration and validation in this study. We have demonstrated the substantial potential of CMFPNet in practical applications. This provides robust technical support for the scientific planning and management of offshore aquaculture areas and lays a solid foundation for further research on multimodal remote sensing image segmentation.

Author Contributions

Conceptualization, F.W. and Z.C.; methodology, H.Y.; software, J.W.; validation, F.W., Y.H. and Z.C.; formal analysis, F.W.; investigation, J.Z.; resources, J.W.; data curation, Y.H.; writing—original draft preparation, H.Y.; writing—review and editing, F.W.; visualization, H.Y.; supervision, Z.C.; project administration, F.W.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 42101257.

Data Availability Statement

Our code is available at https://github.com/Harsh-M1/CMFPNet (accessed on 20 June 2024). The datasets generated during the study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to sincerely thank the editors and the anonymous reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FRA	Floating raft aquaculture
CA	Cage aquaculture
CMFPNet	Cross-modal multidimensional frequency perception network
LGPB	Local–global perception block
MAFFAB	Multidimensional adaptive frequency filtering attention block
MSI	Multispectral image
SAR	Synthetic aperture radar
GEE	Google Earth Engine
IRB	Inverted residual block

Appendix A

In this study, we conducted a comparative analysis of Sentinel-2 imagery from February and October, focusing on two types of offshore aquaculture areas, FRA and CA, and spectral reflectance differences in seawater across various bands. The comparative results are shown in Figure A1. The results indicate that CA, with its structures typically situated on the sea surface, shows significantly higher spectral reflectance values across all bands than does the surrounding seawater. Conversely, FRA, where most structures are submerged, exhibits notable spectral reflectance differences from the background seawater primarily in the B2, B3, and B4 bands, with smaller differences observed in other bands. Introducing water indices (NDWI, RWI, and EWI) further enhances the distinguishability among these three types. Therefore, we selected the B2, B3, and B4 bands from Sentinel-2 imagery, as well as the NDWI, RWI, and EWI, to construct a dataset for multispectral imaging (MSI) analysis in this study.

Figure A1. Comparison of the spectral reflectance and water body index values across various bands in Sentinel-2 imagers for two types of nearshore aquaculture (FRA, CA) and background seawater.

References

FAO. Fishery and Aquaculture Statistics—Yearbook 2020; FAO Yearbook of Fishery and Aquaculture Statistics; FAO: Rome, Italy, 2023. [Google Scholar] [CrossRef]
Zhang, C.; Meng, Q.; Chu, J.; Liu, G.; Wang, C.; Zhao, Y.; Zhao, J. Analysis on the status of mariculture in China and the effectiveness of mariculture management in the Bohai Sea. Mar. Environ. Sci. 2021, 40, 887–894. [Google Scholar]
Costello, C.; Cao, L.; Gelcich, S.; Cisneros-Mata, M.Á.; Free, C.M.; Froehlich, H.E.; Golden, C.D.; Ishimura, G.; Maier, J.; Macadam-Somer, I.; et al. The future of food from the sea. Nature 2020, 588, 95–100. [Google Scholar] [CrossRef] [PubMed]
Long, L.; Liu, H.; Cui, M.; Zhang, C.; Liu, C. Offshore aquaculture in China. Rev. Aquacult. 2024, 16, 254–270. [Google Scholar] [CrossRef]
Yucel-Gier, G.; Eronat, C.; Sayin, E. The impact of marine aquaculture on the environment; the importance of site selection and carrying capacity. Agric. Sci. 2019, 10, 259–266. [Google Scholar] [CrossRef]
Dunne, A.; Carvalho, S.; Morán, X.A.G.; Calleja, M.L.; Jones, B. Localized effects of offshore aquaculture on water quality in a tropical sea. Mar. Pollut. Bull. 2021, 171, 112732. [Google Scholar] [CrossRef] [PubMed]
Simone, M.; Vopel, K. The need for proactive environmental management of offshore aquaculture. Rev. Aquac. 2024, 16, 603–607. [Google Scholar] [CrossRef]
Rubio-Portillo, E.; Villamor, A.; Fernandez-Gonzalez, V.; Anton, J.; Sanchez-Jerez, P. Exploring changes in bacterial communities to assess the influence of fish farming on marine sediments. Aquaculture 2019, 506, 459–464. [Google Scholar] [CrossRef]
Chen, G.; Bai, J.; Bi, C.; Wang, Y.; Cui, B. Global greenhouse gas emissions from aquaculture: A bibliometric analysis. Agric. Ecosyst. Environ. 2023, 348, 108405. [Google Scholar] [CrossRef]
Mahdavi, S.; Salehi, B.; Granger, J.; Amani, M.; Brisco, B.; Huang, W. Remote sensing for wetland classification: A comprehensive review. GISci. Remote Sens. 2018, 55, 623–658. [Google Scholar] [CrossRef]
Sun, W.; Chen, C.; Liu, W.; Yang, G.; Meng, X.; Wang, L.; Ren, K. Coastline extraction using remote sensing: A review. GISci. Remote Sens. 2023, 60, 2243671. [Google Scholar] [CrossRef]
Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
Kang, J.; Sui, L.; Yang, X.; Liu, Y.; Wang, Z.; Wang, J.; Yang, F.; Liu, B.; Ma, Y. Sea surface-visible aquaculture spatial-temporal distribution remote sensing: A case study in Liaoning province, China from 2000 to 2018. Sustainability 2019, 11, 7186. [Google Scholar] [CrossRef]
Hou, T.; Sun, W.; Chen, C.; Yang, G.; Meng, X.; Peng, J. Marine floating raft aquaculture extraction of hyperspectral remote sensing images based decision tree algorithm. Int. J. Appl. Earth Obs. Geoinf. 2022, 111, 102846. [Google Scholar] [CrossRef]
Fu, Y.; Zhang, W.; Bi, X.; Wang, P.; Gao, F. TCNet: A Transformer–CNN Hybrid Network for Marine Aquaculture Mapping from VHSR Images. Remote Sens. 2023, 15, 4406. [Google Scholar] [CrossRef]
Ai, B.; Xiao, H.; Xu, H.; Yuan, F.; Ling, M. Coastal aquaculture area extraction based on self-attention mechanism and auxiliary loss. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2022, 16, 2250–2261. [Google Scholar] [CrossRef]
Amani, M.; Mohseni, F.; Layegh, N.F.; Nazari, M.E.; Fatolazadeh, F.; Salehi, A.; Ahmadi, S.A.; Ebrahimy, H.; Ghorbanian, A.; Jin, S.; et al. Remote sensing systems for ocean: A review (part 2: Active systems). IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2022, 15, 1421–1453. [Google Scholar] [CrossRef]
Gao, L.; Wang, C.; Liu, K.; Chen, S.; Dong, G.; Su, H. Extraction of floating raft aquaculture areas from sentinel-1 SAR images by a dense residual U-Net model with pre-trained Resnet34 as the encoder. Remote Sens. 2022, 14, 3003. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, C.; Chen, J.; Wang, F. Shape-constrained method of remote sensing monitoring of marine raft aquaculture areas on multitemporal synthetic sentinel-1 imagery. Remote Sens. 2022, 14, 1249. [Google Scholar] [CrossRef]
Xiao, S.; Wang, P.; Diao, W.; Rong, X.; Li, X.; Fu, K.; Sun, X. MoCG: Modality Characteristics-Guided Semantic Segmentation in Multimodal Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–18. [Google Scholar] [CrossRef]
Li, J.; Hong, D.; Gao, L.; Yao, J.; Zheng, K.; Zhang, B.; Chanussot, J. Deep learning in multimodal remote sensing data fusion: A comprehensive review. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102926. [Google Scholar] [CrossRef]
Wu, X.; Hong, D.; Chanussot, J. Convolutional neural networks for multimodal remote sensing data classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–10. [Google Scholar] [CrossRef]
Ma, M.; Ma, W.; Jiao, L.; Liu, X.; Li, L.; Liu, F.; Feng, Z.; Yang, S. A multimodal hyper-fusion transformer for remote sensing image classification. Inf. Fusion 2023, 96, 66–79. [Google Scholar] [CrossRef]
Li, Y.; Zhou, Y.; Zhang, Y.; Zhong, L.; Wang, J.; Chen, J. DKDFN: Domain knowledge-guided deep collaborative fusion network for multimodal unitemporal remote sensing land cover classification. ISPRS J. Photogramm. Remote Sens. 2022, 186, 170–189. [Google Scholar] [CrossRef]
Fan, X.; Zhou, W.; Qian, X.; Yan, W. Progressive adjacent-layer coordination symmetric cascade network for semantic segmentation of multimodal remote sensing images. Expert Syst. Appl. 2024, 238, 121999. [Google Scholar] [CrossRef]
Li, G.; Liu, C.; Liu, Y.; Yang, J.; Zhang, X.; Guo, K. Effects of climate, disturbance and soil factors on the potential distribution of Liaotung oak (Quercus wutaishanica Mayr) in China. Ecol. Res. 2012, 27, 427–436. [Google Scholar] [CrossRef]
Hu, J.; Huang, M.; Yu, H.; Li, Q. Research on extraction method of offshore aquaculture area based on Sentinel-2 remote sensing imagery. Mar. Environ. Sci 2022, 41, 619–627. [Google Scholar]
Hafner, S.; Ban, Y.; Nascetti, A. Unsupervised domain adaptation for global urban extraction using Sentinel-1 SAR and Sentinel-2 MSI data. Remote Sens. Environ. 2022, 280, 113192. [Google Scholar] [CrossRef]
Mullissa, A.; Vollrath, A.; Odongo-Braun, C.; Slagter, B.; Balling, J.; Gou, Y.; Gorelick, N.; Reiche, J. Sentinel-1 sar backscatter analysis ready data preparation in google earth engine. Remote Sens. 2021, 13, 1954. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, C.; Ji, Y.; Chen, J.; Deng, Y.; Chen, J.; Jie, Y. Combining segmentation network and nonsubsampled contourlet transform for automatic marine raft aquaculture area extraction from sentinel-1 images. Remote Sens. 2020, 12, 4182. [Google Scholar] [CrossRef]
Wang, D.; Han, M. SA-U-Net++: SAR marine floating raft aquaculture identification based on semantic segmentation and ISAR augmentation. J. Appl. Remote Sens. 2021, 15, 016505. [Google Scholar] [CrossRef]
Gao, B.C. NDWI—A normalized difference water index for remote sensing of vegetation liquid water from space. Remote Sens. Environ. 1996, 58, 257–266. [Google Scholar] [CrossRef]
Wu, Q.; Wang, M.; Shen, Q.; Yao, Y.; Li, J.; Zhang, F.; Zhou, Y. Small water body extraction method based on Sentinel-2 satellite multi-spectral remote sensing image. Natl. Remote Sens. Bull. 2022, 26, 781–794. [Google Scholar] [CrossRef]
Yan, P.; Zhang, Y.; Zhang, Y. A study on information extraction of water system in semi-arid regions with the enhanced water index (EWI) and GIS based noise remove techniques. Remote Sens. Inf. 2007, 6, 62–67. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Ni, Y.; Liu, J.; Chi, W.; Wang, X.; Li, D. CGGLNet: Semantic Segmentation Network for Remote Sensing Images Based on Category-Guided Global-Local Feature Interaction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
Song, W.; Zhou, X.; Zhang, S.; Wu, Y.; Zhang, P. GLF-Net: A Semantic Segmentation Model Fusing Global and Local Features for High-Resolution Remote Sensing Images. Remote Sens. 2023, 15, 4649. [Google Scholar] [CrossRef]
Liu, D.; Zhang, J.; Li, T.; Qi, Y.; Wu, Y.; Zhang, Y. A Lightweight Object Detection and Recognition Method Based on Light Global-Local Module for Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020, 53, 5455–5516. [Google Scholar] [CrossRef]
Cong, S.; Zhou, Y. A review of convolutional neural network architectures and their optimizations. Artif. Intell. Rev. 2023, 56, 1905–1969. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Wadekar, S.N.; Chaurasia, A. Mobilevitv3: Mobile-friendly vision transformer with simple and effective fusion of local, global and input features. arXiv 2022, arXiv:2209.15159. [Google Scholar]
Yang, Y.; Jiao, L.; Li, L.; Liu, X.; Liu, F.; Chen, P.; Yang, S. LGLFormer: Local-global Lifting Transformer for Remote Sensing Scene Parsing. IEEE Trans. Geosci. Remote Sens. 2023, 62, 1–13. [Google Scholar] [CrossRef]
Xue, J.; He, D.; Liu, M.; Shi, Q. Dual network structure with interweaved global-local feature hierarchy for transformer-based object detection in remote sensing image. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2022, 15, 6856–6866. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10819–10829. [Google Scholar]
Hafner, S.; Ban, Y.; Nascetti, A. Semi-Supervised Urban Change Detection Using Multi-Modal Sentinel-1 SAR and Sentinel-2 MSI Data. Remote Sens. 2023, 15, 5135. [Google Scholar] [CrossRef]
Zheng, A.; He, J.; Wang, M.; Li, C.; Luo, B. Category-wise fusion and enhancement learning for multimodal remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Liu, X.; Zou, H.; Wang, S.; Lin, Y.; Zuo, X. Joint Network Combining Dual-Attention Fusion Modality and Two Specific Modalities for Land Cover Classification Using Optical and SAR Images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2023, 17, 3236–3250. [Google Scholar] [CrossRef]
Wu, W.; Guo, S.; Shao, Z.; Li, D. CroFuseNet: A semantic segmentation network for urban impervious surface extraction based on cross fusion of optical and SAR images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2023, 16, 2573–2588. [Google Scholar] [CrossRef]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 783–792. [Google Scholar]
Ruan, J.; Xie, M.; Xiang, S.; Liu, T.; Fu, Y. MEW-UNet: Multi-axis representation learning in frequency domain for medical image segmentation. arXiv 2022, arXiv:2210.14007. [Google Scholar]
Zhang, S.; Li, H.; Li, L.; Lu, J.; Zuo, Z. A high-capacity steganography algorithm based on adaptive frequency channel attention networks. Sensors 2022, 22, 7844. [Google Scholar] [CrossRef] [PubMed]
Duhamel, P.; Vetterli, M. Fast Fourier transforms: A tutorial review and a state of the art. Signal Process. 1990, 19, 259–299. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Ruby, U.; Yendapalli, V. Binary cross entropy with deep learning technique for image classification. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 4. [Google Scholar]
Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3; Springer: Cham, Switzerland, 2017; pp. 240–248. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O.; Liu, M. A multilevel multimodal fusion transformer for remote sensing semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Guo, S.; Wu, W.; Shao, Z.; Teng, J.; Li, D. Extracting urban impervious surface based on optical and SAR images cross-modal multi-scale features fusion network. Int. J. Digit. Earth 2024, 17, 2301675. [Google Scholar] [CrossRef]
Cai, B.; Shao, Z.; Huang, X.; Zhou, X.; Fang, S. Deep learning-based building height mapping using Sentinel-1 and Sentienl-2 data. Int. J. Appl. Earth Obs. Geoinf. 2023, 122, 103399. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. The geographical locations and appearances of the study area and data sampling areas are as follows: (A) Gaoling Town Sea Area, (B) Bayuquan District Sea Area, (C) Pulandian Bay Sea Area, (D) Jinshitan Bay Sea Area, (E) Changhai County Sea Area, and (F) Shicheng Township Sea Area.

Figure 2. A visualization of images acquired from the dataset with different modalities. From left to right: labeled image (yellow for a CA area, purple for an FRA area, and blue for a background area); composite image of B2, B3, and B4 from the MSI mode; NDWI, RWI, and EWI images from the MSI mode; and VV images from the SAR mode.

Figure 3. The CMFPNet structure.

Figure 4. The structure of the LGPB.

Figure 5. The structure of the MAFFAB.

Figure 6. Comparison of the complexity of different network models. (The top-left area indicates the optimal balance between performance and network size).

Figure 7. Visualization comparison results of various network models. In the image, (a–h) depict various scenes. The first four columns display images from different modalities and labeled images from the dataset, while the last nine columns show the extraction results of different models for each scene.

Figure 8. Visualization comparison results of the ablation experiments. The color gradient from blue to red indicates increasing network attention to that area.

Figure 9. Extraction results of offshore aquaculture in Liaoning Province. Subfigure (a) shows the image coverage area. Subfigure (b) shows the geographic location of the aggregated areas of offshore aquaculture, where (A–G) show details of the aggregated areas. In the images, purple represents the FRA area, yellow represents the CA area, blue represents the background area, and the red rectangular boxes highlight key comparison areas.

Table 1. Overview of imagery from each sampling area.

Area	Image Size	Geographical Scope	Image Data
Gaoling Town	5749 × 3456	119.85–120.46°E 39.92–40.21°N	February and October 2023
Bayuquan District	2582 × 6453	121.86–122.15°E 39.94–40.40°N	February and October 2023
Pulandian Bay	4400 × 2451	121.27–121.67°E 39.16–39.38°N	February and October 2023
Jinshitan Bay	3899 × 2933	121.91–122.26°E 38.93–39.19°N	February and October 2023
Changhai County	6643 × 2795	122.26–122.86°E 39.12–39.37°N	February and October 2023
Shicheng Township	5007 × 3011	122.84–123.29°E 39.32–39.59°N	February and October 2023

Table 2. Comparison of different methods on RCA and CCA.

Methods	FRA			CA			Mean
	IoU	F1	Kappa	IoU	F1	Kappa	IoU	F1	Kappa
UNet [61]	88.75	94.04	93.94	77.76	87.49	85.08	83.26	90.77	89.51
DeepLabV3+ [62]	80.32	89.08	88.90	69.95	82.32	78.97	75.14	85.70	83.94
HRNet [63]	82.74	90.56	90.40	76.91	86.95	84.44	79.83	88.76	87.42
UNetFormer [64]	82.10	90.17	90.01	79.77	88.75	86.59	80.94	89.46	88.30
CMFFNet [66]	83.09	90.77	90.61	81.58	89.85	87.91	82.34	90.31	89.26
BHENet [67]	82.01	90.12	89.95	77.07	87.05	84.59	79.54	88.59	87.27
JoiTriNet [51]	81.69	89.92	89.75	81.10	89.56	87.51	81.40	89.74	88.63
FTransUNet [65]	89.08	94.23	94.13	83.83	91.20	89.51	86.46	92.72	91.82
CMFPNet (ours)	90.29	94.90	94.81	85.02	91.91	90.36	87.66	93.41	92.59

The best performance is marked in bold.

Table 3. Quantitative results obtained in ablation experiments concerning the different blocks contained in CMFPNet.

Methods	FRA			CA			Mean			Model Complexity
	IoU	F1	Kappa	IoU	F1	Kappa	IoU	F1	Kappa	Params(M)	FLOPS(G)
Baseline	88.15	93.71	93.60	81.70	89.61	87.61	84.93	91.66	90.61	11.70	20.50
Baseline +LGPB	89.07	94.22	94.12	83.74	91.15	89.46	86.41	92.69	91.79	16.92	21.17
Baseline +LGPB +MAFFAB	90.29	94.90	94.81	85.02	91.91	90.36	87.66	93.41	92.59	22.82	21.28

The best performance is marked in bold.

Table 4. Quantitative comparison results of different data combination methods.

Groups	Bands	FRA			CA			Mean
		IoU	F1	Kappa	IoU	F1	Kappa	IoU	F1	Kappa
Group 1	4	88.00	93.61	93.50	83.52	91.02	89.31	85.76	92.32	91.41
Group 2	11	88.58	93.94	93.84	84.44	91.56	89.89	86.51	92.75	91.87
Group 3	7	90.29	94.90	94.81	85.02	91.91	90.36	87.66	93.41	92.59

The best performance is marked in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, H.; Wang, F.; Hou, Y.; Wang, J.; Zhu, J.; Cui, Z. CMFPNet: A Cross-Modal Multidimensional Frequency Perception Network for Extracting Offshore Aquaculture Areas from MSI and SAR Images. Remote Sens. 2024, 16, 2825. https://doi.org/10.3390/rs16152825

AMA Style

Yu H, Wang F, Hou Y, Wang J, Zhu J, Cui Z. CMFPNet: A Cross-Modal Multidimensional Frequency Perception Network for Extracting Offshore Aquaculture Areas from MSI and SAR Images. Remote Sensing. 2024; 16(15):2825. https://doi.org/10.3390/rs16152825

Chicago/Turabian Style

Yu, Haomiao, Fangxiong Wang, Yingzi Hou, Junfu Wang, Jianfeng Zhu, and Zhenqi Cui. 2024. "CMFPNet: A Cross-Modal Multidimensional Frequency Perception Network for Extracting Offshore Aquaculture Areas from MSI and SAR Images" Remote Sensing 16, no. 15: 2825. https://doi.org/10.3390/rs16152825

APA Style

Yu, H., Wang, F., Hou, Y., Wang, J., Zhu, J., & Cui, Z. (2024). CMFPNet: A Cross-Modal Multidimensional Frequency Perception Network for Extracting Offshore Aquaculture Areas from MSI and SAR Images. Remote Sensing, 16(15), 2825. https://doi.org/10.3390/rs16152825

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CMFPNet: A Cross-Modal Multidimensional Frequency Perception Network for Extracting Offshore Aquaculture Areas from MSI and SAR Images

Abstract

1. Introduction

2. Materials

2.1. Study Area

2.2. Dataset and Processing

3. Methodology

3.1. Architecture of the Proposed CMFPNet

3.2. Local–Global Perception Block (LGPB)

3.3. Multidimensional Adaptive Frequency Filtering Attention Block (MAFFAB)

3.4. Experimental Setups

3.5. Evaluation Metrics

4. Results

4.1. Comparative Experiments

4.1.1. Quantitative Results

4.1.2. Qualitative Results

4.2. Ablation Experiments

5. Discussion

5.1. Influence of Different Band Data Combinations

5.2. Application of the Model

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI