CIS: A Coral Instance Segmentation Network Model with Novel Upsampling, Downsampling, and Fusion Attention Mechanism

Li, Tianrun; Liang, Zhengyou; Zhao, Shuqi

doi:10.3390/jmse12091490

Open AccessArticle

CIS: A Coral Instance Segmentation Network Model with Novel Upsampling, Downsampling, and Fusion Attention Mechanism

by

Tianrun Li

¹

,

Zhengyou Liang

^1,2,*

and

Shuqi Zhao

³

¹

School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China

²

Guangxi Key Laboratory of Multimedia Communications and Network Technology, Nanning 530004, China

³

School of Marine Sciences, Guangxi University, Nanning 530004, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(9), 1490; https://doi.org/10.3390/jmse12091490

Submission received: 9 June 2024 / Revised: 20 August 2024 / Accepted: 26 August 2024 / Published: 28 August 2024

(This article belongs to the Section Marine Biology)

Download

Browse Figures

Versions Notes

Abstract

:

Coral segmentation poses unique challenges due to its irregular morphology and camouflage-like characteristics. These factors often result in low precision, large model parameters, and poor real-time performance. To address these issues, this paper proposes a novel coral instance segmentation (CIS) network model. Initially, we designed a novel downsampling module, ADown_HWD, which operates at multiple resolution levels to extract image features, thereby preserving crucial information about coral edges and textures. Subsequently, we integrated the bi-level routing attention (BRA) mechanism into the C2f module to form the C2f_BRA module within the neck network. This module effectively removes redundant information, enhancing the ability to distinguish coral features and reducing computational redundancy. Finally, dynamic upsampling, Dysample, was introduced into the CIS to better retain the rich semantic and key feature information of corals. Validation on our self-built dataset demonstrated that the CIS network model significantly outperforms the baseline YOLOv8n model, with improvements of 6.3% and 10.5% in P_B and P_M and 2.3% and 2.4% in mAP50_B and mAP50_M, respectively. Furthermore, the reduction in model parameters by 10.1% correlates with a notable 10.7% increase in frames per second (FPS) to 178.6, thus effectively meeting real-time operational requirements.

Keywords:

deep learning; coral reef monitoring; attention mechanism; instance segmentation; neural network

1. Introduction

Coral reef ecosystems are of paramount importance to marine biodiversity and coastal protection. However, monitoring their health has been a significant challenge, particularly due to the complex three-dimensional (3D) structures of coral reefs. The richness of 3D information in coral reefs is essential, yet modern monitoring technologies ultimately rely on two-dimensional (2D) images as their primary input. This dimensional transformation presents a formidable challenge, as critical details and spatial relationships inherent in the 3D structure of coral reefs are inevitably lost or distorted in the transition to 2D. Advanced technologies such as remote sensing and photogrammetry have been widely adopted for global coral reef monitoring. Despite their utility, remote sensing methods are constrained by spatial resolution and lighting conditions, providing only indirect information about the condition of coral reefs [1]. This limitation results in insufficiently accurate analysis and assessment of coral reef conditions, precluding real-time monitoring capabilities. Additionally, remote sensing incurs high costs. Coral images obtained through photogrammetric techniques typically require manual interpretation and analysis. The unique and intricate morphology of corals makes it difficult to manually or automatically identify and analyze various coral characteristics.

In response to these challenges, deep learning methods have emerged as a powerful alternative. By leveraging multi-layer neural networks, these methods can automatically extract and learn complex features from data, adeptly handling high-dimensional, nonlinear patterns and large-scale datasets. This capability is particularly beneficial for coral monitoring, providing more accurate analyses and processing under ever-changing environmental conditions and coral states. The evolution of deep learning has inspired an influx of researchers to explore data acquisition and analysis in specialized and complex environments. Deep learning-based object detection and segmentation algorithms can generally be classified into two categories: one-stage regression-based detectors, exemplified by SSD [2] and the YOLO series [3,4,5], which offer faster detection speeds by directly predicting object locations and classes, and two-stage region proposal-based detectors such as R-CNN [6], Faster R-CNN [7], and Mask R-CNN [8], which involve a computationally intensive process of generating candidate regions followed by detailed classification and bounding box regression, resulting in slower processing speeds unsuitable for real-time tasks. The unique characteristics of the underwater environment, combined with the diverse and complex morphology of corals, for example, their branching, plate-like, and massive structures, create significant difficulties for coral detection and segmentation. Traditional underwater detection and segmentation algorithms suffer from low precision and are labor-intensive. Moreover, they often excel at detecting single corals but struggle with multiple targets. To address these issues, this paper proposes the coral instance segmentation network model. This model significantly reduces computational complexity and enhances instance segmentation precision. It also achieves higher FPS, enabling the system to perform coral instance segmentation in images or videos with minimal latency, thus meeting the requirements for real-time analysis. This allows scientists and researchers to quickly identify the health status of coral reefs, track changes over time, and rapidly respond to potential threats.

The main contributions of this paper can be outlined as follows:

We designed a novel down-sampling module, ADown_HWD. This innovation effectively preserves more detailed boundary and texture information of corals, enhancing the extraction of multi-scale features and improving the capture of multi-scale image information;
The C2f_BRA module is proposed by integrating the C2f module with bi-level routing attention (BRA) [9] for use in the neck network. It aims to address the characteristics of corals with irregular and variable morphology as well as their similarity in color and structure to the seabed environment. This module filters out most irrelevant key–value pairs at a coarse regional level, focusing on more critical information such as coral edges, shapes, colors, and textures. It also avoids the redundancy of unrelated features like plankton, rocks, and seaweed, optimizing computational resources and enhancing real-time processing capabilities;
The introduction of Dysample [10], an ultra-lightweight upsampling method, into the CIS network model effectively reduces both the computational cost and the number of parameters of the model, all while preserving precision. This makes the model ideally suited for deployment on resource-constrained devices.

2. Related Work

With the swift progress of deep learning, numerous academics have begun exploring underwater life studies. Despite the speed advantage of one-stage detection algorithms, their lower precision has prompted continuous improvement efforts. Hou et al. [11] proposed an enhanced YOLOv5-based algorithm designed for object detection in complex underwater environments. They incorporated the HorBlock module into the backbone network to boost feature extraction and significantly enhance detection precision for sea cucumbers and sea urchins. Che et al. [12] integrated adaptive self-supervised learning (Convnext V2), a lightweight network (SlimNeck), and dynamic sparse attention mechanisms into YOLOv8 to enhance the segmentation accuracy and efficiency of underwater targets. Chen et al. [13] enhanced YOLOv4 by substituting the upsampling module with a deconvolution module and integrating depth-wise separable convolutions, thereby decreasing network computation and increasing both the precision and speed of underwater target detection.

The diversity of underwater life often results in partial or complete occlusion of objects, leading to the loss of key feature information and making it challenging for detection algorithms to accurately identify and locate targets. Tao et al. [14] proposed the LKCA-YOLOv5 underwater target detection algorithm to address mutual occlusion between underwater targets. They employed the spatial fusion (S2F) module to enhance focus on spatial dimensional information, improving the detection of occluded targets. Li et al. [15] proposed an improved CME-YOLOv5 network to reduce environmental interference and address the challenges of detecting dense, occluded, and small fish targets. Some scholars have proposed solutions to the problem of inaccurate feature extraction due to the variable and complex underwater environment. Liu et al. [16] enhanced YOLOv7 with a residual network and global attention to improve feature extraction and speed for underwater detection. Considering the unique characteristics of underwater environments and the concealment of targets, Shen et al. [17] introduced the MIPAM for improved YOLO detector precision and real-time performance. Later, they developed the mDFLAM to reduce background interference and enhance underwater object perception [18].

Beyond object detection, YOLO networks have also been applied to segmentation, tracking, and other tasks. Hassanudin et al. [19] conducted fine-grained analysis of coral instance segmentation in Indonesia using the YOLOv8 model. Zhang et al. [20] proposed the BoTS-YOLOv5s-seg model for segmentation, integrating BoTNet and self-attention mechanisms to streamline parameters and enhance fish recognition in dense aquaculture environments. To tackle the challenges associated with the detection and tracking of shallow marine life, Liu et al. [21] designed a YOLOv5 multi-target detection and tracking algorithm with an attention mechanism. Lu et al. [22] removed the haze caused by water turbidity from the captured images and then used YOLO to recognize and track marine organisms including shrimps, squids, crabs, and sharks. Since most network models are designed for still images, they are not well suited for tasks involving video. Therefore, Jiang et al. [23] introduced a new video analysis technique using YOLOv5, binocular stereo vision, and tracking algorithms to automatically estimate coral coverage. Alshdaifat et al. [24] developed an enhanced deep learning approach for detecting multiple fish and performing instance segmentation in underwater videos. Park et al. [25] enhanced the YOLO network by improving application heuristics and network cumulative mean, proposing a method for the classification and counting of fish in underwater videos.

Compared to one-stage algorithms, two-stage algorithms have effectively improved the accuracy and precision of underwater creature identification and localization in complex scenes. They achieve this by separating the region proposal and precise classification steps. Some scholars have utilized two-stage algorithms to conduct research on underwater organisms, such as Song et al. [26], who proposed a method that integrates the MSRCR enhancement algorithm into the Mask R-CNN framework for detecting and segmenting underwater organisms using small datasets. However, its slow processing speed and limitations for long-distance recognition pose challenges for practical application. Yi et al. [27] proposed CAM-RCNN, a new framework using CoordConv and group normalization for segmenting underwater marine organisms, enhancing generalization and addressing class imbalance effectively. Gao et al. [28] improved Faster R-CNN for jellyfish detection by adding inflated convolution to the residual network, enhancing accuracy and small target recognition. In the study of coral detection and segmentation, Picek et al. [29] utilized Mask R-CNN and Bag of Tricks to annotate, localize, and classify coral reefs on a pixel-by-pixel basis. Jaisakthi et al. [30] applied Faster R-CNN for coral reef image annotation and localization, evaluating the impact of various backbones on performance. However, these two methods still yield low accuracy due to the complex and variable underwater environment and the possible occlusion between organisms. Moreover, the two-stage detection algorithm has a complex computational process and requires high computational resources, so it is not suitable for real-time detection and segmentation tasks.

While one-stage and two-stage object detection methods have made significant progress, a relatively new object detection algorithm based on the transformer architecture, DETR (Detection Transformer), has also been applied to the study of underwater organisms. Ke et al. [31] applied DETR to underwater object detection tasks and achieved certain improvements in standard evaluation metrics by fine-tuning the model. Later, Ke et al. [32] proposed a method using DN-DETR and DN-Deformable-DETR for general underwater object detection, employing denoising to accelerate convergence and reduce the training instability of the native DETR detector. Wang et al. [33] introduced DyFish-DETR for underwater object detection, utilizing DyFishNet to extract fish texture features and adopting a slim hybrid encoder to integrate feature information. To develop a real-time and accurate lightweight detection model, Yuan et al. [34] compared DETR and YOLOv5 for underwater sea cucumber detection. The experimental results demonstrated that YOLOv5 outperforms DETR in terms of low computational cost and high precision. However, due to its simple structure and attention mechanism, DETR shows promising potential in underwater object detection. Despite its commendable performance in object detection, the high computational cost of DETR limits its practicality. Therefore, RT-DETR [35] was proposed to achieve higher detection accuracy and faster inference speed.

3. Proposed Method

In this paper, we propose a novel network model, CIS, for instance segmentation of corals. The network structure is illustrated in Figure 1. The CIS network model demonstrates superior precision, a lower parameter count, and accelerated detection speeds compared to the YOLOv8 network model. The architecture of the CIS network model is elegantly divided into three main components: the backbone network, the neck network, and the head network. The process begins with the image entering the convolutional (Conv) layer for preliminary feature extraction. Subsequently, the image undergoes two rounds of down-sampling to generate two distinct scale features, denoted as B1 and B2. B2 is then further downsampled, and the C2f and ADown_HWD modules iteratively enhance features, resulting in the creation of feature maps B3, B4, and B5. The feature map B5 is subsequently channeled through the SPPF module to perform pooling operations at multiple scales, integrating global and local features as the output denoted as P5. P5 then serves as the input for the dynamic upsampling process facilitated by the Dysample module, which expands the feature map dimensions. The upsampled feature map is combined with the B4 feature map from the backbone network through a Concat layer. This combined feature map is subsequently refined by the C2f_BRA module, enhancing the feature representation further. The feature maps are once again processed through a dynamic upsampling module, Dysample, and then concatenated with the B3 feature maps from the backbone network using another Concat layer. These concatenated feature maps are funneled through the C2f_BRA module to produce smaller-scale feature maps, denoted as N3. N3 is then processed through a series of Conv and Concat layers, merging the feature map with the P4 feature map from the neck network. After this integration, the C2f_BRA module is employed to generate a medium-scale feature map, N4. In a similar fashion, the feature map is synthesized with P5 from the backbone network for the final time through Conv and Concat layers. The C2f_BRA module is then utilized to derive the large-scale feature map N5. Ultimately, N3, N4, and N5 are funneled through the segment layer, culminating in the output of the final segmentation result.

3.1. Downsampling Module ADown_HWD

In the domain of coral segmentation, the fidelity of structural and textural image details is paramount. Accurate delineation of coral morphology hinges on the precise capture of its shape, size, and textural attributes. Traditional down-sampling in the YOLOv8 network predominantly employs a uniform convolutional strategy, complemented by normalization and activation functions. While adept at feature extraction and nonlinear transformation, this method falls short in concurrently processing multi-scale feature information, potentially forfeiting vital local details constrained by step and filter dimensions. To address this, we designed the ADown_HWD module, a pioneering downsampler that integrates the ADown [5] module with the Haar wavelet-based HWD module [36,37]. This integrated approach adeptly captures features across a spectrum of resolutions, bolstering the fidelity and expedience of image processing endeavors. A comparative analysis revealed that ADown_HWD markedly outperforms its predecessor, the ADown module, which is predominantly reliant on convolutional and pooling operations. The novel downsampler not only preserves the integrity of coral edge and texture details but also efficiently condenses data dimensions. Given that coral imagery is often procured amidst intricate underwater settings, it is frequently subjected to diverse lighting conditions and visual aberrations. The ADown_HWD module exhibits exceptional adaptability and robustness in complex environments, as it proficiently handles images spanning multiple frequencies, thereby optimizing the performance of the CIS. Figure 2 illustrates the structure of the ADown_HWD module, which was designed by integrating the HWD module into the original ADown model.

The HWD module consists of a lossless feature encoding block and a feature representation learning block. The lossless feature encoding block utilizes the Haar wavelet transform to reduce the resolution of the feature mapping while retaining all the information. This process transforms the features, effectively reducing the spatial resolution. The Haar wavelet effectively encodes the original spatial information into the new channel dimensions and does so without losing information. Each channel is used to capture different features of the original image. The feature representation learning block primarily serves to adapt the channel count of the feature map, ensuring compatibility with the architecture of subsequent layers. Additionally, it efficiently eliminates redundant information, significantly enhancing the capability of subsequent layers to learn and extract distinctive and meaningful features.

3.2. C2f_BRA Module

The range of coral reefs is extensive, spanning from shallow tropical coral reefs to mesophotic reefs and deep-water coral reefs. Corals predominantly inhabit shallow waters within tropical and subtropical regions. With the rapid development of underwater exploration and remote operation technologies, scientists have also gradually discovered deep-sea coral species and their distribution by utilizing advanced technologies such as real-time imaging systems coupled to remotely operated vehicles (ROVs) [38]. While processing the acquired coral images, we noticed similarities in the color and shape of some of the corals with their complex and variable underwater surroundings. However, this phenomenon is not a universal description of all coral images, and it may in some cases result in the inclusion of a degree of redundant data in the input image. In addition, moving objects such as large canopies of macroalgae and dense schools of fish may introduce artifacts such as blurring [39]. This blurring not only increases the amount of redundant information in the image but may also obscure or distort the features of the coral itself. Consequently, this may have a negative impact on the subsequent recognition and segmentation processes. In order to allow the network to focus primarily on the important features of the coral, we integrated the C2f [40] module with the bi-level routing attention mechanism. Therefore, the C2f_BRA module incorporates an attention mechanism during feature extraction, along with more flexible and content-aware computation allocation. This enables the network model to better focus on key information related to coral, such as shapes, colors, textures, and other details. As a result, it significantly improves segmentation precision in scenarios with small and densely packed targets. Additionally, it reduces computational complexity and extensive memory usage.

The bi-level routing attention is a dynamic sparse attention mechanism. It improves problem-solving accuracy and efficiency through fine-grained attention modulation and routing strategies. The key idea is to filter out most of the irrelevant key–value pairs at the coarse region level in order to retain only a small portion of the routing region. Fine-grained token-to-token attention is then applied to this portion of the routing region. The feature map

X \in R^{H \times W \times C}

is input, which is first partitioned into

S \times S

individual non-overlapping regions, each of which contains

\frac{H W}{S^{2}}

individual feature vectors.

X

is reshaped into

X^{r} \in R^{S^{2} \times \frac{H W}{S^{2}} \times C}

, and the query, key, and value tensor,

Q

,

K

, and

V \in R^{S^{2} \times \frac{H W}{S^{2}} \times C}

, respectively, are then derived through linear projections:

Q = X^{r} W^{q}

(1)

K = X^{r} W^{k}

(2)

V = X^{r} W^{v}

(3)

where

X^{r}

is the feature map after remodeling, and

W^{q}

,

W^{k}

, and

W^{v} \in R^{C \times C}

are the projection weights of query, key, and value, respectively.

Through a review of the

Q

and

K

, region-level queries are derived by computing the average value for each region, denoted as

Q^{r}

and

K^{r}

respectively. Subsequently, the average of

Q^{r}

and the transposed

K^{r}

is calculated. And then, applying the average of

Q^{r}

and transpose

K^{r}

matrix multiplication is performed to derive the adjacency matrix of inter-region correlations

A^{r}

:

A^{r} = Q^{r} (K^{r})^{T}

(4)

The affinity graph is then pruned by retaining only the top k connections, effectively removing regions of weak correlation:

I^{r} = t o p k I n d e x (A^{r})

(5)

where

I^{r}

is a routing index matrix that stores the index addresses of the first k most relevant areas returned by the method

t o p k I n d e x

.

Because the routing regions may be dispersed across the entire feature map, we first need to gather the key and value tensors.

K^{g}

and

V^{g}

are tensors formed by gathering the key and value. Then, we apply an attention mechanism to

K^{g}

and

V^{g}

, resulting in the output

O

:

K^{g} = g a t h e r (K, I^{r})

(6)

V^{g} = g a t h e r (V, I^{r})

(7)

O = A t t e n t i o n (Q, K^{g}, V^{g}) + L C E (V),

(8)

Among these, the function

L C E (\cdot)

[41], which is a local context enhancement term, is parameterized using depth-wise convolution with a kernel size of 5. See Figure 3.

3.3. Dynamic Upsampling Module DySample

The upsampling method used in the YOLOv8 network is nearest neighbor interpolation. This method, when upsampling, only selects the closest pixel values without generating new pixel values, which fails to capture minute variations and detailed semantic features in images. Consequently, it often leads to the generation of images with jagged edges, resulting in lower image quality. This is particularly problematic for images rich in details, such as for coral, where it may result in the loss of critical feature information. With the development of dynamic filtering networks [42], dynamic upsamplers like CARAFE [43], FADE [44], and SAPA [45] have shown commendable performance in certain tasks. However, these dynamic upsamplers generally have complex structures, which increase the computational workload and consequently lead to longer inference times. To address the above problems, this paper introduces the dynamic upsampler Dysample. Dysample is a lightweight and fast dynamic upsampler with reduced inference latency, memory footprint, and number of parameters. This method can generate new pixel values based on sampling offsets, which helps to preserve the rich semantic information of coral images. At the same time, Dysample can generate feature maps with different resolutions according to different scales to adapt to coral structures at various scales. This reduces problems of jagged edges and pixel distortions, thereby improving image quality.

Dysample is upsampled by taking the original feature map

X

and the sample set

S

and obtaining a new upsampled feature map

X^{'}

through the grid_sample function as in Equation (9). Compared to other upsampling methods that use time-consuming dynamic convolutions and additional subnetworks for generating dynamic kernels, Dysample avoids the use of dynamic convolutions and instead adopts point sampling to save resources. Firstly, given a feature map

X

of size

C \times H \times W

and an upsampling scale factor

s

, a linear layer with

C

input channels and

2 s^{2}

output channels is used to generate an offset

O

of size

2 s^{2} \times H \times W

. This offset is then reshaped into a tensor of size

2 \times s H \times s W

through pixel shuffling. Finally, the sampling set S is obtained by combining the original grid positions G and the offset O, as shown in Equation (10). See Figure 4.

X^{'} = g r i d_s a m p l e (X, S)

(9)

S = G + O

(10)

4. Experimental Evaluation

4.1. Experimental Dataset

The coral dataset used to train the segmentation model should include images of various coral species to enable the model to capture morphological features of corals under different environmental, lighting, and water conditions. Given the scarcity of publicly available, annotated coral datasets that are suitable for our experiment, particularly those containing a variety of coral species, we undertook a comprehensive data collection and annotation process. Our objective was to assemble a high-quality dataset that would enable the training of accurate and robust models for coral identification and segmentation. To achieve this, we first scoured the internet for existing coral datasets but found them limited in terms of either their diversity, annotation quality, or suitability for our specific experimental needs. Therefore, we resorted to collecting coral images from multiple sources, primarily from publicly available coral datasets such as CoralNet and Coral_Segmentation, as well as keyframes extracted from underwater coral monitoring videos provided by the Guangxi Laboratory on the Study of Coral Reefs in the South China Sea. The final dataset encompasses approximately 20 species of reef corals, predominantly from the genera Acropora, Montipora, Porites, Favia, and Goniastrea. To augment our dataset and enhance the robustness of our model, we adopted data augmentation technology, including flipping, cropping, adjusting brightness, and introducing controlled noise levels to the coral images. This culminated in a coral dataset comprising 2047 high-quality images. The dataset was divided into a training set of 1858 images and a test set of 189 images. To ensure accurate and precise training and testing, we utilized an online annotation platform available at https://universe.roboflow.com (accessed on 1 May 2024) for labeling the coral images. Figure 5 provides an illustrative showcase of a subset of the images from the dataset alongside their corresponding annotations.

4.2. Experimental Environment

To evaluate the effectiveness of the proposed CIS network model, experiments were designed and executed to validate its performance. The experiment was executed on the Ubuntu 20.04 operating system, employing the PyTorch 1.11.0 framework for deep learning. The configuration details pertaining to the specific experimental setup are outlined in Table 1. The model underwent iterative optimization via stochastic gradient descent (SGD), commencing with an initial learning rate of 0.01. The input images have a resolution of 640 × 640 pixels, and to guarantee stability during training, the process was divided into 300 epochs. The specific hyperparameters applied in the whole training process are shown in Table 2 below.

4.3. Evaluation Metrics

The model was evaluated using several metrics, including precision, recall, average precision, mean average precision, and frames per second, among others.

Precision (P): Precision indicates the ratio of the number of correctly determined positive cases to the number of all determined positive cases, which is used to assess whether the prediction is accurate or not. TP represents the count of correctly identified positive samples. FP denotes the instances where negative samples are mistakenly identified as positive. FN refers to cases where backgrounds are incorrectly classified as positive.

P r e c i s i o n = \frac{T P}{T P + F P}

(11)

Recall (R): Recall indicates the ratio of the number of correctly determined positive cases to the number of actual positive cases, used to assess whether to find the full category.

R e c a l l = \frac{T P}{T P + F N}

(12)

Average precision (AP): The area of the region enclosed by the precision–recall curve and the axes is the AP value of the current category. AP is utilized as a metric to evaluate the performance of the model in each individual category.

A P = \int_{0}^{1} P d R

(13)

Mean average precision (mAP): mAP is the average value of AP, utilized to evaluate the performance of a model across all categories.

m A P = \frac{1}{n} \sum_{i = 0}^{n} A P_{i}

(14)

Frames per second (FPS): FPS is a metric used to evaluate the processing speed of a model on a given hardware, indicating the number of images that can be processed per second. FPS is typically calculated by dividing the total number of images processed by the total time it takes to process them, including preprocessing, inference, and postprocessing time for non-maximum suppression (NMS).

4.4. Comparison Experiments with Other Models

To further evaluate the effectiveness of the proposed model in this paper and validate its superiority over other mainstream models for coral instance segmentation tasks, comparative experiments were conducted. The proposed CIS network model was compared with other network models, including Mask R-CNN, YOLACT [46], YOLOv5, YOLOv7, YOLOv8, and YOLOv9. The results of these comparative experiments are presented in Table 3 below. See also Figure 6.

Based on the findings from the comparative experiments, it is evident that the CIS network model significantly outperforms the Mask R-CNN and YOLACT models in terms of precision, mAP50, and especially FPS. Compared with the YOLOv5n-seg model, the CIS network model has a slightly higher number of parameters. However, the mAP50_B and mAP50_M of the CIS were significantly improved by 6.3% and 7.5%, respectively. Additionally, the FPS is also higher than that of the YOLOv5n-seg model. Compared with YOLOv7-seg, CIS has only slightly lower recall, while all other metrics are higher, and the number of parameters and FPS are significantly better than those of YOLOv7-seg. Compared with the baseline model YOLOv8n-seg, the metrics of the CIS network model, i.e., mAP50_B and mAP50_M, increased by 2.3% and 2.4%, respectively. Additionally, the number of parameters decreased by 10.1%, and the FPS increased by 10.7%, meeting real-time requirements. Finally, for the latest YOLOv9 network model, CIS is ahead in all the metrics except the recall rate, which is slightly lower. The CIS network model has fewer parameters, and its mAP50_B and mAP50_M were improved by 0.9% and 1.3%, respectively. Figure 7 shows the comparison between CIS and other network models.

4.5. Comparative Experiment of the ADown Module and the ADown_HWD Module Enhanced with Haar Wavelet Downsampling (HWD)

This experiment aimed to validate the effectiveness of the ADown_HWD module formed by adding Haar Wavelet Downsampling (HWD) in CIS network model. We compared the processing speed and segmentation accuracy with and without HWD to assess its impact on network performance. The experiment trained two models under identical conditions, ensuring consistent training parameters such as learning rate, batch size, and number of training epochs. The experimental results are shown in Table 4 below.

The experimental results demonstrate that the ADown_HWD module achieved a remarkable improvement in processing speed, reaching 178.6 FPS, significantly surpassing the 81.3 FPS of the ADown module. This clearly indicates the positive effect of ADown_HWD on processing speed, stemming from its efficient feature extraction capability. In terms of segmentation accuracy, while the ADown_HWD module exhibited a slight decrease in R_B and R_M metrics, the overall improvement in the mAP50 metric suggests that ADown_HWD has advantages in capturing global features and segmentation accuracy. This finding strikes a balance between precision and speed, showcasing the potential of ADown_HWD in practical applications. The improvement primarily stems from the ADown_HWD module’s implementation of Discrete Wavelet Transform (DWT) for feature processing, enabling more effective capture and integration of both low- and high-frequency information. Specifically, the introduction of wavelet transforms enhances the network’s ability to concurrently recognize image details (such as edges and textures) and global structures, thereby increasing the distinctiveness of features and the accuracy of the network’s detection capabilities. Moreover, compared to the direct pooling and convolution approach of ADown, the hybrid feature processing strategy of ADown_HWD is more complex, offering a richer representation of features which in turn facilitates more effective learning and generalization. This result underscores the importance of integrating advanced mathematical transformation techniques, like wavelet analysis, in deep learning models and highlights their potential in enhancing model performance. Additionally, it is noteworthy that the ADown_HWD module has a parameter count of 2.93, slightly higher than the 2.62 of the ADown module, introducing some additional computational and storage costs. However, considering the significant performance gains, this increase is deemed acceptable. In conclusion, this study not only validates the effectiveness of HWD within the ADown_HWD module but also provides valuable insights for future optimizations of model architecture and performance enhancements.

4.6. Comparative Experiments on Attention Mechanisms

We conducted comparative experiments to demonstrate the effectiveness of the C2f_BRA module, which integrates the bi-level routing attention mechanism into the C2f module. This approach was compared against other attention mechanisms to highlight its superiority. Four other attention mechanisms, SE [47], CA [48], ECA [49], and CBAM [50], were integrated into the C2f module at the same position. This resulted in four new modules: C2f_SE, C2f_CA, C2f_ECA, and C2f_CBAM. These were then compared with the proposed C2f_BRA module. The experimental results are presented in Table 5 below. The results indicate that after integrating these four attention mechanisms, C2f_SE, C2f_CA, and C2f_ECA modules improved the mAP50_M by 0.7%, 1.5%, and 1.2%, respectively. C2f_CBAM achieved the same improvement in mAP50_M as C2f_BRA, with an increase of 1.9%. However, the parameter count and module size of C2f_BRA are superior to the other modules. This suggests that integrating the bi-level routing attention mechanism into the C2f module achieves better mAP50_M compared to other attention mechanisms. Additionally, it requires fewer parameters, resulting in a smaller model size. This makes it more suitable for deployment in environments with limited computational resources.

4.7. Ablation Experiments

Multiple sets of ablation experiments were conducted to evaluate the impact of each component on the overall performance of the model. These experiments were performed in the same experimental environment. The aim was to evaluate the impact of each improvement module on the baseline model. The results of the ablation experiments are shown in Table 6 below.

The ablation sequence in this experiment was conducted by progressively adding each module to evaluate their impact on model performance. First, we started with the baseline model (YOLOv8n) without adding any new modules to establish an initial performance benchmark. Then, we sequentially added the ADown_HWD module, DySample module, and C2f_BRA module to the baseline model, conducting experiments to assess the individual contributions of each module to model accuracy and performance. Following this, we combined modules, adding ADown_HWD and DySample together, ADown_HWD and C2f_BRA together, and DySample and C2f_BRA together to the baseline model to observe their combined effects on model performance. Finally, we integrated all modules (ADown_HWD, DySample, and C2f_BRA) into the baseline model, forming our proposed CIS network model, and performed a comprehensive performance evaluation. This ablation sequence was designed to systematically analyze the effects of each module and their combinations on the overall performance of the model, thereby validating the specific contributions of each module in enhancing model performance.

From the results, it is evident that replacing the Conv module with the proposed ADown_HWD module brings significant improvements. Although R_B and R_M decreased slightly, there were notable improvements in other metrics. Specifically, P_B and P_M increased by 2.7% and 3.1%, respectively. Additionally, the mean average precision at 50% threshold, i.e., mAP50_B and mAP50_M, improved by 1.3% and 1.2%, respectively. These changes suggest that the ADown_HWD module enhances the ability to perform multiscale feature extraction on coral and better capture multi-scale information in images. The main reason for the improvement is the introduction of richer and multi-layered feature representations. ADown_HWD enhances the detection accuracy by decomposing the image into different frequency components, thus extracting detailed and textural features. Specifically, the Discrete Wavelet Transform (DWT) within the HWD module decomposes the image into low-frequency and high-frequency components. These components provide a more comprehensive view of the image, encompassing both global structure and fine details. The low-frequency components capture the overall information, while the high-frequency components focus on local details. This multi-layered fusion of information aids in more accurate target segmentation. Additionally, ADown_HWD combines average pooling and max pooling strategies to smooth the image while preserving important features. By concatenating feature maps from different processing paths, it enhances the model’s ability to recognize features. This approach not only effectively reduces noise but also better captures target details and structure, ultimately leading to improved detection accuracy. After introducing the dynamic upsampling module Dysample in the neck network, compared with the baseline model, P_B and P_M increased by 4.5% and 4.1%, respectively. Likewise, R_B and R_M improved by 1.0% and 2.8%, respectively. The mean average precision at 50% threshold, i.e., mAP50_B and mAP50_M, also saw increases of 1.0% and 1.8%. The improvement observed is primarily attributed to the dynamic sampling capabilities of the DySample module, which dynamically adjusts the sampling locations based on the input features, contrasting with the fixed interpolation rules of nn.Upsample. Specifically, DySample predicts sampling offsets through a convolutional layer, finely tuning the sampling strategy based on the local information in the input feature maps, thereby optimizing the efficiency of information recovery. Additionally, there was a noticeable increase in FPS, indicating that Dysample not only boosts accuracy but also meets the demands for real-time performance. Finally, the C2f_BRA module was added to form the CIS network model. Although R_B and R_M decreased by 0.7% and 1.6%, respectively, significant improvements were observed in all other metrics. P_B and P_M increased by 6.3% and 10.5%, respectively. mAP50_B and mAP50_M also rose by 2.3% and 2.4%. Furthermore, the total number of model parameters decreased by 10.1%, enhancing the efficiency of the model. Additionally, FPS increased by 10.7%, reaching a new high of 178.6. The performance improvement is primarily attributed to the introduction of the BRA module. This module significantly enhances the network feature extraction and fusion capabilities through its advanced attention mechanism and feature routing strategy. Specifically, BRA improves feature integration by accurately routing both local and global information while dynamically selecting key features, thus reducing redundancy and noise. These advancements enable C2f_BRA to more effectively preserve crucial detail information during downsampling, leading to higher precision in segmentation tasks. Therefore, the incorporation of C2f_BRA not only optimizes the network feature representation but also enhances overall segmentation performance. This improvement indicates that integrating the bi-level routing attention mechanism into C2f_BRA effectively enhances the identification and extraction of key coral features. It also optimizes information weight allocation, making the model more adept at handling complex image processing tasks.

The results of ablation experiments demonstrate that the CIS network model proposed significantly enhances the precision and mAP of coral instance segmentation. It also reduces the number of model parameters, lowering memory usage and computational demands. Additionally, the inference speed is faster. The proposed model also improves FPS, enhancing real-time performance and processing speed.

5. Conclusions

In this paper, we propose the CIS network model, a coral instance segmentation network with novel upsampling, downsampling, and fusion attention mechanism. This model not only improves the mAP of coral instance segmentation but also meets real-time processing requirements. A new downsampling module, ADown_HWD, is proposed to enable the network model to effectively retain more detailed information and achieve multi-scale feature extraction. Incorporating the HWD module into the ADown module results in higher accuracy compared to YOLOv8, which uses only Conv modules composed of convolutional layers, batch normalization layers, and activation functions for downsampling. Additionally, the bi-level routing attention mechanism was integrated into the C2f module, filtering out redundant and irrelevant information to focus on critical data. This optimizes the usage of computational resources and enhances real-time processing capabilities. We also introduced a lightweight upsampling module, Dysample, to improve segmentation precision while reducing both model parameters and computational complexity. Dysample primarily achieves resource savings by replacing time-consuming operations such as dynamic convolution with point sampling techniques, and it is implemented solely using built-in functions in PyTorch. With these improvements, the precision of the CIS network model, i.e., P_B and P_M, reached 92.3% and 95.1%, respectively, representing increases of 2.3% and 2.4% over the previous mAP50_B and mAP50_M. Moreover, the number of model parameters decreased by 10.1%, and FPS improved by 10.7%. The experimental results confirmed that the CIS network model delivers high precision and strong real-time performance in coral instance segmentation, making it suitable for this task. However, the method proposed in this study exhibits inadequate recall performance. Future work will focus on addressing this issue.

Author Contributions

Conceptualization, T.L.; data curation, T.L. and S.Z.; formal analysis, T.L.; investigation, T.L.; methodology, T.L.; project administration, T.L. and Z.L.; resources, T.L.; software, T.L.; supervision, Z.L.; validation, T.L.; visualization, T.L.; writing—original draft, T.L.; writing—review and editing, T.L., Z.L. and S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Undergraduate Innovation and Entrepreneurship Training Program of Guangxi University (S202310593348).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data are presented in this article in the form of figures and tables.

Acknowledgments

The authors acknowledge the support of the Undergraduate Innovation and Entrepreneurship Training Program of Guangxi University, the Guangxi Key Laboratory of Multimedia Communications and Network Technology, and the Guangxi Laboratory on the Study of Coral Reefs in the South China Sea for this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Candela, A.; Edelson, K.; Gierach, M.M.; Thompson, D.R.; Woodward, G.; Wettergreen, D. Using remote sensing and in situ measurements for efficient mapping and optimal sampling of coral reefs. Front. Mar. Sci. 2021, 8, 689489. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. 2024. Available online: https://github.com/WongKinYiu/yolov9 (accessed on 14 August 2024).
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 10323–10333. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6027–6037. [Google Scholar]
Hou, C.; Guan, Z.; Guo, Z.; Zhou, S.; Lin, M. Engineering. An Improved YOLOv5s-Based Scheme for Target Detection in a Complex Underwater Environment. J. Mar. Sci. Eng. 2023, 11, 1041. [Google Scholar] [CrossRef]
Che, S.; Li, Z.; Shi, Z.; Gao, M.; Tang, H. Research on an underwater image segmentation algorithm based on YOLOv8. J. Phys. Conf. Ser. 2023, 2644, 012013. [Google Scholar] [CrossRef]
Chen, L.; Zheng, M.; Duan, S.; Luo, W.; Yao, L. Underwater target recognition based on improved YOLOv4 neural network. Electronics 2021, 10, 1634. [Google Scholar] [CrossRef]
Tao, Y.; Zhong, B.; Zhao, W.; Zhou, K. Underwater Object Detection Algorithm Integrating Explicit Visual Center and Attention Mechanism. Laser Optoelectron. Prog. 2023, 61, 1–14. [Google Scholar]
Li, J.; Liu, C.; Lu, X.; Wu, B. CME-YOLOv5: An efficient object detection network for densely spaced fish and small targets. Water 2022, 14, 2412. [Google Scholar] [CrossRef]
Liu, K.; Sun, Q.; Sun, D.; Peng, L.; Yang, M.; Wang, N. Engineering. Underwater target detection based on improved YOLOv7. J. Mar. Sci. Eng. 2023, 11, 677. [Google Scholar] [CrossRef]
Shen, X.; Wang, H.; Cui, T.; Guo, Z.; Fu, X. Multiple information perception-based attention in YOLO for underwater object detection. Vis. Comput. 2024, 40, 1415–1438. [Google Scholar] [CrossRef]
Shen, X.; Sun, X.; Wang, H.; Fu, X. Applications. Multi-dimensional, multi-functional and multi-level attention in YOLO for underwater object detection. Neural Comput. Appl. 2023, 35, 19935–19960. [Google Scholar] [CrossRef]
Hassanudin, W.M.; Utomo, V.G.; Apriyanto, R. Fine-Grained Analysis of Coral Instance Segmentation using YOLOv8 Models. Sinkron 2024, 8, 1047–1055. [Google Scholar] [CrossRef]
Zhang, L.; Qiu, Y.; Fan, J.; Li, S.; Hu, Q.; Xing, B.; Xu, J. Underwater fish detection and counting using image segmentation. Aquac. Int. 2024, 32, 4799–4817. [Google Scholar] [CrossRef]
Liu, Y.; An, B.; Chen, S.; Zhao, D. Multi-target detection and tracking of shallow marine organisms based on improved YOLO v5 and DeepSORT. IET Image Process. 2024, 18, 2273–2290. [Google Scholar] [CrossRef]
Lu, H.; Uemura, T.; Wang, D.; Zhu, J.; Huang, Z.; Kim, H. Applications. Deep-sea organisms tracking using dehazing and deep learning. Mob. Netw. Appl. 2020, 25, 1008–1015. [Google Scholar]
Jiang, Y.; Qu, M.; Chen, Y. Coral Detection, Ranging, and Assessment (CDRA) algorithm-based automatic estimation of coral reef coverage. Mar. Environ. Res. 2023, 191, 106157. [Google Scholar] [CrossRef] [PubMed]
Alshdaifat, N.F.F.; Talib, A.Z.; Osman, M.A. Improved deep learning framework for fish segmentation in underwater videos. Ecol. Inform. 2020, 59, 101121. [Google Scholar] [CrossRef]
Park, J.-H.; Kang, C. Engineering. A study on enhancement of fish recognition using cumulative mean of YOLO network in underwater video images. J. Mar. Sci. Eng. 2020, 8, 952. [Google Scholar] [CrossRef]
Song, S.; Zhu, J.; Li, X.; Huang, Q. Integrate MSRCR and mask R-CNN to recognize underwater creatures on small sample datasets. IEEE Access 2020, 8, 172848–172858. [Google Scholar] [CrossRef]
Yi, D.; Ahmedov, H.B.; Jiang, S.; Li, Y.; Flinn, S.J.; Fernandes, P.G. Coordinate-Aware Mask R-CNN with Group Normalization: A underwater marine animal instance segmentation framework. Neurocomputing 2024, 583, 127488. [Google Scholar] [CrossRef]
Gao, M.; Li, S.; Liu, Z.; Zhang, B.; Bai, Y.; Guan, N.; Wang, P.; Chang, Q. Jellyfish Detection and Recognition Algorithm Based on Improved Faster R-CNN. Acta Metrol. Sin. 2023, 44, 54–61. [Google Scholar]
Picek, L.; Říha, A.; Zita, A. Coral Reef Annotation, Localisation and Pixel-Wise Classification Using Mask R-CNN and Bag of Tricks. 2020. Available online: https://ceur-ws.org/Vol-2696/paper_83.pdf (accessed on 10 August 2024).
Jaisakthi, S.; Mirunalini, P.; Aravindan, C. Coral Reef Annotation and Localization using Faster R-CNN. In Proceedings of the CLEF (Working Notes), Lugano, Switzerland, 9–12 September 2019. [Google Scholar]
Ali, K.; Moetesum, M.; Siddiqi, I.; Mahmood, N. Marine object detection using transformers. In Proceedings of the 2022 19th International Bhurban Conference on Applied Sciences and Technology (IBCAST), Islamabad, Pakistan, 16–20 August 2022; pp. 951–957. [Google Scholar]
Mai, K.; Cheng, W.; Wang, J.; Liu, S.; Wang, Y.; Yi, Z.; Wu, X. Underwater Object Detection Based on DN-DETR. In Proceedings of the 2023 IEEE International Conference on Real-time Computing and Robotics (RCAR), Datong, China, 17–20 July 2023; pp. 762–767. [Google Scholar]
Wang, Z.; Ruan, Z.; Chen, C. Engineering. DyFish-DETR: Underwater Fish Image Recognition Based on Detection Transformer. J. Mar. Sci. Eng. 2024, 12, 864. [Google Scholar] [CrossRef]
Yuan, X.; Fang, S.; Li, N.; Ma, Q.; Wang, Z.; Gao, M.; Tang, P.; Yu, C.; Wang, Y.; Martínez Ortega, J.-F.; et al. Performance Comparison of Sea Cucumber Detection by the Yolov5 and DETR Approach. J. Mar. Sci. Eng. 2023, 11, 2043. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Haar, A. Zur Theorie der Orthogonalen Funktionensysteme; Georg-August-Universitat, Gottingen: Göttingen, Germany, 1909. [Google Scholar]
Xu, G.; Liao, W.; Zhang, X.; Li, C.; He, X.; Wu, X. Haar wavelet downsampling: A simple but effective downsampling module for semantic segmentation. Pattern Recognit. 2023, 143, 109819. [Google Scholar] [CrossRef]
Carvalho, N.F.; Waters, L.G.; Arantes, R.C.; Couto, D.M.; Cavalcanti, G.H.; Güth, A.Z.; Falcão, A.P.C.; Nagata, P.D.; Hercos, C.M.; Sasaki, D.K.; et al. Underwater surveys reveal deep-sea corals in newly explored regions of the southwest Atlantic. Commun. Earth Environ. 2023, 4, 282. [Google Scholar] [CrossRef]
Remmers, T.; Grech, A.; Roelfsema, C.; Gordon, S.; Lechene, M.; Ferrari, R. Close-range underwater photogrammetry for coral reef ecology: A systematic literature review. Coral Reefs 2024, 43, 35–52. [Google Scholar] [CrossRef]
Ultralytics. YOLOv8: v8.1.0. Available online: https://github.com/ultralytics/ultralytics/releases/tag/v8.1.0 (accessed on 20 May 2024).
Ren, S.; Zhou, D.; He, S.; Feng, J.; Wang, X. Shunted self-attention via multi-scale token aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10853–10862. [Google Scholar]
Jia, X.; De Brabandere, B.; Tuytelaars, T.; Gool, L.V. Dynamic filter networks. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Lu, H.; Liu, W.; Fu, H.; Cao, Z. FADE: Fusing the assets of decoder and encoder for task-agnostic upsampling. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 231–247. [Google Scholar]
Lu, H.; Liu, W.; Ye, Z.; Fu, H.; Liu, Y.; Cao, Z. SAPA: Similarity-aware point affiliation for feature upsampling. Adv. Neural Inf. Process. Syst. 2022, 35, 20889–20901. [Google Scholar]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9157–9166. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on COMPUTER vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]

Figure 1. Structure of the CIS network model, primarily divided into three parts: backbone, neck, and head. The proposed improvements are highlighted with a blue dotted outline.

Figure 2. (a) The proposed ADown_HWD module for downsampling; (b) the downsampling module used in YOLOv8.

Figure 3. Structure of the bi-level routing attention mechanism. By collecting key–value pairs from the top k relevant windows, we exploit sparsity to bypass computations in the least relevant regions.

Figure 4. Structure of the Dysample module. The input feature, upsampled feature, and sampling set are denoted by X, X′, and S, respectively. The sampling set is produced by the sampling point generator, which resamples the input feature using the grid sample function.

Figure 5. Some coral images and their corresponding annotations. The top row shows the unannotated images, while the bottom row displays the corresponding annotations.

Figure 6. Line plot of comparison experiments with other models.

Figure 7. Comparison of CIS with other network models.

Table 1. Experimental environment and configuration.

Environment	Configure
Operating system	Ubuntu 20.04
GPU	GeForce RTX 4090 (24 GB) (NVIDIA, Santa Clara, CA, USA)
CUDA version	CUDA 11.3
Python version	Python 3.8
Deep learning framework	PyTorch 1.11.0

Table 2. Hyperparameter configuration.

Hyperparameter	Configure
Training epochs	300
Input size	640 × 640
Batch size	16
Initial learning rate	0.01
Final learning rate	0.01
Optimizer momentum	0.937
Weight decay	0.0005

Table 3. Comparison experiment with other models.

Model	P_B (%) ↑	R_B (%) ↑	mAP50_B (%) ↑	P_M (%) ↑	R_M (%) ↑	mAP50_M (%) ↑	Params (M) ↓	FPS ↑
Mask R-CNN	80.4	83.3	80.4	79.7	83.0	79.7	63.73	7.2
YOLACT	79.7	83.0	79.7	78.6	81.9	78.6	49.59	17.7
YOLOv5n-seg	80.8	77.5	78.6	79.9	76.0	76.8	1.88	153.8
YOLOv7-seg	81.7	78.4	79.5	81.7	78.4	79.8	37.84	60.9
YOLOv8n-seg	86.0	76.0	82.6	85.6	75.6	81.9	3.26	161.3
YOLOv9-seg	84.2	77.4	84.0	84.3	77.0	83.2	27.36	126.6
CIS (Ours)	92.3	75.3	84.9	95.1	74.0	84.3	2.93	178.6

Note: “↑” indicates that a larger value of this metric is more favorable for the model, while “↓” arrow indicates that a smaller value is more favorable for the model.

Table 4. Performance Comparison of ADown and ADown_HWD Modules.

Model	P_B	R_B	mAP50_B	P_M	R_M	mAP50_M	Params/10⁶	FPS
ADown	90.2	76.7	84.4	89.9	76.3	82.5	2.62	81.3
ADown_HWD	92.3	75.3	84.9	95.1	74.0	84.3	2.93	178.6

Table 5. Comparative Experiments on Attention Mechanisms.

Model	P_B	R_B	mAP50_B	P_M	R_M	mAP50_M	Params/10⁶	Weight/MB
C2f	86.0	76.0	82.6	85.6	75.6	81.9	3.26	6.8
C2f_SE	86.5	75.3	83.5	87.6	75.3	82.6	3.26	6.8
C2f_CA	88.4	77.4	83.8	87.9	77.0	83.4	3.27	6.9
C2f_ECA	90.8	76.0	84.6	92.2	73.9	83.1	3.26	6.8
C2f_CBAM	89.6	78.0	83.6	89.9	77.6	83.8	3.28	6.9
C2f_BRA	93.7	74.9	83.9	93.4	74.9	83.8	2.90	6.1

Table 6. Comparison of the results from ablation experiments.

Baseline	ADown_HWD	DySample	C2f_BRA	P_B	R_B	mAP50_B	P_M	R_M	mAP50_M	Params/10⁶	FPS
√				86.0	76.0	82.6	85.6	75.6	81.9	3.26	161.3
√	√			88.7	75.3	83.9	88.7	75.3	83.1	3.27	175.4
√		√		87.2	76.0	83.7	86.8	75.8	83.3	3.27	137.0
√			√	93.7	74.9	83.9	93.4	74.9	83.8	2.90	126.6
√	√	√		90.5	77.0	83.6	89.7	78.4	83.7	3.28	169.5
√	√		√	91.8	72.5	83.3	92.2	72.8	83.4	2.91	109.9
√		√	√	82.3	79.4	83.8	85.5	75.3	82.3	2.92	212.8
√	√	√	√	92.3	75.3	84.9	95.1	74.0	84.3	2.93	178.6

Note: “√” indicates that the module was used, while a blank space represents that the module was not used. The last row corresponds to the CIS network model proposed in this paper.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, T.; Liang, Z.; Zhao, S. CIS: A Coral Instance Segmentation Network Model with Novel Upsampling, Downsampling, and Fusion Attention Mechanism. J. Mar. Sci. Eng. 2024, 12, 1490. https://doi.org/10.3390/jmse12091490

AMA Style

Li T, Liang Z, Zhao S. CIS: A Coral Instance Segmentation Network Model with Novel Upsampling, Downsampling, and Fusion Attention Mechanism. Journal of Marine Science and Engineering. 2024; 12(9):1490. https://doi.org/10.3390/jmse12091490

Chicago/Turabian Style

Li, Tianrun, Zhengyou Liang, and Shuqi Zhao. 2024. "CIS: A Coral Instance Segmentation Network Model with Novel Upsampling, Downsampling, and Fusion Attention Mechanism" Journal of Marine Science and Engineering 12, no. 9: 1490. https://doi.org/10.3390/jmse12091490

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CIS: A Coral Instance Segmentation Network Model with Novel Upsampling, Downsampling, and Fusion Attention Mechanism

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Downsampling Module ADown_HWD

3.2. C2f_BRA Module

3.3. Dynamic Upsampling Module DySample

4. Experimental Evaluation

4.1. Experimental Dataset

4.2. Experimental Environment

4.3. Evaluation Metrics

4.4. Comparison Experiments with Other Models

4.5. Comparative Experiment of the ADown Module and the ADown_HWD Module Enhanced with Haar Wavelet Downsampling (HWD)

4.6. Comparative Experiments on Attention Mechanisms

4.7. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI