3.1. An Improved Approach to Feature Fusion
In underwater environment monitoring, fishing nets are broken due to the current impact of water, and their morphology is complex and variable, making them difficult to distinguish. The attention mechanism is introduced in traditional methods to improve the recognition accuracy [
19]. However, the global attention mechanism requires computation-intensive interactions, which conflicts with the computational capacity limitations of edge devices.
Therefore, this paper proposes a low-computation attention improvement module, DA2D, optimized for underwater scenes. This module can dynamically adapt and prioritize potential hotspot regions. It is a flexible and efficient self-attention mechanism that dynamically adapts the positions of key locations and values according to the input data, thus enabling the model to focus more on relevant regions and capture more informative features. The flowchart of the sub-network module of the DA2D module is shown in
Figure 1.
Based on the self-attention mechanism, DA2D firstly generates a set of uniformly distributed reference points on the feature map and learns the offsets of these reference points based on the query feature using an offset network. The offset network module diagram is shown in
Figure 2.
For each query q, the offset vectors of all reference points obtained from the query features are computed using the offset network O, as shown in Equation (1):
where
(q) denotes the new offset of each reference point relative to the original position. In the feature sampling and deformation phase, the offset network takes the query features as inputs and samples the feature map by bilinear interpolation based on the learned offsets to obtain the deformed keys and values, as shown in Equations (2) and (3):
where (m) denotes the mth attention head in which F and G are the functions that map the deformed positions back to the feature map to extract the keys and values, respectively. In this way, the corresponding offsets are calculated for each reference point. The relative positional bias is also calculated from the deformed points to enhance the multi-head attention mechanism and to output the transformed feature representation. The multi-head attention computation follows the principle of multi-head attention computation in the standard transformer module, where the keys and values used have already undergone the deformation operations described above, and relative position offsets computed from the deformed points are introduced to enhance the attention mechanism. The formula is provided as follows:
where σ is a softmax function to normalize the attention weights at each position, and d is half of the feature dimensions in each attention head. By projecting queries, keys, and values into separate subspaces and performing independent attention computations, the model can capture different schema dependencies simultaneously. DA2D further introduces the ability to dynamically adjust the key points and value locations, thus making the attention more focused on essential regions.
The DA2D module provides the model with the ability to adaptively adjust the location of attentional sampling points according to the input content by drawing on the deformable convolutional [
20] concept to provide the model with an ability to adaptively adjust the location of attentional sampling points based on the input content. This innovation eschews the traditional approach of employing a uniform or fixed sampling strategy on the global feature map. The module predicts a set of offsets for each query location in a specific implementation. It applies these offsets to a predetermined grid of reference coordinates, generating a series of new contextually relevant sampling locations. This mechanism realizes the dynamic focusing of crucial information and its effective extraction, effectively avoids computational redundancy caused by undifferentiated processing of all pixels, and thus improves computational efficiency. Further, the DA2D module adopts a sparse connectivity strategy [
21] that incorporates only a small number of parameters into the discrimination process of each forward propagation instead of considering all the parameters, a strategy that may lead to an increase in the number of parameters but reduces the number of floating point operations (FLOPs). With this approach, the study effectively improves the performance of broken region detection while keeping the computational effort low and realizes the goal of substantial performance improvement based on a small computational effort.
In summary, the DA2D module offers several advantages that make it particularly suitable for underwater target detection. The module lets the model focus on visual features critical for target detection through a sparse connectivity mechanism, ignoring background noise. Given the wide range of variations in lighting conditions, turbidity, and color in underwater environments, sparse connectivity helps the model to focus on critical structural features of fishing nets, such as mesh patterns and textures, which can be effectively identified even in low visibility conditions. In addition, deformable convolution enhances the model’s ability to deal with perspective changes and morphology variations, which is crucial for detecting nets that may have changed shape due to water currents, the influence of marine organisms, or physical damage. By learning the offset, deformable convolution can dynamically adjust the sensory field to capture more accurate target positions, ensuring precise localization in complex underwater environments.
3.2. Citing an Attention Mechanism
In the field of underwater environmental monitoring, monitoring objects such as fishing nets often present multiple elongated slit features. Traditional convolutional neural networks (CNNs) may not be effective in recognizing and modelling such elongated slit features due to their inherent ability to model local information. This is mainly due to the fact that CNNs lack the ability to model long-range dependencies, which is crucial for the accurate recognition and processing of elongated features. To address this problem, we introduce the transformer module, which has powerful global information modelling capabilities. However, the self-attention mechanism in the standard transformer architecture calculates the attention weights mainly through the interaction between query and key, which does not fully consider the interconnections between keys. For this reason, we adopt the CoT (contextual transformer) block [
22]. The CoT (contextual transformer) block is designed to dig deep into the contextual information of the keys and utilize this information to guide the process of computing dynamic attention weights. In this way, the CoT block can effectively enhance the model’s ability to process visual representations, aggregating the advantages of contextual information mining and the self-attention mechanism into one. As shown in
Figure 3, the implementation of this structure significantly improves the model’s recognition and modelling effect on slender gap features.
In the initial stage of processing the input features, we set three key variables and let the variables
and
initially share the same input value, X. In order to capture and represent the features with local context, we perform a k × k sized grouped convolution on
, which in turn yields an augmented
(denoted as
*). This step essentially models the local information statically. Further, we merged
* with
with the aim of integrating the local contextual information with the original query information. Immediately after that, we performed two successive rounds of convolutional processing on the merged result; these steps aim to refine further and optimize the fused feature representation to enhance the model’s ability to represent the input data.
Unlike traditional self-attention mechanisms, the A-matrix construction in our approach relies not only on the direct relationship between query and key but is realized through the interaction between
and
* augmented with local context. This design allows the attention mechanism to not only focus on the direct connection between query and key but also incorporates the interaction of local context information, thus significantly improving the performance of the self-attention mechanism. Next, by multiplying this dynamically generated attention map with the value vector
, we implement dynamic context-based modelling. This process allows the model to dynamically adjust its emphasis on information, which in turn significantly improves the quality of the feature representation.
Ultimately, the contextual transformer (CoT) module effectively fuses local and global contexts by integrating features obtained from local static context modelling with those obtained based on dynamic context modelling. The CoT module is able to better distinguish fishing nets from the background through its ability to efficiently integrate contextual information and maintain a high recognition accuracy even in the case of poor visibility. In addition, dynamic context modelling allows the model to adapt to the changes of the fishing nets in different scenarios, such as the changes in the state of the net under the action of water currents.
3.3. Integration of the SEAM Module
In studying the detection of underwater fishing net vulnerabilities that the recognition system may miss due to water currents superimposing the vulnerability with the intact net behind it, we developed advanced techniques to enhance detection accuracy. We incorporate the SEAM module [
23] into the YOLO detection framework to enhance the spatial transformation invariance of the model. SEAM effectively bridges the gap between fully supervised and weakly supervised semantic segmentation by constructing a twin-network structure with shared weights to achieve covariant regularization. In this structure, one branch directly processes the original input image. In contrast, the other branch first applies affine transformations (e.g., scaling, rotating, or flipping) to the input image before forward propagation. This design ensures that the generated class activation maps (CAMs) remain consistent even when the input image is transformed, mimicking the nature of pixel-level labels as the image transforms under fully supervised conditions, where SEAM introduces covariant regularization to ensure that the predicted CAMs from various transformed images [
24] provide self-supervision for network learning. F(.) denotes the network, A(.) denotes the affine transformation, and I is the input image.
Further, SEAM introduces a pixel correlation module (PCM) at the end of the network to further refine the CAM by capturing the contextual information of each pixel through a self-attentive mechanism. The PCM measures the feature similarity between pixels using the cosine distance and calculates the affinity by normalizing the inner product in the feature space to optimize the CAM to fit the object boundary more accurately. This approach not only ensures the consistency of the CAM under different transformations but also effectively improves the detection performance by integrating low-level features and adjusting the inter-pixel similarity.The block diagram of the PCM module is shown in
Figure 4.
In integrating SEAM and PCM into the YOLO framework, PCM has the following advantages. By removing unnecessary jump connections, reducing parameters, and using the ReLU activation function instead of the sigmoid, the model structure is streamlined, overfitting is avoided, and the sensitivity of the model to contextual information is improved. These improvements enable the model to identify and localize vulnerabilities more efficiently when detecting underwater fishing net vulnerabilities, even if they are accurately detected under different spatial transformations, enhancing the accuracy of underwater fishing net detection.
3.4. Block Diagram of Light-YOLO
This part introduces the Light-YOLO architecture, depicted in
Figure 5, which enhances YOLOv8n by substituting its SPPF module with a custom designed DA2D module. Unlike the SPPF’s parallel pooling layers, the DA2D module adjusts more flexibly, better capturing target shapes and details, which is ideal for complex geometrical transformations and occlusions. It also incorporates the CoT and SEAM modules to boost the interaction between the detection head and feature extraction, markedly enhancing performance and efficiency over YOLOv8n.
In summary, this thesis model uses a lightweight attention mechanism introduced in this paper based on sparse connectivity and deformable convolution, which not only helps focus on critical features but also improves overall efficiency by reducing unnecessary computational paths. A modularized network structure is used in the network structure, which allows flexible adjustment of the model depth and width according to the task complexity, ensuring that unnecessary computational loads are reduced while maintaining high accuracy.