A Cross-Level Iterative Subtraction Network for Camouflaged Object Detection

Hu, Tongtong; Zhang, Chao; Lyu, Xin; Sun, Xiaowen; Chen, Shangjing; Zeng, Tao; Chen, Jiale

doi:10.3390/app14178063

Open AccessArticle

A Cross-Level Iterative Subtraction Network for Camouflaged Object Detection

by

Tongtong Hu

¹,

Chao Zhang

²,

Xin Lyu

^1,3,*

,

Xiaowen Sun

⁴,

Shangjing Chen

¹

,

Tao Zeng

¹ and

Jiale Chen

¹

College of Computer Science and Software Engineering, Hohai University, Nanjing 211100, China

²

Information Center, Ministry of Water Resources, Beijing 100053, China

³

Key Laboratory of Water Big Data Technology of Ministry of Water Resources, Hohai University, Nanjing 211100, China

⁴

Water Resources Service Center of Jiangsu Province, Nanjing 210029, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 8063; https://doi.org/10.3390/app14178063

Submission received: 9 May 2024 / Revised: 3 August 2024 / Accepted: 7 September 2024 / Published: 9 September 2024

(This article belongs to the Special Issue Application, Optimization and Architecture of Deep Learning Neural Network)

Download

Browse Figures

Versions Notes

Abstract

:

Camouflaged object detection (COD) is a challenging task, aimed at segmenting objects that are similar in color and texture to their background. Sufficient multi-scale feature fusion is crucial for accurately segmenting object regions. However, most methods usually focus on information compensation, overlooking the difference between features, which is important for distinguishing the object from the background. To this end, we propose the cross-level iterative subtraction network (CISNet), which integrates information from cross-layer features and enhances details through iteration mechanisms. CISNet involves a cross-level iterative structure (CIS) for feature complementarity, where texture information is used to enrich high-level features and semantic information is used to enhance low-level features. In particular, we present a multi-scale strip convolution subtraction (MSCSub) module within CIS to extract difference information between cross-level features and fuse multi-scale features, which improves the feature representation and guides accurate segmentation. Furthermore, an enhanced guided attention (EGA) module is presented to refine features by deeply mining local context information and capturing a broader range of relationships between different feature maps in a top-down manner. Extensive experiments conducted on four benchmark datasets demonstrate that our model outperforms the state-of-the-art COD models in all evaluation metrics.

Keywords:

camouflaged object detection; feature difference; cross-level integration; attention

1. Introduction

Camouflaged object detection (COD) is a segmentation task that aims to segment objects that closely resemble their background in terms of color and texture. Additionally, various appearances (e.g., shapes and sizes) exist in camouflaged objects and some edges of them are rugged, which further increase the difficulty of this task. Exploring methods that overcome the shortcomings of scale variations and edge ambiguity that exist in COD not only contributes to the advancement of camouflaged object detection but also benefits other closely related computer vision tasks, such as medical image segmentation [1,2,3,4], transparent object detection [5], species discovery [6], and military reconnaissance [7]. In addition to its academic value, COD can also promote practical applications in the fields of water body biometrics, pest and disease identification, and environmental change monitoring.

In recent years, significant progress has been made in COD due to the advancements and widespread application in deep learning [8] and the proposal of large-scale COD datasets (e.g., COD10K [9] and NC4K [10]). Deep learning-based COD methods have achieved impressive performance in terms of segmentation accuracy and speed. These methods commonly utilize either convolutional neural networks (CNNs) such as ResNet50 [11] and Res2Net50 [12] or transformers like PVT [13] and Swin Transformer [14] as the backbone of their models. While both have shown promising results, transformers possess the advantage of capturing long-range dependencies among all elements present within the input and modeling global information. Moreover, feature integration approaches are essential for feature interaction and information fusion between different levels, obtaining enhanced feature representation, whether using CNNs or transformers as the backbone network. Analyzing several commonly used feature integration approaches in COD, as shown in Figure 1, we find that existing approaches fall into three strategies. The first strategy involves integrating features from adjacent levels, either for specific level features or across all levels, as depicted in Figure 1a [15] and Figure 1d [16]. The second one, represented by Figure 1b [17], adopts an FPN-like architecture, progressively aggregating multi-level features in a top-down manner to generate a high-resolution semantic feature representation. The third strategy introduces auxiliary tasks (e.g., edge detection) after the backbone, as shown in Figure 1c [18]. We infer two significant findings. Firstly, existing feature integration approaches primarily focus on adjacent levels and are deficient in integrating cross-level features. The high-level features extracted from the backbone contain more semantic information, which is beneficial for locating objects, while the low-level features contain more spatial information, which is more helpful in recovering object edge details. Although previous works integrate features from different levels, we argue that a more comprehensive and effective integration approach should consider the difference and complementarity of cross-level features and leverage their distinct characteristics for improved segmentation performance. Secondly, previous methods concentrate on information compensation, ignoring the impact of information differences on feature fusion. Current methods for feature integration involve element-wise addition [19], concatenation operation [20,21], and specially designed modules that contain multiple operations [15,18,22]. However, all of these methods integrate different features together for the purpose of information complement; they ignore the difference between different-level features and may bring in potential redundancy, leading to inaccurate localization and indefinite edges.

Based on the above analysis, in this paper, we propose a transformer-based cross-level iterative subtraction network (CISNet) for COD, which is designed to perform difference complementation between cross-level features and then progressively improve the accuracy of segmentation in a top-down manner. In detail, we present a cross-level iterative structure (CIS), as depicted in Figure 1e, specifically designed for cross-level feature integration. And, within CIS, we adopt a sequential iteration strategy to facilitate the information interaction across multiple levels. This structure incorporates the utilization of multi-scale strip convolution subtraction (MSCSub) to integrate the highest-level features with their relevant lower-level features, leading to more comprehensive and semantically expressive feature representation. Within MSCSub, strip convolution layers with multiple kernel sizes are employed to capture features in different directions and scales, facilitating the improved extraction of local detail information and increasing the adaptability of the model to handle camouflaged objects with varying appearances. And, through a subtraction operation in MSCSub, difference information between different levels can be extracted, and feature complementarity is performed. Moreover, we aggregate the outputs of each MSCSub at the same level to capture expansive horizontal information. To further refine the features at each level, we design enhanced guidance attention (EGA) to improve the sensing of local context information and capture a broader range of feature relationships with greater diversity. Our main contributions can be summarized as follows:

We propose a cross-level iterative subtraction network for camouflaged object detection, which focuses on difference complementation between cross-level features to enhance the segmentation performance;
We present a cross-level iterative structure and multi-scale strip convolution subtraction module for the efficient interaction and integration of features across multiple levels, which captures difference information between cross-level features and guides the network to focus on the details of object;
We present the enhanced guidance attention module to deeply mine multi-scale local context information and capture the relationships between different features, which aims to improve the segmentation accuracy of camouflaged objects.

The rest of this paper is organized as follows. In Section 2, we discuss several related works. Then, in Section 3, we provide an overview of our model and detailed descriptions of each module. Section 4 presents experimental results and related analysis. And Section 5 shows some use cases. Finally, we conclude the paper in Section 6.

2. Related Work

2.1. Camouflaged Object Detection

Traditional COD methods extract a variety of hand-crafted features between the foreground and the background to segment the camouflaged object [23,24,25]. However, these methods are limited in their applicability to simple scenes and exhibit inefficiency in complex scenes.

In recent years, CNN-based COD methods have made significant development, and they can be broadly categorized as follows: (1) multi-scale feature fusion: ZoomNet [26] designs a mixed-scale triplet network that takes images as input at three scales, learning their discriminative mixed-scale semantics and mining the subtle differences between the target and background. C2FNet [15] incorporates an attention-induced cross-layer fusion module and a two-branch context-aware module to obtain richer global context information; (2) multi-stage strategy: SINet [9] mimics the predation process of natural organisms and designs a strategy of searching before recognizing. PFNet [17] proposes a position-then-focus approach and introduces the strategy of distraction mining. LSR [10] models the visual perception behavior and achieves effective edge prior and cross-comparison between potential camouflaged regions and background. MRRNet [27] proposes a match–recognize–refine strategy to recognize camouflaged objects by matching appropriate receptive fields for the camouflaged objects of different sizes and shapes. SegMaR [28] proposes the strategy of segment–magnify–reiterate to obtain detailed distraction information through the magnification process in multi-stage training to have the ability to segment complex shaped camouflaged targets accurately; (3) comparative learning: UJSC [29] introduces an uncertainty-aware adversarial training strategy that combines SOD and COD to improve SOD and COD accuracy simultaneously. UGTR [30] combines a convolutional neural network and transformer and uses a probabilistic representation model to learn the uncertainty of the camouflaged object, allowing the model to focus more on the uncertain region, leading to more accurate segmentation.

Due to the utilization of convolutional layers, CNNs excel at extracting local features from images. These convolutional layers can effectively capture spatial hierarchies of features, which is beneficial in scenarios where local details are critical for object detection. Based on this, we adopt the multi-scale feature fusion strategy to combine information from different scales, enhancing feature representation ability. Different from the above methods, CISNet integrates cross-level features and enables the full interaction of information through an iterative approach, an approach which more fully makes use of the complementarity between different-level features.

2.2. Multi-Level Feature Integration

Multi-level feature integration is a common approach in object detection and segmentation. Low-level features capture detailed local information, while high-level features capture global semantic context. Combining features from different levels enables the incorporation of local and global contextual information, which can enhance the model’s ability to handle appearance variations in COD tasks. Previous works have proposed some approaches for multi-level feature integration. For example, C2FNet [15] is designed with an attention-induced cross-level fusion module. It integrates the cross-level features by utilizing a Multi-Scale Channel Attention (MSCA) to obtain both local contexts and global contexts, which makes it exploit multi-scale information to alleviate scale changes. However, it is noteworthy that this method incurs significant computational overhead. MINet [31] uses elemental addition to fuse adjacent two-level or three-level features to integrate contextual information from neighboring resolutions and enhance the representation of features at different resolutions. This straightforward feature fusion not only makes it difficult to utilize effective information, but also leads to redundancy. In order to maintain the consistency and correlations of features and enrich semantic context, FSNet [32] proposed a cross-connection strategy to transfer information between low-resolution features and high-resolution features. Although this method enables sufficient interaction across different feature levels, it may not fully exploit their inherent characteristics.

Unlike these methods, our approach takes into account the difference in features between cross-level features. To effectively capture the subtle distinctions within the object, we propose to integrate strip convolution with element-wise subtraction. This approach is specifically designed to extract information regarding the difference between the high-level features and their corresponding lower-level features, thereby achieving feature complementarity. Subsequently, we fuse the extracted difference features from various convolutional layers using element-wise addition. In this way, the fused features we obtain are not only rich in discriminative information but also free of redundancy.

2.3. Transformers in Computer Vision

Originally designed for NLP [33,34], researchers in recent years have widely applied transformers to a variety of computer vision tasks [35,36,37,38], such as image classification [39,40,41], object detection [42,43,44], and semantic segmentation [45,46,47,48,49], and all of them have achieved impressive performance. Compared with CNN-based methods [11,50,51], the multi-head self-attention layer in the transformer has dynamic weights and global sensory fields, which gives the transformer the advantage of modeling global contextual information.

Transformer-based models are also turning out to be a big trend in COD. HitNet [52] realizes the refinement of high-resolution features to low-resolution feature representations with iterative feedback by establishing global loop connections between multi-scale resolutions. SARNet [16] mimics the human vision mechanism to observe the behavioral patterns of camouflaged objects and formulates a search–amplify–recognize strategy to break camouflage by zooming in on the object region and alternately focusing on the foreground and background. FSPNet [53] designs a feature shrinkage decoder with a neighbor interaction module, which aggregates neighboring features step-by-step via a shrinkage pyramid for accurate decoding. CamoFormer [20] investigates a more effective way to utilize the self-attention mechanism by assigning different features to different self-attention heads to handle foreground and background regions separately to obtain more accurate segmentation results.

Each approach in the literature offers strengths and limitations in addressing the challenges of COD. Traditional methods excel in simpler scenarios, but they often falter in complex scenes due to their reliance on hand-crafted features. CNN-based approaches have significantly advanced the field by leveraging deep learning techniques to capture intricate details and semantic context, as evidenced by the diverse strategies displayed above. However, CNNs primarily focus on local feature extraction and struggle to capture long-range dependencies across the entire image. This limitation can hinder their ability to effectively model global contextual information, which is crucial in tasks where understanding relationships between the distant parts of the image is necessary. In contrast, transformer-based models capitalize on their ability to model global dependencies and adaptively attend to relevant features, making them robust in diverse and challenging scenes. Therefore, following the baseline of transformer-based COD approaches, we propose a new scheme to achieve more accurate camouflaged object detection. The obvious advantages of the proposed scheme are that we take into account not only the compensation between cross-level features, but also the difference, for more promising results. And we also demonstrate that our method captures more detailed and discriminative information.

3. CISNet Method

The overall architecture of CISNet, as depicted in Figure 2, comprises three main components: transformer-based backbone, cross-level iterative structure (CIS) (incorporating multi-scale strip convolution subtraction (MSCSub)), and enhanced guidance attention (EGA). We adopt the PVTv2 [13] as the backbone to generate multi-scale features

{E_{i}}_{i = 1}^{4}

from the input image I∈

R^{H \times W \times 3}

. The resolution of

{E_{i}}_{i = 1}^{4}

is

{\frac{H}{2^{i + 1}} \times \frac{W}{2^{i + 1}}}_{i = 1}^{4}

, and the number of channels correspond to 64, 128, 320, and 512. Firstly, we separately adopt a

1 \times 1

convolution to compress the number of channels in each feature map obtained from the backbone to 64, which can reduce the parameters. Then, to make more effective use of distinctive characteristics between different levels, we feed features from the highset level and all lower levels into the MSCSub separately. This allows the extraction of different information between cross-levels and generates complementary features

F_{h - i}

, where h and i represent the level of the highest-level feature and lower-level feature, respectively. This process is iterated to capture higher-order complementarity information across multiple levels. The results of MSCSub at the same level are then fused using the addition operation, yielding complementarity-enhanced features

{F_{l e v e l_{i}}}_{i = 1}^{3}

. Subsequently, these enhanced features are fed into the EGA module to deeply mine local context information, outputting the refined features

{X_{i}}_{i = 1}^{3}

. We use EGA to refine

F_{l e v e l_{i}}

with

X_{i + 1}

in a top-down manner to improve the accuracy of camouflaged object segmentation progressively. Finally, we use

X_{1}

generated by the last EGA to predict camouflaged map P, which is supervised by ground truth.

3.1. Cross-Level Iterative Structure

The highest-level features extracted by the transformer contain the most semantic information, which is beneficial for locating objects, while the lowest-level features contain the most spatial information, which is advantageous for constructing object edge regions. Leveraging this characteristic, we propose a cross-level iterative structure (CIS) to perform the information integration and detail enhancement of cross-level features iteratively. Specifically, CIS integrates the highest-level features and all relevant lower-level features separately at each stage, which is referred to as cross-level integration. This integration approach promotes the mutual sharing of semantic and spatial information between high-level and low-level features. Compared to the integration of adjacent level features, our cross-level integration approach allows for a more comprehensive fusion of semantic and spatial information, utilizing the strengths of both high-level and low-level features to enhance the overall representation of features. To further enrich the multi-level information captured by our approach, we adopt a sequential iteration strategy for cross-level integration. This strategy enables the generation of a series of information-rich features with varying orders and receptive fields. By iteratively refining the integration process, our approach effectively captures a broader range of context information across multiple levels, enabling the acquisition of more comprehensive and semantically expressive feature representation.

Multi-Scale Strip Convolution Subtraction

The features generated at different levels possess unique characteristics. Exploiting the differences between these features can highlight subtle features in objects and their backgrounds. Therefore, we propose the multi-scale strip convolution subtraction (MSCSub), as shown in Figure 2, to emphasize the differences from texture to structure between different levels and capture their complementary information, providing features with rich information for subsequent operations.

To cope with the challenges posed by the diverse sizes and shapes of camouflaged objects, as well as the particular “ruggedness” of their edges, we utilize strip convolution layers within the MSCSub. Strip convolution uses some convolution kernels of different scales in the horizontal and vertical directions to capture features, which can better capture subtle features such as the texture and shape of the target object in various directions. Specifically, it uses longer kernel shapes in one direction and narrower in another to aggregate both global context and local context information. This offers an advantage by capturing remote dependencies between isolated regions with longer kernel shapes in one spatial dimension, while using narrower kernel shapes in other dimensions to effectively capture local context information and prevent interference from extraneous regions. Thereby, it enhances the ability to capture details along the object boundaries. Furthermore, considering that the receptive field within a single-scale convolution kernel is limited, we employ strip convolutions at four sizes of

1 \times 3

,

3 \times 1

,

1 \times 5

, and

5 \times 1

in addition to

1 \times 1

. After obtaining multi-scale features, we directly subtract features that have the same scale to capture the complementarity information between high-level and low-level features and highlight detail and structure differences according to the pixel–pixel and region–region patterns. In this way, MSCSub introduces strip convolution kernels of different sizes to adaptively capture contextual information at different scales and perform multi-scale features interacting with each other, thereby generating more effective and discriminating feature information. The entire multi-scale strip convolution subtraction process can be formulated as follows:

\begin{matrix} F_{h - i} = & C o n v_{3 \times 3} (| Γ_{u p} (F_{h}) ⊝ F_{i} | + \\ | Γ_{u p} (C o n v_{1 \times 3} (F_{h})) ⊝ C o n v_{1 \times 3} (F_{i}) | + \\ | Γ_{u p} (C o n v_{3 \times 1} (F_{h})) ⊝ C o n v_{3 \times 1} (F_{i}) | + \\ | Γ_{u p} (C o n v_{1 \times 5} (F_{h})) ⊝ C o n v_{1 \times 5} (F_{i}) | + \\ | Γ_{u p} (C o n v_{3 \times 1} (F_{h})) ⊝ C o n v_{5 \times 1} (F_{i}) |) \end{matrix}

(1)

where

F_{h}

and

F_{i}

represent feature maps from the highest level and lower level at the current stage. For example, for features generated by backbone,

F_{h}

corresponds to

E_{4}

and

F_{i}

corresponds to

E_{3}

,

E_{2}

, or

E_{1}

.

Γ_{u p}

is a bilinear upsampling operation to keep the shape between features matched. The symbol ⊝ is the element-wise subtraction operation. | · | represents the absolute value. And

{Conv}_{m \times n}

(·) denotes the strip convolution layer.

Subsequently, we aggregate the cross-scale complementary features

F_{h - i}

at the same level to generate complementarity-enhanced features

{F_{l e v e l_{i}}}_{i = 1}^{3}

. This process can be formulated as follows:

F_{l e v e l_{i}} = C o n v_{3 \times 3} (\sum_{h = i + 1}^{4} F_{h - i}), i = 1, 2, 3 .

(2)

3.2. Enhanced Guidance Attention

Although we obtain complementarity-enhanced features from the previous process, some noise and blurred edges may still exist in

{F_{l e v e l_{i}}}_{i = 1}^{3}

. The most common solution is to use fully convolutional decoders, which have the ability to capture high-level semantics, but local convolution and pooling operations restrict the receptive field range. As well, self-attention mechanisms also face difficulties in modeling high-frequency information and capturing fine contours, which may result in biases when predicting object boundaries. Inspired by the Multi-Dconv Head Transposed Attention [54], we design enhanced guidance attention (EGA) to further explore the multi-scale local contextual information and capture a broader range of relationships between different feature maps, improving the segment accuracy of camouflaged objects. The structural details of EGA are shown in Figure 3.

Given the complementarity-enhanced feature

{F_{l e v e l_{i}}}_{i = 1}^{3}

and the output

{X_{i + 1}}_{i = 1}^{3}

from deeper EGA (for the initial EGA, the two inputs are

F_{l e v e l_{3}}

and

E_{4}

), EGA first generates

q u e r y

(Q),

k e y

(K) and

v a l u e

(V) matrices. Q and V are generated by adding up the results of three parallel convolutions of

1 \times 5

,

5 \times 1

, and

3 \times 3

applied on normalized

X_{i + 1}

and

F_{l e v e l_{i}}

separately. In this process, strip convolutions with

1 \times 5

,

5 \times 1

are used to capture long-range dependencies in the horizontal and vertical directions, while a standard

3 \times 3

convolution is employed to gather more local context information. Combining these three layers of convolution is conducive to detecting camouflaged objects with different sizes and shapes because we use square windows to detect features and strip windows, making our approach more advantageous compared to methods such as dilatation convolution and pyramid pooling. K is obtained by normalizing

X_{i + 1}

. Subsequently, we concat Q and K to generate an enhanced $K^{'}$ , which can be formulated as

K^{'} = C o n v_{1 \times 1} ([Q, K])

(3)

where [Q, K] is the concatenation operation and

C o n v_{1 \times 1}

is the

1 \times 1

convolution layer to compress the number of feature channels from 128 to 64.

In the next step, we reshape Q and $K^{'}$ matrices such that their dot-product interaction generates an attention map A. Then, we apply attention map A to V to obtain a more precise detection than

F_{l e v e l_{i}}

. In addition, we fuse the original

F_{l e v e l_{i}}

to the results via a skip connection to further complement its semantic information. Overall, the EGA process is expressed as

X_{i} = V \cdot S o f t m a x (\frac{Q^{⊤} K^{'}}{α}) + F_{l e v e l i}, i = 1, 2, 3 .

(4)

where

α

is a learnable scaling parameter to control the magnitude of the dot product of Q and $K^{'}$ before applying the softmax function. We use P predicted by

X_{1}

as the final prediction map, which is simultaneously supervised by ground truth.

3.3. Loss Function

We denote the prediction P generated by

X_{1}

through a

3 \times 3

convolution following a Sigmoid function as the final output. During training, P is unsampled to match the size of the input image and supervised by the weighted binary cross-entropy loss (

L_{B C E}^{w}

) [55] and weighted IoU loss (

L_{I O U}^{w}

) [55]. Each pixel is assigned a weight based on the difference between the center pixel and its surrounding environment, allowing for increased attention to hard pixels. Additionally, we employ an ImageNet pre-trained classification network VGG16 as LossNet [56], as depicted in Figure 4, to compute the difference

L_{f}

between features extracted from the prediction and ground truth, aiding in network optimization, which is formulated as

L_{f} = \sum_{i = 1}^{4} l_{f}^{i}

(5)

We take

F_{P}^{i}

and

F_{G}^{i}

to represent the multi-scale features extracted from the prediction and ground truth by pre-trained VGG16, and

l_{f}^{i}

can be calculated as their Euclidean distance (L2-Loss):

l_{f}^{i} = | | F_{P}^{i} - F_{G}^{i} {| |}_{2}, i = 1, 2, 3, 4 .

(6)

In summary, the total loss function of our model is defined as

L_{t o t a l} = L_{B C E}^{w} + L_{I O U}^{w} + L_{f}

(7)

4. Experiments

4.1. Datasets

In the experiments, we evaluate the performance of CISNet using four benchmark datasets: CHAMELEON [57], CAMO [58], COD10K [9], and NC4K [10]. CHAMELEON consists of 76 high-resolution images collected from Google search using the following search string: camouflaged animals. CAMO includes 1250 images containing camouflaged objects. COD10K is a large annotated dataset that includes 5066 camouflaged images. NC4K is currently the largest COD test dataset, consisting of 4121 images downloaded from the Internet. Following the approach taken in previous works [9,26,28], we adopt 1000 images from the CAMO dataset and 3040 images from COD10K for training, while the remaining images are used for testing.

4.2. Evaluation Metrics

Following previous works [9,26,28], we use four commonly used metrics for evaluation, including Structure-measure (

S_{α}

) [59], mean E-measure (

E_{ϕ}

) [60], weighted F-measure (

F_{β}^{ω}

) [61], and mean absolute error (M) [62].

S_{α}

simultaneously evaluates the region-aware and object-aware structural similarity between the prediction and ground truth.

E_{ϕ}

evaluates element-wise similarity and provides statistics at the image level.

F_{β}^{ω}

is an overall performance measurement that considers both precision and recall. M represents the average pixel-level relative difference between the normalized prediction map and the ground truth. For

S_{α}

,

E_{ϕ}

, and

F_{β}^{ω}

, larger values indicate better model performance, whereas, for M, a smaller value indicates better performance.

4.3. Implementation Details

We implemented CISNet using Pytorch library [63]. A pre-trained PVTv2 [13] on the ImageNet [64] dataset was applied as the backbone of our network to extract multi-scale features. During the training process, we employed AdanW [65] as the optimizer and conducted a total of 100 epochs with a learning rate of

2 \times 10^{- 5}

. The input image size was resized to 704 × 704 for both the training and testing phases, and the output of the test was resized to the original size of the input. The entire model was trained on a single NVIDIA A40 GPU with a batch size of 4.

4.4. Performance Comparison

4.4.1. Quantitative Comparisons

We compare the proposed CISNet regarding S-measure, mean E-measure, weighted F-measure, and mean absolute error scores with 19 state-of-the-art COD methods, including SINet [9], SINet-V2 [66], PFNet [17], C2FNet [15], LSR [10], OSFormer [67], TPRNet [68], SegMaR [28], BSANet [18], DGNet [69], AGNet [70], BGNet [71], ZoomNet [26], CINet [72], EAMNet [73], MFFN [74], FDNet [75], FEDER [76], and HitNet [52]. For accurate comparison, all results were either provided by the published papers or reproduced by the released source codes. And the quantitative comparison results are summarized in Table 1.

As can be seen from Table 1, the proposed CISNet has achieved the best performance on all indicators for these datasets compared to all other methods. For instance, compared with the second-best method, HitNet, CISNet takes more account of the extraction of local information and utilizes difference information between cross-level features to iteratively refine these features. These approaches effectively exploit the characteristics of features across different scales, resulting in CISNet being more adept at capturing refined edge details. As Table 1 shows, on the most challenging COD10K test dataset, CISNet increases

S_{α}

and

F_{β}^{ω}

by

1.2 %

and

1.7 %

, respectively, and decreases MAE by

8 %

compared with HitNet. On the NC4K dataset, CISNet increases

S_{α}

and

F_{β}^{ω}

by

1.1 %

and

1.4 %

, respectively, and decreases MAE by

12 %

compared with HitNet. And we also achieved improvements on the other two datasets compared with HitNet. This demonstrates that our method has stronger generalization ability.

4.4.2. Visual Comparisons

Figure 5 displays the COD maps produced by CISNet and other SOTA COD methods. We present some representative prediction results to further demonstrate the advantages of CISNet. Compared with other models, CISNet is capable of detecting camouflaged objects closer to the ground truths in various challenging scenarios. As shown from the first row to the fourth row, CISNet can segment objects with various shapes and sizes and does not ignore thin edges. Similar to the examples shown in the fifth row, CISNet can correctly segment multi-scale camouflage objects in a clearly superior manner to others. The objects with complex topological structures and abundant edge details are difficult to segment, even for manual annotation (e.g., sixth row and seventh row). But CISNet is capable of segmenting clearer edges and boundaries, even for complex backgrounds (seventh row). At the same time, other results are blurred or without correct details. Even for objects with complex occlusions, CISNet is able to recognize the target more accurately, while other models fail in these kinds of complex scenes. This favorable performance of CISNet can be mainly attributed to two key factors. Firstly, the MSCSub module plays a crucial role in perceiving some richer details of camouflaged objects with diverse receptive fields. It enables feature complementarity at different levels, enhancing the ability to capture fine-grained information. Secondly, the CIS promotes the cross-level transmission of both semantic and texture information. This facilitates the interaction of complementary features across multiple levels, gradually enhancing the representative ability of the extracted features.

4.5. Ablation Studies

In this section, we conduct ablation studies on COD10K and NC4K to validate the effectiveness of each core module in CISNet. In addition, we also analyze the influence of LossNet and channels in MSCSub on model performance.

4.5.1. Effectiveness of Cross-Level Integration Mechanism and Strip Convolutions in MSCSub

On the one hand, we utilize features from adjacent levels as inputs to validate the effectiveness of our cross-level integration mechanism. On the other hand, we evaluate the effectiveness of strip convolutions in MSCSub by either removing all strip convolutions or replacing them with ordinary convolutions using kernel sizes of

3 \times 3

and

5 \times 5

. The experimental results, as shown in Table 2 (#1-#5 and #OUR), indicate that our method achieves the best performance compared to the other experimental setups.

We also conduct the visualization of feature maps generated by employing different convolution methods and integration strategies in MSCSub to demonstrate the superiority of our method more intuitively. In Figure 6, (a) shows the visualization of using convolution with

3 \times 3

and

5 \times 5

in MSCSub and feeding the adjacent level features as input, (b) shows the visualization of using convolution with

1 \times 3

,

3 \times 1

,

1 \times 5

, and

5 \times 1

in MSCSub and feeding the adjacent level features as input, (c) shows the visualization of using convolution with

3 \times 3

and

5 \times 5

in MSCSub and feeding the highest-level features with their relevant lower-level features as input, respectively, and (d) shows the visualization of using convolution with

1 \times 3

,

3 \times 1

,

1 \times 5

, and

5 \times 1

in MSCSub and feeding the highest-level features with their relevant lower-level features as input, respectively. As the results illustrate in Figure 6, in (a), the first feature map contains rich texture information but exhibits more background noise and the location of objects in the third feature map is a little thin. Similarly, (b) has issues similar to (a), but the edge texture information in the first feature map is more obvious than that in (a), and there is less background noise. This indicates that our strip convolutions can effectively capture richer local context information while reducing noise interference. However, we cannot see distinct texture information in the first feature map in (c). However, in the first feature map in (d), we can observe abundant edge texture information and less influence of background noise, and the third feature map has more plump object localization. Comparing (b) and (d), our cross-level integration mechanism focuses more on edge details, possesses stronger background noise suppression capabilities, and achieves more precise object localization.

Based on the above, the cross-level integrative mechanism and multi-scale strip convolution subtraction effectively compensate for the differences in cross-level features and enrich the semantic context and edge details required for COD performance. This allows us to exploit the multi-scale characteristics of different features better.

4.5.2. Effectiveness of EGA

To illustrate the effectiveness of EGA in improving the accuracy of segmentation, we conducted an ablation study where we replaced EGA with an element-wise addition operation followed by a normal convolution layer while keeping the rest of the CISNet model unchanged. The results, as shown in Table 3 (#6 and #OUR), indicate a significantly degraded performance (

E_{ϕ}

:

- 5.4 %

on COD10K-Test,

- 1.7 %

on NC4K;

F_{β}^{ω}

:

- 3.1 %

on COD10K-Test,

- 1.6 %

on NC4K). This shows that our EGA can greatly improve the performance of the proposed model. Thus, it demonstrates that our EGA is capable of deeply mining local context information to enrich the details and capturing the relationships between different features to enhance the expression ability of features. As a result, the performance of COD is significantly improved.

4.5.3. Effectiveness of Iterative Strategy

As shown in Table 3, we directly feed the outputs of MSCSub that operate on the features from backbone to EGA to evaluate the effectiveness of our iterative strategy (#7). Comparatively, the results obtained with the iterative approach (#OUR) show a markedly improvement in performance (

E_{ϕ}

:

+ 2.5 %

on COD10K-Test,

+ 1.2 %

on NC4K;

F_{β}^{ω}

:

+ 3.1 %

on COD10K-Test,

+ 1.6 %

on NC4K). This demonstrates that our difference complementary features are refined effectively after a sequential iterative strategy.

4.5.4. Influence of LossNet

To verify the influence of LossNet on model performance, we conduct ablation experiments by removing LossNet from our model. The results, as shown in Table 3 (#8 and #OUR), reveal that all the indicators of our model are improved after adding LossNet (

E_{ϕ}

increases from 0.932 to 0.938 on COD10K-Test and from 0.929 to 0.931 on NC4K,

F_{β}^{ω}

increases from 0.807 to 0.812 on COD10K-Test and from 0.832 to 0.837 on NC4K). This suggests that LossNet plays a crucial role in constraining the model.

4.5.5. Influence of Channels in MSCSub

In the MSCSub, we compress all feature channels to 64 to reduce computation. However, it is important to determine if 64 channels are the optimal choice. To investigate this, we conduct experiments by varying the number of channels to 32, 128, and 256 while keeping the rest of CISNet unchanged. The results of these experiments are summarized in Table 4. Compared to #OUR, all indicators with 32 channels in MSCSub (#9) show a significant decrease. When we set 128 channels in MSCSub, (#10) does not show obvious advantages in most indicators and even has a slight decrease in

E_{ϕ}

on COD10K-Test. When employing MSCSub with 256 channels, (#11) shows a slight increase in all indicators, but this comes at the cost of increased computation. Overall, we can conclude that MSCSub with 64 channels is the most advantageous choice.

To more intuitively show the feature differences between different scales, we conduct the visualization of some features generated in MSCSub. These visualization results are presented in Figure 7 and Figure 8. We can see from Figure 8 that our multi-scale strip convolutions excel in perceiving features with rich texture details while effectively suppressing noise from the background. Furthermore, MSCSub can clearly highlight the differences between features at different levels and facilitate the transfer of texture and location information between low-level and high-level features. In Figure 7, it is noticeable that, after continuous iterative processing, the horizontally fused feature maps output by MSCSub at each level capture subtle features and regional characteristics comprehensively. Consequently, through visual analysis, we can conclude that CIS with MSCSub fully utilizes feature differences at different levels, effectively captures edge details, and provides some descriptions of global structural information and local texture details. This further demonstrates the superiority of our method in the task of camouflaged object segmentation.

5. Use Cases

In this section, we apply CISNet to some downstream tasks related to COD to evaluate its practical application value.

5.1. Camouflage Pattern People Detection

Camouflage Pattern People Detection is a key technology in the field of military reconnaissance, as it plays a crucial role in detecting and identifying enemy special forces, agents, or spies with advanced camouflage and concealment capabilities. This technology aids in enhancing situational awareness and improving security measures by enabling the identification of hidden threats in various operational environments. In order to evaluate the effectiveness of CISNet Camouflage Pattern People Detection, we retrained CISNet on CamouflageData [7], where 80% of the samples are used for training and 20% for testing. And the results are shown in Figure 9.

5.2. Polyp Segmentation

Accurate segmentation of polyps in colonoscopy images is indeed crucial for the early detection and intervention of these tumorous lesions. By precisely delineating the boundaries of polyps, segmentation techniques can aid in identifying and characterizing potential areas of concern, enabling healthcare professionals to make informed decisions regarding patient care and treatment. We retrained the CISNet on the training set of Kvasir [77] and the CVC-ClinicDB [78] dataset, in which 80% are for training and 20% are for testing. Figure 9 illustrates the visual results generated by CISNet.

5.3. Transparent Object Detection

Transparent objects, such as glass or windows, can be difficult to detect visually due to their transparency and reflective properties. Accurately recognizing transparent obstacles in the environment is indeed critical for ensuring the safe navigation of robots and drones in daily life and industrial production settings. We further investigated the effectiveness of CISNet in a transparent object detection task on Trans10K [5]. The visual results presented in Figure 9 further demonstrate the practical application value of CISNet.

6. Conclusions

The application of camouflaged object detection (COD) techniques can improve the monitoring ability of water bodies and the environment, enhancing the protection and management of water resources and ecosystems. This contributes to the achievement of sustainable development. In this paper, we propose a cross-level iterative subtraction network (CISNet) for detecting camouflaged objects. Differently from most existing methods that integrate features from adjacent levels, the proposed cross-level iterative structure (CIS) iteratively integrates the highest-level features with their relevant lower-level features, making information complementation more efficient. Within CIS, we present a multi-scale strip convolution subtraction (MSCSub) module to extract difference information between cross-level features and fuse the cross-scale complementary features, which enhances the ability of a network to capture object details. Furthermore, we present an enhanced guidance attention (EGA) module to improve the segmentation accuracy by deeply mining local context information and capturing a broader range of relationships between different features. Through extensive experiments on benchmark datasets, our model demonstrates superior performance compared to other SOTA models in the COD task.

Although our approach achieves good performance in certain scenarios, it may not perform optimally in highly complex backgrounds. Additionally, the current training dataset’s lack of sufficient diversity may limit the model’s adaptability to a wide range of camouflage strategies, potentially impacting its overall detection accuracy. Overcoming these limitations will be crucial for improving the robustness and applicability of the COD system in real-world situations. In future work, we intend to incorporate similarity-based learning techniques into COD to enhance the network’s ability to refine its parameters by comparing the distinctive features of camouflaged objects with the standard visual attributes of generic objects.

Author Contributions

Conceptualization, T.H., X.L., S.C., T.Z. and J.C.; methodology, T.H., X.L., S.C., T.Z. and J.C.; software, T.H.; validation, T.H., S.C. and J.C.; formal analysis, T.H., C.Z., X.L., T.Z. and J.C.; investigation, C.Z., X.S., S.C. and T.Z.; resources, T.H. and X.L.; data curation, T.H., C.Z. and S.C.; writing—original draft preparation, T.H.; writing—review and editing, T.H., C.Z., X.L., X.S., S.C., T.Z. and J.C.; visualization, T.H.; supervision, X.L.; project administration, X.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China (Grant Nos. 2023YFC3209301, 2023YFC3209201), the Excellent Post-doctoral Program of Jiangsu Province (Grant No. 2022ZB166), and the Fundamental Research Funds for the Central Universities (Grant Nos. B230201007, B230204009, B220206006).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dong, B.; Wang, W.; Fan, D.P.; Li, J.; Fu, H.; Shao, L. Polyp-pvt: Polyp segmentation with pyramid vision transformers. arXiv 2021, arXiv:2108.06932. [Google Scholar] [CrossRef]
Fan, D.P.; Zhou, T.; Ji, G.P.; Zhou, Y.; Chen, G.; Fu, H.; Shen, J.; Shao, L. Inf-net: Automatic covid-19 lung infection segmentation from ct images. IEEE Trans. Med. Imaging 2020, 39, 2626–2637. [Google Scholar] [CrossRef] [PubMed]
Fan, D.P.; Ji, G.P.; Zhou, T.; Chen, G.; Fu, H.; Shen, J.; Shao, L. Pranet: Parallel reverse attention network for polyp segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Lima, Peru, 4–8 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 263–273. [Google Scholar]
Deng, J.; Gong, H.; Ming, L. Medical Image Segmentation Based on Object Detection. J. Univ. Electron. Sci. Technol. China 2023, 52, 254. [Google Scholar]
Xie, E.; Wang, W.; Wang, W.; Ding, M.; Shen, C.; Luo, P. Segmenting transparent objects in the wild. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 696–711. [Google Scholar]
Pérez-de la Fuente, R.; Delclòs, X.; Peñalver, E.; Speranza, M.; Wierzchos, J.; Ascaso, C.; Engel, M.S. Early evolution and ecology of camouflage in insects. Proc. Natl. Acad. Sci. USA 2012, 109, 21414–21419. [Google Scholar] [CrossRef] [PubMed]
Zheng, Y.; Zhang, X.; Wang, F.; Cao, T.; Sun, M.; Wang, X. Detection of people with camouflage pattern via dense deconvolution network. IEEE Signal Process. Lett. 2018, 26, 29–33. [Google Scholar] [CrossRef]
Dai, X.; Gong, H.; Wu, S.; Yuan, X.; Ma, Y. Fully convolutional line parsing. Neurocomputing 2022, 506, 1–11. [Google Scholar] [CrossRef]
Fan, D.P.; Ji, G.P.; Sun, G.; Cheng, M.M.; Shen, J.; Shao, L. Camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2777–2787. [Google Scholar]
Lv, Y.; Zhang, J.; Dai, Y.; Li, A.; Liu, B.; Barnes, N.; Fan, D.P. Simultaneously localize, segment and rank the camouflaged objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11591–11601. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Chen, G.; Liu, S.J.; Sun, Y.J.; Ji, G.P.; Wu, Y.F.; Zhou, T. Camouflaged object detection via context-aware cross-level fusion. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6981–6993. [Google Scholar] [CrossRef]
Xing, H.; Wang, Y.; Wei, X.; Tang, H.; Gao, S.; Zhang, W. Go Closer To See Better: Camouflaged Object Detection via Object Area Amplification and Figure-ground Conversion. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 5444–5457. [Google Scholar] [CrossRef]
Mei, H.; Ji, G.P.; Wei, Z.; Yang, X.; Wei, X.; Fan, D.P. Camouflaged object segmentation with distraction mining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8772–8781. [Google Scholar]
Zhu, H.; Li, P.; Xie, H.; Yan, X.; Liang, D.; Chen, D.; Wei, M.; Qin, J. I can find you! boundary-guided separated attention network for camouflaged object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 3608–3616. [Google Scholar]
Ji, G.P.; Zhu, L.; Zhuge, M.; Fu, K. Fast camouflaged object detection via edge-based reversible re-calibration network. Pattern Recognit. 2022, 123, 108414. [Google Scholar] [CrossRef]
Yin, B.; Zhang, X.; Hou, Q.; Sun, B.Y.; Fan, D.P.; Van Gool, L. Camoformer: Masked separable attention for camouflaged object detection. arXiv 2022, arXiv:2212.06570. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Yong, X.; Chen, D.; Xia, R.; Ye, B.; Gao, H.; Chen, Z.; Lyu, X. SSCNet: A Spectrum-Space Collaborative Network for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2023, 15, 5610. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Liu, F.; Tong, Y.; Lyu, X.; Zhou, J. Semantic Segmentation of Remote Sensing Images by Interactive Representation Refinement and Geometric Prior-Guided Inference. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 5400318. [Google Scholar] [CrossRef]
Galun; Sharon; Basri; Brandt. Texture segmentation by multiscale aggregation of filter responses and shape elements. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Washington, DC, USA, 13–16 October 2003; pp. 716–723. [Google Scholar]
Guo, H.; Dou, Y.; Tian, T.; Zhou, J.; Yu, S. A robust foreground segmentation method by temporal averaging multiple video frames. In Proceedings of the 2008 International Conference on Audio, Language and Image Processing, Shanghai, China, 7–9 July 2008; pp. 878–882. [Google Scholar]
Hall, J.R.; Cuthill, I.C.; Baddeley, R.; Shohet, A.J.; Scott-Samuel, N.E. Camouflage, detection and identification of moving targets. Proc. R. Soc. B Biol. Sci. 2013, 280, 20130064. [Google Scholar] [CrossRef] [PubMed]
Pang, Y.; Zhao, X.; Xiang, T.Z.; Zhang, L.; Lu, H. Zoom in and out: A mixed-scale triplet network for camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2160–2170. [Google Scholar]
Yan, X.; Sun, M.; Han, Y.; Wang, Z. Camouflaged Object Segmentation Based on Matching–Recognition–Refinement Network. IEEE Trans. Neural Netw. Learn. Syst.
Jia, Q.; Yao, S.; Liu, Y.; Fan, X.; Liu, R.; Luo, Z. Segment, magnify and reiterate: Detecting camouflaged objects the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4713–4722. [Google Scholar]
Li, A.; Zhang, J.; Lv, Y.; Liu, B.; Zhang, T.; Dai, Y. Uncertainty-aware joint salient object and camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10071–10081. [Google Scholar]
Yang, F.; Zhai, Q.; Li, X.; Huang, R.; Luo, A.; Cheng, H.; Fan, D.P. Uncertainty-guided transformer reasoning for camouflaged object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 4146–4155. [Google Scholar]
Pang, Y.; Zhao, X.; Zhang, L.; Lu, H. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9413–9422. [Google Scholar]
Song, Z.; Kang, X.; Wei, X.; Liu, H.; Dian, R.; Li, S. FSNet: Focus Scanning Network for Camouflaged Object Detection. IEEE Trans. Image Process. 2023, 32, 2267–2278. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 9355–9366. [Google Scholar]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 22–31. [Google Scholar]
Li, X.; Xu, F.; Gao, H.; Liu, F.; Lyu, X. A Frequency Domain Feature-Guided Network for Semantic Segmentation of Remote Sensing Images. IEEE Signal Process. Lett. 2024, 31, 1369–1373. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual Event, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Liu, N.; Zhang, N.; Wan, K.; Shao, L.; Han, J. Visual saliency transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 4722–4732. [Google Scholar]
Zhuge, M.; Fan, D.P.; Liu, N.; Zhang, D.; Xu, D.; Shao, L. Salient object detection via integrity learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3738–3752. [Google Scholar] [CrossRef]
Jiang, Z.H.; Hou, Q.; Yuan, L.; Zhou, D.; Shi, Y.; Jin, X.; Wang, A.; Feng, J. All tokens matter: Token labeling for training better vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 18590–18602. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Li, X.; Xu, F.; Li, L.; Xu, N.; Liu, F.; Yuan, C.; Chen, Z.; Lyu, X. AAFormer: Attention-Attended Transformer for Semantic Segmentation of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5002805. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Lyu, X.; Gao, H.; Tong, Y.; Cai, S.; Li, S.; Liu, D. Dual attention deep fusion semantic segmentation networks of large-scale satellite remote-sensing images. Int. J. Remote Sens. 2021, 42, 3583–3610. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Li, X.; Xu, F.; Liu, F.; Lyu, X.; Tong, Y.; Xu, Z.; Zhou, J. A Synergistical Attention Model for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5400916. [Google Scholar] [CrossRef]
Hu, X.; Wang, S.; Qin, X.; Dai, H.; Ren, W.; Luo, D.; Tai, Y.; Shao, L. High-resolution iterative feedback network for camouflaged object detection. Proc. AAAI Conf. Artif. Intell. 2023, 37, 881–889. [Google Scholar] [CrossRef]
Huang, Z.; Dai, H.; Xiang, T.Z.; Wang, S.; Chen, H.X.; Qin, J.; Xiong, H. Feature shrinkage pyramid for camouflaged object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5557–5566. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5728–5739. [Google Scholar]
Wei, J.; Wang, S.; Huang, Q. F³Net: Fusion, feedback and focus for salient object detection. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12321–12328. [Google Scholar]
Zhao, X.; Zhang, L.; Lu, H. Automatic polyp segmentation via multi-scale subtraction network. In Proceedings, Part I 24, Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 120–130. [Google Scholar]
Skurowski, P.; Abdulameer, H.; Błaszczyk, J.; Depta, T.; Kornacki, A.; Kozieł, P. Animal camouflage analysis: Chameleon database. Unpubl. Manuscr. 2018, 2, 7. [Google Scholar]
Le, T.N.; Nguyen, T.V.; Nie, Z.; Tran, M.T.; Sugimoto, A. Anabranch network for camouflaged object segmentation. Comput. Vis. Image Underst. 2019, 184, 45–56. [Google Scholar] [CrossRef]
Fan, D.P.; Cheng, M.M.; Liu, Y.; Li, T.; Borji, A. Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4548–4557. [Google Scholar]
Fan, D.P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.M.; Borji, A. Enhanced-alignment measure for binary foreground map evaluation. arXiv 2018, arXiv:1805.10421. [Google Scholar]
Margolin, R.; Zelnik-Manor, L.; Tal, A. How to evaluate foreground maps? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 248–255. [Google Scholar]
Perazzi, F.; Krähenbühl, P.; Pritch, Y.; Hornung, A. Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 733–740. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Fan, D.P.; Ji, G.P.; Cheng, M.M.; Shao, L. Concealed object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6024–6042. [Google Scholar] [CrossRef]
Pei, J.; Cheng, T.; Fan, D.P.; Tang, H.; Chen, C.; Van Gool, L. Osformer: One-stage camouflaged instance segmentation with transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 19–37. [Google Scholar]
Zhang, Q.; Ge, Y.; Zhang, C.; Bi, H. TPRNet: Camouflaged object detection via transformer-induced progressive refinement network. Vis. Comput. 2022, 39, 4593–4607. [Google Scholar] [CrossRef]
Ji, G.P.; Fan, D.P.; Chou, Y.C.; Dai, D.; Liniger, A.; Van Gool, L. Deep gradient learning for efficient camouflaged object detection. Mach. Intell. Res. 2023, 20, 92–108. [Google Scholar] [CrossRef]
Yu, J.; Chen, S.; Lu, L.; Chen, Z.; Xu, X.; Hu, X.; Zhu, J. Alternate guidance network for boundary-aware camouflaged object detection. Mach. Vis. Appl. 2023, 34, 69. [Google Scholar] [CrossRef]
Sun, Y.; Wang, S.; Chen, C.; Xiang, T.Z. Boundary-guided camouflaged object detection. arXiv 2022, arXiv:2207.00794. [Google Scholar]
Li, X.; Li, H.; Zhou, H.; Yu, M.; Chen, D.; Li, S.; Zhang, J. Camouflaged object detection with counterfactual intervention. Neurocomputing 2023, 553, 126530. [Google Scholar] [CrossRef]
Sun, D.; Jiang, S.; Qi, L. Edge-Aware Mirror Network for Camouflaged Object Detection. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 2465–2470. [Google Scholar]
Zheng, D.; Zheng, X.; Yang, L.T.; Gao, Y.; Zhu, C.; Ruan, Y. Mffn: Multi-view feature fusion network for camouflaged object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 6232–6242. [Google Scholar]
Song, Y.; Li, X.; Qi, L. Camouflaged Object Detection with Feature Grafting and Distractor Aware. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 2459–2464. [Google Scholar]
He, C.; Li, K.; Zhang, Y.; Tang, L.; Zhang, Y.; Guo, Z.; Li, X. Camouflaged object detection with feature decomposition and edge reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22046–22055. [Google Scholar]
Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Halvorsen, P.; De Lange, T.; Johansen, D.; Johansen, H.D. Kvasir-seg: A segmented polyp dataset. In Proceedings, Part II 26, Proceedings of the Multi Media Modeling: 26th International Conference, MMM 2020, Daejeon, Republic of Korea, 5–8 January 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 451–462. [Google Scholar]
Bernal, J.; Sánchez, F.J.; Fernández-Esparrach, G.; Gil, D.; Rodríguez, C.; Vilariño, F. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Comput. Med. Imaging Graph. 2015, 43, 99–111. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Simplified schematics of feature integration approaches in different networks. Yellow blocks, blue blocks, and green blocks denote the encoder blocks, feature integration modules, and decoder modules, respectively. And the solid lines indicate the direction of feature transfer.

Figure 2. Overall architecture of proposed CISNet.

Figure 3. Illustration of enhanced guidance attention.

Figure 4. The architecture of LossNet.

Figure 5. Visual performance of the proposed CISNet.

Figure 6. Visual comparison of the results generated by applying different convolution methods and integration strategies to the features from the backbone. (a), (b), (c), and (d), respectively, represent the results of using ordinary convolutions and feeding the adjacent level features as input, using strip convolutions and feeding the adjacent level features as input, using ordinary convolutions and feeding the cross level features as input, and using strip convolutions and feeding the cross level features as input.

Figure 7. Visualization of features produced in our model.

Figure 8. Visualization of features generated in the process of MSCSub applied on the backbone features. (a), (b), (c), (d), and (e), respectively, represent the results of

E_{4}

⊝

E_{j}

,

C o n v_{1 \times 3} (E_{4})

⊝

C o n v_{1 \times 3} (E_{j})

,

C o n v_{1 \times 5} (E_{4})

⊝

C o n v_{1 \times 5} (E_{j})

,

C o n v_{3 \times 1} (E_{4})

⊝

C o n v_{3 \times 1} (E_{j})

, and

C o n v_{5 \times 1} (E_{4})

⊝

C o n v_{5 \times 1} (E_{j})

, where j denotes 1, 2, and 3.

Figure 8. Visualization of features generated in the process of MSCSub applied on the backbone features. (a), (b), (c), (d), and (e), respectively, represent the results of

E_{4}

⊝

E_{j}

,

C o n v_{1 \times 3} (E_{4})

⊝

C o n v_{1 \times 3} (E_{j})

,

C o n v_{1 \times 5} (E_{4})

⊝

C o n v_{1 \times 5} (E_{j})

,

C o n v_{3 \times 1} (E_{4})

⊝

C o n v_{3 \times 1} (E_{j})

, and

C o n v_{5 \times 1} (E_{4})

⊝

C o n v_{5 \times 1} (E_{j})

, where j denotes 1, 2, and 3.

Figure 9. Visualization results of downstream applications. From top to bottom: image (first row), ground truth (second row), and the results of CISNet (third row).

Table 1. Quantitative comparison with 19 other COD methods on four benchmark datasets.

Method	Year	CHAMELEON				CAMO-Test				COD10K-Test				NC4K
Method	Year	$S_{α}$ ↑	$E_{ϕ}$ ↑	$F_{β}^{ω}$ ↑	M↓	$S_{α}$ ↑	$E_{ϕ}$ ↑	$F_{β}^{ω}$ ↑	M↓	$S_{α}$ ↑	$E_{ϕ}$ ↑	$F_{β}^{ω}$ ↑	M↓	$S_{α}$ ↑	$E_{ϕ}$ ↑	$F_{β}^{ω}$ ↑	M↓
SINet [9]	2020	0.869	0.891	0.740	0.044	0.751	0.771	0.606	0.100	0.771	0.806	0.551	0.051	0.810	0.873	0.772	0.057
SINetV2 [66]	2021	0.888	0.942	0.816	0.030	0.820	0.882	0.743	0.070	0.815	0.887	0.680	0.037	0.847	0.903	0.769	0.048
PFNet [17]	2021	0.882	0.942	0.810	0.033	0.782	0.852	0.695	0.085	0.800	0.868	0.660	0.036	0.829	0.892	0.745	0.053
C2FNet [15]	2021	0.888	0.932	0.828	0.032	0.796	0.864	0.719	0.080	0.813	0.890	0.686	0.036	0.838	0.898	0.762	0.049
LSR [10]	2021	0.893	0.938	0.839	0.033	0.793	0.826	0.725	0.085	0.847	0.924	0.775	0.028	0.839	0.883	0.779	0.053
OSFormer [67]	2022	0.897	0.931	0.839	0.028	0.801	0.859	0.769	0.071	0.811	0.881	0.701	0.034	0.833	0.887	0.792	0.047
TPRNet [68]	2022	0.814	0.870	0.781	0.076	0.814	0.870	0.781	0.076	0.829	0.892	0.725	0.034	0.854	0.903	0.790	0.047
SegMaR [28]	2022	0.897	0.950	0.835	0.027	0.815	0.872	0.742	0.071	0.833	0.899	0.724	0.034	0.841	0.905	0.781	0.046
BSANet [18]	2022	0.895	0.946	0.841	0.027	0.796	0.851	0.717	0.079	0.818	0.891	0.699	0.034	0.841	0.897	0.771	0.048
DGNet [69]	2022	0.890	0.934	0.816	0.029	0.839	0.901	0.769	0.057	0.822	0.911	0.693	0.033	0.857	0.907	0.784	0.042
BGNet [71]	2022	0.885	0.942	0.815	0.032	0.812	0.870	0.749	0.073	0.831	0.901	0.722	0.033	0.843	0.901	0.764	0.048
ZoomNet [26]	2022	0.902	0.952	0.845	0.023	0.820	0.883	0.752	0.066	0.838	0.893	0.729	0.029	0.853	0.907	0.784	0.043
AGNet [70]	2023	0.900	0.952	0.864	0.027	0.808	0.859	0.783	0.075	0.835	0.896	0.760	0.033	0.852	0.900	0.816	0.046
CINet [72]	2023	0.905	0.947	0.843	0.028	0.827	0.888	0.763	0.066	0.830	0.904	0.710	0.033	0.855	0.910	0.789	0.043
EAMNet [73]	2023	0.899	0.942	0.855	0.023	0.831	0.890	0.763	0.064	0.839	0.907	0.733	0.029	0.862	0.916	0.801	0.040
MFFN [74]	2023	0.905	0.963	0.852	0.021	0.829	0.881	0.793	0.062	0.846	0.917	0.745	0.028	0.856	0.915	0.791	0.042
FDNet [75]	2023	0.909	0.947	0.856	0.025	0.836	0.886	0.777	0.066	0.857	0.918	0.763	0.028	0.865	0.911	0.803	0.042
FEDER [76]	2023	0.903	0.947	0.856	0.026	0.836	0.897	0.807	0.066	0.844	0.911	0.748	0.029	0.862	0.913	0.824	0.042
HitNet [52]	2023	0.915	0.962	0.875	0.020	0.844	0.902	0.801	0.057	0.868	0.932	0.798	0.024	0.870	0.921	0.825	0.039
CISNet	-	0.916	0.970	0.877	0.019	0.846	0.908	0.808	0.056	0.879	0.938	0.812	0.021	0.880	0.931	0.837	0.034

"↑" and "↓" indicate that larger or smaller is better. The best results are highlighted in bold.

Table 2. Ablation study on the effectiveness of cross-level integration mechanism and strip convolutions in MSCSub on COD10K and NC4K.

No.	cross_level	adjacent_level	strip_conv	ordinary_conv	COD10K-Test				NC4K
No.	cross_level	adjacent_level	strip_conv	ordinary_conv	$S_{α}$ ↑	$E_{ϕ}$ ↑	$F_{β}^{ω}$ ↑	M↓	$S_{α}$ ↑	$E_{ϕ}$ ↑	$F_{β}^{ω}$ ↑	M↓
#1	✓				0.878	0.930	0.810	0.022	0.879	0.927	0.834	0.036
#2		✓			0.876	0.932	0.809	0.023	0.880	0.928	0.834	0.036
#3		✓	✓		0.873	0.930	0.804	0.023	0.878	0.927	0.831	0.036
#4		✓		✓	0.876	0.933	0.809	0.022	0.880	0.928	0.832	0.036
#5	✓			✓	0.878	0.931	0.811	0.022	0.880	0.930	0.836	0.034
#OUR	✓		✓		0.879	0.938	0.812	0.021	0.880	0.931	0.837	0.034

"↑" and "↓" indicate that larger or smaller is better. The best results are highlighted in bold.

Table 3. Ablation study on the effectiveness of different modules on COD10K and NC4K.

No.	EGA	LossNet	Iteration	COD10K-Test				NC4K
No.	EGA	LossNet	Iteration	$S_{α}$ ↑	$E_{ϕ}$ ↑	$F_{β}^{ω}$ ↑	M↓	$S_{α}$ ↑	$E_{ϕ}$ ↑	$F_{β}^{ω}$ ↑	M↓
#6		✓	✓	0.873	0.884	0.781	0.023	0.878	0.914	0.821	0.037
#7	✓	✓		0.871	0.911	0.787	0.023	0.878	0.919	0.821	0.037
#8	✓		✓	0.876	0.932	0.807	0.022	0.880	0.929	0.832	0.036
#OUR	✓	✓	✓	0.879	0.938	0.812	0.021	0.880	0.931	0.837	0.034

"↑" and "↓" indicate that larger or smaller is better. The best results are highlighted in bold.

Table 4. Ablation study on the influence of channels in MSCSub on COD10K and NC4K.

No.	Channels in MSCSub	COD10K-Test				NC4K
No.	Channels in MSCSub	$S_{α}$ ↑	$E_{ϕ}$ ↑	$F_{β}^{ω}$ ↑	M↓	$S_{α}$ ↑	$E_{ϕ}$ ↑	$F_{β}^{ω}$ ↑	M↓
#9	32	0.878	0.934	0.811	0.022	0.880	0.927	0.834	0.036
#10	128	0.879	0.935	0.813	0.022	0.881	0.931	0.838	0.034
#11	256	0.883	0.939	0.820	0.020	0.884	0.932	0.840	0.033
#OUR	64	0.879	0.938	0.812	0.021	0.880	0.931	0.837	0.034

"↑" and "↓" indicate that larger or smaller is better.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, T.; Zhang, C.; Lyu, X.; Sun, X.; Chen, S.; Zeng, T.; Chen, J. A Cross-Level Iterative Subtraction Network for Camouflaged Object Detection. Appl. Sci. 2024, 14, 8063. https://doi.org/10.3390/app14178063

AMA Style

Hu T, Zhang C, Lyu X, Sun X, Chen S, Zeng T, Chen J. A Cross-Level Iterative Subtraction Network for Camouflaged Object Detection. Applied Sciences. 2024; 14(17):8063. https://doi.org/10.3390/app14178063

Chicago/Turabian Style

Hu, Tongtong, Chao Zhang, Xin Lyu, Xiaowen Sun, Shangjing Chen, Tao Zeng, and Jiale Chen. 2024. "A Cross-Level Iterative Subtraction Network for Camouflaged Object Detection" Applied Sciences 14, no. 17: 8063. https://doi.org/10.3390/app14178063

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Cross-Level Iterative Subtraction Network for Camouflaged Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Camouflaged Object Detection

2.2. Multi-Level Feature Integration

2.3. Transformers in Computer Vision

3. CISNet Method

3.1. Cross-Level Iterative Structure

Multi-Scale Strip Convolution Subtraction

3.2. Enhanced Guidance Attention

3.3. Loss Function

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Performance Comparison

4.4.1. Quantitative Comparisons

4.4.2. Visual Comparisons

4.5. Ablation Studies

4.5.1. Effectiveness of Cross-Level Integration Mechanism and Strip Convolutions in MSCSub

4.5.2. Effectiveness of EGA

4.5.3. Effectiveness of Iterative Strategy

4.5.4. Influence of LossNet

4.5.5. Influence of Channels in MSCSub

5. Use Cases

5.1. Camouflage Pattern People Detection

5.2. Polyp Segmentation

5.3. Transparent Object Detection

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI