CGNet: Remote Sensing Instance Segmentation Method Using Contrastive Language–Image Pretraining and Gated Recurrent Units

Zhang, Hui; Tian, Zhao; Chen, Zhong; Liu, Tianhang; Xu, Xueru; Leng, Junsong; Qi, Xinyuan

doi:10.3390/rs17193305

Open AccessArticle

CGNet: Remote Sensing Instance Segmentation Method Using Contrastive Language–Image Pretraining and Gated Recurrent Units

by

Hui Zhang

^1,2,

Zhao Tian

^3,4,

Zhong Chen

^3,4

,

Tianhang Liu

^3,4,*

,

Xueru Xu

^3,4,

Junsong Leng

^3,4 and

Xinyuan Qi

²

¹

Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

²

Beijing Automatic and Control Institute, Beijing 100854, China

³

School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China

⁴

National Key Laboratory of Science and Technology on Multi-Spectral Information Processing, Huazhong University of Science and Technology, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(19), 3305; https://doi.org/10.3390/rs17193305

Submission received: 16 June 2025 / Revised: 10 September 2025 / Accepted: 17 September 2025 / Published: 26 September 2025

(This article belongs to the Special Issue Object Detection in Remote Sensing Images Based on Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Highlights

What are the main findings?

CGNet achieves superior instance segmentation performance in remote sensing datasets. In the NWPU VHR-10 dataset, it reaches an average precision (AP) of 68.1%, which is 0.9% higher than that of the suboptimal method; in the SSDD dataset, it achieves an AP of 67.4%, outperforming the second-best method by 3.2%. It also shows strong performances across different metrics (e.g., AP₅₀ and AP₇₅) and target scales (small, medium, and large), especially excelling in segmenting small targets and large ships.
CGNet maintains a lightweight architecture while delivering high accuracy. With only 64.2 million trainable parameters, it has 17% fewer parameters than Cascade Mask R-CNN (77.3 M) and 33% fewer than HQ-ISNet (95.6 M). This proves that its design—including the ConvGRU-based iterative refinement, fusion head, and CLIP-enhanced backbone—effectively balances segmentation accuracy and computational efficiency without relying on a heavy backbone.

What are the implications of the main findings?

CGNet provides an effective solution to the key problems of remote sensing instance segmentation. The integration of CLIP’s semantic supervision addresses the issues of missed and misdetections caused by complex backgrounds and similar target contours in remote sensing images. Meanwhile, the joint refinement of contour and mask branches via ConvGRU solves the problem of dimensional mismatch between the two types of information, offering a feasible approach to enhance segmentation precision for small and blurred targets.
CGNet promotes the practical application of remote sensing instance segmentation. As a lightweight and high-performance model, CGNet meets the real-time requirements of scenarios like land planning and aerospace. Its parameter efficiency and computational economy mean it can be deployed on resource-constrained platforms (e.g., edge devices for on-site remote sensing data processing), expanding the scope of practical applications for remote sensing instance segmentation technology.

Abstract

Instance segmentation in remote sensing imagery is a significant application area within computer vision, holding considerable value in fields such as land planning and aerospace. The target scales of remote sensing images are often small, the contours of different categories of targets can be remarkably similar, and the background information is complex, containing more noise interference. Therefore, it is essential for the network model to utilize the background and internal instance information more effectively. Considering all the above, to fully adapt to the characteristics of remote sensing images, a network named CGNet, which combines an enhanced backbone with a contour–mask branch, is proposed. This network employs gated recurrent units for the iteration of contour and mask branches and adopts the attention head for branch fusion. Additionally, to address the common issues of missed and misdetections in target detection, a supervised backbone network using contrastive pretraining for feature supplementation is introduced. The proposed method has been experimentally validated in the NWPU VHR-10 and SSDD datasets, achieving average precision metrics of 68.1% and 67.4%, respectively, which are 0.9% and 3.2% higher than those of the suboptimal methods.

Keywords:

remote sensing image; instance segmentation; gated recurrent unit; contrastive language–image pretraining

1. Introduction

Remote sensing image instance segmentation is a branch of the field of image instance segmentation, aimed at distinguishing instances and dividing pixels for specific remote sensing targets. Common segmentation targets include special scenes, such as baseball fields and athletic tracks, as well as salient objects, like airplanes, ships, buildings, and vehicles. Remote sensing image instance segmentation plays significant roles in both civilian and military domains.

Currently, deep-learning-based methods for remote sensing image instance segmentation consider different approaches for instance representation. For each instance generation, it not only corresponds to a specific category but also has an individual segmentation result. This segmentation result is generally represented by a category-agnostic mask. Instance segmentation methods based on convolutional neural networks (CNNs) can be divided into one-stage and two-stage instance segmentation methods according to their model frameworks. One-stage instance segmentation methods [1,2,3] are based on one-stage object detection models [4,5,6], incorporating additional feature maps and interactive branches for instance representation and simultaneously performing object detection, classification, and mask generation. In contrast, two-stage instance segmentation methods [7,8,9] are based on two-stage object detection networks [10], first performing object detection and then feeding the detection boxes into subsequent branches for further object classification and binary mask generation within the detection boxes. One-stage instance segmentation methods have a significant advantage in speed and can easily meet real-time requirements. However, since one-stage methods perform segmentation directly on feature maps, the results are generally coarser and have certain disadvantages in accuracy. Two-stage instance segmentation methods follow the paradigm of detection followed by segmentation, using an additional segmentation head for precise mask computation. This method can improve the accuracy of segmentation results to some extent but significantly increases the computation time, especially since the number of instances in remote sensing images is much higher than that in natural images. Due to the need for binary mask segmentation for each detected instance, the requirement for the real-time performance of instance segmentation algorithms remains a significant challenge.

Most existing remote sensing image instance segmentation algorithms are improved upon mask-based segmentation methods [11,12]. However, as mentioned above, mask-based methods require the dense prediction of image pixels. Remote sensing image instance segmentation tasks are characterized by a high number of instances, necessitating the generation of a vast number of mask representations. This undoubtedly increases the inference workload of the network model, reduces the inference speed, and makes it difficult to meet real-time requirements. Meanwhile, contour-based instance segmentation methods [13,14] transform the dense prediction problem to a regression problem, enhancing the inference speed, and, thus, may present a feasible alternative.

Nevertheless, directly applying contour-based instance segmentation methods to remote sensing images still poses certain issues. First, the backgrounds of remote sensing images are complex and variable, containing a wealth of effective information, whereas contour-based methods only consider the features of target boundaries and cannot effectively incorporate background information. Second, since different categories of remote sensing targets may have similar boundaries, the information within the targets is equally important. Additionally, remote sensing images generally have more noise, and the target contours may be blurred or inaccurate due to noise interference. Therefore, simply using contour-based instance segmentation methods in remote sensing images may lead to significant accuracy loss. Recent work by Zhang et al. [15] has shown that introducing an auxiliary segmentation branch with geometric priors can effectively guide the main detection task in aircraft imagery, which is methodologically aligned with our use of a contour branch to refine instance segmentation. Additionally, Wang et al. [16] highlight the challenges posed by complex backgrounds and noise in remote sensing imagery, further motivating our design of a dual-branch architecture that jointly optimizes contour and mask representations.

Due to the dimensional differences between contour information and mask information, it is unable to directly combine them to improve the segmentation accuracy of the model. Therefore, a new integration method must be designed to achieve feature consolidation through a unified dimensional representation of contours and masks. Moreover, during the regression process, as the predicted results gradually align with the ground truth, a sequential relationship is inherently established to some extent. This suggests that sequence-related neural network units, such as recurrent neural networks (RNNs) [17] and gated recurrent units (GRUs) [18], may yield better outcomes. These units are adept at capturing temporal dependencies and patterns in sequential data, which can be particularly beneficial in refining the accuracies of predictions in tasks where the evolution of predictions over time mirrors a sequence, such as in the iterative refinement of target contours in instance segmentation tasks.

To overcome the inherent limitations of independent contour and mask branches—namely, their incompatible dimensions and unequal susceptibilities to background clutter—we introduce a mutually refining iteration block powered by stacked GRUs. The driving insight is that contour evolution and mask fine tuning should co-evolve instead of being solved in isolation; GRUs provide the temporal memory required to remember helpful cues while forgetting noise accumulated across iterations. By first unifying dimensions with the discrete cosine transform (DCT) [19], we allow the two branches to speak a common frequency-domain language, enabling the GRU’s update gate to cross-select the most discriminative signals from either source. This synergistic loop suppresses redundant gradients, cancels imaging noise, and propagates only task-relevant context, yielding noticeably crisper boundaries without extra parameters—GRU weights are shared across all the iterations. Finally, an attentional fusion head re-weights the refined contour and mask tokens so that the network listens to the branch that is the most trustworthy at every spatial location, giving a consistent boost in segmentation accuracy, especially for small and blurred instances.

The backbone network is crucial for the algorithm to extract image features, and the performance of the backbone network will directly affect the performance of the instance segmentation algorithm. Instance segmentation models are generally improved from object detection networks, and for targets missed by the object detection algorithm, the instance segmentation network will not be able to perform any mask segmentation operations on them. A larger backbone network can increase the feature extraction capability to some extent, thereby increasing the information input to the object detection branch, reducing missed and misdetections, and bringing about some performance improvements. However, a larger backbone network means more parameters and longer computation times, which undoubtedly pose a significant challenge to the real-time requirements of instance segmentation.

Missed detections and false alarms are still disturbingly common in remote sensing pipelines, largely because visual features alone are easily confounded by highly similar textures or complex backgrounds. We therefore inject semantic guidance straight into the early feature extraction stage by aligning each pixel with text prototypes produced by a frozen CLIP [20] encoder. The rationale is simple: Language embeddings encode categorical context (“a pixel of ship” or “a pixel of bridge”) that is invariant to illumination, scale, or viewpoint, acting as an external memory that tells the backbone what to look for. A lightweight MLP bridges any domain gap between the frozen linguistic space and the image CNN, so the enhanced feature maps remain cheap to compute yet rich in semantics. The lightweight MLP projects the CLIP text embeddings to the same channel-dimensional space as the visual features from the DLA backbone, enabling pixel-level similarity computation and alignment. These pixel–text- aligned features are simultaneously fed to both the detector and the segmentation head, reducing confusion among lookalike categories, shrinking missed detections, and sharpening instance boundaries—all without increasing the online inference cost, because the text encoder is offline and fixed.

In summary, this paper makes the following contributions: (1) We propose a unified contour–mask co-refinement framework that iteratively optimizes both representations in a shared DCT space using ConvGRUs, boosting mask AP by +0.9% in NWPU and +3.2% in SSDD without adding any parameters. (2) We introduce a cross-attention fusion mechanism to adaptively combine contour and mask cues for more accurate segmentation; (3) we enhance the backbone network with frozen CLIP text encoders to provide semantic supervision, reducing false positives and missed detections. These designs collectively improve segmentation accuracy in challenging remote sensing datasets while maintaining a lightweight architecture.

1.1. Related Work

In the field of image processing, a multitude of segmentation-related methods has emerged. These methods can be broadly categorized into threshold-based segmentation methods, edge-based segmentation methods, and clustering-based segmentation methods. Traditional image segmentation techniques own fast processing speeds and, due to their combination of mathematical theory and image-related concepts, generally have strong interpretability and can achieve a certain level of segmentation accuracy. However, due to the limitations of the models themselves, these methods can only handle relatively simple segmentation problems. As image segmentation scenarios and target categories become increasingly complex and the requirements for segmentation accuracy grow higher, traditional image segmentation techniques are becoming less capable of meeting current demands. Meanwhile, deep-learning-based image segmentation methods have found widespread application in new scenarios and task requirements.

1.1.1. One-Stage Instance Segmentation

One-stage instance segmentation methods can be further divided into mask-based methods and contour-based methods. For mask-based methods, YOLACT [1] uses RetinaNet [6] as the object detection network and designs an additional branch in the detection network to generate mask representation coefficients specific to each instance. BlendMask [2] further improves upon YOLACT, using FCOS [5] as the object detection network. Moreover, BlendMask no longer uses simple mask representation coefficients but instead employs more complex instance-specific mask weight maps and performs matrix multiplication with the obtained prototype maps to obtain the final instance masks. CondInst [3] innovatively uses dynamic filters, replacing BlendMask’s branch for learning mask weight maps with a branch for learning dynamic filter parameters. Furthermore, CondInst does not compute prototype maps but designs additional dynamic filter branches that use the learned parameters for convolutions to refine the feature maps in an instance-specific manner and uses the results as the instance’s mask segmentation representation. SparseInst [21] uses sparse activation maps to represent instances, avoiding dense prediction for each pixel in the image, and employs a bipartite graph-matching strategy to avoid the introduction of non-maximum suppression, thereby speeding up inferences. Nevertheless, they still demand pixel-wise dense supervision, which explodes the computational load once the tiny, but numerous, objects of remote sensing images appear.

In addition to using CNNs for one-stage mask representations, there are also methods based on the transformer architecture. For example, SOLQ [22] uses DETR [23] as the basic object detection model, and increases the dimensionality of the queries in DETR by appending a 128-dimension DCT vector to each query to represent the target’s mask. Conversely, Mask2Former [24] uses each query as a search key to interact with the image’s low-level feature maps, obtaining the final mask segmentation results through the class of the query and the target information of the feature maps. Yet their heavy self-attention modules are inherently memory hungry, aggravating the already severe “small target, large image” imbalance in satellite/aerial imagery.

Some existing models approach one-stage instance segmentation from the perspective of contours. PolarMask [25] represents contours in polar coordinates. It first uses the FCOS network for object detection and adds a new branch. Unlike mask representation methods, the newly added branch generates the distance from the contour boundary to the target center. FourierNet [26] transforms contour segmentation to the frequency domain. The FCOS object detection network is still used for object detection, while the newly added branch is responsible for predicting a series of Fourier coefficients. After obtaining the coefficients, they can be inversely transformed back to the spatial domain to obtain the instance’s spatial contour result. However, by modeling only the object boundary, they throw away the very background context that is indispensable for distinguishing “ship-like pier” from “real ship” in noisy harbors.

In the field of remote sensing image instance segmentation, some methods also consider using one-stage instance segmentation models with improvements specific to remote sensing targets. For example, FB-ISNet [27] improves upon CondInst, using a deep-layer aggregation (DLA) network [28] as the backbone network and adding a bidirectional feature pyramid network (FPN) [29]. Shi et al. [30] retain FCOS as the object detection network but introduce an additional bounding box refinement module to improve detection accuracy and use two parallel branches on the prototype map to jointly obtain the final mask segmentation results. Although these methods design modules specific to remote sensing images to adapt to the special needs of remote sensing scenarios, they still cannot achieve satisfactory performance in terms of accuracy. Consequently, their improvements barely touch the chronic dilemma: similar contours across categories and blurred edges caused by sub-meter resolution.

1.1.2. Two-Stage Instance Segmentation

Unlike one-stage instance segmentation methods, two-stage instance segmentation methods first use an object detection network to obtain object detection results and then design an additional segmentation generation step to obtain instance segmentation results. Two-stage instance segmentation methods are also divided into mask-based segmentation methods and contour-based segmentation methods. Mask R-CNN [7] improves upon Faster R-CNN [10] by introducing a mask segmentation branch, using sampled features to generate the final mask results. For more accurate segmentation representation, Mask R-CNN adjusts the region-of-interest (RoI) process, changing from direct rounding to floating-point alignment, to improve accuracy. The price of ever-heavier backbones is sluggish inferences, exactly what real-time remote sensing applications strive to avoid, when hundreds of instances per tile must be segmented.

Due to the difference in the input quality of the RoI between the training and inference stages of Mask R-CNN, a mismatch between training and inference arises. To address this issue, Cascade R-CNN [8] progressively increases the RoI threshold, gradually improving the quality of the input detection boxes, thereby unifying the training and inference stages. However, the mask segmentation part of Cascade R-CNN is still performed in isolation. To create connections between the segmentation parts, HTC [9] connects the mask segmentation results from previous steps together with the feature maps, using them as input for the next stage’s mask segmentation branch. Additionally, HTC designs a global supervision module to further enhance the model’s accuracy.

In the field of remote sensing image instance segmentation, most models have improved upon these methods. For example, LFG-Net [11] uses an additional FPN and further refines the low-level feature maps of Cascade R-CNN to obtain more precise segmentation results. Ye et al. [12] use an attention mechanism to fuse feature maps at different scales and employ convolutional kernels of varying sizes to extract features of different-sized targets while also improving the mask generation branch using an attention mechanism. HQ-ISNet [31] also enhances the FPN layer to obtain better image features. Furthermore, HQ-ISNet introduces hierarchical refinement and interacts with the predicted masks. In recent years, larger backbone networks have gradually gained popularity. For example, GLFRNet [32] uses Vmamba [33] as a branch in the backbone network and designs a fusion module to integrate global and local features. Yu et al. [34] improve upon Mask R-CNN and use the Swin transformer [35] as an encoder to enhance network performance. It is evident that most recent methods have attempted to use larger backbone networks to achieve better segmentation results, which consequently lead to further decreases in inference speeds.

Contour-based two-stage instance segmentation methods, inspired by active contour models, continuously refine the obtained contour results to fit the target boundaries. DeepSnake [13] first uses CenterNet [4] as the object detection network, extracts initial contours from bounding boxes, and iteratively refines the initial contours using a recurrent convolutional network to ultimately obtain contour results that conform to the target boundaries. E2EC [14] improves upon DeepSnake, using a multilayer perceptron (MLP) to learn an initial contour. Additionally, to address the issues of path intersection and lateral movement during iteration, E2EC designs an additional loss function to assist in training. DANCE also considers the problem of path intersection. Unlike E2EC, DANCE [36] uses a segmented matching strategy during contour-point-sampling path intersection. Although PolySnake [17] also employs GRUs for contour refinement, its iterative module is designed exclusively for contour evolution, with the mask branch remaining independent and non-iterative. In contrast, CGNet introduces a dual-branch ConvGRU iteration mechanism that simultaneously refines both contour and mask representations in a shared 256-D DCT space. This design is motivated by two key insights: (1) Unified dimensional alignment: By projecting both the mask (via DCT) and contour to the same frequency-domain vector space, ConvGRU can perform cross-modal temporal fusion, allowing each branch to benefit from the complementary evolution of the other. (2) Cross-branch memory sharing: Unlike PolySnake’s GRU, which only maintains the intra-contour state, CGNet’s ConvGRU maintains inter-branch hidden states via the channel-wise concatenation of contour and mask features at each iteration step. This enables the update gate to selectively suppress noise from either branch while propagating task-relevant context across iterations. Furthermore, CGNet’s ConvGRU employs dilated circular convolution to explicitly model the topological continuity of contours while leveraging 1 × 1 conv + max-pooling fusion to integrate multiscale mask cues. These structural designs make ConvGRU particularly suitable for remote sensing scenarios where small targets, blurred edges, and noisy backgrounds demand joint optimization of boundary and regional cues—a capability that single-branch GRU frameworks, like PolySnake, inherently lack.

In two-stage instance segmentation methods, contour-based methods can significantly reduce computation time and improve inference speed compared to those of mask-based methods. This is because contour-based methods sample fewer points and transform the classification problem to a regression problem, avoiding dense prediction for pixels. However, these methods do not perform well for targets with complex boundaries, especially non-convex targets.

2. Materials and Methods

2.1. Datasets

In the field of remote sensing image instance segmentation, numerous datasets have emerged, covering various image sources, such as optical images and synthetic aperture radar (SAR) images. Among them, optical image datasets include the iSAID dataset [37], NWPU VHR-10 dataset [38], and BITCC dataset [39], while SAR image datasets comprise the HRSID dataset [40] and SSDD dataset [41]. Since the proposed method has a relatively low number of parameters and a less complex model structure, it is more suitable for smaller-scale datasets. Therefore, this study uses the NWPU VHR-10 dataset and SSDD dataset for experimentation.

The NWPU VHR-10 Dataset is a geospatial remote sensing dataset designed for object detection and segmentation tasks. It consists of 800 images in total, comprising 650 target images and 150 background images. These images were carefully selected, cropped, and manually annotated from two source datasets: the Google Earth dataset and the Vaihingen dataset. Specifically, 715 images were extracted from the Google Earth dataset, which offers spatial resolutions ranging from 0.5 to 2 m. The remaining 85 images are sharpened infrared optical images sourced from the Vaihingen dataset, which has a higher spatial resolution of 0.08 m. The dataset encompasses 757 airplanes, 302 ships, 655 storage tanks, 390 baseball fields, 524 tennis courts, 159 basketball courts, 163 athletic tracks, 224 harbors, 124 bridges, and 477 vehicles. In this study, the NWPU VHR-10 dataset was randomly divided into 70% for training and the remaining 30% for testing purposes.

SSDD is the first openly available dataset widely used for research on deep-learning-based ship detection and instance segmentation in SAR images. Due to the varying scales of ships in SSDD, vessels of different sizes produce different numbers of contour points. Larger ships provide more points, while smaller ones offer fewer. SSDD contains a total of 1160 images, officially divided into a training set with 928 images and a test set with 232 images.

2.2. Evaluation Metrics

Currently, in the field of image instance segmentation, common dataset evaluation metrics refer to the evaluation method of the MS COCO dataset [42]. First, in a manner similar to that in the object detection field, it is necessary to calculate the intersection over union (

I o U

) between the instance segmentation mask predicted by the model and the ground-truth instance segmentation mask of the target, obtaining

I o U_{m a s k}

. This is used to derive the average precision (

A P

) for the instance segmentation results (

A P_{m a s k}

). The calculation formula for

I o U_{m a s k}

is as follows:

I o U_{M a s k} = \frac{M_{p r e d} \cap M_{g t}}{M_{p r e d} \cup M_{g t}}

(1)

where

M_{p r e d}

represents the mask prediction result of the instance segmentation model for a specific target, and

M_{g t}

represents the ground truth of that target. Using a predetermined

I o U

threshold, the precision and recall can be calculated using the following equations:

P r e c i s i o n = \frac{T P}{T P + F P}

(2)

R e c a l l = \frac{T P}{T P + F N}

(3)

where

T P

represents the true positive samples,

F P

represents the false positive samples, and

F N

represents the false negative samples.

A P_{m a s k}

can be calculated using the following formula:

A P_{M a s k} = \int_{0}^{1} P r e d i c t (r) d r

(4)

where r denotes the recall rate. This formula represents the integral of the precision at various IoU thresholds. In MS COCO, the calculation of

A P_{m a s k}

is simplified as follows: Fixed

I o U_{m a s k}

thresholds are selected, ranging from 0.5 to 0.95, with a step size of 0.05, yielding precision values at ten different

I o U_{m a s k}

thresholds. A weighted average of these values is then computed to obtain the evaluation metric (

A P_{m a s k}

).

In addition to evaluating prediction results using

A P_{m a s k}

, the metrics

A P_{50}

and

A P_{75}

at

I o U_{m a s k}

thresholds of 0.5 and 0.75, respectively, can be used to preliminarily assess the model’s detection and segmentation capabilities. Furthermore, based on the scale of the targets, objects in the image can be categorized into small targets (with an area smaller than

32 \times 32

in the image), medium targets (with an area larger than

32 \times 32

but smaller than

96 \times 96

), and large targets (with an area larger than

96 \times 96

). These are evaluated separately using

A P_{S}

,

A P_{M}

, and

A P_{L}

.

2.3. Model Architecture

The overall architecture of the model is shown in Figure 1 and consists of three parts: the backbone network, the object detection network, and the segmentation network. The backbone network employs DLA with an additional CLIP text encoder for auxiliary supervision. The object detection network uses CenterNet, while the segmentation network is divided into two branches: a contour branch and a mask branch. Both branches utilize convolutional GRU modules for iterative refinement and ultimately produce the final results through an attention-based fusion head.

2.3.1. Backbone Supervision

RemoteCLIP [43] refers to a domain-adapted variant of CLIP, specifically fine tuned for remote sensing imagery. In CGNet, a DLA backbone network enhanced with RemoteCLIP is designed to supervise image features, thereby improving the network’s detection and segmentation performances. A pixel–text alignment feature supervision approach is proposed, incorporating additional textual information. Unlike the task-specific geometric prior in Zhang et al. [15], our frozen CLIP encoder provides category-aware cues that are agnostic to the object shape yet is equally effective in suppressing background confusion (Wang et al. [16]).

For text description, CGNet employs fixed template prompt phrases. For instance, if a pixel in the input image belongs to the “ship“ category, the prompt phrase “a pixel of a ship” is used. During the initialization of the backbone network, CGNet predefines prompt phrases for all the existing categories in the dataset, generating N prompt outputs, where N denotes the total number of target classes in the dataset. These N prompt phrases are then fed into the RemoteCLIP text encoder for text encoding. To further enhance the alignment between textual features and input image features, a simple fully connected (FC) layer is introduced as a feedforward network to refine the extracted text features. This FC layer also ensures scale consistency between the text-encoded features and the backbone network features. During both training and inferences, the RemoteCLIP text encoder remains frozen to prevent gradient explosion. CLIP’s alignment loss encourages the text encoder to act as a class name oracle: Any phrase that reliably points to the visual concept is sufficient, whereas additive context may increase variance. Remote sensing nouns, such as “ship” or “bridge”, already serve as strong visual anchors; extra domain qualifiers (e.g., “satellite image pixel of …”) lie outside the short-phrase distribution on which CLIP was pretrained and can shift the textual embedding away from the visual centroid, reducing mutual information without providing new supervisory signals. Consequently, the minimal prompt “a pixel of cls” is the maximum a posteriori choice with the CLIP prior, a conclusion consistent with prompt-sensitivity analyses in natural-image tasks [44]. For this reason, we did not perform an exhaustive template search.

The text-embedding process is formulated as follows:

T o k e n = T o k e n i z e r (T (C l s))

(5)

F_{t e x t} = F C (E n c o d e r (T o k e n))

(6)

where

C l s

represents the target class, T denotes the prompt template,

T o k e n i z e r

performs text tokenization,

E n c o d e r

refers to the RemoteCLIP text encoder, and

F_{t e x t}

is the final textual feature.

After obtaining the text prompts, CGNet employs pixel–text alignment to supervise the feature learning for each input image. Specifically, it uses the fused feature map from the DLA backbone, which has a spatial resolution of

[\frac{H}{4}, \frac{W}{4}, C],

where C is the number of channels. For each pixel in the feature map, its C-dimensional feature vector is compared with the text embeddings via similarity computation. The pixel–text alignment process is formulated as:

M_{s c o r e} = E i n s u m (N o r m (F_{t e x t}), N o r m (F_{i m g}))

(7)

where

F_{i m g}

is the fused feature map from the backbone,

N o r m

represents the normalization function, and

E i n s u m

denotes the multiplication of the tensor.

After obtaining the aligned features, it is necessary to supervise the features to achieve the goal of enhancing the backbone network. Binary cross-entropy is used as the loss function, and the supervision masks are derived from the ground truth of the instance segmentation. Specifically, N blank masks are first generated, corresponding to the scale

[\frac{H}{4}, \frac{W}{4}]

. Then, for each image, if there exists an instance belonging to the ith class, the instance mask is added to the ith mask. After processing each image in a batch, the ground truth for supervision is obtained. The loss function used can be expressed as follows:

L_{C L I P} = B C E (M_{s c o r e}, G T_{C L I P})

(8)

and Figure 2 illustrates the overall workflow.

2.3.2. Mask Information Branch Alignment and Representation

Due to the dimensional inconsistency between the mask information representation and contour information representation, additional processing steps are required. This paper employs image DCT to perform dimensional conversion on mask prediction results.

First, a two-dimensional DCT transform is applied to the target mask (M), converting the mask to a frequency-domain representation (

M_{D C T}

). In

M_{D C T}

, the upper-left region represents low-frequency components, where

M_{D C T} [0, 0]

corresponds to the DC component. For the two-dimensional frequency-domain result (

M_{D C T}

), a zigzag encoding scheme is used to compress it to a one-dimensional representation (V), with high-frequency components beyond the encoding limit being truncated. In the proposed method, the maximum length of V is set at 256, ensuring alignment with the number of coordinate points in the contour branch.

The mask branch starts by producing an initial contour from center-point heatmaps and backbone features. Specifically, the center-point heatmaps first yield an object-level bounding box, from which a fixed number of uniformly spaced contour points are sampled; these 2-D coordinates are concatenated with the corresponding RoI-aligned backbone features and fed into two FC layers that regress the first 256-D DCT vector representing the initial mask. In the following steps, the branch iteratively updates the DCT coefficients so that the mask gradually approaches the target mask; the refined coefficients are also sent to the contour branch by concatenating the updated DCT vector with the contour-point features at each GRU iteration, helping the contour to evolve toward the target contour.

2.3.3. Iteration Module

The iterative module takes the features extracted from the backbone network, the initial contour, and the initial DCT prediction results as input. It first employs stacked basic modules for further feature extraction. These basic modules consist of a combination of convolution, nonlinear activation functions, and normalization.

The convolutional operation utilizes a dilated circular convolutional function. Circular convolution is a specialized form of one-dimensional convolution that connects the head and tail of the input one-dimensional vector, eliminating the need for additional zero-padding while producing outputs at the same scale as that of the inputs. This makes circular convolution particularly suitable for topological structures, such as contours. Dilated circular convolution is used to model the topological structures of contours without breaking their cyclic nature. The dilation allows for a larger receptive field along the contour, capturing more context while preserving spatial continuity.

For the nonlinear activation function, LeakyReLU [45] is used. Due to its computational simplicity, LeakyReLU mitigates the vanishing-gradient problem to some extent and addresses the “dead ReLU problem” associated with the standard ReLU function. As a result, it is widely used in the field of computer vision.

The iterative module employs a

1 \times 1

convolution and a max-pooling layer as the fusion head to combine the feature outputs extracted by the basic modules with different dilation rates. Finally, a convolutional GRU is used as the prediction head to estimate the offsets of different branch vectors. The predicted offsets are then added to the original input to obtain the refined prediction result for the current iteration. More details are illustrated in Figure 3.

The GRU unit takes the sequential input (x) and hidden information (h) as inputs. In the convolutional GRU, the branch vectors are concatenated channel-wise with the sampled features from the backbone network and serve as the input (x). The initial hidden state (

h_{0}

) is obtained by applying a tanh activation function to

x_{0}

.

Unlike standard GRU units, the convolutional GRU merges the input (x) and hidden state (h) through channel-wise concatenation instead of element-wise addition. Furthermore, before applying nonlinear activation, a 1D convolutional operation is used instead of a fully connected layer mapping.

To fully exploit hidden information, the convolutional GRU incorporates an additional feedforward network (FFN), refining the prediction results through a simple MLP layer, thereby producing outputs that better adhere to the ground truth.

The update gate is formulated as follows:

z_{t} = σ (C o n v (C a t (x_{t}, h_{t - 1})))

(9)

where

C a t

denotes channel-wise concatenation,

C o n v

represents 1D convolution, and

σ

is the sigmoid function. The reset gate is expressed as follows:

r_{t} = σ (C o n v (C a t (x_{t}, h_{t - 1})))

(10)

where the convolutional operations do not share parameters with those in the update gate. The candidate hidden state is computed as follows:

{\tilde{h}}_{t} = t a n h (C o n v (C a t (x_{t}, r_{t} ⊙ h_{t - 1})))

(11)

where ⊙ denotes element-wise multiplication, and

t a n h

is the activation function. The hidden-state update is computed as follows:

h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ {\tilde{h}}_{t}

(12)

Finally, the prediction is given by

P r e d i c t i o n = x_{t} + F F N (h_{t})

(13)

where

F F N (h_{t})

predicts the offset between the current input and the ground truth. The structure of the convolutional GRU unit is illustrated in Figure 4. In each branch, we stack K iterative modules which weights are shared across the layers. By ablating

K \in {4, 6, 9, 12}

on the SSDD val split, we observe that the mask’s AP climbs steadily and plateaus at K = 12; further enlarging K yields negligible gains, so 12 is adopted as the saturation point.

2.3.4. Fusion Head

CGNet proposes a novel branch fusion module. This module effectively leverages DCT information and mask information, generating the final instance segmentation results through a carefully designed fusion process. The module employs cross-attention, treating the contour branch results as queries and the mask branch results as keys and values. Specifically, the fusion module first concatenates the contour prediction results, DCT vectors, and image features separately and then applies different linear layers for projection. Subsequently, the module performs standard cross-attention computations to generate offsets between the predicted contours and ground truth. These offsets are added to the input contours to produce the final refined results. Notably, the module does not use FFNs as post-processing.

2.3.5. Loss Function

The final loss is a weighted combination of four terms:

𝓛_{total} = 𝓛_{\det} + 𝓛_{poly} + 𝓛_{dct} + λ_{clip} 𝓛_{CLIP},

(14)

where

$𝓛_{\det}$ : the focal loss of CenterNet for object detection;
$𝓛_{poly}$ : the smooth-L1 loss between predicted polygonal offsets and ground-truth contour coordinates;
$𝓛_{dct}$ : the smooth-L1 loss between the predicted 256-D DCT vector and its ground truth;
$𝓛_{CLIP}$ : the binary cross-entropy (BCE) loss for pixel–text alignment supervision;
$λ_{clip}$ is set at 0.5 in all the experiments.

All the losses are computed per image and averaged over the batch. Gradients are backpropagated jointly through the entire network.

2.4. Training Details

CGNet employs adaptive moment estimation (Adam) [46] as the optimizer to backpropagate network loss, with an initial learning rate set at

10^{- 4}

. The learning rate decays at the 80th and 120th epochs, and the total number of training epochs is fixed at 150. The batch size is set at 4. In this study, DLA is used as the backbone network of the model, with a depth of 34 layers. Additionally, a deformable convolutional network (DCN) [47] is incorporated to enhance the backbone network’s performance. The smooth-L1 loss is adopted as the loss function, while the binary cross-entropy loss (BCE loss) is applied for CLIP supervision. The overall training objective follows Equation (14);

λ_{clip}

is fixed at 0.5 without further tuning.

3. Results

3.1. Comparison Results

Comparison in the NWPU VHR-10 dataset: Table 1 compares the instance segmentation performance of CGNet with those of various state-of-the-art and classical methods in the NWPU VHR-10 dataset.The evaluated models include specialized remote sensing image (RSI) approaches, such as ARE-Net [48], models by Kumar [49] and Shi and Zhang [30], FB-ISNet [27], HQ-ISNet [31], and YOLOv5s-MLS [50], as well as general-purpose models, like YOLACT [1], Mask R-CNN [7], Cascade R-CNN [8], PointRend [51], Vmamba-IR [33], SG-Former [52], GLFRNet [32], and HQ-ISNet-v2 [53]. CGNet outperforms all the other methods in terms of

A P

and

A P_{75}

, with its

A P

score being

0.9 %

higher than that of the suboptimal model. This demonstrates CGNet’s strong segmentation capability, particularly for small objects. Additionally, CGNet ranks second in

A P_{S}

,

A P_{M}

, and

A P_{L}

, confirming its effectiveness across objects at varying scales. Quantitatively, CGNet possesses only 64.2 M trainable parameters—17% and 33% fewer than those of Cascade Mask R-CNN (77.3 M) and HQ-ISNet (95.6 M), respectively—while still delivering the highest mask AP, verifying its lightweight and high-efficiency design.

Comparison in the SSDD dataset: The SSDD benchmark includes specialized SAR image instance segmentation models, such as C-SE Mask R-CNN [54], EMIN [55], FL-CSE-ROIE [56], MAI-SE-Net [57], SA R-CNN [58], and LFG-Net [11], alongside other established methods. As shown in Table 2, CGNet achieves the highest performance in this dataset, surpassing the second-best method (LFG-Net) by a significant margin of

3.2 %

in the overall accuracy. Additionally, CGNet excels in

A P_{75}

,

A P_{M}

, and

A P_{L}

, indicating robust segmentation capabilities across objects of varying sizes, particularly large-scale ships. In SSDD, CGNet retains the same 64.2 M parameters, outperforming LFG-Net (55.7 M) by a +3.2 mask

A P

and nearly doubling the

A P_{L}

value of HQ-ISNet (42.1 M), further confirming that the accuracy gains do not rely on a heavier backbone.

3.2. Ablation Study

To validate the model’s efficacy, comprehensive experiments were performed on multiple datasets. The results consistently show that these proposed enhancements achieve favorable outcomes and significantly improve CGNet’s performance.

ConvGRU: The GRU is a type of recurrent neural network, derived as an improvement over the LSTM. Compared to the LSTM, the GRU reduces the number of gates. To validate the impacts of different recurrent units on this model, Table 3 compares CGNet with the proposed model using ConvLSTM units. It can be observed that employing ConvGRU slightly enhances the network’s performance, which may be attributed to the GRU’s design being more suitable as an iterative module.

Fusion head: The fusion head is used to combine the results from different branches and output the final prediction. Table 3 presents a comparison between the attention-based fusion head and the MLP-based fusion head. Clearly, the attention-based fusion mechanism can more effectively filter useful information and achieve more accurate results. Additionally, the attention-based fusion head operates faster and requires fewer computational resources.

CLIP supervision: CLIP supervision is used to supplement textual information and enhance the performance of the backbone network, thereby improving the detection and segmentation networks. In Table 3, the model with CLIP supervision performs the best in all the metrics, demonstrating the importance of CLIP’s auxiliary guidance.

Iteration layer: In Table 4, as the number of iterative layers grows, all the metrics consistently improve. However, when the depth increases from nine to twelve, the gain in AP becomes marginal (only +0.8%), indicating that the performance is approaching the saturation point. Continuing to stack layers beyond 12 would raise the parameter count and noticeably prolong the inference time without a meaningful accuracy benefit.

3.3. Visualization

For intuitive performance assessment, CGNet’s visual results in the NWPU VHR-10 and SSDD datasets are displayed. The comparisons reveal that CGNet consistently produces enhanced visual outputs in both.

NWPU VHR-10: The experimental outcomes in the NWPU VHR-10 dataset are illustrated in Figure 5. The first row indicates that competing methods frequently misclassify background regions as objects, highlighting the complexity of RSI. In contrast, CGNet effectively suppresses such false detections. Rows 2-4 demonstrate CGNet’s superior boundary delineation capability, where conventional methods exhibit noticeable edge artifacts. Quantitative comparison across columns confirms CGNet’s substantial improvement in mask precision, particularly for small objects, where the boundary segmentation accuracy is critically enhanced.

SSDD: Figure 6 displays the comparative results in the SSDD dataset. While all the methods achieve satisfactory performances for isolated offshore vessels, conventional approaches exhibit significantly higher false positives for inshore ships due to their strong visual similarity with coastal environments. Furthermore, SAR imaging characteristics cause competing methods to fragment large targets, whereas CGNet maintains object integrity. These visual comparisons confirm CGNet’s advantages in reducing false alarms, minimizing missed detections, and producing more coherent object boundaries.

4. Discussion

CGNet achieves 68.1% and 67.4% mask APs in NWPU VHR-10 and SSDD, outperforming the second-best methods by 0.9% and 3.2% while keeping only 64.2 M parameters, confirming the value of joint contour–mask refinement. By projecting both signals to a shared DCT space, ConvGRU can evolve them simultaneously and suppress noise accumulated in single-branch iterations; frozen CLIP text priors inject category-aware context into early features and noticeably reduce confusion between lookalike categories, such as ships and ship-like piers. Nevertheless, large instances in NWPU (harbors and stadiums) are embedded in multiclass clutter; the fixed prompt “a pixel of a harbor” is applied equally to interior pixels that, in fact, belong to cars or buildings, lowering the local IoU. The attention fusion head downweights the text stream whenever visual features contradict the prior, so regions that need extra guidance receive less supervisory signal and medium/large-object accuracies drop. In SSDD, large vessels lie on the homogeneous sea, the text prior is valid for almost every labeled pixel, and the fusion head maintains high attention on the text and successfully suppresses sea-clutter false positives, so the same architecture exhibits opposite trends in these two datasets. Dynamic prompt adaptation, lightweight prior calibration, or robust fusion strategies could allow semantic cues to adjust flexibly to local contexts and further unleash the accuracy potential of real-time remote sensing instance segmentation.

5. Conclusions

Instance segmentation of remote sensing imagery represents a crucial computer vision task with extensive applications in many areas. However, this task presents unique challenges: Targets typically appear at small scales with highly similar inter-class contours, while complex backgrounds introduce substantial noise interference. These characteristics demand that segmentation models effectively leverage both contextual information and intrinsic instance features. As demonstrated by Wang et al. [16], background noise and visual clutter remain major challenges in remote sensing imagery. We address these issues by integrating semantic guidance from CLIP and jointly refining contour and mask representations—a strategy that parallels the auxiliary prior injection mechanism proposed by Zhang et al. [15] This paper proposes CGNet, a novel architecture specifically designed for remote sensing imagery characteristics. The network incorporates an optimized backbone coupled with a dedicated contour prediction branch, where gated recurrent units facilitate iterative refinement between contour and mask predictions. For effective feature integration, we implement an attention-based fusion mechanism. Furthermore, recognizing the prevalent issues of target omission and false detection, we enhance the backbone network through contrastive-learning-based supervision to enrich feature representation.

Although CGNet has the aforementioned advantages, there remain room for improvement and potential research directions for this model. First, CGNet still lags behind current state-of-the-art models. Future improvements could explore stronger backbone networks, such as transformers or Vmamba, to enhance performance. Additionally, the design of the contour and mask branches in the segmentation head could be further refined, for instance, by introducing attention mechanisms to help the model better focus on targets at various scales and locations. Second, the background-induced mismatch of the CLIP text prior explains why CGNet underperforms on medium and large objects in NWPU VHR-10 yet excels on large objects in SSDD. NWPU’s large instances (harbors and stadiums) are embedded in multiclass clutter; the fixed template prior (‘a pixel of a harbor’) is equally applied to interior pixels that actually belong to cars, buildings, or airplanes. The attention fusion head downweights the text stream whenever local visual features contradict the prior, so the very regions that need extra guidance receive a weaker supervisory signal, lowering the IoU for medium and large objects. In contrast, SSDD’s large ships sit on the homogeneous sea; the text prior is contextually valid for almost every labeled pixel, allowing the fusion head to maintain high attention on the text and suppress sea-clutter false positives. Because no additional parameters are introduced, this neighborhood–prior alignment effect is purely data dependent; hence, the same architecture exhibits opposite trends in the two datasets. To mitigate this data-dependent prior mismatch, future work could explore dynamic prompt adaptation, lightweight prior calibration, or robust fusion strategies that allow the text cue to flexibly adjust to local contexts without sacrificing efficiency. While stacking more modules or deepening the backbone network could also enhance the performance in large datasets, such modifications would inevitably reduce the inference speed. Therefore, striking a balance between segmentation accuracy and inference speed may be a key focus of future research. Finally, carefully designed loss functions can further improve the model’s performance. Subsequent work could focus on developing loss functions for both the contour and mask branches to better guide network training. Although CGNet already achieves a competitive accuracy, its parameter count (64.2 M) is markedly lower than those of most two-stage competitors (≥77 M). Future work will continue to balance parameter efficiency and accuracy by exploring stronger yet economical backbones.

Author Contributions

Conceptualization, H.Z. and Z.T.; methodology, Z.T. and Z.C.; software, T.L.; validation, H.Z., Z.T. and X.X.; formal analysis, T.L.; investigation, X.X. and J.L.; resources, H.Z. and X.Q.; data curation, T.L. and J.L.; writing—original draft preparation, T.L.; writing—review and editing, H.Z. and Z.T.; visualization, T.L.; supervision, Z.C.; project administration, H.Z.; funding acquisition, H.Z. and Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Civil Aerospace Technology Pre-Research Project of China’s 14th Five-Year Plan, Guide No. D040404, and by the project “Research on Suspicious Area Detection Technology for Underground/Semi-Underground Targets with Multisource Data”.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

We are grateful to the handling editor and the anonymous reviewers for their thorough review and constructive suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9157–9166. [Google Scholar]
Chen, H.; Sun, K.; Tian, Z.; Shen, C.; Huang, Y.; Yan, Y. Blendmask: Top-down meets bottom-up for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8573–8581. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H. Conditional Convolutions for Instance Segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 282–298. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Detector, A.F.O. Fcos: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 69–76. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4974–4983. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91. [Google Scholar] [CrossRef] [PubMed]
Wei, S.; Zeng, X.; Zhang, H.; Zhou, Z.; Shi, J.; Zhang, X. LFG-Net: Low-Level Feature Guided Network for Precise Ship Instance Segmentation in SAR Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
Ye, W.; Zhang, W.; Lei, W.; Zhang, W.; Chen, X.; Wang, Y. Remote sensing image instance segmentation network with transformer and multi-scale feature representation. Expert Syst. Appl. 2023, 234, 121007. [Google Scholar] [CrossRef]
Peng, S.; Jiang, W.; Pi, H.; Li, X.; Bao, H.; Zhou, X. Deep snake for real-time instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8533–8542. [Google Scholar]
Zhang, T.; Wei, S.; Ji, S. E2ec: An end-to-end contour-based method for high-quality high-speed instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 4443–4452. [Google Scholar]
Zhang, Y.; Liu, X.; Zhao, H. Auxiliary Geometric Prior-Guided Segmentation for Aircraft Detection in Remote Sensing Images. Pattern Recognit. 2025, 153, 111503. [Google Scholar] [CrossRef]
Wang, J.; Chen, Y.; Li, M. Background-Robust Feature Learning for Remote Sensing Instance Segmentation Under Noise and Clutter. Remote Sens. 2025, 17, 125. [Google Scholar] [CrossRef]
Feng, H.; Zhou, K.; Zhou, W.; Yin, Y.; Deng, J.; Sun, Q.; Li, H. Recurrent generic contour-based instance segmentation with progressive learning. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7947–7961. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Strang, G. The discrete cosine transform. SIAM Rev. 1999, 41, 135–147. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Cheng, T.; Wang, X.; Chen, S.; Zhang, W.; Zhang, Q.; Huang, C.; Zhang, Z.; Liu, W. Sparse instance activation for real-time instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 4433–4442. [Google Scholar]
Dong, B.; Zeng, F.; Wang, T.; Zhang, X.; Wei, Y. Solq: Segmenting objects by learning queries. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Volume 34, pp. 21898–21909. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 1290–1299. [Google Scholar]
Xie, E.; Sun, P.; Song, X.; Wang, W.; Liu, X.; Liang, D.; Shen, C.; Luo, P. Polarmask: Single shot instance segmentation with polar representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12193–12202. [Google Scholar]
Riaz, H.U.M.; Benbarka, N.; Zell, A. Fouriernet: Compact mask representation for instance segmentation using differentiable shape decoders. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: New York, NY, USA, 2021; pp. 7833–7840. [Google Scholar]
Su, H.; Huang, P.; Yin, J.; Zhang, X. Faster and Better Instance Segmentation for Large Scene Remote Sensing Imagery. In Proceedings of the IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; IEEE: New York, NY, USA, 2022; pp. 2187–2190. [Google Scholar]
Yu, F.; Wang, D.; Shelhamer, E.; Darrell, T. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2403–2412. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Shi, F.; Zhang, T. An Anchor-Free Network With Box Refinement and Saliency Supplement for Instance Segmentation in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6516205. [Google Scholar] [CrossRef]
Su, H.; Wei, S.; Liu, S.; Liang, J.; Wang, C.; Shi, J.; Zhang, X. HQ-ISNet: High-quality instance segmentation for remote sensing imagery. Remote Sens. 2020, 12, 989. [Google Scholar] [CrossRef]
Zhao, J.; Wang, Y.; Zhou, Y.; Du, W.l.; Yao, R.; El Saddik, A. GLFRNet: Global-Local Feature Refusion Network for Remote Sensing Image Instance Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5610112. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024; Volume 37, pp. 103031–103063. [Google Scholar]
Yu, D.; Ji, S. A Novel Shape Guided Transformer Network for Instance Segmentation in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 8325–8339. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Liu, Z.; Liew, J.H.; Chen, X.; Feng, J. Dance: A deep attentive contour model for efficient instance segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 345–354. [Google Scholar]
Waqas Zamir, S.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Shahbaz Khan, F.; Zhu, F.; Shao, L.; Xia, G.S.; Bai, X. isaid: A large-scale dataset for instance segmentation in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 28–37. [Google Scholar]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Wu, K.; Zheng, D.; Chen, Y.; Zeng, L.; Zhang, J.; Chai, S.; Xu, W.; Yang, Y.; Li, S.; Liu, Y.; et al. A dataset of building instances of typical cities in China. Chin. Sci. Data 2021, 6, 182–190. [Google Scholar] [CrossRef]
Wei, S.; Zeng, X.; Qu, Q.; Wang, M.; Su, H.; Shi, J. HRSID: A high-resolution SAR images dataset for ship detection and instance segmentation. IEEE Access 2020, 8, 120234–120254. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Li, J.; Xu, X.; Wang, B.; Zhan, X.; Xu, Y.; Ke, X.; Zeng, T.; Su, H.; et al. SAR ship detection dataset (SSDD): Official release and comprehensive data analysis. Remote Sens. 2021, 13, 3690. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft Coco: Common Objects in Context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part v 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. Remoteclip: A vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar] [CrossRef]
Pham, N.T. On the prompt sensitivity of contrastive vision-language models. In Proceedings of the NeurIPS Workshop, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Xu, B.; Wang, N.; Chen, T.; Li, M. Empirical evaluation of rectified activations in convolutional network. arXiv 2015, arXiv:1505.00853. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Zeng, X.; Wei, S.; Shi, J.; Zhang, X. A Lightweight Adaptive RoI Extraction Network for Precise Aerial Image Instance Segmentation. IEEE Trans. Instrum. Meas. 2021, 70, 5018617. [Google Scholar] [CrossRef]
Kumar, D. Accurate object detection & instance segmentation of remote sensing, imagery using cascade mask R-CNN with HRNet backbone. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4502105. [Google Scholar]
Gong, L.; Huang, X.; Chen, J.; Xiao, M.; Chao, Y. Multiscale leapfrog structure: An efficient object detector architecture designed for unmanned aerial vehicles. Eng. Appl. Artif. Intell. 2024, 127, 107270. [Google Scholar] [CrossRef]
Kirillov, A.; Wu, Y.; He, K.; Girshick, R. Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9799–9808. [Google Scholar]
Yu, D.; Ji, S. Shape-Guided Transformer for Instance Segmentation in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 125–138. [Google Scholar] [CrossRef]
Su, H.; Wei, S.; Liu, S.; Liang, J.; Wang, C.; Shi, J.; Zhang, X. HQ-ISNet-v2: High-Quality Instance Segmentation with Dual-Scale Mask Refinement for Remote Sensing Imagery. Remote Sens. 2024, 16, 420. [Google Scholar]
Zhang, T.; Zhang, X.; Li, J.; Shi, J. Contextual squeeze-and-excitation mask r-cnn for sar ship instance segmentation. In Proceedings of the 2022 IEEE Radar Conference (RadarConf22), Paris, France, 24–29 April 2022; IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
Zhang, T.; Zhang, X. Enhanced Mask Interaction Network for SAR Ship Instance Segmentation. In Proceedings of the IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 3508–3511. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X. A Full-Level Context Squeeze-and-Excitation ROI Extractor for SAR Ship Instance Segmentation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4506705. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X. A Mask Attention Interaction and Scale Enhancement Network for SAR Ship Instance Segmentation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4511005. [Google Scholar] [CrossRef]
Gao, F.; Huo, Y.; Wang, J.; Hussain, A.; Zhou, H. Anchor-Free SAR Ship Instance Segmentation With Centroid-Distance Based Loss. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 11352–11371. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of CGNet. CGNet consists of three parts: the backbone network, detection head, and segmentation head. The segmentation head includes a mask branch and a contour branch. Two branches undergo iterative refinement using ConvGRU, and their results are fused through an attention head. Additionally, the backbone network of CGNet receives auxiliary supervision from the CLIP’s text encoder.

Figure 2. The workflow of the CLIP-enhanced backbone and supervision.

Figure 3. An illustration of the iteration module in CGNet.

Figure 4. A detailed display of ConvGRU.

Figure 5. Visualization results in the NWPU VHR-10 dataset.

Figure 6. Visualization results in the SSDD dataset.

Table 1. Instance segmentation results (mask APs) in the NWPU VHR-10 dataset (%).Bold numbers indicate the best results.

Model	Backbone	$AP$	${AP}_{50}$	${AP}_{75}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$	$Params$ (M)
* YOLACT [1]	ResNet-50-FPN	43.3	77.5	44.0	23.1	40.5	54.3	50.7
Mask R-CNN [7]	ResNet-50-FPN	54.9	83.0	65.2	61.3	56.1	37.2	44.2
Cascade Mask R-CNN [8]	ResNet-50-FPN	58.9	93.7	65.4	45.1	57.6	69.2	77.3
PointRend [51]	ResNet-50-FPN	61.1	90.1	64.7	52.8	59.9	61.0	44.4
ARE-Net [48]	ResNet-101-FPN	64.8	93.2	71.5	53.9	65.3	72.9	-
Kumar [49]	HRNetV2p-W32	65.1	91.9	71.5	49.5	64.7	69.8	-
Shi and Zhang [30]	ResNet-FPN	65.2	94.9	72.1	49.4	65.7	71.2	-
* FB-ISNet [27]	DLA-BiFPN	67.0	94.5	72.0	-	-	-	-
HQ-ISNet [31]	HRFPN-W40	67.2	94.6	74.2	51.9	67.8	77.5	95.6
* YOLOv5s-MLS [50]	-	57.2	95.5	-	-	-	-	-
Vmamba-IR [33]	Vmamba-B	67.0	91.8	74.9	56.3	65.5	75.3	82.6
SG-Former [52]	Swin-T + shape head	66.5	91.4	74.2	55.9	64.8	74.1	79.4
GLFRNet [32]	ResNeXt-64×4d	65.9	90.7	73.8	54.7	64.0	73.5	71.3
HQ-ISNet-v2 [53]	HRNetV2-W48	66.8	92.1	74.6	56.1	65.2	74.4	97.2
CGNet	DLASeg	68.1	92.9	76.1	58.7	66.2	74.9	64.2

Models with * are one-stage models, while others are two-stage models.

Table 2. Instance segmentation results (mask APs) in the SSDD dataset (%).Bold numbers indicate the best results.

Model	Backbone	$AP$	${AP}_{50}$	${AP}_{75}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$	$Params$ (M)
* YOLACT [1]	ResNet-50-FPN	57.8	91.4	70.9	58.4	56.5	58.6	50.7
Mask R-CNN [7]	ResNet-50-FPN	64.8	94.3	81.7	66.7	59.0	19.4	44.2
Cascade Mask R-CNN [8]	ResNet-50-FPN	65.5	94.3	82.3	66.7	62.2	40.1	77.3
PointRend [51]	ResNet-50-FPN	65.6	94.5	82.3	67.0	62.2	16.8	44.4
C-SE Mask R-CNN [54]	ResNet-50-FPN	58.6	89.2	71.2	58.3	60.7	26.7	45.6
EMIN [55]	ResNet-101-FPN	61.7	94.3	76.8	62.1	61.3	61.3	60.1
FL-CSE-ROIE [56]	ResNet-101-FPN	62.6	93.7	78.3	63.3	61.2	75.0	61.4
MAI-SE-Net [57]	ResNet-101-FPN	63.0	94.4	77.6	63.3	62.5	47.7	60.3
HQ-ISNet [31]	HRNetV2-W40	57.6	86.0	72.6	56.7	61.3	50.2	42.1
* SA R-CNN [58]	ResNet-50-GCB-FPN	59.4	90.4	77.6	63.3	62.5	47.7	48.9
LFG-Net [11]	ResNeXt-64×4d	64.2	95.0	81.1	63.1	68.2	43.1	55.7
SG-Former [52]	Swin-T + shape head	64.2	91.7	80.9	61.8	71.6	77.3	79.4
GLFRNet [32]	ResNeXt-64×4d	63.8	91.2	80.3	60.9	70.8	76.5	71.3
CGNet	DLASeg	67.4	94.4	84.4	64.5	75.1	80.2	64.2

Models with * are one-stage models, while others are two-stage models.

Table 3. Ablation experiments on NWPU VHR-10 datasets.Bold numbers indicate the best results.

Ablation	$AP$	${AP}_{50}$	${AP}_{75}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$
ConvLSTM	67.6	91.4	75.8	57.5	66.2	72.8
ConvGRU	68.1	92.9	76.1	58.7	66.2	74.9
MLP fusion head	67.5	92.1	76.4	49.7	66.0	76.1
Attention fusion head	68.1	92.9	76.1	58.7	66.2	74.9
No CLIP supervision	67.2	91.6	75.2	57.3	65.1	73.3
With CLIP supervision	68.1	92.9	76.1	58.7	66.2	74.9

Table 4. Iteration-layer ablation on SSDD datasets.Bold numbers indicate the best results.

K	$AP$	${AP}_{50}$	${AP}_{75}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$
4	63.9	88.3	70.3	51.3	62.0	69.3
6	65.8	90.8	73.2	55.3	65.3	72.3
9	66.6	91.3	73.7	55.5	64.7	71.7
12	67.4	94.4	84.4	64.5	75.1	80.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, H.; Tian, Z.; Chen, Z.; Liu, T.; Xu, X.; Leng, J.; Qi, X. CGNet: Remote Sensing Instance Segmentation Method Using Contrastive Language–Image Pretraining and Gated Recurrent Units. Remote Sens. 2025, 17, 3305. https://doi.org/10.3390/rs17193305

AMA Style

Zhang H, Tian Z, Chen Z, Liu T, Xu X, Leng J, Qi X. CGNet: Remote Sensing Instance Segmentation Method Using Contrastive Language–Image Pretraining and Gated Recurrent Units. Remote Sensing. 2025; 17(19):3305. https://doi.org/10.3390/rs17193305

Chicago/Turabian Style

Zhang, Hui, Zhao Tian, Zhong Chen, Tianhang Liu, Xueru Xu, Junsong Leng, and Xinyuan Qi. 2025. "CGNet: Remote Sensing Instance Segmentation Method Using Contrastive Language–Image Pretraining and Gated Recurrent Units" Remote Sensing 17, no. 19: 3305. https://doi.org/10.3390/rs17193305

APA Style

Zhang, H., Tian, Z., Chen, Z., Liu, T., Xu, X., Leng, J., & Qi, X. (2025). CGNet: Remote Sensing Instance Segmentation Method Using Contrastive Language–Image Pretraining and Gated Recurrent Units. Remote Sensing, 17(19), 3305. https://doi.org/10.3390/rs17193305

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CGNet: Remote Sensing Instance Segmentation Method Using Contrastive Language–Image Pretraining and Gated Recurrent Units

Abstract

Highlights

Abstract

1. Introduction

1.1. Related Work

1.1.1. One-Stage Instance Segmentation

1.1.2. Two-Stage Instance Segmentation

2. Materials and Methods

2.1. Datasets

2.2. Evaluation Metrics

2.3. Model Architecture

2.3.1. Backbone Supervision

2.3.2. Mask Information Branch Alignment and Representation

2.3.3. Iteration Module

2.3.4. Fusion Head

2.3.5. Loss Function

2.4. Training Details

3. Results

3.1. Comparison Results

3.2. Ablation Study

3.3. Visualization

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI