1. Introduction
As the need for human–robot interaction grows, computer vision has increasingly embraced cross-modal fusion—a technique that combines different data types, such as visual and linguistic information. This interdisciplinary approach has given rise to various tasks, including referring image segmentation (RIS), visual question answering, and video text retrieval [
1,
2,
3]. This paper specifically addresses RIS, a fundamental yet complex challenge within the multimodal domain. RIS requires a deep understanding of both visual and linguistic elements to segment specific instances within an image. Unlike other tasks, RIS demands a nuanced comprehension of the image’s content to identify and segment the region mentioned in a given textual description [
4,
5]. Typically, the region of interest in RIS is an object or substance within the image, and the accompanying text provides clues about the object’s action, category, color, location, and other attributes. The overall process is shown in
Figure 1. This task requires a model that not only interprets the text but also correlates it with the visual elements of the image to accurately delineate the specified area [
4,
5,
6].
Researchers have traditionally approached RIS by adapting standard image segmentation processes. This involves extracting features from both visual and textual inputs, followed by integrating these features to create a composite set of multimodal features that inform the prediction of segmentation masks [
4,
7,
8]. Among them, Chen and Li et al. both adopted an iterative approach to refine the segmentation mask by gradually integrating visual and textual features [
5,
9], while Ding, Feng, and Hu et al. introduced an attentional mechanism to enhance the model’s focus on key visual regions and textual descriptions [
10,
11,
12]. In addition, Liu and Margffoy-Tuay et al. emphasized the dynamic interaction of visual and textual information during the segmentation process [
6,
13], while Shi and Ye et al. focused on how to extract key information from textual descriptions to better integrate them with visual information [
14,
15]. Conventional models for image segmentation typically use dense binary classification networks to determine the membership of each pixel to the target object [
16,
17,
18,
19]. However, this pixel-centric approach often overlooks the relational structure between global pixels, which is crucial for accurate segmentation. To address this limitation, Jiang Liu and colleagues introduced PolyFormer [
20], a groundbreaking Transformer-based sequence-to-sequence (seq2seq) framework. PolyFormer represents segmentation masks—not as dense binary classifications but as a collection of sparsely distributed polygonal vertices. These vertices trace the contours of the objects referenced in the textual descriptions, offering a more structurally aware approach to segmentation [
21,
22]. In this regard, Acuna and Castrejon et al. have used polygon vertex prediction methods for instance segmentation. These methods provide a more structured and sparse representation by predicting the vertices of an object’s contour instead of performing pixel-level classification [
23,
24]. This idea is further extended by Liang et al., who combine a mask generated by a segmentation network with a deformation network to optimize polygons, allowing them to fit object boundaries more accurately [
25]. In addition, Xie et al. use a polar coordinate system to represent and predict the contours of objects, which performs well when dealing with objects with complex shapes and orientation changes [
26]. PolyFormer’s innovative prediction framework has led to significant performance improvements, achieving state-of-the-art (SOTA) results on the widely recognized RefCOCO [
27], RefCOCO+ [
27], and RefCOCOg [
28] datasets. Despite these advancements, PolyFormer’s extensive parameter count and substantial training data requirements present new challenges, particularly for optimization on consumer-grade graphics cards. In addition, in some complex scenarios, Polyformer also has cases where the target is detected correctly but the segmentation result is wrong. As shown in
Figure 2.
To address these challenges, we developed an innovative strategy that leverages the equivalent substitutability of a kernel attention network (KAN) [
29] and multi-layer perceptron (MLP). Our solution builds upon the PolyFormer framework by integrating a KAN decoder branch, functioning as a classification head. This branch is structurally similar to the existing MLP decoder branch and utilizes the multimodal features generated by the same encoder. To enhance the KAN decoder branch’s performance, we introduced a multiscale feature fusion module, combining features extracted from the image and text encoders with those processed by the PolyFormer encoder. These multiscale features enrich the model’s ability to interpret and segment the image based on the textual description. At the same time, to minimize the training costs associated with the PolyFormer framework, we adopted a freeze–fine-tuning approach. This method involves freezing the encoders, which constitute a significant portion of PolyFormer’s parameters, and fine-tuning only the newly added KAN decoder branch. This strategy allows for efficient training while preserving the model’s foundational capabilities. Furthermore, to fully harness the potential of both MLP and KAN classification heads, we designed a dual-branch decoder architecture based on the PolyFormer framework, employing an ensemble learning strategy. By combining the insights from both branches, we aim to maximize the model’s predictive accuracy and robustness in segmenting the image regions as described in the text. While the dual-branch decoder module, enhanced by the KAN classification header, offers some improvement in model segmentation accuracy, the impact is not as pronounced as desired. The limitation arises from the similarity of features derived by the dual-decoder branches during the encoding stage, leading to minimal variance in their respective predictions. To fully capitalize on the ensemble learning strategy, it is essential to incorporate a segmentation model that can effectively complement the outputs of the PolyFormer. Recognizing the compatibility between our decoder’s output and the input requirements of the SAM (segment anything model) [
30], we devised an innovative approach to enhance image segmentation. SAM, developed by Meta, is a cue-based model renowned for its zero-shot segmentation capabilities, segmenting arbitrary images based on simple cues such as dots, boxes, or masks. Leveraging this compatibility, we introduced a SAM-based algorithmic framework tailored for seq2seq RIS. This framework uses our decoder’s output as input cues for SAM, integrating SAM’s predictions with the outputs of our dual-decoder to enhance segmentation performance in RIS tasks.
To substantiate the efficacy of our SAM-complemented dual-decoder ensemble RIS framework, we conducted comprehensive experiments on the widely recognized public datasets: RefCOCO, RefCOCO+, and RefCOCOg. These experiments benchmarked our method against other leading models in the field, demonstrating that it achieves SOTA performance. Additionally, we conducted ablation experiments to gain a deeper understanding of the contributions of individual components within our framework. These experiments highlighted how each element contributes to the overall improvement in segmentation performance.
In this paper, our contributions are fourfold:
A novel hybrid framework for referring image segmentation: The dual-decoder model with SAM complementation achieves an improvement in referring image segmentation results than other SOTA models.
A novel dual-decoder framework with KAN is proposed to increase the prediction accuracy of segmentation target edge coordinate points.
We propose a SAM-based referring image segmentation results completion module. It could further complement the segmentation result on the prediction of our decoder.
We have successfully used our framework for referring image segmentation tasks on three open datasets and have surpassed other state-of-the-art methods.
3. Method
3.1. Overall Framework
In our research, we introduce a novel algorithmic framework that integrates dual-decoder with SAM complementation, as illustrated in
Figure 3. This framework addresses the shortcomings of conventional referring image segmentation techniques, particularly when faced with intricate visual scenes. Additionally, it aims to overcome the challenges posed by current models, especially their parameter count and the volume of training data needed. The core innovation of this framework lies in enhancing the precision and reliability of image segmentation through the synergistic application of SAM and sequence-to-sequence methods, coupled with a dual-decoder architecture. The framework employs an ensemble learning strategy, allowing for a more holistic and nuanced understanding of the image and its associated text.
The process begins with a multimodal feature extraction encoder, which includes a visual encoder based on the Swin Transformer. This encoder efficiently extracts both local and global features from the image using its hierarchical structure and windowing mechanism. In parallel, a text encoder using the BERT model deeply understands the semantic information of the input text. These visual and textual features are fused through a fully connected layer and projection operation to form a unified multimodal feature representation, laying the groundwork for segmentation tasks. We implement a two-branch decoder structure with MLP and KAN as classification heads. The MLP decoder predicts continuous floating-point coordinates directly, avoiding quantization errors, while the KAN decoder enhances the model’s ability to capture multi-granularity information with a multiscale feature fusion module. The outputs from both decoders are intelligently merged through an ensemble learning strategy, aiming to improve segmentation accuracy. To further enhance segmentation performance under the constraint of limited model parameters, we introduce a SAM-based segmentation target complementation module. This module uses SAM’s zero-sample learning capability to refine the integration results of the dual-decoder. When the results of the dual-decoder are suboptimal, hint data are fed into SAM via a judgment module and a positive sample point generation module. SAM then produces and refines complementary results through a noise processing module, enhancing the accuracy of the final segmentation.
This comprehensive approach significantly improves the accuracy and robustness of image segmentation by combining advanced visual and textual coding techniques with a dual-decoder architecture and an ensemble learning strategy, all complemented by SAM’s powerful segmentation capabilities.
3.2. Encoder
The proposed hybrid framework encoder consists of a visual encoder, a text encoder, and a multimodal Transformer encoder.
The visual encoder is built upon the Swin Transformer, a Transformer-based architecture adept at handling image processing tasks. It extracts image features via a hierarchical structure and a windowing mechanism, capturing both local and global information within an image. Specifically, we utilize the fourth layer of features extracted by the Swin Transformer as our visual representation, denoted as , with dimensions , where H and W represent the height and width of the feature map, and C is the channel depth.
The text encoder employs BERT, a pre-trained deep bidirectional Transformer widely used in natural language processing. BERT’s proficiency in extracting rich semantic features from text grants our model a nuanced comprehension of referential expressions. For the input linguistic description, BERT generates linguistic embedding features, denoted as , with dimensions , where L is the length of the text and is the embedding dimension.
To effectively merge the visual and textual features, the image feature is first flattened into a sequence . It is then projected into the same embedding space as the text feature through a fully connected layer. The visual and textual features are resized to a uniform dimension by learned matrices and , along with bias vectors and , yielding transformed features and . These are concatenated to form a comprehensive multimodal feature . The is then processed by a multimodal Transformer encoder, which consists of N Transformer layers. Each layer is equipped with a multi-head self-attention mechanism, layer normalization, and a feed-forward network. Additionally, we incorporate absolute position encoding and relative position bias for both image and text features, ensuring that positional information is conserved throughout the encoding process.
3.3. Dual-Decoder
The decoder of our model is a sophisticated two-branch structure featuring an MLP and a KAN as the classification heads. The MLP classification head is equipped with a regression-based Transformer decoder that predicts continuous floating-point coordinates directly. This approach circumvents the quantization errors typical of methods that discretize coordinates, ensuring a more precise representation of the segmented object’s edges. The objective of this decoder is to forecast a sequence of coordinates that delineate the vertices of the target polygon, effectively outlining the target object. It is composed of N Transformer decoder layers, each incorporating a multi-head self-attention mechanism, a multi-head cross-attention layer, and a feed-forward network. These components work in concert to discern the relationships between the multimodal feature
and the 2D coordinate embedding
. In contrast to traditional techniques, our framework, based on PolyFormer, devises a 2D coordinate codebook to accurately capture the embedding of any floating-point coordinate. The codebook is defined as
, where
and
denote the height and width of the coordinate grid, and
is the embedding dimension. The decoder retrieves the precise coordinate embedding from surrounding grid points through bilinear interpolation, using the following formula:
The MLP decoder branch concludes with a sophisticated design to yield precise coordinate predictions for the segmentation target’s edge points. Specifically, the final output from the last Transformer decoder layer is processed by two lightweight category headers, both constructed with multi-layer perceptrons (MLPs). The category prediction header is a two-layer MLP designed to output a marker type. This marker indicates the nature of the current output, whether it is a coordinate marker, a separator marker, or an end-of-sequence marker, which is crucial for understanding the structure of the segmentation output. Simultaneously, the coordinate prediction header functions as a three-layer feed-forward network, tasked with predicting the top-left and bottom-right coordinates of the segmented target’s prediction frame, as well as the 2D coordinates of the segmented target’s edge vertices. This allows for a detailed and accurate delineation of the target object within the image.
The KAN decoder branch, while fundamentally similar to the MLP branch, introduces two main structural differences. Firstly, the prediction header is substituted with the kernelized attention network (KAN) in place of the MLP, potentially offering a more nuanced understanding of the multimodal features. Secondly, the KAN branch incorporates a multiscale feature fusion module at the decoder’s input stage. This module merges the initial multimodal features extracted from the visual and text encoders with those from the multimodal Transformer encoder, providing a comprehensive feature set for the decoder to process. The KAN branch decoder is intended to rapidly complement the output from the MLP branch decoder, even when utilizing a consumer graphics card. The inclusion of the multiscale feature fusion module allows the decoder to ingest unfused visual and textual features, capturing coarse-grained information and offering a differentiated prediction result from the MLP decoder branch. As shown in
Figure 4, in order to further improve the performance of the framework with limited hardware resources, we only use the KAN decoder branch output as the final output of the framework in the training phase, while freezing the rest of the parameters and training only the KAN decoder branch. The primary structure of the KAN decoder, as depicted in
Figure 5, showcases this innovative approach, ensuring that our framework is not only accurate but also efficient, and capable of handling complex segmentation tasks with limited computational resources. This dual-branch strategy, with the ensemble learning from both MLP and KAN decoders, significantly enhances the segmentation performance of our framework.
3.4. Dual-Decoder Output Ensemble Module
Following the dual-branch decoder’s output, our framework employs an integration learning strategy to synthesize the results, prioritizing the MLP branch’s output as the primary result and using the KAN branch’s output as a supplementary one. This strategic integration is designed to leverage the strengths of both branches and enhance the overall segmentation accuracy, as shown in Algorithm 1. The process begins by calculating the average Euclidean metric of the predicted boundary points for the segmentation target in each output. This calculation is crucial as it provides a measure of the precision of the segmentation predictions. By referring to the characteristics of segmentation error images, as illustrated in
Figure 2, images with smaller average point distances are identified. In the next step, the framework compares the average point distances of the dual-branch prediction results. The rationale behind this comparison is to select the prediction that best captures the segmentation target’s boundary. The output with the larger average point distance is deemed to have a better delineation of the target’s edges and is, therefore, chosen as the final prediction result of the model. This method ensures that the final segmentation result is not only accurate but also robust, capable of handling variations in image complexity and segmentation difficulty. By prioritizing the output with a larger average point distance, the framework optimizes for a more precise and detailed segmentation, which is essential for tasks that require high levels of accuracy. At the same time, this output will also impact the subsequent SAM complementation module.
Algorithm 1 Dual-branch decoder output integration module. |
Input: Output_MLP, Output_KAN, N. Output: Output_Final. procedure Integrate(Output_MLP, Output_KAN) Calculate: Dis_MLP ← the average Euclidean metric of predicted boundary points form Output_MLP. Dis_KAN ← the average Euclidean metric of predicted boundary points form Output_KAN. if Dis_MLP < N and Dis_MLP < Dis_KAN then return Output_KAN ▹ Prioritize KAN output for better boundary delineation. else return Output_MLP ▹ Prioritize MLP output for better boundary delineation. end if end procedure
|
3.5. SAM-Based Segmentation Completion Module
Given the constraints that prevent fine-tuning the entire range of model parameters, our two-branch decoder integration occasionally faces challenges that result in subpar segmentation on certain datasets. To counter this, we have ensemble the segment anything model (SAM), leveraging its robust zero-sample learning capability to enhance the model’s segmentation outcomes. As depicted in
Figure 6, the implementation involves a sequence of modules: a judgment module, a positive sample point generation module, the SAM module itself, and a final noise processing module. The SAM-based segmentation target complementary module initiates by acquiring the output results from the decoder to formulate the prompt data necessary for SAM’s operation. The process decides the type of prompt data based on the average point distance of the input results: (1) If the average point distance is small, only the prediction frame coordinates are used as the SAM prompt data. (2) If the average point distance is large, a combination of randomly selected positive sample points within the segmentation target and the prediction frame coordinates are used alongside the prediction frame coordinates as the SAM prompt data.
However, these supplementary results from SAM may contain errors or noise, as observed in
Figure 7. These errors often manifest as multiple prediction blocks. To address this, we have developed a block counting module that utilizes a breadth-first search algorithm to tally the number of segmentation blocks present in SAM’s output results. The decision-making process is as follows: (1) If the count of segmentation blocks exceeds a predefined threshold, the SAM result is discarded, and the two-branch decoder’s prediction is taken as the primary segmentation result. (2) If the count does not surpass the threshold, SAM’s output is merged with the two-branch decoder’s prediction to formulate a new, refined segmentation outcome.
This approach ensures that while SAM provides valuable complementary information, its integration is conditional on the quality of its output, thereby maintaining the integrity and accuracy of the final segmentation result. This method exemplifies our commitment to creating a flexible and adaptive framework that maximizes the strengths of multiple models while mitigating their weaknesses.
4. Experiment
4.1. Experimental Settings and Evaluation Indicators
4.1.1. Experimental Settings
The model in this paper is constructed using PyTorch. The input image size of the model is
. The learning rate is set to 0.0001 for the fine-tuning of the KAN decoder branching module, and the number of training rounds is 100. During the training period, only the KAN decoder branching parameter is activated, while the parameters of the rest of the model modules are frozen. In the testing phase, the mean Euclidean metric threshold is set to 4, the positive point generation value is set to 2, and the SAM prediction noise module discard threshold is set to 15. In particular, the mean Euclidean metric threshold is set according to the results in
Figure 8 and corresponds to N in Algorithms 1 and 2. For the overall model architecture, we use Swin-B as the visual encoder and BERT as the text encoder. Both the Transformer encoder and the two-branch decoder consist of six encoder layers and six decoder layers. Additionally, the Swin-B SAM model is used to correspond to the visual encoder.
Algorithm 2 Two-branch decoder integration with SAM. |
Input: Decoder_output, Dis_output, N, Block_N Output: Final_mask procedure SAM() if Dis_output < N then Use only for else Use and for end if ▹ Run SAM with prompt data CountBlocks ▹ Count prediction blocks if then return ▹ Discarding SAM results else return ▹ Merge SAM results with decoder outputs end if end procedure
|
4.1.2. Evaluation Indicators
We use the mean intersection over union (mIoU) as the primary evaluation metric for referring image segmentation (RIS). For comparison with PolyFormer and other articles that use overall intersection over union (oIoU) results as an evaluation criterion, we also use oIoU as an evaluation metric for the model.
4.2. Dataset
In this study, we aimed to evaluate our referring image segmentation (RIS) model against existing architectures by testing it on three prominent RIS datasets: RefCOCO, RefCOCO+, and RefCOCOg. The RefCOCO dataset is composed of 142,209 annotated textual descriptions associated with 50,000 objects across 19,994 images. Similarly, RefCOCO+ also contains a substantial number of annotations, with 141,564 expressions for 49,856 objects in 19,992 images. A distinguishing feature of RefCOCO+ is the absence of positional words in the descriptions, which increases the complexity of the segmentation task by requiring the model to infer spatial relationships without explicit cues. Additionally, “Test A” and “Test B” sets in RefCOCO and RefCOCO+ contain only people and only non-people respectively. Expanding on this, RefCOCOg introduces an additional layer of complexity with 85,474 reference expressions for 54,822 objects within 26,711 images. The expressions in RefCOCOg were sourced through Amazon Mechanical Turk, resulting in longer and more intricate descriptions, averaging 8.4 words per expression compared to the 3.5 words in RefCOCO and RefCOCO+.
4.3. Main Results
Table 1 and
Table 2 succinctly illustrate the comparative segmentation effectiveness of various classical models on these datasets. Our framework demonstrates a superior performance, surpassing current state-of-the-art (SOTA) models on almost all accounts. The results are a testament to the robustness and adaptability of our method, which excels in translating textual descriptions into precise image segmentations. Specifically, our framework’s performance in terms of both mean intersection over union (mIoU) and overall intersection over union (oIoU) is commendable. On the RefCOCO dataset, while our framework is slightly inferior to the S
2RM model on the Test A dataset, we managed to exceed the scores achieved by the S
2RM model on the val and Test B datasets, which has only recently been introduced as a competitor in the field. However, the S2RM model underperforms our model in all five datasets of the more challenging RefCOCO+ and RefCOCOg. More notably, our framework outperforms the S
2RM model in all eight datasets in terms of the oIoU scores. This demonstrates that S
2RM is slightly better on simple human segmentation datasets, but our framework is more robust on more comprehensive and complex referring image segmentation scenarios with more comprehensive capabilities.
The RefCOCO dataset serves as a foundational benchmark, and our model’s success here lays a solid foundation for its performance on more complex datasets. As we transition to the RefCOCO+ and RefCOCOg datasets, which are known for their increased complexity due to the lack of positional words and the intricacy of the reference expressions, our model continues to shine. It only misses PolyFormer’s oIoU score by only a slight 0.01% on the val dataset in RefCOCOg and exceeds the SOTA scores that PolyFormer has set for a long time in other datasets. On the other hand, on the mIoU score, which PolyFormer focuses more on, our framework achieves better performance than PolyFormer on all eight datasets.
In addition, as shown in
Figure 9,
Figure 10 and
Figure 11, in the RefCOCO, RefCOCO+, and RefCOCOg datasets, for the images that PolyFormer fails to segment accurately, our framework is not only able to completely complement the segmentation results but even able to accurately segment the images that have completed incorrect segmentation. This achievement is a significant indicator of the powerful generalization capabilities of our framework, which can adapt to various levels of challenges presented by different datasets. In summary, our research introduces a novel RIS model that sets new benchmarks for segmentation accuracy. Its exceptional performance across a range of datasets underscores its potential for real-world applications, where the accurate segmentation of images based on textual descriptions is crucial.
4.4. Ablation Studies
4.4.1. Modeled Structural Ablation Experiment
In
Table 3 of our paper, we meticulously detail the incremental performance improvements achieved by enhancing the PolyFormer model with our innovative additions. The table presents a clear breakdown of how each modification contributes to the overall efficacy of the model. Initially, we implemented the kernelized attention network (KAN) decoder branch as a standalone feature for segmentation prediction. This modification resulted in a modest improvement in performance on the RefCOCO and RefCOCO+ datasets, indicating its potential. However, we observed a slight decrease in performance on the RefCOCOg dataset, suggesting the need for further refinement. Our next step was to integrate the predictions from both the KAN and the existing multi-layer perceptron (MLP) decoder branches using our proprietary integration learning strategy. This integration led to a consistent improvement in model performance across all three datasets, showcasing the synergistic effect of combining the strengths of both decoders. The most significant leap in performance came with the introduction of the SAM-based complementation module. When this module was added on top of the two-branch decoder integration, the model achieved remarkable results. On the RefCOCO dataset, the model reached a mean intersection over union (mIoU) of 77.14% and an overall intersection over union (oIoU) of 75.37%. On the more complex RefCOCO+ and RefCOCOg datasets, the model delivered a mIoU of 71.75% and scores of 68.07% for oIoU, 70.72% for mIoU, and 67.75% for oIoU, respectively. These results are particularly noteworthy as they represent the best segmentation performance achieved with only a single decoder fine-tuned on a consumer-grade graphics card. This demonstrates not only the superior accuracy of our model but also its practicality and accessibility.
4.4.2. SAM Input Ablation Experiment
In
Figure 6, we illustrate the design of the semantic attention model (SAM), highlighting its capability to accept a variety of input prompts, including bounding box coordinates, positive and negative sample points, and image masks. To investigate the optimal utilization of our model’s outputs for enhancing segmentation through SAM, we conducted a series of ablation experiments, the findings of which are presented in
Table 4. To swiftly identify the most effective input format for SAM, we selected the RefCOCO+ dataset, which poses a greater challenge than RefCOCO, as our validation set. Initially, when we fed SAM solely with the target box coordinates derived from the dual-branch model’s predictions, SAM achieved a segmentation score of 67.46% for the mean intersection over union (mIoU) and 62.67% for the overall intersection over union (oIoU) on the RefCOCO+ val dataset; and only 71.50% and 61.06% (mIoU) and 68.91% and 54.85% (oIoU) on the Test A and Test B datasets. We then enriched the input prompt by incorporating randomly generated positive sample points alongside the target frame coordinates. This addition led to a noticeable improvement in SAM’s segmentation performance, with scores rising to 69.07% (mIoU) and 64.42% (oIoU). The performance also improved on the Test A and Test B datasets. Subsequently, introducing a noise detection and discard module further refined SAM’s predictions, yielding an enhanced val result of 70.81% (mIoU) and 67.34% (oIoU). Similarly, results of 74.48% (mIoU) and 72.51% (oIoU) were achieved on the Test A dataset, while results of 64.52% (mIoU) and 58.84% (oIoU) were achieved on the Test B dataset. Ultimately, by integrating SAM’s refined predictions as complementary information back into the dual-branch model, the overall framework achieved a remarkable segmentation performance of 71.75% (mIoU) and 68.07% (oIoU) on the val dataset, 75.70% (mIoU) and 73.46% (oIoU) on the Test A dataset, and 65.69% (mIoU) and 59.47% (oIoU) on the Test B dataset, thereby attaining the SOTA results. This outcome underscores the significance of each component in the framework and the synergistic impact of their integration, propelling the model to greater segmentation accuracy.