Multimodal Attention-Based Instruction-Following Part-Level Affordance Grounding

Qu, Wen; Guo, Lulu; Cui, Jian; Jin, Xiao

doi:10.3390/app14114696

Open AccessArticle

Multimodal Attention-Based Instruction-Following Part-Level Affordance Grounding

Computer Science and Technology, Dalian Maritime University, Gaoxin District, Dalian 116026, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(11), 4696; https://doi.org/10.3390/app14114696

Submission received: 22 March 2024 / Revised: 15 April 2024 / Accepted: 29 April 2024 / Published: 29 May 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

The integration of language and vision for object affordance understanding is pivotal for the advancement of embodied agents. Current approaches are often limited by reliance on segregated pre-processing stages for language interpretation and object localization, leading to inefficiencies and error propagation in affordance segmentation. To overcome these limitations, this study introduces a unique task, part-level affordance grounding, in direct response to natural language instructions. We present the Instruction-based Affordance Grounding Network (IAG-Net), a novel architecture that unifies language–vision interactions through a varied-scale multimodal attention mechanism. Unlike existing models, IAG-Net employs two textual–visual feature fusion strategies, capturing both sentence-level and task-specific textual features alongside multiscale visual features for precise and efficient affordance prediction. Our evaluation on two newly constructed vision–language affordance datasets, ITT-AFF VL and UMD VL, demonstrates a significant leap in performance, with an improvement of 11.78% and 0.42% in mean Intersection over Union (mIoU) over cascaded models, bolstering both accuracy and processing speed. We contribute to the research community by releasing our source code and datasets, fostering further innovation and replication of our findings.

Keywords:

affordance grounding; vision–language learning; affordance segmentation; multimodal learning

1. Introduction

The ability to comprehend and execute human instructions in natural language is important for the functionality of embodied agents. Once agents gain this ability, they can engage in natural and intuitive communication with humans. This ability necessitates an in-depth understanding of both natural language processing and visual content analysis. To address this challenge, a significant body of research has emerged, encompassing areas such as vision and language navigation [1,2,3], instruction-following manipulation [4], and visual grounding [5].

Despite considerable progress in various related research areas, formidable challenges, particularly in complex, real-world scenarios, remain in the domain of instruction-following manipulation. Most existing methods predominantly focus on object-level vision–language comprehension, which limits their ability to achieve fine-grained, part-level comprehension. Object-level grounding enables agents to identify the target object, such as “the glass on the right”. However, agents fail to consider the specific functional parts of the object that are necessary to complete the task. For example, the instruction “Hand over the glass on the right”, as depicted in Figure 1, requires the agent to understand not only the object itself (the glass on the right) but also the specific interaction ("hand over" involves the body of the glass in the green area of Figure 1). Conversely, for the instruction “Please pour some water into the right glass”, the agent needs to associate the mouth area of the right glass (the pink area in Figure 1) with the task of “pouring water into”. Object-level comprehension methods [6,7,8,9,10,11] excel at locating objects but struggle to discern part-level differences for the same object.

While object-level grounding is sufficient for basic manipulation tasks such as “pick up”, “hand over”, and “move”, it is inadequate for more complex actions like ‘use the knife to cut’. These actions require semantic grasping [12]—a nuanced comprehension of the object’s different parts and their associated functionalities or “affordance”, a term coined by psychologist James J. Gibson [13]. In robotic manipulation, understanding affordances at a granular level is crucial for determining the exact location for task-specific actions within a real-world context. Interpreting directives such as "Hand over the glass" or "Pour water into the glass" requires an advancement beyond traditional object-level grounding techniques. Existing research typically concentrates on either tracking bounding boxes with language cues [6,7,8,9,10,11] or segmenting object areas without considering the semantic context of tasks [14,15,16], as depicted at the bottom of Figure 1. Such methods, while effective for object detection and instance segmentation, fall short in their granularity, failing to meet the demands of manipulation-oriented robotic interactions. Current affordance detection methods can identify parts of objects, but they are confined to simplistic, pre-defined affordance categories and do not factor in the task semantics. This necessitates additional pre-processing steps for language interpretation and object localization prior to affordance segmentation. Our work seeks to fill this gap, providing a direct and nuanced understanding of instructions for fine-grained affordance segmentation that aligns with the complexities of real-world applications.

To overcome this challenge, we explore the problem of part-level affordance grounding based on natural language instructions. Specifically, we address the problem of segmenting the relevant functional region of a referenced object when given an image and an expression describing a manipulation instruction. In this paper, we introduce the Instruction-following Affordance Grounding Network (IAG-Net), targeting the nuanced task of part-level affordance grounding based on natural language instructions, as illustrated in the upper right of Figure 1. Unlike previous methods that rely on coarse object-level prediction and fixed task representations, our approach represents a significant advancement, achieving an end-to-end understanding of manipulation instructions for precise, fine-grained affordance grounding within images. We introduce a varied-scale multimodal attention mechanism for efficient feature fusion and extract task-specific contexts for precise affordance prediction. Our approach bridges the gap between language instructions and visual cues, facilitating a more nuanced interaction between humans and robots—beyond the capabilities of current object recognition systems. In addition, different from several works using simulated or fixed environment settings [8,11], we establish an instruction image dataset with realistic images and semantic grasping tasks.

Affordance grounding from natural language expressions has broad applications, including human–robot communication and natural language-facilitated human–robot interaction [17]. For instance, when faced with different instructions for the same object, such as “Help me move the table” and “Help me clean the table”, a robot should distinguish the relevant object regions and act appropriately. The part-level affordance grounding can enhance human–robot interaction capabilities and bolster the autonomy of robots in decision-making processes. In the context of dialog response generation, natural language-based affordance grounding can provide specific cues for predicting future human actions and understanding the semantics of manipulation tasks. Compared to one-shot manipulation task representation, natural language offers flexibility for various tasks, such as “Pour the milk” and “Hand over the milk box”. Therefore, this research has the potential to advance the field of robotics by empowering robots with enhanced interaction and decision-making abilities, thus allowing them to operate more autonomously and effectively [18,19] in dynamic environments or under remote control. Moreover, by integrating robust security measures and defenses against potential cyberattacks [20,21], this work contributes to enhancing reliability and trustworthiness in practical applications. In a brief overview, our contributions consist of the following:

(1): We introduce an end-to-end framework for instruction-following part-level affordance grounding that streamlines the process and enables the development of more efficient robotics applications.
(2): A realistic instruction dataset for instruction-following affordance grounding is used. Our multimodal dataset, featuring realistic images paired with manipulation instruction expressions, offers a more comprehensive and practical resource.
(3): We introduce two language–vision feature fusion strategies. By leveraging both sentence-level and task-specific textual features, our approach incorporates a soft event detection strategy in the task-information extraction module, enhancing semantic information guidance for part-level affordance grounding.

The remainder of this paper is structured as follows. Section 2 presents existing works related to our work. In Section 3, we delineate the architecture and three components of our IAG-Net. Section 4 discusses the experimental results and analysis conducted on two datasets. We conclude the paper and discuss potential applications in Section 5.

2. Related Work

In this section, we survey three key areas related to our work: instruction-following robot manipulation, visual grounding, and affordance detection. We outline the interconnections between these areas and highlight the gaps targeted by this work. Table 1 shows a comparison of existing tasks in related work.

2.1. Instruction-Following Robot Manipulation

Understanding robot manipulation in response to instructions is foundational to our research. In robotics, grounding language includes grounding the navigation instructions and the manipulation instructions. Both types of grounding ground the spatial relations of verbs [22,23,24] and the visual attributes of objects, such as colors [25,26] and shapes [27]. However, former work has focused on grounding navigation verbs, such as "Go near the blue cone" and "Walk to the white barrel”. Latter work has focused on grounding manipulation verbs, such as "Pick up the yellow box on the upper side" and "Fetch the red box on the left”.

Our work is most related to grounding manipulation instructions. Early works used semantic structures such as spatial description clauses [28] and the verb-environment-instruction library structure [4] to transform natural language into structured sequences. Recently, many studies have applied deep neural networks to this task by encoding and comparing image features and language features [29,30]. To explore the ambiguity of spatial and perspective problems in "pick up and place" tasks, many studies have introduced probability models [30], commonsense reasoning [31], feedback and conversation [8,26], user gestures [32], spatial reasoning [23], and perspective analysis [33]. In addition to the output of the referred object location, some works have directly mapped an image and language to motor control [27], grasp strategy [34], and structured action sequences. Recently, Mees et al. [11] attempted to perform language-conditioned manipulation with visual affordance following a U-Net [35] architecture. However, their work focused on completing long-horizon manipulation tasks under a simulated environment. Our method is most similar to the instruction grounding methods based on deep learning [7,9]. Li et al. [9] used the “Grasp A to B” as the robot control language format, in which A and B correspond to the target object and the delivery place, respectively. However, existing methods use a two-stage neural network, which first generates candidate objects and then infers the most likely objects. In contrast, we propose a one-stage framework to learn part-level segmentation based on its suitability for a given task specified using natural language. In addition, existing methods focus on grounding "pick up and place” tasks, while our work addresses semantic grasp tasks such as "cut and contain", which is the foundation for learning complex manipulation. Existing language-following datasets [6,7,8,9,10] usually provide object-level annotations rather than part-level annotations and grounding.

2.2. Visual Grounding

As a fundamental multimodal task, visual grounding mainly involves vision and language comprehension and aims to localize the referent object in an image according to the given natural language. This task has gained much attention in both the natural language and computer vision community. Typical visual grounding tasks include referring expression comprehension (REC) and referring image segmentation (RIS).

REC focuses on locating the most related objects in an image based on a given referring expression [36]. In this task, it is assumed that the referring expression unambiguously describes an object. The typical two-stage methods [37,38,39] first generate several candidate regions via a pretraining proposal framework, e.g., Faster-RCNN [40] or Mask-RCNN [41]. Afterward, the candidate regions and expression map are embedded in a common space, and joint embedding is used to maximize the match score of region–expression pairs. However, whole-sentence embedding may fail to consider fine-grained details that are crucial for accurate localization. Compositional modular networks [37] first identify the entities and their relationships in the input expression and then implement end-to-end visual reasoning. MAttNet [38] uses Bi-LSTM to parse an expression into subject, location, and relationship representations and computes individual match scores. However, these two-stage methods tend to be restricted by the generated proposal quality in the first stage. More recently, one-stage methods [42,43] eliminate the offline object detector and directly output the target bounding boxes of the referring expression. Following one-stage methods, several researchers [44,45,46] have explored end-to-end frameworks based on transformers.

Similar to the REC task, the RIS task aims to segment the referring object in an image that matches the corresponding referring expression [47]. In early works [48,49], textual features and image features were usually fused together and then the label for each pixel was predicted following classical image segmentation methods. To improve the performance of feature fusion, many methods have utilized attention mechanisms in image segmentation. Shi et al. [50] proposed a key-word-aware network to evaluate the importance of words according to their contribution to identifying an object. Ye et al. [51] considered the long-range correlations between text and vision modalities. Since most approaches usually perform multimodal fusion between language and vision at the decoding side, the EFN [52] employs a coattention mechanism to achieve simultaneous updates of multimodal features. Due to the excellent performance of transformers [53] based on self-attention mechanisms, many researchers [54,55,56] have extracted and fused vision and language features with self-attention mechanisms.

Although the above methods also input a picture and an expression and output the target object guided by the expression, they are different from our method in two aspects. First, the referring expression and user instructions are different in content. While referring expressions describe the referent object within an image (e.g., “the man on the left of the tree”), manipulation instructions consist of objects and tasks (e.g., “Use the knife to cut”). On the other hand, the above methods achieve object-level location or segmentation. These methods do not adequately address the part-level granularity required for complex manipulation tasks.

2.3. Object Affordance Detection

The concept of affordance encompasses both object affordance and environment affordance [57]. We mainly focus on object affordance in this paper. Vision-based object affordance understanding aims to learn object affordance from vision, which falls within the domain of affordance learning. Visual object affordance is normally associated with three main tasks: object affordance categorization, object affordance detection, and affordance reasoning.

Similar to vision tasks such as object detection and semantic segmentation, affordance detection aims to detect the object and assign pixel-level labels for the object according to the affordance categories [58,59,60,61,62,63]. In early work, manually designed geometric features were extracted from RGB-D images to detect the affordance of objects [58]. As deep learning has progressed, an increasing number of deep-learning-based methods have been proposed to avoid hand-designed features. Nguyen et al. [59] trained an encoder–decoder convolutional network to obtain pixel label predictions. Subsequently, Ref. [60] employed a sequence of deconvolutional layers with a resizing strategy and introduced a multitask loss function to effectively handle the multiclass problem in the affordance mask. AffordanceNet [62] uses two branches to localize the object and assign affordance labels separately. Without object candidates, Zhao et al. [63] designed a relationship-aware network to produce pixel-wise affordance segmentation in an end-to-end way. Considering the object shape, ADOSMNet [16] uses a novel shape mask-guided encoder that aggregates global features and locally relevant object features. In addition, several researchers have introduced weakly supervised and few-shot affordance detection methods [64].

In these studies, the affordance detection and segmentation problem is considered a classification or pixel-level prediction problem. Due to the predefined and fixed nature of the affordance category, it is challenging for robots to directly understand language instructions. Consequently, existing approaches often fail to consider the crucial relation between visual features and task-specific semantic contexts. Our approach innovatively integrates natural language instruction with affordance detection and bridges the gap between language instructions and visual affordance, facilitating a more convenient interaction between humans and robots.

3. Method

In this section, we initially present the problem formulation and describe the overall architecture of our method. Then, we explain the three main modules: the sentence-level multiscale vision language encoder, task-specific context extraction module, and part-level affordance-grounding module.

3.1. Problem Formulation

The primary objective of our method is to generate a part-level mask, M, for the image, I, based on a corresponding natural language instruction, T, that identifies the functional part of the target object relevant to the given manipulation task. With

θ

denoted as the model parameters, the task can be formulated as follows:

M = A f f_{θ} (I, T) .

(1)

In contrast to existing object-level segmentation tasks, M is a part-level mask corresponding to the affordance described in the language instruction. This task requires not only grounding between the image and the natural language but also grounding between the manipulation task and the functional part of the object. The former is object-level grounding and the latter is part-level grounding. Inspired by the characteristics of this task, we developed a method that employs two language–vision feature fusion strategies, one for sentence-level semantic features and the other for task-specific semantic features.

3.2. Overview

The architecture of our approach, as shown in Figure 2, comprises two foundational modules for extracting visual and linguistic features, followed by a multiscale vision language encoder to integrate sentence-level linguistic features with visual features from multiple scales. To effectively learn the fused feature from two different modalities, the encoder fuses sentence-level language features with multiscale visual features. The multiscale vision language encoder focuses on grounding the image with the sentence-level language feature. In addition to sentence-level semantic features, different words in the instructions contribute differently to the manipulation task. Recognizing the varying importance of different words in the instructions for the manipulation task, we introduce a task-specific context extraction and fusion module. This module is designed to identify and emphasize words that are crucial for understanding the manipulation task. Finally, the part-level affordance-grounding module combines the outputs of the above two modules and predicts a mask, M, for the functional part of the target object.

3.3. Vision and Language Encoders

For multimodal tasks involving both vision and language, the two-tower model architecture has emerged as a widely adopted approach for effectively extracting features from each modality. Therefore, in our approach we employ two distinct backbones, a visual encoder and a language encode, to separately extract visual and textual features. The network inputs consist of an image,

I \in R^{F H \times F W \times C} (C = 3)

, and a language instruction,

T = {w_{1}, w_{2}, \dots, w_{t}}

, where

F H

,

F W

, and C represent the height, width, and color channel number of the image, respectively, and t denotes the length of the instruction. The visual encoder processes the image input, while the language encoder handles the textual input.

Visual Features: We adopt ResNet 50 [65] as our visual backbone, following the approach of many prior works in image segmentation and detection [55,66]. ResNet 50 is known for its ability to effectively capture detailed multiscale visual features. The backbone produces multilevel visual features,

R e s_{i} \in R^{F H_{i} \times F W_{i} \times C_{i}}

, where

R e s_{i}

corresponds to a ResNet 50 feature and

i \in 1, 2, 3, 4

denotes the block index. Here,

F H_{i}

and

F W_{i}

represent the height and width of the feature map, respectively, and

C_{i}

indicates the dimension of the feature channels.

Text Features: Language feature are extracted using BERT [67], a language encoder that levarages a bidirectional self-attention mechanism within transformer layers to generate rich linguistic representations. Given an instruction sentence,

T = {w_{1}, w_{2}, \dots, w_{t}}

, where

w_{i}

represents the i-th word with

i = {1, \dots, t}

, BERT outputs a final hidden state,

F_{w} = {f_{1}, f_{2}, \dots, f_{t}} \in R^{t \times d}

, where d is the dimension of the word embedding. Each embedding vector,

f_{i}

, corresponds to the contextual representation of the i-th word,

w_{i}

, in the instruction.

3.4. Multiscale Vision Language Encoder

A multimodal learning study [68] demonstrated the significance of high-level visual features, especially when language terms are directly related to objects such as “Hand over the cup on the right of the desk”. This insight serves as a guiding principle for our approach, where we harness high-level visual features and sentence-level textual features to capture object-level vision and language attributes effectively. The multiscale visual language encoder is designed to extract object-level multimodal features for affordance grounding.

We obtain the global sentence representation by concatenating the representations of all words in the sentence, as expressed in the following equation:

F_{g t} = [f_{1}, \dots, f_{t}],

(2)

where

[,]

represents the concatenation operation. Initially, we transform both the textual and visual features into a uniform feature dimension using two separate linear layers. Subsequently, these features are fused, as shown in Equation (3).

F_{m} = [g (F_{v} W_{v}), g (F_{g t} W_{t})],

(3)

where

W_{v}

and

W_{t}

are two learnable parameter matrices and g denotes the activation function Relu. The concatenation operation,

[,]

, is employed to combine features from two different modalities. The visual feature,

F_{v}

, is extracted from the visual backbone, specifically using the visual feature

R e s_{4}

. The textual feature,

F_{g t}

, is computed according to Equation (2). The multimodal feature,

F_{m}

, results from fusing the sentence-level semantic and high-level visual feature, providing a global multimodal representation.

In order to enhance the fusion of visual–linguistic features, we propose a varied-scale multimodal attention (VSMA) mechanism designed to effectively capture cross-modal features. The incorporation of multiscale visual features has demonstrated efficacy in image segmentation, particularly in handling objects of varied scales. Additionally, cross-modal attention mechanisms guide the network to focus on image regions relevant to the language context, thereby enhancing performance in vision–language tasks. Motivated by these insights, we introduce the VSMA mechanism, which integrate a multiscale structure with cross-modal attention to refine the features and concentrate on the essential multimodal features of the target object. The structure of the VSMA mechanism is depicted in Figure 3. Specifically, VSMA comprises channel attention, spatial attention at various scales, and multimodal attention.

For channel attention, we compute the attention weights from the average pooling and max pooling of the multimodal feature,

F_{m}

, following channel attention in the CBAM [69]. Since various scales have been efficient for various image processing tasks in previous works such as ASPP [70], we introduce varied-scale spatial attention to improve the robustness of the affordance parts in the images. For the feature of modality, m, we first compute the max pooling and average pooling features as

F_{m a x}^{m}

and

F_{a v g}^{m}

, respectively. Next, atrous convolutional layers [70] with different sampling rates,

A c o n_{s}

, are employed to capture varied-scale spatial information, the results of which are concatenated. Then, a convolution operation is applied to obtain the varied-scale spatial attention weights.

V S A (F_{m}) = C o n v (A c o n_{s} ([F_{m a x}^{m}, F_{a v g}^{m}])) F_{m},

(4)

where

A c o n_{s}

is the atrous convolutional layer for scale s, as in ASPP, and

C o n v

is the convolution layer.

In addition to spatial attention, multimodal attention is useful for determining the role of each modality in the affordance-grounding tasks. Therefore, multimodal attention is utilized to leverage features from different modalities before fusion, as shown in Figure 3. The whole VSMA can be represented as follows:

\begin{matrix} F_{m a}^{h} & = V S M A (F_{m}) \\ = M A (V S A (C A (F_{m}))) \end{matrix},

(5)

where

C A

represents the channel attention,

V S A

is the varied-scale spatial attention, and

M A

denotes the modality attention.

F_{m a}

is the refined multimodal feature after the h-th VSMA and

h = {1, \dots, H}

represents the index of heads in the varied-scale multimodal attention (VSMA). The number of VSMA mechanisms, indexed by h, collectively captures varied-scale cross-modal attention. The details of the head number, H, settings are described in the experiments (Section 4.3).

Finally, the outputs of different heads are fused by concatenation and convolution operations to obtain the multiscale attention features,

F_{s}

.

F_{s} = c o n v_{1 \times 1} [F_{m a}^{1}, \dots, F_{m a}^{H}]

(6)

Through channel attention, varied-scale spatial attention, and multimodal attention, the features from two different modalities are fused deeply and densely. To further obtain part-level results, the output of this module is passed into the part-level affordance to refine the segmentation.

3.5. Task-Specific Context Extraction

Task information, exemplified by actions and objects such as “cup” and “contain”, plays a crucial role in grounding object parts for specific tasks. For instance, given two similar language instructions, “Grasp the cup” and “Grasp the knife”, distinguishing the target object in the image is essential. Similarly, for instructions like “Use the cup to contain” and “Grasp the cup”, the different actions lead to distinct affordance areas. Both action and object information are necessary for part-level affordance grounding. Therefore, this module is designed to extract a task-related textual context, which we denote as the task-specific context for part-level affordance grounding.

The target object and responding action/verb are two important elements for the task, such as “Use the knife to cut” and “Hand over the cup”. We conceptualize the task context as a pair of key elements: <object, action>. Drawing inspiration from event extraction (EE) [71] in natural language processing (NLP), we treat the task-specific context extraction problem akin to the event extraction problem, where the action serves as the event trigger and the object serves as the argument. Since there is no annotation for the trigger or the argument in the datasets, we use a soft classifier to predict the probability that each word belongs to one of three types (i.e., trigger, argument, or unnecessary words). Then, the task-specific context of the entire expression can be calculated from the probability of each word.

Word Feature Embedding: By capturing the syntactic and semantic essence of each word, we can map words in the instruction to individual vectors using pre-trained BERT, as outlined in Section 3.2. After this step, the instruction is represented as a sequence of vectors,

{f_{1}, f_{2}, \dots, f_{t}}

. These word representations are then augmented with position embedding,

p_{i}

, resulting in a composite feature for each word,

w_{i}

, denoted by

[f_{i}, p_{i}]

.

Graph-Based Feature Encoding: Following word embedding, we transform the dependency tree of the instructions into an undirected syntactic graph. This graph effectively links words to their contextual counterparts, enhancing relational understanding. For instance, in the instruction "Use the knife on the right to cut", the action word "cut" and its corresponding object "the knife" are closely linked in the graph, despite being distant in the sequential representation. In the sequential representation, the two words are 5 steps apart, whereas in the dependency tree they are 2 steps apart (as shown in Figure 4).

In the generated graph

G = {V, E}

, each node,

v \in V

, represents a word in the given instruction, while each directed edge,

e \in E

, corresponds to a syntactic arc in the dependency tree (see Figure 4). To consider and learn the syntactic context for each word, we employ a graph neural network (GNN) [72] to aggregate information from the connected neighbors. After passing through L layers, the GNN outputs a representation for each word,

f_{i}^{L}

.

Task-Specific Context Computing: Each word’s feature from the graph, denoted as

f_{i}^{L}

, is passed through a multi-layer perceptron (MLP), resulting a soft probability vector,

s_{i} \in R^{1 \times 3}

. Each vector,

s_{i}

, indicates the likelihood of each word,

w_{i}

, belonging to one of three categories: action, object, or unnecessary word. The task-specific context of the instruction, T, is then computed using a weighted combination of all the words, as shown in Equation (7).

f_{c} = \sum_{i = 1}^{t} (s_{i}^{1} \times h_{i}^{L} + s_{i}^{2} \times h_{i}^{L}),

(7)

where

s_{i}^{1}

and

s_{i}^{2}

represent the probabilities of the action and the object for each word,

w_{i}

, respectively.

h_{i}^{L}

is the word embedding for the word,

w_{i}

, which is encoded by GNN.

f_{c}

represents the task-specific context for the entire sentence. The task-specific context,

F_{t}

, is then fused with visual features,

R e s_{1}

, through a

1 \times 1

convolution operation, resulting

F_{t}

as follows:

F_{t c} = c o n v_{1 \times 1} [R e s_{1}, f_{c}],

(8)

where the combined feature,

F_{t c}

, is subsequently utilized in the part-level affordance-grounding module. The

F_{t c}

fuses the task-specific context with the visual features, which provide a rich semantic of action and target object.

In summary, the task-specific context extraction module extracts a task-related textual context, represents it as a pair of key elements, <object, action>, and integrates this context with visual object features for further use in the part-level affordance-grounding module.

3.6. Part-Level Affordance Grounding

In this module, we leverage the extracted task-specific information to precisely guide part-level affordance grounding. The multiscale feature,

F_{s}

, provide cross-modal features with varied scales, which provides global multimodal information. The task-specific features,

F_{t c}

, further refine the semantic information and pay more attention to local information. To closely align object parts with their respective functional tasks, we employ an “upsample after fusion” strategy, wherein we first resize the multiscale feature,

F_{s}

, and task-specific features,

F_{t c}

, to a uniform size. These features are then concatenated and subjected to convolution operations for effective integration. This fusion process is crucial for aligning visual and task-specific cues at a fine-grained level. We repeatedly apply deconvolution layers to the fused features until the resulting feature map is upscaled to match the size of the original image. This process ensures that the part-level affordance grounding is detailed and aligns well with the original image scale, which is denoted as:

M_{a} = d e c o n v_{r e p e a t e d} (c o n v_{3 \times 3} ([r e s h a p e (F_{s}), F_{t c}])) .

(9)

3.7. Loss Function

To optimize the part-level affordance grounding, we define our loss function using binary cross-entropy [73]. This loss function is particularly suited for binary classification tasks, where each pixel in the image is classified as relevant or irrelevant for the task. The function is formulated as follows:

L_{a f f} = - \sum_{k = 1}^{H \times W} (y_{k} l o g (p_{k}) + (1 - y_{k}) l o g (1 - p_{k})),

(10)

where

y_{k}

represents the pixel element of the ground-truth mask and

p_{k}

represents the corresponding pixel element in the predicted mask. The pseudocode for the IAG-Net training process is outlined in Algorithm 1.

Algorithm 1: Training process for IAG-Net

4. Experiments and Evaluation

To thoroughly assess our method, we conducted extensive experiments on two public datasets, focusing on investigating the following research problems:

RQ1: How does IAG-Net perform compared with existing related methods?

RQ2: How does each design choice made in IAG-Net affect its performance?

RQ3: How sensitive is IAG-Net with different hyperparameters and optimizations?

4.1. Dataset Construction

This task necessitates paired data,

< T, I, M >

, comprising natural language instructions, T, corresponding RGB images, I, and pixel-level part affordance masks, M. As no existing datasets fulfill these requirements, we constructed a new affordance-grounding dataset (https://github.com/WenQu-NEU/Affordance-Grounding-Dataset accessed on 30 March 2024) with the following methodology:

Image Selection: The first step involves selecting an image, I, with existing affordance annotations from two widely recognized computer vision datasets: the IIT-AFF dataset [59] and the UMD dataset [58]. We detail the specific characteristics and selection criteria for each dataset below.
Language Instruction Labeling: After image selection, we manually added instructions, T, for each image, I.
Affordance Mask Generation: We generated the affordance mask, M, based on the image annotations corresponding to the given instructions, T.

Image Selection: In computer vision, there are two widely used datasets with pixel ground-truth affordance labels for objects, i.e., the IIT-AFF dataset [59] and the UMD dataset [58]). The IIT-AFF dataset (https://sites.google.com/site/iitaffdataset/) has 8835 real-world images with cluttered scenes. This dataset contains 10 object categories: bowl, tv, pan, hammer, knife, cup, drill, racket, spatula, and bottle, and 9 affordance classes: contain, cut, display, engine, grasp, hit, pound, support, and w-grasp. In contrast, the images in the UMD dataset (https://users.umiacs.umd.edu/~fer/affordance/part-affordance-dataset/index.html) are objects on a rotating table with a clutter-free setup. These images were captured by a Kinect camera. We utilized the IIT-AFF and UMD datasets for their extensive annotations and the affordance categories they offer, which align with our objective to examine object affordances in diverse scenarios. To ensure the UMD dataset’s variety and to avoid redundancy, we implemented a filtering process to remove highly similar images. By computing pairwise image similarity, any pair with a similarity score exceeding the threshold of

0.9

had one image randomly removed. This curation resulted in a refined UMD dataset consisting of 7070 unique images for our experiments. Consequently, our enhanced dataset now comprised 14,642 object bounding boxes and

24, 677

pixel-level affordance regions, making it a robust and comprehensive resource for affordance analysis.

Language Instruction Labeling: To create a semantic link between functional areas and language instructions, we manually crafted language instructions for each image, adhering to standardized guidelines to ensure consistency, relevance, and diversity across annotations. Two volunteers were tasked with generating instructions based on the content of each image. Each instruction was meticulously designed to encompass three critical elements: the object, the action to be performed, and the affordance related to the action, such as “Use the cup to contain”. To ensure accuracy and consistency, each volunteer worked independently, followed by a review process where any discrepancies between their instructions were discussed and resolved. This process resulted in 8835 and 7070 expressions for the IIT-AFF and UMD datasets, respectively. To enrich the diversity of our language instructions, we specifically focused on varying the expressions used for objects with multiple affordances. For example, a knife, which possesses affordances such as “grasp” and “cut” was labeled with distinct instructions such as “Grasp the knife” or “Use the knife to cut”. Additionally, we incorporated positional descriptors such “the right” and “the left” descriptors to provide more precise and contextually relevant descriptions of the objects. Figure 5 shows examples of our dataset. The first and third rows display the original images along with their affordance annotations from the IIT-AFF and UMD datasets. The second and fourth rows further demonstrate how these images are paired with language instructions and their corresponding ground-truth masks.

Affordance Mask Generation: Generating affordance masks posed challenges in accurately aligning textual instructions with visual annotations, which we addressed through iterative refinement and validation processes. The IIT-AFF and UMD datasets provide affordance annotations for each image. For each image annotation, the value of each pixel ranges from 0 to 9 and 0 to 7 for the IIT-AFF and UMD datasets, respectively. In the third row of Figure 5, each color corresponds to a different affordance category. We automatically generated affordance masks corresponding to the text instructions. Therefore, a binary mask generated corresponds to each instruction focusing on one affordance area at a time, which is more aligned with our research questions.

Table 2 shows the statistics of the constructed affordance-grounding dataset, which are denoted as the IIT-AFF VL dataset and UMD VL dataset. Since IIT-AFF VL and UMD VL have different affordance categories, we used "-" to represent absent affordance categories in the table. For a comprehensive and fair evaluation, we randomly divided the dataset into distinct sets: training (

80 %

), testing (

10 %

), and validation (the remaining

10 %

). This division ensures a balanced representation of the data across all the experimental phases. Considering the diversity of expressions, we used different affordance categories to construct instructions for each object category. For instance, the object “knife” can provide two affordance categories, i.e., “cut” and “grasp”. We used different instructions to cover them, such as “Grasp the knife” and "Use the knife to cut".

4.2. Experimental Settings

4.2.1. Implementation Details

We used the Python and PyTorch frameworks via NVIDIA RTX3060 on Windows to implement our method. In the experiment, the input images were resized to

512 \times 512

pixels following previous works [70,74]. For model optimization, we utilized stochastic gradient descent (SGD) [75] with a weight decay of

1 \times 10^{- 4}

and momentum of

0.9

. The initial learning rate was set to

0.001

and gradually decreasing with the polynomial decay strategy.

4.2.2. Evaluation Metrics

We employed the mean intersection-over-union (mIoU) and

F_{β}^{ω}

metrics [76] because they are widely recognized for assessing segmentation quality in contexts similar to ours. The mIoU is calculated as the average ratio between the intersection area and the union areas across all pixels using both the ground-truth mask, M, and the predicted probability result, P. For affordance segmentation, most studies employed the

F_{β}^{ω}

metric based on the parameters defined in [76]. Therefore, we also reported the

F_{β}^{ω}

for the experiments. Since our datasets include diverse affordance categories, we conducted evaluations per the affordance category, allowing for a more granular understanding of the model performance across different types of object interactions.

4.3. Comparison Results (RQ1)

To demonstrate the effect of our model, we conducted comparative analyses with three distinct methodological paradigms: image-only segmentation methods, cascaded methods, and multimodal-based RIS methods. Image-only methods were selected to determine the influence of visual data in isolation, providing a baseline to analyze the impact of incorporating linguistic instructions. Cascaded methods and multimodal-based methods were chosen to represent common practical applications and to assess how the fusion of text and image data informs the performance within these contexts. Specifically, we selected image-only segmentation methods (DeepLabV3+ [74], OCRNet [77], AffordanceNet [62], RelaNet [63], BPN [14], GSE [15], and ADOSMNet [16]) that are considered benchmarks within the field of affordance segmentation. It is important to note the fundamental difference between image segmentation and affordance segmentation tasks. While image segmentation predicts masks for an entire image indiscriminately, affordance segmentation is more intricate, often requiring a pre-processing step for object detection or an integrated approach that accomplishes object detection and segmentation simultaneously. In the realm of multimodal approaches, we incorporated cascaded methods (combining Faster-RCNN [40] and DeepLabV3+ [74]) owing to their prevalent adoption in practical scenarios, thus enabling an evaluation that reflects real-world application performance. Furthermore, we evaluated our model against a state-of-the-art RIS method (BKINet [56]), which represents the latest innovations in multimodal understanding. These baselines were chosen due to their proven robustness and widespread recognition in the literature, providing a well-understood standard from language-independent to language-integrated approaches, against which to measure our advancements. The details of these methods are as follows:

DeepLabV3+ [74]: The semantic segmentation method DeepLabV3 was employed for its semantic segmentation capabilities. By fine-tuning this method in our dataset, we aimed to assess the additional impact, if any, that language processing contributes to overall affordance-grounding performance.
OCRNet [77]: OCRNet is notable for its object-contextual representations for semantic segmentation, which characterize each pixel by leveraging the features of its corresponding object class. Since affordance segmentation is closely related to object class, OCRNet serves as an appropriate image-only baseline for our study.
AffordanceNet [62]: AffordanceNet, a classical affordance detection method, segments images based on affordance categories without the integration of language instructions, providing a baseline for comparing language-agnostic approaches.
RelaNet [63]: RelaNet is an end-to-end affordance segmentation method without separate and intermediate object detection steps. It considers relationships between the affordance and corresponding objects. Because this dataset is a closed-source dataset, we employed the results of the IIT-AFF dataset reported in the original papers (the backbone with ResNet50).
BPN [14]: BPN considers the potential relationship between object categories and object affordances and proposes a boundary-preserving network. It is an image-only method for affordance detection. Because this dataset is a closed-source dataset, we employed the results of the IIT-AFF dataset reported in the original papers.
GSE [15]: GSE utilizes a repeated multi-scale feature-map-fusion network to produce category-relevant feature maps. Because this dataset is a closed-source dataset, we employed the results of the IIT-AFF dataset reported in the original papers (the backbone with ResNet50).
ADOSMNet [16]: This state-of-the-art affordance detection method operates without language instructions. We considered it an upper bound in our comparisons since it eliminates the need to interpret language and achieve state-of-the-art results on the two datasets (IIT-AFF and UMD). Because this dataset is a closed-source dataset, we employed the results of the IIT-AFF dataset reported in the original papers (the backbone with ResNet50). Unfortunately, ADOSMNet’s performance on the UMD VL dataset could not be compared because the dataset selected is specific to our study (details in Section 4.1).
Cascaded Method: Our custom-designed cascaded method combines object detection and affordance segmentation. It begins by localizing the object mentioned in the instructions using Faster-RCNN [40]. The target object is the region with the highest prediction probability, and affordance segmentation is performed within the target region with DeepLabV3+ [74]). This approach provides insights into the impact of object grounding on the whole process.
BKINet [56]: BKINet is a state-of-the-art method for referring image segmentation that involves uniquely segmenting a reference object directly based on the input text. We trained this model on our dataset to evaluate its performance under the same conditions as though in our proposed method.
TransVG [44]: TransVG is a classical visual grounding method that predict object localization with language. We trained this model on our dataset to provide the object-level grounding results, which is easier than our part-level task.

In Table 3 and Table 4, we present the accuracy values of the evaluated methods across the two datasets. The clear background and minimal environmental interference of the UMD dataset likely contribute to its generally higher performance than that of IIT-AFF. This observation has shown the importance of dataset characteristics in influencing the model accuracy. Among the three categories of methods, image-only methods, by concentrating solely on visual cues, avoid the complexities and potential inaccuracies introduced by language data integration, which explains their superior performance. However, when object grounding is integrated with the image-only method in the cascaded approach, there is a notable performance decrease. For instance, compared with that of DeepLabV3+, the cascaded method shows a decrease in performance by

- 1.41 %

,

- 1.69 %

,

- 8.59 %

, and

- 43.07 %

on the two datasets. The cascaded method’s performance drop, as indicated by the comparison with DeepLabV3+, suggests that the sequential process of object grounding followed by affordance segmentation may introduce compounded errors, leading to lower overall accuracy. Considering the image characteristics of UMD and IIT-AFF, we find that the more complex the image content is, the larger the performance decreases for the cascaded method. We also evaluated the accuracy of object localization with the visual grounding method TransVG [44], which has average accuracy values of approximately

87.28

on the IIT-AFF dataset and

99.78

on the UMD dataset. This result highlights a challenge for both cascaded and multimodal methods, which process both image and language instructions, as they tend to underperform compared to image-centric approaches. Our IAG-Net notably outperforms the cascaded and multimodal RIS methods across most categories, demonstrating its robustness and effectiveness. BKINet’s underperformance in categories such as “cut”, “hit”, “pound”, and “support” could be attributed to the limited sample size in these affordance categories within the IIT-AFF VL dataset. Our method, IAG-Net, demonstrates competitive results comparable to those of image-only methods and appears to be less influenced by dataset variability. In addition to the above methods, we also tested the large multimodal model NExT-Chat [78] for this task using the provided demo (https://next-chatv.github.io/). NExT-Chat is a large multimodal model for detection and segmentation that can handle multiple tasks, such as visual grounding, region captioning, and grounded reasoning. While NExt-Chat is capable of bounding boxes or segmentation for entire objects, it is unable to achieve part-level visual understanding. Enhancing large multimodal models such as NExT-Chat for more granular tasks points to an area for future exploration and could be a valuable direction for subsequent research.

Visualization: To further illustrate the subjective performance of IAG-Net, we present a series of visualizations that compare our model’s affordance-grounding results with the ground truth. Each set in our visual representation, as displayed in Figure 6, comprises four parts. The first column displays the original image paired with its corresponding language instruction. The second column shows the overlap of the predicted affordance mask on the original image. The third and fourth columns provide a side-by-side comparison of the affordance prediction maps of IAG-Net against the ground-truth annotations. These visual examples demonstrate that IAG-Net is capable of effectively interpreting language instructions and aligning closely with ground-truth affordance maps.

Statistical Analysis: To evaluate the statistical significance of the performance disparities among multimodal-based approaches, we undertook a statistical analysis focusing on the mean IoU and F-score values across two datasets. The results revealed a statistically significant difference in segmentation performance among the three methods. In Figure 7, we present a visualization depicting the mean IoU and F-score values alongside their respective confidence intervals for two datasets. These confidence intervals were calculated using the t-distribution, assuming a

95 %

confidence level and considering the sample size. The inclusion of confidence intervals in our visualization effectively displays the uncertainty associated with these performance estimates. Notably, our method is distinguished by its tighter confidence intervals for the mean IoU across both datasets, indicating a more uniform performance across various affordance categories and underscoring the robustness of our approach. Nevertheless it is important to note that there remains room for improvement in terms of the F-score performance of our method. While our method shows promising results in terms of mean IoU, it requires further refinement to improve F-score outcome.

Model Efficiency Analysis: In terms of model efficiency, IAG-Net has a more streamlined architecture with fewer parameters than the cascaded method, as illustrated in Table 5. This efficiency stems from IAG-Net’s integrated approach, in contrast to the cascaded method’s separate training for visual grounding and affordance segmentation, which results in a larger parameter set and lower FPS. Compared to image-only methods, IAG-Net achieves a similar balance of FPS and parameters, indicating its efficiency in handling both visual and language data without significant resource overhead. BKINet achieves the highest FPS, partly because of the parallel computing of the architecture.

Failure Cases: Our analysis of failure cases (as shown in Figure 8), particularly in complex scenes from the IIT-AFF dataset, reveals that IAG-Net sometimes struggles with images containing multiple objects matching the instructions, leading to partial target recognition. Occlusions present another challenge, affecting the model’s performance. These errors predominantly arise from the model’s current limitations in simple decoder architecture. Future work will focus on enhancing this component to better interpret instructions and handle complex visual scenes.

4.4. Ablation Study (RQ2)

To verify the contribution of each component, we conducted extensive ablation experiments on the two datasets and analyzed the results of different operations. As shown in Table 6, the components of the network include the following:

Baseline: In the baseline model, the encoder extracts visual and textual features, and the fused multimodal features are fed into the decoder for affordance segmentation. This model serves as a foundational comparison point without the task-specific context and VSMA.
Baseline+TaskContext: This model incorporates task-specific context into the baseline, enhancing the multimodal fusion process.
IAG-Net: The full model integrates both the multihead VSMA (MH-VSMA) and the task-specific context, offering advanced affordance-grounding capabilities.

For the baseline, we fused the visual and textual features in the encoder, and the decoder outputs the affordance segmentation mask. For Baseline+TaskContext, we added the task-specific context and the multimodal fusion to the encoder. In this case, for all five measurements (overall accuracy, mean accuracy, frequency accuracy, mIoU, and

F_{β}^{ω}

), improvements are observed to varying degrees. After adding MH-VSAM, the full model significantly improves in both the mIoU (

+ 3.5 %

) and

F_{β}^{ω}

(

+ 2.1 %

). The results are presented in Table 6. The introduction of a task-specific context significantly improves the overall and frequency accuracies, demonstrating its effectiveness in capturing task-relevant information. The inclusion of MH-VSAM further improves the mean accuracy, mIoU, and

F_{β}^{ω}

, underscoring its role in refining affordance segmentation. The performance of overall accuracy and frequency accuracy have declined slightly. While mIoU and F-score focus on segmentation quality and boundary delineation, overall accuracy and frequency accuracy consider pixel-level classification correctness. The increase in mIoU and F-score suggests an improvement in segmentation quality, which may come at the cost of overall accuracy as the model might prioritize correctly segmenting complex regions over accurately classifying all pixels. The sample imbalance issues could also cause results where certain classes are underrepresented in the dataset. Since the frequency accuracy is computed by averaging the frequency accuracies for all classes in the dataset, the sample imbalance will cause a decrease if less frequent classes are misclassified more often. The ablation study indicates that both the task-specific context and the MH-VSAM benefit the affordance-grounding task. These enhancements validate our two-stage fusion strategy, which combines sentence-level fusion and task context fusion for more accurate affordance grounding.

4.5. Influence of Hyperparameters and Optimization (RQ3)

When we used the multi-head VSMA in the encoder, there was an improvement in the two measurements. For the multi-head VSMA module, we evaluated the method with different numbers of heads. As shown in Table 7, our evaluation of the multi-head VSMA module with varying head numbers reveals a trade-off between accuracy and FPS. While increasing the number of heads enhances the mean accuracy, mIoU, and

F_{β}^{ω}

, it concurrently reduces the FPS. The use of 5 heads strikes a balance, achieving improved accuracy while maintaining reasonable computational efficiency. The results presented in Table 7 are derived from averaging multiple runs, with data randomly sampled from the two datasets for each iteration. This process ensures that our findings are robust and representative of various data scenarios.

We also conducted compaarative experiments with two different optimizers, SGD and Adam. Although Adam demonstrated faster convergence during the initial epochs and overall convergence speed, SGD achieved higher metrics in both mIoU and mean accuracy. This outcome underscores the suitability of SGD for the affordance segmentation tasks. Consequently, in our experimental setup we opted for the SGD optimizer, configured with a weight decay of

1 \times 10^{- 4}

and a momentum of

0.9

, as detailed in the experiment setting subsection.

5. Conclusions

In this work, we address the problem of affordance grounding by introducing the instruction-following affordance-grounding network (IAG-Net), representing a significant advancement in affordance segmentation. This is particularly notable due to its ability to directly interpret natural language instructions and bypass traditional object detection stages. Our key findings emphasize the importance of both sentence-level and task-specific contexts in visual language tasks, especially in achieving precise part-level affordance grounding.

We contribute to the field by creating two multimodal datasets, namely, IIT-AFF VL and UMD VL, which are built upon the existing image-only datasets IIT-AFF and UMD. By augmenting images with affordance annotation with text instructions, we generate instruction-following affordance masks. The creation of the IIT-AFF VL and UMD VL datasets represents a notable contribution to the field, providing comprehensive resources for future research in fine-grained vision–language affordance understanding. These multimodal datasets not only extend existing image-only resources but also offer rich opportunities for exploring advanced multimodal interactions. While our comprehensive evaluations demonstrate the effectiveness and generalizability of IAG-Net, it is essential to acknowledge the limitations and challenges encountered. Since existing affordance datasets lack variety, performance in highly cluttered scenes or with objects exhibiting complex or subtle affordances may be affected.

In summary, our work represents a significant step forward in vision–language affordance grounding. There remain opportunities for further exploration and refinement. Further enhancements to IAG-Net will involve capitalizing on large-scale multimodal models and advancing robustness in challenging contexts. Another avenue for improvement involves integrating additional knowledge sources, such as knowledge graphs, to enhance the understanding of affordance grounding. Furthermore, we aim to integrate richer knowledge sources and explore expansive applications of our methodology, particularly in interactive robotics and complex task execution.

Author Contributions

Conceptualization, W.Q.; methodology, W.Q., L.G. and J.C.; software, J.C., L.G. and W.Q.; validation, X.J.; writing—original draft preparation, L.G. and X.J.; writing—review and editing, W.Q.; visualization, L.G. and W.Q.; supervision, W.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is available at https://github.com/WenQu-NEU/Affordance-Grounding-Dataset (accessed on 30 March 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wu, W.; Chang, T.; Li, X.; Yin, Q.; Hu, Y. Vision-language navigation: A survey and taxonomy. Neural Comput. Appl. 2024, 36, 3291–3316. [Google Scholar] [CrossRef]
Ding, Z.; Sun, Y.; Xu, S.; Pan, Y.; Peng, Y.; Mao, Z. Recent Advances and Perspectives in Deep Learning Techniques for 3D Point Cloud Data Processing. Robotics 2023, 12, 100. [Google Scholar] [CrossRef]
Han, Z.; Yang, Y.; Wang, W.; Zhou, L.; Gadekallu, T.R.; Alazab, M.; Gope, P.; Su, C. RSSI Map-Based Trajectory Design for UGV Against Malicious Radio Source: A Reinforcement Learning Approach. IEEE Trans. Intell. Transp. Syst. 2023, 24, 4641–4650. [Google Scholar] [CrossRef]
Misra, D.K.; Sung, J.; Lee, K.; Saxena, A. Tell me dave: Context-sensitive grounding of natural language to manipulation instructions. Int. J. Robot. Res. 2016, 35, 281–300. [Google Scholar] [CrossRef]
Matuszek, C. Grounded language learning: Where robotics and nlp meet. Proc. Int. Jt. Conf. Artif. Intell. 2018, 5687–5691. [Google Scholar]
Hatori, J.; Kikuchi, Y.; Kobayashi, S.; Takahashi, K.; Tsuboi, Y.; Unno, Y.; Ko, W.; Tan, J. Interactively picking real-world objects with unconstrained spoken language instructions. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 3774–3781. [Google Scholar]
Mi, J.; Liang, H.; Katsakis, N.; Tang, S.; Li, Q.; Zhang, C.; Zhang, J. Intention-related natural language grounding via object affordance detection and intention semantic extraction. Front. Neurorobot. 2020, 14, 26. [Google Scholar] [CrossRef] [PubMed]
Shridhar, M.; Mittal, D.; Hsu, D. INGRESS: Interactive visual grounding of referring expressions. Int. J. Robot. Res. 2020, 39, 217–232. [Google Scholar] [CrossRef]
Li, Z.; Mu, Y.; Sun, Z.; Song, S.; Su, J.; Zhang, J. Intention Understanding in Human-Robot Interaction Based on Visual-NLP Semantics. Front. Neurorobot. 2021, 14, 610139. [Google Scholar] [CrossRef] [PubMed]
Gong, R.; Gao, X.; Gao, Q.; Shakiah, S.; Thattai, G.; Sukhatme, G.S. LEMMA: Learning Language-Conditioned Multi-Robot Manipulation. IEEE Robot. Autom. Lett. 2023, 8, 6835–6842. [Google Scholar] [CrossRef]
Mees, O.; Borja-Diaz, J.; Burgard, W. Grounding Language with Visual Affordances over Unstructured Data. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 11576–11582. [Google Scholar]
Barcellona, L.; Bacchin, A.; Gottardi, A.; Menegatti, E.; Ghidoni, S. FSG-Net: A Deep Learning model for Semantic Robot Grasping through Few-Shot Learning. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 1793–1799. [Google Scholar]
Gibson, J.J. The Ecological Approach to Visual Perception; Houghton Mifflin: Boston, MA, USA, 1979. [Google Scholar]
Yin, C.; Zhang, Q. Object affordance detection with boundary-preserving network for robotic manipulation tasks. Neural Comput. Appl. 2022, 34, 17963–17980. [Google Scholar] [CrossRef]
Zhang, Y.; Li, H.; Ren, T.; Dou, Y.; Li, Q. Multi-scale Fusion and Global Semantic Encoding for Affordance Detection. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–8. [Google Scholar]
Chen, D.; Kong, D.; Li, J.; Wang, S.; Yin, B. ADOSMNet: A novel visual affordance detection network with object shape mask guided feature encoders. Multimed. Tools Appl. 2023, 83, 31629–31653. [Google Scholar] [CrossRef]
Miraglia, L.; Dio, C.D.; Manzi, F.; Kanda, T.; Cangelosi, A.; Itakura, S.; Ishiguro, H.; Massaro, D.; Fonagy, P.; Marchetti, A. Shared Knowledge in Human-Robot Interaction (HRI). Int. J. Soc. Robot. 2024, 16, 59–75. [Google Scholar] [CrossRef]
Peng, Y.; Nabae, H.; Funabora, Y.; Suzumori, K. Controlling a peristaltic robot inspired by inchworms. Biomim. Intell. Robot. 2024, 4, 100146. [Google Scholar] [CrossRef]
Peng, Y.; Nabae, H.; Funabora, Y.; Suzumori, K. Peristaltic transporting device inspired by large intestine structure. IEEE Int. Conf. Robot. Autom. 2024, 365, 114840. [Google Scholar] [CrossRef]
Wang, Z.; Wang, W.; Yang, Y.; Han, Z.; Xu, D.; Su, C. CNN- and GAN-based classification of malicious code families: A code visualization approach. Int. J. Intell. Syst. 2022, 37, 12472–12489. [Google Scholar] [CrossRef]
Yang, Y.; Wei, X.; Xu, R.; Wang, W.; Peng, L.; Wang, Y. Jointly beam stealing attackers detection and localization without training: An image processing viewpoint. Front. Comput. Sci. 2023, 17, 173704. [Google Scholar] [CrossRef]
Chaplot, D.S.; Sathyendra, K.M.; Pasumarthi, R.K.; Rajagopal, D.; Salakhutdinov, R. Gated-attention architectures for task-oriented language grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Venkatesh, S.G.; Biswas, A.; Upadrashta, R.; Srinivasan, V.; Talukdar, P.; Amrutur, B. Spatial reasoning from natural language instructions for robot manipulation. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11196–11202. [Google Scholar]
Paul, R.; Arkin, J.; Roy, N. Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators. Robot. Sci. Syst. XII 2018, 37. [Google Scholar] [CrossRef]
Magassouba, A.; Sugiura, K.; Quoc, A.T.; Kawai, H. Understanding natural language instructions for fetching daily objects using gan-based multimodal target–source classification. IEEE Robot. Autom. Lett. 2019, 4, 3884–3891. [Google Scholar] [CrossRef]
Ahn, H.; Choi, S.; Kim, N.; Cha, G.; Oh, S. Interactive text2pickup networks for natural language-based human robot collaboration. IEEE Robot. Autom. Lett. 2018, 3, 3308–3315. [Google Scholar] [CrossRef]
Stepputtis, S.; Campbell, J.; Phielipp, M.; Lee, S.; Baral, C.; Ben Amor, H. Language-conditioned imitation learning for robot manipulation tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 13139–13150. [Google Scholar]
Tellex, S.; Kollar, T.; Dickerson, S. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 7–11 August 2011. [Google Scholar]
Nguyen, T.; Gopalan, N.; Patel, R.; Corsaro, M.; Pavlick, E.; Tellex, S. Robot Object Retrieval with Contextual Natural Language Queries. Robot. Sci. Syst. 2020. [Google Scholar]
Hemachandra, S.; Duvallet, F.; Howard, T.M.; Roy, N.; Stentz, A.; Walter, M.R. Learning models for following natural language directions in unknown environments. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 5608–5615. [Google Scholar]
Briggs, G.; Williams, T.; Scheutz, M. Enabling robots to understand indirect speech acts in task-based interactions. J. Hum.-Robot. Interact. 2017, 6, 64–94. [Google Scholar] [CrossRef]
Scalise, R.; Li, S.; Admoni, H.; Rosenthal, S.; Srinivasa, S.S. Natural language instructions for human-robot collaborative manipulation. Int. J. Robot. Res. 2018, 37, 558–565. [Google Scholar] [CrossRef]
Huang, K.; Han, Y.; Wu, J.; Qiu, F.; Tang, Q. Language-Driven Robot Manipulation with Perspective Disambiguation and Placement Optimization. IEEE Robot. Autom. Lett. 2022, 7, 4188–4195. [Google Scholar] [CrossRef]
Chen, Y.; Xu, R.; Lin, Y.; Vela, P.A. A joint network for grasp detection conditioned on natural language commands. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 4576–4582. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Proceedings of the MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer International Publishing: Cham, Switzerland; pp. 234–241.
Mao, J.; Huang, J.; Toshev, A.; Camburu, O.; Yuille, A.L.; Murphy, K. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 11–20. [Google Scholar]
Hu, R.; Rohrbach, M.; Andreas, J.; Darrell, T.; Saenko, K. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1115–1124. [Google Scholar]
Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T.L. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1307–1315. [Google Scholar]
Wang, P.; Wu, Q.; Cao, J.; Shen, C.; Gao, L.; Hengel, A.V.D. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1960–1968. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; DollÃ¡r, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Yang, Z.; Gong, B.; Wang, L.; Huang, W.; Yu, D.; Luo, J. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4683–4693. [Google Scholar]
Yang, Z.; Chen, T.; Wang, L.; Luo, J. Improving one-stage visual grounding by recursive sub-query construction. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 387–404. [Google Scholar]
Deng, J.; Yang, Z.; Chen, T.; Zhou, W.; Li, H. TransVG: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1769–1779. [Google Scholar]
Du, Y.; Fu, Z.; Liu, Q.; Wang, Y. Visual Grounding with Transformers. In Proceedings of the IEEE International Conference on Multimedia and Expo, Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
Deng, J.; Yang, Z.; Liu, D.; Chen, T.; Zhou, W.; Zhang, Y.; Li, H.; Ouyang, W. TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13636–13652. [Google Scholar] [CrossRef] [PubMed]
Hu, R.; Rohrbach, M.; Darrell, T. Segmentation from natural language expressions. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 108–124. [Google Scholar]
Liu, C.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Yuille, A. Recurrent multimodal interaction for referring image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1271–1280. [Google Scholar]
Margffoy-Tuay, E.; Perez, J.C.; Botero, E.; Arbelaez, P. Dynamic multimodal instance segmentation guided by natural language queries. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 630–645. [Google Scholar]
Shi, H.; Li, H.; Meng, F.; Wu, Q. Key-word-aware network for referring expression image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 38–54. [Google Scholar]
Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10502–10511. [Google Scholar]
Feng, G.; Hu, Z.; Zhang, L.; Lu, H. Encoder fusion network with co-attention embedding for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15506–15515. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Yang, Z.; Wang, J.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H. LAVT: Language-Aware Vision Transformer for Referring Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18155–18165. [Google Scholar]
Kim, N.; Kim, D.; Lan, C.; Zeng, W.; Kwak, S. ReSTR: Convolution-free Referring Image Segmentation Using Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18145–18154. [Google Scholar]
Ding, H.; Zhang, S.; Wu, Q.; Yu, S.; Hu, J.; Cao, L.; Ji, R. Bilateral Knowledge Interaction Network for Referring Image Segmentation. IEEE Trans. Multimed. 2023, 26, 2966–2977. [Google Scholar] [CrossRef]
Hassanin, M.; Khan, S.; Tahtali, M. Visual affordance and function understanding: A survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
Myers, A.; Teo, C.L.; Fermuller, C.; Aloimonos, Y. Affordance detection of tool parts from geometric features. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 1374–1381. [Google Scholar]
Nguyen, A.; Kanoulas, D.; Caldwell, D.G.; Tsagarakis, N.G. Object-based affordances detection with convolutional neural networks and dense conditional random fields. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 5908–5915. [Google Scholar]
Kokic, M.; Stork, J.A.; Haustein, J.A.; Kragic, D. Affordance detection for task-specific grasping using deep learning. In Proceedings of the 2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids), Birmingham, UK, 15–17 November 2017; pp. 91–98. [Google Scholar]
Chu, F.J.; Xu, R.; Vela, P.A. Learning affordance segmentation for real-world robotic manipulation via synthetic images. IEEE Robot. Autom. Lett. 2019, 4, 1140–1147. [Google Scholar] [CrossRef]
Do, T.T.; Nguyen, A.; Reid, I. Affordancenet: An end-to-end deep learning approach for object affordance detection. In Proceedings of the 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 5882–5889. [Google Scholar]
Zhao, X.; Cao, Y.; Kang, Y. Object affordance detection with relationship-aware network. Neural Comput. Appl. 2020, 32, 14321–14333. [Google Scholar] [CrossRef]
Zhai, W.; Luo, H.; Zhang, J.; Cao, Y.; Tao, D. One-shot object affordance detection in the wild. Int. J. Comput. Vis. 2022, 130, 2472–2500. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, Z.; Lu, Y.; Li, Q.; Tao, X.; Guo, Y.; Gong, M.; Liu, T. CRIS: CLIP-Driven Referring Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11676–11685. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Yang, S.; Xia, M.; Li, G.; Zhou, H.; Yu, Y. Bottom-Up Shift and Reasoning for Referring Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11266–11275. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Su, F.; Qian, T.; Zhou, J.; Li, B.; Li, F.; Teng, C.; Ji, D. A tree-like structured perceptron for transition-based biomedical event extraction. Knowl. Based Syst. 2024, 283, 111180. [Google Scholar] [CrossRef]
Sun, C.; Li, C.; Lin, X.; Zheng, T.; Meng, F.; Rui, X.; Wang, Z. Attention-based graph neural networks: A survey. Artif. Intell. Rev. 2023, 56, 2263–2310. [Google Scholar] [CrossRef]
Mao, A.; Mohri, M.; Zhong, Y. Cross-Entropy Loss Functions: Theoretical Analysis and Applications. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 23803–23828. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Mitchell, T.M. Machine Learning, 1st ed.; McGraw-Hill Education: New York, NY, USA, 1997. [Google Scholar]
Margolin, R.; Zelnik-Manor, L.; Tal, A. How to Evaluate Foreground Maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Yuan, Y.; Chen, X.; Wang, J. Object-Contextual Representations for Semantic Segmentation. In Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 173–190. [Google Scholar]
Zhang, A.; Ji, W.; Chua, T.S. NExT-Chat. An LMM for Chat, Detection and Segmentation. arXiv 2023, arXiv:2311.04498. [Google Scholar]

Figure 1. Comparison of object-level grounding and part-level affordance grounding. The red boxes indicates the detected object bounding boxes. Additionally, the pink and green colors in the masks represent two affordance categories: contain and w-grasp.

Figure 2. An overview of our multimodal attention-based instruction-following affordance-grounding network (IAG-Net).

Figure 3. The structure of varied-scale multimodal attention.

Figure 4. An example of a dependency tree for “Use the knife on the right to cut”.

Figure 5. Example of our affordance-grounding dataset. Each color correspond to an affordance category.

Figure 6. Visual examples of our methods.

Figure 7. Performance comparison of three multimodal-based methods with confidence intervals. The left plot displays results on the IIT-AFF VL dataset, while the right plot presents results from the UMD VL dataset. Confidence intervals, calculated at a 95% confidence level, are included in both plots to provide insights into the uncertainty surrounding the estimated performance metrics.

Figure 8. Example of a failure case of hand occlusion.

Table 1. Comparison affordance grounding with existing related tasks.

Tasks	Language- Following	Object Location	Object Segmentation	Affordance Segmentation
Instruction-following Manipulation	✓	✓
REC	✓	✓
RIS	✓		✓
Affordance Detection				✓
Affordance Grounding	✓			✓

Table 2. Statistics of instruction-following affordance-grounding dataset.

	ITT-AFF VL		UMD VL		All
Number	Image	Instructions	Image	Instructions	Pairs
contain	1143	1143	949	949	2092
cut	101	101	790	790	891
display	1594	1594	-	-	1594
engine	71	71	-	-	71
grasp	1617	1617	2611	2611	4228
hit	414	414	-	-	414
pound	180	180	419	419	599
support	440	440	1060	1060	1500
w-grasp	926	926	671	671	1597
scoop	-	-	570	570	570
Average	721	721	1010	1010	1731
All	8835	8835	7070	7070	13,556

Table 3. Performance on the UMD vision language dataset. Bold text is utilized to highlight the best values achieved for mIoU and F-score among multimodal-based methods.

	Image Only					Multimodal-Based
	DeepLabV3+		OCRNet		AffordanceNet	Cascaded		BKINet		IAG-Net (Ours)
Affordance	mIoU	$F_{β}^{ω}$	mIoU	$F_{β}^{ω}$	$F_{β}^{ω}$	mIoU	$F_{β}^{ω}$	mIoU	$F_{β}^{ω}$	mIoU	$F_{β}^{ω}$
contain	85.80	81.27	85.40	81.72	59.75	86.11	83.38	79.03	47.81	85.24	82.02
cut	82.3	78.6	82.76	79.42	64.26	82.07	80.62	42.32	43.65	78.34	79.89
grasp	69.5	80.03	71.02	80.54	65.97	71.39	78.70	75.98	51.80	76.24	81.08
pound	85.42	81.29	85.61	81.70	76.73	85.20	82.08	61.45	67.17	85.13	83.14
scoop	85.57	79.74	85.92	80.65	77.0	85.79	81.65	70.84	86.83	86.34	81.22
support	83.77	81.11	85.03	82.04	79.45	85.86	81.34	89.37	90.47	85.20	81.41
w-grasp	80.80	76.04	81.01	77.06	78.65	78.97	73.00	86.46	88.08	81.81	75.45
Average	81.88	79.73	82.39	80.45	71.6	82.19	80.11	72.20	67.97	82.61	80.60

Table 4. Performance on the IIT vision language dataset. Bold text is utilized to highlight the best values achieved for mIoU and F-score among multimodal-based methods.

	Image Only									Multimodal-Based
	DeepLabV3+		OCRNet		AffordanceNet	RelaNet	GSE	BPN	ADOSMNet	Cascaded		BKINet		IAG-Net (Ours)
Affordance	mIoU	$F_{β}^{ω}$	mIoU	$F_{β}^{ω}$	$F_{β}^{ω}$	$F_{β}^{ω}$	$F_{β}^{ω}$	$F_{β}^{ω}$	$F_{β}^{ω}$	mIoU	$F_{β}^{ω}$	mIoU	$F_{β}^{ω}$	mIoU	$F_{β}^{ω}$
contain	85.68	63.98	88.35	70.85	79.61	80.20	87.92	80.62	87.77	73.18	66.11	81.79	65.90	87.72	70.25
cut	69.29	61.58	79.20	61.94	75.68	78.04	65.34	79.23	84.68	67.45	58.68	63.37	43.01	73.29	65.11
display	87.03	39.30	89.46	41.42	77.81	79.14	91.90	80.55	84.26	74.37	39.18	96.68	77.17	91.55	42.84
engine	78.60	73.99	79.82	73.96	77.50	81.22	81.91	81.49	84.37	58.94	51.53	78.52	75.87	85.39	79.58
grasp	75.95	29.28	78.59	31.61	68.48	71.59	79.76	72.96	83.07	70.60	29.42	69.12	59.31	83.49	33.68
hit	85.08	40.07	94.97	39.12	70.75	88.52	90.51	88.84	84.64	89.96	35.93	64.25	75.00	92.55	43.19
pound	79.47	54.82	79.21	53.16	69.57	76.91	75.95	77.59	82.68	72.66	46.09	35.56	51.28	75.43	55.21
support	79.60	67.88	81.36	61.09	69.81	80.12	78.41	80.96	83.74	76.98	56.59	71.44	68.63	77.99	51.31
w-grasp	88.58	70.08	82.13	73.20	70.98	74.56	89.43	74.56	83.35	67.56	64.06	52.43	76.06	90.29	70.31
Average	81.03	55.66	83.68	56.26	73.35	78.92	82.33	79.64	84.28	72.41	49.73	68.13	65.83	84.19	56.83

Table 5. Model efficiency analysis of image-only and multimodal based methods. The ↑ signifies that larger values are preferable.

Method	Type	Params	FPS ↑
DeepLabV3+	Image only	$39.76$	$4.86$
Cascaded	Mutimodal-based	$81.86$	$2.61$
BKINet	Multimodal-based	$138.55$	$10.85$
IAGNet (Ours)	Multimodal-based	$42.35$	$4.24$

Table 6. Results of the ablation study for each component. The symbol ✓ indicates the addition of the corresponding component. Bold text is utilized to emphasize the best values achieved for each metric.

Method	TaskInf	MH-VSMA	Overall Acc	Mean Acc	Freq Acc	mIoU	$F_{β}^{ω}$
Base			$94.8$	$70.5$	$90.5$	$74.9$	$43.0$
Base+TaskInf	✓		$95.4$	$71.7$	$91.4$	$75.8$	$43.9$
IAGNet	✓	✓	$94.0$	$85.0$	$89.0$	$79.3$	$46.0$

Table 7. Hyperparameters for MH-VSMA, with different head numbers of VSMA. The bold text is used to highlight the best value for each metric.

Head Number	Mean Acc	mIoU	$F_{β}^{ω}$	FPS
1-VSMA	$74.63$	$74.08$	$41.18$	$4.10$
2-VSMA	$82.85$	$76.39$	$44.10$	$4.02$
3-VSMA	$81.99$	$76.31$	$44.30$	$3.82$
4-VSMA	$83.94$	$76.68$	$45.86$	$3.61$
5-VSMA	$84.99$	$79.32$	$45.99$	$3.56$
6-VSMA	$81.41$	$75.48$	$41.77$	$3.55$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qu, W.; Guo, L.; Cui, J.; Jin, X. Multimodal Attention-Based Instruction-Following Part-Level Affordance Grounding. Appl. Sci. 2024, 14, 4696. https://doi.org/10.3390/app14114696

AMA Style

Qu W, Guo L, Cui J, Jin X. Multimodal Attention-Based Instruction-Following Part-Level Affordance Grounding. Applied Sciences. 2024; 14(11):4696. https://doi.org/10.3390/app14114696

Chicago/Turabian Style

Qu, Wen, Lulu Guo, Jian Cui, and Xiao Jin. 2024. "Multimodal Attention-Based Instruction-Following Part-Level Affordance Grounding" Applied Sciences 14, no. 11: 4696. https://doi.org/10.3390/app14114696

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Attention-Based Instruction-Following Part-Level Affordance Grounding

Abstract

1. Introduction

2. Related Work

2.1. Instruction-Following Robot Manipulation

2.2. Visual Grounding

2.3. Object Affordance Detection

3. Method

3.1. Problem Formulation

3.2. Overview

3.3. Vision and Language Encoders

3.4. Multiscale Vision Language Encoder

3.5. Task-Specific Context Extraction

3.6. Part-Level Affordance Grounding

3.7. Loss Function

4. Experiments and Evaluation

4.1. Dataset Construction

4.2. Experimental Settings

4.2.1. Implementation Details

4.2.2. Evaluation Metrics

4.3. Comparison Results (RQ1)

4.4. Ablation Study (RQ2)

4.5. Influence of Hyperparameters and Optimization (RQ3)

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI