Cascaded Searching Reinforcement Learning Agent for Proposal-Free Weakly-Supervised Phrase Comprehension

Wang, Yaodong; Yue, Lili; Li, Maoqing

doi:10.3390/electronics13050898

Open AccessArticle

Cascaded Searching Reinforcement Learning Agent for Proposal-Free Weakly-Supervised Phrase Comprehension

by

Yaodong Wang

^*,

Lili Yue

and

Maoqing Li

School of Automation and Electrical Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(5), 898; https://doi.org/10.3390/electronics13050898

Submission received: 25 December 2023 / Revised: 23 February 2024 / Accepted: 23 February 2024 / Published: 27 February 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Phrase comprehension (PC) aims to locate a specific object in an image according to a given linguistic query. The existing PC methods work in either a fully supervised or proposal-based weakly supervised manner, which rely explicitly or implicitly on expensive region annotations. In order to completely remove the dependence on the supervised region information, this paper proposes to address PC in a proposal-free weakly supervised training paradigm. To this end, we developed a novel cascaded searching reinforcement learning agent (CSRLA). Concretely, we first leveraged a visual language pre-trained model to generate a visual–textual cross-modal attention heatmap. Accordingly, a coarse salient initial region of the referential target was located. Then, we formulated the visual object grounding as a Markov decision process (MDP) in a reinforcement learning framework, where an agent was trained to iteratively search for the target’s complete region from the salient local region. Additionally, we developed a novel confidence discrimination reward function (ConDis_R) to constrain the model to search for a complete and exclusive object region. The experimental results on three benchmark datasets of Refcoco, Refcoco+, and Refcocog demonstrated the effectiveness of our proposed method.

Keywords:

phrase comprehension; reinforcement learning; class activation mapping

1. Introduction

Phrase comprehension (PC) is a fundamental task in the multi-modal learning community and serves as the basis for many downstream tasks, including image captioning [1,2], visual question answering [3,4], etc. The purpose of PC is to locate a specific entity in an image according to a given linguistic query. It requires a comprehensive understanding of both vision and language, and there has been a surge of research interest in computer vision (CV) and natural language processing (NLP). Most traditional PC methods follow a fully supervised training approach, where the mappings between the entity regions and the query texts during the training stage are available. However, the annotated data are noticeably expensive due to the labor-intensive labeling process in practical applications and noise may be introduced when labeling the training data.

In recent years, some efforts have tackled PC in a proposal-based weakly supervised training paradigm, to alleviate the strong dependence on annotations, where region–text pairs are no longer required, but the entity regions still need to be provided in advance. The predominant methods formulate weakly supervised PC as an entity region retrieval problem, which is based on a modality reconstruction mechanism [5,6,7], where the highest-score bounding box among all the candidate regions is selected as the referential target. However, weakly supervised training methods do not completely remove the dependence on expensive grounding annotations, due to the given entity regions. Therefore, it is of great significance to explore specific solutions for PC in a proposal-free weakly supervised training paradigm. To this end, this paper focuses on exploring the proposal-free weakly supervised phrase comprehension task to remove the constraint of expensive grounding annotations completely, where any explicit or implicit supervised region information does not need to be available during the training stage. To distinguish the traditional fully supervised, proposal-based weakly supervised, and the proposal-free weakly supervised PC methods, we denote them as FPC, BWPC, and FWPC, respectively. Figure 1 depicts the differences among them.

Take FPC for instance (Figure 1a), the instance-level annotations of the queries “blurry man”, “clear man”, and the corresponding bounding-boxes of these two instances in the image are provided when training an FPC model. For BWPC in Figure 1b, although the mappings between the specific entity regions and the corresponding query texts are unavailable in the training stage, those entity regions need to be given in advance. In Figure 1c for the proposal-free weakly supervised training paradigm, explicit or implicit supervised region information is unavailable during the training phase.

On the other hand, reinforcement learning (RL) has seen a rapid development in recent years, including practical applications in the fields of visual grounding (VG) [9] and object detection (OD) [10]. As a pioneering RL-based work in VG, Sun et al. [9] first proposed an iterative shrinking mechanism to localize the target, where the shrinking direction is decided by an RL agent. In the OD field, Liu et al. [10] proposed a pay attention to them (PAT) general object detection framework, which first applies a CNN regression detector to the entire input image, and then conducts refinement via RL in a cascaded way. Essentially, differently from supervised learning, RL relies on an unsupervised learning approach, where an agent gains and maximizes the rewards in a constant trial-and-error interaction with the environment. Consequently, RL continuously constrains the optimization direction of the model. This inherent characteristic and the successful application of RL inspired us to approach FWPC in a sequential-decision-making form. The core insight of our idea is illustrated in Figure 2. Thus, we conceive a strategy that starts with a part of an object that has been located in a certain way and then iteratively searches for the complete object.

Based on the above observations and conception, in this paper we propose a cascaded searching reinforcement learning agent, namely CSRLA, to address FWPC. We first leverage the visual language pre-trained model ALBEF [11] and calculate the cross-attention weights to locate the most salient region according to the alignment restricted by a threshold of the image and the query. Then, we formulate PC as a Markov decision process (MDP), where the target grounding is decomposed into sequential cascaded searching actions started from the salient region within an RL framework. An actor-net and a critic-net are developed for the agent to conduct an action and assess its value, respectively. Additionally, we design a novel confidence discrimination reward function (ConDis_R) to constrain the agent to search for a complete and exclusive target entity.

To summarize, the main contributions are as follows:

This paper makes the first attempt to address PC in an proposal-free weakly supervised paradigm.
We propose a cascaded searching reinforcement learning agent (CSRLA), which formulates PC as a Markov decision process (MDP) within an RL framework, where the target grounding is decomposed into RL-based sequential cascaded searching actions, to perform a cascaded search for the complete referent target from the initial salient region.
We design a novel confidence discrimination reward function (ConDis_R) to constrain the agent to search for a complete and exclusive target.
Extensive experiments on three benchmark datasets demonstrated the effectiveness of our proposed method.

2. Related Work

PC techniques have made great progress in recent years. The existing methods follow two training paradigms: FPC and BWPC. On the other hand, RL has been widely employed to address CV tasks, and remarkable achievements have been made. This section introduces some representative efforts in these directions.

2.1. Fully-Supervised Phrase Comprehension

The patch–query pairs are available during the training stage for FPC. Zhao et al. [12] propose a encoder–decoder transformer-based architecture Word2Pix, which employs a word-to-pixel attention mechanism to learn textual to visual features for visual grounding. Deng et al. [13] presented a transformer-based framework TransVG, which establishes cross-modal correspondence using simply stacked transformer layers. Hong et al. [14] proposed an RVG-TREE model, which could automatically generate a binary tree to parse the language and implement grounding with the tree in a bottom-up manner. Hu et al. [15] proposed a CMNs model that analyzes referring texts into components, and recognizes entities and relationships mentioned in the query for object grounding in an end-to-end manner. Liao et al. [16] proposed an RCCF method that reformulates the phrase comprehension as a correlation filtering process to achieve a faster inference speed. Liu et al. [17] developed an NMTREE network that is capable of efficiently parsing the query sentence in a tree structure, where each node calculates the visual attention according to its linguistic features. Deng et al. [18] explored neat yet effective purely transformer-based frameworks for visual grounding. Su et al. [19] proposed an active perception visual grounding framework based on language adaptive weights (VG-LAW) to effectively alleviate the mismatches of redundant and missing features.Yang et al. [20] developed a margin-based loss for tuning joint vision language models, the gradient-based explanations are consistent with human region-level grounding annotations. Li et al. [21] presented a end-to-end transformer-based grounding approach that focuses on capturing the cross-modality correlations between text and visual patches. Unfortunately, FPC relies heavily on expensive annotations of patch–query pairs.

It is worth noting that certain single-stage, proposal-free PC methods have been introduced [9,22,23,24], aiming to enhance the efficiency of target grounding by bypassing the need for proposal generation. However, a fundamental distinction arises when comparing these approaches with our proposed method: while these methods adhere to fully supervised training paradigms, our approach embraces an proposal-free weakly supervised training paradigm.

2.2. Weakly Supervised Phrase Comprehension

In this setting, the entity regions are provided in advance, but the specific patch–query mapping is unavailable. As a pioneering work, Rohrbach et al. [5] proposed a method that learns to locate objects by reconstructing a given text via an attention mechanism. KAC-Net [6] explores the consistency contained in both visual and textual modalities by reconstructing input phrase and proposal information. Liu et al. [7] introduced an end-to-end adaptive reconstruction network (ARN) model that captures the relationship between the image region proposals and the texts in an adaptive grounding and reconstruction manner. Niu et al. [25] proposed a variational Bayesian method (variational context) to deal with complex context modeling in PC. Zhang et al. [26] proposed a novel counterfactual contrastive learning (CCL) model to implement sufficient contrastive learning between the positive and negative results for BWPC. Sun et al. [27] proposed a cycle-free pipeline to mitigate the target deviation problem, where a region describer network is developed to predict the textual description for each entity region. Liu et al. [28] introduced an entity-enhanced adaptive reconstruction network (EARN) to address BWPC via entity enhancement, adaptive grounding, and collaborative reconstruction. Despite the patches and queries being disconnected in this paradigm, the reliance on expensive grounding annotations is not completely eliminated in BWPC.

2.3. Vision-and-Language Pre-Training

Recently, vision-and-language pre-training (VLP) has shown great ability in multi-modal learning by pre-training the model on large-scale vision-language datasets and fine-tuning on specific multi-modal learning tasks, including visual question answering, visual commonsense reasoning, and cross-modal retrieval, etc. ALIGN [29] utilized a noisy dataset of more than 1B image–text pairs, obtained without expensive filtering or post-processing steps from the Conceptual Captions dataset [30], to pre-train a visual-language model. The recent CLIP model [31], pre-trained on 400 million image–test pairs collected from the web, achieved impressive performance by performing zero-shot transfer on downstream tasks. ALBEF [11] proposed a momentum distillation (MoD), to deal with noisy image–text pairs drawn from the web, and introduced image–text contrastive (ITC) loss for the cross-modal alignment.

2.4. Reinforcement Learning in Computer Vision

RL is a machine learning paradigm that learns a sequence of actions in a continuous trial-and-error interaction with the environment in order to obtain and maximize rewards. Since AlphaGo [32] came to prominence, RL has attracted much attention.

The past decade has witnessed a surge of the practical applications of RL in CV, due to its powerful decision-making ability. Generally, RL implements an MDP to model the CV tasks in an appropriate manner; the representative applications include object tracking [33,34], object detection [35,36], image segmentation [37,38], and video analysis [39,40], etc.

In addition, since this paper focuses on PC in the multi-modal learning community, we mention some representative works of RL in this domain. In the multi-modal learning community, Ren et al. [41] introduced a decision-making framework for the image captioning task, where a policy and a value network are employed to collaboratively generate image captions. SCST [42] also dealt with the same task by proposing a self-critical sequence training algorithm that utilizes the output of its own test-time inference rather than estimating a baseline to normalize the rewards to reduce training variances. N2NMNs [43] addressed the visual question answering task by leveraging RL to jointly optimize the neural modules and the layout policy. Lu et al. [44] introduced a novel visual question answering task called textual distractors generation, which was also modeled by RL. In cross-modal retrieval tasks, Cai et al. [45] proposed a novel retrieval framework that conducts the interactive process in an ask-and-confirm manner, where an RL policy is utilized to search for discriminative objects. Yan et al. [46] proposed an RL-based attention mechanism to make the attention module optimize directly towards an expected goal.

It is worth mentioning that Sun et al. [9] made the first attempt at RL in FPC, they modeled visual grounding in an RL framework through an iterative image-cropping manner. However, different from the task setting of [9], this paper deals with PC in a proposal-free weakly supervised training paradigm via an RL-based target searching policy.

3. Methodology

3.1. Overview

We propose a cascaded searching reinforcement learning agent (CSRLA) to address FWPC. We first leverage the visual language pre-trained ALBEF [11] to locate the entity salient region, then we formulate FWPC as a Markov decision process (MDP) in an RL framework to cascade search for the complete referent target from the initial local region, as illustrated in Figure 3.

3.2. Visual Encoder

We employ an off-the-shelf visual-transformer, i.e., swin-transformer [47], to encode each input image patch, since its characteristic of multi-scale receptive field fusion is suitable for our variable-sized target localization task. The official released version “Swin-S” pre-trained on ImageNet-1K [48] and fine-tuned on MS-COCO [49] is adopted. We employ the features of the feature pyramid network (FPN) of the neck block, since they integrate all levels of visual relationships.

3.3. Textual Encoder

For each query text, we leverage the vanilla language processing model BERT [50] to extract the word-level representations. BERT is a transformer-based pre-trained linguistic model, in which a bi-directional self-attention mechanism is capable of capturing the contextual semantics of sentences. The linguistic query is encoded into a D-dimensional vector, which is denoted as

T = \{t_{1}, \cdot \cdot \cdot, t_{n}\} \in R^{n \times D}

.

3.4. Initially Salient Region Localization

We leverage the pre-trained ALBEF to locate the coarse salient regions of the referential instances in the form of initial boxes. ALBEF adopts a cross-modal transformer to mine the inter-modal correlations among vision–language elements via its cross-attention mechanism.

Grad-CAM [51] provides an operational neural network visualization technique that can highlight specific-instance regions through class activation mapping. We utilize this technique to activate and highlight a coarse region of images where the visual and the linguistic modalities are strongly correlated to each other. Concretely, we localize the cross-modal interrelated salient region via the cross-attention map between the visual and the linguistic modalities. To do this, the Grad-CAM [51] of CSRLA is calculated as follows:

H_{C A A M} = m a x (0, \sum_{k} α_{k} A^{k}),

(1)

where

H_{C A A M} \in R^{w \times h}

refers to cross-attention activation mapping (CAAM) in the 8th layer of the cross-modal transformer, ∑ is the global average pooling operation, and

α_{k}

stands for the cross-attention importance weights. To obtain

α_{k}

, the gradient of the attention map A, i.e.,

\frac{\partial s}{\partial A_{j i}^{k}}

is first calculated:

α_{k} = \frac{1}{Z} \sum_{i} \sum_{j} \frac{\partial s}{\partial A_{j i}^{k}},

(2)

where s is the similarity between the cross modalities, and

\frac{1}{Z} \sum \sum

denotes the global average pooling operation.

The

H_{C A A M}

reflects how closely the different modalities are related. In order to generate the initial region of a referential instance, we set a threshold for CAAM weights, and the region whose weight is greater than this threshold is located in the form of a box:

B_{p} ((x_{0}, y_{0}), (x_{1}, y_{1})) = F (H_{C A A M}, δ),

(3)

where

B_{p}

stands for the generated initial region boxes, (

x_{0}

,

y_{0}

) and (

x_{1}

,

y_{1}

) are the top-left and bottom-right coordinates of the boxes, respectively,

δ

denotes the weight threshold, and F is a function that checks the salient heatmap contours and generates boxes according to the threshold.

3.5. Formulation of Reinforcement Learning

Reinforcement learning is a machine learning paradigm that learns a sequence of decisions in MDPs by gaining and maximizing the rewards in the constant trial-and-error interaction with the environment. This section presents our proposed RL-based searching policy for searching a complete target region from the aforementioned initial salient patch. We describe the modeling process in detail, in terms of state, action, reward, and loss function.

3.5.1. State

State refers to the environment with which an agent interacts, it should cover all the scenarios that might be observed during the agent learning process. Specifically, in our CSRLA, for an iteration i, the state

s_{i}

consists of four components: sentence-level linguistic representation is generated by the textual encoder, i.e., BERT [50]. We take a global average pooling for each token representation of the output, and finally obtain a 768-dimensional semantic feature

s_{t_{i}}

of the query text. Similarly, patch-level visual representation is generated by the visual encoder, i.e., Swin-transformer [47]. We employ the features of the feature pyramid network (FPN), since they integrate all levels of visual relationships, then we flatten the [256 × 6 × 6] embedding of the neck block into a 9216-dimensional visual representation

s_{v_{i}}

. Furthermore, we record the agent’s history actions vector to guide future decisions, denoted as

s_{a_{i}}

. For the spatial-vector

s_{s_{i}}

of a current patch i, we argue that this is the key information for target localization, since it is important for an agent to know its spatial location in the image environment in an instance iterative searching process. Following [9], we encode using a 5-dimensional vector, i.e., [

\frac{x_{0}}{W}

,

\frac{y_{0}}{H}

,

\frac{x_{1}}{W}

,

\frac{y_{1}}{H}

,

\frac{w \cdot h}{W \cdot H}

], where [

x_{0}

,

y_{0}

] and [

x_{1}

,

y_{1}

] represent the top-left and bottom-right coordinates of the i-th patch in the iterations, w and h stand for the width and height of the patch, and W and H denote the width and height of the image.

Finally, these four components are stacked together to obtain a cross-modal joint state (CJS), as follows:

S_{i} = s_{t_{i}} \oplus s_{v_{i}} \oplus s_{a_{i}} \oplus s_{s_{i}},

(4)

where ⊕ denotes the concatenation operation. CJS allows the agent to continuously observe the visual environment, especially the semantic-rich linguistic environment and cross-modal alignment environment, which provides the basis for the MDPs.

3.5.2. Action

In RL, an agent learns to perform one action at a time after observing the current environment. Specifically, in CSRLA, the agent conducts a cascade searching strategy of patch expansion from a local region to a complete entity. Therefore, we formulate the action space of CSRLA as a set of five actions, denoted as [

a_{1}

,

a_{2}

,

a_{3}

,

a_{4}

,

a_{5}

], where

a_{1}

,

a_{2}

,

a_{3}

,

a_{4}

indicate that the i-th patch expands in the directions left, right, up, and down, respectively. Note that

a_{5}

is a terminator, which represents that the current patch is not expanded in this action. The agent performs the actions to complete the search of the reference object. Figure 4 shows the presented action space.

At each state transition step t of an episode, the agent first performs an action, then the searched area expands in the corresponding direction, which indicates that the state has been iteratively transmitted to the next MDP time-step. The new region coordinates ((

x_{0}

,

y_{0}

), (

x_{1}

,

y_{1}

)) that have been moved to are as follows:

\{\begin{matrix} x_{0} = x_{0} - λ (x_{1} - x_{0}), & a c t i o n = a_{1} \\ x_{1} = x_{1} + λ (x_{1} - x_{0}), & a c t i o n = a_{2} \\ y_{0} = y_{0} - λ (y_{1} - y_{0}), & a c t i o n = a_{3} \\ y_{1} = y_{1} + λ (y_{1} - y_{0}), & a c t i o n = a_{4} \\ ϕ, & a c t i o n = a_{5}, \end{matrix}

(5)

where the hyperparameter

λ

denotes the region expansion ratio. In our setting, this is assigned in a discrete decreasing manner:

λ = \{\begin{matrix} 0.2, & s t e p s \leq 5 \\ 0.1, & s t e p s \geq 8 \\ 0.15, & o t h e r w i s e, \end{matrix}

(6)

where

s t e p s

refers to the number of actions that have been performed in the current episode.

3.5.3. Reward

When an agent observes an environment, performs an action, and obtains a reward from the environment, CSRLA completes a state transition process (i.e., a RL step). In essence, rewards determine the direction of an agent’s learning, i.e., the optimization direction of the network.

Specifically, CSRLA aims to search a more complete region of the object from the salient patch. We consider that a well-trained classification network should be sensitive to the local-region and complete-region of visual objects. Generally, the category confidence of the complete-region is higher than that of the local-region. In our task scenario, for example, when an agent observes an image patch of black animal hair, it cannot clearly judge whether this instance category is “cat” or “dog” or other, since each category confidence level is not high due to the average distribution. However, as the patch expands from partial to complete, the agent becomes more confident of which animal category is observed.

To this end, we designed a reward function based on a neat yet effective multi-label classification network (MLC), which consists of a Resnet backbone and a multi-label classification head. The novel confidence discrimination reward function (ConDis_R) is developed as

R (S_{t - 1}, S_{t}) = η_{t} - η_{t - 1} - Δ (t),

(7)

where

Δ (t)

is a penalty term to constrain the search efficiency of the agent,

η

is the category confidence of each time-step, which is formulated as follows:

η_{t} = S o f t E x p (a r g m a x (S o f t M a x (p_{t}))),

(8)

where

S o f t E x p

and

S o f t M a x

are two activation functions,

p_{t}

is the category confidence score of the current state

S_{t}

.

S o f t E x p

is a confidence sensitivity perception function, which maps the category confidence score via an exponential function, making image patches have different confidence sensitivities at different expansion stages. The

S o f t M a x

is a probabilistic constraint that penalizes the search for instances other than the target. The calculation formula of the

S o f t E x p

activation function is as follows:

f (x) = \{\begin{matrix} - \frac{1}{α} \times ln (1 - α (x + α)), & α < 0 \\ x, & α = 0 \\ \frac{1}{α} \times e^{α x - 1}, & α > 0 . \end{matrix}

(9)

Transitions: After an action is performed, the current state is transformed into the next one. In our approach, the MDP transitions are deterministic; that is, a state is transferred to a specific state after an action has been performed.The state transfer formula is as follows:

\begin{matrix} T (s, a) & = & T ((s_{t_{i}}, s_{v_{i}}, s_{a_{i}}, s_{s_{i}}), a) \\ = & T (s_{t_{i + 1}}, s_{v_{i + 1}}, s_{a_{i + 1}}, s_{s_{i + 1}}) \\ = & s^{'}, \end{matrix}

(10)

where

T

is the RL transition,

T (s, a)

represents that the agent’s state transits to

s^{'}

by executing action a while in s.

3.5.4. Optimization

Our CSRLA was optimized in the advantage-actor-critic framework [52], which is one of the popular algorithms in the policy-gradient family. It consists of an actor-net and a critic-net; the former is used to select and execute an action from the action space in the current patch environment

S_{i}

, while the latter is applied to evaluate the executed action value, which serves as the basis for the subsequent decisions.

The objective function

J_{a} (θ)

of actor-net is defined as:

\nabla_{θ} J_{a} (θ) = A (s_{t}, a_{t}) \nabla_{θ} l o g π_{θ} (a_{t} | s_{t}),

(11)

where

a_{t}

and

s_{t}

are the action and the state at time step t,

π_{θ}

denotes the policy according to the current action, and

A (s_{t}, a_{t})

is the advantage function that takes the mathematical expectation of the action value to reduce its variance. This is defined as

A (s_{t}, a_{t}) = E_{π_{θ}}^{n} [r_{t}^{n} + V^{π} (s_{t + 1}^{n}) - V^{π} (s_{t}^{n})],

(12)

where

r_{t}^{n}

refers to the immediate reward for performing the current action, and

V^{π} (\cdot)

represents the state value generated by the critic-net.

Accordingly, the objective function

J_{c} (ω)

of the critic-net is defined as

\nabla_{ω} J_{c} (ω) = \nabla_{ω} {(A_{ω} (s_{t}, a_{t}))}^{2} .

(13)

The pseudocode for the proposed model is shown in Algorithm 1.

Algorithm 1 Advantage-Actor-Critic for CSRLA

Input: Image I and referring text T

1:: Generate Grad-CAM by ALBEF
2:: Generate salient local region
3:: Initialize action-value parameters $θ$
4:: Initialize state-value parameters $ω$
5:: Initialize S (first state of an episode)
6:: for t = 1, 2, ⋯ do
7:: Take $a_{t}$ according to policy $π_{θ} (a_{t} | s_{t})$
8:: Execute $a_{t}$ (patch expansion), observe reward $r_{t}$ , next state $s_{t + 1}$
9:: $r_{t} = η_{t} - η_{t - 1} - Δ (t)$
10:: R← $r_{t}$ + $γ R$
11:: $A (s_{t}, a_{t}) \leftarrow E_{π_{θ}}^{n} [r_{t}^{n} + V^{π} (s_{t + 1}^{n}) - V^{π} (s_{t}^{n})]$
12:: Update $θ \leftarrow θ + A (s_{t}, a_{t}) \nabla_{θ} l o g π_{θ} (a_{t} | s_{t})$
13:: Update $ω \leftarrow ω + \nabla_{ω} {(A_{ω} (s_{t}, a_{t}))}^{2}$
14:: t← $t + 1$
15:: end for

3.5.5. Discussion

In this paper, we propose the CSRLA model to address phrase comprehension in a proposal-free weakly supervised training paradigm. Our approach does not rely on region supervised information for the RefCOCO, RefCOCO+, and RefCOCOg datasets. Despite the encouraging performance, our proposed CSRLA still suffers from one limitation. Although CSRLA no longer relies on regional supervision information, strictly speaking, the referential annotated texts provide supervision information to a certain extent. Given this consideration, we define the proposed supervisory paradigm as a proposal-free weakly supervised paradigm rather than an unsupervised paradigm. Therefore, exploring a thoroughly unsupervised training method that does not rely on manual annotation for both region and text information is of great significance.

4. Experiments

4.1. Experimental Setup

Datasets: We evaluated our approach on three benchmark datasets, i.e., RefCOCO [8], RefCOCO+ [8], and RefCOCOg [53]. RefCOCO includes 50,000 target instances and 142,209 referring texts. It is split into training, validation, test A, and test B sets, which have 120,624, 10,834, 5657, and 5095 query–patch pairs, respectively. RefCOCO+ contains 49,856 referents and 141,546 queries. In the same way, RefCOCO+ is split into training, validation, test A, and test B sets, which have 120,191, 10,758, 5726, and 4889 query–patch pairs, respectively. RefCOCOg consists of 95,010 longer query sentences containing appearances and locations for 49,822 entities and is an auxiliary dataset for PC. It is split into training and validation with 21,149 and 4650 images, respectively. It does not have an open test split.

Evaluation Metrics: The metric used to evaluate CSRLA was the popular intersection-over-union (IoU) between the searched bounding box and the ground-truth, where the correct determination threshold was 0.5.

Implementation Details: For the image regions, the local fragment features were encoded by a Swin-transformer [47], whose dimensionality was 9216. As for the query sentence, word-level tokens were encoded by the vanilla BERT [50] into 768-dim vectors. The activation threshold of the cross-modal attention salient region was 0.15. In the RL searching stage, the list length of the history RL action

s_{a_{i}}

was set to 5. The maximum number of action iterations was 9. The initial expansion ratio was 0.2, and the searching stride decreased by 0.05 at specific time steps. The learning rates of the actor and the critic networks were both set to

1 \times 10^{- 6}

. The reward discount factor

γ

was 0.99. We trained the MLC network for 15 epochs on MS-COCO [49] and trained the actor and the critic networks for 500 K episodes. All experiments were implemented on the Pytorch platform with 3 GeForce GTX 1080 GPUs.

4.2. Experimental Results

Since this paper was the first to address PC in an proposal-free weakly supervised paradigm, there are no previously published FWPC results.Thus, we had no choice but to choose the state-of-the-art BWPC methods, including VC [25], ARN [7], IGN [26], DTMR [54], C-FREE [27] and EARN [28] for comparison. It should be noted that, although VC [25] reports proposal-free weakly supervised experiment results on RefCOCO [8], RefCOCO+ [8], and RefCOCOg [53], its experimental setting still belonged to the BWPC category in the task definition of this paper, since entity regions were required.

Table 1, Table 2 and Table 3 present the quantitative grounding results of CSRLA and the competitors on the three benchmark datasets. Surprisingly, we observed that the grounding accuracy of our proposal-free weakly supervised method was quite close to those of the state-of-the-art weakly supervised methods.

Specifically, on the RefCOCO+ dataset, our method achieved accuracies of 32.37%, 32.56%, and 31.26% on split validation, test A, and test B, respectively, as well as an average accuracy of 32.06% on the three test sets. Encouragingly, CSRLA outperformed most VC [25] settings on testA and testB, as well as for most ARN settings [7] and IGN (base) [26], and outperformed most VC [25] settings on testA. From the average accuracy of the overall test set, the proposed CSRLA outperformed most experimental settings of VC [25], ARN [7], and IGN [26].

On the RefCOCO dataset, CSRLA also achieved a competitive performance, surpassing all experimental setup results of VC [25] on testB and most experimental setup results of VC [25] on testA, as well as most of the experimental results of ARN [7] and IGN [26]. From the average accuracy of the three test sets, the proposed CSRLA achieved a detection accuracy of 32.04%, surpassing VC [25] and performing similarly to ARN [7] and IGN [26].

The experimental results on the RefCOCOg dataset show that the proposed method achieved an accuracy of 31.89%, which was also better than the results of several experimental settings on VC [25]; the average accuracy (mA) of the proposed CSRLA on the three datasets was 32.03%, which exceeded VC [25] and was basically on par with the performance of ARN [7] and IGN [26].

We emphasize that, differently from our proposal-free weakly supervised training approach, all the above comparative methods used weakly supervised settings, all of which relied on the supervised region information. CSRLA approximated the state-of-the-art BWPC methods in performance, which demonstrates its superiority and effectiveness, since it does not take advantage of any explicit or implicit supervised region information.

4.3. Ablation Studies

Impact of various components of the model. We performed a set of experiments on the RefCOCO dataset to study the effectiveness of the different components of CSRLA. Table 4 presents the experimental results.

First, we conducted an experiment that only carried out the first stage, to generate the salient-region of the referential target by ALBEF [11] and to verify the impact of RL without the target searching process (w/o RL, ALBEF for FWPC). We observed that the salient initial regions generated by ALBEF had a poor grounding accuracy on 3 splits in the FWPC setting without the RL-based cascading searching process, which were 19.17%, 15.82%, and 22.65% on val, testA, and testB, respectively, and the average accuracy was 19.21%, which was 12.85% lower than that of CSRLA. This illustrates the necessity of an RL-based cascading searching process for FWPC.

Then, we investigated the impact of replacing the multi-label classification network (MLC) of the reward function with a multi-class classification network (MCC). MCC, by contrast, produced a significant performance degradation, whose average accuracy on the three splits was 6.72% lower than that of CSRLA. This performance degradation may have been due to the training dataset’s deviation and the category inconsistency of the MS-COCO dataset [49] and the Imagenet dataset [48].

Last, we explored the performance difference between a decreasing expansion ratio and a fixed one (fixed stride). The decreasing expansion-stride strategy makes sense, as our search mechanism scales from a region to all sides. The actual expanded area increases with the number of iterations. Actually, the strategy of regional fine-tuning is preferable when approaching the target, as we can observe from the experimental results; the average accuracy of fixed stride scheme was 2.86% lower than that of CSRLA, which suggests the effectiveness of the decreasing expansion-stride strategy.

Impact of Cross-Attention Activation Mapping thresholds. One of the key hyperparameters of CSRLA is the cross-attention activation mapping threshold CAAM threshold. A small value may result in a large salient region being generated in the first stage, and vice versa. Both cases would affect the grounding performance. Table 5 shows the impact of different CAAM thresholds; we observed a performance increase from the threshold of 0.09 to 0.15, and then a decrease to 0.20. The peak 32.06% was reached at 0.15, thus we chose the threshold value of 0.15.

4.4. Visualization and Analysis

We performed a set of visual experiments to provide a more comprehensive understanding of the grounding process of our method.

4.4.1. Visualization of Success Cases

Figure 5 shows 10 sets of successful examples. Specifically, each set of examples consists of the query expression, the input image, and the output image with the visualized cross-attention activate mapping and the grounding result.

The goal of CSRLA is to locate the complete object via a cascaded searching strategy from a partial salient region. For instance, the query sentence of example (e) is “boy in gray”, and a prominent gray-clothes-region bounding box is generated according to this in the first stage; then the CSRLA network starts from this box and gradually searches for the object of the boy wearing gray clothes. In another example (g), the salient region focuses on the leg area based on a person’s kneeling motion, and then searches and finds the subject performing the action, i.e., the kneeling person. These examples illustrate that our CSRLA could search the complete object effectively from the local part of the referential target in terms of its ability to use a cascaded searching policy.

4.4.2. Visualization of Failure Cases

Additionally, Figure 6 shows 10 sets of failure examples. Taking (b) and (g) as examples, if the CAAM thresholds selected are too small, objects other than the reference object will be selected incorrectly by large red grounding bounding boxes. On the contrary, examples (d) and (i) illustrate that a large threshold results in a small salient region box, making it difficult to search for the complete referential object.

In (c), when the objects in the image overlap heavily, this also increases the difficulty of referential target grounding. When facing scenarios where the difference between the referential target and other objects is slight, or where the reference is ambiguous, e.g., examples (e) and (j), our model would become confused, resulting in a decrease in grounding accuracies. Moreover, in the face of complex query sentences, grounding is often defective, such as (f), where the query is “boy with white collar looking up behind brown hair head with no face showing”, and CSRLA does not have a good understanding of what this refers to.

In addition, we also noticed that some annotation biases existed in the datasets. For example in (h), the query is “colorful keyboard”, according to which the target grounded by our model is indeed the “colorful keyboard”, whereas the annotated ground-truth target is the puzzling “laptop”.

5. Conclusions

Differently from all existing efforts, we first proposed to address PC tasks in an proposal-free weakly supervised setting (i.e., FWPC) with the intention of completely eliminating the dependence on the expensive supervised region information. To address this new challenge, we proposed a novel model, named CSRLA. We first leveraged a visual language pre-trained model to generate a visual salient region that the query attended closely to. Then, we developed a RL-based regional searching network for searching referential objects from the partial salient region to the complete entity. We verified the performance of our approach on three PC benchmark datasets, i.e., RefCOCO, RefCOCO+, and RefCOCOg. CSRLA’s performance was close to those of the state-of-the-art BWPC methods, but it does not need any explicit or implicit supervised region information, which demonstrates its superiority and effectiveness.

Author Contributions

Y.W.: Methodology, Software, Writing. L.Y.: Writing. M.L.: Review, Editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets that support the findings of this study are available at https://cocodataset.org/download, accessed on 25 December 2023. The code will be made available on reasonable request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Xiang, N.; Chen, L.; Liang, L.; Rao, X.; Gong, Z. Semantic-Enhanced Cross-Modal Fusion for Improved Unsupervised Image Captioning. Electronics 2023, 12, 3549. [Google Scholar] [CrossRef]
Zhao, W.; Yang, W.; Chen, D.; Wei, F. DFEN: Dual Feature Enhancement Network for Remote Sensing Image Caption. Electronics 2023, 12, 1547. [Google Scholar] [CrossRef]
Jiang, L.; Meng, Z. Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph. Electronics 2023, 12, 1390. [Google Scholar] [CrossRef]
Zhu, H.; Togo, R.; Ogawa, T.; Haseyama, M. Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data. Electronics 2023, 12, 2183. [Google Scholar] [CrossRef]
Rohrbach, A.; Rohrbach, M.; Hu, R.; Darrell, T.; Schiele, B. Grounding of Textual Phrases in Images by Reconstruction. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 817–834. [Google Scholar] [CrossRef]
Chen, K.; Gao, J.; Nevatia, R. Knowledge aided consistency for weakly supervised phrase grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4042–4050. [Google Scholar] [CrossRef]
Liu, X.; Li, L.; Wang, S.; Zha, Z.J.; Meng, D.; Huang, Q. Adaptive reconstruction network for weakly supervised referring expression grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2611–2620. [Google Scholar] [CrossRef]
Yu, L.; Poirson, P.; Yang, S.; Berg, A.C.; Berg, T.L. Modeling context in referring expressions. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 69–85. [Google Scholar] [CrossRef]
Sun, M.; Xiao, J.; Lim, E.G. Iterative shrinking for referring expression grounding using deep reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14060–14069. [Google Scholar] [CrossRef]
Liu, S.; Huang, D.; Wang, Y. Pay Attention to Them: Deep Reinforcement Learning-Based Cascade Object Detection. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 2544–2556. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 2021, 34, 9694–9705. [Google Scholar]
Zhao, H.; Zhou, J.T.; Ong, Y.S. Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 1523–1533. [Google Scholar] [CrossRef]
Deng, J.; Yang, Z.; Chen, T.; Zhou, W.; Li, H. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1769–1779. [Google Scholar] [CrossRef]
Hong, R.; Liu, D.; Mo, X.; He, X.; Zhang, H. Learning to Compose and Reason with Language Tree Structures for Visual Grounding. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 684–696. [Google Scholar] [CrossRef]
Hu, R.; Rohrbach, M.; Andreas, J.; Darrell, T.; Saenko, K. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1115–1124. [Google Scholar] [CrossRef]
Liao, Y.; Liu, S.; Li, G.; Wang, F.; Chen, Y.; Qian, C.; Li, B. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10880–10889. [Google Scholar] [CrossRef]
Liu, D.; Zhang, H.; Wu, F.; Zha, Z.J. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4673–4682. [Google Scholar] [CrossRef]
Deng, J.; Yang, Z.; Liu, D.; Chen, T.; Zhou, W.; Zhang, Y.; Li, H.; Ouyang, W. Transvg++: End-to-end visual grounding with language conditioned vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13636–13652. [Google Scholar] [CrossRef]
Su, W.; Miao, P.; Dou, H.; Wang, G.; Qiao, L.; Li, Z.; Li, X. Language adaptive weight generation for multi-task visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10857–10866. [Google Scholar] [CrossRef]
Yang, Z.; Kafle, K.; Dernoncourt, F.; Ordonez, V. Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19165–19174. [Google Scholar] [CrossRef]
Li, K.; Li, J.; Guo, D.; Yang, X.; Wang, M. Transformer-based Visual Grounding with Cross-modality Interaction. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–19. [Google Scholar] [CrossRef]
Li, M.; Sigal, L. Referring transformer: A one-step approach to multi-task visual grounding. Adv. Neural Inf. Process. Syst. 2021, 34, 19652–19664. [Google Scholar]
Yang, Z.; Gong, B.; Wang, L.; Huang, W.; Yu, D.; Luo, J. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4683–4693. [Google Scholar] [CrossRef]
Yang, Z.; Chen, T.; Wang, L.; Luo, J. Improving one-stage visual grounding by recursive sub-query construction. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 387–404. [Google Scholar] [CrossRef]
Niu, Y.; Zhang, H.; Lu, Z.; Chang, S.F. Variational context: Exploiting visual and textual context for grounding referring expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 347–359. [Google Scholar] [CrossRef]
Zhang, Z.; Zhao, Z.; Lin, Z.; He, X. Counterfactual contrastive learning for weakly-supervised vision-language grounding. Adv. Neural Inf. Process. Syst. 2020, 33, 18123–18134. [Google Scholar]
Sun, M.; Xiao, J.; Lim, E.G.; Zhao, Y. Cycle-free Weakly Referring Expression Grounding with Self-paced Learning. IEEE Trans. Multimed. 2023, 25, 1611–1621. [Google Scholar] [CrossRef]
Liu, X.; Li, L.; Wang, S.; Zha, Z.J.; Li, Z.; Tian, Q.; Huang, Q. Entity-enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3003–3018. [Google Scholar] [CrossRef] [PubMed]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning (PMLR), Virtual Event, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2556–2565. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (PMLR), Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
Ren, L.; Lu, J.; Wang, Z.; Tian, Q.; Zhou, J. Collaborative deep reinforcement learning for multi-object tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 586–602. [Google Scholar] [CrossRef]
Luo, W.; Sun, P.; Zhong, F.; Liu, W.; Zhang, T.; Wang, Y. End-to-end active object tracking and its real-world deployment via reinforcement learning. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 1317–1332. [Google Scholar] [CrossRef] [PubMed]
Bellver, M.; Giro-I-Nieto, X.; Marques, F.; Torres, J. Hierarchical Object Detection with Deep Reinforcement Learning. Adv. Parallel Comput. 2016, 31, 3. [Google Scholar]
Uzkent, B.; Yeh, C.; Ermon, S. Efficient object detection in large images using deep reinforcement learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1824–1833. [Google Scholar] [CrossRef]
Liao, X.; Li, W.; Xu, Q.; Wang, X.; Jin, B.; Zhang, X.; Wang, Y.; Zhang, Y. Iteratively-refined interactive 3D medical image segmentation with multi-agent reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9394–9402. [Google Scholar] [CrossRef]
Zeng, N.; Li, H.; Wang, Z.; Liu, W.; Liu, S.; Alsaadi, F.E.; Liu, X. Deep-reinforcement-learning-based images segmentation for quantitative analysis of gold immunochromatographic strip. Neurocomputing 2021, 425, 173–180. [Google Scholar] [CrossRef]
Mansour, R.F.; Escorcia-Gutierrez, J.; Gamarra, M.; Villanueva, J.A.; Leal, N. Intelligent video anomaly detection and classification using faster RCNN with deep reinforcement learning model. Image Vis. Comput. 2021, 112, 104229. [Google Scholar] [CrossRef]
Liu, T.; Meng, Q.; Huang, J.J.; Vlontzos, A.; Rueckert, D.; Kainz, B. Video summarization through reinforcement learning with a 3D spatio-temporal u-net. IEEE Trans. Image Process. 2022, 31, 1573–1586. [Google Scholar] [CrossRef] [PubMed]
Ren, Z.; Wang, X.; Zhang, N.; Lv, X.; Li, L.J. Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 290–298. [Google Scholar] [CrossRef]
Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar] [CrossRef]
Hu, R.; Andreas, J.; Rohrbach, M.; Darrell, T.; Saenko, K. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 804–813. [Google Scholar] [CrossRef]
Lu, J.; Ye, X.; Ren, Y.; Yang, Y. Good, Better, Best: Textual Distractors Generation for Multiple-Choice Visual Question Answering via Reinforcement Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4921–4930. [Google Scholar] [CrossRef]
Cai, G.; Zhang, J.; Jiang, X.; Gong, Y.; He, L.; Yu, F.; Peng, P.; Guo, X.; Huang, F.; Sun, X. Ask&confirm: Active detail enriching for cross-modal retrieval with partial query. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1835–1844. [Google Scholar] [CrossRef]
Yan, S.; Yu, L.; Xie, Y. Discrete-continuous action space policy gradient-based attention for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8096–8105. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1928–1937. [Google Scholar]
Mao, J.; Huang, J.; Toshev, A.; Camburu, O.; Yuille, A.L.; Murphy, K. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 11–20. [Google Scholar] [CrossRef]
Sun, M.; Xiao, J.; Lim, E.G.; Liu, S.; Goulermas, J.Y. Discriminative triad matching and reconstruction for weakly referring expression grounding. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4189–4195. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Illustration of the differences among the three settings of PC. (a) is fully supervised PC, where patch–query pairs are required. (b) illustrates proposal-based weakly supervised PC, where the mappings between the patches and the linguistic expressions are unavailable, but entity regions are required. (c) shows proposal-free weakly supervised PC, where no supervised region information is available. The example image was sourced from the benchmark dataset RefCOCO [8].

Figure 2. The core insight of our cascaded searching reinforcement learning agent. The example image was sourced from the benchmark dataset RefCOCO [8].

Figure 3. Illustration of the proposed cascaded searching reinforcement learning agent (CSRLA) for FWPC.

Figure 4. The action space of CSRLA.

Figure 5. (a–j) Visualization of experimental results with 10 sets of successful grounding examples. Each set of examples consists of the input query text, the input image, and the output results (visualized attention heatmaps and bounding boxes).

Figure 6. (a–j) Visualization of experimental results with 10 sets of failure grounding examples. Each set of examples consists of the input query text, the input image, and the output results (visualized attention heatmaps and bounding boxes).

Table 1. Grounding results (IoU > 0.5, %) on the benchmark dataset RefCOCO+ compared with the state-of-the-art BWPC methods.

Methods	Published on	Supervised Manners	Settings	RefCOCO+			AVG
Methods	Published on	Supervised Manners	Settings	Val	Test A	Test B	AVG
VC (det) [25]	TPAMI-19	proposal-based weakly-supervised	$w / o r e g$ − $d e t$	-	19.74	24.05	21.90
VC (det) [25]	TPAMI-19	proposal-based weakly-supervised	$d e t$	-	25.79	25.54	25.67
VC (det) [25]	TPAMI-19	proposal-based weakly-supervised	$w / o α$ − $d e t$	-	34.68	28.10	31.39
VC [25]	TPAMI-19	proposal-based weakly-supervised	$w / o r e g$	-	18.79	24.14	21.47
VC [25]	TPAMI-19	proposal-based weakly-supervised	-	-	23.24	24.91	24.08
VC [25]	TPAMI-19	proposal-based weakly-supervised	$w / o α$	-	34.60	31.58	33.09
ARN [7]	ICCV-19	proposal-based weakly-supervised	$L_{a d p} + L_{a t t}$	33.53	36.40	29.23	33.05
ARN (det) [7]	ICCV-19	proposal-based weakly-supervised	$L_{l a n} + L_{a d p}$ − $d e t$	31.73	34.23	29.35	31.77
ARN (det) [7]	ICCV-19	proposal-based weakly-supervised	$d e t$	32.78	34.35	32.13	33.09
ARN [7]	ICCV-19	proposal-based weakly-supervised	$L_{l a n} + L_{a d p} + L_{a t t}$	34.53	36.01	33.75	34.76
IGN [26]	NIPS-20	proposal-based weakly-supervised	$B a s e$	31.13	34.44	29.59	31.72
IGN [26]	NIPS-20	proposal-based weakly-supervised	$C C L$	34.29	36.91	33.56	34.92
C-FREE [27]	TMM-21	proposal-based weakly-supervised	-	39.20	39.63	37.59	38.81
EARN [28]	TPAMI-23	proposal-based weakly-supervised	$L_{a d p}$	35.50	37.39	33.65	35.51
EARN [28]	TPAMI-23	proposal-based weakly-supervised	$w / o L_{a d p}$	35.31	33.46	37.27	35.35
EARN [28]	TPAMI-23	proposal-based weakly-supervised	$L_{l a n} + L_{a d p} + L_{a t t}$	37.54	37.58	37.92	37.68
CSRLA (Ours)	-	proposal-free weakly-supervised	-	32.37	32.56	31.26	32.06

Table 2. Grounding results (IoU > 0.5, %) on the benchmark dataset RefCOCO compared with the state-of-the-art BWPC methods.

Methods	Published on	Supervised Manners	Settings	RefCOCO			AVG
Methods	Published on	Supervised Manners	Settings	Val	Test A	Test B	AVG
VC (det) [25]	TPAMI-19	proposal-based weakly-supervised	$w / o r e g$ − $d e t$	-	17.14	22.30	19.72
VC (det) [25]	TPAMI-19	proposal-based weakly-supervised	$d e t$	-	20.91	21.77	21.34
VC (det) [25]	TPAMI-19	proposal-based weakly-supervised	$w / o α$ − $d e t$	-	32.68	27.22	29.95
VC [25]	TPAMI-19	proposal-based weakly-supervised	$w / o r e g$	-	13.59	21.65	17.62
VC [25]	TPAMI-19	proposal-based weakly-supervised	-	-	17.34	20.98	19.16
VC [25]	TPAMI-19	proposal-based weakly-supervised	$w / o α$	-	33.29	30.13	31.71
ARN [7]	ICCV-19	proposal-based weakly-supervised	$L_{a d p} + L_{a t t}$	33.07	36.43	29.09	32.86
ARN (det) [7]	ICCV-19	proposal-based weakly-supervised	$L_{l a n} + L_{a d p}$ − $d e t$	31.58	35.50	28.32	31.80
ARN (det) [7]	ICCV-19	proposal-based weakly-supervised	$d e t$	32.17	35.35	30.28	32.60
ARN [7]	ICCV-19	proposal-based weakly-supervised	$L_{l a n} + L_{a d p} + L_{a t t}$	34.26	36.01	33.07	34.45
IGN [26]	NIPS-20	proposal-based weakly-supervised	$B a s e$	31.05	34.39	28.16	31.20
IGN [26]	NIPS-20	proposal-based weakly-supervised	$C C L$	34.78	37.64	32.59	35.00
DTMR [54]	TPAMI-21	proposal-based weakly-supervised	-	39.21	41.14	37.72	39.36
EARN [28]	TPAMI-23	proposal-based weakly-supervised	$L_{a d p}$	35.31	37.07	32.66	35.01
EARN [28]	TPAMI-23	proposal-based weakly-supervised	$w / o L_{a d p}$	34.93	33.76	36.98	35.22
CSRLA (Ours)	-	proposal-free weakly-supervised	-	31.86	32.06	32.19	32.04

Table 3. Grounding results (IoU > 0.5, %) on the benchmark dataset RefCOCOg compared with the state-of-the-art BWPC methods.

Methods	Published on	Supervised Manners	Settings	Val	mA
VC (det) [25]	TPAMI-19	proposal-based weakly-supervised	$w / o r e g$ − $d e t$	28.14	22.27
VC (det) [25]	TPAMI-19	proposal-based weakly-supervised	$d e t$	33.66	25.53
VC (det) [25]	TPAMI-19	proposal-based weakly-supervised	$w / o α$ − $d e t$	29.65	30.47
VC [25]	TPAMI-19	proposal-based weakly-supervised	$w / o r e g$	25.14	20.66
VC [25]	TPAMI-19	proposal-based weakly-supervised	-	33.79	24.05
ARN [7]	ICCV-19	proposal-based weakly-supervised	$L_{a d p} + L_{a t t}$	33.19	32.99
ARN (det) [7]	ICCV-19	proposal-based weakly-supervised	$L_{l a n} + L_{a d p}$ − $d e t$	32.60	31.90
ARN (det) [7]	ICCV-19	proposal-based weakly-supervised	$d e t$	33.09	32.88
ARN [7]	ICCV-19	proposal-based weakly-supervised	$L_{l a n} + L_{a d p} + L_{a t t}$	34.66	34.61
IGN [26]	NIPS-20	proposal-based weakly-supervised	$B a s e$	32.17	31.56
IGN [26]	NIPS-20	proposal-based weakly-supervised	$C C L$	34.92	34.96
EARN [28]	TPAMI-23	proposal-based weakly-supervised	$L_{a d p}$	38.99	35.80
EARN [28]	TPAMI-23	proposal-based weakly-supervised	$w / o L_{a d p}$	38.37	35.73
CSRLA (Ours)	-	proposal-free weakly-supervised	-	31.89	32.03

Table 4. Impact of various components of the model (grounding accuracy, IoU > 0.5, %). The best results are marked in bold.

Settings	Val	Test A	Test B	Avg
w/o RL (ALBEF for FWPC)	19.17	15.82	22.65	19.21
MCC	25.20	23.54	27.14	25.34
fixed stride	29.05	28.32	30.17	29.18
CSRLA	32.37	32.56	31.26	32.06

Table 5. Impact of different cross-attention activation mapping thresholds (grounding accuracy, IoU > 0.5, %). The best results are marked in bold.

CAAM Thresholds	Val	Test A	Test B	Avg
0.11	26.52	28.76	27.71	27.66
0.13	31.64	32.48	30.69	31.60
0.15	32.37	32.56	31.26	32.06
0.17	30.22	29.83	28.20	29.42
0.20	24.14	27.54	25.35	25.68

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Yue, L.; Li, M. Cascaded Searching Reinforcement Learning Agent for Proposal-Free Weakly-Supervised Phrase Comprehension. Electronics 2024, 13, 898. https://doi.org/10.3390/electronics13050898

AMA Style

Wang Y, Yue L, Li M. Cascaded Searching Reinforcement Learning Agent for Proposal-Free Weakly-Supervised Phrase Comprehension. Electronics. 2024; 13(5):898. https://doi.org/10.3390/electronics13050898

Chicago/Turabian Style

Wang, Yaodong, Lili Yue, and Maoqing Li. 2024. "Cascaded Searching Reinforcement Learning Agent for Proposal-Free Weakly-Supervised Phrase Comprehension" Electronics 13, no. 5: 898. https://doi.org/10.3390/electronics13050898

APA Style

Wang, Y., Yue, L., & Li, M. (2024). Cascaded Searching Reinforcement Learning Agent for Proposal-Free Weakly-Supervised Phrase Comprehension. Electronics, 13(5), 898. https://doi.org/10.3390/electronics13050898

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cascaded Searching Reinforcement Learning Agent for Proposal-Free Weakly-Supervised Phrase Comprehension

Abstract

1. Introduction

2. Related Work

2.1. Fully-Supervised Phrase Comprehension

2.2. Weakly Supervised Phrase Comprehension

2.3. Vision-and-Language Pre-Training

2.4. Reinforcement Learning in Computer Vision

3. Methodology

3.1. Overview

3.2. Visual Encoder

3.3. Textual Encoder

3.4. Initially Salient Region Localization

3.5. Formulation of Reinforcement Learning

3.5.1. State

3.5.2. Action

3.5.3. Reward

3.5.4. Optimization

3.5.5. Discussion

4. Experiments

4.1. Experimental Setup

4.2. Experimental Results

4.3. Ablation Studies

4.4. Visualization and Analysis

4.4.1. Visualization of Success Cases

4.4.2. Visualization of Failure Cases

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI