Cascaded Searching Reinforcement Learning Agent for Proposal-Free Weakly-Supervised Phrase Comprehension
Abstract
:1. Introduction
- This paper makes the first attempt to address PC in an proposal-free weakly supervised paradigm.
- We propose a cascaded searching reinforcement learning agent (CSRLA), which formulates PC as a Markov decision process (MDP) within an RL framework, where the target grounding is decomposed into RL-based sequential cascaded searching actions, to perform a cascaded search for the complete referent target from the initial salient region.
- We design a novel confidence discrimination reward function (ConDis_R) to constrain the agent to search for a complete and exclusive target.
- Extensive experiments on three benchmark datasets demonstrated the effectiveness of our proposed method.
2. Related Work
2.1. Fully-Supervised Phrase Comprehension
2.2. Weakly Supervised Phrase Comprehension
2.3. Vision-and-Language Pre-Training
2.4. Reinforcement Learning in Computer Vision
3. Methodology
3.1. Overview
3.2. Visual Encoder
3.3. Textual Encoder
3.4. Initially Salient Region Localization
3.5. Formulation of Reinforcement Learning
3.5.1. State
3.5.2. Action
3.5.3. Reward
3.5.4. Optimization
Algorithm 1 Advantage-Actor-Critic for CSRLA |
Input: Image I and referring text T
|
3.5.5. Discussion
4. Experiments
4.1. Experimental Setup
4.2. Experimental Results
4.3. Ablation Studies
4.4. Visualization and Analysis
4.4.1. Visualization of Success Cases
4.4.2. Visualization of Failure Cases
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Xiang, N.; Chen, L.; Liang, L.; Rao, X.; Gong, Z. Semantic-Enhanced Cross-Modal Fusion for Improved Unsupervised Image Captioning. Electronics 2023, 12, 3549. [Google Scholar] [CrossRef]
- Zhao, W.; Yang, W.; Chen, D.; Wei, F. DFEN: Dual Feature Enhancement Network for Remote Sensing Image Caption. Electronics 2023, 12, 1547. [Google Scholar] [CrossRef]
- Jiang, L.; Meng, Z. Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph. Electronics 2023, 12, 1390. [Google Scholar] [CrossRef]
- Zhu, H.; Togo, R.; Ogawa, T.; Haseyama, M. Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data. Electronics 2023, 12, 2183. [Google Scholar] [CrossRef]
- Rohrbach, A.; Rohrbach, M.; Hu, R.; Darrell, T.; Schiele, B. Grounding of Textual Phrases in Images by Reconstruction. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 817–834. [Google Scholar] [CrossRef]
- Chen, K.; Gao, J.; Nevatia, R. Knowledge aided consistency for weakly supervised phrase grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4042–4050. [Google Scholar] [CrossRef]
- Liu, X.; Li, L.; Wang, S.; Zha, Z.J.; Meng, D.; Huang, Q. Adaptive reconstruction network for weakly supervised referring expression grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2611–2620. [Google Scholar] [CrossRef]
- Yu, L.; Poirson, P.; Yang, S.; Berg, A.C.; Berg, T.L. Modeling context in referring expressions. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 69–85. [Google Scholar] [CrossRef]
- Sun, M.; Xiao, J.; Lim, E.G. Iterative shrinking for referring expression grounding using deep reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14060–14069. [Google Scholar] [CrossRef]
- Liu, S.; Huang, D.; Wang, Y. Pay Attention to Them: Deep Reinforcement Learning-Based Cascade Object Detection. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 2544–2556. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 2021, 34, 9694–9705. [Google Scholar]
- Zhao, H.; Zhou, J.T.; Ong, Y.S. Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 1523–1533. [Google Scholar] [CrossRef]
- Deng, J.; Yang, Z.; Chen, T.; Zhou, W.; Li, H. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1769–1779. [Google Scholar] [CrossRef]
- Hong, R.; Liu, D.; Mo, X.; He, X.; Zhang, H. Learning to Compose and Reason with Language Tree Structures for Visual Grounding. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 684–696. [Google Scholar] [CrossRef]
- Hu, R.; Rohrbach, M.; Andreas, J.; Darrell, T.; Saenko, K. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1115–1124. [Google Scholar] [CrossRef]
- Liao, Y.; Liu, S.; Li, G.; Wang, F.; Chen, Y.; Qian, C.; Li, B. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10880–10889. [Google Scholar] [CrossRef]
- Liu, D.; Zhang, H.; Wu, F.; Zha, Z.J. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4673–4682. [Google Scholar] [CrossRef]
- Deng, J.; Yang, Z.; Liu, D.; Chen, T.; Zhou, W.; Zhang, Y.; Li, H.; Ouyang, W. Transvg++: End-to-end visual grounding with language conditioned vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13636–13652. [Google Scholar] [CrossRef]
- Su, W.; Miao, P.; Dou, H.; Wang, G.; Qiao, L.; Li, Z.; Li, X. Language adaptive weight generation for multi-task visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10857–10866. [Google Scholar] [CrossRef]
- Yang, Z.; Kafle, K.; Dernoncourt, F.; Ordonez, V. Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19165–19174. [Google Scholar] [CrossRef]
- Li, K.; Li, J.; Guo, D.; Yang, X.; Wang, M. Transformer-based Visual Grounding with Cross-modality Interaction. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–19. [Google Scholar] [CrossRef]
- Li, M.; Sigal, L. Referring transformer: A one-step approach to multi-task visual grounding. Adv. Neural Inf. Process. Syst. 2021, 34, 19652–19664. [Google Scholar]
- Yang, Z.; Gong, B.; Wang, L.; Huang, W.; Yu, D.; Luo, J. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4683–4693. [Google Scholar] [CrossRef]
- Yang, Z.; Chen, T.; Wang, L.; Luo, J. Improving one-stage visual grounding by recursive sub-query construction. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 387–404. [Google Scholar] [CrossRef]
- Niu, Y.; Zhang, H.; Lu, Z.; Chang, S.F. Variational context: Exploiting visual and textual context for grounding referring expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 347–359. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhao, Z.; Lin, Z.; He, X. Counterfactual contrastive learning for weakly-supervised vision-language grounding. Adv. Neural Inf. Process. Syst. 2020, 33, 18123–18134. [Google Scholar]
- Sun, M.; Xiao, J.; Lim, E.G.; Zhao, Y. Cycle-free Weakly Referring Expression Grounding with Self-paced Learning. IEEE Trans. Multimed. 2023, 25, 1611–1621. [Google Scholar] [CrossRef]
- Liu, X.; Li, L.; Wang, S.; Zha, Z.J.; Li, Z.; Tian, Q.; Huang, Q. Entity-enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3003–3018. [Google Scholar] [CrossRef] [PubMed]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning (PMLR), Virtual Event, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
- Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2556–2565. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (PMLR), Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
- Ren, L.; Lu, J.; Wang, Z.; Tian, Q.; Zhou, J. Collaborative deep reinforcement learning for multi-object tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 586–602. [Google Scholar] [CrossRef]
- Luo, W.; Sun, P.; Zhong, F.; Liu, W.; Zhang, T.; Wang, Y. End-to-end active object tracking and its real-world deployment via reinforcement learning. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 1317–1332. [Google Scholar] [CrossRef] [PubMed]
- Bellver, M.; Giro-I-Nieto, X.; Marques, F.; Torres, J. Hierarchical Object Detection with Deep Reinforcement Learning. Adv. Parallel Comput. 2016, 31, 3. [Google Scholar]
- Uzkent, B.; Yeh, C.; Ermon, S. Efficient object detection in large images using deep reinforcement learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1824–1833. [Google Scholar] [CrossRef]
- Liao, X.; Li, W.; Xu, Q.; Wang, X.; Jin, B.; Zhang, X.; Wang, Y.; Zhang, Y. Iteratively-refined interactive 3D medical image segmentation with multi-agent reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9394–9402. [Google Scholar] [CrossRef]
- Zeng, N.; Li, H.; Wang, Z.; Liu, W.; Liu, S.; Alsaadi, F.E.; Liu, X. Deep-reinforcement-learning-based images segmentation for quantitative analysis of gold immunochromatographic strip. Neurocomputing 2021, 425, 173–180. [Google Scholar] [CrossRef]
- Mansour, R.F.; Escorcia-Gutierrez, J.; Gamarra, M.; Villanueva, J.A.; Leal, N. Intelligent video anomaly detection and classification using faster RCNN with deep reinforcement learning model. Image Vis. Comput. 2021, 112, 104229. [Google Scholar] [CrossRef]
- Liu, T.; Meng, Q.; Huang, J.J.; Vlontzos, A.; Rueckert, D.; Kainz, B. Video summarization through reinforcement learning with a 3D spatio-temporal u-net. IEEE Trans. Image Process. 2022, 31, 1573–1586. [Google Scholar] [CrossRef] [PubMed]
- Ren, Z.; Wang, X.; Zhang, N.; Lv, X.; Li, L.J. Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 290–298. [Google Scholar] [CrossRef]
- Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar] [CrossRef]
- Hu, R.; Andreas, J.; Rohrbach, M.; Darrell, T.; Saenko, K. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 804–813. [Google Scholar] [CrossRef]
- Lu, J.; Ye, X.; Ren, Y.; Yang, Y. Good, Better, Best: Textual Distractors Generation for Multiple-Choice Visual Question Answering via Reinforcement Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4921–4930. [Google Scholar] [CrossRef]
- Cai, G.; Zhang, J.; Jiang, X.; Gong, Y.; He, L.; Yu, F.; Peng, P.; Guo, X.; Huang, F.; Sun, X. Ask&confirm: Active detail enriching for cross-modal retrieval with partial query. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1835–1844. [Google Scholar] [CrossRef]
- Yan, S.; Yu, L.; Xie, Y. Discrete-continuous action space policy gradient-based attention for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8096–8105. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
- Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1928–1937. [Google Scholar]
- Mao, J.; Huang, J.; Toshev, A.; Camburu, O.; Yuille, A.L.; Murphy, K. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 11–20. [Google Scholar] [CrossRef]
- Sun, M.; Xiao, J.; Lim, E.G.; Liu, S.; Goulermas, J.Y. Discriminative triad matching and reconstruction for weakly referring expression grounding. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4189–4195. [Google Scholar] [CrossRef] [PubMed]
Methods | Published on | Supervised Manners | Settings | RefCOCO+ | AVG | ||
---|---|---|---|---|---|---|---|
Val | Test A | Test B | |||||
VC (det) [25] | TPAMI-19 | proposal-based weakly-supervised | − | - | 19.74 | 24.05 | 21.90 |
VC (det) [25] | TPAMI-19 | proposal-based weakly-supervised | - | 25.79 | 25.54 | 25.67 | |
VC (det) [25] | TPAMI-19 | proposal-based weakly-supervised | − | - | 34.68 | 28.10 | 31.39 |
VC [25] | TPAMI-19 | proposal-based weakly-supervised | - | 18.79 | 24.14 | 21.47 | |
VC [25] | TPAMI-19 | proposal-based weakly-supervised | - | - | 23.24 | 24.91 | 24.08 |
VC [25] | TPAMI-19 | proposal-based weakly-supervised | - | 34.60 | 31.58 | 33.09 | |
ARN [7] | ICCV-19 | proposal-based weakly-supervised | 33.53 | 36.40 | 29.23 | 33.05 | |
ARN (det) [7] | ICCV-19 | proposal-based weakly-supervised | − | 31.73 | 34.23 | 29.35 | 31.77 |
ARN (det) [7] | ICCV-19 | proposal-based weakly-supervised | 32.78 | 34.35 | 32.13 | 33.09 | |
ARN [7] | ICCV-19 | proposal-based weakly-supervised | 34.53 | 36.01 | 33.75 | 34.76 | |
IGN [26] | NIPS-20 | proposal-based weakly-supervised | 31.13 | 34.44 | 29.59 | 31.72 | |
IGN [26] | NIPS-20 | proposal-based weakly-supervised | 34.29 | 36.91 | 33.56 | 34.92 | |
C-FREE [27] | TMM-21 | proposal-based weakly-supervised | - | 39.20 | 39.63 | 37.59 | 38.81 |
EARN [28] | TPAMI-23 | proposal-based weakly-supervised | 35.50 | 37.39 | 33.65 | 35.51 | |
EARN [28] | TPAMI-23 | proposal-based weakly-supervised | 35.31 | 33.46 | 37.27 | 35.35 | |
EARN [28] | TPAMI-23 | proposal-based weakly-supervised | 37.54 | 37.58 | 37.92 | 37.68 | |
CSRLA (Ours) | - | proposal-free weakly-supervised | - | 32.37 | 32.56 | 31.26 | 32.06 |
Methods | Published on | Supervised Manners | Settings | RefCOCO | AVG | ||
---|---|---|---|---|---|---|---|
Val | Test A | Test B | |||||
VC (det) [25] | TPAMI-19 | proposal-based weakly-supervised | − | - | 17.14 | 22.30 | 19.72 |
VC (det) [25] | TPAMI-19 | proposal-based weakly-supervised | - | 20.91 | 21.77 | 21.34 | |
VC (det) [25] | TPAMI-19 | proposal-based weakly-supervised | − | - | 32.68 | 27.22 | 29.95 |
VC [25] | TPAMI-19 | proposal-based weakly-supervised | - | 13.59 | 21.65 | 17.62 | |
VC [25] | TPAMI-19 | proposal-based weakly-supervised | - | - | 17.34 | 20.98 | 19.16 |
VC [25] | TPAMI-19 | proposal-based weakly-supervised | - | 33.29 | 30.13 | 31.71 | |
ARN [7] | ICCV-19 | proposal-based weakly-supervised | 33.07 | 36.43 | 29.09 | 32.86 | |
ARN (det) [7] | ICCV-19 | proposal-based weakly-supervised | − | 31.58 | 35.50 | 28.32 | 31.80 |
ARN (det) [7] | ICCV-19 | proposal-based weakly-supervised | 32.17 | 35.35 | 30.28 | 32.60 | |
ARN [7] | ICCV-19 | proposal-based weakly-supervised | 34.26 | 36.01 | 33.07 | 34.45 | |
IGN [26] | NIPS-20 | proposal-based weakly-supervised | 31.05 | 34.39 | 28.16 | 31.20 | |
IGN [26] | NIPS-20 | proposal-based weakly-supervised | 34.78 | 37.64 | 32.59 | 35.00 | |
DTMR [54] | TPAMI-21 | proposal-based weakly-supervised | - | 39.21 | 41.14 | 37.72 | 39.36 |
EARN [28] | TPAMI-23 | proposal-based weakly-supervised | 35.31 | 37.07 | 32.66 | 35.01 | |
EARN [28] | TPAMI-23 | proposal-based weakly-supervised | 34.93 | 33.76 | 36.98 | 35.22 | |
CSRLA (Ours) | - | proposal-free weakly-supervised | - | 31.86 | 32.06 | 32.19 | 32.04 |
Methods | Published on | Supervised Manners | Settings | Val | mA |
---|---|---|---|---|---|
VC (det) [25] | TPAMI-19 | proposal-based weakly-supervised | − | 28.14 | 22.27 |
VC (det) [25] | TPAMI-19 | proposal-based weakly-supervised | 33.66 | 25.53 | |
VC (det) [25] | TPAMI-19 | proposal-based weakly-supervised | − | 29.65 | 30.47 |
VC [25] | TPAMI-19 | proposal-based weakly-supervised | 25.14 | 20.66 | |
VC [25] | TPAMI-19 | proposal-based weakly-supervised | - | 33.79 | 24.05 |
ARN [7] | ICCV-19 | proposal-based weakly-supervised | 33.19 | 32.99 | |
ARN (det) [7] | ICCV-19 | proposal-based weakly-supervised | − | 32.60 | 31.90 |
ARN (det) [7] | ICCV-19 | proposal-based weakly-supervised | 33.09 | 32.88 | |
ARN [7] | ICCV-19 | proposal-based weakly-supervised | 34.66 | 34.61 | |
IGN [26] | NIPS-20 | proposal-based weakly-supervised | 32.17 | 31.56 | |
IGN [26] | NIPS-20 | proposal-based weakly-supervised | 34.92 | 34.96 | |
EARN [28] | TPAMI-23 | proposal-based weakly-supervised | 38.99 | 35.80 | |
EARN [28] | TPAMI-23 | proposal-based weakly-supervised | 38.37 | 35.73 | |
CSRLA (Ours) | - | proposal-free weakly-supervised | - | 31.89 | 32.03 |
Settings | Val | Test A | Test B | Avg |
---|---|---|---|---|
w/o RL (ALBEF for FWPC) | 19.17 | 15.82 | 22.65 | 19.21 |
MCC | 25.20 | 23.54 | 27.14 | 25.34 |
fixed stride | 29.05 | 28.32 | 30.17 | 29.18 |
CSRLA | 32.37 | 32.56 | 31.26 | 32.06 |
CAAM Thresholds | Val | Test A | Test B | Avg |
---|---|---|---|---|
0.11 | 26.52 | 28.76 | 27.71 | 27.66 |
0.13 | 31.64 | 32.48 | 30.69 | 31.60 |
0.15 | 32.37 | 32.56 | 31.26 | 32.06 |
0.17 | 30.22 | 29.83 | 28.20 | 29.42 |
0.20 | 24.14 | 27.54 | 25.35 | 25.68 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Y.; Yue, L.; Li, M. Cascaded Searching Reinforcement Learning Agent for Proposal-Free Weakly-Supervised Phrase Comprehension. Electronics 2024, 13, 898. https://doi.org/10.3390/electronics13050898
Wang Y, Yue L, Li M. Cascaded Searching Reinforcement Learning Agent for Proposal-Free Weakly-Supervised Phrase Comprehension. Electronics. 2024; 13(5):898. https://doi.org/10.3390/electronics13050898
Chicago/Turabian StyleWang, Yaodong, Lili Yue, and Maoqing Li. 2024. "Cascaded Searching Reinforcement Learning Agent for Proposal-Free Weakly-Supervised Phrase Comprehension" Electronics 13, no. 5: 898. https://doi.org/10.3390/electronics13050898
APA StyleWang, Y., Yue, L., & Li, M. (2024). Cascaded Searching Reinforcement Learning Agent for Proposal-Free Weakly-Supervised Phrase Comprehension. Electronics, 13(5), 898. https://doi.org/10.3390/electronics13050898