A Survey of Robot Intelligence with Large Language Models
Abstract
:1. Introduction
- This paper summarizes and introduces the foundational elements and tuning methods of LLM architecture.
- It explores and arranges prompt techniques to enhance the problem-solving abilities of LLMs.
- It reviews and encapsulates how LLMs and VLMs have been employed to augment robot intelligence across five topics as shown in Figure 1: (1) reward design for reinforcement learning, (2) low-level control, (3) high-level planning, (4) manipulation, and (5) scene understanding.
2. Review Protocol
- The titles and abstracts of the articles were reviewed to eliminate duplicates and irrelevant articles.
- The full texts of the selected articles from the first iteration were thoroughly examined and categorized.
- Article searching began on 18 September 2023.
- the publication years were those after 2020,
- the keywords of Robotics and LLM, which were ((“Robotic” OR “Robotics”) AND (“LLM” OR “LM” OR “Large Language Model” OR “Language Model”)), and relevant journal and conference articles written in English were considered.
3. Related Works
3.1. Language Model
3.2. LLM Architectures and Tunings
3.3. Prompt Techniques for Increasing LLM Performance
4. Language Models for Robotic Intelligence
4.1. Reward Design in Reinforcement Learning
4.2. Low-Level Control
4.3. High-Level Planning (Including Decision-Making and Reasoning)
4.4. Manipulation by LLMs
4.5. Scene Understanding in LLMs and VLMs
5. Discussion and Future Directions
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Hello GPT-4o. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 13 August 2024).
- Vemprala, S.H.; Bonatti, R.; Bucker, A.; Kapoor, A. ChatGPT for Robotics: Design Principles and Model Abilities. IEEE Access 2024, 12, 55682–55696. [Google Scholar] [CrossRef]
- Hu, Y.; Xie, Q.; Jain, V.; Francis, J.; Patrikar, J.; Keetha, N.; Kim, S.; Xie, Y.; Zhang, T.; Zhao, S.; et al. Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis. arXiv 2023, arXiv:2312.08782. [Google Scholar]
- Xiao, X.; Liu, J.; Wang, Z.; Zhou, Y.; Qi, Y.; Cheng, Q.; He, B.; Jiang, S. Robot Learning in the Era of Foundation Models: A Survey. arXiv 2023, arXiv:2311.14379. [Google Scholar]
- Mao, Y.; Ge, Y.; Fan, Y.; Xu, W.; Mi, Y.; Hu, Z.; Gao, Y. A Survey on LoRA of Large Language Models. arXiv 2024, arXiv:2407.11046. [Google Scholar]
- Hunt, W.; Ramchurn, S.D.; Soorati, M.D. A Survey of Language-Based Communication in Robotics. arXiv 2024, arXiv:2406.04086. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. Proc. Mach. Learn. Res. 2021, 139, 8748–8763. [Google Scholar]
- Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Dabis, J.; Finn, C.; Gopalakrishnan, K.; Hausman, K.; Herzog, A.; Hsu, J.; et al. RT-1: Robotics Transformer for Real-World Control at Scale. In Proceedings of the Robotics: Science and Systems 2023, Daegu, Republic of Korea, 10–14 July 2023. [Google Scholar] [CrossRef]
- Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Chen, X.; Choromanski, K.; Ding, T.; Driess, D.; Dubey, A.; Finn, C.; et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv 2023, arXiv:2307.15818. [Google Scholar]
- Ahn, M.; Dwibedi, D.; Finn, C.; Arenas, M.G.; Gopalakrishnan, K.; Hausman, K.; Ichter, B.; Irpan, A.; Joshi, N.; Julian, R.; et al. AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents. arXiv 2024, arXiv:2401.12963. [Google Scholar]
- Ma, Y.J.; Liang, W.; Wang, G.; Huang, D.-A.; Bastani, O.; Jayaraman, D.; Zhu, Y.; Fan, L.; Anandkumar, A. Eureka: Human-Level Reward Design via Coding Large Language Models. arXiv 2023, arXiv:2310.12931. [Google Scholar]
- Ma, Y.; Song, Z.; Zhuang, Y.; Hao, J.; King, I. A Survey on Vision-Language-Action Models for Embodied AI. arXiv 2024, arXiv:2405.14093. [Google Scholar]
- Zhou, H.; Yao, X.; Meng, Y.; Sun, S.; Bing, Z.; Huang, K.; Knoll, A. Language-Conditioned Learning for Robotic Manipulation: A Survey. arXiv 2023, arXiv:2312.10807. [Google Scholar]
- Firoozi, R.; Tucker, J.; Tian, S.; Majumdar, A.; Sun, J.; Liu, W.; Zhu, Y.; Song, S.; Kapoor, A.; Hausman, K.; et al. Foundation Models in Robotics: Applications, Challenges, and the Future. arXiv 2023, arXiv:2312.07843. [Google Scholar] [CrossRef]
- Gu, J.; Stefani, E.; Wu, Q.; Thomason, J.; Wang, X.E. Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions. Proc. Annu. Meet. Assoc. Comput. Linguist. 2022, 1, 7606–7623. [Google Scholar] [CrossRef]
- Zhai, C. Statistical Language Models for Information Retrieval; Association for Computational Linguistics: Morristown, NJ, USA, 2007; Volume 94, ISBN 9781598295900. [Google Scholar]
- Gao, J.; Lin, C.Y. Introduction to the Special Issue on Statistical Language Modeling. ACM Trans. Asian Lang. Inf. Process. 2004, 3, 87–93. [Google Scholar] [CrossRef]
- Rosenfeld, R. Two Decdes of Statistical Language Modeling Where Do We Go Form Here? Where Do We Go from Here? Proc. IEEE 2000, 88, 1270–1275. [Google Scholar] [CrossRef]
- Gondala, S.; Verwimp, L.; Pusateri, E.; Tsagkias, M.; Van Gysel, C. Error-Driven Pruning of Language Models for Virtual Assistants. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7413–7417. [Google Scholar] [CrossRef]
- Liu, X.; Croft, W.B. Statistical Language Modeling for Information Retrieval. Annu. Rev. Inf. Sci. Technol. 2005, 39, 1–31. [Google Scholar] [CrossRef]
- Thede, S.M.; Harper, M.P. A Second-Order Hidden Markov Model for Part-of-Speech Tagging. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, College Park, MD, USA, 20–26 June 1999; Association for Computational Linguistics: Morristown, NJ, USA, 1999; pp. 175–182. [Google Scholar]
- Bahl, L.R.; Brown, P.F.; De Souza, P.V.; Mercer, R.L. A Tree-Based Statistical Language Model for Natural Language Speech Recognition. IEEE Trans. Acoust. 1989, 37, 1001–1008. [Google Scholar] [CrossRef]
- Brants, T.; Popat, A.C.; Xu, P.; Och, F.J.; Dean, J. Large Language Models in Machine Translation. In Proceedings of the EMNLP-CoNLL 2007-Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, 28–30 June 2007; Volume 1, pp. 858–867. [Google Scholar]
- Popov, M.; Kulnitskiy, B.; Perezhogin, I.; Mordkovich, V.; Ovsyannikov, D.; Perfilov, S.; Borisova, L.; Blank, V. Catalytic 3D Polymerization of C60. Fuller. Nanotub. Carbon Nanostruct. 2018, 26, 465–470. [Google Scholar] [CrossRef]
- Mikolov, T.; Karafiát, M.; Burget, L.; Jan, C.; Khudanpur, S. Recurrent Neural Network Based Language Model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, Chiba, Japan, 26–30 September 2010; pp. 1045–1048. [Google Scholar]
- Kombrink, S.; Mikolov, T.; Karafiát, M.; Burget, L. Recurrent Neural Network Based Language Modeling in Meeting Recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Florence, Italy, 27–31 August 2011; pp. 2877–2880. [Google Scholar]
- Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural Language Processing (Almost) from Scratch. J. Mach. Learn. Res. 2011, 12, 2493–2537. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations Ofwords and Phrases and Their Compositionality. Adv. Neural Inf. Process. Syst. 2013, 26, 3111–3119. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the 1st International Conference on Learning Representations, ICLR 2013-Workshop Track Proceedings, Scottsdale, AZ, USA, 2–4 May 2013; pp. 1–12. [Google Scholar]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the NAACL HLT 2018—2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 14 February 2018; Volume 1, pp. 2227–2237. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL HLT 2019—2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 10 October 2019; Volume 1, pp. 4171–4186. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5999–6009. [Google Scholar]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. Proc. Annu. Meet. Assoc. Comput. Linguist. 2020, 58, 7871–7880. [Google Scholar] [CrossRef]
- Fedus, W.; Zoph, B.; Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. J. Mach. Learn. Res. 2022, 23, 1–40. [Google Scholar]
- Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. Language Models Are Unsupervised Multitask Learners. arXiv 2021, arXiv:2109.08270. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Sanh, V.; Webson, A.; Raffel, C.; Bach, S.H.; Sutawika, L.; Alyafeai, Z.; Chaffin, A.; Stiegler, A.; Le Scao, T.; Raja, A.; et al. Multitask Prompted Training Enables Zero-Shot Task Generalization. In Proceedings of the ICLR 2022—10th International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Wang, T.; Roberts, A.; Hesslow, D.; Le Scao, T.; Chung, H.W.; Beltagy, I.; Launay, J.; Raffel, C. What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? Proc. Mach. Learn. Res. 2022, 162, 22964–22984. [Google Scholar]
- Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling Laws for Neural Language Models. arXiv 2020, arXiv:2001.08361. [Google Scholar]
- Shanahan, M. Talking about Large Language Models. Commun. ACM 2024, 67, 68–79. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.H.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
- Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; de Las Casas, D.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training Compute-Optimal Large Language Models. Adv. Neural Inf. Process. Syst. 2022, 35, 30016–30030. [Google Scholar]
- Taylor, R.; Kardas, M.; Cucurull, G.; Scialom, T.; Hartshorn, A.; Saravia, E.; Poulton, A.; Kerkez, V.; Stojnic, R. Galactica: A Large Language Model for Science. arXiv 2022, arXiv:2211.09085. [Google Scholar]
- Fausk, H.; Isaksen, D.C. T-Model Structures. Homol. Homotopy Appl. 2007, 9, 399–438. [Google Scholar] [CrossRef]
- Groeneveld, D.; Beltagy, I.; Walsh, P.; Bhagia, A.; Kinney, R.; Tafjord, O.; Jha, A.H.; Ivison, H.; Magnusson, I.; Wang, Y.; et al. OLMo: Accelerating the Science of Language Models. Allen Inst. Artif. Intell. 2024, 62, 15789–15809. [Google Scholar]
- Lozhkov, A.; Li, R.; Allal, L.B.; Cassano, F.; Lamy-Poirier, J.; Tazi, N.; Tang, A.; Pykhtar, D.; Liu, J.; Wei, Y.; et al. StarCoder 2 and The Stack v2: The Next Generation. arXiv 2024, arXiv:2402.19173. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- The Claude 3 Model Family: Opus, Sonnet, Haiku. Available online: https://api.semanticscholar.org/CorpusID:268232499 (accessed on 13 August 2024).
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 35, 1877–1901. [Google Scholar]
- Cai, Z.; Cao, M.; Chen, H.; Chen, K.; Chen, K.; Chen, X.; Chen, X.; Chen, Z.; Chen, Z.; Chu, P.; et al. InternLM2 Technical Report. arXiv 2024, arXiv:2403.17297. [Google Scholar]
- Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.d.O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. [Google Scholar]
- Lieber, O.; Lenz, B.; Bata, H.; Cohen, G.; Osin, J.; Dalmedigos, I.; Safahi, E.; Meirom, S.; Belinkov, Y.; Shalev-Shwartz, S.; et al. Jamba: A Hybrid Transformer-Mamba Language Model. arXiv 2024, arXiv:2403.19887. [Google Scholar]
- Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned Language Models Are Zero-Shot Learners. In Proceedings of the ICLR 2022—10th International Conference on Learning Representations, Virtual, 25–29 April 2022; pp. 1–46. [Google Scholar]
- Pinnaparaju, N.; Adithyan, R.; Phung, D.; Tow, J.; Baicoianu, J.; Datta, A.; Zhuravinskyi, M.; Mahan, D.; Bellagente, M.; Riquelme, C.; et al. Stable Code Technical Report. arXiv 2024, arXiv:2404.01226. [Google Scholar]
- Yoo, K.M.; Han, J.; In, S.; Jeon, H.; Jeong, J.; Kang, J.; Kim, H.; Kim, K.-M.; Kim, M.; Kim, S.; et al. HyperCLOVA X Technical Report. arXiv 2024, arXiv:2404.01954. [Google Scholar]
- Rae, J.W.; Borgeaud, S.; Cai, T.; Millican, K.; Hoffmann, J.; Song, F.; Aslanides, J.; Henderson, S.; Ring, R.; Young, S.; et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv 2021, arXiv:2112.11446. [Google Scholar]
- Grok-1.5 Vision Preview. Available online: https://x.ai/blog/grok-1.5v (accessed on 13 August 2024).
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training Language Models to Follow Instructions with Human Feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
- Vallabh Shrimangale Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date. Available online: https://medium.com/@shrimangalevallabh789/introducing-meta-llama-3-the-most-capable-openly-available-llm-to-date-12de163151e1 (accessed on 13 August 2024).
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. arXiv 2022, arXiv:2204.02311. [Google Scholar]
- Abdin, M.; Jacobs, S.A.; Awan, A.A.; Aneja, J.; Awadallah, A.; Awadalla, H.; Bach, N.; Bahree, A.; Bakhtiari, A.; Bao, J.; et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv 2024, arXiv:2404.14219. [Google Scholar]
- Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V.; et al. OPT: Open Pre-Trained Transformer Language Models. arXiv 2022, arXiv:2205.01068. [Google Scholar]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
- Claude 3.5 Sonnet. Available online: https://www.anthropic.com/news/claude-3-5-sonnet (accessed on 13 August 2024).
- Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; Hashimoto, T. Alpaca: A Strong, Replicable Instruction-Following Model. Available online: https://crfm.stanford.edu/2023/03/13/alpaca.html (accessed on 13 August 2024).
- GPT-4o Mini: Advancing Cost-Efficient Intelligence. Available online: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ (accessed on 13 August 2024).
- Malartic, Q.; Chowdhury, N.R.; Cojocaru, R.; Farooq, M.; Campesan, G.; Djilali, Y.A.D.; Narayan, S.; Singh, A.; Velikanov, M.; Boussaha, B.E.A.; et al. Falcon2-11B Technical Report. Available online: https://huggingface.co/tiiuae/falcon-11B (accessed on 13 August 2024).
- Li, R.; Allal, L.B.; Zi, Y.; Muennighoff, N.; Kocetkov, D.; Mou, C.; Marone, M.; Akiki, C.; Li, J.; Chim, J.; et al. StarCoder: May the Source Be with You! arXiv 2023, arXiv:2305.06161. [Google Scholar]
- Introducing Llama 3.1: Our Most Capable Models to Date. Available online: https://ai.meta.com/blog/meta-llama-3-1/ (accessed on 13 August 2024).
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
- Mistral AI Mistral Large. Available online: https://mistral.ai/news/mistral-large/?utm_source=www.turingpost.com&utm_medium=referral&utm_campaign=the-ultimate-guide-to-llm-benchmarks-evaluating-language-model-capabilities (accessed on 13 August 2024).
- Yang, A.; Xiao, B.; Wang, B.; Zhang, B.; Bian, C.; Yin, C.; Lv, C.; Pan, D.; Wang, D.; Yan, D.; et al. Baichuan 2: Open Large-Scale Language Models. arXiv 2023, arXiv:2309.10305. [Google Scholar]
- Team, G.; Deepmind, G. Gemma 2: Improving Open Language Models at a Practical Size. arXiv 2024, arXiv:2408.00118. [Google Scholar]
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
- An, S.; Bae, K.; Choi, E.; Choi, S.J.; Choi, Y.; Hong, S.; Hong, Y.; Hwang, J.; Jeon, H.; Gerrard, J.J.; et al. EXAONE 3.0 7.8B Instruction Tuned Language Model. arXiv 2024, arXiv:2408.03541. [Google Scholar]
- Guo, D.; Zhu, Q.; Yang, D.; Xie, Z.; Dong, K.; Zhang, W.; Chen, G.; Bi, X.; Wu, Y.; Li, Y.K.; et al. DeepSeek-Coder: When the Large Language Model Meets Programming—The Rise of Code Intelligence. arXiv 2024, arXiv:2401.14196. [Google Scholar]
- Grok-2 Beta Release. Available online: https://x.ai/blog/grok-2 (accessed on 13 August 2024).
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinform.) 2020, 12346 LNCS, 213–229. [Google Scholar] [CrossRef]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. Available online: https://github.com/haotian-liu/LLaVA (accessed on 13 August 2024).
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training Data-Efficient Image Transformers & Distillation through Attention. arXiv 2021, arXiv:2012.12877. [Google Scholar]
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. Available online: http://arxiv.org/abs/2304.10592 (accessed on 13 August 2024).
- Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-Shot Text-to-Image Generation. arXiv 2021, arXiv:2102.12092. [Google Scholar]
- ChatGPT-4 System Card. Available online: https://cdn.openai.com/papers/gpt-4-system-card.pdf (accessed on 13 August 2024).
- Xiao, B.; Wu, H.; Xu, W.; Dai, X.; Hu, H.; Lu, Y.; Zeng, M.; Liu, C.; Yuan, L. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. arXiv 2023, arXiv:2311.06242. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. Available online: https://github.com/microsoft/Swin-Transformer (accessed on 13 August 2024).
- Bar-Tal, O.; Chefer, H.; Tov, O.; Herrmann, C.; Paiss, R.; Zada, S.; Ephrat, A.; Hur, J.; Liu, G.; Raj, A.; et al. Lumiere: A Space-Time Diffusion Model for Video Generation. arXiv 2024, arXiv:2401.12945. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 15, 12077–12090. [Google Scholar]
- Adept Fuyu-Heavy: A New Multimodal Model. Available online: https://www.adept.ai/blog/adept-fuyu-heavy (accessed on 13 August 2024).
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16X16 Words: Transformers for Image Recognition At Scale. In Proceedings of the ICLR 2021—9th International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
- Gemini Team; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al. Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context. arXiv 2024, arXiv:2403.05530. [Google Scholar]
- Bao, H.; Dong, L.; Piao, S.; Wei, F. Beit: Bert Pre-Training of Image Transformers. In Proceedings of the ICLR 2022—10th International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Dong, X.; Zhang, P.; Zang, Y.; Cao, Y.; Wang, B.; Ouyang, L.; Zhang, S.; Duan, H.; Zhang, W.; Li, Y.; et al. InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD. arXiv 2024, arXiv:2404.06512. [Google Scholar]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollar, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15979–15988. [Google Scholar]
- Introducing Idefics2: A Powerful 8B Vision-Language Model for the Community. Available online: https://huggingface.co/blog/idefics2 (accessed on 13 August 2018).
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. Available online: https://github.com/CompVis/latent-diffusion (accessed on 13 August 2024).
- Laurençon, H.; Tronchon, L.; Cord, M.; Sanh, V. What Matters When Building Vision-Language Models? arXiv 2024, arXiv:2405.02246. [Google Scholar] [CrossRef]
- Nair, S.; Rajeswaran, A.; Kumar, V.; Finn, C.; Gupta, A. R3M: A Universal Visual Representation for Robot Manipulation. arXiv 2022, arXiv:2203.12601. [Google Scholar]
- Chameleon Team. Chameleon: Mixed-Modal Early-Fusion Foundation Models. arXiv 2024, arXiv:2405.09818. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A Visual Language Model for Few-Shot Learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models. Available online: https://github.com/salesforce/LAVIS/tree/main/projects/blip2 (accessed on 13 August 2024).
- Beyer, L.; Steiner, A.; Pinto, A.S.; Kolesnikov, A.; Wang, X.; Salz, D.; Neumann, M.; Alabdulmohsin, I.; Tschannen, M.; Bugliarello, E.; et al. PaliGemma: A Versatile 3B VLM for Transfer. arXiv 2024, arXiv:2407.07726. [Google Scholar]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. Available online: https://segment-anything.com (accessed on 13 August 2024).
- Ravi, N.; Gabeur, V.; Hu, Y.-T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Raedle, R.; Rolland, C.; Gustafson, L.; et al. SAM 2: Segment Anything in Images and Videos. arXiv 2024, arXiv:2408.00714. [Google Scholar]
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
- Qwen2-VL: To See the World More Clearly. Available online: https://qwenlm.github.io/blog/qwen2-vl/ (accessed on 1 September 2024).
- Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling Instruction-Finetuned Language Models. arXiv 2022, arXiv:2210.11416. [Google Scholar]
- Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; Hon, H.W. Unified Language Model Pre-Training for Natural Language Understanding and Generation. Adv. Neural Inf. Process. Syst. 2019, 32, 13063–13075. [Google Scholar]
- Zeng, A.; Liu, X.; Du, Z.; Wang, Z.; Lai, H.; Ding, M.; Yang, Z.; Xu, Y.; Zheng, W.; Xia, X.; et al. GLM-130B: An Open Bilingual Pre-Trained Model. arXiv 2022, arXiv:2210.02414. [Google Scholar]
- Tay, Y.; Wei, J.; Chung, H.W.; Tran, V.Q.; So, D.R.; Shakeri, S.; Garcia, X.; Zheng, H.S.; Rao, J.; Chowdhery, A.; et al. Transcending Scaling Laws with 0.1% Extra Compute. In Proceedings of the EMNLP 2023—2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 1471–1486. [Google Scholar]
- Kaufmann, T.; Weng, P.; Bengs, V.; Hüllermeier, E. A Survey of Reinforcement Learning from Human Feedback. arXiv 2023, arXiv:2312.14925. [Google Scholar]
- Rafailov, R.; Sharma, A.; Mitchell, E.; Ermon, S.; Manning, C.D.; Finn, C. Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv 2023, arXiv:2305.18290. [Google Scholar]
- Han, Z.; Gao, C.; Liu, J.; Zhang, J.; Zhang, S.Q. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey. arXiv 2024, arXiv:2403.14608. [Google Scholar]
- Houlsby, N.; Giurgiu, A.; Jastrzçbski, S.; Morrone, B.; de Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-Efficient Transfer Learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, CA, USA, 9–15 June 2019; pp. 4944–4953. [Google Scholar]
- Hu, Z.; Wang, L.; Lan, Y.; Xu, W.; Lim, E.P.; Bing, L.; Xu, X.; Poria, S.; Lee, R.K.W. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. In Proceedings of the EMNLP 2023—2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 5254–5276. [Google Scholar]
- Lester, B.; Al-Rfou, R.; Constant, N. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the EMNLP 2021—2021 Conference on Empirical Methods in Natural Language Processing, Virtual Event, Punta Cana, 7–11 November 2021; pp. 3045–3059. [Google Scholar]
- Liu, X.; Zheng, Y.; Du, Z.; Ding, M.; Qian, Y.; Yang, Z.; Tang, J. GPT Understands, Too. AI Open, 2023; in press. [Google Scholar] [CrossRef]
- Li, X.L.; Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the ACL-IJCNLP 2021—59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Virtual Event, 1–6 August 2021; pp. 4582–4597. [Google Scholar]
- Hu, E.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-Rank Adaptation of Large Language Models. In Proceedings of the ICLR 2022—10th International Conference on Learning Representations, Virtual, 25–29 April 2022; pp. 1–26. [Google Scholar]
- Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLORA: Efficient Finetuning of Quantized LLMs. Adv. Neural Inf. Process. Syst. 2023, 36, 10088–10115. [Google Scholar]
- Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv 2022, arXiv:2203.11171. [Google Scholar]
- Zhang, Z.; Zhang, A.; Li, M.; Zhao, H.; Karypis, G.; Smola, A. Multimodal Chain-of-Thought Reasoning in Language Models. arXiv 2023, arXiv:2302.00923. [Google Scholar]
- Diao, S.; Wang, P.; Lin, Y.; Zhang, T. Active Prompting with Chain-of-Thought for Large Language Models. arXiv 2023, arXiv:2302.12246. [Google Scholar]
- Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; Callan, J.; Neubig, G. PAL: Program-Aided Language Models. Proc. Mach. Learn. Res. 2023, 202, 10764–10799. [Google Scholar]
- Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.L.; Cao, Y.; Narasimhan, K. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Adv. Neural Inf. Process. Syst. 2023, 36, 11809–11822. [Google Scholar]
- Trautmann, D. Large Language Model Prompt Chaining for Long Legal Document Classification. arXiv 2023, arXiv:2308.04138. [Google Scholar]
- Liu, J.; Liu, A.; Lu, X.; Welleck, S.; West, P.; Le Bras, R.; Choi, Y.; Hajishirzi, H. Generated Knowledge Prompting for Commonsense Reasoning. Proc. Annu. Meet. Assoc. Comput. Linguist. 2022, 1, 3154–3169. [Google Scholar] [CrossRef]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
- Paranjape, B.; Lundberg, S.; Singh, S.; Hajishirzi, H.; Zettlemoyer, L.; Ribeiro, M.T. ART: Automatic Multi-Step Reasoning and Tool-Use for Large Language Models. arXiv 2023, arXiv:2303.09014. [Google Scholar]
- Zhou, Y.; Muresanu, A.I.; Han, Z.; Paster, K.; Pitis, S.; Chan, H.; Ba, J. Large Language Models Are Human-Level Prompt Engineers. arXiv 2022, arXiv:2211.01910. [Google Scholar]
- Li, Z.; Peng, B.; He, P.; Galley, M.; Gao, J.; Yan, X. Guiding Large Language Models via Directional Stimulus Prompting. Adv. Neural Inf. Process. Syst. 2023, 36, 62630–62656. [Google Scholar]
- Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv 2022, arXiv:2210.03629. [Google Scholar]
- Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language Agents with Verbal Reinforcement Learning. Adv. Neural Inf. Process. Syst. 2023, 36, 8634–8652. [Google Scholar]
- Mittal, M.; Yu, C.; Yu, Q.; Liu, J.; Rudin, N.; Hoeller, D.; Yuan, J.L.; Singh, R.; Guo, Y.; Mazhar, H.; et al. Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments. IEEE Robot. Autom. Lett. 2023, 8, 3740–3747. [Google Scholar] [CrossRef]
- Ma, Y.J.; Liang, W.; Wang, H.-J.; Wang, S.; Zhu, Y.; Fan, L.; Bastani, O.; Jayaraman, D. DrEureka: Language Model Guided Sim-To-Real Transfer. arXiv 2024, arXiv:2406.01967. [Google Scholar]
- Zhao, W.; Queralta, J.P.; Westerlund, T. Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: A Survey. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence, SSCI 2020, Canberra, Australia, 1–4 December 2020; pp. 737–744. [Google Scholar]
- Xie, T.; Zhao, S.; Wu, C.H.; Liu, Y.; Luo, Q.; Zhong, V.; Yang, Y.; Yu, T. Text2Reward: Automated Dense Reward Function Generation for Reinforcement Learning. arXiv 2023, arXiv:2309.11489. [Google Scholar]
- Di Palo, N.; Byravan, A.; Hasenclever, L.; Wulfmeier, M.; Heess, N.; Riedmiller, M. Towards A Unified Agent with Foundation Models. arXiv 2023, arXiv:2307.09668. [Google Scholar]
- Du, Y.; Konyushkova, K.; Denil, M.; Raju, A.; Landon, J.; Hill, F.; De Freitas, N.; Cabi, S. Vision-Language Models As Success Detectors. Proc. Mach. Learn. Res. 2023, 232, 120–136. [Google Scholar]
- Du, Y.; Watkins, O.; Wang, Z.; Colas, C.; Darrell, T.; Abbeel, P.; Gupta, A.; Andreas, J. Guiding Pretraining in Reinforcement Learning with Large Language Models. Proc. Mach. Learn. Res. 2023, 202, 8657–8677. [Google Scholar]
- Driess, D.; Xia, F.; Sajjadi, M.S.M.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. PaLM-E: An Embodied Multimodal Language Model. Proc. Mach. Learn. Res. 2023, 202, 8469–8488. [Google Scholar]
- Chen, X.; Djolonga, J.; Padlewski, P.; Mustafa, B.; Changpinyo, S.; Wu, J.; Ruiz, C.R.; Goodman, S.; Wang, X.; Tay, Y.; et al. PaLI-X: On Scaling up a Multilingual Vision and Language Model. arXiv 2023, arXiv:2305.18565. [Google Scholar]
- Asimov, I. Runaround. Astounding Sci. Fict. 1942, 29, 94–103. [Google Scholar]
- Jang, E.; Irpan, A.; Khansari, M.; Kappler, D.; Ebert, F.; Lynch, C.; Levine, S.; Finn, C. BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning. Proc. Mach. Learn. Res. 2021, 164, 991–1002. [Google Scholar]
- Tang, Y.; Yu, W.; Tan, J.; Zen, H.; Faust, A.; Harada, T. SayTap: Language to Quadrupedal Locomotion. Proc. Mach. Learn. Res. 2023, 229, 3556–3570. [Google Scholar]
- Mandi, Z.; Jain, S.; Song, S. RoCo: Dialectic Multi-Robot Collaboration with Large Language Models. arXiv 2023, arXiv:2307.04738. [Google Scholar]
- Wang, Y.-J.; Zhang, B.; Chen, J.; Sreenath, K. Prompt a Robot to Walk with Large Language Models. arXiv 2023, arXiv:2309.09969. [Google Scholar]
- Liang, J.; Huang, W.; Xia, F.; Xu, P.; Hausman, K.; Ichter, B.; Florence, P.; Zeng, A. Code as Policies: Language Model Programs for Embodied Control. In Proceedings of the IEEE International Conference on Robotics and Automation, London, UK, 29 May–2 June 2023; pp. 9493–9500. [Google Scholar]
- Mirchandani, S.; Xia, F.; Florence, P.; Ichter, B.; Driess, D.; Arenas, M.G.; Rao, K.; Sadigh, D.; Zeng, A. Large Language Models as General Pattern Machines. Proc. Mach. Learn. Res. 2023, 229. [Google Scholar]
- Yoneda, T.; Fang, J.; Li, P.; Zhang, H.; Jiang, T.; Lin, S.; Picker, B.; Yunis, D.; Mei, H.; Walter, M.R. Statler: State-Maintaining Language Models for Embodied Reasoning. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2023; pp. 1–19. [Google Scholar] [CrossRef]
- Mu, Y.; Zhang, Q.; Hu, M.; Wang, W.; Ding, M.; Jin, J.; Wang, B.; Dai, J.; Qiao, Y.; Luo, P. EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought. Adv. Neural Inf. Process. Syst. 2023, 36, 25081–25094. [Google Scholar]
- Chen, H.; Tan, H.; Kuntz, A.; Bansal, M.; Alterovitz, R. Enabling Robots to Understand Incomplete Natural Language Instructions Using Commonsense Reasoning. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 1963–1969. [Google Scholar]
- Huang, W.; Xia, F.; Shah, D.; Driess, D.; Zeng, A.; Lu, Y.; Florence, P.; Mordatch, I.; Levine, S.; Hausman, K.; et al. Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents. Adv. Neural Inf. Process. Syst. 2023, 36, 59636–59661. [Google Scholar]
- Huang, W.; Xia, F.; Xiao, T.; Chan, H.; Liang, J.; Florence, P.; Zeng, A.; Tompson, J.; Mordatch, I.; Chebotar, Y.; et al. Inner Monologue: Embodied Reasoning through Planning with Language Models. Proc. Mach. Learn. Res. 2023, 205, 1769–1782. [Google Scholar]
- Lykov, A.; Tsetserukou, D. LLM-BRAIn: AI-Driven Fast Generation of Robot Behaviour Tree Based on Large Language Model. arXiv 2023, arXiv:2305.19352. [Google Scholar]
- Song, C.H.; Sadler, B.M.; Wu, J.; Chao, W.L.; Washington, C.; Su, Y. LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models. In Proceedings of the IEEE International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 2986–2997. [Google Scholar]
- Singh, I.; Blukis, V.; Mousavian, A.; Goyal, A.; Xu, D.; Tremblay, J.; Fox, D.; Thomason, J.; Garg, A. ProgPrompt: Generating Situated Robot Task Plans Using Large Language Models. In Proceedings of the IEEE International Conference on Robotics and Automation, London, UK, 29 May–2 June 2023; pp. 11523–11530. [Google Scholar]
- Rana, K.; Haviland, J.; Garg, S.; Abou-Chakra, J.; Reid, I.; Sünderhauf, N. SayPlan: Grounding Large Language Models Using 3D Scene Graphs for Scalable Robot Task Planning. Proc. Mach. Learn. Res. 2023, 229. [Google Scholar]
- Zeng, A.; Attarian, M.; Ichter, B.; Choromanski, K.; Wong, A.; Welker, S.; Tombari, F.; Purohit, A.; Ryoo, M.; Sindhwani, V.; et al. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. arXiv 2022, arXiv:2204.00598. [Google Scholar]
- Lin, K.; Agia, C.; Migimatsu, T.; Pavone, M.; Bohg, J. Text2Motion: From Natural Language Instructions to Feasible Plans. Auton. Robots 2023, 47, 1345–1365. [Google Scholar] [CrossRef]
- Wu, J.; Antonova, R.; Kan, A.; Lepert, M.; Zeng, A.; Song, S.; Bohg, J.; Rusinkiewicz, S.; Funkhouser, T. TidyBot: Personalized Robot Assistance with Large Language Models. Auton. Robots 2023, 47, 1087–1102. [Google Scholar] [CrossRef]
- Stone, A.; Xiao, T.; Lu, Y.; Gopalakrishnan, K.; Lee, K.H.; Vuong, Q.; Wohlhart, P.; Kirmani, S.; Zitkovich, B.; Xia, F.; et al. Open-World Object Manipulation Using Pre-Trained Vision-Language Models. Proc. Mach. Learn. Res. 2023, 229, 1–18. [Google Scholar]
- Gao, J.; Sarkar, B.; Xia, F.; Xiao, T.; Wu, J.; Ichter, B.; Majumdar, A.; Sadigh, D. Physically Grounded Vision-Language Models for Robotic Manipulation. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar] [CrossRef]
- Wang, R.; Mao, J.; Hsu, J.; Zhao, H.; Wu, J.; Gao, Y. Programmatically Grounded, Compositionally Generalizable Robotic Manipulation. arXiv 2023, arXiv:2304.13826. [Google Scholar]
- Ha, H.; Florence, P.; Song, S. Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition. Proc. Mach. Learn. Res. 2023, 229. [Google Scholar]
- Huang, W.; Wang, C.; Zhang, R.; Li, Y.; Wu, J.; Fei-Fei, L. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models. Proc. Mach. Learn. Res. 2023, 229. [Google Scholar]
- Ahn, M.; Brohan, A.; Brown, N.; Chebotar, Y.; Cortes, O.; David, B.; Finn, C.; Fu, C.; Gopalakrishnan, K.; Hausman, K.; et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. Proc. Mach. Learn. Res. 2023, 205, 287–318. [Google Scholar]
- Huang, S.; Jiang, Z.; Dong, H.; Qiao, Y.; Gao, P.; Li, H. Instruct2Act: Mapping Multi-Modality Instructions to Robotic Actions with Large Language Model. arXiv 2023, arXiv:2305.11176. [Google Scholar]
- Chen, W.; Hu, S.; Talak, R.; Carlone, L. Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding. arXiv 2022, arXiv:2209.05629. [Google Scholar]
- Yang, J.; Chen, X.; Qian, S.; Madaan, N.; Iyengar, M.; Fouhey, D.F.; Chai, J. LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar] [CrossRef]
- Chen, B.; Xia, F.; Ichter, B.; Rao, K.; Gopalakrishnan, K.; Ryoo, M.S.; Stone, A.; Kappler, D. Open-Vocabulary Queryable Scene Representations for Real World Planning. In Proceedings of the IEEE International Conference on Robotics and Automation, London, UK, 29 May–2 June 2023; pp. 11509–11522. [Google Scholar]
- Elhafsi, A.; Sinha, R.; Agia, C.; Schmerling, E.; Nesnas, I.A.D.; Pavone, M. Semantic Anomaly Detection with Large Language Models. Auton. Robots 2023, 47, 1035–1055. [Google Scholar] [CrossRef]
- Hong, Y.; Zhen, H.; Chen, P.; Zheng, S.; Du, Y.; Chen, Z.; Gan, C. 3D-LLM: Injecting the 3D World into Large Language Models. Adv. Neural Inf. Process. Syst. 2023, 36, 20482–20494. [Google Scholar]
- Shah, D.; Osinski, B.; Ichter, B.; Levine, S.; Osiński, B.; Ichter, B.; Levine, S. LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action. Proc. Mach. Learn. Res. 2023, 205, 492–504. [Google Scholar]
- Zhou, G.; Hong, Y.; Wu, Q. NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 7641–7649. [Google Scholar]
- Huang, C.; Mees, O.; Zeng, A.; Burgard, W. Visual Language Maps for Robot Navigation. In Proceedings of the IEEE International Conference on Robotics and Automation, London, UK, 29 May–2 June 2023; pp. 10608–10615. [Google Scholar]
- Triantafyllidis, E.; Christianos, F.; Li, Z. Intrinsic Language-Guided Exploration for Complex Long-Horizon Robotic Manipulation Tasks. arXiv 2023, arXiv:2309.16347. [Google Scholar]
- Yu, W.; Gileadi, N.; Fu, C.; Kirmani, S.; Lee, K.H.; Arenas, M.G.; Chiang, H.T.L.; Erez, T.; Hasenclever, L.; Humplik, J.; et al. Language to Rewards for Robotic Skill Synthesis. Proc. Mach. Learn. Res. 2023, 229. [Google Scholar]
- Perez, J.; Proux, D.; Roux, C.; Niemaz, M. LARG, Language-Based Automatic Reward and Goal Generation. arXiv 2023, arXiv:2306.10985. [Google Scholar]
- Song, J.; Zhou, Z.; Liu, J.; Fang, C.; Shu, Z.; Ma, L. Self-Refined Large Language Model as Automated Reward Function Designer for Deep Reinforcement Learning in Robotics. arXiv 2023, arXiv:2309.06687. [Google Scholar]
- Mahmoudieh, P.; Pathak, D.; Darrell, T. Zero-Shot Reward Specification via Grounded Natural Language. In Proceedings of the Proceedings of Machine Learning Research, Baltimore, MD, USA, 17–23 July 2022; Volume 162, pp. 14743–14752. [Google Scholar]
- Park, J.; Lim, S.; Lee, J.; Park, S.; Chang, M.; Yu, Y.; Choi, S. CLARA: Classifying and Disambiguating User Commands for Reliable Interactive Robotic Agents. IEEE Robot. Autom. Lett. 2024, 9, 1059–1066. [Google Scholar] [CrossRef]
- Wake, N.; Kanehira, A.; Sasabuchi, K.; Takamatsu, J.; Ikeuchi, K. ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application. IEEE Access 2023, 11, 95060–95078. [Google Scholar] [CrossRef]
- Palnitkar, A.; Kapu, R.; Lin, X.; Liu, C.; Karapetyan, N.; Aloimonos, Y. ChatSim: Underwater Simulation with Natural Language Prompting. In Proceedings of the Oceans Conference Record (IEEE), Biloxi, MS, USA, 25–28 September 2023. [Google Scholar]
- Yang, R.; Hou, M.; Wang, J.; Zhang, F. OceanChat: Piloting Autonomous Underwater Vehicles in Natural Language. arXiv 2023, arXiv:2309.16052. [Google Scholar]
- Lin, B.Y.; Huang, C.; Liu, Q.; Gu, W.; Sommerer, S.; Ren, X. On Grounded Planning for Embodied Tasks with Language Models. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 13192–13200. [Google Scholar]
- Dai, Z.; Asgharivaskasi, A.; Duong, T.; Lin, S.; Tzes, M.-E.; Pappas, G.; Atanasov, N. Optimal Scene Graph Planning with Large Language Model Guidance. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar] [CrossRef]
- Yang, Z.; Raman, S.S.; Shah, A.; Tellex, S. Plug in the Safety Chip: Enforcing Constraints for LLM-Driven Robot Agents. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar] [CrossRef]
- Sun, J.; Zhang, Q.; Duan, Y.; Jiang, X.; Cheng, C.; Xu, R. Prompt, Plan, Perform: LLM-Based Humanoid Control via Quantized Imitation Learning. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar] [CrossRef]
- Liu, Z.; Bahety, A.; Song, S. REFLECT: Summarizing Robot Experiences for FaiLure Explanation and CorrecTion. Proc. Mach. Learn. Res. 2023, 229. [Google Scholar]
- Cao, Y.; Lee, C.S.G. Robot Behavior-Tree-Based Task Generation with Large Language Models. CEUR Workshop Proc. 2023, 3433. [Google Scholar]
- Zhen, Y.; Bi, S.; Xing-tong, L.; Wei-qin, P.; Hai-peng, S.; Zi-rui, C.; Yi-shu, F. Robot Task Planning Based on Large Language Model Representing Knowledge with Directed Graph Structures. arXiv 2023, arXiv:2306.05171. [Google Scholar]
- You, H.; Ye, Y.; Zhou, T.; Zhu, Q.; Du, J. Robot-Enabled Construction Assembly with Automated Sequence Planning Based on ChatGPT: RoboGPT. Buildings 2023, 13, 1772. [Google Scholar] [CrossRef]
- Ren, A.Z.; Dixit, A.; Bodrova, A.; Singh, S.; Tu, S.; Brown, N.; Xu, P.; Takayama, L.; Xia, F.; Varley, J.; et al. Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners. Proc. Mach. Learn. Res. 2023, 229. [Google Scholar]
- Chen, Y.; Arkin, J.; Zhang, Y.; Roy, N.; Fan, C. Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems? In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [CrossRef]
- Kannan, S.S.; Venkatesh, V.L.N.; Min, B.-C. SMART-LLM: Smart Multi-Agent Robot Task Planning Using Large Language Models. arXiv 2023, arXiv:2309.10062. [Google Scholar]
- Ding, Y.; Zhang, X.; Paxton, C.; Zhang, S. Task and Motion Planning with Large Language Models for Object Rearrangement. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Detroit, MI, USA, 1–5 October 2023; pp. 2086–2092. [Google Scholar]
- Chen, Y.; Arkin, J.; Dawson, C.; Zhang, Y.; Roy, N.; Fan, C. AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar] [CrossRef]
- Shafiullah, N.M.; Cui, Z.J.; Altanzaya, A.; Pinto, L. Behavior Transformers: Cloning k Modes with One Stone. Adv. Neural Inf. Process. Syst. 2022, 35, 22955–22968. [Google Scholar]
- Zhao, X.; Li, M.; Weber, C.; Hafez, M.B.; Wermter, S. Chat with the Environment: Interactive Multimodal Perception Using Large Language Models. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Detroit, MI, USA, 1–5 October 2023; pp. 3590–3596. [Google Scholar]
- Guo, Y.; Wang, Y.-J.; Zha, L.; Jiang, Z.; Chen, J. DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment. arXiv 2023, arXiv:2307.00329. [Google Scholar]
- Kim, G.; Kim, T.; Kannan, S.S.; Venkatesh, V.L.N.; Kim, D.; Min, B.-C. DynaCon: Dynamic Robot Planner with Contextual Awareness via LLMs. arXiv 2023, arXiv:2309.16031. [Google Scholar]
- Dagan, G.; Keller, F.; Lascarides, A. Dynamic Planning with a LLM. arXiv 2023, arXiv:2308.06391. [Google Scholar]
- Wu, Z.; Wang, Z.; Xu, X.; Lu, J.; Yan, H. Embodied Task Planning with Large Language Models. arXiv 2023, arXiv:2307.01848. [Google Scholar]
- Gkanatsios, N.; Jain, A.; Xian, Z.; Zhang, Y.; Atkeson, C.; Fragkiadaki, K. Energy-Based Models Are Zero-Shot Planners for Compositional Scene Rearrangement. In Proceedings of the Robotics: Science and Systems 2023, Daegu, Republic of Korea, 10–14 July 2023. [Google Scholar] [CrossRef]
- Ni, Z.; Deng, X.; Tai, C.; Zhu, X.; Xie, Q.; Huang, W.; Wu, X.; Zeng, L. GRID: Scene-Graph-Based Instruction-Driven Robotic Task Planning. arXiv 2023, arXiv:2309.07726. [Google Scholar]
- Ming, C.; Lin, J.; Fong, P.; Wang, H.; Duan, X.; He, J. HiCRISP: A Hierarchical Closed-Loop Robotic Intelligent Self-Correction Planner. arXiv 2023, arXiv:2309.12089. [Google Scholar]
- Ding, Y.; Zhang, X.; Amiri, S.; Cao, N.; Yang, H.; Kaminski, A.; Esselink, C.; Zhang, S. Integrating Action Knowledge and LLMs for Task Planning and Situation Handling in Open Worlds. Auton. Robots 2023, 47, 981–997. [Google Scholar] [CrossRef]
- Jin, C.; Tan, W.; Yang, J.; Liu, B.; Song, R.; Wang, L.; Fu, J. AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation. arXiv 2023, arXiv:2305.18898. [Google Scholar]
- Cui, Y.; Niekum, S.; Gupta, A.; Kumar, V.; Rajeswaran, A. Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation? Proc. Mach. Learn. Res. 2022, 168, 893–905. [Google Scholar]
- Tang, C.; Huang, D.; Ge, W.; Liu, W.; Zhang, H. GraspGPT: Leveraging Semantic Knowledge From a Large Language Model for Task-Oriented Grasping. IEEE Robot. Autom. Lett. 2023, 8, 7551–7558. [Google Scholar] [CrossRef]
- Parakh, M.; Fong, A.; Simeonov, A.; Gupta, A.; Chen, T.; Agrawal, P. Human-Assisted Continual Robot Learning with Foundation Models. arXiv 2023, arXiv:2309.14321. [Google Scholar]
- Bucker, A.; Figueredo, L.; Haddadin, S.; Kapoor, A.; Ma, S.; Vemprala, S.; Bonatti, R. LATTE: LAnguage Trajectory TransformEr. In Proceedings of the IEEE International Conference on Robotics and Automation, 4 August 2023; pp. 7287–7294. [Google Scholar]
- Ren, P.; Zhang, K.; Zheng, H.; Li, Z.; Wen, Y.; Zhu, F.; Ma, M.; Liang, X. RM-PRT: Realistic Robotic Manipulation Simulator and Benchmark with Progressive Reasoning Tasks. arXiv 2023, arXiv:2306.11335. [Google Scholar]
- Xiao, T.; Chan, H.; Sermanet, P.; Wahid, A.; Brohan, A.; Hausman, K.; Levine, S.; Tompson, J. Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models. Proc. Mach. Learn. Res. 2023, 19, 1–18. [Google Scholar] [CrossRef]
- Wang, T.; Li, Y.; Lin, H.; Xue, X.; Fu, Y. WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model. arXiv 2023, arXiv:2308.15962. [Google Scholar]
- Shen, W.; Yang, G.; Yu, A.; Wong, J.; Kaelbling, L.P.; Isola, P. Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation. Proc. Mach. Learn. Res. 2023, 229, 1–18. [Google Scholar]
- Sharma, S.; Shivakumar, K.; Huang, H.; Hoque, R.; Imran, A.; Ichter, B.; Goldberg, K. From Occlusion to Insight: Object Search in Semantic Shelves Using Large Language Models. arXiv 2023, arXiv:2302.12915. [Google Scholar]
- Mees, O.; Borja-Diaz, J.; Burgard, W. Grounding Language with Visual Affordances over Unstructured Data. In Proceedings of the IEEE International Conference on Robotics and Automation, 4 October 2023; pp. 11576–11582. [Google Scholar]
- Xu, Y.; Hsu, D. “Tidy Up the Table”: Grounding Common-Sense Objective for Tabletop Object Rearrangement. arXiv 2023, arXiv:2307.11319. [Google Scholar]
- Nanwani, L.; Agarwal, A.; Jain, K.; Prabhakar, R.; Monis, A.; Mathur, A.; Jatavallabhula, K.M.; Abdul Hafez, A.H.; Gandhi, V.; Krishna, K.M. Instance-Level Semantic Maps for Vision Language Navigation. In Proceedings of the 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Busan, Republic of Korea, 28–31 August 2023; pp. 507–512. [Google Scholar] [CrossRef]
- Yu, B.; Kasaei, H.; Cao, M. L3MVN: Leveraging Large Language Models for Visual Target Navigation. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Detroit, MI, USA, 1–5 October 2023; pp. 3554–3560. [Google Scholar]
- Kanazawa, N.; Kawaharazuka, K.; Obinata, Y.; Okada, K.; Inaba, M. Recognition of Heat-Induced Food State Changes by Time-Series Use of Vision-Language Model for Cooking Robot. Lect. Notes Networks Syst. 2024, 795, 547–560. [Google Scholar] [CrossRef]
- Seenivasan, L.; Islam, M.; Kannan, G.; Ren, H. SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery. In Proceedings of the 26th International Conference, Vancouver, BC, Canada, 8–12 October 2023. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
- Sun, Z.; Yu, H.; Song, X.; Liu, R.; Yang, Y.; Zhou, D. MobileBERT: A Compact Task-Agnostic BERT for Resource-Limited Devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 2158–2170. [Google Scholar] [CrossRef]
- Gao, L.; Biderman, S.; Black, S.; Golding, L.; Hoppe, T.; Foster, C.; Phang, J.; He, H.; Thite, A.; Nabeshima, N.; et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv 2020, arXiv:2101.00027. [Google Scholar]
- Wang, B.; Komatsuzaki, A. GPT-J 6B. Available online: https://github.com/kingoflolz/mesh-transformer-jax (accessed on 13 August 2024).
- Liu, Y.; Zhang, K.; Li, Y.; Yan, Z.; Gao, C.; Chen, R.; Yuan, Z.; Huang, Y.; Sun, H.; Gao, J.; et al. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models. arXiv 2024, arXiv:2402.17177. [Google Scholar]
- Azeem, R.; Hundt, A.; Mansouri, M.; Brandão, M. LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions. arXiv 2024, arXiv:2406.08824. [Google Scholar]
- Rebedea, T.; Dinu, R.; Sreedhar, M.; Parisien, C.; Cohen, J. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. arXiv 2023, arXiv:2310.10501. [Google Scholar]
- Wu, X.; Chakraborty, S.; Xian, R.; Liang, J.; Guan, T.; Liu, F.; Sadler, B.M.; Manocha, D.; Bedi, A.S. Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics. arXiv 2024, arXiv:2402.10340. [Google Scholar]
Title | Keywords | Ref. |
---|---|---|
Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis | Foundation Models | [3] |
A Survey of Large Language Models | LLM | [5] |
Vision-Language Models for Vision Tasks: A Survey | VLM | [12] |
Language-conditioned Learning for Robotic Manipulation: A Survey | LLM, VLM, Manipulation | [13] |
Foundation Models in Robotics: Applications, Challenges, and the Future | Foundation Models | [14] |
Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions | VLN | [15] |
Release Date | Model Name | Developer | Ref. | Release Date | Model Name | Developer | Ref. |
---|---|---|---|---|---|---|---|
2018-06 | GPT-1 | OpenAI | [44] | 2024-02 | OLMo | Allen Institute for AI | [45] |
2019-02 | GPT-2 | OpenAI | [35] | 2024-02 | StarCoder2 | Hugging Face | [46] |
2019-10 | T5 | [47] | 2024-03 | Claude 3 | Anthropic | [48] | |
2020-05 | GPT-3 | OpenAI | [49] | 2024-03 | InternLM2 | Shanghai AI Lab | [50] |
2021-07 | Codex | OpenAI | [51] | 2024-03 | Jamba | AI21Labs | [52] |
2021-09 | FLAN | [53] | 2024-04 | Stabe Code | Stability AI | [54] | |
2021-10 | T0 | Hugging Face | [37] | 2024-04 | HyperCLOVA | Naver | [55] |
2021-12 | Gopher | DeepMind | [56] | 2024-04 | Grok-1.5 | xAI | [57] |
2022-03 | InstructGPT | OpenAI | [58] | 2024-04 | Llama3 | Meta AI Research | [59] |
2022-04 | PaLM | [60] | 2024-04 | Phi-3 | Microsoft | [61] | |
2022-05 | OPT | Meta AI Research | [62] | 2024-05 | GPT-4o | OpenAI | [1] |
2023-02 | LLaMA | Meta AI Research | [63] | 2024-06 | Claude 3.5 | Anthropic | [64] |
2023-03 | Alpaca | Stanford Univ. | [65] | 2024-07 | GPT-4o mini | OpenAI | [66] |
2023-03 | GPT-4 | OpenAI | [50] | 2024-07 | Falcon2-11B | TII | [67] |
2023-05 | StarCoder | Hugging Face | [68] | 2024-07 | Llama 3.1 405B | Meta AI Research | [69] |
2023-07 | LLaMA2 | Meta AI Research | [70] | 2024-07 | Large2 | Mistral AI | [71] |
2023-09 | Baichuan2 | Baidu | [72] | 2024-07 | Gemma2 | Gemma Team, Google DeepMind | [73] |
2023-10 | Mistrial | Mistral AI | [74] | 2024-08 | EXAONE 3 | LG AI Research | [75] |
2024-01 | DeepSeek-Coder | DeepSeek-AI | [76] | 2024-08 | Grok-2 and Grok-2 mini | xAI | [77] |
Release Date | Model Name | Developer | Ref. | Release Date | Model Name | Developer | Ref. |
---|---|---|---|---|---|---|---|
2020-05 | DETR | Facebook AI | [78] | 2023-04 | LLaVA | UW–Madison | [79] |
2020-12 | DeiT | Facebook AI | [80] | 2023-04 | MiniGPT-4 | KAUST | [81] |
2021-02 | DALL-E | OpenAI | [82] | 2023-09 | GPT-4V | OpenAI | [83] |
2021-02 | CLIP | OpenAI | [7] | 2023-11 | Florence-2 | Microsoft | [84] |
2021-03 | Swin Transformer | Microsoft | [85] | 2024-01 | Lumiere | Google Research | [86] |
2021-05 | SegFormer | Univ. of Hong Kong | [87] | 2024-01 | Fuyu | Adept | [88] |
2021-06 | Vision Transformer | Google Research, Brain | [89] | 2024-03 | Gemini 1.5 | Gemini Team, Google | [90] |
2021-06 | BEiT | HIT, Microsoft Research | [91] | 2024-04 | InternLMXComposer2 | Shanghai AI Lab. | [92] |
2021-11 | ViTMAE | Facebook AI | [93] | 2024-04 | IDEFICS 2 | Hugging Face | [94] |
2021-12 | Stable Diffusion | LMU Munich, IWR | [95] | 2024-05 | Idefics2 | Hugging Face | [96] |
2022-03 | R3M | Meta AI, Stanford Univ. | [97] | 2024-05 | Chameleon | Meta AI Research | [98] |
2022-04 | Flamingo | DeepMind | [99] | 2024-07 | InternLM-XComposer-2.5 | Shanghai AI Lab. | [98] |
2023-01 | BLIP-2 | Salesforce Research | [100] | 2024-07 | PaliGemma | Univ. of Hong Kong | [101] |
2023-04 | SAM | Meta AI Research | [102] | 2024-08 | SAM 2 | Meta AI | [103] |
2023-04 | DINOv2 | Meta AI Research | [104] | 2024-08 | Qwen-VL2 | Alibaba Group | [105] |
Name | Explanation | Ref. |
---|---|---|
Zero-Shot Prompting | Enabling the model to perform new tasks without any examples | [53] |
Few-Shot Prompting | Providing a few examples to enable performing new tasks | [49] |
Chain-of-Thought | Explicitly generating intermediate reasoning steps to perform step-by-step inference | [41] |
Self-Consistency | Generating various reasoning paths independently through Few-Shot CoT, with each path going through a prompt generation process to select the most consistent answer | [120] |
Generated Knowledge Prompting | Integrating knowledge and information relevant to a question, and then providing it along with the question to generate accurate answers | [126] |
Prompt Chaining | Dividing a task into sub-tasks and connecting prompts for each sub-task as input–output pairs | [125] |
Tree of Thoughts | Dividing a problem into subproblems with intermediate steps that serve as “thoughts” towards solving the problem, where each thought undergoes an inference process and self-evaluates its progress towards solving the problem | [124] |
Retrieval Augmented Generation | Combining external information retrieval with natural language generation | [127] |
Automatic Reasoning and Tool-use | Using external tools to automatically generate intermediate reasoning steps | [128] |
Automatic Prompt Engineer | Automatically generating and selecting commands | [129] |
Active Prompt | Addressing the issue that the effectiveness may be limited by human annotations | [122] |
Directional Stimulus Prompting | Guiding the model to think and generate responses in a specific direction | [130] |
Program-Aided Language Models | Using models to understand natural language problems and generate programs as intermediate reasoning steps | [123] |
ReAct | Combining reasoning and actions within a mode | [131] |
Reflexion | Enhancing language-based agents through language feedback | [132] |
Multimodal CoT | A two-stage framework that integrates text and vision modalities | [121] |
Name | Explanation | Ref. |
---|---|---|
Reward Design in RL |
| [11,134,136,137,138,139,176,177,178,179,180] |
Low-level Control |
| [8,9,10,144,145,146,147,148,181,182,183] |
High-level Planning |
| [149,150,151,152,153,154,155,156,157,158,159,160,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207] |
Manipulation |
| [161,162,163,164,165,166,167,208,209,210,211,212,213,214,215] |
Scene Understanding |
| [168,169,170,171,172,173,174,175,216,217,218,219,220,221,222,223] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jeong, H.; Lee, H.; Kim, C.; Shin, S. A Survey of Robot Intelligence with Large Language Models. Appl. Sci. 2024, 14, 8868. https://doi.org/10.3390/app14198868
Jeong H, Lee H, Kim C, Shin S. A Survey of Robot Intelligence with Large Language Models. Applied Sciences. 2024; 14(19):8868. https://doi.org/10.3390/app14198868
Chicago/Turabian StyleJeong, Hyeongyo, Haechan Lee, Changwon Kim, and Sungtae Shin. 2024. "A Survey of Robot Intelligence with Large Language Models" Applied Sciences 14, no. 19: 8868. https://doi.org/10.3390/app14198868
APA StyleJeong, H., Lee, H., Kim, C., & Shin, S. (2024). A Survey of Robot Intelligence with Large Language Models. Applied Sciences, 14(19), 8868. https://doi.org/10.3390/app14198868