LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding †
Abstract
1. Introduction
- We introduce a method for generating vision–language instruction-following data by integrating human keypoints, filling a critical gap in existing multimodal models for human-centric visual understanding.
- We demonstrate substantial improvements in human-centric visual tasks through comprehensive experiments comparing our fine-tuned LLaVA-1.5-7B model (hereafter referred to as LLaVA-Pose) with its original version and other models.
- We offer insights into how different types of fine-tuning data impact model capabilities for specific domains.
2. Related Work
2.1. Instruction-Following Multimodal Models
2.2. Multimodal Human-Centric Visual Understanding
2.3. Micro-Action Recognition
3. Keypoint-Integrated Visual Instruction Data Generation
- Conversation: Dynamic interactions simulating real-world conversational exchanges about human poses and actions, such as asking what individuals are doing in a given scene (see detailed prompts and curation process in Table 1).
- Detailed description: In-depth descriptions focusing on human body language and environmental interactions, transcending simple object identification to emphasize narrative-style outputs useful in applications requiring detailed human observation (see detailed prompts and curation process in Table 2).
- Complex reasoning: Challenges requiring multi-step reasoning about human activities, such as understanding intentions behind specific actions or predicting next possible movements based on current poses (see detailed prompts and curation process in Table 3).
messages = [{"role": "system", "content": f"""You are an AI visual assistant, and you are seeing a single image. What you see are provided with five sentences, describing the same image you are looking at. Answer all questions as you are seeing the image. Design a conversation between you and a person asking about this photo. The answers should be in a tone that a visual AI assistant is seeing the image and answering the question. Ask diverse questions and give corresponding answers. Your primary focus should be on generating questions and answers about the poses, actions, and movements of people in the image. Please generate diverse questions that relate primarily to human poses and actions, and provide detailed answers as if you are seeing the image. Only include questions that have definite answers: (1) one can see the content in the image that the question asks about and can answer confidently; (2) one can determine confidently from the image that it is not in the image. Do not ask any question that cannot be answered confidently. """}] for sample in fewshot_samples: messages.append({"role": "user", "content": sample["context"]}) messages.append({"role": "assistant", "content": sample["response"]}) messages.append({"role": "user", "content": "Can you generate some new questions and answers focusing on the poses and actions of people in the image?"}) |
messages = [{"role": "system", "content": f"""You are an AI visual assistant specializing in analyzing human poses and actions in images. You receive five sentences, each describing the same image you are observing. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes and human keypoints, represented as (x1, y1, x2, y2) for bounding boxes and (x, y, visibility) for keypoints, with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y for bounding boxes, and x, y coordinates along with visibility (0: not labeled, 1: labeled but not visible, 2: labeled and visible) for keypoints. The keypoints represent the following body parts: 1. nose 2. left eye 3. right eye 4. left ear 5. right ear 6. left shoulder 7. right shoulder 8. left elbow 9. right elbow 10. left wrist 11. right wrist 12. left hip 13. right hip 14. left knee 15. right knee 16. left ankle 17. right ankle Using the provided captions and bounding box/human keypoint information, describe the scene in a detailed manner, focusing on the human poses and actions. Instead of directly mentioning the bounding box or keypoint coordinates, utilize this data to explain the scene using natural language. Include details like the number of people, their actions, poses, interactions, and relative positions. When using the information from the caption and coordinates, directly explain the scene, and do not mention that the information source is the caption or the bounding box/keypoints. Always answer as if you are directly looking at the image."""}] for annotation in annotations_group: messages.append({"role": "user", "content": annotation["context"]}) messages.append({"role": "user", "content": random.choice(detailed_description)}) |
messages = [{"role": "system", "content": f"""You are an AI visual assistant specializing in analyzing human poses and actions in images. You receive five sentences, each describing the same image you are observing. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes and human keypoints, represented as (x1, y1, x2, y2) for bounding boxes and (x, y, visibility) for human keypoints, with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y for bounding boxes, and x, y coordinates along with visibility (0: not labeled, 1: labeled but not visible, 2: labeled and visible) for human keypoints. The human keypoints represent the following body parts: 1. nose 2. left eye 3. right eye 4. left ear 5. right ear 6. left shoulder 7. right shoulder 8. left elbow 9. right elbow 10. left wrist 11. right wrist 12. left hip 13. right hip 14. left knee 15. right knee 16. left ankle 17. right ankle The task is to use the provided caption and bounding box/human keypoint information to create a plausible question about the human poses and actions in the image, and provide the answer in detail. Create complex questions beyond describing the scene. To answer such questions, one should require first understanding the human poses and actions, then based on the background knowledge or reasoning, either explain why the actions are happening that way, or provide guidance and help to the user’s request. Make the question challenging by not including the visual content details in the question so that the user needs to reason about that first. Do not include any coordinates or numerical values in your explanation. Instead, utilize the data to explain the scene using natural language. Include details like the number of people, their actions, poses, interactions, relative positions, as well as the relationships and interactions between people and objects in the scene. Describe how people are using objects, their proximity to objects, and any activities involving both people and objects. When using the information from the caption and coordinates, directly explain the scene, and do not mention that the information source is the caption or the bounding box/human keypoints. Always answer as if you are directly looking at the image. """}] for annotation in annotations_group: messages.append({"role": "user", "content": annotation["context"]}) |
➤ "Describe the actions and poses of people in the following image in detail." ➤ "Provide a detailed description of the poses of people in the given image." ➤ "Explain the various details of the actions of people you see in the image." ➤ "Share a comprehensive analysis of the behaviors of people presented in the image." ➤ "Offer a thorough analysis of the actions of people in the image." ➤ "Explain the various poses and actions of people in the displayed image with great detail." ➤ "Characterize the poses of people in the image using a well-detailed description." ➤ "Break down the actions of people in the image in a detailed manner." ➤ "Walk through the important details of the actions of people in the image." ➤ "Portray the poses and actions of people in the image with a rich, descriptive narrative." ➤ "Narrate the actions and poses of people in the image with precision." ➤ "Analyze the poses and actions of people in the image in a comprehensive and detailed manner." ➤ "Illustrate the actions and poses of people in the image through a descriptive explanation." ➤ "Examine the actions and poses of people in the image closely and share their details." ➤ "Write an exhaustive depiction of the actions of people in the given image." ➤ "Carefully observe the people in the image and share the details of their poses and actions." |
4. Model Architecture and Fine-Tuning Approach
- Vision Encoder: The input image is first processed by a pre-trained CLIP ViT-L encoder [41], which extracts visual features.
- Vision–Language Connector: The visual features are passed through a two-layer multi-layer perceptron (MLP), which projects them into the embedding space of the LLM.
- Large Language Model: The mapped visual embeddings are concatenated with the text token embeddings, and the combined embeddings are input into the LLM Vicuna-v1.5 [42]. The LLM generates textual responses conditioned on both the visual and textual information.
5. Experiments
5.1. Qualitative Evaluation
5.2. Quantitative Evaluation
- Conversation: The fine-tuned model scores 58.9 vs. original’s 54.0;
- Detailed description: 69.1 vs. 47.2;
- Complex reasoning: 75.1 vs. 55.2;
- Overall performance: The model fine-tuned on all three data types achieves 69.4 vs. the original LLaVA-1.5-7B’s 52.1, representing a 33.2% increase.
6. Discussion
6.1. Key Findings
6.2. Limitations and Future Work
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
VLMs | vision–language models |
E-HPAUB | extended human pose and action understanding benchmark |
AI | artificial intelligence |
LLaVA | large language and vision assistant |
LLMs | large language models |
SMPL | skinned multi-person linear |
MANet | micro-action network |
RNNs | recurrent neural networks |
References
- Cui, C.; Ma, Y.; Cao, X.; Ye, W.; Zhou, Y.; Liang, K.; Chen, J.; Lu, J.; Yang, Z.; Liao, K.D.; et al. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 958–979. [Google Scholar]
- Liang, P.P.; Zadeh, A.; Morency, L.P. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Comput. Surv. 2024, 56, 1–42. [Google Scholar] [CrossRef]
- Wu, J.; Gan, W.; Chen, Z.; Wan, S.; Yu, P.S. Multimodal large language models: A survey. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 2247–2256. [Google Scholar]
- Driess, D.; Xia, F.; Sajjadi, M.S.M.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. PaLM-E: An embodied multimodal language model. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
- Peng, Z.; Wang, W.; Dong, L.; Hao, Y.; Huang, S.; Ma, S.; Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv 2023, arXiv:2306.14824. [Google Scholar] [CrossRef]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
- Zhang, D.; Hussain, T.; An, W.; Shouno, H. PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment. arXiv 2025, arXiv:2507.09139. [Google Scholar]
- Kyrollos, D.G.; Fuller, A.; Greenwood, K.; Harrold, J.; Green, J.R. Under the cover infant pose estimation using multimodal data. IEEE Trans. Instrum. Meas. 2023, 72, 5007212. [Google Scholar] [CrossRef]
- Zhou, H.; Wang, D.; Yu, Y.; Zhang, Z. Research progress of human–computer interaction technology based on gesture recognition. Electronics 2023, 12, 2805. [Google Scholar] [CrossRef]
- Wang, T.; Zheng, P.; Li, S.; Wang, L. Multimodal human–robot interaction for human-centric smart manufacturing: A survey. Adv. Intell. Syst. 2024, 6, 2300359. [Google Scholar] [CrossRef]
- Yildirim, N.; Richardson, H.; Wetscherek, M.T.; Bajwa, J.; Jacob, J.; Pinnock, M.A.; Harris, S.; Coelho De Castro, D.; Bannur, S.; Hyland, S.; et al. Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 11–16 May 2024. CHI ’24. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Feng, Y.; Lin, J.; Dwivedi, S.K.; Sun, Y.; Patel, P.; Black, M.J. Chatpose: Chatting about 3d human pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2093–2103. [Google Scholar]
- Feng, D.; Guo, P.; Peng, E.; Zhu, M.; Yu, W.; Wang, P. PoseLLaVA: Pose Centric Multimodal LLM for Fine-Grained 3D Pose Manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 2951–2959. [Google Scholar]
- Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A skinned multi-person linear model. ACM Trans. Graph. 2015, 34, 248. [Google Scholar] [CrossRef]
- Wu, Z.; Chen, X.; Pan, Z.; Liu, X.; Liu, W.; Dai, D.; Gao, H.; Ma, Y.; Wu, C.; Wang, B.; et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv 2024, arXiv:2412.10302. [Google Scholar]
- Wu, P.; Xie, S. V?: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 13084–13094. [Google Scholar]
- Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
- Zhu, J.; Wang, W.; Chen, Z.; Liu, Z.; Ye, S.; Gu, L.; Tian, H.; Duan, Y.; Su, W.; Shao, J.; et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv 2025, arXiv:2504.10479. [Google Scholar]
- Chen, X.; Wu, Z.; Liu, X.; Pan, Z.; Liu, W.; Xie, Z.; Yu, X.; Ruan, C. Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. arXiv 2025, arXiv:2501.17811. [Google Scholar]
- Wang, W.; Chen, Z.; Chen, X.; Wu, J.; Zhu, X.; Zeng, G.; Luo, P.; Lu, T.; Zhou, J.; Qiao, Y.; et al. VisionLLM: Large language model is also an open-ended decoder for vision-centric tasks. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ‘23, Red Hook, NY, USA, 10–16 December 2023. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millicah, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ‘22, Red Hook, NY, USA, 28 November–9 December 2022. [Google Scholar]
- Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26296–26306. [Google Scholar]
- Zheng, C.; Wu, W.; Chen, C.; Yang, T.; Zhu, S.; Shen, J.; Kehtarnavaz, N.; Shah, M. Deep Learning-based Human Pose Estimation: A Survey. ACM Comput. Surv. 2023, 56, 1–37. [Google Scholar] [CrossRef]
- Surek, G.A.S.; Seman, L.O.; Stefenon, S.F.; Mariani, V.C.; Coelho, L.d.S. Video-based human activity recognition using deep learning approaches. Sensors 2023, 23, 6384. [Google Scholar] [CrossRef]
- Morshed, M.G.; Sultana, T.; Alam, A.; Lee, Y.K. Human action recognition: A taxonomy-based survey, updates, and opportunities. Sensors 2023, 23, 2182. [Google Scholar] [CrossRef]
- Le, V.H. Deep learning-based for human segmentation and tracking, 3D human pose estimation and action recognition on monocular video of MADS dataset. Multimed. Tools Appl. 2023, 82, 20771–20818. [Google Scholar] [CrossRef]
- Delmas, G.; Weinzaepfel, P.; Moreno-Noguer, F.; Rogez, G. Posefix: Correcting 3D human poses with natural language. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 15018–15028. [Google Scholar]
- Delmas, G.; Weinzaepfel, P.; Lucas, T.; Moreno-Noguer, F.; Rogez, G. Posescript: 3d human poses from natural language. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 346–362. [Google Scholar]
- Delmas, G.; Weinzaepfel, P.; Lucas, T.; Moreno-Noguer, F.; Rogez, G. Posescript: Linking 3d human poses and natural language. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 5146–5159. [Google Scholar] [CrossRef] [PubMed]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Guo, D.; Li, K.; Hu, B.; Zhang, Y.; Wang, M. Benchmarking Micro-Action Recognition: Dataset, Methods, and Applications. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 6238–6252. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Gu, J.; Li, K.; Wang, F.; Wei, Y.; Wu, Z.; Fan, H.; Wang, M. Motion matters: Motion-guided modulation network for skeleton-based micro-action recognition. arXiv 2025, arXiv:2507.21977. [Google Scholar]
- Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. LAION-5B: An open large-scale dataset for training next generation image–text models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ‘22. Red Hook, NY, USA, 28 November–9 December 2022. [Google Scholar]
- Changpinyo, S.; Sharma, P.; Ding, N.; Soricut, R. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3558–3568. [Google Scholar]
- Byeon, M.; Park, B.; Kim, H.; Lee, S.; Baek, W.; Kim, S. COYO-700M: Image-Text Pair Dataset. 2022. Available online: https://github.com/kakaobrain/coyo-dataset (accessed on 18 July 2025).
- OpenAI. Hello gpt-4o. 2024. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 18 July 2025).
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
- Chiang, W.L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J.E.; et al. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. 2023. Available online: https://lmsys.org/blog/2023-03-30-vicuna/ (accessed on 18 July 2025).
- Rasley, J.; Rajbhandari, S.; Ruwase, O.; He, Y. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 3505–3506. [Google Scholar]
- Zhang, D.; An, W.; Shouno, H. Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models. In Proceedings of the International Conference on Advanced Concepts for Intelligent Vision Systems (ACIVS), Tokyo, Japan, 28–30 July 2025. [Google Scholar]
- Anthropic. Introducing Claude 4. 2025. Available online: https://www.anthropic.com/news/claude-4 (accessed on 13 August 2025).
- Zhang, Z.; Liu, W.; Zheng, Y.; Du, L.; Sun, L. Learning spatio-temporal context for basketball action pose estimation with a multi-stream network. Sci. Rep. 2025, 15, 29173. [Google Scholar] [CrossRef]
- Gu, J.; Yi, Y.; Li, Q. Motion sensitive network for action recognition in control and decision-making of autonomous systems. Front. Neurosci. 2024, 18, 1370024. [Google Scholar] [CrossRef]
- Chen, D.; Chen, M.; Wu, P.; Wu, M.; Zhang, T.; Li, C. Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition. Sci. Rep. 2025, 15, 4982. [Google Scholar] [CrossRef]
- Li, Y.; Liu, Z.; Kong, Y.; Li, G.; Zhang, J.; Bian, C.; Liu, F.; Yao, L.; Sun, Z. Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding. arXiv 2025, arXiv:2501.16786. [Google Scholar]
Hyperparameter | Value |
---|---|
batch size | 8 |
gradient accumulation steps | 2 |
learning rate | |
learning rate schedule | cosine decay |
learning warmup ratio | 0.03 |
optimizer | AdamW |
epoch | 1 |
weight decay | 0 |
DeepSpeed stage | 3 |
Source: http://images.cocodataset.org/val2014/COCO_val2014_000000060992.jpg (accessed on 18 August 2025) |
User: Portray the poses and actions of people in the image with a rich, descriptive narrative. DeepSeek-VL2: The image captures a tennis player in action on a court during a match. The player is wearing a red shirt, white shorts, and green and white wristbands. He is in a tennis racket in his right hand, appears to be in the middle of a forehand shot. The player is in a fast-paced moment, with his body leaning forward and focused expression. In the background, there are two ball boys. One boy is holding a tennis ball, wearing a blue shirt and black shorts, while the other is wearing a blue shirt and black pants. They appear to be ball boys. The court has a blue wall in the background with the words “VISILAB” and “CHALLENEFR TOUR” written on it. There is also a digital scoreboard on the left side of the court displaying “142 Km/h” and “Cronet.” The court is a professional tennis match, likely part of a Challenger Tour. |
User: Portray the poses and actions of people in the image with a rich, descriptive narrative. : In the image, a man is swinging a tennis racket on a court, while two other people are present in the scene. One of the individuals appears to be a young girl, who might be watching the man playing tennis or waiting for her turn to participate. The man, wearing a red shirt, is actively engaged in the game, demonstrating his skills by swinging the racket with precision and power. It seems like an enjoyable and lively moment where the players share a common interest in tennis, possibly making it a family or social event where people of different ages can bond over the sport. |
User: Portray the poses and actions of people in the image with a rich, descriptive narrative. Qwen2-VL-7B: In the image, a tennis player is captured in the midst of an intense rally, showcasing his agility and skill. He is leaning forward, his body coiled like a spring, as he prepares to strike the incoming tennis ball with his racket. His eyes are focused intently on the ball, his muscles tensed and ready to spring into action. In the background, another person is holding a tennis ball, likely waiting for the right moment to serve or pass the ball to the player. This person’s stance is relaxed yet attentive, as they watch the ongoing match with keen interest. A third person is standing nearby, possibly a coach or a fellow player, observing the game with a critical eye. Their presence adds an element of support and encouragement to the scene, as they watch the player’s performance and offer guidance when needed. The image captures the dynamic and competitive nature of tennis, with each player fully engaged in the game and focused on their respective tasks. The vibrant colors of the court and the players’ attire add to the overall energy and excitement of the scene. |
User: Portray the poses and actions of people in the image with a rich, descriptive narrative. InternVL3-8B: In the image, a man is playing tennis on a court, swinging his racket to hit a tennis ball. He is wearing a red shirt and white shorts, and his pose suggests that he is in the middle of a powerful swing. There are other people in the background, possibly watching the game or waiting for their turn to play. The scene takes place on a tennis court, which is a common setting for tennis matches and practice sessions. The presence of the other people and the tennis ball in the air indicate that this is an active and engaging moment in the game. |
User: Portray the poses and actions of people in the image with a rich, descriptive narrative. ChatPose: The image captures a lively tennis match with several people on the court. A man is in the middle of a swing, holding a tennis racket and preparing to hit the ball. He is focused and determined, showcasing his athleticism and skill in the sport. Another tennis racket can be seen in the scene, suggesting that there might be another player waiting for their turn or practicing nearby. There are also two sports balls visible in the image, one of which is likely the tennis ball the man is about to hit. Another person is present on the court, possibly a fellow player, a coach, or a spectator. A chair is located near the edge of the court, providing a place for rest or observation during breaks in the game. Overall, the scene depicts an engaging and dynamic tennis match with various individuals involved in the sport. |
User: Portray the poses and actions of people in the image with a rich, descriptive narrative. Janus-Pro-7B: In the image, a dynamic scene unfolds on a tennis court, capturing the intensity and focus of a player in action. The central figure, a tennis player dressed in a red shirt and white shorts, is mid-swing, his body leaning forward with a powerful motion. His right arm extends outward, gripping a tennis racket, while his left arm is slightly bent, maintaining balance. His legs are positioned in a wide stance, with the right leg slightly bent and the left leg extended backward, emphasizing the force and precision of his movement. The player’s expression is one of concentration and determination, his eyes fixed on the ball as it approaches. The ball is suspended in the air, a blur of motion, indicating the speed and agility required in this moment. The player’s grip on the racket is firm, his wrist steady, suggesting a practiced and skilled hand. In the background, two individuals stand observing the match. One, dressed in a light blue shirt and dark shorts, holds a tennis ball, ready to serve or return a shot. The other, wearing a teal shirt and black pants, leans forward slightly, perhaps in anticipation of the next move. Their presence adds a layer of context, indicating that this is a competitive or practice session. The backdrop of the court features a digital display showing the speed of the serve, with the text “142 km/h” and “Coulée” visible, adding a modern and technological touch to the scene. The wall behind the players is adorned with logos, including “VISIAB” and “CHALLENGER TOUR,” suggesting a professional or high-level amateur event. The overall atmosphere is one of energy and focus, with each individual playing a role in the unfolding drama of the tennis match. The image captures a moment of athletic prowess and competitive spirit, highlighting the dedication and skill required in the sport. |
User: Portray the poses and actions of people in the image with a rich, descriptive narrative. LLaVA-1.5-7B: The image captures a tennis match in progress, with a man in a red shirt and white shorts playing the game. He is in the middle of a swing, holding a tennis racket and preparing to hit the ball. Another person is standing nearby, likely watching the game or waiting for their turn to play. There are a few more people in the scene, one of them holding a tennis ball, possibly waiting for their turn to play or observing the match. The presence of multiple people in the scene suggests that this is a social event or a gathering where people are enjoying a friendly game of tennis. |
User: Portray the poses and actions of people in the image with a rich, descriptive narrative. GPT-4o: In this dynamic tennis scene, the focal player—clad in a vivid red shirt and white shorts—is captured mid-action, exuding intensity and athleticism. His left foot plants firmly on the court, knee bent, while his right leg extends powerfully behind him, hinting at both agility and control. His torso twists slightly as he follows through on a forehand stroke, his racket slicing through the air with purpose. The focused gaze locked on the ball reveals his anticipation and sharp reflexes. His right arm is extended, wrist flicked for control, while his left arm counterbalances behind him, fingers splayed—a gesture of fluid coordination. The tension in his muscles and the sharpness of his movement suggest this is a high-stakes rally. In the background, a ball boy stands alert with another tennis ball tucked behind his back, poised to spring into action. To the right, a line judge leans slightly forward, hands on knees, watching the trajectory of the ball with sharp concentration. Both background figures, though static, add a layer of tension and professionalism to the scene. The signage and scoreboard indicate this is part of a Challenger Tour event, and the speed of the last serve—142 km/h—is visible, reinforcing the elite level of play. The atmosphere is electric, the motion frozen in time yet bursting with energy and precision. |
User: Portray the poses and actions of people in the image with a rich, descriptive narrative. LLaVA-Pose: In the image, there are three individuals engaged in a dynamic scene on a tennis court. The primary focus is on a man who is actively swinging a tennis racket, likely in the midst of a powerful serve or return shot. His body is positioned with a slight forward lean, indicating the force and precision of his movement. His left shoulder is slightly raised, and his right shoulder is lower, suggesting the follow-through of a swing. His left elbow is bent, and his right elbow is extended, with his right wrist positioned to control the racket’s direction. His legs are spread apart, with his left knee slightly bent and his right knee more flexed, providing a stable base for his powerful swing. To the left of the main player, there is a woman who appears to be observing the action. She stands upright with a relaxed posture, her arms hanging naturally by her sides. Her head is turned towards the main player, indicating her focus on the ongoing play. Her stance is casual, with her feet firmly planted on the ground, suggesting she is not currently involved in the game but is instead watching attentively. Nearby, there is another woman who is also watching the game. She is positioned slightly behind the main player, with her body turned towards the action. Her posture is slightly more engaged than the observer on the left, with her shoulders squared and her head turned towards the main player. Her arms are relaxed, and her legs are slightly apart, indicating a stance that is ready to move if needed. The scene captures the intensity and focus of a tennis match, with the main player in the midst of a powerful action, while the other two individuals are observing, possibly waiting for their turn to play or simply enjoying the game. The overall atmosphere is one of concentration and athleticism, with each person contributing to the dynamic environment of the tennis court. |
Source: http://images.cocodataset.org/val2014/COCO_val2014_000000004742.jpg (accessed on 18 August 2025) |
User: Why is the person standing with knees bent and arms extended while riding the surfboard? DeepSeek-VL2: The person is standing with knees bent and arms extended while riding the surfboard to maintain balance. |
User: Why is the person standing with knees bent and arms extended while riding the surfboard? : The person standing with knees bent and arms extended while riding the surfboard is likely doing so to maintain balance and control on the surfboard as it moves through the water. By bending the knees and extending the arms, the individual is able to increase the surface area of the body in contact with the water, which helps with stability. This posture is also known as the urf stanceör addling stance, änd it allows the surfer to efficiently paddle and maneuver the board in the water before catching a wave. |
User: Why is the person standing with knees bent and arms extended while riding the surfboard? Qwen2-VL-7B: The person is standing with knees bent and arms extended while riding the surfboard to maintain balance and control while surfing. This stance allows the surfer to adjust their body position and weight distribution to navigate the wave effectively. By bending their knees and extending their arms, the surfer can shift their center of gravity and react to the changing dynamics of the wave, ensuring a smooth and controlled ride. This technique is essential for maintaining stability and preventing falls while surfing. |
User: Why is the person standing with knees bent and arms extended while riding the surfboard? InternVL3-8B: The person is standing with knees bent and arms extended to maintain balance and control while riding the surfboard on the wave. |
User: Why is the person standing with knees bent and arms extended while riding the surfboard? ChatPose: The person is standing with knees bent and arms extended while riding the surfboard to maintain balance and control as they catch a wave. Surfing requires a combination of strength, agility, and coordination to navigate the constantly changing ocean conditions. By bending his knees and extending his arms, the surfer is able to adjust his body position and center of gravity to stay stable on the surfboard. Additionally, this stance allows the surfer to quickly react to any changes in the wave or water, ensuring a smoother and more controlled ride. |
User: Why is the person standing with knees bent and arms extended while riding the surfboard? Janus-Pro-7B: The person is standing with knees bent and arms extended while riding the surfboard to maintain balance and stability. Bending the knees lowers the center of gravity, which helps in maintaining balance on the moving surfboard. Extending the arms can help with balance and provide additional stability. This posture is common among surfers to help them stay upright and maneuver the board effectively. User: Why is the person standing with knees bent and arms extended while riding the surfboard? |
LLaVA-1.5-7B: The person is standing with knees bent and arms extended while riding the surfboard to maintain balance and control on the surfboard. By bending their knees, they lower their center of gravity, which helps them stay stable on the surfboard and adapt to the changing water conditions. Extending their arms outward provides additional balance and stability, as well as helps them maintain a proper stance on the surfboard. This positioning allows the surfer to effectively maneuver the surfboard and ride the waves, ensuring a smooth and enjoyable surfing experience. |
User: Why is the person standing with knees bent and arms extended while riding the surfboard? GPT-4o: The person in the image is standing with knees bent and arms extended while riding the surfboard to maintain balance, control, and stability on the moving waves. Here’s why each of these posture elements is important: Bent Knees: Shock Absorption: Bent knees act like natural shock absorbers, helping the surfer absorb the motion and bumps from the waves. Lower Center of Gravity: A lower stance helps stabilize the body, making it less likely to tip over. Mobility: It allows quick adjustments in weight distribution and faster reaction to changes in the wave. Extended Arms: Balance: Like a tightrope walker, the surfer uses their arms to shift and correct their balance. Steering: Arm position helps with directional changes—leaning slightly forward, backward, or to the sides. Coordination: Extending arms assists in maintaining an aligned posture, ensuring the surfer’s upper and lower body move in harmony. This stance is a fundamental technique in surfing, especially for beginners and intermediates, as it maximizes control while minimizing the risk of falling. |
User: Why is the person standing with knees bent and arms extended while riding the surfboard? LLaVA-Pose: The person is standing with knees bent and arms extended to maintain balance and stability while riding the surfboard. This stance is crucial for several reasons: 1. Balance and Stability: Bent knees lower the center of gravity, which helps the surfer stay balanced on the moving surfboard. This is particularly important when riding waves, as the board can be unstable and the water can be unpredictable. 2. Control: Extended arms act as a counterbalance to the body’s movements. By spreading the arms, the surfer can make fine adjustments to their balance, allowing them to shift their weight and direction as needed to navigate the wave. 3. Shock Absorption: The bent knees act as shock absorbers, allowing the surfer to absorb the impact of the waves and maintain a smooth ride. This is especially important for beginners or those riding smaller waves. 4. Flexibility and Agility: This stance allows the surfer to be more agile and responsive to changes in the wave’s shape and speed. It enables quick shifts in weight and direction, which are essential for maneuvering the surfboard effectively. In summary, the person’s pose is a combination of balance, control, shock absorption, and agility, all of which are essential for successfully riding a surfboard on the waves. |
Conversation | Detailed Description | Complex Reasoning | All | |
---|---|---|---|---|
Full data | 64.3 | 78.9 | 65.0 | 69.4 |
Conversation | 58.9 | 45.1 | 50.4 | 51.5 |
Detailed description | 52.7 | 69.1 | 65.3 | 62.4 |
Complex reasoning | 62.4 | 59.7 | 75.1 | 65.7 |
LLaVA-1.5-7B [25] | 54.0 | 47.2 | 55.2 | 52.1 |
Conversation | Detailed Description | Complex Reasoning | All | |
---|---|---|---|---|
DeepSeek-VL2 [18] | 60.1 | 41.8 | 42.6 | 48.1 |
[19] | 62.6 | 38.8 | 68.8 | 56.7 |
Qwen2-VL-7B [20] | 73.8 | 52.8 | 68.8 | 65.1 |
InternVL3-8B [21] | 75.4 | 56.2 | 67.1 | 66.3 |
ChatPose [15] | 69.7 | 55.4 | 77.4 | 67.5 |
Janus-Pro-7B [22] | 70.2 | 68.7 | 66.1 | 68.3 |
LLaVA-Pose | 77.4 | 58.6 | 72.9 | 69.6 |
Conversation | Detailed Description | Complex Reasoning | All | |
---|---|---|---|---|
DeepSeek-VL2 [18] | 56.7 | 39.3 | 45.2 | 47.1 |
[19] | 55.2 | 33.6 | 67.3 | 52.0 |
Qwen2-VL-7B [20] | 70.9 | 45.0 | 72.3 | 62.7 |
InternVL3-8B [21] | 71.2 | 52.3 | 65.7 | 63.1 |
ChatPose [15] | 64.9 | 54.9 | 72.5 | 64.1 |
Janus-Pro-7B [22] | 65.0 | 61.4 | 67.0 | 64.5 |
LLaVA-Pose | 68.7 | 56.9 | 71.0 | 65.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, D.; Hussain, T.; An, W.; Shouno, H. LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding. Sensors 2025, 25, 5213. https://doi.org/10.3390/s25165213
Zhang D, Hussain T, An W, Shouno H. LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding. Sensors. 2025; 25(16):5213. https://doi.org/10.3390/s25165213
Chicago/Turabian StyleZhang, Dewen, Tahir Hussain, Wangpeng An, and Hayaru Shouno. 2025. "LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding" Sensors 25, no. 16: 5213. https://doi.org/10.3390/s25165213
APA StyleZhang, D., Hussain, T., An, W., & Shouno, H. (2025). LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding. Sensors, 25(16), 5213. https://doi.org/10.3390/s25165213