LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding

Zhang, Dewen; Hussain, Tahir; An, Wangpeng; Shouno, Hayaru

doi:10.3390/s25165213

Open AccessArticle

LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding^†

¹

Department of Informatics, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo 182-8585, Japan

²

TikTok Inc., 1199 Coleman Ave, San Jose, CA 95110, USA

^*

Authors to whom correspondence should be addressed.

^†

This article is a revised and expanded version of a paper entitled “Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models”, which was presented at the International Conference on Advanced Concepts for Intelligent Vision Systems (ACIVS), Tokyo, Japan, 28–30 July 2025.

Sensors 2025, 25(16), 5213; https://doi.org/10.3390/s25165213

Submission received: 5 August 2025 / Revised: 18 August 2025 / Accepted: 19 August 2025 / Published: 21 August 2025

(This article belongs to the Section Intelligent Sensors)

Download

Browse Figures

Versions Notes

Abstract

Current vision–language models (VLMs) are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision–language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish an Extended Human Pose and Action Understanding Benchmark (E-HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 33.2% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models for human-centric visual understanding.

Keywords:

human pose and action understanding; keypoint-integrated data generation; instruction-following data; vision–language models; multimodal instruction tuning

1. Introduction

The development of multimodal models integrating vision and language has become a central focus in artificial intelligence (AI) research [1,2,3,4]. Models like LLaVA (Large Language and Vision Assistant) [5] bridge the gap between visual perception and natural language understanding, utilizing large language models (LLMs) and visual encoders to process a wide range of image-related tasks [6,7,8,9].

Despite their achievements, current models struggle with specialized tasks that require nuanced understanding of human activities, particularly those involving poses and actions. This limitation constrains their application in assistive robotics, healthcare, and human–computer interaction [10,11,12,13]. A significant challenge is the lack of specialized vision–language instruction-following data. While LLaVA introduces a method for converting image–text pairs into instruction-following data using GPT-4 [14], it primarily relies on image captions and object bounding boxes, which lack the precision needed to interpret complex human activities. Consequently, models trained on such data show limited performance in tasks requiring detailed understanding of human pose and action.

Recent efforts have started to address this limitation by exploring language-guided pose understanding. For example, ChatPose [15] leverages LLMs to reason about 3D human poses from images and text, and PoseLLaVA [16] introduces a pose-centric multimodal LLM capable of unified pose estimation, adjustment, and generation. However, they primarily focus on 3D representations based on the Skinned Multi-Person Linear (SMPL) model [17] and do not explicitly target instruction-following tasks designed to facilitate interpretable human pose and action understanding in natural image scenarios.

To address this gap, we propose a novel approach that integrates human keypoints into the instruction-following data generation process. Our keypoint-integrated method provides a more comprehensive representation of human pose and action, enabling models to reason not just about objects in an image but about how individuals interact with those objects and each other. This significantly enhances the model’s ability to describe human movements, reason about their purposes, and respond to queries about human interactions.

Our contributions can be summarized as follows:

We introduce a method for generating vision–language instruction-following data by integrating human keypoints, filling a critical gap in existing multimodal models for human-centric visual understanding.
We demonstrate substantial improvements in human-centric visual tasks through comprehensive experiments comparing our fine-tuned LLaVA-1.5-7B model (hereafter referred to as LLaVA-Pose) with its original version and other models.
We offer insights into how different types of fine-tuning data impact model capabilities for specific domains.

2. Related Work

2.1. Instruction-Following Multimodal Models

The LLaVA model [5] has made noteworthy progress by integrating vision encoders with LLMs to tackle various vision–language tasks. Similarly, models such as DeepSeek-VL2 [18],

V^{*}

[19], Qwen2-VL [20], InternVL3 [21], Janus-Pro [22], VisionLLM [23] and Flamingo [24] have been developed for general vision understanding. Although effective for image description and elementary visual reasoning, these models are not specifically designed for interpreting detailed human poses and actions. We introduce a method for generating instruction-following data specifically for human pose and action understanding by leveraging human keypoint information alongside traditional visual features. By integrating this enriched dataset into the fine-tuning process of the LLaVA-1.5-7B model [25], we enhance its capacity for complex reasoning and detailed description of human-centric activities.

2.2. Multimodal Human-Centric Visual Understanding

Traditional human activity recognition typically relies on distinct models for specific tasks [26,27,28,29], but recent research shows a trend toward unifying these capabilities within a single multimodal framework. For instance, ChatPose [15] uses LLMs to combine language-based reasoning with visual input for understanding and generating 3D human poses. PoseLLaVA [16] further advances this direction by introducing a pose-centric multimodal LLM that integrates a pose encoder-decoder into the LLaVA framework. It enables unified processing of pose estimation, pose adjustment, and pose generation tasks across pose, image, and text modalities. Moreover, PoseLLaVA introduces the PosePart dataset to complement existing pose-language datasets such as PoseFix [30] and PoseScript [31,32], improving the model’s ability to perform fine-grained pose manipulation. While these methods leverage SMPL parameters [17] for 3D pose representation, they face challenges in real-world applicability: (1) dependence on precise 3D ground truth lacking in natural images, (2) reconstruction errors (e.g., global orientation drift [15]), and (3) computational overhead from parametric modeling. Our approach addresses these gaps by operating in a 2D context: We integrate human keypoints into instruction-following data, bypassing 3D ambiguities while explicitly linking spatial cues to language semantics. This design enhances interpretability for human actions in daily scenes through diverse instruction types (conversation, description, and reasoning) and leverages ubiquitous 2D annotations (e.g., COCO [33]), encouraging the model to associate pose–action cues with natural language queries for explainable human-centric understanding.

2.3. Micro-Action Recognition

Recent studies have investigated fine-grained human motion understanding, particularly in the field of micro-action recognition. Guo et al. [34] establishes a benchmark for micro-action recognition. Their work introduces the Micro-action-52 dataset, a large-scale collection focusing on whole-body, low-intensity movements captured during professional psychological interviews, uniquely emphasizing lower-limb micro-actions often missed in prior datasets. Alongside the dataset, they propose the micro-action network (MANet), which integrates squeeze-and-excitation blocks and temporal shift modules into a ResNet backbone [35], and introduce a joint-embedding loss to enhance discrimination between visually similar categories. Furthermore, they comprehensively evaluate MANet against nine prevalent action recognition methods and demonstrate the utility of their approach in an emotion recognition application. Gu et al. [36] proposes a motion-guided modulation network that incorporates skeleton-based motion cues to enhance the recognition of subtle and rapid human actions. These works share a similar objective with ours in recognizing fine-grained human motion patterns. However, our approach differs in that we leverage vision–language models (VLMs) to integrate visual and textual modalities for human pose and action understanding.

3. Keypoint-Integrated Visual Instruction Data Generation

Large-scale multimodal datasets like LAION-5B [37], CC-12M [38], and COYO-700M [39] have advanced VLMs. However, leveraging these datasets specifically for instruction-following tasks involving nuanced understanding of human pose and action remains underexplored.

Previous research such as LLaVA demonstrates promising results in generating visual instruction-following data using symbolic representations (captions and bounding boxes) to prompt language-only GPT-4 [14]. Our approach enhances this foundation by integrating human keypoint data into the instruction-following data generation process. While LLaVA focuses primarily on captions and bounding boxes to represent visual content, our method enriches this representation by including annotations of keypoints, which capture precise positions and movements of human body parts within the scene.

To enhance the visual understanding capabilities of our model, we extend the data generation methodology originally used in LLaVA by incorporating human-centric annotations, using GPT-4o [40] as a teacher model. To represent images as visual features for prompting GPT-4o, our approach considers the following: (1) captions describing the visual scene from various perspectives; (2) bounding boxes localizing objects and providing spatial information; and (3) keypoints representing precise locations of joints and critical body parts. This enriched representation (example in the top block of Figure 1) allows for comprehensive understanding of human pose and action by detailing the exact positioning of body parts. The captions, bounding boxes, and keypoint annotations are obtained directly from the COCO dataset [33].

Using these symbolic representations, we generate three distinct types of instruction-following data (example in the bottom block of Figure 1) from the COCO dataset [33]:

Conversation: Dynamic interactions simulating real-world conversational exchanges about human poses and actions, such as asking what individuals are doing in a given scene (see detailed prompts and curation process in Table 1).
Detailed description: In-depth descriptions focusing on human body language and environmental interactions, transcending simple object identification to emphasize narrative-style outputs useful in applications requiring detailed human observation (see detailed prompts and curation process in Table 2).
Complex reasoning: Challenges requiring multi-step reasoning about human activities, such as understanding intentions behind specific actions or predicting next possible movements based on current poses (see detailed prompts and curation process in Table 3).

Table 1. For each query, we demonstrate the process of building the prompt for GPT-4o to generate a multi-turn conversation about the image. The examples come from fewshot_samples, where each example provides a short caption. We first construct few-shot demonstrations using context–response pairs, then query GPT-4o to generate new questions and answers focused on human poses and actions. The messages form the final prompt.

messages = [{"role": "system", "content": f"""You are an AI visual assistant, and you are seeing a single image. What you see are provided with five sentences, describing the same image you are looking at. Answer all questions as you are seeing the image.
Design a conversation between you and a person asking about this photo. The answers should be in a tone that a visual AI assistant is seeing the image and answering the question. Ask diverse questions and give corresponding answers.
Your primary focus should be on generating questions and answers about the poses, actions, and movements of people in the image. Please generate diverse questions that relate primarily to human poses and actions, and provide detailed answers as if you are seeing the image. Only include questions that have definite answers:
(1) one can see the content in the image that the question asks about and can answer confidently;
(2) one can determine confidently from the image that it is not in the image. Do not ask any question that cannot be answered confidently. """}]

for sample in fewshot_samples:
messages.append({"role": "user", "content": sample["context"]})
messages.append({"role": "assistant", "content": sample["response"]})
messages.append({"role": "user", "content": "Can you generate some new questions and answers focusing on the poses and actions of people in the image?"})

Table 2. For each query, we demonstrate the prompt construction process for GPT-4o to generate detailed descriptions about human poses and actions. The examples come from annotations_group, with each example containing an input annotation[“context”]. The final instruction in the prompt is randomly selected from detailed_description, which consists of instructions listed in Table 4. It is important to note that the messages form the final prompt.

messages = [{"role": "system", "content": f"""You are an AI visual assistant specializing in analyzing human poses and actions in images. You receive five sentences, each describing the same image you are observing. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes and human keypoints, represented as (x1, y1, x2, y2) for bounding boxes and (x, y, visibility) for keypoints, with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y for bounding boxes, and x, y coordinates along with visibility (0: not labeled, 1: labeled but not visible, 2: labeled and visible) for keypoints.
The keypoints represent the following body parts:
1. nose
2. left eye
3. right eye
4. left ear
5. right ear
6. left shoulder
7. right shoulder
8. left elbow
9. right elbow
10. left wrist
11. right wrist
12. left hip
13. right hip
14. left knee
15. right knee
16. left ankle
17. right ankle

Using the provided captions and bounding box/human keypoint information, describe the scene in a detailed manner, focusing on the human poses and actions.
Instead of directly mentioning the bounding box or keypoint coordinates, utilize this data to explain the scene using natural language. Include details like the number of people, their actions, poses, interactions, and relative positions.
When using the information from the caption and coordinates, directly explain the scene, and do not mention that the information source is the caption or the bounding box/keypoints. Always answer as if you are directly looking at the image."""}]

for annotation in annotations_group:
messages.append({"role": "user", "content": annotation["context"]})
messages.append({"role": "user", "content": random.choice(detailed_description)})

Table 3. For each query, we demonstrate the prompt construction process for GPT-4o to generate complex reasoning responses about human poses and actions. The examples come from annotations_group, with each example containing an input annotation[“context”]. It is important to note that the messages form the final prompt.

messages = [{"role": "system", "content": f"""You are an AI visual assistant specializing in analyzing human poses and actions in images. You receive five sentences, each describing the same image you are observing. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes and human keypoints, represented as (x1, y1, x2, y2) for bounding boxes and (x, y, visibility) for human keypoints, with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y for bounding boxes, and x, y coordinates along with visibility (0: not labeled, 1: labeled but not visible, 2: labeled and visible) for human keypoints.
The human keypoints represent the following body parts:
1. nose
2. left eye
3. right eye
4. left ear
5. right ear
6. left shoulder
7. right shoulder
8. left elbow
9. right elbow
10. left wrist
11. right wrist
12. left hip
13. right hip
14. left knee
15. right knee
16. left ankle
17. right ankle

The task is to use the provided caption and bounding box/human keypoint information to create a plausible question about the human poses and actions in the image, and provide the answer in detail.
Create complex questions beyond describing the scene. To answer such questions, one should require first understanding the human poses and actions, then based on the background knowledge or reasoning, either explain why the actions are happening that way, or provide guidance and help to the user’s request. Make the question challenging by not including the visual content details in the question so that the user needs to reason about that first.
Do not include any coordinates or numerical values in your explanation. Instead, utilize the data to explain the scene using natural language. Include details like the number of people, their actions, poses, interactions, relative positions, as well as the relationships and interactions between people and objects in the scene. Describe how people are using objects, their proximity to objects, and any activities involving both people and objects.
When using the information from the caption and coordinates, directly explain the scene, and do not mention that the information source is the caption or the bounding box/human keypoints. Always answer as if you are directly looking at the image. """}]

for annotation in annotations_group:
messages.append({"role": "user", "content": annotation["context"]})

Table 4. The list of instructions for detailed image description about human poses and actions. These instructions exhibit semantic invariance under lexical transformation.

➤ "Describe the actions and poses of people in the following image in detail."
➤ "Provide a detailed description of the poses of people in the given image."
➤ "Explain the various details of the actions of people you see in the image."
➤ "Share a comprehensive analysis of the behaviors of people presented in the image."
➤ "Offer a thorough analysis of the actions of people in the image."
➤ "Explain the various poses and actions of people in the displayed image with great detail."
➤ "Characterize the poses of people in the image using a well-detailed description."
➤ "Break down the actions of people in the image in a detailed manner."
➤ "Walk through the important details of the actions of people in the image."
➤ "Portray the poses and actions of people in the image with a rich, descriptive narrative."
➤ "Narrate the actions and poses of people in the image with precision."
➤ "Analyze the poses and actions of people in the image in a comprehensive and detailed manner."
➤ "Illustrate the actions and poses of people in the image through a descriptive explanation."
➤ "Examine the actions and poses of people in the image closely and share their details."
➤ "Write an exhaustive depiction of the actions of people in the given image."
➤ "Carefully observe the people in the image and share the details of their poses and actions."

This approach generates 200,328 unique vision–language instruction-following samples (112,980 conversations, 45,174 detailed descriptions, and 42,174 complex reasonings). These samples are specifically tailored to enrich the multimodal model’s ability to interpret and engage with human pose and action understanding. For example, in scenarios involving skiing, as shown in Figure 1, our approach uses keypoint data to provide nuanced understanding of the skier’s posture, balance, and motion.

4. Model Architecture and Fine-Tuning Approach

Our LLaVA-Pose model adopts the simple yet powerful architecture of LLaVA-1.5 [25], which has demonstrated strong capabilities in visual instruction tuning while remaining highly data-efficient. We follow this architecture without structural modification, and focus on augmenting it with keypoint-integrated instruction-following data to enable improved understanding of human poses and actions.

Figure 2 provides an overview of our LLaVA-Pose model architecture:

Vision Encoder: The input image is first processed by a pre-trained CLIP ViT-L encoder [41], which extracts visual features.
Vision–Language Connector: The visual features are passed through a two-layer multi-layer perceptron (MLP), which projects them into the embedding space of the LLM.
Large Language Model: The mapped visual embeddings are concatenated with the text token embeddings, and the combined embeddings are input into the LLM Vicuna-v1.5 [42]. The LLM generates textual responses conditioned on both the visual and textual information.

We freeze the vision encoder and fine-tune only the MLP connector and LLM on our keypoint-integrated instruction-following data described in Section 3. Through this fine-tuning process, the model enhances its ability to converse, describe and reason about regarding complex human-centric visual content. As shown in our experiments (Section 5), this architecture significantly improves performance on human pose and action understanding tasks.

5. Experiments

We fine-tune the LLaVA-1.5-7B model to enhance its instruction-following ability in human pose and action understanding tasks, using a dataset of 200,328 unique samples generated from the COCO training dataset [33] by GPT-4o. The dataset consists of three instruction categories: conversation, detailed description, and complex reasoning (Section 3). We refer to the resulting fine-tuned model as LLaVA-Pose. We train our model using the DeepSpeed framework [43] on 2 × A100 GPUs, following hyperparameters similar to those used in the original LLaVA-1.5 model [25], with adjustments for computational resources and stability, including a batch size of 8 and gradient accumulation steps of 2. Table 5 summarizes the hyperparameters used during fine-tuning.

5.1. Qualitative Evaluation

We conduct a qualitative evaluation to compare the responses of eight models: DeepSeek-VL2 [18],

V^{*}

[19], Qwen2-VL-7B [20], InternVL3-8B [21], ChatPose [15], Janus-Pro-7B [22], LLaVA-1.5-7B [25], and our LLaVA-Pose. Table 6 and Table 7 present detailed comparisons of these models’ outputs for two representative images from the COCO Validation dataset [33], focusing on queries related to human pose and action understanding.

When asked to provide a detailed description of the poses and actions of the characters in the tennis player image (Table 6), the responses from the DeepSeek-VL2,

V^{*}

, Qwen2-VL-7B, InternVL3-8B, ChatPose, Janus-Pro-7B, and LLaVA-1.5-7B models offer simple descriptions, concentrating on general aspects of the scene. Their analyses remain relatively superficial, lacking detailed explanations of the characters’ postures, movements, positions, or interactions. In contrast, our LLaVA-Pose model delivers a more nuanced and contextually aware analysis. It identifies key elements of the main player’s posture, including specific details such as the flexion of the knees, the positioning of the elbows and wrists, and the alignment of the shoulders. Additionally, our model clearly explains how these body parts contribute to the power and precision of the player’s tennis swing. It also captures the passive involvement of two women in the scene, accurately describing their postures, attention levels, and engagement, which other models fail to recognize.

Similarly, when asked about the surfing image (Table 7), LLaVA-Pose again outperforms the other models in providing a comprehensive and fine-grained analysis. While the baseline models typically state that the person is standing with bent knees and extended arms for balance, their explanations lack depth and contextual reasoning. In contrast, LLaVA-Pose offers a detailed description of the surfer’s posture and movement: it highlights the bent knees lowering the center of gravity, the extended arms functioning as a counterbalance, and how these pose elements contribute to agility, shock absorption, and control while maneuvering on the surfboard. Furthermore, it explains how the surfer’s posture dynamically adapts to the changing conditions of the waves, which is an aspect not captured by the other models.

We include GPT-4o’s [40] responses in Table 6 and Table 7 as a reference, given its role as the teacher model in our instruction data generation (Section 3). Qualitative analysis reveals that LLaVA-Pose generates nuanced explanations comparable to GPT-4o. This demonstrates that fine-tuning with keypoint-integrated data enables LLaVA-Pose to emulate the teacher model’s reasoning capabilities for human-centric understanding, while outperforming other VLMs.

5.2. Quantitative Evaluation

To systematically evaluate performance, drawing inspiration from prior work [5,42], we utilize GPT-4o [40] to measure response quality across different models. Following LLaVA’s methodology, we create triplets (image, ground-truth descriptions, and question) and have each model generate responses. A language-only GPT-4o then evaluates these responses on a scale of 1–10, considering helpfulness, relevance, accuracy, and detail level.

To evaluate model performance on human-centric tasks, we construct the Extended Human Pose and Action Understanding Benchmark (E-HPAUB), an extended version of HPAUB proposed in our previous work [44]. In this extended version, we randomly select 90 images of human-centric scenes from the COCO Validation 2014 dataset [33] and generate three distinct types of questions for each image: conversation, detailed description, and complex reasoning, resulting in a total of 270 questions. The questions are crafted using the data generation approach outlined in Section 3. This extended benchmark enables a more comprehensive evaluation of the model’s ability to interpret and respond to diverse human-centric visual scenarios involving human poses and actions. By systematically varying the training datasets, we analyze the impact of different types of instruction-following data on model performance, as shown in Table 8. Results demonstrate substantial improvements compared to the original LLaVA-1.5-7B model, such as the following:

Conversation: The fine-tuned model scores 58.9 vs. original’s 54.0;
Detailed description: 69.1 vs. 47.2;
Complex reasoning: 75.1 vs. 55.2;
Overall performance: The model fine-tuned on all three data types achieves 69.4 vs. the original LLaVA-1.5-7B’s 52.1, representing a 33.2% increase.

As shown in Table 8, all models fine-tuned with keypoint-integrated instruction-following data outperform the baseline LLaVA-1.5-7B model that uses only captions and bounding boxes, demonstrating the significant contribution of keypoint information.

Using the E-HPAUB benchmark, we further conduct a comparative quantitative evaluation of DeepSeek-VL2,

V^{*}

, Qwen2-VL-7B, InternVL3-8B, ChatPose, Janus-Pro-7B, and our LLaVA-Pose model. The results, summarized in Table 9, demonstrate that LLaVA-Pose achieves the highest overall average score of 69.6, outperforming DeepSeek-VL2 (48.1),

V^{*}

(56.7), Qwen2-VL-7B (65.1), InternVL3-8B (66.3), ChatPose (67.5, despite its 13B parameters), and Janus-Pro-7B (68.3). Notably, LLaVA-Pose surpasses the second-best model, Janus-Pro-7B, by an absolute margin of 1.3 points, corresponding to a relative improvement of 1.9%. These results further demonstrate that even when compared against state-of-the-art VLMs in the multimodal understanding domain, our LLaVA-Pose model maintains a clear performance advantage in understanding and reasoning about complex human poses and actions.

To address potential bias toward GPT-style responses, we conduct an additional evaluation using Claude Sonnet 4 [45] as an independent evaluator (Table 10). This model-agnostic approach mitigates style preference risks by cross-validating results across distinct LLM families (OpenAI GPT vs. Anthropic Claude). Both evaluators rate responses on the same 1–10 scale using identical criteria: helpfulness, relevance, accuracy, and detail level. Results in Table 9 (GPT-4o) and Table 10 (Claude Sonnet 4) show strong inter-evaluator agreement: LLaVA-Pose consistently ranks first overall (69.6 vs. 65.5), outperforming second-place Janus-Pro-7B by +1.3 (GPT-4o) and +1.0 (Claude Sonnet 4) absolute points. Notably, the relative performance hierarchy remains stable across evaluators. This inter-evaluator consensus suggests our findings are robust to model-specific stylistic preferences.

6. Discussion

6.1. Key Findings

Fine-tuning the LLaVA-1.5-7B model on keypoint-integrated instruction-following data significantly enhances its ability to understand human poses and actions. This fine-tuning process leads to substantial improvements in human-centric tasks, enabling multimodal AI systems to operate more effectively in real-world environments. Notably, our LLaVA-Pose model achieves the highest overall performance on the E-HPAUB benchmark. These results demonstrate the effectiveness of integrating keypoint-level information to improve multimodal understanding of complex human activities. Although certain models (e.g., Qwen2-VL-7B, InternVL3-8B, ChatPose, Janus-Pro-7B) achieve higher scores on specific sub-tasks. This may be partly due to differences in model capacity and pretraining scale: for instance, ChatPose has 13B parameters, InternVL3-8B has 8B parameters, and Qwen2-VL-7B and Janus-Pro-7B are trained on a substantially larger and more diverse dataset. In contrast, LLaVA-Pose is built on a 7B model and fine-tuned on our targeted pose-aware dataset, prioritizing pose and action reasoning over exhaustive visual detail coverage.

6.2. Limitations and Future Work

While LLaVA-Pose excels at understanding static images, it currently lacks any temporal modeling capability, which is a limitation for tasks involving dynamic human actions and interactions. Many human activities unfold over time and require sequential context to be correctly interpreted. Static, frame-by-frame pose analysis often fails to capture crucial motion cues or to resolve occlusions without temporal continuity [46]. Indeed, effective spatio-temporal modeling is widely recognized as essential for robust action recognition in video data [47]. Without temporal context, a model may miss subtle transitions or ambiguous poses that are only distinguishable when viewed as part of a sequence. This limitation impacts applications such as video-based activity analysis and sports performance monitoring, where understanding the temporal evolution of poses is critical for accurate interpretation of complex behaviors. To address this gap, future work will explore extending our framework to sequential visual data by incorporating explicit temporal modeling techniques, such as recurrent neural networks (RNNs) or Transformer-based attention mechanisms. Sequence models like RNNs (and their gated variants) have long been used to capture the dynamics of human pose sequences, treating poses over time as sequential data for action recognition [48]. More recently, transformer-based video models have demonstrated that adding dedicated temporal encoders can significantly improve a model’s performance and temporal reasoning abilities in video understanding tasks [49]. Drawing on these advances, we plan to integrate a temporal module into LLaVA-Pose so that it can learn the continuity and changes of human poses over time. We anticipate that enabling temporal context will substantially enhance the model’s ability to reason about human behaviors in complex, time-varying environments, thereby overcoming the current limitation and broadening the applicability of our approach.

Beyond temporal modeling, there are two further limitations to acknowledge. First, inference efficiency has not been systematically analyzed. Although our motivation is to distill GPT-4o’s capabilities into a smaller, open-source model suitable for local deployment, future work should explicitly investigate runtime latency and efficiency to validate the practical advantages in edge-device scenarios. Second, our ablation studies have primarily focused on the data generation process. Exploring architectural variations, such as lightweight pose-aware attention modules or multi-branch connectors, may further improve performance and will be addressed in future work.

7. Conclusions

In this paper, we introduced a method for generating vision–language instruction-following data by integrating human keypoints alongside traditional captions and bounding boxes information. This keypoint-integrated approach significantly enhances a multimodal model’s ability to understand and reason about human poses and actions. Unlike the original LLaVA method, which primarily relied on coarse object-level annotations, our method leverages fine-grained spatial and structural cues from keypoints to improve interpretability in human-centric visual scenarios. Through rigorous experimentation and evaluation, our fine-tuned model, LLaVA-Pose, demonstrated superior performance across various tasks related to human pose and action understanding. These findings underscore the potential of integrating fine-grained human keypoint data to enhance the capabilities of multimodal AI systems.

Author Contributions

Conceptualization, D.Z. and H.S.; methodology, D.Z.; software, D.Z. and W.A.; validation, D.Z.; formal analysis, D.Z.; investigation, D.Z.; resources, D.Z.; data curation, D.Z.; writing—original draft preparation, D.Z.; writing—review and editing, D.Z., T.H. and H.S.; visualization, D.Z.; supervision, H.S.; project administration, D.Z. and H.S.; funding acquisition, D.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JST SPRING (grant number JPMJSP2131).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset generated and analyzed during this study is available within the article. In addition, the source code is available at: https://github.com/Ody-trek/LLaVA-Pose.

Conflicts of Interest

Author Wangpeng An was employed by the company TikTok Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

VLMs	vision–language models
E-HPAUB	extended human pose and action understanding benchmark
AI	artificial intelligence
LLaVA	large language and vision assistant
LLMs	large language models
SMPL	skinned multi-person linear
MANet	micro-action network
RNNs	recurrent neural networks

References

Cui, C.; Ma, Y.; Cao, X.; Ye, W.; Zhou, Y.; Liang, K.; Chen, J.; Lu, J.; Yang, Z.; Liao, K.D.; et al. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 958–979. [Google Scholar]
Liang, P.P.; Zadeh, A.; Morency, L.P. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Comput. Surv. 2024, 56, 1–42. [Google Scholar] [CrossRef]
Wu, J.; Gan, W.; Chen, Z.; Wan, S.; Yu, P.S. Multimodal large language models: A survey. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 2247–2256. [Google Scholar]
Driess, D.; Xia, F.; Sajjadi, M.S.M.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. PaLM-E: An embodied multimodal language model. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
Peng, Z.; Wang, W.; Dong, L.; Hao, Y.; Huang, S.; Ma, S.; Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv 2023, arXiv:2306.14824. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
Zhang, D.; Hussain, T.; An, W.; Shouno, H. PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment. arXiv 2025, arXiv:2507.09139. [Google Scholar]
Kyrollos, D.G.; Fuller, A.; Greenwood, K.; Harrold, J.; Green, J.R. Under the cover infant pose estimation using multimodal data. IEEE Trans. Instrum. Meas. 2023, 72, 5007212. [Google Scholar] [CrossRef]
Zhou, H.; Wang, D.; Yu, Y.; Zhang, Z. Research progress of human–computer interaction technology based on gesture recognition. Electronics 2023, 12, 2805. [Google Scholar] [CrossRef]
Wang, T.; Zheng, P.; Li, S.; Wang, L. Multimodal human–robot interaction for human-centric smart manufacturing: A survey. Adv. Intell. Syst. 2024, 6, 2300359. [Google Scholar] [CrossRef]
Yildirim, N.; Richardson, H.; Wetscherek, M.T.; Bajwa, J.; Jacob, J.; Pinnock, M.A.; Harris, S.; Coelho De Castro, D.; Bannur, S.; Hyland, S.; et al. Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 11–16 May 2024. CHI ’24. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Feng, Y.; Lin, J.; Dwivedi, S.K.; Sun, Y.; Patel, P.; Black, M.J. Chatpose: Chatting about 3d human pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2093–2103. [Google Scholar]
Feng, D.; Guo, P.; Peng, E.; Zhu, M.; Yu, W.; Wang, P. PoseLLaVA: Pose Centric Multimodal LLM for Fine-Grained 3D Pose Manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 2951–2959. [Google Scholar]
Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A skinned multi-person linear model. ACM Trans. Graph. 2015, 34, 248. [Google Scholar] [CrossRef]
Wu, Z.; Chen, X.; Pan, Z.; Liu, X.; Liu, W.; Dai, D.; Gao, H.; Ma, Y.; Wu, C.; Wang, B.; et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv 2024, arXiv:2412.10302. [Google Scholar]
Wu, P.; Xie, S. V?: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 13084–13094. [Google Scholar]
Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
Zhu, J.; Wang, W.; Chen, Z.; Liu, Z.; Ye, S.; Gu, L.; Tian, H.; Duan, Y.; Su, W.; Shao, J.; et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv 2025, arXiv:2504.10479. [Google Scholar]
Chen, X.; Wu, Z.; Liu, X.; Pan, Z.; Liu, W.; Xie, Z.; Yu, X.; Ruan, C. Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. arXiv 2025, arXiv:2501.17811. [Google Scholar]
Wang, W.; Chen, Z.; Chen, X.; Wu, J.; Zhu, X.; Zeng, G.; Luo, P.; Lu, T.; Zhou, J.; Qiao, Y.; et al. VisionLLM: Large language model is also an open-ended decoder for vision-centric tasks. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ‘23, Red Hook, NY, USA, 10–16 December 2023. [Google Scholar]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millicah, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ‘22, Red Hook, NY, USA, 28 November–9 December 2022. [Google Scholar]
Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26296–26306. [Google Scholar]
Zheng, C.; Wu, W.; Chen, C.; Yang, T.; Zhu, S.; Shen, J.; Kehtarnavaz, N.; Shah, M. Deep Learning-based Human Pose Estimation: A Survey. ACM Comput. Surv. 2023, 56, 1–37. [Google Scholar] [CrossRef]
Surek, G.A.S.; Seman, L.O.; Stefenon, S.F.; Mariani, V.C.; Coelho, L.d.S. Video-based human activity recognition using deep learning approaches. Sensors 2023, 23, 6384. [Google Scholar] [CrossRef]
Morshed, M.G.; Sultana, T.; Alam, A.; Lee, Y.K. Human action recognition: A taxonomy-based survey, updates, and opportunities. Sensors 2023, 23, 2182. [Google Scholar] [CrossRef]
Le, V.H. Deep learning-based for human segmentation and tracking, 3D human pose estimation and action recognition on monocular video of MADS dataset. Multimed. Tools Appl. 2023, 82, 20771–20818. [Google Scholar] [CrossRef]
Delmas, G.; Weinzaepfel, P.; Moreno-Noguer, F.; Rogez, G. Posefix: Correcting 3D human poses with natural language. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 15018–15028. [Google Scholar]
Delmas, G.; Weinzaepfel, P.; Lucas, T.; Moreno-Noguer, F.; Rogez, G. Posescript: 3d human poses from natural language. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 346–362. [Google Scholar]
Delmas, G.; Weinzaepfel, P.; Lucas, T.; Moreno-Noguer, F.; Rogez, G. Posescript: Linking 3d human poses and natural language. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 5146–5159. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Guo, D.; Li, K.; Hu, B.; Zhang, Y.; Wang, M. Benchmarking Micro-Action Recognition: Dataset, Methods, and Applications. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 6238–6252. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Gu, J.; Li, K.; Wang, F.; Wei, Y.; Wu, Z.; Fan, H.; Wang, M. Motion matters: Motion-guided modulation network for skeleton-based micro-action recognition. arXiv 2025, arXiv:2507.21977. [Google Scholar]
Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. LAION-5B: An open large-scale dataset for training next generation image–text models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ‘22. Red Hook, NY, USA, 28 November–9 December 2022. [Google Scholar]
Changpinyo, S.; Sharma, P.; Ding, N.; Soricut, R. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3558–3568. [Google Scholar]
Byeon, M.; Park, B.; Kim, H.; Lee, S.; Baek, W.; Kim, S. COYO-700M: Image-Text Pair Dataset. 2022. Available online: https://github.com/kakaobrain/coyo-dataset (accessed on 18 July 2025).
OpenAI. Hello gpt-4o. 2024. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 18 July 2025).
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
Chiang, W.L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J.E.; et al. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. 2023. Available online: https://lmsys.org/blog/2023-03-30-vicuna/ (accessed on 18 July 2025).
Rasley, J.; Rajbhandari, S.; Ruwase, O.; He, Y. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 3505–3506. [Google Scholar]
Zhang, D.; An, W.; Shouno, H. Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models. In Proceedings of the International Conference on Advanced Concepts for Intelligent Vision Systems (ACIVS), Tokyo, Japan, 28–30 July 2025. [Google Scholar]
Anthropic. Introducing Claude 4. 2025. Available online: https://www.anthropic.com/news/claude-4 (accessed on 13 August 2025).
Zhang, Z.; Liu, W.; Zheng, Y.; Du, L.; Sun, L. Learning spatio-temporal context for basketball action pose estimation with a multi-stream network. Sci. Rep. 2025, 15, 29173. [Google Scholar] [CrossRef]
Gu, J.; Yi, Y.; Li, Q. Motion sensitive network for action recognition in control and decision-making of autonomous systems. Front. Neurosci. 2024, 18, 1370024. [Google Scholar] [CrossRef]
Chen, D.; Chen, M.; Wu, P.; Wu, M.; Zhang, T.; Li, C. Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition. Sci. Rep. 2025, 15, 4982. [Google Scholar] [CrossRef]
Li, Y.; Liu, Z.; Kong, Y.; Li, G.; Zhang, J.; Bian, C.; Liu, F.; Yao, L.; Sun, Z. Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding. arXiv 2025, arXiv:2501.16786. [Google Scholar]

Figure 1. One example to demonstrate the structure of instruction-following data. The top block displays the context information, including captions, bounding boxes (shown as solid rectangles in the visual image), and keypoints (shown as green circular markers in the visual image) used to prompt GPT-4o, and the bottom block displays the three types of responses generated. It is important to note that the visual image itself is not used to prompt GPT-4o; it is shown solely for reference purposes.

Figure 2. LLaVA-Pose model architecture.

Table 5. Fine-tuning hyperparameters used in our experiments.

Hyperparameter	Value
batch size	8
gradient accumulation steps	2
learning rate	$2 \times 10^{- 5}$
learning rate schedule	cosine decay
learning warmup ratio	0.03
optimizer	AdamW
epoch	1
weight decay	0
DeepSpeed stage	3

Table 6. Comparison of responses from DeepSeek-VL2,

V^{*}

, Qwen2-VL-7B, InternVL3-8B, ChatPose, Janus-Pro-7B, LLaVA-1.5-7B, LLaVA-Pose, and GPT-4o for the given tennis player image. Blue-highlighted content indicates key information relevant to the question.

Table 6. Comparison of responses from DeepSeek-VL2,

V^{*}

, Qwen2-VL-7B, InternVL3-8B, ChatPose, Janus-Pro-7B, LLaVA-1.5-7B, LLaVA-Pose, and GPT-4o for the given tennis player image. Blue-highlighted content indicates key information relevant to the question.

Source: http://images.cocodataset.org/val2014/COCO_val2014_000000060992.jpg (accessed on 18 August 2025)

User: Portray the poses and actions of people in the image with a rich, descriptive narrative.
DeepSeek-VL2: The image captures a tennis player in action on a court during a match. The player is wearing a red shirt, white shorts, and green and white wristbands. He is in a tennis racket in his right hand, appears to be in the middle of a forehand shot. The player is in a fast-paced moment, with his body leaning forward and focused expression. In the background, there are two ball boys. One boy is holding a tennis ball, wearing a blue shirt and black shorts, while the other is wearing a blue shirt and black pants. They appear to be ball boys. The court has a blue wall in the background with the words “VISILAB” and “CHALLENEFR TOUR” written on it. There is also a digital scoreboard on the left side of the court displaying “142 Km/h” and “Cronet.” The court is a professional tennis match, likely part of a Challenger Tour.

User: Portray the poses and actions of people in the image with a rich, descriptive narrative.

V^{*}

: In the image, a man is swinging a tennis racket on a court, while two other people are present in the scene. One of the individuals appears to be a young girl, who might be watching the man playing tennis or waiting for her turn to participate. The man, wearing a red shirt, is actively engaged in the game, demonstrating his skills by swinging the racket with precision and power. It seems like an enjoyable and lively moment where the players share a common interest in tennis, possibly making it a family or social event where people of different ages can bond over the sport.

User: Portray the poses and actions of people in the image with a rich, descriptive narrative.
Qwen2-VL-7B: In the image, a tennis player is captured in the midst of an intense rally, showcasing his agility and skill. He is leaning forward, his body coiled like a spring, as he prepares to strike the incoming tennis ball with his racket. His eyes are focused intently on the ball, his muscles tensed and ready to spring into action. In the background, another person is holding a tennis ball, likely waiting for the right moment to serve or pass the ball to the player. This person’s stance is relaxed yet attentive, as they watch the ongoing match with keen interest. A third person is standing nearby, possibly a coach or a fellow player, observing the game with a critical eye. Their presence adds an element of support and encouragement to the scene, as they watch the player’s performance and offer guidance when needed. The image captures the dynamic and competitive nature of tennis, with each player fully engaged in the game and focused on their respective tasks. The vibrant colors of the court and the players’ attire add to the overall energy and excitement of the scene.

User: Portray the poses and actions of people in the image with a rich, descriptive narrative.
InternVL3-8B: In the image, a man is playing tennis on a court, swinging his racket to hit a tennis ball. He is wearing a red shirt and white shorts, and his pose suggests that he is in the middle of a powerful swing. There are other people in the background, possibly watching the game or waiting for their turn to play. The scene takes place on a tennis court, which is a common setting for tennis matches and practice sessions. The presence of the other people and the tennis ball in the air indicate that this is an active and engaging moment in the game.

User: Portray the poses and actions of people in the image with a rich, descriptive narrative.
ChatPose: The image captures a lively tennis match with several people on the court. A man is in the middle of a swing, holding a tennis racket and preparing to hit the ball. He is focused and determined, showcasing his athleticism and skill in the sport. Another tennis racket can be seen in the scene, suggesting that there might be another player waiting for their turn or practicing nearby. There are also two sports balls visible in the image, one of which is likely the tennis ball the man is about to hit. Another person is present on the court, possibly a fellow player, a coach, or a spectator. A chair is located near the edge of the court, providing a place for rest or observation during breaks in the game. Overall, the scene depicts an engaging and dynamic tennis match with various individuals involved in the sport.

User: Portray the poses and actions of people in the image with a rich, descriptive narrative.
Janus-Pro-7B: In the image, a dynamic scene unfolds on a tennis court, capturing the intensity and focus of a player in action. The central figure, a tennis player dressed in a red shirt and white shorts, is mid-swing, his body leaning forward with a powerful motion. His right arm extends outward, gripping a tennis racket, while his left arm is slightly bent, maintaining balance. His legs are positioned in a wide stance, with the right leg slightly bent and the left leg extended backward, emphasizing the force and precision of his movement. The player’s expression is one of concentration and determination, his eyes fixed on the ball as it approaches. The ball is suspended in the air, a blur of motion, indicating the speed and agility required in this moment. The player’s grip on the racket is firm, his wrist steady, suggesting a practiced and skilled hand. In the background, two individuals stand observing the match. One, dressed in a light blue shirt and dark shorts, holds a tennis ball, ready to serve or return a shot. The other, wearing a teal shirt and black pants, leans forward slightly, perhaps in anticipation of the next move. Their presence adds a layer of context, indicating that this is a competitive or practice session. The backdrop of the court features a digital display showing the speed of the serve, with the text “142 km/h” and “Coulée” visible, adding a modern and technological touch to the scene. The wall behind the players is adorned with logos, including “VISIAB” and “CHALLENGER TOUR,” suggesting a professional or high-level amateur event. The overall atmosphere is one of energy and focus, with each individual playing a role in the unfolding drama of the tennis match. The image captures a moment of athletic prowess and competitive spirit, highlighting the dedication and skill required in the sport.

User: Portray the poses and actions of people in the image with a rich, descriptive narrative.
LLaVA-1.5-7B: The image captures a tennis match in progress, with a man in a red shirt and white shorts playing the game. He is in the middle of a swing, holding a tennis racket and preparing to hit the ball. Another person is standing nearby, likely watching the game or waiting for their turn to play. There are a few more people in the scene, one of them holding a tennis ball, possibly waiting for their turn to play or observing the match. The presence of multiple people in the scene suggests that this is a social event or a gathering where people are enjoying a friendly game of tennis.

User: Portray the poses and actions of people in the image with a rich, descriptive narrative.
GPT-4o: In this dynamic tennis scene, the focal player—clad in a vivid red shirt and white shorts—is captured mid-action, exuding intensity and athleticism. His left foot plants firmly on the court, knee bent, while his right leg extends powerfully behind him, hinting at both agility and control. His torso twists slightly as he follows through on a forehand stroke, his racket slicing through the air with purpose. The focused gaze locked on the ball reveals his anticipation and sharp reflexes. His right arm is extended, wrist flicked for control, while his left arm counterbalances behind him, fingers splayed—a gesture of fluid coordination. The tension in his muscles and the sharpness of his movement suggest this is a high-stakes rally. In the background, a ball boy stands alert with another tennis ball tucked behind his back, poised to spring into action. To the right, a line judge leans slightly forward, hands on knees, watching the trajectory of the ball with sharp concentration. Both background figures, though static, add a layer of tension and professionalism to the scene. The signage and scoreboard indicate this is part of a Challenger Tour event, and the speed of the last serve—142 km/h—is visible, reinforcing the elite level of play. The atmosphere is electric, the motion frozen in time yet bursting with energy and precision.

User: Portray the poses and actions of people in the image with a rich, descriptive narrative.
LLaVA-Pose: In the image, there are three individuals engaged in a dynamic scene on a tennis court. The primary focus is on a man who is actively swinging a tennis racket, likely in the midst of a powerful serve or return shot. His body is positioned with a slight forward lean, indicating the force and precision of his movement. His left shoulder is slightly raised, and his right shoulder is lower, suggesting the follow-through of a swing. His left elbow is bent, and his right elbow is extended, with his right wrist positioned to control the racket’s direction. His legs are spread apart, with his left knee slightly bent and his right knee more flexed, providing a stable base for his powerful swing. To the left of the main player, there is a woman who appears to be observing the action. She stands upright with a relaxed posture, her arms hanging naturally by her sides. Her head is turned towards the main player, indicating her focus on the ongoing play. Her stance is casual, with her feet firmly planted on the ground, suggesting she is not currently involved in the game but is instead watching attentively. Nearby, there is another woman who is also watching the game. She is positioned slightly behind the main player, with her body turned towards the action. Her posture is slightly more engaged than the observer on the left, with her shoulders squared and her head turned towards the main player. Her arms are relaxed, and her legs are slightly apart, indicating a stance that is ready to move if needed. The scene captures the intensity and focus of a tennis match, with the main player in the midst of a powerful action, while the other two individuals are observing, possibly waiting for their turn to play or simply enjoying the game. The overall atmosphere is one of concentration and athleticism, with each person contributing to the dynamic environment of the tennis court.

Table 7. Comparison of responses from DeepSeek-VL2,

V^{*}

, Qwen2-VL-7B, InternVL3-8B, ChatPose, Janus-Pro-7B, LLaVA-1.5-7B, LLaVA-Pose, and GPT-4o for the given surfing image. Blue-highlighted content indicates key information relevant to the question.

Table 7. Comparison of responses from DeepSeek-VL2,

V^{*}

, Qwen2-VL-7B, InternVL3-8B, ChatPose, Janus-Pro-7B, LLaVA-1.5-7B, LLaVA-Pose, and GPT-4o for the given surfing image. Blue-highlighted content indicates key information relevant to the question.

Source: http://images.cocodataset.org/val2014/COCO_val2014_000000004742.jpg (accessed on 18 August 2025)

User: Why is the person standing with knees bent and arms extended while riding the surfboard?
DeepSeek-VL2: The person is standing with knees bent and arms extended while riding the surfboard to maintain balance.

User: Why is the person standing with knees bent and arms extended while riding the surfboard?

V^{*}

: The person standing with knees bent and arms extended while riding the surfboard is likely doing so to maintain balance and control on the surfboard as it moves through the water. By bending the knees and extending the arms, the individual is able to increase the surface area of the body in contact with the water, which helps with stability. This posture is also known as the

\ddot{s}

urf stanceör

\ddot{p}

addling stance, änd it allows the surfer to efficiently paddle and maneuver the board in the water before catching a wave.

User: Why is the person standing with knees bent and arms extended while riding the surfboard?
Qwen2-VL-7B: The person is standing with knees bent and arms extended while riding the surfboard to maintain balance and control while surfing. This stance allows the surfer to adjust their body position and weight distribution to navigate the wave effectively. By bending their knees and extending their arms, the surfer can shift their center of gravity and react to the changing dynamics of the wave, ensuring a smooth and controlled ride. This technique is essential for maintaining stability and preventing falls while surfing.

User: Why is the person standing with knees bent and arms extended while riding the surfboard?
InternVL3-8B: The person is standing with knees bent and arms extended to maintain balance and control while riding the surfboard on the wave.

User: Why is the person standing with knees bent and arms extended while riding the surfboard?
ChatPose: The person is standing with knees bent and arms extended while riding the surfboard to maintain balance and control as they catch a wave. Surfing requires a combination of strength, agility, and coordination to navigate the constantly changing ocean conditions. By bending his knees and extending his arms, the surfer is able to adjust his body position and center of gravity to stay stable on the surfboard. Additionally, this stance allows the surfer to quickly react to any changes in the wave or water, ensuring a smoother and more controlled ride.

User: Why is the person standing with knees bent and arms extended while riding the surfboard?
Janus-Pro-7B: The person is standing with knees bent and arms extended while riding the surfboard to maintain balance and stability. Bending the knees lowers the center of gravity, which helps in maintaining balance on the moving surfboard. Extending the arms can help with balance and provide additional stability. This posture is common among surfers to help them stay upright and maneuver the board effectively.
User: Why is the person standing with knees bent and arms extended while riding the surfboard?

LLaVA-1.5-7B: The person is standing with knees bent and arms extended while riding the surfboard to maintain balance and control on the surfboard. By bending their knees, they lower their center of gravity, which helps them stay stable on the surfboard and adapt to the changing water conditions. Extending their arms outward provides additional balance and stability, as well as helps them maintain a proper stance on the surfboard. This positioning allows the surfer to effectively maneuver the surfboard and ride the waves, ensuring a smooth and enjoyable surfing experience.

User: Why is the person standing with knees bent and arms extended while riding the surfboard?
GPT-4o: The person in the image is standing with knees bent and arms extended while riding the surfboard to maintain balance, control, and stability on the moving waves. Here’s why each of these posture elements is important:
Bent Knees:
Shock Absorption: Bent knees act like natural shock absorbers, helping the surfer absorb the motion and bumps from the waves.
Lower Center of Gravity: A lower stance helps stabilize the body, making it less likely to tip over.
Mobility: It allows quick adjustments in weight distribution and faster reaction to changes in the wave.
Extended Arms:
Balance: Like a tightrope walker, the surfer uses their arms to shift and correct their balance.
Steering: Arm position helps with directional changes—leaning slightly forward, backward, or to the sides.
Coordination: Extending arms assists in maintaining an aligned posture, ensuring the surfer’s upper and lower body move in harmony.
This stance is a fundamental technique in surfing, especially for beginners and intermediates, as it maximizes control while minimizing the risk of falling.

User: Why is the person standing with knees bent and arms extended while riding the surfboard?
LLaVA-Pose: The person is standing with knees bent and arms extended to maintain balance and stability while riding the surfboard. This stance is crucial for several reasons: 1. Balance and Stability: Bent knees lower the center of gravity, which helps the surfer stay balanced on the moving surfboard. This is particularly important when riding waves, as the board can be unstable and the water can be unpredictable. 2. Control: Extended arms act as a counterbalance to the body’s movements. By spreading the arms, the surfer can make fine adjustments to their balance, allowing them to shift their weight and direction as needed to navigate the wave. 3. Shock Absorption: The bent knees act as shock absorbers, allowing the surfer to absorb the impact of the waves and maintain a smooth ride. This is especially important for beginners or those riding smaller waves. 4. Flexibility and Agility: This stance allows the surfer to be more agile and responsive to changes in the wave’s shape and speed. It enables quick shifts in weight and direction, which are essential for maneuvering the surfboard effectively. In summary, the person’s pose is a combination of balance, control, shock absorption, and agility, all of which are essential for successfully riding a surfboard on the waves.

Table 8. Ablation study on E-HPAUB with various training data. We prompt GPT-4o to evaluate and compare responses generated by our LLaVA-Pose model against those from the original LLaVA-1.5-7B model. GPT-4o is tasked with assessing and providing ratings based on the overall quality of the answers, accompanied by detailed explanations.

	Conversation	Detailed Description	Complex Reasoning	All
Full data	64.3	78.9	65.0	69.4
Conversation	58.9	45.1	50.4	51.5
Detailed description	52.7	69.1	65.3	62.4
Complex reasoning	62.4	59.7	75.1	65.7
LLaVA-1.5-7B [25]	54.0	47.2	55.2	52.1

Table 9. Comparative performance evaluated by GPT-4o on E-HPAUB.

	Conversation	Detailed Description	Complex Reasoning	All
DeepSeek-VL2 [18]	60.1	41.8	42.6	48.1
$V^{*}$ [19]	62.6	38.8	68.8	56.7
Qwen2-VL-7B [20]	73.8	52.8	68.8	65.1
InternVL3-8B [21]	75.4	56.2	67.1	66.3
ChatPose [15]	69.7	55.4	77.4	67.5
Janus-Pro-7B [22]	70.2	68.7	66.1	68.3
LLaVA-Pose	77.4	58.6	72.9	69.6

Table 10. Comparative performance evaluated by Claude Sonnet 4 on E-HPAUB.

	Conversation	Detailed Description	Complex Reasoning	All
DeepSeek-VL2 [18]	56.7	39.3	45.2	47.1
$V^{*}$ [19]	55.2	33.6	67.3	52.0
Qwen2-VL-7B [20]	70.9	45.0	72.3	62.7
InternVL3-8B [21]	71.2	52.3	65.7	63.1
ChatPose [15]	64.9	54.9	72.5	64.1
Janus-Pro-7B [22]	65.0	61.4	67.0	64.5
LLaVA-Pose	68.7	56.9	71.0	65.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, D.; Hussain, T.; An, W.; Shouno, H. LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding. Sensors 2025, 25, 5213. https://doi.org/10.3390/s25165213

AMA Style

Zhang D, Hussain T, An W, Shouno H. LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding. Sensors. 2025; 25(16):5213. https://doi.org/10.3390/s25165213

Chicago/Turabian Style

Zhang, Dewen, Tahir Hussain, Wangpeng An, and Hayaru Shouno. 2025. "LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding" Sensors 25, no. 16: 5213. https://doi.org/10.3390/s25165213

APA Style

Zhang, D., Hussain, T., An, W., & Shouno, H. (2025). LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding. Sensors, 25(16), 5213. https://doi.org/10.3390/s25165213

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding^†

Abstract

1. Introduction

2. Related Work

2.1. Instruction-Following Multimodal Models