Abstract
Current vision–language models (VLMs) are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision–language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish an Extended Human Pose and Action Understanding Benchmark (E-HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 33.2% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models for human-centric visual understanding.
1. Introduction
The development of multimodal models integrating vision and language has become a central focus in artificial intelligence (AI) research [1,2,3,4]. Models like LLaVA (Large Language and Vision Assistant) [5] bridge the gap between visual perception and natural language understanding, utilizing large language models (LLMs) and visual encoders to process a wide range of image-related tasks [6,7,8,9].
Despite their achievements, current models struggle with specialized tasks that require nuanced understanding of human activities, particularly those involving poses and actions. This limitation constrains their application in assistive robotics, healthcare, and human–computer interaction [10,11,12,13]. A significant challenge is the lack of specialized vision–language instruction-following data. While LLaVA introduces a method for converting image–text pairs into instruction-following data using GPT-4 [14], it primarily relies on image captions and object bounding boxes, which lack the precision needed to interpret complex human activities. Consequently, models trained on such data show limited performance in tasks requiring detailed understanding of human pose and action.
Recent efforts have started to address this limitation by exploring language-guided pose understanding. For example, ChatPose [15] leverages LLMs to reason about 3D human poses from images and text, and PoseLLaVA [16] introduces a pose-centric multimodal LLM capable of unified pose estimation, adjustment, and generation. However, they primarily focus on 3D representations based on the Skinned Multi-Person Linear (SMPL) model [17] and do not explicitly target instruction-following tasks designed to facilitate interpretable human pose and action understanding in natural image scenarios.
To address this gap, we propose a novel approach that integrates human keypoints into the instruction-following data generation process. Our keypoint-integrated method provides a more comprehensive representation of human pose and action, enabling models to reason not just about objects in an image but about how individuals interact with those objects and each other. This significantly enhances the model’s ability to describe human movements, reason about their purposes, and respond to queries about human interactions.
Our contributions can be summarized as follows:
- We introduce a method for generating vision–language instruction-following data by integrating human keypoints, filling a critical gap in existing multimodal models for human-centric visual understanding.
- We demonstrate substantial improvements in human-centric visual tasks through comprehensive experiments comparing our fine-tuned LLaVA-1.5-7B model (hereafter referred to as LLaVA-Pose) with its original version and other models.
- We offer insights into how different types of fine-tuning data impact model capabilities for specific domains.
2. Related Work
2.1. Instruction-Following Multimodal Models
The LLaVA model [5] has made noteworthy progress by integrating vision encoders with LLMs to tackle various vision–language tasks. Similarly, models such as DeepSeek-VL2 [18], [19], Qwen2-VL [20], InternVL3 [21], Janus-Pro [22], VisionLLM [23] and Flamingo [24] have been developed for general vision understanding. Although effective for image description and elementary visual reasoning, these models are not specifically designed for interpreting detailed human poses and actions. We introduce a method for generating instruction-following data specifically for human pose and action understanding by leveraging human keypoint information alongside traditional visual features. By integrating this enriched dataset into the fine-tuning process of the LLaVA-1.5-7B model [25], we enhance its capacity for complex reasoning and detailed description of human-centric activities.
2.2. Multimodal Human-Centric Visual Understanding
Traditional human activity recognition typically relies on distinct models for specific tasks [26,27,28,29], but recent research shows a trend toward unifying these capabilities within a single multimodal framework. For instance, ChatPose [15] uses LLMs to combine language-based reasoning with visual input for understanding and generating 3D human poses. PoseLLaVA [16] further advances this direction by introducing a pose-centric multimodal LLM that integrates a pose encoder-decoder into the LLaVA framework. It enables unified processing of pose estimation, pose adjustment, and pose generation tasks across pose, image, and text modalities. Moreover, PoseLLaVA introduces the PosePart dataset to complement existing pose-language datasets such as PoseFix [30] and PoseScript [31,32], improving the model’s ability to perform fine-grained pose manipulation. While these methods leverage SMPL parameters [17] for 3D pose representation, they face challenges in real-world applicability: (1) dependence on precise 3D ground truth lacking in natural images, (2) reconstruction errors (e.g., global orientation drift [15]), and (3) computational overhead from parametric modeling. Our approach addresses these gaps by operating in a 2D context: We integrate human keypoints into instruction-following data, bypassing 3D ambiguities while explicitly linking spatial cues to language semantics. This design enhances interpretability for human actions in daily scenes through diverse instruction types (conversation, description, and reasoning) and leverages ubiquitous 2D annotations (e.g., COCO [33]), encouraging the model to associate pose–action cues with natural language queries for explainable human-centric understanding.
2.3. Micro-Action Recognition
Recent studies have investigated fine-grained human motion understanding, particularly in the field of micro-action recognition. Guo et al. [34] establishes a benchmark for micro-action recognition. Their work introduces the Micro-action-52 dataset, a large-scale collection focusing on whole-body, low-intensity movements captured during professional psychological interviews, uniquely emphasizing lower-limb micro-actions often missed in prior datasets. Alongside the dataset, they propose the micro-action network (MANet), which integrates squeeze-and-excitation blocks and temporal shift modules into a ResNet backbone [35], and introduce a joint-embedding loss to enhance discrimination between visually similar categories. Furthermore, they comprehensively evaluate MANet against nine prevalent action recognition methods and demonstrate the utility of their approach in an emotion recognition application. Gu et al. [36] proposes a motion-guided modulation network that incorporates skeleton-based motion cues to enhance the recognition of subtle and rapid human actions. These works share a similar objective with ours in recognizing fine-grained human motion patterns. However, our approach differs in that we leverage vision–language models (VLMs) to integrate visual and textual modalities for human pose and action understanding.
3. Keypoint-Integrated Visual Instruction Data Generation
Large-scale multimodal datasets like LAION-5B [37], CC-12M [38], and COYO-700M [39] have advanced VLMs. However, leveraging these datasets specifically for instruction-following tasks involving nuanced understanding of human pose and action remains underexplored.
Previous research such as LLaVA demonstrates promising results in generating visual instruction-following data using symbolic representations (captions and bounding boxes) to prompt language-only GPT-4 [14]. Our approach enhances this foundation by integrating human keypoint data into the instruction-following data generation process. While LLaVA focuses primarily on captions and bounding boxes to represent visual content, our method enriches this representation by including annotations of keypoints, which capture precise positions and movements of human body parts within the scene.
To enhance the visual understanding capabilities of our model, we extend the data generation methodology originally used in LLaVA by incorporating human-centric annotations, using GPT-4o [40] as a teacher model. To represent images as visual features for prompting GPT-4o, our approach considers the following: (1) captions describing the visual scene from various perspectives; (2) bounding boxes localizing objects and providing spatial information; and (3) keypoints representing precise locations of joints and critical body parts. This enriched representation (example in the top block of Figure 1) allows for comprehensive understanding of human pose and action by detailing the exact positioning of body parts. The captions, bounding boxes, and keypoint annotations are obtained directly from the COCO dataset [33].
Figure 1.
One example to demonstrate the structure of instruction-following data. The top block displays the context information, including captions, bounding boxes (shown as solid rectangles in the visual image), and keypoints (shown as green circular markers in the visual image) used to prompt GPT-4o, and the bottom block displays the three types of responses generated. It is important to note that the visual image itself is not used to prompt GPT-4o; it is shown solely for reference purposes.
Using these symbolic representations, we generate three distinct types of instruction-following data (example in the bottom block of Figure 1) from the COCO dataset [33]:
- Conversation: Dynamic interactions simulating real-world conversational exchanges about human poses and actions, such as asking what individuals are doing in a given scene (see detailed prompts and curation process in Table 1).
- Detailed description: In-depth descriptions focusing on human body language and environmental interactions, transcending simple object identification to emphasize narrative-style outputs useful in applications requiring detailed human observation (see detailed prompts and curation process in Table 2).
- Complex reasoning: Challenges requiring multi-step reasoning about human activities, such as understanding intentions behind specific actions or predicting next possible movements based on current poses (see detailed prompts and curation process in Table 3).
Table 1.
For each query, we demonstrate the process of building the prompt for GPT-4o to generate a multi-turn conversation about the image. The examples come from fewshot_samples, where each example provides a short caption. We first construct few-shot demonstrations using context–response pairs, then query GPT-4o to generate new questions and answers focused on human poses and actions. The messages form the final prompt.
Table 1.
For each query, we demonstrate the process of building the prompt for GPT-4o to generate a multi-turn conversation about the image. The examples come from fewshot_samples, where each example provides a short caption. We first construct few-shot demonstrations using context–response pairs, then query GPT-4o to generate new questions and answers focused on human poses and actions. The messages form the final prompt.
| messages = [{"role": "system", "content": f"""You are an AI visual assistant, and you are seeing a single image. What you see are provided with five sentences, describing the same image you are looking at. Answer all questions as you are seeing the image. Design a conversation between you and a person asking about this photo. The answers should be in a tone that a visual AI assistant is seeing the image and answering the question. Ask diverse questions and give corresponding answers. Your primary focus should be on generating questions and answers about the poses, actions, and movements of people in the image. Please generate diverse questions that relate primarily to human poses and actions, and provide detailed answers as if you are seeing the image. Only include questions that have definite answers: (1) one can see the content in the image that the question asks about and can answer confidently; (2) one can determine confidently from the image that it is not in the image. Do not ask any question that cannot be answered confidently. """}] for sample in fewshot_samples: messages.append({"role": "user", "content": sample["context"]}) messages.append({"role": "assistant", "content": sample["response"]}) messages.append({"role": "user", "content": "Can you generate some new questions and answers focusing on the poses and actions of people in the image?"}) |
Table 2.
For each query, we demonstrate the prompt construction process for GPT-4o to generate detailed descriptions about human poses and actions. The examples come from annotations_group, with each example containing an input annotation[“context”]. The final instruction in the prompt is randomly selected from detailed_description, which consists of instructions listed in Table 4. It is important to note that the messages form the final prompt.
Table 2.
For each query, we demonstrate the prompt construction process for GPT-4o to generate detailed descriptions about human poses and actions. The examples come from annotations_group, with each example containing an input annotation[“context”]. The final instruction in the prompt is randomly selected from detailed_description, which consists of instructions listed in Table 4. It is important to note that the messages form the final prompt.
| messages = [{"role": "system", "content": f"""You are an AI visual assistant specializing in analyzing human poses and actions in images. You receive five sentences, each describing the same image you are observing. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes and human keypoints, represented as (x1, y1, x2, y2) for bounding boxes and (x, y, visibility) for keypoints, with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y for bounding boxes, and x, y coordinates along with visibility (0: not labeled, 1: labeled but not visible, 2: labeled and visible) for keypoints. The keypoints represent the following body parts: 1. nose 2. left eye 3. right eye 4. left ear 5. right ear 6. left shoulder 7. right shoulder 8. left elbow 9. right elbow 10. left wrist 11. right wrist 12. left hip 13. right hip 14. left knee 15. right knee 16. left ankle 17. right ankle Using the provided captions and bounding box/human keypoint information, describe the scene in a detailed manner, focusing on the human poses and actions. Instead of directly mentioning the bounding box or keypoint coordinates, utilize this data to explain the scene using natural language. Include details like the number of people, their actions, poses, interactions, and relative positions. When using the information from the caption and coordinates, directly explain the scene, and do not mention that the information source is the caption or the bounding box/keypoints. Always answer as if you are directly looking at the image."""}] for annotation in annotations_group: messages.append({"role": "user", "content": annotation["context"]}) messages.append({"role": "user", "content": random.choice(detailed_description)}) |
Table 3.
For each query, we demonstrate the prompt construction process for GPT-4o to generate complex reasoning responses about human poses and actions. The examples come from annotations_group, with each example containing an input annotation[“context”]. It is important to note that the messages form the final prompt.
Table 3.
For each query, we demonstrate the prompt construction process for GPT-4o to generate complex reasoning responses about human poses and actions. The examples come from annotations_group, with each example containing an input annotation[“context”]. It is important to note that the messages form the final prompt.
| messages = [{"role": "system", "content": f"""You are an AI visual assistant specializing in analyzing human poses and actions in images. You receive five sentences, each describing the same image you are observing. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes and human keypoints, represented as (x1, y1, x2, y2) for bounding boxes and (x, y, visibility) for human keypoints, with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y for bounding boxes, and x, y coordinates along with visibility (0: not labeled, 1: labeled but not visible, 2: labeled and visible) for human keypoints. The human keypoints represent the following body parts: 1. nose 2. left eye 3. right eye 4. left ear 5. right ear 6. left shoulder 7. right shoulder 8. left elbow 9. right elbow 10. left wrist 11. right wrist 12. left hip 13. right hip 14. left knee 15. right knee 16. left ankle 17. right ankle The task is to use the provided caption and bounding box/human keypoint information to create a plausible question about the human poses and actions in the image, and provide the answer in detail. Create complex questions beyond describing the scene. To answer such questions, one should require first understanding the human poses and actions, then based on the background knowledge or reasoning, either explain why the actions are happening that way, or provide guidance and help to the user’s request. Make the question challenging by not including the visual content details in the question so that the user needs to reason about that first. Do not include any coordinates or numerical values in your explanation. Instead, utilize the data to explain the scene using natural language. Include details like the number of people, their actions, poses, interactions, relative positions, as well as the relationships and interactions between people and objects in the scene. Describe how people are using objects, their proximity to objects, and any activities involving both people and objects. When using the information from the caption and coordinates, directly explain the scene, and do not mention that the information source is the caption or the bounding box/human keypoints. Always answer as if you are directly looking at the image. """}] for annotation in annotations_group: messages.append({"role": "user", "content": annotation["context"]}) |
Table 4.
The list of instructions for detailed image description about human poses and actions. These instructions exhibit semantic invariance under lexical transformation.
Table 4.
The list of instructions for detailed image description about human poses and actions. These instructions exhibit semantic invariance under lexical transformation.
|
➤ "Describe the actions and poses of people in the following image in detail." ➤ "Provide a detailed description of the poses of people in the given image." ➤ "Explain the various details of the actions of people you see in the image." ➤ "Share a comprehensive analysis of the behaviors of people presented in the image." ➤ "Offer a thorough analysis of the actions of people in the image." ➤ "Explain the various poses and actions of people in the displayed image with great detail." ➤ "Characterize the poses of people in the image using a well-detailed description." ➤ "Break down the actions of people in the image in a detailed manner." ➤ "Walk through the important details of the actions of people in the image." ➤ "Portray the poses and actions of people in the image with a rich, descriptive narrative." ➤ "Narrate the actions and poses of people in the image with precision." ➤ "Analyze the poses and actions of people in the image in a comprehensive and detailed manner." ➤ "Illustrate the actions and poses of people in the image through a descriptive explanation." ➤ "Examine the actions and poses of people in the image closely and share their details." ➤ "Write an exhaustive depiction of the actions of people in the given image." ➤ "Carefully observe the people in the image and share the details of their poses and actions." |
This approach generates 200,328 unique vision–language instruction-following samples (112,980 conversations, 45,174 detailed descriptions, and 42,174 complex reasonings). These samples are specifically tailored to enrich the multimodal model’s ability to interpret and engage with human pose and action understanding. For example, in scenarios involving skiing, as shown in Figure 1, our approach uses keypoint data to provide nuanced understanding of the skier’s posture, balance, and motion.
4. Model Architecture and Fine-Tuning Approach
Our LLaVA-Pose model adopts the simple yet powerful architecture of LLaVA-1.5 [25], which has demonstrated strong capabilities in visual instruction tuning while remaining highly data-efficient. We follow this architecture without structural modification, and focus on augmenting it with keypoint-integrated instruction-following data to enable improved understanding of human poses and actions.
Figure 2 provides an overview of our LLaVA-Pose model architecture:
Figure 2.
LLaVA-Pose model architecture.
- Vision Encoder: The input image is first processed by a pre-trained CLIP ViT-L encoder [41], which extracts visual features.
- Vision–Language Connector: The visual features are passed through a two-layer multi-layer perceptron (MLP), which projects them into the embedding space of the LLM.
- Large Language Model: The mapped visual embeddings are concatenated with the text token embeddings, and the combined embeddings are input into the LLM Vicuna-v1.5 [42]. The LLM generates textual responses conditioned on both the visual and textual information.
We freeze the vision encoder and fine-tune only the MLP connector and LLM on our keypoint-integrated instruction-following data described in Section 3. Through this fine-tuning process, the model enhances its ability to converse, describe and reason about regarding complex human-centric visual content. As shown in our experiments (Section 5), this architecture significantly improves performance on human pose and action understanding tasks.
5. Experiments
We fine-tune the LLaVA-1.5-7B model to enhance its instruction-following ability in human pose and action understanding tasks, using a dataset of 200,328 unique samples generated from the COCO training dataset [33] by GPT-4o. The dataset consists of three instruction categories: conversation, detailed description, and complex reasoning (Section 3). We refer to the resulting fine-tuned model as LLaVA-Pose. We train our model using the DeepSpeed framework [43] on 2 × A100 GPUs, following hyperparameters similar to those used in the original LLaVA-1.5 model [25], with adjustments for computational resources and stability, including a batch size of 8 and gradient accumulation steps of 2. Table 5 summarizes the hyperparameters used during fine-tuning.
Table 5.
Fine-tuning hyperparameters used in our experiments.
5.1. Qualitative Evaluation
We conduct a qualitative evaluation to compare the responses of eight models: DeepSeek-VL2 [18], [19], Qwen2-VL-7B [20], InternVL3-8B [21], ChatPose [15], Janus-Pro-7B [22], LLaVA-1.5-7B [25], and our LLaVA-Pose. Table 6 and Table 7 present detailed comparisons of these models’ outputs for two representative images from the COCO Validation dataset [33], focusing on queries related to human pose and action understanding.
Table 6.
Comparison of responses from DeepSeek-VL2, , Qwen2-VL-7B, InternVL3-8B, ChatPose, Janus-Pro-7B, LLaVA-1.5-7B, LLaVA-Pose, and GPT-4o for the given tennis player image. Blue-highlighted content indicates key information relevant to the question.
Table 7.
Comparison of responses from DeepSeek-VL2, , Qwen2-VL-7B, InternVL3-8B, ChatPose, Janus-Pro-7B, LLaVA-1.5-7B, LLaVA-Pose, and GPT-4o for the given surfing image. Blue-highlighted content indicates key information relevant to the question.
When asked to provide a detailed description of the poses and actions of the characters in the tennis player image (Table 6), the responses from the DeepSeek-VL2, , Qwen2-VL-7B, InternVL3-8B, ChatPose, Janus-Pro-7B, and LLaVA-1.5-7B models offer simple descriptions, concentrating on general aspects of the scene. Their analyses remain relatively superficial, lacking detailed explanations of the characters’ postures, movements, positions, or interactions. In contrast, our LLaVA-Pose model delivers a more nuanced and contextually aware analysis. It identifies key elements of the main player’s posture, including specific details such as the flexion of the knees, the positioning of the elbows and wrists, and the alignment of the shoulders. Additionally, our model clearly explains how these body parts contribute to the power and precision of the player’s tennis swing. It also captures the passive involvement of two women in the scene, accurately describing their postures, attention levels, and engagement, which other models fail to recognize.
Similarly, when asked about the surfing image (Table 7), LLaVA-Pose again outperforms the other models in providing a comprehensive and fine-grained analysis. While the baseline models typically state that the person is standing with bent knees and extended arms for balance, their explanations lack depth and contextual reasoning. In contrast, LLaVA-Pose offers a detailed description of the surfer’s posture and movement: it highlights the bent knees lowering the center of gravity, the extended arms functioning as a counterbalance, and how these pose elements contribute to agility, shock absorption, and control while maneuvering on the surfboard. Furthermore, it explains how the surfer’s posture dynamically adapts to the changing conditions of the waves, which is an aspect not captured by the other models.
We include GPT-4o’s [40] responses in Table 6 and Table 7 as a reference, given its role as the teacher model in our instruction data generation (Section 3). Qualitative analysis reveals that LLaVA-Pose generates nuanced explanations comparable to GPT-4o. This demonstrates that fine-tuning with keypoint-integrated data enables LLaVA-Pose to emulate the teacher model’s reasoning capabilities for human-centric understanding, while outperforming other VLMs.
5.2. Quantitative Evaluation
To systematically evaluate performance, drawing inspiration from prior work [5,42], we utilize GPT-4o [40] to measure response quality across different models. Following LLaVA’s methodology, we create triplets (image, ground-truth descriptions, and question) and have each model generate responses. A language-only GPT-4o then evaluates these responses on a scale of 1–10, considering helpfulness, relevance, accuracy, and detail level.
To evaluate model performance on human-centric tasks, we construct the Extended Human Pose and Action Understanding Benchmark (E-HPAUB), an extended version of HPAUB proposed in our previous work [44]. In this extended version, we randomly select 90 images of human-centric scenes from the COCO Validation 2014 dataset [33] and generate three distinct types of questions for each image: conversation, detailed description, and complex reasoning, resulting in a total of 270 questions. The questions are crafted using the data generation approach outlined in Section 3. This extended benchmark enables a more comprehensive evaluation of the model’s ability to interpret and respond to diverse human-centric visual scenarios involving human poses and actions. By systematically varying the training datasets, we analyze the impact of different types of instruction-following data on model performance, as shown in Table 8. Results demonstrate substantial improvements compared to the original LLaVA-1.5-7B model, such as the following:
Table 8.
Ablation study on E-HPAUB with various training data. We prompt GPT-4o to evaluate and compare responses generated by our LLaVA-Pose model against those from the original LLaVA-1.5-7B model. GPT-4o is tasked with assessing and providing ratings based on the overall quality of the answers, accompanied by detailed explanations.
- Conversation: The fine-tuned model scores 58.9 vs. original’s 54.0;
- Detailed description: 69.1 vs. 47.2;
- Complex reasoning: 75.1 vs. 55.2;
- Overall performance: The model fine-tuned on all three data types achieves 69.4 vs. the original LLaVA-1.5-7B’s 52.1, representing a 33.2% increase.
As shown in Table 8, all models fine-tuned with keypoint-integrated instruction-following data outperform the baseline LLaVA-1.5-7B model that uses only captions and bounding boxes, demonstrating the significant contribution of keypoint information.
Using the E-HPAUB benchmark, we further conduct a comparative quantitative evaluation of DeepSeek-VL2, , Qwen2-VL-7B, InternVL3-8B, ChatPose, Janus-Pro-7B, and our LLaVA-Pose model. The results, summarized in Table 9, demonstrate that LLaVA-Pose achieves the highest overall average score of 69.6, outperforming DeepSeek-VL2 (48.1), (56.7), Qwen2-VL-7B (65.1), InternVL3-8B (66.3), ChatPose (67.5, despite its 13B parameters), and Janus-Pro-7B (68.3). Notably, LLaVA-Pose surpasses the second-best model, Janus-Pro-7B, by an absolute margin of 1.3 points, corresponding to a relative improvement of 1.9%. These results further demonstrate that even when compared against state-of-the-art VLMs in the multimodal understanding domain, our LLaVA-Pose model maintains a clear performance advantage in understanding and reasoning about complex human poses and actions.
Table 9.
Comparative performance evaluated by GPT-4o on E-HPAUB.
To address potential bias toward GPT-style responses, we conduct an additional evaluation using Claude Sonnet 4 [45] as an independent evaluator (Table 10). This model-agnostic approach mitigates style preference risks by cross-validating results across distinct LLM families (OpenAI GPT vs. Anthropic Claude). Both evaluators rate responses on the same 1–10 scale using identical criteria: helpfulness, relevance, accuracy, and detail level. Results in Table 9 (GPT-4o) and Table 10 (Claude Sonnet 4) show strong inter-evaluator agreement: LLaVA-Pose consistently ranks first overall (69.6 vs. 65.5), outperforming second-place Janus-Pro-7B by +1.3 (GPT-4o) and +1.0 (Claude Sonnet 4) absolute points. Notably, the relative performance hierarchy remains stable across evaluators. This inter-evaluator consensus suggests our findings are robust to model-specific stylistic preferences.
Table 10.
Comparative performance evaluated by Claude Sonnet 4 on E-HPAUB.
6. Discussion
6.1. Key Findings
Fine-tuning the LLaVA-1.5-7B model on keypoint-integrated instruction-following data significantly enhances its ability to understand human poses and actions. This fine-tuning process leads to substantial improvements in human-centric tasks, enabling multimodal AI systems to operate more effectively in real-world environments. Notably, our LLaVA-Pose model achieves the highest overall performance on the E-HPAUB benchmark. These results demonstrate the effectiveness of integrating keypoint-level information to improve multimodal understanding of complex human activities. Although certain models (e.g., Qwen2-VL-7B, InternVL3-8B, ChatPose, Janus-Pro-7B) achieve higher scores on specific sub-tasks. This may be partly due to differences in model capacity and pretraining scale: for instance, ChatPose has 13B parameters, InternVL3-8B has 8B parameters, and Qwen2-VL-7B and Janus-Pro-7B are trained on a substantially larger and more diverse dataset. In contrast, LLaVA-Pose is built on a 7B model and fine-tuned on our targeted pose-aware dataset, prioritizing pose and action reasoning over exhaustive visual detail coverage.
6.2. Limitations and Future Work
While LLaVA-Pose excels at understanding static images, it currently lacks any temporal modeling capability, which is a limitation for tasks involving dynamic human actions and interactions. Many human activities unfold over time and require sequential context to be correctly interpreted. Static, frame-by-frame pose analysis often fails to capture crucial motion cues or to resolve occlusions without temporal continuity [46]. Indeed, effective spatio-temporal modeling is widely recognized as essential for robust action recognition in video data [47]. Without temporal context, a model may miss subtle transitions or ambiguous poses that are only distinguishable when viewed as part of a sequence. This limitation impacts applications such as video-based activity analysis and sports performance monitoring, where understanding the temporal evolution of poses is critical for accurate interpretation of complex behaviors. To address this gap, future work will explore extending our framework to sequential visual data by incorporating explicit temporal modeling techniques, such as recurrent neural networks (RNNs) or Transformer-based attention mechanisms. Sequence models like RNNs (and their gated variants) have long been used to capture the dynamics of human pose sequences, treating poses over time as sequential data for action recognition [48]. More recently, transformer-based video models have demonstrated that adding dedicated temporal encoders can significantly improve a model’s performance and temporal reasoning abilities in video understanding tasks [49]. Drawing on these advances, we plan to integrate a temporal module into LLaVA-Pose so that it can learn the continuity and changes of human poses over time. We anticipate that enabling temporal context will substantially enhance the model’s ability to reason about human behaviors in complex, time-varying environments, thereby overcoming the current limitation and broadening the applicability of our approach.
Beyond temporal modeling, there are two further limitations to acknowledge. First, inference efficiency has not been systematically analyzed. Although our motivation is to distill GPT-4o’s capabilities into a smaller, open-source model suitable for local deployment, future work should explicitly investigate runtime latency and efficiency to validate the practical advantages in edge-device scenarios. Second, our ablation studies have primarily focused on the data generation process. Exploring architectural variations, such as lightweight pose-aware attention modules or multi-branch connectors, may further improve performance and will be addressed in future work.
7. Conclusions
In this paper, we introduced a method for generating vision–language instruction-following data by integrating human keypoints alongside traditional captions and bounding boxes information. This keypoint-integrated approach significantly enhances a multimodal model’s ability to understand and reason about human poses and actions. Unlike the original LLaVA method, which primarily relied on coarse object-level annotations, our method leverages fine-grained spatial and structural cues from keypoints to improve interpretability in human-centric visual scenarios. Through rigorous experimentation and evaluation, our fine-tuned model, LLaVA-Pose, demonstrated superior performance across various tasks related to human pose and action understanding. These findings underscore the potential of integrating fine-grained human keypoint data to enhance the capabilities of multimodal AI systems.
Author Contributions
Conceptualization, D.Z. and H.S.; methodology, D.Z.; software, D.Z. and W.A.; validation, D.Z.; formal analysis, D.Z.; investigation, D.Z.; resources, D.Z.; data curation, D.Z.; writing—original draft preparation, D.Z.; writing—review and editing, D.Z., T.H. and H.S.; visualization, D.Z.; supervision, H.S.; project administration, D.Z. and H.S.; funding acquisition, D.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by JST SPRING (grant number JPMJSP2131).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The dataset generated and analyzed during this study is available within the article. In addition, the source code is available at: https://github.com/Ody-trek/LLaVA-Pose.
Conflicts of Interest
Author Wangpeng An was employed by the company TikTok Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| VLMs | vision–language models |
| E-HPAUB | extended human pose and action understanding benchmark |
| AI | artificial intelligence |
| LLaVA | large language and vision assistant |
| LLMs | large language models |
| SMPL | skinned multi-person linear |
| MANet | micro-action network |
| RNNs | recurrent neural networks |
References
- Cui, C.; Ma, Y.; Cao, X.; Ye, W.; Zhou, Y.; Liang, K.; Chen, J.; Lu, J.; Yang, Z.; Liao, K.D.; et al. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 958–979. [Google Scholar]
- Liang, P.P.; Zadeh, A.; Morency, L.P. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Comput. Surv. 2024, 56, 1–42. [Google Scholar] [CrossRef]
- Wu, J.; Gan, W.; Chen, Z.; Wan, S.; Yu, P.S. Multimodal large language models: A survey. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 2247–2256. [Google Scholar]
- Driess, D.; Xia, F.; Sajjadi, M.S.M.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. PaLM-E: An embodied multimodal language model. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
- Peng, Z.; Wang, W.; Dong, L.; Hao, Y.; Huang, S.; Ma, S.; Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv 2023, arXiv:2306.14824. [Google Scholar] [CrossRef]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
- Zhang, D.; Hussain, T.; An, W.; Shouno, H. PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment. arXiv 2025, arXiv:2507.09139. [Google Scholar]
- Kyrollos, D.G.; Fuller, A.; Greenwood, K.; Harrold, J.; Green, J.R. Under the cover infant pose estimation using multimodal data. IEEE Trans. Instrum. Meas. 2023, 72, 5007212. [Google Scholar] [CrossRef]
- Zhou, H.; Wang, D.; Yu, Y.; Zhang, Z. Research progress of human–computer interaction technology based on gesture recognition. Electronics 2023, 12, 2805. [Google Scholar] [CrossRef]
- Wang, T.; Zheng, P.; Li, S.; Wang, L. Multimodal human–robot interaction for human-centric smart manufacturing: A survey. Adv. Intell. Syst. 2024, 6, 2300359. [Google Scholar] [CrossRef]
- Yildirim, N.; Richardson, H.; Wetscherek, M.T.; Bajwa, J.; Jacob, J.; Pinnock, M.A.; Harris, S.; Coelho De Castro, D.; Bannur, S.; Hyland, S.; et al. Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 11–16 May 2024. CHI ’24. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Feng, Y.; Lin, J.; Dwivedi, S.K.; Sun, Y.; Patel, P.; Black, M.J. Chatpose: Chatting about 3d human pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2093–2103. [Google Scholar]
- Feng, D.; Guo, P.; Peng, E.; Zhu, M.; Yu, W.; Wang, P. PoseLLaVA: Pose Centric Multimodal LLM for Fine-Grained 3D Pose Manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 2951–2959. [Google Scholar]
- Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A skinned multi-person linear model. ACM Trans. Graph. 2015, 34, 248. [Google Scholar] [CrossRef]
- Wu, Z.; Chen, X.; Pan, Z.; Liu, X.; Liu, W.; Dai, D.; Gao, H.; Ma, Y.; Wu, C.; Wang, B.; et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv 2024, arXiv:2412.10302. [Google Scholar]
- Wu, P.; Xie, S. V?: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 13084–13094. [Google Scholar]
- Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
- Zhu, J.; Wang, W.; Chen, Z.; Liu, Z.; Ye, S.; Gu, L.; Tian, H.; Duan, Y.; Su, W.; Shao, J.; et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv 2025, arXiv:2504.10479. [Google Scholar]
- Chen, X.; Wu, Z.; Liu, X.; Pan, Z.; Liu, W.; Xie, Z.; Yu, X.; Ruan, C. Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. arXiv 2025, arXiv:2501.17811. [Google Scholar]
- Wang, W.; Chen, Z.; Chen, X.; Wu, J.; Zhu, X.; Zeng, G.; Luo, P.; Lu, T.; Zhou, J.; Qiao, Y.; et al. VisionLLM: Large language model is also an open-ended decoder for vision-centric tasks. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ‘23, Red Hook, NY, USA, 10–16 December 2023. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millicah, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ‘22, Red Hook, NY, USA, 28 November–9 December 2022. [Google Scholar]
- Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26296–26306. [Google Scholar]
- Zheng, C.; Wu, W.; Chen, C.; Yang, T.; Zhu, S.; Shen, J.; Kehtarnavaz, N.; Shah, M. Deep Learning-based Human Pose Estimation: A Survey. ACM Comput. Surv. 2023, 56, 1–37. [Google Scholar] [CrossRef]
- Surek, G.A.S.; Seman, L.O.; Stefenon, S.F.; Mariani, V.C.; Coelho, L.d.S. Video-based human activity recognition using deep learning approaches. Sensors 2023, 23, 6384. [Google Scholar] [CrossRef]
- Morshed, M.G.; Sultana, T.; Alam, A.; Lee, Y.K. Human action recognition: A taxonomy-based survey, updates, and opportunities. Sensors 2023, 23, 2182. [Google Scholar] [CrossRef]
- Le, V.H. Deep learning-based for human segmentation and tracking, 3D human pose estimation and action recognition on monocular video of MADS dataset. Multimed. Tools Appl. 2023, 82, 20771–20818. [Google Scholar] [CrossRef]
- Delmas, G.; Weinzaepfel, P.; Moreno-Noguer, F.; Rogez, G. Posefix: Correcting 3D human poses with natural language. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 15018–15028. [Google Scholar]
- Delmas, G.; Weinzaepfel, P.; Lucas, T.; Moreno-Noguer, F.; Rogez, G. Posescript: 3d human poses from natural language. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 346–362. [Google Scholar]
- Delmas, G.; Weinzaepfel, P.; Lucas, T.; Moreno-Noguer, F.; Rogez, G. Posescript: Linking 3d human poses and natural language. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 5146–5159. [Google Scholar] [CrossRef] [PubMed]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Guo, D.; Li, K.; Hu, B.; Zhang, Y.; Wang, M. Benchmarking Micro-Action Recognition: Dataset, Methods, and Applications. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 6238–6252. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Gu, J.; Li, K.; Wang, F.; Wei, Y.; Wu, Z.; Fan, H.; Wang, M. Motion matters: Motion-guided modulation network for skeleton-based micro-action recognition. arXiv 2025, arXiv:2507.21977. [Google Scholar]
- Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. LAION-5B: An open large-scale dataset for training next generation image–text models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ‘22. Red Hook, NY, USA, 28 November–9 December 2022. [Google Scholar]
- Changpinyo, S.; Sharma, P.; Ding, N.; Soricut, R. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3558–3568. [Google Scholar]
- Byeon, M.; Park, B.; Kim, H.; Lee, S.; Baek, W.; Kim, S. COYO-700M: Image-Text Pair Dataset. 2022. Available online: https://github.com/kakaobrain/coyo-dataset (accessed on 18 July 2025).
- OpenAI. Hello gpt-4o. 2024. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 18 July 2025).
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
- Chiang, W.L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J.E.; et al. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. 2023. Available online: https://lmsys.org/blog/2023-03-30-vicuna/ (accessed on 18 July 2025).
- Rasley, J.; Rajbhandari, S.; Ruwase, O.; He, Y. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 3505–3506. [Google Scholar]
- Zhang, D.; An, W.; Shouno, H. Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models. In Proceedings of the International Conference on Advanced Concepts for Intelligent Vision Systems (ACIVS), Tokyo, Japan, 28–30 July 2025. [Google Scholar]
- Anthropic. Introducing Claude 4. 2025. Available online: https://www.anthropic.com/news/claude-4 (accessed on 13 August 2025).
- Zhang, Z.; Liu, W.; Zheng, Y.; Du, L.; Sun, L. Learning spatio-temporal context for basketball action pose estimation with a multi-stream network. Sci. Rep. 2025, 15, 29173. [Google Scholar] [CrossRef]
- Gu, J.; Yi, Y.; Li, Q. Motion sensitive network for action recognition in control and decision-making of autonomous systems. Front. Neurosci. 2024, 18, 1370024. [Google Scholar] [CrossRef]
- Chen, D.; Chen, M.; Wu, P.; Wu, M.; Zhang, T.; Li, C. Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition. Sci. Rep. 2025, 15, 4982. [Google Scholar] [CrossRef]
- Li, Y.; Liu, Z.; Kong, Y.; Li, G.; Zhang, J.; Bian, C.; Liu, F.; Yao, L.; Sun, Z. Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding. arXiv 2025, arXiv:2501.16786. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).



