1. Introduction
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks, from text generation [
1,
2,
3,
4,
5,
6,
7] to solving difficult Olympiad-level math questions with reasoning [
8,
9]. Recent advancements have pushed LLMs toward applications in computer vision (vision–language models (VLMs)) and multimodal processing (MLLMs), allowing them to learn simultaneously from different data sources rather than solely text data. VLMs are trained on a diverse set of large data to exhibit state-of-the-art performance on different vision-text tasks [
10,
11,
12,
13,
14,
15,
16], such as classification [
17,
18,
19,
20] and segmentation [
21,
22,
23].
The growing reliance on automated systems in fields such as quality control, surveillance, and automated inspection has increased the need for robust classification models capable of precise object recognition. VLMs offer an intriguing alternative to traditional computer vision models in these applications with their potential for zero-shot and few-shot learning, since they are flexible on newly seen situations. However, VLMs face unique challenges in visual recognition, particularly when distinguishing visually similar objects or processing complex scenes with multiple objects and classes (
Figure 1). One critical yet under-explored aspect of these challenges is LLMs’ ability to differentiate real objects from artificial ones and to classify human-like figures across different categories. In this study, we investigate these limitations by evaluating the performance of various VLMs in these specific classification tasks, highlighting areas that require improvement.
We organized the evaluation of language models around two primary classification tasks:
real vs. fake object classification and
human vs. human-like object classification (
Figure 2). The first evaluation challenges the models to distinguish real objects from their artificial counterparts across various categories, such as fruits, vegetables, and hands. The second evaluation requires models to differentiate human and human-like figures (mannequins, banners), which adds complexity in scenarios involving multi-class and multi-object settings. Through these tasks, we seek to assess state-of-the-art vision–language models’ strengths and weaknesses in handling visually complex information and to identify the architectural and training modifications necessary to improve their classification performance.
By evaluating a range of VLMs with different architectures, sizes, and training configurations, we aim to provide a comprehensive analysis of their current performance to reveal both their capabilities and limitations in visual classification tasks. Our findings emphasize the necessity for advancements in multimodal integration, prompt engineering, and adaptive learning, which could enable LLMs to fulfill a broader range of practical applications that demand precise and reliable object classification. The primary contributions of this study are as follows:
We provide an in-depth empirical assessment of a broad range of vision–language models (VLMs) across two challenging tasks: differentiating real objects from artificial replicas and distinguishing real humans from human-like figures (e.g., mannequins, banners). Our evaluation is structured into step-by-step tasks that separately measure basic object recognition, authenticity verification, and spatial reasoning. This multi-task framework reveals how even high-performing models (e.g., GPT-4o) achieve near-perfect accuracy on isolated object identification (up to 98.6%) but suffer drastic performance drops (often near-zero) under more complex multi-object or nuanced authenticity conditions.
We introduce a novel evaluation approach that leverages tailored question prompts and a custom dataset. The dataset includes everyday objects, their artificial replicas, and visualizations of human versus human-like figures, enabling us to disentangle different aspects of visual classification. By categorizing performance into distinct evaluation steps (e.g., Step 1 for identification and Step 2 for authenticity or positional reasoning), our framework provides a reproducible methodology to benchmark and analyze the limitations of current LLM-based vision systems under consistent conditions.
Our analysis not only quantifies current system limitations but also offers critical insights into the challenges of multimodal reasoning in LLMs. We highlight that while basic recognition capabilities are robust, issues such as compromised spatial reasoning and difficulty in integrating subtle authenticity cues persist, particularly in scenes with multiple objects or complex visual arrangements. These findings underscore the need for enhanced multimodal training, refined prompt engineering, and architectural innovations that improve spatial and contextual understanding.
In the following sections, we present the detailed evaluation scheme employed for both classification tasks, the experimental setup and evaluation metrics, and a comprehensive discussion of our findings, ultimately underscoring the potential and limitations of LLMs in complex visual environments.
2. Related Work
2.1. Vision–Language Models
Recent progress in vision–language models (VLMs) (
Table 1) has expanded their utility beyond pure text generation toward unified visual–linguistic tasks [
10,
17]. Building on work such as CLIP, which learns visual representations from natural language supervision [
20], researchers have proposed advanced multimodal transformers (e.g., ViLBERT, VisualBERT) that jointly pre-train on large-scale image–text datasets for downstream tasks like image captioning, segmentation, and object detection [
13,
19,
24,
25]. While these methods demonstrate impressive zero-shot or open-vocabulary recognition capabilities, they often encounter difficulties in effectively handling multiple objects in cluttered scenes or in making fine-grained distinctions, particularly when the authenticity of an object (real vs. fake) is in question [
6]. Recent large-scale vision–language models, including GPT-4-like systems, show promise in bridging some of these gaps, yet continue to struggle with tasks that require higher-level reasoning about object context and attributes [
26].
To address these challenges, recently published models such as Claude 3 [
27], Emu3 [
28], NVLM [
29], Qwen2-VL [
16], and Pixtral [
30] have refined multimodal alignment to better handle diverse real-world data. Similarly, systems like LLaMA 3.2 Vision [
3], Baichuan Ocean Mini [
31], TransFusion [
32], and DeepSeekVL2 [
33] incorporate richer training resources—including video, audio, and wiki-style corpora—to enhance scene understanding. Other notable approaches, including BLIP-3 [
34], OLMo-2 [
35], and Qwen2.5-VL [
7], focus on scaling to larger datasets or increasing parameter sizes while preserving fine-grained recognition capabilities. Despite these advances, reliably verifying authenticity and distinguishing closely resembling objects remain significant hurdles. In this evaluation study, we systematically examine how both established and newly released VLMs handle near-identical real versus fake items and human-like figures, with an emphasis on nuanced visual reasoning.
2.2. Object Classification and Multi-Object Recognition
Early convolutional networks such as ResNet and EfficientNet achieved high accuracy in single-object recognition benchmarks (e.g., ImageNet), providing a baseline for more advanced architectures [
36,
37]. However, the emergence of open-vocabulary and zero-shot classification approaches has shifted attention toward models that can adapt to novel object categories without additional training data. For example, a two-step strategy combining an LLM with a VLM can refine one-class classification by prompting the LLM to propose visually confusing “negative” categories, thus enhancing the decision boundary [
38]. Beyond single-object tasks, multi-object recognition intensifies complexity—models must classify each object independently while resolving occlusions or overlapping visual cues [
39]. Efforts to address these challenges include large-scale open-vocabulary detection systems like YOLO-World, which integrates language embeddings for real-time; zero-shot object detection [
40]; and the recognize anything model (RAM), designed for generalized image tagging in cluttered scenes [
41]. Yet, correctly interpreting relationships among multiple objects and maintaining accuracy across all items remain open research problems [
39].
2.3. Human Classification and Human-like Figures
Distinguishing real humans from human-like objects—such as mannequins, statues, or life-sized banners—remains a significant challenge in computer vision. Traditional human detectors often rely on silhouette or facial cues, which can be misleading if an object mimics human proportions or features. For example, Karthika and Chandran [
42] reduce false positives in pedestrian detection by adding images of mannequins and statues during training, indicating that augmenting datasets with human-like decoys can significantly improve robustness. Going a step further, Ju et al. [
43] present the Human-Art dataset, comprising both natural and “artificial” (sculptures, drawings) depictions of humans. Their findings show that standard models trained only on real humans suffer marked drops in performance when tested on artificial domains, underscoring the domain gap that arises between real and synthetic or artistic imagery.
A closely related problem appears in face anti-spoofing and liveness detection, where systems must differentiate genuine biometrics (e.g., a live face or finger) from fake representations (e.g., photos, masks, silicone molds). Yu et al. [
44] offer a comprehensive survey of deep-learning-based methods for face anti-spoofing, highlighting how texture, depth, and motion cues can thwart presentation attacks. Classical work by Kollreider et al. [
45] proposed non-intrusive techniques—analyzing facial images directly for signs of authenticity—while more recently, competitions have focused on advanced spoofing scenarios. In particular, the LivDet series evaluates fingerprint-liveness methods against increasingly realistic forgeries. Orrù et al. [
46] illustrate this evolving arms race in the 2019 LivDet in Action competition, demonstrating how constant innovation is necessary to detect ever more elaborate fake fingerprints. Although these advances apply primarily to face or fingerprint biometrics, the underlying principle—verifying the authenticity of the object—also informs broader tasks like real-human detection, since mannequins and statues can be viewed as a form of “presentation attack” in the domain of person recognition.
Multimodal approaches have also emerged to improve human recognition. Zhu et al. [
47] integrated full-body shape and pose information for person identification in surveillance settings, while Nagrani et al. [
48] introduced an audio–visual matching framework that correlates a speaker’s voice with their facial appearance. These multimodal systems not only increase robustness in challenging scenarios but also offer promising strategies to mitigate the misclassification of static human-like objects.
3. Methodology
In this study, we evaluate the performance of large-scale vision–language Models (VLMs) in two main classification tasks: identifying objects and distinguishing between humans and human-like figures. Each task tests the models’ visual reasoning capabilities in object detection, classification, as well as their spatial perception. In addition, we also conduct an evaluation of the performance of state-of-the-art open-vocabulary models in the same tasks.
3.1. Dataset
In this study, we created a custom dataset to evaluate whether vision–language models can distinguish real objects from fake replicas, and differentiate humans from mannequin/banner figures. The dataset contains 82 images of everyday items (e.g., fruits, hands) photographed outdoors, paired with visually similar toys or replicas, as well as 18 images depicting real individuals indoors alongside mannequins or banners. All images were manually labeled to indicate their category or authenticity and the presence of human-like figures.
3.2. Object Classification
In the object classification task, VLM models were evaluated in terms of their ability to recognize and categorize objects, as well as to determine whether those objects are real or fake. The dataset used for this task consisted of everyday objects that can be easily found in our surroundings and also included fake model data corresponding to those objects. To assess model performance not only in simple scenarios but also in more complex ones, the dataset was organized into two types: single-object scenarios (featuring only one object per class) and multiple-object scenarios (featuring several objects). Examples of images from this dataset are presented in
Figure 3.
To evaluate the impact of prompts used as inputs to the VLMs for object classification, detection, and recognition, we created a series of step-by-step questions. We then categorized these questions into different levels according to how much information they provide about the desired output. For Questions 1 and 2, we integrated all of the data into a single set of queries. Meanwhile, for Questions 3 and 4, we divided the data into single-object and multiple-object scenarios to evaluate the model’s performance in more detail. The evaluation prompts are given in
Figure 4.
The evaluated models were categorized by parameter size (small, medium, or large), and their classification accuracies were calculated for each question. Accuracy was determined in two steps. If the model passed the first step, it proceeded to the second; otherwise, it did not advance. The first step evaluated whether the model correctly identified the object, while the second step evaluated whether the model correctly determined if the object was real or fake. This procedure was applied for both single-object and multiple-object scenarios to evaluate the reasoning capabilities of the models.
Figure 4.
Used experiment prompts for evaluation of VLMs for object-related tasks.
Figure 4.
Used experiment prompts for evaluation of VLMs for object-related tasks.
3.3. Human Classification
The second task, referred to as human classification, assessed the model’s ability to distinguish between actual humans and human-like figures such as mannequins and banners. It also evaluated spatial reasoning skills, including recognizing the order in which objects are arranged. The dataset for this task was created by combining humans, mannequins, and banners in various scenarios. Examples of images from this dataset are shown in
Figure 5. We also created a series of step-by-step questions to evaluate how prompts entered into the VLM affect human classification. The evaluation prompts are given in
Figure 6.
Step 1: Did it guess what the objects are?
Step 2: Did it describe the position of the objects (e.g., in order)?
We evaluated the results by following the same procedure used for object classification but with different criteria for each evaluation step. The first step assessed whether the model correctly distinguished among humans, mannequins, and banners. The second step assessed whether the model accurately described the position of each object (for example, their order). Following these steps, we conducted the evaluation for human classification.
Figure 6.
Used experiment prompts for evaluation of VLMs for human-related tasks.
Figure 6.
Used experiment prompts for evaluation of VLMs for human-related tasks.
4. Experiments
We evaluated models encompassing diverse architectures, training methodologies, and parameter scales. The collection of evaluated models represents various approaches to vision–language integration, ranging from tightly integrated architectures to modular, two-stage designs. Integrated models such as Qwen2-VL (7B and 72B) and Llama-3.2 (11B and 90B) merge vision and language processing within unified Transformer layers. Specifically, Qwen2-VL employs dynamic visual tokenization and multimodal rotary position embeddings, enabling flexible handling of varying image sizes and extended video inputs. Conversely, Llama-3.2 models incorporate a dedicated two-stage vision encoder with interleaved cross-attention layers, infusing multi-level visual features directly into the text generation process. In comparison, Phi-3.5-3.8B, though primarily text-oriented, distinguishes itself through an extensive context window and robust multilingual capabilities, underscoring how architectural innovations can also prioritize training efficiency and context management.
On the modular spectrum, models such as the LLaVA series (llava-v1.6-13B, llava-v1.6-34B, and llava-next-72B), the InternVL2.5 series (26B, 38B, and 78B), Ovis1.6-Gemma2-9B, and MiniCPM-V-2.6-8B emphasize separation between frozen vision encoders and language models, typically connected via a learned projection layer. For instance, LLaVA models maintain simplicity and computational efficiency by combining a pre-trained vision encoder with a frozen large language model, scaling the underlying LLM to enhance multimodal reasoning. InternVL2.5 leverages progressive scaling and modular design to optimize vision-text fusion for complex tasks, whereas Ovis1.6 introduces innovations by mapping continuous visual outputs into discrete “visual word” embeddings for improved integration. MiniCPM-V-2.6-8B expands this modular approach into real-time, multi-sensory processing by integrating vision, audio, and text within a streaming architecture. These models illustrate how diverse design decisions—from unified to modular architectures—affect trade-offs in flexibility, scalability, and real-time performance in multimodal AI systems.
We categorized the evaluated models into four groups based on accessibility and parameter scale: the "closed" category included models accessible exclusively via a paid API, while publicly available models were classified as small-scale (under 10B parameters), medium-scale (10B to 40B parameters), and large-scale (above 40B parameters). For the closed category, we utilized the GPT-4o model version, updated as of 20 November 2024. Computational resources varied by model scale: some small- and medium-sized models were evaluated using two NVIDIA RTX 4090 GPUs, whereas larger models and a subset of medium-sized models were evaluated using eight NVIDIA RTX 3090 GPUs (Nvidia, Santa Clara, CA, USA).
Accuracy was used as the evaluation metric for both tasks. We manually reviewed each model’s responses to verify the correct object classification. This manual verification was crucial, as differing reasoning processes and text-output formats across models resulted in varied response styles.
4.1. Performance Analysis
All models generally showed increased accuracy as the question level advanced. Furthermore, GPT-4o delivered the best performance compared to the other models for most of the questions.
Object Classification: In
Table 2 and
Table 3, we present the performance of all evaluated models on questions of all object classification tasks. From this table, it is evident that most models perform well in Step 1—focusing on identifying the object in single-object scenarios—with accuracy often exceeding 90%. For instance, GPT-4o achieves 98.59% on Q1 Step 1, demonstrating strong fundamental recognition capabilities. However, once the task shifts to Step 2, which asks whether the identified object is real or fake, performance drops considerably, especially in multi-object contexts where the complexity is higher. Even large-scale models such as GPT-4o can decline to near-zero accuracy on certain query sets (e.g., Q2 Step 2). Meanwhile, smaller models, although they maintain moderate performance for single-object classification, also show a steep reduction in accuracy when presented with multiple objects. These observations underscore the challenges current vision–language models face in handling nuanced reasoning and authenticity verification in more complex scenes.
Human Classification: In
Table 4, we present the results for the human classification task, where Step 1 asks models to identify the objects in an image, and Step 2 requires them to describe each object’s position. While most models perform reasonably well in Step 1—particularly on Q1, with accuracies often above 50%—their performance frequently declines in Step 2, where they must specify the arrangement or order of objects. However, the best-performing model, GPT-4o, maintains high accuracy across both steps, achieving 94.44% on Q1 Step 1 and 100% on Q2 Step 2, indicating strong spatial reasoning capabilities. By contrast, many smaller or mid-sized models, such as llama3-llava-next-8B and Ovisi1.6-Gemma2-9B, show marked drops in Step 2 accuracy even when they can correctly identify the objects in Step 1. This gap highlights the continued challenge for current vision–language models in going beyond basic classification to deliver coherent scene descriptions that accurately capture positional details.
4.2. Evaluation of Open-Vocabulary Vision Foundation Models
Despite the advancements in open-vocabulary vision models such as YOLO-world [
39] and the recognize anything model (RAM) [
40], their ability to accurately classify objects and distinguish between real and fake entities remains limited. In many cases, these models struggle to identify whether an object is real or fake. Additionally, they face challenges in basic object recognition, especially in complex and multi-class scenarios. This section evaluates the performance of open-vocabulary vision models in the challenging tasks of real or fake classification and object classification.
Single-object Scenarios: In tasks involving isolated objects, such as distinguishing between real and fake items like carrots, hands, oranges, and pimentos (peppers), the models failed to distinguish between the two categories entirely (
Figure 7). Moreover, in some cases, the models were unable to identify the type of object. Despite the presence of distinguishable features, such as shape or color, the models often misclassified fake objects (
Figure 8).
Multi-object Scenarios: The models exhibited a complete inability to classify objects correctly in scenes containing multiple objects. For example, in a mixed arrangement of real and fake oranges, the models consistently failed to differentiate between individual objects, leading to pervasive misclassifications across the scene.
5. Discussion
5.1. Object Classification
Figure 9 illustrates how the model responses are categorized for a question (Q1), serving as an example of the overall approach used for all queries. Rather than displaying every question and answer in full, we group the responses into distinct stages—“Incorrect output”, “Correct for step 1”, and “Correct for steps 1 and 2”—to highlight how the models refine their answers when given additional instructions or follow-up prompts. While this figure focuses on Q1, the same categorization method is applied throughout our study to allow for a consistent comparison of how each model’s reasoning evolves across different types of questions.
In Q2, the model must classify each item strictly as “carrot”, “orange”, “bell pepper”, or “hand” with no explicit call to distinguish real from fake. Consequently, it often responds with statements like “This object is a carrot”, omitting any mention of authenticity. As a result, if the model fails to label a fake item as “fake”, we count it as having classified that object as “real”. Because many items are, in fact, artificial, the final accuracy averages around 50%. Thus, while the model’s basic object identification (e.g., recognizing a carrot shape) may look accurate, it does not address real-versus-fake question—highlighting the challenge of incorporating authenticity judgments into what otherwise appears to be straightforward classification.
A key reason for the lower accuracy on Q3, compared to Q1 is the more complex and context-heavy reasoning it requires. While Q1 focuses on straightforward identification (“What is this object?”), Q3 compels the model to infer additional attributes, synthesize multiple clues, and explicitly determine whether an object is fake. Consequently, as shown in
Table 3, GPT-4o’s accuracy on Q3 Step 1 (Single Object) dips to about 95.77%, down from 98.59%, on Q1 Step 1—a modest drop that becomes a 10–15% decline for many smaller models. This underlines how tasks involving subtler distinctions or extra context (e.g., identifying fake details) present a substantially higher cognitive load for vision–language models. Furthermore, Q3 explicitly calls for fake objects to be labeled as “fake”, so models that merely note an object type (e.g., “carrot”) without acknowledging its authenticity risk being scored incorrectly.
Turning to Q4,
Table 3 reveals that single-object performance (Q4-1) is generally higher than in previous tasks, likely because Q4 provides clearer textual cues for both classification and authenticity. In contrast, Q2 and Q3 often involve more nuanced features or multiple attributes, compounding the difficulty of deciding “real vs. fake”. By focusing on a single, distinctly defined object, Q4-1 reduces ambiguity, and even smaller models achieve their strongest results.
Nevertheless, when multiple objects appear in one image, performance universally declines. An example of results is shown in
Figure 10. As seen under the “Multi-Object” columns in
Table 2 and
Table 3, some models that achieve near-perfect scores in single-object tasks drop sharply when forced to identify and assess several items simultaneously. Importantly, the second stage of our evaluation—Step 2—applies additional correctness checks, further magnifying any errors. Even GPT-4o—consistently strong in single-object recognition—experiences a noticeable dip. In summary, these findings emphasize how handling multiple objects, each with its own attributes and potential authenticity considerations, pushes the models’ capacities beyond simpler classification demands, revealing ongoing limitations in their current reasoning capabilities.
5.2. Human Classification
Table 4 illustrates how each model’s responses are organized for the human classification tasks, mirroring the approach used in the object classification experiments. Rather than listing every question and answer, we show how the models refine their outputs when given additional instructions. Although Q1 more directly asks whether the subject is a human, mannequin, or banner, and Q2 places additional emphasis on arrangement or positioning, both questions include single-object and multi-object variants. This design enables us to evaluate each model’s capabilities under different levels of complexity. An example of the results is shown in
Figure 11.
A recurring challenge involves distinguishing humans from mannequins, especially for models that rely heavily on color or shape cues. Realistic mannequins sometimes fool smaller or less fine-tuned systems into labeling them as actual humans, whereas GPT-4o and other large-scale architectures more reliably detect subtle giveaways like stiff poses or unnatural textures. Banners featuring images of people further blur the line: if the model merely perceives a human form, it may incorrectly assign the “human” label instead of recognizing a two-dimensional, printed surface.
When multiple objects appear, many models that successfully classify a single figure show a clear drop in accuracy as they must now identify multiple subjects and, in some cases, describe their relative positions. While larger models often maintain high overall scores, exceeding 90–100% in certain scenarios, smaller and mid-scale models consistently see double-digit declines under these more visually crowded conditions. As in the object classification tasks, this multi-object complexity reveals ongoing limitations in current vision–language systems, reinforcing that tracking multiple categories and managing authenticity or pose cues remain formidable hurdles for modern AI.
5.3. Challenges and Limitations in Both Classification Scenarios
Across both tasks, the performance of LLMs revealed several key limitations related to processing and classifying complex scenes. In particular, our analysis identified the following challenges in both classification evaluations:
Limited Spatial Recognition: In the real vs. fake classification task, models struggled to interpret spatial arrangements. This limitation was evident as the systems frequently misclassified frames containing multiple similar objects, which indicates that the models do not reliably differentiate spatially close or overlapping items. This poses significant implications for practical applications of LLMs such as quality inspection and monitoring systems, where precise identification and differentiation of closely positioned or overlapping objects is critical to operational accuracy.
Challenges in Multi-Class Discrimination: In the person, mannequin, and banner classification task, the models’ difficulties in separating similar object types on complex visual backgrounds underscored a deficiency in multi-class recognition. This limitation is evident in real-world environments like security surveillance, where accurate identification amidst visually complex backgrounds directly impacts decision-making and operational efficiency.
High Incidence of Missing Values: Both tasks demonstrated significant rates of missing data points. Such gaps imply that the models occasionally fail to register crucial details within an image, which further affects overall performance in accurately capturing scene dynamics. For practical systems, particularly in quality assurance and monitoring applications, missing critical visual cues can lead to significant oversights, potentially compromising reliability and system integrity. Techniques such as retrieval-augmented generation (RAG) may offer a promising approach to mitigating this issue by enhancing model comprehension of fine-grained image contexts.
Prevalence of Misclassifications: The occurrence of numerous misclassifications across tasks reinforces the notion that the models struggle with integrating contextual and spatial information effectively. These errors highlight underlying issues in neural architectures when confronted with varying object poses, densities, and subtle visual distinctions. In real-world scenarios, particularly in precision-driven sectors such as industrial quality control, frequent misclassifications could substantially undermine productivity, accuracy, and safety.
These limitations suggest that further architectural adjustments are necessary. Integrating multimodal training methods—by combining both textual and visual data—could enhance spatial processing and improve class differentiation. Such improvements would broaden the applicability of LLMs to fields like inventory management, quality control, and surveillance, where accurate and precise categorization is critical.
6. Conclusions and Future Research Directions
In this study, we provide a comprehensive evaluation of LLMs’ strengths and limitations in complex visual classification tasks to highlight the need for advancements in spatial processing and multimodal learning. While models demonstrated adequate performance in single-object tasks, their effectiveness decreased significantly in multi-object and multi-class settings, where spatial awareness and feature prioritization are critical for accurate classification.
The findings from our study underscore several promising avenues for advancing LLMs’ capabilities in visual classification tasks. One primary focus for future research could be the integration of multimodal training, which combines text and visual data to strengthen models’ spatial and contextual understanding. By training on datasets that include diverse object types, lighting conditions, and spatial arrangements, future LLMs may gain an improved ability to handle complex classification tasks. Expanding training datasets to include real-world complexities—such as varied lighting conditions, object clutter, and diverse environmental settings—could further improve model generalization. By training on data that closely mirror operational environments, future LLMs may better handle nuanced differences across complex scenes, enhancing their applicability in fields demanding reliable visual classification.
Prompt engineering offers another avenue for improvement, especially in directing models to focus on intrinsic object features like posture, material, and texture. Refining prompt designs to emphasize these fundamental aspects over superficial cues, such as color or attire, can guide models to perform more accurate and context-sensitive classifications. Effective prompt engineering may also help mitigate errors in complex, multi-class tasks by aligning model focus with core object characteristics. Moreover, continuous learning mechanisms that allow models to iteratively adapt to new data can provide long-term improvements. Implementing such adaptive learning processes may help models refine their performance in real-world scenarios where accuracy and reliability are critical.
Author Contributions
Data curation, I.P., Y.J. and S.Y.; Formal Analysis, H.J.; Funding Acquisition, S.K.; Investigation, H.J.; Methodology, H.J.; Project Administration, S.K.; Resources, S.K.; Software, H.J. and I.P.; Supervision, S.K.; Validation, H.J. and I.P.; Visualization, H.J. and I.P.; Writing—Original Draft, H.J., I.P. and Y.N.; Writing—Review and Editing, H.J., I.P., Y.N. and S.K. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by a Korea Evaluation Institute of Industrial Technology (KEIT) grant funded by the Korean government (MOTIE) (RS-2022-00154651, 3D semantic camera module development capable of material and property recognition).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI 2019. Available online: https://paperswithcode.com/paper/language-models-are-unsupervised-multitask (accessed on 15 November 2024).
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
- Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
- DeepSeek-AI; Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; et al. DeepSeek-V3 Technical Report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
- Abdin, M.; Aneja, J.; Awadalla, H.; Awadallah, A.; Awan, A.A.; Bach, N.; Bahree, A.; Bakhtiari, A.; Bao, J.; Behl, H.; et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv 2024, arXiv:2404.14219. [Google Scholar]
- Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar]
- Qwen, A.Y.; Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; et al. Qwen2.5 Technical Report. arXiv 2025, arXiv:2412.15115. [Google Scholar]
- Imani, S.; Du, L.; Shrivastava, H. MathPrompter: Mathematical Reasoning using Large Language Models. arXiv 2023, arXiv:2303.05398. [Google Scholar]
- Yang, A.; Zhang, B.; Hui, B.; Gao, B.; Yu, B.; Li, C.; Liu, D.; Tu, J.; Zhou, J.; Lin, J.; et al. Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement. arXiv 2024, arXiv:2409.12122. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Sun, Z.; Fang, Y.; Wu, T.; Zhang, P.; Zang, Y.; Kong, S.; Xiong, Y.; Lin, D.; Wang, J. Alpha-clip: A clip model focusing on wherever you want. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 13019–13029. [Google Scholar]
- Laurençon, H.; Saulnier, L.; Tronchon, L.; Bekman, S.; Singh, A.; Lozhkov, A.; Wang, T.; Karamcheti, S.; Rush, A.; Kiela, D.; et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Adv. Neural Inf. Process. Syst. 2024, 36, 71683–71702. [Google Scholar]
- Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.N.; et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10965–10975. [Google Scholar]
- Bao, H.; Wang, W.; Dong, L.; Liu, Q.; Mohammed, O.K.; Aggarwal, K.; Som, S.; Piao, S.; Wei, F. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural Inf. Process. Syst. 2022, 35, 32897–32912. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
- Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 2019, 32, 1–11. [Google Scholar]
- Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.J.; Chang, K.W. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv 2019, arXiv:1908.03557. [Google Scholar]
- Gan, Z.; Chen, Y.C.; Li, L.; Zhu, C.; Cheng, Y.; Liu, J. Large-scale adversarial training for vision-and-language representation learning. Adv. Neural Inf. Process. Syst. 2020, 33, 6616–6628. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
- Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. SAM 2: Segment Anything in Images and Videos. arXiv 2024, arXiv:2408.00714. [Google Scholar]
- Luo, J.; Khandelwal, S.; Sigal, L.; Li, B. Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4029–4040. [Google Scholar]
- Ren, W.; Xia, R.; Zheng, M.; Wu, Z.; Tang, Y.; Sebe, N. Cross-Class Domain Adaptive Semantic Segmentation with Visual Language Models. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 5005–5014. [Google Scholar]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
- Enis, M.; Hopkins, M. From llm to nmt: Advancing low-resource machine translation with claude. arXiv 2024, arXiv:2404.13813. [Google Scholar]
- Wang, X.; Zhang, X.; Luo, Z.; Sun, Q.; Cui, Y.; Wang, J.; Zhang, F.; Wang, Y.; Li, Z.; Yu, Q.; et al. Emu3: Next-token prediction is all you need. arXiv 2024, arXiv:2409.18869. [Google Scholar]
- Dai, W.; Lee, N.; Wang, B.; Yang, Z.; Liu, Z.; Barker, J.; Rintamaki, T.; Shoeybi, M.; Catanzaro, B.; Ping, W. Nvlm: Open frontier-class multimodal llms. arXiv 2024, arXiv:2409.11402. [Google Scholar]
- Agrawal, P.; Antoniak, S.; Hanna, E.B.; Bout, B.; Chaplot, D.; Chudnovsky, J.; Costa, D.; De Monicault, B.; Garg, S.; Gervet, T.; et al. Pixtral 12B. arXiv 2024, arXiv:2410.07073. [Google Scholar]
- Li, Y.; Sun, H.; Lin, M.; Li, T.; Dong, G.; Zhang, T.; Ding, B.; Song, W.; Cheng, Z.; Huo, Y.; et al. Ocean-omni: To Understand the World with Omni-modality. arXiv 2024, arXiv:2410.08565. [Google Scholar]
- Zhou, C.; Yu, L.; Babu, A.; Tirumala, K.; Yasunaga, M.; Shamis, L.; Kahn, J.; Ma, X.; Zettlemoyer, L.; Levy, O. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv 2024, arXiv:2408.11039. [Google Scholar]
- Wu, Z.; Chen, X.; Pan, Z.; Liu, X.; Liu, W.; Dai, D.; Gao, H.; Ma, Y.; Wu, C.; Wang, B.; et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv 2024, arXiv:2412.10302. [Google Scholar]
- Xue, L.; Shu, M.; Awadalla, A.; Wang, J.; Yan, A.; Purushwalkam, S.; Zhou, H.; Prabhu, V.; Dai, Y.; Ryoo, M.S.; et al. xgen-mm (blip-3): A family of open large multimodal models. arXiv 2024, arXiv:2408.08872. [Google Scholar]
- Team OLMo; Walsh, P.; Soldaini, L.; Groeneveld, D.; Lo, K.; Arora, S.; Bhagia, A.; Gu, Y.; Huang, S.; Jordan, M.; et al. 2 OLMo 2 Furious. arXiv 2024, arXiv:2501.00656. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
- Bendou, Y.; Lioi, G.; Pasdeloup, B.; Mauch, L.; Hacene, G.B.; Cardinaux, F.; Gripon, V. LLM meets Vision-Language Models for Zero-Shot One-Class Classification. arXiv 2024, arXiv:2404.00675. [Google Scholar]
- Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16901–16911. [Google Scholar]
- Zhang, Y.; Huang, X.; Ma, J.; Li, Z.; Luo, Z.; Xie, Y.; Qin, Y.; Luo, T.; Li, Y.; Liu, S.; et al. Recognize anything: A strong image tagging model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 1724–1732. [Google Scholar]
- Guan, X.; Zhang, L.L.; Liu, Y.; Shang, N.; Sun, Y.; Zhu, Y.; Yang, F.; Yang, M. rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking. arXiv 2025, arXiv:2501.04519. [Google Scholar]
- Karthika, N.; Chandran, S. Addressing the false positives in pedestrian detection. In Proceedings of the Electronic Systems and Intelligent Computing: Proceedings of ESIC 2020, Arunachal Pradesh, India, 2–4 March 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 1083–1092. [Google Scholar]
- Ju, X.; Zeng, A.; Wang, J.; Xu, Q.; Zhang, L. Human-art: A versatile human-centric dataset bridging natural and artificial scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 618–629. [Google Scholar]
- Yu, Z.; Qin, Y.; Li, X.; Zhao, C.; Lei, Z.; Zhao, G. Deep learning for face anti-spoofing: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5609–5631. [Google Scholar] [CrossRef]
- Kollreider, K.; Fronthaler, H.; Bigun, J. Non-intrusive liveness detection by face images. Image Vis. Comput. 2009, 27, 233–244. [Google Scholar] [CrossRef]
- Orrù, G.; Casula, R.; Tuveri, P.; Bazzoni, C.; Dessalvi, G.; Micheletto, M.; Ghiani, L.; Marcialis, G.L. Livdet in action-fingerprint liveness detection competition 2019. In Proceedings of the 2019 International Conference on Biometrics (ICB), Crete, Greece, 4–7 June 2019; pp. 1–6. [Google Scholar]
- Zhu, H.; Zheng, W.; Zheng, Z.; Nevatia, R. Sharc: Shape and appearance recognition for person identification in-the-wild. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 6290–6300. [Google Scholar]
- Nagrani, A.; Albanie, S.; Zisserman, A. Seeing voices and hearing faces: Cross-modal biometric matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8427–8436. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Lu, S.; Li, Y.; Chen, Q.G.; Xu, Z.; Luo, W.; Zhang, K.; Ye, H.J. Ovis: Structural embedding alignment for multimodal large language model. arXiv 2024, arXiv:2405.20797. [Google Scholar]
- Yao, Y.; Yu, T.; Zhang, A.; Wang, C.; Cui, J.; Zhu, H.; Cai, T.; Li, H.; Zhao, W.; He, Z.; et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv 2024, arXiv:2408.01800. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
- Chen, Z.; Wang, W.; Cao, Y.; Liu, Y.; Gao, Z.; Cui, E.; Zhu, J.; Ye, S.; Tian, H.; Liu, Z.; et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv 2024, arXiv:2412.05271. [Google Scholar]
Figure 1.
Illustrative examples of real-world challenges that VLMs struggle to comprehend.
Figure 1.
Illustrative examples of real-world challenges that VLMs struggle to comprehend.
Figure 2.
Scheme for evaluating the performance of vision-based LLMs in distinguishing realistic objects.
Figure 2.
Scheme for evaluating the performance of vision-based LLMs in distinguishing realistic objects.
Figure 3.
Examples from the object classification dataset. The first row shows fake objects, the second row shows real objects, and the third row depicts scenes containing multiple objects—some fake and some real.
Figure 3.
Examples from the object classification dataset. The first row shows fake objects, the second row shows real objects, and the third row depicts scenes containing multiple objects—some fake and some real.
Figure 5.
Human classification data.
Figure 5.
Human classification data.
Figure 7.
Examples of classification results in different categories using the YOLO-World model. The model misclassified real objects as fake. (a) Real orange. (b) Real hand. (c) Real pimento.
Figure 7.
Examples of classification results in different categories using the YOLO-World model. The model misclassified real objects as fake. (a) Real orange. (b) Real hand. (c) Real pimento.
Figure 8.
Examples of classification results in different categories. The top row shows results from the YOLO-World model. The bottom row presents results from the RAM model. (a) The top two oranges are fake oranges, and the bottom two oranges are real oranges. (b) Fake hand (top) and real hand (bottom). Different colored boxes are different labeled misclassifications.
Figure 8.
Examples of classification results in different categories. The top row shows results from the YOLO-World model. The bottom row presents results from the RAM model. (a) The top two oranges are fake oranges, and the bottom two oranges are real oranges. (b) Fake hand (top) and real hand (bottom). Different colored boxes are different labeled misclassifications.
Figure 9.
Object classification Q1 results (red: incorrect object classification; yellow: correct object classification but cannot distinguish between real and fake; green: correct object classification and also distinguishes between real and fake).
Figure 9.
Object classification Q1 results (red: incorrect object classification; yellow: correct object classification but cannot distinguish between real and fake; green: correct object classification and also distinguishes between real and fake).
Figure 10.
Object classification Q3-2 results (red: incorrect object classification; yellow: correct object classification but cannot distinguish between real and fake; green: correct object classification and also correctly distinguishes between real and fake).
Figure 10.
Object classification Q3-2 results (red: incorrect object classification; yellow: correct object classification but cannot distinguish between real and fake; green: correct object classification and also correctly distinguishes between real and fake).
Figure 11.
Human classification Q2 results (red: incorrect object classification; yellow: correct object classification but cannot give the correct order; green: correct object classification and gives the right order).
Figure 11.
Human classification Q2 results (red: incorrect object classification; yellow: correct object classification but cannot give the correct order; green: correct object classification and gives the right order).
Table 1.
Recent vision–language models and their specifications.
Table 1.
Recent vision–language models and their specifications.
Model | Model Specification |
---|
CLIP | Learns visual representations from natural language supervision. |
ViLBERT | A multimodal transformer that jointly pre-trains on large-scale image–text datasets for various vision–language tasks. |
VisualBERT | Integrates visual and textual inputs in a transformer framework to support tasks such as image captioning, segmentation, and object detection. |
Claude 3 | An advanced vision–language model that refines multimodal alignment to better handle diverse real-world data, improving fine-grained recognition. |
Emu3 | Designed for improved multimodal alignment, leveraging visual and language data to boost performance across multiple modalities. |
NVLM | A vision–language model that incorporates multimodal data for enhanced scene understanding and downstream visual–linguistic tasks. |
Pixtral | Focuses on refining multimodal alignment to better manage diverse real-world image and language data. |
LLaMA 3.2 Vision | An extension of the LLaMA architecture that integrates vision components and enriches training with video, audio, and wiki-style corpora. |
Baichuan Ocean Mini | Leverages extensive multimodal training resources to enhance scene understanding in vision–language tasks. |
TransFusion | Incorporates diverse training resources—including video and audio—to improve reasoning about object context and attributes across modalities. |
DeepSeekVL2 | Uses a combination of visual, textual, and other modality data to advance scene and object recognition capabilities. |
BLIP-3 | Scales vision–language pre-training to larger datasets while preserving fine-grained recognition capabilities. |
OLMo | Focused on scaling vision–language learning, this approach integrates modalities for improved alignment and task performance. |
Qwen2.5-VL | Represents a refinement over previous Qwen-based models, emphasizing sophisticated multimodal alignment and fine-grained visual recognition. |
Phi | Enhances multimodal fusion with improved scaling and refined alignment for robust visual–language reasoning. |
Ovis | Targets open-world scene understanding using novel vision–language training techniques. |
MiniCPM-X | A lightweight variant emphasizing efficient parameter usage while retaining strong multimodal performance. |
InternVL | Integrates internal vision–language representations to boost performance in downstream tasks with robust scene understanding. |
Llava-Next | An evolution of prior Llava models that emphasizes enhanced visual reasoning and interactive capabilities. |
Table 2.
Object classification results corresponding to Questions 1 and 2. Bold values indicate the best scores and underlined values indicate the second-best scores for each question.
Table 2.
Object classification results corresponding to Questions 1 and 2. Bold values indicate the best scores and underlined values indicate the second-best scores for each question.
Model | Q1 | Q2 |
---|
Step 1
|
Step 2
|
Step 1
|
Step 2
|
---|
Single
Object
|
Multi-
Object
|
Single
Object
|
Multi-
Object
|
Single
Object
|
Multi-
Object
|
Single
Object
|
Multi-
Object
|
---|
closed model | GPT-4o [49] | 98.59 | 100.00 | 64.79 | 12.50 | 100.00 | 100.00 | 49.30 | 0.00 |
small
scale
models | Qwen2-VL-7B [16] | 76.06 | 100.00 | 46.48 | 0.00 | 100.00 | 100.00 | 49.30 | 0.00 |
Phi-3.5-3.8B [5] | 50.70 | 75.00 | 25.35 | 0.00 | 90.14 | 100.00 | 39.44 | 0.00 |
llama3-llava-next-8B [3] | 76.06 | 87.50 | 45.07 | 0.00 | 81.69 | 87.50 | 38.03 | 0.00 |
Ovis1.6-Gemma2-9B [50] | 84.51 | 100.00 | 52.11 | 0.00 | 100.00 | 100.00 | 54.93 | 0.00 |
MiniCPM-V-2.6-8B [51] | 85.92 | 87.50 | 52.11 | 0.00 | 94.37 | 100.00 | 53.52 | 0.00 |
medium
scale
models | Llama-3.2-11B [3] | 80.28 | 100.00 | 47.89 | 0.00 | 95.77 | 100.00 | 47.89 | 0.00 |
llava-v1.6-13B [52] | 88.73 | 87.50 | 53.52 | 0.00 | 98.59 | 100.00 | 47.89 | 0.00 |
InternVL2.5-26B [53] | 85.91 | 100.00 | 60.56 | 0.00 | 92.96 | 100.00 | 49.30 | 0.00 |
llava-v1.6-34B [52] | 74.65 | 75.00 | 40.85 | 0.00 | 100.00 | 100.00 | 49.30 | 0.00 |
InternVL2.5-38B [53] | 90.14 | 75.00 | 63.38 | 12.50 | 94.37 | 100.00 | 47.89 | 0.00 |
large
scale
models | Qwen2-VL-72B [16] | 59.15 | 75.00 | 43.66 | 0.00 | 94.37 | 100.00 | 43.66 | 0.00 |
Llama-3.2-90B [3] | 78.87 | 75.00 | 46.48 | 0.00 | 90.14 | 100.00 | 49.30 | 0.00 |
InternVL2.5-78B [53] | 73.24 | 50.00 | 53.52 | 0.00 | 92.96 | 100.00 | 47.89 | 0.00 |
llava-next-72B [52] | 78.87 | 87.50 | 47.89 | 0.00 | 98.59 | 100.00 | 49.30 | 0.00 |
Table 3.
Object classification results corresponding to Questions 3 and 4. Bold values indicate the best scores and underlined values indicate the second-best scores for each question.
Table 3.
Object classification results corresponding to Questions 3 and 4. Bold values indicate the best scores and underlined values indicate the second-best scores for each question.
Model | Q3_1 | Q3_2 | Q4_1 | Q4_2 |
---|
Step 1
|
Step 2
|
Step 1
|
Step 2
|
Step 1
|
Step 2
|
Step 1
|
Step 2
|
---|
Single
Object
|
Single
Object
|
Multi-
Object
|
Multi-
Object
|
Single
Object
|
Single
Object
|
Multi-
Object
|
Multi-
Object
|
---|
Closed model | GPT-4o [49] | 95.77 | 56.34 | 100.00 | 25.00 | 100.00 | 71.83 | 87.50 | 12.50 |
Small-
scale
models | Qwen2-VL-7B [16] | 76.06 | 46.48 | 87.50 | 12.50 | 100.00 | 59.15 | 100.00 | 0.00 |
Phi-3.5-3.8B [5] | 43.66 | 18.31 | 62.50 | 0.00 | 81.69 | 45.07 | 100.00 | 0.00 |
llama3-llava-next-8B [3] | 74.65 | 26.76 | 75.00 | 25.00 | 80.28 | 42.25 | 12.50 | 0.00 |
Ovis1.6-Gemma2-9B [50] | 85.92 | 50.70 | 87.50 | 12.50 | 100.00 | 61.97 | 100.00 | 0.00 |
MiniCPM-V-2.6-8B [51] | 83.10 | 43.66 | 87.50 | 0.00 | 98.59 | 52.11 | 100.00 | 0.00 |
Medium-
scale
models | Llama-3.2-11B [3] | 70.42 | 38.03 | 87.50 | 37.50 | 95.77 | 64.79 | 100.00 | 0.00 |
llava-v1.6-13B [52] | 92.96 | 32.39 | 100.00 | 0.00 | 97.18 | 49.30 | 100.00 | 0.00 |
InternVL2.5-26B [53] | 84.51 | 52.11 | 87.50 | 0.00 | 91.55 | 64.79 | 100.00 | 12.50 |
llava-v1.6-34B [52] | 76.06 | 30.99 | 87.50 | 37.50 | 98.59 | 50.70 | 87.50 | 0.00 |
InternVL2.5-38B [53] | 88.73 | 50.70 | 100.00 | 0.00 | 91.55 | 61.97 | 100.00 | 12.50 |
Large-
scale
models | Qwen2-VL-72B [16] | 60.56 | 36.62 | 100.00 | 0.00 | 100.00 | 61.97 | 100.00 | 0.00 |
Llama-3.2-90B [3] | 83.10 | 45.07 | 100.00 | 0.00 | 95.77 | 57.75 | 100.00 | 0.00 |
InternVL2.5-78B [53] | 81.69 | 47.89 | 100.00 | 12.50 | 94.37 | 59.15 | 100.00 | 12.50 |
llava-next-72B [52] | 73.24 | 36.62 | 75.00 | 25.00 | 97.18 | 36.62 | 100.00 | 0.00 |
Table 4.
Human classification results. Bold values indicate the best scores and underlined values indicate the second-best scores for each question.
Table 4.
Human classification results. Bold values indicate the best scores and underlined values indicate the second-best scores for each question.
Model | Q1 | Q2 |
---|
Step 1
|
Step 2
|
Step 1
|
Step 2
|
---|
Closed model | GPT-4o | 94.44 | 94.44 | 100.00 | 100.00 |
Small-
scale
models | Qwen2-VL-7B | 77.78 | 66.67 | 77.78 | 66.67 |
Phi-3.5-3.8B | 38.89 | 38.89 | 61.11 | 55.56 |
llama3-llava-next-8B | 55.56 | 50.00 | 50.00 | 38.89 |
Ovis1.6-Gemma2-9B | 66.67 | 55.56 | 100.00 | 100.00 |
MiniCPM-V-2.6-8B | 50.00 | 50.00 | 77.78 | 55.56 |
Medium-
scale
models | Llama-3.2-11B | 50.00 | 50.00 | 55.56 | 50.00 |
llava-v1.6-13B | 50.00 | 50.00 | 27.78 | 27.78 |
InternVL2.5-26B | 100.00 | 66.67 | 94.44 | 83.33 |
llava-v1.6-34B | 50.00 | 50.00 | 72.22 | 72.22 |
InternVL2.5-38B | 66.67 | 61.11 | 88.89 | 83.33 |
Large-
scale
models | Qwen2-VL-72B | 77.78 | 72.22 | 100.00 | 100.00 |
Llama-3.2-90B | 55.56 | 50.00 | 88.89 | 83.33 |
InternVL2.5-78B | 72.22 | 50.00 | 94.44 | 94.44 |
llava-next-72B | 72.22 | 55.56 | 83.33 | 83.33 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).