Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance

Son, Minjun; Lee, Sungjin

doi:10.3390/app15073992

Open AccessArticle

Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance

by

Minjun Son

¹

and

Sungjin Lee

^2,*

¹

Department of Metabiohealth, Sungkyunkwan University, Suwon 16419, Republic of Korea

²

Department of Smart Automotive, Soonchunhyang University, Asan 31538, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3992; https://doi.org/10.3390/app15073992

Submission received: 1 March 2025 / Revised: 30 March 2025 / Accepted: 1 April 2025 / Published: 4 April 2025

Download

Browse Figure

Versions Notes

Abstract

:

This study investigates prompt engineering (PE) strategies to mitigate hallucination, a key limitation of multimodal large language models (MLLMs). To address this issue, we explore five prominent multimodal PE techniques: in-context learning (ICL), chain of thought (CoT), step-by-step reasoning (SSR), tree of thought (ToT), and retrieval-augmented generation (RAG). These techniques are systematically applied across multiple datasets with distinct domains and characteristics. Based on the empirical findings, we propose the greedy prompt engineering strategy (Greedy PES), a methodology for optimizing PE application across different datasets and MLLM models. To evaluate user satisfaction with MLLM-generated responses, we adopt a comprehensive set of evaluation metrics, including BLEU, ROUGE, METEOR, S-BERT, MoverScore, and CIDEr. A weighted aggregate evaluation score is introduced to provide a holistic assessment of model performance under varying conditions. Experimental results demonstrate that the optimal prompt engineering strategy varies significantly depending on both dataset properties and the MLLM model used. Specifically, datasets categorized as general benefit the most from ICL, ToT, and RAG, whereas mathematical datasets perform optimally with ICL, SSR, and ToT. In scientific reasoning tasks, RAG and SSR emerge as the most effective strategies. Applying Greedy PES leads to a substantial improvement in performance across different multimodal tasks, achieving an average evaluation score enhancement of 184.3% for general image captioning, 90.3% for mathematical visual question answering (VQA), and 49.1% for science visual question answering (VQA) compared to conventional approaches. These findings highlight the effectiveness of structured PE strategies in optimizing MLLM performance and provide a robust framework for PE-driven model enhancement across diverse multimodal applications.

Keywords:

multimodal large language model; prompt engineering; in-context learning; chain of thought; retrieval-augmented generation; step-by-step reasoning; tree of thought; greedy prompt engineering strategy

1. Introduction

1.1. Multimodal Large Language Models: Foundations and Architectures

With the recent advancements in artificial Iintelligence (AI), large language models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, human language comprehension extends beyond mere text-based processing; it integrates multiple sensory modalities, including vision, hearing, and contextual reasoning to achieve a more holistic understanding. To overcome this limitation, multimodal large language models (MLLMs) have emerged as a new paradigm. These models are designed to process and interpret not only textual data but also diverse input modalities such as images, audio, and video, thereby enabling a more comprehensive and context-aware understanding of information.

MLLMs are designed to process not only textual data but also various modalities such as images, videos, and audio. While conventional LLMs are trained exclusively on textual data, MLLMs integrate visual and auditory information, allowing them to leverage richer contextual cues. This multimodal capability extends beyond traditional language comprehension, enabling more sophisticated decision making and reasoning by combining linguistic and perceptual information. However, the incorporation of multimodal data introduces new technical challenges, particularly concerning the architecture and training methodologies of MLLMs. These challenges arise from the need to effectively align, fuse, and interpret multiple modalities within a unified framework, necessitating advancements in model design and optimization strategies.

The architecture of MLLMs is primarily composed of three key components: (1) pre-trained modality encoder, (2) pre-trained large language model (LLM), and (3) cross-modality transformer [1].

The pre-trained modality encoder is responsible for processing and extracting features from non-textual data, such as images, audio, and video. A prominent example of such an encoder is CLIP (Contrastive Language–Image Pretraining) [2], which plays a crucial role in learning the relationships between vision and language. These encoders enable MLLMs to bridge the gap between different modalities by effectively mapping non-textual inputs into a representational space that aligns with linguistic information.

The pre-trained LLM serves as the core text processing component of MLLMs. It utilizes existing LLM architectures, such as GPT [3,4,5], Llama [6,7], Gemini [8], and Mistral [9], to generate refined responses by integrating textual information with extracted multimodal features. These LLMs act as the reasoning engine of MLLMs, enabling context-aware and semantically coherent responses by leveraging both linguistic and non-linguistic information.

The cross-modality transformer facilitates the effective fusion of features extracted from non-textual data with the linguistic representations processed by the LLM. This component is essential for aligning, integrating, and contextualizing multimodal information, allowing MLLMs to learn semantic relationships across different modalities. By incorporating multimodal reasoning capabilities, the cross-modality transformer enables MLLMs to generate more accurate, context-aware, and semantically enriched outputs across diverse multimodal tasks.

To optimize the performance of MLLMs, various training strategies, including pretraining, instruction tuning, and alignment tuning, are employed [1].

Pretraining serves as the foundational phase, where the model learns fundamental representation learning by leveraging large-scale multimodal datasets. During this stage, MLLMs are trained on image–text, audio–text, and other modality–text combinations, allowing them to understand and capture the relationships between different modalities. This step is crucial for enabling MLLMs to process and integrate information from diverse sources effectively. Following pretraining, instruction tuning is applied to enhance the model’s ability to generate task-specific responses. This process fine-tunes the MLLM to align with user prompts, ensuring that the model can produce outputs that are more coherent, relevant, and tailored to specific tasks. By learning from structured instructions, MLLMs become more adept at following user queries and delivering accurate and context-aware responses. To further refine the quality, reliability, and trustworthiness of the model’s outputs, alignment tuning is incorporated. This involves techniques such as reinforcement learning from human feedback (RLHF) [10], which adjusts the model’s responses to better reflect human preferences and ethical considerations. In particular, RLHF plays a critical role in reducing hallucination in large language models (LLMs) [11]. Alignment tuning plays a vital role in mitigating hallucinations and biases, ensuring that MLLMs produce factually accurate and contextually appropriate outputs. By integrating these training methods, MLLMs can achieve improved multimodal understanding and enhanced user interaction capabilities.

1.2. Technical Challenges and Hallucination in MLLMs

Despite the powerful capabilities of MLLMs enabled by their architectural design and training methodologies, several performance limitations remain. One of the most critical challenges is hallucination, which refers to instances where the model generates responses that do not accurately correspond to the actual visual information [12]. This phenomenon occurs when MLLMs produce information that is not present in the training data or misinterpret visual content, leading to inaccurate or misleading outputs. Hallucination is particularly problematic in tasks such as image captioning, object recognition, and scene understanding, where precise alignment between textual descriptions and visual data is crucial. Recent studies [13] have highlighted the risks associated with semantic gaps and misalignment between different modalities in MLLMs. These issues arise when the textual and visual components of the model fail to integrate effectively, leading to inconsistencies in generated responses. To address this, it is essential to develop effective modality alignment techniques that ensure a coherent and accurate representation of multimodal data. Furthermore, improper alignment strategies can lead to unnecessary increases in model parameters without guaranteeing performance improvements, underscoring the need for careful selection of alignment methods to optimize both efficiency and accuracy in MLLMs.

1.3. Prompt Engineering for Enhancing MLLM Performance

To mitigate the hallucination problem and enhance the performance of MLLMs, various prompt engineering (PE) techniques have been proposed, similar to those developed for LLMs [14,15,16]. However, unlike LLMs, which rely solely on textual inputs, MLLMs process visual content in addition to text. As a result, strategic prompt design must go beyond simple text-based prompting and consider alignment with visual information to ensure coherence and accuracy in multimodal reasoning.

First, in-context learning (ICL) [17] requires providing relevant examples within a given multimodal image–text pair context to enable the model to generate appropriate responses. Chain of thought (CoT) [18] should guide the model to solve complex reasoning tasks by leveraging sequential textual explanations based on image analysis. Similarly, step-by-step reasoning (SSR) [19] encourages the model to perform spatial and stepwise visual analysis, ensuring a structured reasoning process. Tree of thought (ToT) [19] extends this concept by considering multiple cognitive pathways derived from the image, allowing the model to select the most reliable response based on different analytical perspectives. On the other hand, retrieval-augmented generation (RAG) [20] enhances multimodal understanding by retrieving external knowledge related to the given image, enabling the model to generate evidence-based responses even when dealing with previously unseen information. In summary, prompt engineering in MLLMs must evolve beyond simple text-based design to strategically integrate visual information, ensuring that the model effectively utilizes multimodal inputs to improve response accuracy and reliability.

In fact, existing prompt engineering research aimed at mitigating hallucination has predominantly focused on LLMs, while the systematic optimization of PE strategies for multimodal data remains underdeveloped. However, hallucination phenomena arising specifically from multimodal inputs present challenges that cannot be fully addressed by conventional approaches alone. Therefore, the development of prompt engineering techniques tailored to multimodal data is essential for generating accurate and contextually grounded responses.

This study aims to develop an optimal prompt engineering strategy that maximizes user satisfaction and response accuracy in practical MLLM service deployment while minimizing computational resource requirements. Specifically, instead of performing additional fine-tuning on pre-trained modality encoders, pre-trained LLMs, or cross-modality transformers, we explore how multimodal-specific prompt engineering techniques alone can enhance MLLM performance.

To achieve this, we systematically investigate the application of RAG, CoT, ICL, SSR, and ToT as effective prompt engineering strategies. In particular, we evaluate the impact of these techniques on state-of-the-art MLLMs, including Phi [21,22], Llama [23], Pixtral [24], and Qwen [25,26], providing empirical insights into their effectiveness across different architectures.

To ensure that our evaluation closely aligns with user satisfaction, we employ a diverse set of performance metrics, including bilingual evaluation understudy (BLEU), recall-oriented understudy for gisting eval (ROUGE), metric for evaluation of translation with explicit ordering (METEOR), sentence-bidirectional encoder representations from transformers (S-BERT), MoverScore, and consensus-based image description evaluation (CIDEr). These metrics collectively assess the quality, fluency, and relevance of multimodal-generated responses.

For benchmark datasets, we utilize MathVista [27], CVBench [28], ScienceQA [29], nocaps [30], MSCOCO [31], and Flickr30k [32]. These datasets span a variety of domains and multimodal tasks, allowing for a comprehensive analysis of prompt engineering strategies in multimodal natural language generation. Based on the results, we propose a greedy prompt engineering strategy (Greedy PES) that optimally selects the most effective prompt engineering technique for each dataset and MLLM model, maximizing response quality and reliability.

The proposed Greedy PES method enables the identification of the optimal MLLM model and the most effective PE combination for each dataset, based on the exhaustive evaluation of all possible PE configurations. Furthermore, by employing a weighted metric computation scheme that adaptively reflects the characteristics of each dataset and user preferences, this approach achieves a closer alignment with user satisfaction compared to conventional methods.

2. Related Works

MLLMs [21,22,23,24,25,26], which aim to generate text by processing diverse multimodal inputs through LLMs, have been the focus of extensive research in terms of architectural advancements [1,15] and training methodologies [1]. Despite continuous improvements in performance, MLLMs still face significant challenges, particularly in regard to generating hallucinated responses that fail to accurately reflect the provided visual or contextual inputs [12]. To mitigate these limitations, prompt engineering (PE) strategies have emerged as a promising solution to enhance response quality and reliability [14,15,16]. Additionally, recent research has explored user experience optimization methods specifically tailored for MLLMs, employing diverse evaluation frameworks to assess model effectiveness [11,33,34,35].

In terms of architectural studies, Fu et al. [1] categorized MLLM architectures into three core components: pre-trained modality encoders, pre-trained LLMs, and cross-modality transformers. This structure allows MLLMs to integrate multimodal information efficiently while leveraging LLMs’ textual reasoning capabilities. In contrast, Zhang et al. [15] proposed a more fine-grained MM-LLM framework, decomposing it into modality encoders, input projectors, LLM backbones, output projectors, and modality generators. This expanded framework extends beyond text generation, encompassing multimodal content generation, including images and audio, thus broadening the application scope of MLLMs.

In contrast to architectural research, studies on MLLM training methodologies have primarily focused on pretraining, instruction tuning, and alignment tuning [1]. Pretraining aims to align different modalities while embedding multimodal world knowledge into the model [1]. Instruction tuning is designed to teach MLLMs how to follow user instructions and effectively perform assigned tasks. Meanwhile, alignment tuning ensures that MLLMs are aligned with specific human preferences, improving their ability to generate responses that are both reliable and contextually appropriate.

Despite these advanced training strategies, achieving perfect alignment between different modality encoders remains a fundamental challenge. This misalignment issue often leads to multimodal hallucination, where the content generated by an MLLM does not accurately correspond to the provided visual input. When combined with an LLM, multimodal hallucination can manifest in three distinct types [12]: existence hallucination, attribute hallucination, and relationship hallucination. Existence hallucination is the most fundamental form of hallucination, where the model incorrectly asserts the presence of objects that do not actually exist in the image. Attribute hallucination occurs when the model misdescribes the attributes of an object, such as failing to correctly identify the color of a dog. This type of hallucination is often correlated with existence hallucination, as attribute descriptions should be grounded in the actual objects present in the image. Relationship hallucination is a more complex phenomenon that extends beyond the existence of objects. It refers to incorrect descriptions of relationships between objects, such as relative positioning or interactions, leading to misinterpretations of the scene’s contextual meaning. These hallucination challenges highlight the inherent difficulties in aligning multimodal representations, necessitating effective prompt engineering strategies to mitigate the issue and improve response reliability.

To help address the hallucination problem in LLMs, various prompt engineering techniques such as ICL [17], CoT [18], SSR [19], ToT [19], and RAG [20] have been introduced. These approaches aim to guide the model in structured reasoning, contextual retrieval, and incremental step-wise reasoning, thereby improving response accuracy. However, applying these techniques directly to MLLMs presents performance limitations, as MLLMs require strategic prompt design that accounts for alignment with visual information, rather than relying solely on text-based prompts. To overcome these challenges, recent studies have proposed multimodal-specific prompt engineering techniques. Yin et al. [14,36,37] introduced multimodal ICL (M-ICL), multimodal CoT (M-CoT), and LLM-aided visual reasoning (LAVR) to mitigate multimodal hallucination by integrating visual and textual reasoning. Similarly, He et al. [38] proposed prompt optimization for enhancing multimodal reasoning (POEM), a visual analysis system that optimizes prompts to enhance multimodal reasoning capabilities in large language models. Expanding on this, Zhang et al. [15] introduced Multimodal-CoT, a framework that extends chain-of-thought reasoning to process multimodal inputs (text and images), thereby improving joint linguistic and visual inference. Additionally, Wu et al. [16] explored visual prompting techniques for MLLMs, categorizing different types of visual prompts and investigating their impact on compositional reasoning, visual grounding, and object reference within multimodal contexts.

In parallel, MLLM evaluation methodologies have also been a subject of extensive research. Xu et al. [33] provided a comprehensive review of MM-LLM efficiency improvement techniques, introducing various benchmarks for measuring multimodal effectiveness. Li et al. [34] highlighted the limitations of existing evaluation methods, noting that most benchmarks require fixed answers, which constrains the evaluation of creative responses. Additionally, they emphasized the lack of effective hallucination assessment, the inadequate evaluation of multimodal knowledge learning, and the absence of causality understanding metrics. To address these shortcomings, they proposed adopting user-centric evaluation, multimodal expansion evaluation, and interactive and dynamic evaluation methods. Similarly, Huang et al. [11] systematized MLLM evaluation concepts, categorizing evaluation approaches based on what to evaluate (evaluation objectives), how to evaluate (evaluation methodologies), and where to evaluate (evaluation scope). Furthermore, Xie et al. [35] proposed a standardized evaluation framework that incorporates accuracy-based metrics (BLEU, ROUGE, CIDEr, and MoverScore based on Wasserstein distance), as well as human evaluation, to ensure a more holistic assessment of MLLM performance.

The key contributions of this paper are as follows:

Comprehensive Performance Analysis: We present an extensive performance evaluation of various MLLMs, along with prompt engineering techniques designed to enhance their capabilities. This analysis is conducted across multiple datasets and assessed using a diverse set of performance metrics.
Weighted Aggregate Performance Metric: We introduce a weighted aggregate performance metric that integrates multiple evaluation metrics, such as BLEU, ROUGE, METEOR, SBERT, MoverScore, and CIDEr, to provide a holistic assessment of prompt engineering strategies.
Optimization Strategy for Prompt Engineering in MLLMs: We investigate the impact of prompt engineering strategies on different datasets and MLLM architectures, leading to the formulation of an optimized strategy tailored to dataset characteristics and MLLM models. Additionally, we propose a greedy prompt engineering strategy (Greedy PES) to further refine the application of prompt engineering for improved model performance.

The structure of this paper is as follows: Section 3 provides a detailed explanation of the MLLM models used in this study. Section 4 introduces the evaluation metrics employed for performance measurement, while Section 5 describes the benchmark datasets used for experimentation. Section 6 presents the proposed Greedy PES for optimizing MLLM performance. Section 7 discusses the experimental results, including the impact of different parameters, the effect of Greedy PES, the derived MLLM optimization strategies, and further insights. Finally, Section 9 concludes the paper.

3. System Models

In this study, we conduct experiments using four state-of-the-art MLLM models: Phi-3.5, Llama-3.2, Pixtral, and Qwen-2.5. These models have been recently introduced, are widely adopted, and exhibit strong performance while maintaining a parameter size of approximately 10 billion. A summary of the technical specifications of these models is presented in Table 1.

3.1. Phi

Phi-3.5-Vision-Instruct is a lightweight multimodal model developed by Microsoft in October 2024. It is designed to perform a wide range of vision–language tasks, including general image understanding, optical character recognition (OCR), chart and table comprehension, multi-image comparison, and video clip summarization [39].

The model architecture consists of a CLIP ViT-L/14-based image encoder and a Phi-3.5-mini-based pre-trained LLM. It employs rotary position embedding (RoPE) for positional encoding and utilizes the SwiGLU activation function [40]. The model supports a maximum context length of 128K tokens and has been pre-trained on approximately 0.5T tokens from image–text datasets. Additionally, supervised fine-tuning (SFT) and direct preference optimization (DPO) were applied for post-training.

3.2. Llama

Llama-3.2-11B-Vision-Instruct is an 11-billion-parameter multimodal language model developed by Meta Platforms in September 2024. It is designed to process both text and images simultaneously, enabling multimodal conversations and visual reasoning tasks. Built upon the Llama 3.1 architecture, this model integrates visual information to support various applications across different domains [7].

The architecture consists of a CLIP-based image encoder and a Llama 3.1-based pre-trained LLM. It employs RoPE for positional encoding and utilizes the SwiGLU activation function [40]. The model supports a maximum context length of 128K tokens and has been pre-trained on approximately 15.6T tokens comprising image–text datasets. For post-training, it has undergone SFT and DPO.

3.3. Pixtral

Pixtral-12B is a 12-billion-parameter multimodal language model developed and released by Mistral AI in October 2024. It is designed to understand both images and text simultaneously, enabling advanced multimodal reasoning and language generation [24].

The architecture comprises a CLIPA-based image encoder [41], a Mistral Nemo 12B-based pre-trained LLM, and a Mistral Nemo 12B-based multimodal decoder. It employs RoPE-2D for positional encoding and utilizes the SwiGLU activation function [40]. The model supports a maximum context length of 128K tokens and has been pre-trained on billions of image–text pairs. For post-training, SFT and DPO were applied.

3.4. Qwen

Qwen2-VL-7B-Instruct is a 7-billion-parameter multimodal language model developed by Alibaba Group in 2024. It is designed to function as a visual agent, enabling advanced multimodal understanding and reasoning [42].

The architecture consists of a 600-million-parameter ViT-based encoder [43] and a Qwen2-7B-based pre-trained LLM. It employs rotary multimodal rotary position embedding (M-ROPE) for positional encoding and utilizes the SwiGLU activation function [40]. The model supports a maximum context length of 128K tokens and has been pre-trained on approximately 7 trillion tokens of image–text data.

For post-training, SFT and DPO were applied to refine the model’s output. Additionally, to extend the context window, yet another RoPE extension method (YARN) [44] and dual chunk attention (DCA) [45] techniques were employed.

4. Performance Metric

This paper aims to analyze the performance of MLLMs from multiple perspectives by considering various evaluation metrics that effectively reflect user satisfaction with model responses. To achieve this, we employ a diverse set of widely used and well-established evaluation metrics, each with its own unique characteristics. Specifically, we utilize BLEU, ROUGE, METEOR, S-BERT, MoverScore, and CIDEr as our primary evaluation criteria.

4.1. BLEU

BLEU [46] is one of the most widely used metrics for evaluating the performance of machine translation and natural language generation models. It measures the similarity between a generated sentence and a reference sentence based on n-gram overlap. BLEU typically calculates precision from unigrams (1-g) to four-grams (4-g) and applies a brevity penalty (BP) to address the issue of shorter sentences receiving disproportionately high scores. The BLEU score is computed using the following formula:

BLEU = B P \cdot exp (\sum_{n = 1}^{N} w_{n} log p_{n})

(1)

where

p_{n}

represents the n-gram precision value, and

w_{n}

denotes the weight, which is typically set as

w_{n} = \frac{1}{N}

. The brevity penalty (BP) is introduced to penalize excessively short generated sequences and can be formally defined by the following equation:

B P = \{\begin{matrix} 1, & if c > r \\ e^{(1 - r / c)}, & if c \leq r \end{matrix}

(2)

where c represents the length of the generated sentence, while r denotes the length of the reference sentence.

BLEU allows for quantitative performance comparison. However, it has limitations in capturing contextual meaning, as it does not account for synonyms or the flexibility of sentence structures.

4.2. ROUGE

ROUGE score [47] is primarily used to measure the similarity between generated text and reference text in summarization tasks. It includes several variants, such as ROUGE-N, ROUGE-L, and ROUGE-W. ROUGE-N is based on n-gram overlap, while ROUGE-L relies on the longest common subsequence (LCS). ROUGE-W is a weighted version of ROUGE-L, assigning greater importance to longer common subsequences.

In this study, we utilized the Evaluate library’s ROUGE module to compute the ROUGE scores, specifically using ROUGE-1 as the primary evaluation metric.

The formula for ROUGE-N is as follows:

ROUGE - N = \frac{\sum_{n - gram \in Ref} min ({Count}_{Ref} (n - gram), {Count}_{Can} (n - gram))}{\sum_{n - gram \in Ref} {Count}_{Ref} (n - gram)}

(3)

where n-gram is sequence of n consecutive words,

{Count}_{Ref} (n - gram)

is the frequency of the n-gram in the reference text, and

{Count}_{Can} (n - gram)

is the frequency of the n-gram in the candidate text.

4.3. METEOR

The METEOR metric differs from BLEU in that it considers not only simple n-gram matches but also synonymy, stemming, and word order alignment. METEOR utilizes the harmonic mean of unigram precision and recall, making it more closely correlated with human evaluation compared to BLEU [48]. The corresponding formula is as follows:

F_{mean} = \frac{P \cdot R}{α P + (1 - α) R}

(4)

where P is precision, R is recall, and

α

is weight factor, normally set to

0.9

.

4.4. Sentence-BERT

S-BERT is a model designed to measure semantic similarity at the sentence level. In this study, sentence embeddings for both reference and generated sentences are obtained using a BERT-based model, and the semantic similarity between sentences is computed based on these embeddings [49].

E_{X} = S - B E R T (X), E_{Y} = S - B E R T (Y),

(5)

where

E_{X}

and

E_{Y}

are dense vector representations of reference sentence X and generated sentence Y, respectively. S-BERT is a fine-tuned BERT-based model that outputs sentence-level embeddings.

The similarity between sentences is calculated using the following cosine similarity formula:

sim (X, Y) = cos (E_{X}, E_{Y}) = \frac{E_{X} \cdot E_{Y}}{∥ E_{X} ∥ ∥ E_{Y} ∥} .

(6)

4.5. MoverScore

MoverScore [50] is a metric designed to measure semantic similarity between sentences more accurately by combining word mover’s distance (WMD) with word embeddings. The corresponding formula is as follows:

MoverScore = 1 - \frac{\sum_{i = 1}^{n} \sum_{j = 1}^{m} T_{i, j} \cdot (1 - c o s (w_{i}, w_{j}))}{\sum_{i = 1}^{n} \sum_{j = 1}^{m} T_{i, j}}

(7)

where

w_{i}

and

w_{j}

denote the constituent words of the reference sentence X and the generated sentence Y, respectively.

c o s (w_{i}, w_{j})

is cosine similarity between the word embeddings

W_{i}

and

w_{j}

, and

T_{i, j}

is the optimal transport matrix determining how much of the embedding from

w_{i}

should be moved to

w_{j}

.

4.6. CIDEr

CIDEr [51] measures the similarity between the generated sentence and reference sentence using a term frequency-inverse document frequency (TF-IDF) weighted n-gram matching approach. It emphasizes informative content (rare but meaningful words) while reducing the influence of common, less informative words. The CIDEr formula is as follows:

sim (X, Y) = c o s (w_{g} (X), w_{g} (Y)) = \frac{\sum_{g} w_{g} (X) \cdot w_{g} (Y)}{∥ w (X) ∥ \cdot ∥ w (Y) ∥}

(8)

where the TF-IDF weight of an n-gram g for each reference sentence X and generated sentence Y is defined as follows:

w_{g} (X) = h_{g} (X) \cdot IDF (g), w_{g} (Y) = h_{g} (Y) \cdot IDF (g)

(9)

where

h_{g} (X)

and

h_{g} (Y)

represent the term frequency (TF) of the n-gram g in the candidate and reference sentences, respectively.

IDF (g)

is the inverse document frequency of g, computed as:

IDF (g) = log (\frac{N}{\sum_{d \in D} I [g \in d]})

(10)

where N is the total number of captions in the corpus. D represents the set of all reference captions.

I [g \in d]

is an indicator function equal to 1 if g appears in document d, and 0 otherwise.

5. Dataset

In this study, we selected MSCOCO, Flickr30k, nocaps, and CVBench as benchmark datasets for the general natural language understanding category; ScienceQA for the scientific reasoning category; and MathVista for the mathematical reasoning category. These datasets were chosen to evaluate a wide range of language capabilities, ensuring a balanced representation across general natural language understanding, scientific reasoning, and mathematical reasoning. A comparative analysis of these datasets is presented in Table 2.

The MSCOCO 2014 5K Test Image-Text Retrieval dataset is utilized for evaluating image–text retrieval and matching performance. It consists of a total of 5000 test samples and is used to assess a model’s ability to retrieve appropriate textual descriptions for a given image or to find images corresponding to a given text query [31].

The Flickr30k dataset is designed for image captioning research, focusing on learning and evaluating the relationship between images and textual descriptions. It comprises 31,800 test samples and is widely used to evaluate models that generate natural language descriptions of images [32].

The nocaps dataset is specifically constructed for image captioning performance evaluation, particularly in scenarios where the images contain objects or scenes that are challenging for conventional captioning models. The dataset includes 4500 samples in the validation set and 10,600 samples in the test set [30].

The ScienceQA dataset is designed for evaluating scientific question answering (QA) models and contains a diverse set of scientific questions. The dataset consists of 12,700 samples in the training set, 4240 samples in the validation set, and 4240 samples in the test set. It focuses on assessing a model’s scientific knowledge and reasoning capabilities [29].

The MathVista dataset is constructed to evaluate mathematical visual question answering (Math-VQA) models by integrating mathematical concepts with visual information. It is used to test a model’s ability to perform mathematical reasoning and visual interpretation. The dataset includes 5140 test samples and provides an additional 1000-sample Test Mini set for smaller-scale evaluations [27].

The CVBench dataset serves as a computer vision benchmark (CV-Bench) for visual question answering (VQA) tasks, measuring model performance in visual understanding and question answering accuracy. The dataset includes 2640 test samples and is used to evaluate a model’s ability to process visual information and generate correct responses to image-based questions [28].

6. Greedy Prompt Engineering Strategy

This section describes the greedy prompt engineering strategy (Greedy PES), which is designed to identify and apply optimal prompt engineering techniques for different MLLM deployment environments, including the various MLLM models and benchmark datasets discussed in the previous sections.

In addition, the RAG approach was extended by integrating it with CoT, ToT, and SSR, whereby external information is retrieved and reformulated based on each respective reasoning strategy. These variants are denoted as R(C), R(T), and R(S), respectively.

The greedy prompt engineering strategy (Greedy PES) aims to determine the optimal combination of MLLM models and prompt engineering techniques for each dataset by identifying the highest achievable performance across all possible prompt engineering (PE) combinations. To formalize this, let d represent a dataset, p a prompt engineering technique, e an evaluation metric, and m an MLLM model. The evaluation score derived from these parameters is denoted as

S_{d, e} (m, p)

. Furthermore, the weight assigned to each evaluation metric is defined as

w_{d, e}

, which accounts for the varying dynamic ranges of different evaluation metrics to prevent imbalance when aggregating scores. Additionally, these weights reflect the relative importance of each metric in assessing model performance.

The applied prompt engineering techniques are represented using the following abbreviations:

\begin{matrix} B : base, I : ICL, C : CoT, S : SSR, T : ToT; \\ R (B) : basic RAG, R (C) : CoT - based RAG, R (S) : SSR - based RAG, R (T) : ToT - based RAG . \end{matrix}

Then, the objective is to identify the MLLM model m and prompt engineering technique p that maximize the aggregated evaluation score

S_{e} (m, p)

across multiple evaluation metrics. This can be formulated as the following optimization equation:

\begin{matrix} m^{*}, p^{*} & = arg max_{m, p} {\sum_{e \in E} w_{d, e} \cdot S_{d, e} (m, p)}, \\ subject to d & \in {MSCOCO, Flickr 30 k, Nocaps, ScienceQA, MathVista, CVBench}, \\ p & \subset {B, I, C, S, T, R (B), R (C), R (S), R (T)}, \\ m & \in {Llama, Pixtral, Phi, Qwen}, \\ E & = {BLUE, ROUGE, METEOR, S - BERT, Mover, CIDEr} . \end{matrix}

(11)

The optimal MLLM model

m^{*}

and the optimal prompt engineering technique

p^{*}

may vary depending on each dataset d.

7. Simulation Result

This section presents the experimental setup designed to validate the effectiveness of the proposed Greedy PES algorithm for optimizing MLLM performance, along with a detailed performance analysis across different benchmark datasets.

Table 3 presents the experimental setup used in this study. The selected MLLM models include Llama-3.2-11B, Phi-3.5-4.2B, Pixtral-12B, and Qwen2-VL-7B, and the performance of various PE strategies, including ICL, CoT, RAG, ToT, SSR, and their hybrid combinations, was analyzed. In Section 7.1, Section 7.2, Section 7.3, Section 7.4, Section 7.5, Section 7.6, performances are evaluated, compared, and analyzed based on the experimental setup in Table 3 across the six datasets.

The responses for performance evaluation were generated using prompt formats derived from the corresponding PE strategies, with temperature = 0.1 and top-P = 0.9 applied as decoding parameters. For RAG, the prompt is automatically augmented with an image that exhibits high cosine similarity to the input image. Specifically, RAG employs a retrieval-augmented strategy to enhance multimodal reasoning. A subset of the dataset is pre-embedded using the CLIP [2] model to construct a retrieval database via ChromaDB. When the original image is given, it is encoded into a vector using the same CLIP model, and the most semantically similar image is retrieved based on cosine similarity. Prior to presenting the target image, the retrieved image and its caption are shown to provide relevant contextual knowledge and assist the model in generating more accurate responses.

For performance analysis, inference was conducted by applying various prompt engineering techniques to the pretrained models, including Llama-3.2-11B, Phi-3.5-4.2B, Pixtral-12B, and Qwen2-VL-7B, utilizing NVIDIA H100 Tensor Core GPU computing resources.

Finally, Section 7.7 analyzes the best-performing PE strategy for each dataset and MLLM model to derive a PE optimization strategy and discuss insights for performance enhancement.

7.1. MSCOCO

Analyzing the baseline performance (B) in Table 4 and Table 5, it is evident that Qwen-2 achieves the highest performance across most evaluation metrics. This is followed by Pixtral, Llama 3.2, and Phi-3.5 in descending order of performance. Notably, Llama 3.2 exhibits the best results in semantic similarity metrics (S-BERT, MoverScore), suggesting that the generated captions are likely to be more semantically appropriate. Meanwhile, Phi-3.5 achieves a higher CIDEr score than Pixtral and Llama 3.2, although it records the lowest performance in other metrics.

However, when applying the Greedy PES, the optimal model

m^{*}

and optimal prompt engineering strategy

p^{*}

are found to be Phi-3.5 with the base RAG technique. This indicates that although Phi-3.5 initially exhibited the lowest baseline performance, it outperforms all other models when Greedy PES is applied. This underscores the significant impact of PE on MLLM performance. Additionally, it is noteworthy that Phi-3.5, despite being the smallest model at 4.2B parameters, achieves superior performance compared to larger models when optimized using Greedy PES. Furthermore, following Phi-3.5, the models rank in performance as Qwen-2, Pixtral, and Llama 3.2. Interestingly, Qwen-2, which had the lowest baseline performance, demonstrates the second-best performance under Greedy PES. The BLEU score improvement for Qwen-2 through Greedy PES is nearly tenfold, highlighting the effectiveness of prompt engineering optimization. On the other hand, for the CIDEr metric, the ToT technique proves to be the most effective, with Qwen-2 emerging as the best-performing model.

7.2. Flickr30k

Analyzing the baseline performance (B) in Table 6 and Table 7, it is evident that Qwen-2 achieves the highest performance across most evaluation metrics. In particular, Qwen-2 records the highest scores in BLEU, ROUGE, and CIDEr, indicating its superior baseline performance in caption generation. Meanwhile, Pixtral achieves the highest performance in METEOR and also records a high MoverScore, suggesting strong semantic similarity between generated and reference captions. Phi-3.5 demonstrates a higher CIDEr score than Pixtral and Llama 3.2, but it records the lowest performance in most other metrics. When applying the greedy prompt engineering strategy (Greedy PES), the optimal model

m^{*}

and optimal PE strategy

p^{*}

are found to be Qwen-2 with the ToT technique. Notably, this combination achieves the highest performance across METEOR, S-BERT, MoverScore, and CIDEr, further confirming its effectiveness in enhancing captioning performance. Additionally, Phi-3.5, despite being the smallest model with only 4.2B parameters, demonstrates comparable performance. This suggests that Phi-3.5 could be a resource-efficient alternative for general captioning tasks, particularly in hardware-constrained environments where computational efficiency is a key requirement.

7.3. nocaps

As observed in Table 8 and Table 9, the baseline performance (B) analysis reveals that Qwen-2 achieves the highest scores in BLEU, ROUGE, and METEOR, indicating that it possesses the strongest baseline performance in caption generation. In contrast, Pixtral outperforms the other models in MoverScore and CIDEr, while Llama 3.2 achieves the highest score in S-BERT, demonstrating its strength in semantic similarity evaluation.

When applying the Greedy PES, the optimal model

m^{*}

and optimal PE strategy

p^{*}

are found to be Qwen-2 with the ToT technique. However, it is noteworthy that Phi-3.5, despite being the smallest model, achieves the best performance in BLEU and ROUGE. Additionally, Phi-3.5 also demonstrates performance comparable to Qwen-2 across METEOR, S-BERT, and MoverScore, indicating its efficiency in multimodal captioning tasks despite its lower parameter count.

7.4. ScienceQA

As observed in Table 10 and Table 11, the baseline performance (B) analysis demonstrates that Qwen-2 achieves the highest performance across all evaluation metrics. Following Qwen-2, Phi-3.5, Llama 3.2, and Pixtral exhibit strong performance in descending order.

Upon applying the greedy prompt engineering strategy (Greedy PES), the optimal model

m^{*}

and optimal prompt engineering strategy

p^{*}

are identified as Phi-3.5 with the base RAG technique. Additionally, Phi-3.5 with the ICL and SSR combination also demonstrates strong performance, albeit with a marginal difference. This result highlights the importance of knowledge expansion and step-by-step reasoning techniques, such as base RAG, ICL, and SSR, in scientific question-answering tasks, where structured reasoning and contextual information retrieval are crucial for generating accurate responses.

Furthermore, after applying Greedy PES, the combination of Qwen-2 with SSR follows Phi-3.5 in terms of performance. Notably, Qwen-2 also exhibited strong performance in the baseline results, reinforcing its effectiveness in scientific-domain-specific response generation.

7.5. MathVista

As observed in Table 12 and Table 13, the baseline performance (B) analysis demonstrates that Phi-3.5 outperforms all models across all evaluation metrics, followed by Qwen-2, Llama 3.2, and Pixtral in descending order. This result indicates that Phi-3.5 and Qwen-2 exhibit strong mathematical problem-solving and reasoning capabilities, whereas Pixtral demonstrates relatively lower performance in this domain.

Upon applying the greedy prompt engineering strategy (Greedy PES), the optimal model

m^{*}

and optimal prompt engineering strategy

p^{*}

are identified as Qwen-2 with the ToT approach. However, it is also notable that Phi-3.5 with the ICL approach achieves the best performance in the S-BERT and MoverScore metrics. These findings confirm that even after applying prompt engineering techniques, Phi-3.5 and Qwen-2 maintain a performance advantage over other models in mathematical reasoning and problem-solving tasks. Additionally, ToT and ICL emerge as the most effective prompt engineering strategies for optimizing MLLM performance in mathematical domains.

7.6. CVBench

As observed in Table 14 and Table 15, the base performance (B) indicates that Qwen-2 demonstrates the highest overall performance, with Phi-3.5 also exhibiting comparable proficiency in understanding multimodal data. Specifically, Phi-3.5 achieves the highest scores in BLEU, ROUGE, and METEOR, while Qwen-2 records the best performance in S-BERT, MoverScore, and CIDEr. This suggests that responses generated by Qwen-2 are semantically more appropriate and natural compared to other models.

When applying the Greedy PES, the optimal model

m^{*}

and prompt engineering strategy

p^{*}

are determined to be Phi-3.5 combined with the ICL technique. Furthermore, since ICL consistently emerges as the most effective prompt engineering method across various evaluation metrics for other MLLM models, this indicates that ICL is particularly advantageous for datasets such as CVBench, which require a fundamental yet comprehensive understanding of both text and image-based inputs.

7.7. Performance Analysis and Discussion

This section provides a quantitative analysis of the best-performing MLLM, the best-performing MLLM with PES, and the degree of performance enhancement, based on the previously presented results across datasets, MLLM models, and evaluation metrics.

Table 16 presents the optimal prompt engineering strategies (PES) and MLLM models across various datasets and evaluation metrics, derived from the results in Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, Table 10, Table 11, Table 12, Table 13, Table 14 and Table 15. As observed in the results, the optimal PES varies significantly depending on the dataset and the chosen MLLM model. Generally, for general category datasets, ICL, ToT, and RAG are predominantly utilized. This trend can be attributed to the characteristics of multimodal data, where generating captions from recognized objects and input text requires in-context reasoning, multi-path inference, and knowledge expansion to deepen the relationship between objects and textual context. In contrast, for math-related datasets, ICL, SSR, and ToT are the primary techniques, while for science-related datasets, RAG and SSR are more frequently employed. The emphasis on SSR in the math and science domains compared to the general domain is notable, as solving mathematical and scientific problems inherently demands step-by-step reasoning, which is crucial for handling complex problem-solving tasks.

Additionally, while Qwen-2 consistently achieves the highest performance across most cases when no PES is applied, it is noteworthy that Phi-3.5 also emerges as a strong contender when Greedy PES is applied. More significantly, despite being the smallest model, Phi-3.5 exhibits substantial performance improvement when PES is applied, demonstrating the effectiveness of PE in enhancing MLLM performance. These findings suggest that Greedy PES has strong potential for MLLM model optimization, highlighting its applicability for further expansion and future advancements in multimodal AI research.

A more detailed analysis of each dataset is now presented to examine the optimal MLLM and PES combinations for different multimodal tasks.

The MSCOCO dataset is designed for image captioning, encompassing diverse scenes and objects. The optimal MLLM–PES combinations identified through Greedy PES are Phi-3.5 with RAG and Qwen-2 with ToT. The results indicate that Qwen-2 exhibits strong image captioning capabilities even without additional prompt engineering, suggesting that it is inherently well trained for general multimodal image–text alignment. In contrast, Phi-3.5, when integrated with RAG, demonstrates a more effective retrieval-based approach, allowing the model to extract relevant information from the image and generate high-quality captions.

Flickr30k focuses on understanding relationships between people and objects within an image to generate relevant captions. The optimal MLLM–PES combination is Qwen-2 with ToT, reinforcing the finding that Qwen-2 is a strong candidate for text generation in general multimodal datasets. The results further suggest that the ToT-based approach facilitates enhanced logical reasoning, allowing the model to establish deeper semantic connections between elements in the image, ultimately producing more contextually relevant captions.

The nocaps dataset is designed for open-domain image captioning, where models must generate captions that describe the main content of an image, even for unseen objects. As observed in prior datasets, the optimal MLLM-PES combination remains Qwen-2 with ToT, reinforcing its capability in open-domain captioning. Furthermore, in the baseline setting (B), Qwen-2 outperforms the other models, highlighting its robustness in unconstrained image captioning tasks.

The ScienceQA dataset evaluates scientific reasoning and question answering, requiring the model to comprehend scientific concepts and principles. While Qwen-2 achieves the highest performance in the baseline setting (B), applying Greedy PES leads to optimal MLLM–PES combinations of Phi-3.5 with RAG or Phi-3.5 with ICL and SSR. This suggests that RAG and structured step-by-step reasoning (ICL, SSR) are the most effective strategies for solving scientific problems, as they facilitate information retrieval, logical deduction, and structured reasoning.

MathVista is designed to assess mathematical problem solving, numerical computation, and logical reasoning in a multimodal context. In the baseline setting (B), Phi-3.5 emerges as the best-performing model. However, when applying Greedy PES, the optimal MLLM–PES combination shifts to Qwen-2 with ToT, demonstrating that the ToT framework enhances logical reasoning and enables structured multi-step problem solving, particularly for mathematical tasks requiring iterative hypothesis evaluation and validation.

CVBench serves as a computer-vision-focused multimodal benchmark, where models are assessed on object recognition and scene description based on image–text relationships. In the baseline setting (B), Phi-3.5 and Qwen-2 achieve the highest performance, while Greedy PES identifies Phi-3.5 with ICL as the optimal combination. This finding indicates that ICL effectively optimizes image descriptions by incorporating diverse in-context examples, making it the most suitable approach for tasks requiring fine-grained multimodal understanding.

Ultimately, the application of the Greedy PES resulted in significant performance improvements across different multimodal tasks. The observed performance improvements are as follows:

184.3% increase in evaluation scores for general image captioning tasks compared to conventional methods.
90.3% increase in evaluation scores for mathematical VQA.
49.1% increase in evaluation scores for science VQA.

These results underscore the importance of prompt engineering in MLLM optimization, illustrating how Greedy PES can significantly enhance model performance by aligning multimodal reasoning techniques with dataset-specific requirements.

7.8. Prompt Examples

Table 17 presents examples of the prompts used for in the aforementioned experiments.

Table 18, Table 19, Table 20 and Table 21 present comparative results obtained by applying various PE techniques using images and questions from Figure 1 as inputs. The images and captions in Figure 1 were extracted from the nocaps dataset.

We now summarize and analyze the above prompt examples. B generally elicited strong visual grounding and basic descriptions, though Phi-3.5 and Llama 3.2 occasionally misinterpreted scenes negatively. I offered concise referencing but lacked contextual depth and emotional nuance across models. C encouraged creativity and narrative richness, but some models misread humorous cues. R yielded clear and concise outputs, though fine-grained detail was sometimes inconsistent. S and T aimed to deepen reasoning, revealing model-specific differences in analytical and emotional interpretation. R(C) supported creative, emotional framing but sometimes induced speculative responses. Model-wise, Phi-3.5 performed well with B and C; Llama 3.2 with I and S; Pixtral with C and R(C); and Qwen 2 with T and R. These results suggest that each prompt strategy effectively exposes the strengths and limitations of different MLLMs.

8. Featured Application

The rapid progress of MLLMs has opened up diverse application domains where natural language generation is required to be grounded in multimodal inputs. MLLMs have demonstrated strong potential in a wide range of use cases including image–text visual question answering (VQA) [27], medical image captioning [52,53], multimodal dialogue systems [54,55], robotics-based visual reasoning [56], legal document visual summarization [38], and math education support systems [15,27]. These applications leverage the capability of MLLMs to reason across textual, visual, and sometimes auditory modalities to deliver more informed and context-aware responses.

Despite these promising applications, deploying MLLMs in real-world scenarios faces several key challenges. One major limitation arises from the computational overhead of advanced prompt engineering strategies. Specifically, the proposed Greedy PES exhaustively explores all available combinations of prompts to identify optimal strategies for a given dataset and model. This approach, while empirically effective, is computationally intensive and resource demanding, making it less feasible in resource-constrained environments such as mobile or embedded devices [57,58].

To mitigate such constraints, recent work has proposed several solutions. For example, meta prompt selectors dynamically choose suitable prompts based on input domain or task characteristics [59]; heuristic rules can be used to predefine prompt configurations based on prior dataset analysis [38]; and prompt distillation techniques attempt to consolidate multiple prompt types into a unified, lighter-weight form [58]. These approaches enable more scalable and deployment-friendly usage of prompt engineering in practical settings.

Additionally, the effectiveness of each PE technique, such as ICL [60], CoT [15], SSR [19], ToT [19], and RAG [20], varies considerably depending on the task domain and dataset characteristics. For instance, ToT has proven particularly effective in scenarios requiring structured reasoning over visual inputs, while RAG is optimal in tasks that demand external knowledge retrieval and grounding, such as in scientific QA tasks [59]. These findings suggest that domain-aware prompt adaptation is essential for achieving optimal performance across applications.

While MLLMs have demonstrated strong generalization and reasoning capabilities, their effective deployment in real-world applications relies heavily on prompt strategies that are computationally efficient, domain-specific, and adaptively optimized. The proposed Greedy PES provides an empirical framework for identifying such strategies but also highlights the need for future research in lightweight and domain-adaptive prompt optimization.

9. Conclusions

This study investigated optimal PE strategies to mitigate one of the key limitations of MLLMs—the hallucination phenomenon. To achieve this, we analyzed representative multimodal PE techniques, including ICL, CoT, SSR, ToT, and RAG. These techniques were systematically applied across multiple datasets with distinct domain characteristics, allowing for a comprehensive performance evaluation.

The primary contribution of this work is the proposal of the greedy prompt engineering strategy (Greedy PES), a methodology designed to select the optimal prompt engineering strategy based on dataset and model characteristics. To ensure an objective and quantitative evaluation of MLLM responses, we employed a range of evaluation metrics, including BLEU, ROUGE, METEOR, S-BERT, MoverScore, and CIDEr. Additionally, a weighted aggregate evaluation score was introduced to facilitate a holistic comparison of model performance.

Experimental results demonstrate that the optimal PES varies depending on the dataset and the model used. General image captioning datasets benefited most from ICL, ToT, and RAG, suggesting that multimodal models require enhanced contextual reasoning, structured thought processing, and external knowledge retrieval for effective caption generation. Mathematical reasoning tasks (mathematical category) were best addressed by ICL, SSR, and ToT, highlighting the importance of incremental, structured reasoning in mathematical problem-solving. Scientific reasoning tasks (science category) showed the highest gains with RAG and SSR, reinforcing the need for external knowledge augmentation and systematic logical inference in scientific domains.

In the absence of prompt engineering, Qwen-2 emerged as the most effective model across various benchmarks. However, when Greedy PES was applied, Phi-3.5 also achieved competitive performance, despite being the smallest model in terms of parameter count. This finding underscores the potential of PES to significantly enhance the efficiency of smaller-scale models, making Phi-3.5 a highly efficient and accurate model when coupled with optimized prompt strategies.

These results empirically validate the hypothesis that PE can significantly enhance model performance and compensate for inherent model limitations. Moving forward, future research should extend the validation of Greedy PES to a broader range of multimodal applications and explore additional techniques to mitigate hallucination effects within MLLMs. Furthermore, domain-specific optimizations (e.g., medical, legal applications) should be investigated to refine PES methodologies for specialized fields where precision and reliability are paramount.

Author Contributions

Conceptualization, S.L.; methodology, S.L. and M.S.; software, M.S.; validation, S.L. and M.S.; formal analysis, S.L.; investigation, S.L.; resources, S.L.; data curation, M.S.; writing—original draft preparation, S.L.; writing—review and editing, S.L.; visualization, S.L. and M.S.; supervision, S.L.; project administration, S.L.; funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Soonchunhyang University Research Fund.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

We acknowledge the support by the Soonchunhyang University Research Fund.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fu, C.; Zhang, Y.F.; Yin, S.; Li, B.; Fang, X.; Zhao, S.; Duan, H.; Sun, X.; Liu, Z.; Wang, L.; et al. MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs. arXiv 2024, arXiv:2411.15296v2. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Ye, J.; Chen, X.; Xu, N.; Zu, C.; Shao, Z.; Liu, S.; Cui, Y.; Zhou, Z.; Gong, C.; Shen, Y.; et al. A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models. arXiv 2023, arXiv:2303.10420. [Google Scholar] [CrossRef]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805. [Google Scholar]
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de Las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
Kaufmann, T.; Weng, P.; Bengs, V.; Hüllermeier, E. A Survey of Reinforcement Learning from Human Feedback. arXiv 2023, arXiv:2312.14925. [Google Scholar]
Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
Zhai, B.; Yang, S.; Zhao, X.; Xu, C.; Shen, S.; Zhao, D.; Keutzer, K.; Li, M.; Yan, T.; Fan, X. Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv 2023, arXiv:2310.01779. [Google Scholar]
Song, S.; Li, X.; Li, S.; Zhao, S.; Yu, J.; Ma, J.; Mao, X.; Zhang, W.; Wang, M. How to Bridge the Gap between Modalities: Survey on Multimodal Large Language Model. arXiv 2023, arXiv:2311.07594v3. [Google Scholar]
Yin, S.; Fu, C.; Zhao, S.; Li, K.; Sun, X.; Xu, T.; Chen, E. A Survey on Multimodal Large Language Models. Natl. Sci. Rev. 2024, 11, nwae403. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Zhang, A.; Li, M.; Zhao, H.; Karypis, G.; Smola, A. Multimodal Chain-of-Thought Reasoning in Language Models. arXiv 2023, arXiv:2302.00923. [Google Scholar]
Wu, J.; Zhang, Z.; Xia, Y.; Li, X.; Xia, Z.; Chang, A.; Yu, T.; Kim, S.; Rossi, R.A.; Zhang, R.; et al. Visual Prompting in Multimodal Large Language Models: A Survey. arXiv 2024, arXiv:2409.15310. [Google Scholar]
Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Wu, Z.; Chang, B.; Sun, X.; Xu, J.; Li, L.; Sui, Z. A Survey on In-context Learning. arXiv 2022, arXiv:2301.00234. [Google Scholar] [CrossRef]
Amatriain, X. Prompt Design and Engineering: Introduction and Advanced Methods. arXiv 2024, arXiv:2401.14423. [Google Scholar]
Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the potential of prompt engineering in Large Language Models: A comprehensive review. arXiv 2023, arXiv:2310.14735. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; Volume 33, pp. 9459–9474. [Google Scholar]
Gunasekar, S.; Zhang, Y.; Aneja, J.; Mendes, C.C.; Del Giorno, A.; Gopi, S.; Javaheripi, M.; Kauffmann, P.; de Rosa, G.; Saarikivi, O.; et al. Textbooks Are All You Need. arXiv 2023, arXiv:2306.11644. [Google Scholar]
Li, Y.; Bubeck, S.; Eldan, R.; Del Giorno, A.; Gunasekar, S.; Lee, Y.T. Textbooks Are All You Need II: Phi-1.5 technical report. arXiv 2023, arXiv:2309.05463. [Google Scholar]
Meta AI. Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models. Available online: https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ (accessed on 31 March 2025).
Agrawal, P.; Antoniak, S.; Hanna, E.B.; Bout, B.; Chaplot, D.; Chudnovsky, J.; Costa, D.; De Monicault, B.; Garg, S.; Gervet, T.; et al. Pixtral 12B: A Multimodal Language Model. arXiv 2024, arXiv:2410.07073. [Google Scholar]
Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar]
Yang, A.; Yang, B.; Hui, B.; Zheng, B.; Yu, B.; Zhou, C.; Li, C.; Li, C.; Liu, D.; Huang, F.; et al. Qwen2 Technical Report. arXiv 2024, arXiv:2407.10671. [Google Scholar]
Lu, P.; Bansal, H.; Xia, T.; Liu, J.; Li, C.; Hajishirzi, H.; Cheng, H.; Chang, K.-W.; Galley, M.; Gao, J. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. In Proceedings of the 3rd Workshop on Mathematical Reasoning and AI (MATH-AI), NeurIPS 2023, New Orleans, LA, USA, 15 December 2023. [Google Scholar]
Tong, S.; Brown, E.; Wu, P.; Woo, S.; Middepogu, M.; Akula, S.C.; Yang, J.; Yang, S.; Iyer, A.; Pan, X.; et al. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. In Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, USA, 10–15 December 2024. [Google Scholar]
Lu, P.; Mishra, S.; Xia, T.; Qiu, L.; Chang, K.W.; Zhu, S.C.; Tafjord, O.; Clark, P.; Kalyan, A. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, USA, 10–15 December 2024. [Google Scholar]
Agrawal, H.; Desai, K.; Lee, S. NoCaps: Novel Object Captioning at Scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Plummer, B.A.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Xu, M.; Yin, W.; Cai, D.; Yi, R.; Xu, D.; Wang, Q.; Wu, B.; Zhao, Y.; Yang, C.; Wang, S.; et al. A Survey of Resource-efficient LLM and Multimodal Foundation Models. arXiv 2024, arXiv:2401.08092. [Google Scholar]
Li, J.; Lu, W.; Fei, H.; Luo, M.; Dai, M.; Xia, M.; Jin, Y.; Gan, Z.; Qi, D.; Fu, C.; et al. A Survey on Benchmarks of Multimodal Large Language Models. arXiv 2024, arXiv:2408.08632. [Google Scholar]
Xie, J.; Chen, Z.; Zhang, R.; Wan, X.; Li, G. Large Multimodal Agents: A Survey. arXiv 2024, arXiv:2402.15116. [Google Scholar]
Baldassini, F.B.; Shukor, M.; Cord, M.; Soulier, L.; Piwowarski, B. What Makes Multimodal In-Context Learning Work? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–28 June 2024. [Google Scholar]
Mitra, C.; Huang, B.; Darrell, T.; Herzig, R. Compositional Chain-of-Thought Prompting for Large Multimodal Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
He, J.; Wang, X.; Liu, S.; Wu, G.; Silva, C.; Qu, H. POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models. arXiv 2024, arXiv:2306.13549v4. [Google Scholar]
Microsoft Research. Phi-3.5: A Lightweight Multimodal Model for Vision and Language Tasks. arXiv 2024, arXiv:2410.11223.
Shazeer, N. GLU Variants Improve Transformer. arXiv 2020, arXiv:2002.05202v1. [Google Scholar]
Li, X.; Wang, Z.; Xie, C. An Inverse Scaling Law for CLIP Training. In Proceedings of the Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Alibaba Group. Qwen2-VL-7B-Instruct: Advancements in Vision-Language Understanding. arXiv 2024, arXiv:2401.12345.
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Peng, B.; Quesnelle, J.; Fan, H.; Shippole, E. YaRN: Efficient context window extension of large language models. arXiv 2023, arXiv:2309.00071. [Google Scholar]
An, C.; Huang, F.; Zhang, J.; Gong, S.; Qiu, X.; Zhou, C.; Kong, L. Training-free long-context scaling of large language models. arXiv 2024, arXiv:2402.17463. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, PA, USA, 6–12 July 2002. [Google Scholar]
Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the ACL-04 Workshop on Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019. [Google Scholar]
Zhao, W.; Peyrard, M.; Liu, F.; Gao, Y.; Meyer, C.M.; Eger, S. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, China, 3–7 November 2019. [Google Scholar]
Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. CIDEr: Consensus-based Image Description Evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Naseem, U.; Thapa, S.; Masood, A. Advancing Accuracy in Multimodal Medical Tasks Through Bootstrapped Language-Image Pretraining (BioMedBLIP): Performance Evaluation Study. J. Med. Internet Res. Med Inform. 2024, 12, e56627. [Google Scholar]
Liu, F.; Zhu, T.; Wu, X.; Yang, B.; You, C.; Wang, C.; Lu, L.; Liu, Z.; Zheng, Y.; Sun, X.; et al. A medical multimodal large language model for future pandemics. npj Digit. Med. 2023, 6, 226. [Google Scholar]
Yang, Z.; Li, L.; Wang, J.; Lin, K.; Azarnasab, E.; Ahmed, F.; Liu, Z.; Liu, C.; Zeng, M.; Wang, L. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action. arXiv 2023, arXiv:2303.11381. [Google Scholar]
Lu, M.Y.; Chen, B.; Williamson, D.F.; Chen, R.J.; Zhao, M.; Chow, A.K.; Ikemura, K.; Kim, A.; Pouli, D.; Patel, A.; et al. A multimodal generative AI copilot for human pathology. Nature 2024, 634, 466–473. [Google Scholar]
Yang, Y.; Zhou, T.; Li, K.; Tao, D.; Li, L.; Shen, L.; He, X.; Jiang, J.; Shi, Y. Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 26265–26275. [Google Scholar]
Liu, Y.; Duan, H.; Zhang, Y.; Li, B.; Zhang, S.; Zhao, W.; Yuan, Y.; Wang, J.; He, C.; Liu, Z.; et al. MMBench: Is Your Multi-modal Model an All-around Player? In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Liu, H.I.; Galindo, M.; Xie, H.; Wong, L.K.; Shuai, H.H.; Li, Y.H.; Cheng, W.H. Lightweight Deep Learning for Resource-Constrained Environments: A Survey. arXiv 2024, arXiv:2404.07236v2. [Google Scholar] [CrossRef]
Jiang, C.; Xu, H.; Dong, M.; Chen, J.; Ye, W.; Yan, M.; Ye, Q.; Zhang, J.; Huang, F.; Zhang, S. Hallucination Augmented Contrastive Learning for Multimodal Large Language Model. arXiv 2024, arXiv:2312.06968v3. [Google Scholar]
Doveh, S.; Perek, S.; Mirza, M.J.; Lin, W.; Alfassy, A.; Arbelle, A.; Ullman, S.; Karlinsky, L. Towards Multimodal In-Context Learning for Vision & Language Models. arXiv 2024, arXiv:2403.12736. [Google Scholar]

Figure 1. Input image and caption from the nocaps dataset (caption: a woman in a white dress is standing between two suit-wearing men in a yard).

Table 1. Technical summary of major MLLMs.

Model	Phi	Llama	Pixtral	Qwen
Model Size	4.2B parameters	Up to 405B parameters	12B parameters	Up to 72B
Architecture	Vision encoder	Dense transformer	Vision encoder+	Customized encoder+
	Transformer-based LLM+	Transformer-based LLM	Transformer-based LLM	Transformer-based LLM
Vision Encoder	CLIP ViT-L/14-based	N/A	Pixtral-ViT (400 M parameters)	Customized encoder
Positional Encoding	RoPE	RoPE	ROPE-2D	RoPE + DCA
Activation Function	SwiGLU	SwiGLU	SwiGLU (encoder), GeLU (decoder)	SwiGLU
Context Length	Up to 128K tokens	Up to 128K tokens	Up to 128K tokens	Up to 128K tokens
Training Data Size	Around 0.5T tokens	Around 15.6T tokens	Billions of image–text pairs	7T+ tokens
Pre-training	Large-scale image–text	Multilingual, code	Mixed image–text	Multilingual, code, math
Post-training	SFT + DPO	SFT + DPO	SFT + DPO	SFT + DPO

Table 2. Comparison of benchmark datasets for image–text and visual reasoning tasks.

Dataset	Task	Category	Description
MSCOCO	Caption	General	Image and text matching, caption-based retrieval
Flickr30k	Caption	General	Generating natural language descriptions for images
Nocaps	Caption	General	Captioning novel objects and scenes
ScienceQA	VQA	Science	Evaluating scientific knowledge and reasoning
MathVista	VQA	Math	Assessing mathematical reasoning with visual context
CVBench	VQA	General	Evaluating visual understanding through QA tasks

Table 3. Experimental setup.

Model
Llama-3.2-11B	Phi-3.5-4.2B	Pixtral-12B	Qwen2-VL-7B	-
Prompt Engineering
B: Base	I: ICL	C: CoT	R: RAG	T: ToT	S:SSR
Evaluation Metric
BLUE	ROUGE	METEOR	S-BERT	MoverScore	CIDEr
Dataset
MSCOCO	Flickr30k	Nocaps	ScienceQA	MathVista	CvBench

Table 4. Performances of BLEU, ROUGE, and METEOR according to MLLM models and prompt engineering techniques on the MSCOCO dataset (The boldfaced numbers indicate the highest performance for each language model across different PES variants).

Method	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen
Metric	BLEU				ROUGE				METEOR
B	1.5927	2.4344	2.9209	3.7367	0.072	0.1085	0.1793	0.2487	0.1076	0.2057	0.2587	0.2712
I	17.7844	14.6007	3.3469	6.787	0.436	0.4217	0.1689	0.245	0.401	0.4086	0.3237	0.2702
I, S	8.8645	5.4292	1.5041	5.7771	0.3107	0.2054	0.0812	0.2283	0.315	0.2924	0.164	0.2321
I, T	9.3105	2.8614	3.1159	7.735	0.3255	0.1548	0.1621	0.2188	0.3209	0.2393	0.3087	0.2509
C	14.9965	8.8183	7.8643	12.9066	0.4574	0.296	0.3082	0.4551	0.4987	0.3827	0.3785	0.4933
C, I	15.4373	5.7162	4.0372	15.3999	0.4748	0.2609	0.2184	0.4695	0.5114	0.3529	0.3134	0.4955
C, S	14.4889	5.5974	12.3729	11.0606	0.4549	0.2188	0.433	0.4087	0.5045	0.3144	0.4836	0.4507
C, T	15.018	4.385	1.5935	15.391	0.4558	0.1986	0.085	0.467	0.4919	0.2823	0.1383	0.5121
S	15.1778	4.6746	1.8906	12.3664	0.4532	0.2049	0.1125	0.451	0.5018	0.3039	0.1965	0.4989
T	16.6347	4.4104	13.4563	17.5892	0.4788	0.1796	0.4629	0.4869	0.5112	0.2819	0.4974	0.5411
R(B)	27.8553	10.1012	5.0772	3.0467	0.5559	0.3387	0.2108	0.1355	0.5325	0.3803	0.3794	0.1237
R(C)	14.0141	8.4676	4.3829	2.982	0.443	0.3169	0.1953	0.1317	0.4767	0.3749	0.3413	0.1319
R(S)	2.2911	6.0903	2.8862	4.1465	0.1148	0.33	0.1479	0.1408	0.1253	0.3382	0.2539	0.1623
R(T)	13.3791	3.5638	4.4854	3.1019	0.4161	0.1586	0.1981	0.1195	0.4306	0.2428	0.362	0.1186

Table 5. Performances of S-BERT, MoverScore, and CIDEr according to MLLM models and prompt engineering techniques on the MSCOCO dataset (The boldfaced numbers indicate the highest performance for each language model across different PES variants).

Method	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen
Metric	S-BERT				MoverScore				CIDEr
B	0.4049	0.706	0.6929	0.6868	0.5489	0.7277	0.7215	0.6654	0.0464	0	0.0353	0.1001
I	0.5785	0.6291	0.3118	0.5564	0.7742	0.7533	0.7184	0.6632	0	0.0036	0.0458	0.0364
I, S	0.383	0.6801	0.3212	0.5279	0.7289	0.7312	0.7093	0.6329	0.0016	0.0666	0	0.0758
I, T	0.4057	0.6029	0.324	0.5385	0.7306	0.7052	0.7208	0.6491	0	0.0603	0.0159	0.1592
C	0.7164	0.7108	0.6852	0.7366	0.7952	0.7539	0.7373	0.7916	0.0062	0.0287	0.0336	0.0004
C, I	0.7232	0.702	0.6716	0.7423	0.7963	0.7422	0.7126	0.8	0.0054	0.0572	0.0201	0
C, S	0.7206	0.6965	0.7012	0.7237	0.7951	0.74	0.7758	0.7908	0.0039	0.0115	0.0001	0.0021
C, T	0.7187	0.6912	0.6456	0.7419	0.7959	0.7234	0.6532	0.7899	0.0089	0.0319	0.0001	0.0013
S	0.7229	0.7047	0.697	0.7431	0.7989	0.7391	0.705	0.7939	0.0196	0.0323	0	0.0025
T	0.7391	0.6152	0.7228	0.7586	0.7998	0.7057	0.7821	0.805	0.0063	0.0347	0.0002	0.0019
R(B)	0.7334	0.6259	0.6601	0.3307	0.8118	0.7391	0.7545	0.5655	0	0.0442	0.1011	0.0404
R(C)	0.702	0.6078	0.655	0.3846	0.7898	0.7379	0.7571	0.5795	0.0007	0.0093	0.0163	0.0918
R(S)	0.2484	0.6191	0.6509	0.3645	0.4852	0.7047	0.7257	0.572	0.0005	0.0202	0.0146	0.0813
R(T)	0.69	0.6092	0.659	0.3097	0.7826	0.7078	0.7567	0.557	0.0006	0.0098	0.0204	0.0694

Table 6. Performances of BLEU, ROUGE, and METEOR according to MLLM models and prompt engineering techniques on the FLICKR30k dataset (The boldfaced numbers indicate the highest performance for each language model across different PES variants).

Method	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen
Metric	BLEU				ROUGE				METEOR
B	1.9661	2.6143	3.7315	4.359	0.0811	0.1412	0.2186	0.2555	0.0988	0.2393	0.3036	0.2784
I	24.8959	10.8788	3.803	3.6845	0.5341	0.3779	0.1774	0.2103	0.5309	0.3883	0.3541	0.1791
I, S	17.6712	3.6719	1.5761	4.1518	0.469	0.2048	0.0943	0.2272	0.4934	0.2803	0.1855	0.1988
I, T	18.6258	3.3325	3.5435	5.2165	0.4578	0.1781	0.1738	0.2596	0.4658	0.2619	0.329	0.2159
C	19.4097	7.898	11.3782	16.8317	0.4651	0.3209	0.3706	0.4701	0.5053	0.3908	0.4233	0.5038
C, I	20.726	6.0268	5.2859	16.5979	0.4941	0.259	0.2405	0.45	0.5269	0.3442	0.3241	0.4891
C, S	16.7199	7.1802	14.3171	12.4918	0.4375	0.2621	0.4189	0.4278	0.483	0.343	0.4733	0.4599
C, T	16.0005	3.6898	2.5275	16.6957	0.449	0.1944	0.0986	0.4487	0.5047	0.2851	0.1634	0.5029
S	16.4494	3.9303	2.6445	15.035	0.4487	0.1971	0.1532	0.4612	0.4882	0.3016	0.2352	0.4953
T	19.5581	4.1068	17.6505	18.9299	0.4815	0.219	0.4659	0.5173	0.5141	0.3094	0.5154	0.5476
R(B)	28.1474	8.6846	4.5104	5.5115	0.5263	0.3452	0.2136	0.2318	0.4993	0.3499	0.377	0.2185
R(C)	17.3322	6.31	4.038	4.6716	0.447	0.2656	0.2017	0.212	0.4889	0.3262	0.3387	0.2057
R(S)	2.4731	6.6279	2.9527	5.1781	0.1029	0.3315	0.1606	0.2426	0.1093	0.334	0.2594	0.2226
R(T)	17.693	3.1555	4.2711	4.7382	0.4538	0.171	0.2078	0.2375	0.4651	0.237	0.3394	0.214

Table 7. Performances of S-BERT, MoverScore, and CIDEr according to MLLM models and prompt engineering techniques on the FLICKR 30k dataset (The boldfaced numbers indicate the highest performance for each language model across different PES variants).

Method	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen
Metric	S-BERT				MoverScore				CIDEr
B	0.2602	0.6638	0.644	0.6103	0.5109	0.7425	0.7459	0.6822	0.0375	0.0001	0.0956	0.1313
I	0.6542	0.5679	0.4302	0.3097	0.8198	0.7647	0.7391	0.6171	0	0.019	0.0044	0.0161
I, S	0.659	0.6056	0.3004	0.3098	0.8112	0.7453	0.729	0.6274	0.0001	0.0475	0	0.0082
I, T	0.6135	0.5577	0.4392	0.3105	0.7989	0.7311	0.7436	0.6377	0	0.0909	0.0001	0.0071
C	0.6698	0.68	0.6295	0.6952	0.8134	0.7767	0.7708	0.8175	0	0.1119	0.0294	0.0022
C, I	0.6808	0.6626	0.6323	0.6802	0.8183	0.7621	0.7277	0.8181	0	0.1094	0.0961	0.0002
C, S	0.659	0.6653	0.6486	0.677	0.8083	0.76	0.7976	0.8058	0.0056	0.0287	0	0.0018
C, T	0.6652	0.654	0.6055	0.6908	0.8084	0.7373	0.6681	0.8188	0	0.0648	0	0
S	0.6582	0.6666	0.6474	0.6856	0.8102	0.7523	0.7225	0.8175	0.0022	0.008	0	0.0021
T	0.6587	0.6396	0.6636	0.6964	0.8151	0.7479	0.8078	0.8262	0.0001	0.0502	0	0
R(B)	0.6483	0.5266	0.5231	0.4158	0.7923	0.7516	0.7661	0.6443	0	0.0298	0.0738	0.0367
R(C)	0.6526	0.5934	0.5252	0.3971	0.802	0.7537	0.7683	0.6403	0.0003	0.0725	0.0247	0.0162
R(S)	0.2097	0.5426	0.5091	0.4216	0.4633	0.718	0.7379	0.6537	0.0008	0.0202	0.014	0.0013
R(T)	0.6393	0.5543	0.5265	0.4178	0.8039	0.7134	0.7682	0.642	0	0.0131	0.0242	0.0133

Table 8. Performances of BLEU, ROUGE, and METEOR according to MLLM models and prompt engineering techniques on the nocaps dataset (The boldfaced numbers indicate the highest performance for each language model across different PES variants).

Method	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen
Metric	BLEU				ROUGE				METEOR
B	2.5986	3.2636	4.2172	7.6038	0.0802	0.1288	0.2175	0.3197	0.1165	0.2325	0.311	0.3768
I	28.8185	16.0436	3.1868	10.7068	0.5569	0.4287	0.1347	0.3225	0.5479	0.4479	0.2937	0.3582
I, S	24.031	8.7413	1.9456	13.5274	0.5147	0.2962	0.0855	0.3239	0.5433	0.3794	0.1655	0.3921
I, T	25.1039	7.2913	1.1572	12.2658	0.5231	0.261	0.0648	0.339	0.5383	0.3472	0.1354	0.4082
C	21.1268	10.9829	15.2161	22.2892	0.4904	0.3241	0.4333	0.5204	0.5218	0.4052	0.4722	0.5602
C, I	24.5323	7.5259	10.0188	20.7051	0.5158	0.2644	0.3503	0.5167	0.5488	0.348	0.4102	0.583
C, S	20.0497	5.2231	16.2252	18.1366	0.478	0.2077	0.4679	0.4919	0.5137	0.3207	0.4846	0.5556
C, T	21.021	5.5268	2.42	18.5812	0.4918	0.2069	0.0939	0.503	0.5289	0.3216	0.1523	0.5605
S	20.9282	5.579	2.7636	20.7539	0.4838	0.2117	0.1385	0.508	0.5286	0.3217	0.2219	0.5487
T	24.3542	6.0107	21.8678	22.1703	0.5144	0.225	0.5061	0.523	0.5306	0.3286	0.5332	0.5692
R(B)	27.6833	11.3653	4.4654	9.104	0.5344	0.3483	0.1543	0.2652	0.5569	0.3897	0.3289	0.3812
R(C)	18.8508	12.6761	3.3072	9.2149	0.4666	0.3553	0.1278	0.264	0.5279	0.411	0.2456	0.3672
R(S)	7.1765	9.22	2.6148	9.0202	0.2627	0.3521	0.0941	0.2533	0.2868	0.3727	0.1823	0.344
R(T)	21.2077	4.8689	3.2336	11.2817	0.4875	0.1718	0.1225	0.2863	0.5085	0.2622	0.2472	0.4002

Table 9. Performances of S-BERT, MoverScore, and CIDEr according to MLLM models and prompt engineering techniques on the nocaps dataset (The boldfaced numbers indicate the highest performance for each language model across different PES variants).

Method	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen
Metric	S-BERT				MoverScore				CIDEr
B	0.3804	0.6907	0.6756	0.69	0.5374	0.7318	0.7373	0.7144	0.0447	0.0439	0.1021	0.0186
I	0.6781	0.5881	0.1856	0.6157	0.8004	0.7447	0.6874	0.7174	0	0.0303	0.0485	0.0606
I, S	0.6644	0.6544	0.1856	0.6819	0.7845	0.7438	0.7031	0.736	0	0.0506	0	0.0698
I, T	0.6925	0.6331	0.1856	0.6706	0.7906	0.7288	0.6739	0.7423	0	0.1219	0	0.0196
C	0.7006	0.7011	0.6572	0.7268	0.7947	0.7505	0.7649	0.8026	0	0.0838	0.0046	0
C, I	0.6975	0.6799	0.6563	0.7237	0.7958	0.7432	0.7434	0.803	0	0.0885	0.0465	0
C, S	0.6932	0.6917	0.6737	0.7084	0.796	0.7416	0.777	0.7968	0	0.1294	0	0
C, T	0.686	0.6726	0.6297	0.7176	0.79	0.7347	0.6537	0.7905	0	0.104	0	0
S	0.6871	0.6932	0.6823	0.7118	0.7892	0.7403	0.7091	0.7995	0	0.1059	0.0612	0
T	0.6949	0.6652	0.6976	0.7257	0.7958	0.7344	0.7954	0.8039	0	0.0889	0	0
R(B)	0.6941	0.5952	0.5139	0.6947	0.8004	0.7394	0.7293	0.7279	0	0.0329	0.1483	0.0137
R(C)	0.676	0.6405	0.5139	0.681	0.7921	0.7468	0.7257	0.7234	0	0.0055	0.0303	0.032
R(S)	0.4495	0.6392	0.5139	0.6668	0.5728	0.7165	0.7174	0.7083	0	0.0104	0.0037	0.0326
R(T)	0.6691	0.5882	0.5139	0.7034	0.7889	0.706	0.7207	0.734	0	0.0366	0.0296	0.0939

Table 10. Performances of BLEU, ROUGE, and METEOR according to MLLM models and prompt engineering techniques on the ScienceQA dataset (The boldfaced numbers indicate the highest performance for each language model across different PES variants).

Method	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen
Metric	BLEU				ROUGE				METEOR
B	3.0148	1.8828	0.4307	10.2778	0.1802	0.0966	0.0357	0.288	0.2035	0.1462	0.0748	0.3067
I	2.7406	3.2818	2.8016	0.5479	0.3305	0.3675	0.1208	0.0455	0.2406	0.2639	0.1697	0.0679
I, S	1.519	3.3586	2.7049	0.6919	0.3313	0.3568	0.1198	0.056	0.2409	0.2639	0.1682	0.0821
I, T	1.303	2.483	2.7163	0.5759	0.1605	0.3439	0.1204	0.0515	0.1263	0.2421	0.1747	0.0852
C	0.3473	3.229	1.0351	7.0063	0.0602	0.2113	0.2838	0.2908	0.067	0.2146	0.2006	0.295
C, I	0.6814	2.9978	0.9541	0.6029	0.0726	0.1598	0.0505	0.0479	0.0657	0.1768	0.095	0.0781
C, S	0.2523	4.0965	0.9347	3.137	0.0947	0.1858	0.266	0.2	0.0842	0.2258	0.1962	0.2174
C, T	1.2471	2.8077	1.4095	5.6359	0.2576	0.2297	0.3537	0.3311	0.2122	0.2024	0.2337	0.29
S	0.7913	3.6607	1.1608	5.5777	0.1914	0.2589	0.3733	0.499	0.1339	0.2356	0.2423	0.3671
T	0.6106	4.1778	1.0835	3.3409	0.1748	0.2766	0.3417	0.4286	0.1094	0.2277	0.2286	0.3375
R(B)	11.4993	6.7109	7.0196	6.0952	0.4376	0.2591	0.1979	0.1274	0.3303	0.2708	0.2949	0.1675
R(C)	3.2236	3.435	2.3641	3.3737	0.0761	0.1603	0.0713	0.0928	0.0821	0.204	0.12	0.1285
R(S)	5.4835	4.7683	5.5466	4.6472	0.2526	0.1514	0.1677	0.1197	0.1906	0.1959	0.2589	0.1591
R(T)	0.1767	5.9225	0.6248	6.1253	0.009	0.1615	0.0304	0.1227	0.0239	0.1957	0.0673	0.1608

Table 11. Performances of S-BERT, MoverScore, and CIDEr according to MLLM models and prompt engineering techniques on the ScienceQA dataset (The boldfaced numbers indicate the highest performance for each language model across different PES variants).

Method	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen
Metric	S-BERT				MoverScore				CIDEr
B	0.236	0.0935	0.0282	0.241	0.3317	0.262	0.2267	0.3512	0.3639	0.042	0	1.0446
I	0.4667	0.3941	0.3925	0.0448	0.5129	0.4659	0.3326	0.2682	0.9533	0.6496	0.1269	0.0019
I, S	0.519	0.3646	0.394	0.0265	0.5482	0.4293	0.3299	0.2651	0.9407	0.3114	0.1203	0.0208
I, T	0.4826	0.3847	0.383	0.0105	0.484	0.4605	0.3297	0.2539	0.5228	0.4409	0.1233	0.0008
C	0.1237	0.2108	0.3808	0.2485	0.3015	0.3171	0.4543	0.3559	0.05	0.2047	0.8835	0.8294
C, I	0.4443	0.1819	0.1468	0.0365	0.447	0.3066	0.2632	0.2594	0.1142	0.0907	0.0264	0.001
C, S	0.239	0.215	0.3352	0.206	0.3612	0.3162	0.4447	0.3295	0.15	0.2287	0.6745	0.3097
C, T	0.3186	0.2346	0.4625	0.3151	0.427	0.3405	0.5154	0.4132	0.5348	0.1306	0.8801	0.7532
S	0.4319	0.2622	0.4763	0.3892	0.4754	0.3513	0.5241	0.461	0.365	0.2754	0.896	1.4677
T	0.3746	0.332	0.419	0.3498	0.4432	0.3878	0.4963	0.4164	0.3462	0.5067	0.8329	1.0355
R(B)	0.4872	0.2374	0.2973	0.0839	0.5334	0.3451	0.3288	0.2736	1.7644	0.6944	0.1093	0.4706
R(C)	0.3955	0.1331	0.1069	0.0981	0.4041	0.2945	0.2572	0.2828	0.2745	0.1218	0.0208	0.2322
R(S)	0.4441	0.1652	0.2371	0.1033	0.4698	0.3016	0.3162	0.2783	0.8802	0.2335	0.0746	0.3806
R(T)	0.3804	0.1852	0.0181	0.1006	0.4019	0.3128	0.2395	0.2775	0	0.3781	0	0.4826

Table 12. Performances of BLEU, ROUGE, and METEOR according to MLLM models and prompt engineering techniques on the MathVista dataset (The boldfaced numbers indicate the highest performance for each language model across different PES variants).

Method	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen
Metric	BLEU				ROUGE				METEOR
B	0.6612	0.6143	0.126	0.6863	0.1408	0.033	0.0096	0.0618	0.0967	0.0406	0.0149	0.0782
I	0.0918	0.6821	1.2592	0.0869	0.2276	0.2252	0.0693	0.0072	0.1256	0.1255	0.0837	0.0129
I, S	0.2226	1.0011	1.5175	0.038	0.2462	0.2606	0.0819	0.0043	0.1371	0.1507	0.1003	0.0161
I, T	0.5203	0.9322	1.1025	0.0762	0.0724	0.2099	0.066	0.005	0.0513	0.1245	0.0798	0.015
C	0.4019	0.9774	1.8382	1.3637	0.019	0.0478	0.1725	0.1446	0.0149	0.0605	0.1101	0.1201
C, I	0.5048	0.9231	0.505	0.0695	0.0292	0.0471	0.0314	0.0059	0.0276	0.0521	0.0461	0.0141
C, S	0.406	0.9521	1.6271	1.1172	0.0189	0.0466	0.1474	0.157	0.0167	0.0631	0.089	0.1145
C, T	0.4825	1.2964	0.6333	1.1674	0.0933	0.109	0.2545	0.2232	0.0559	0.0881	0.1408	0.1526
S	1.3617	0.5511	1.6025	0.8705	0.1313	0.135	0.2415	0.3012	0.0739	0.0905	0.1316	0.1753
T	1.2232	1.2366	1.3855	0.5731	0.0991	0.1954	0.2611	0.3512	0.054	0.1261	0.135	0.1905
R(B)	0.0306	0.729	1.3726	0.0779	0.2255	0.1397	0.0635	0.0055	0.1229	0.091	0.0913	0.0141
R(C)	0.3753	0.7082	0.4303	0.0807	0.0222	0.0385	0.0234	0.0062	0.0209	0.0465	0.0323	0.0163
R(S)	0.446	1.0149	0.692	0.1841	0.0447	0.0379	0.0374	0.0106	0.0387	0.0582	0.0526	0.0264
R(T)	0.1596	0.8213	0.0825	0.3845	0.0527	0.0703	0.0084	0.0151	0.0328	0.0658	0.013	0.0164

Table 13. Performances of S-BERT, MoverScore, and CIDEr according to MLLM models and prompt engineering techniques on the MathVista dataset (The boldfaced numbers indicate the highest performance for each language model across different PES variants).

Method	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen
Metric	S-BERT				MoverScore				CIDEr
B	0.3874	0.1657	0.104	0.328	0.4066	0.2532	0.2208	0.3565	0.2583	0.0152	0	0.0546
I	0.6916	0.5227	0.3874	0.1606	0.7134	0.4957	0.3767	0.2534	0.5	0.3297	0.0531	0
I, S	0.6562	0.4991	0.39	0.1635	0.6246	0.4808	0.3775	0.2589	0.55	0.178	0.0653	0
I, T	0.6069	0.5719	0.3865	0.1645	0.5291	0.5319	0.3798	0.2541	0.1389	0.2783	0.0426	0
C	0.1318	0.2342	0.3457	0.362	0.2583	0.2872	0.3496	0.3959	0.0221	0.0325	0.1637	0.2179
C, I	0.3895	0.2328	0.1445	0.1588	0.3761	0.2901	0.2527	0.2413	0.0227	0.0061	0.0001	0
C, S	0.1711	0.1902	0.3319	0.3551	0.2848	0.2594	0.3428	0.4134	0.0221	0.0259	0.0869	0.2587
C, T	0.3783	0.3004	0.4875	0.5079	0.4243	0.3117	0.5236	0.5509	0.1047	0.1104	0.3688	0.4693
S	0.5498	0.3405	0.5052	0.5993	0.5191	0.3492	0.5388	0.6311	0.2349	0.121	0.3349	0.6106
T	0.518	0.4316	0.4997	0.5751	0.4652	0.3944	0.5534	0.6072	0.2349	0.1765	0.3969	0.8036
R(B)	0.6665	0.3318	0.3333	0.1889	0.6528	0.3427	0.314	0.2617	0.5	0.1967	0.0455	0
R(C)	0.3092	0.1739	0.1603	0.2068	0.3408	0.2524	0.2464	0.2547	0.0227	0.0248	0.0038	0
R(S)	0.2421	0.2087	0.2273	0.2153	0.3038	0.274	0.2718	0.2692	0.0728	0.0146	0.0045	0.0008
R(T)	0.3814	0.1972	0.1016	0.2023	0.3855	0.2719	0.2317	0.2608	0.1189	0.0764	0	0.0182

Table 14. Performances of BLEU, ROUGE, and METEOR according to MLLM models and prompt engineering techniques on the CVBench dataset (The boldfaced numbers indicate the highest performance for each language model across different PES variants).

Method	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen
Metric	BLEU				ROUGE				METEOR
B	1.1247	0.3936	0.245	1.0894	0.0625	0.0236	0.0152	0.0571	0.125	0.0872	0.032	0.1231
I	0.5495	1.8268	2.4048	0.043	0.4529	0.1511	0.1244	0.003	0.2598	0.182	0.1701	0.0108
I, S	0.8242	1.5994	1.817	0.0521	0.2778	0.1149	0.1119	0.0037	0.19	0.1695	0.157	0.0087
I, T	0.4577	1.5656	2.1895	0.067	0.1162	0.1819	0.1314	0.0041	0.0864	0.2084	0.1795	0.0108
C	0.4706	1.0214	0.927	0.8577	0.0554	0.0539	0.0981	0.0897	0.0567	0.1086	0.113	0.1279
C, I	0.1621	1.2902	0.653	0.0198	0.0112	0.0654	0.0385	0.0013	0.0223	0.1315	0.0923	0.0068
C, S	0.7464	1.3761	0.9131	1.0915	0.0561	0.072	0.0909	0.0792	0.0598	0.1406	0.1046	0.131
C, T	1.1221	0.9178	1.0999	1.4751	0.1635	0.0511	0.1773	0.163	0.155	0.1172	0.1639	0.1804
S	1.0599	1.2527	0.9041	1.7137	0.0872	0.0663	0.2016	0.2736	0.1032	0.1342	0.168	0.2232
T	0.6267	1.5116	1.0003	1.4238	0.0643	0.0774	0.1939	0.2004	0.0754	0.1494	0.1735	0.177
R(B)	0.9225	1.4648	2.3389	0.7355	0.3883	0.0761	0.1147	0.0361	0.2471	0.1587	0.1542	0.0439
R(C)	0.5612	1.1486	0.7245	0.3849	0.0439	0.0602	0.0411	0.0227	0.0396	0.1337	0.0733	0.0284
R(S)	0.7274	1.1886	1.2155	0.2837	0.0674	0.0625	0.0678	0.0169	0.0682	0.1214	0.1182	0.0296
R(T)	0.165	0.8398	0.1101	0.2922	0.0109	0.0468	0.0085	0.0171	0.0218	0.1222	0.0158	0.0297

Table 15. Performances of S-BERT, MoverScore, and CIDEr according to MLLM models and prompt engineering techniques on the CVBench dataset (The boldfaced numbers indicate the highest performance for each language model across different PES variants).

Method	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen	Phi	Llama	Pixtral	Qwen
Metric	S-BERT				MoverScore				CIDEr
B	0.2142	0.117	0.0531	0.2589	0.3249	0.2768	0.2362	0.3378	0.0358	0	0.0002	0.0425
I	0.5981	0.375	0.4709	0.079	0.6897	0.4306	0.3946	0.2444	0.8707	0.16	0.1065	0
I, S	0.3326	0.336	0.4537	0.0671	0.4665	0.4119	0.3944	0.2383	0.2081	0.1259	0.0676	0
I, T	0.4935	0.3869	0.4814	0.0615	0.4782	0.4431	0.3966	0.2298	0.0328	0.2025	0.0986	0
C	0.1225	0.2049	0.2564	0.2526	0.2865	0.3186	0.3738	0.3507	0.0766	0.026	0.1221	0.1213
C, I	0.3134	0.216	0.2055	0.0699	0.3519	0.3174	0.3263	0.2367	0.0015	0.0442	0.0005	0
C, S	0.1492	0.2255	0.2449	0.2373	0.2984	0.3294	0.353	0.3425	0.032	0.0514	0.1022	0.0933
C, T	0.26	0.2184	0.3335	0.3237	0.3916	0.3255	0.4352	0.4032	0.164	0.0201	0.2885	0.2635
S	0.1673	0.2193	0.4048	0.3998	0.3318	0.325	0.4849	0.4625	0.0442	0.03	0.3412	0.4056
T	0.1472	0.292	0.4374	0.3427	0.3127	0.3574	0.4877	0.4217	0.0115	0.065	0.3272	0.3967
R(B)	0.4848	0.2513	0.3764	0.1443	0.5856	0.3335	0.3697	0.2854	0.6442	0.0543	0.0972	0.0376
R(C)	0.1422	0.2068	0.1263	0.1302	0.2863	0.3119	0.2785	0.2686	0.0387	0.0372	0.0082	0.0061
R(S)	0.1197	0.2038	0.2239	0.0821	0.2932	0.3169	0.3408	0.2562	0.0519	0.0406	0.0139	0
R(T)	0.1246	0.2022	0.0155	0.1151	0.2771	0.3045	0.2346	0.263	0.0004	0.0092	0	0.0012

Table 16. Comparison of results across MLLM models and datasets using different metrics (The boldfaced numbers indicate the highest performance for each language model across different PES variants).

Metric	MSCOCO	Flickr30k	nocaps	ScienceQA	MathVista	CVBench
Category	General	General	General	Science	Math	General
Phi
BLEU	R(B)	R(B)	I	R(B)	S	B
ROUGE	R(B)	I	I	R(B)	I, S	I
METEOR	R(B)	I	R(B)	R(B)	I, S	I
S-BERT	T	C, I	C	I, S	I	I
MoverScore	R(B)	I	I	I, S	I	I
CIDEr	B	B	B	R(B)	I, S	I
Best PES	R(B)	I	I	R(B)	I, S	I
Llama
BLEU	I	I	I	R(B)	C, T	I
ROUGE	I	I	I	I	I, S	I, T
METEOR	I	C	I	R(B)	I, S	I, T
S-BERT	C	C	C	I	I, T	I, T
MoverScore	C	C	C	I	I, T	I, T
CIDEr	I, S	C	C, S	R(B)	I	I, T
Best PES	I	C	I	I	I, T	I, T
Pixtral
BLEU	T	T	T	R(B)	C	I
ROUGE	T	T	T	S	T	S
METEOR	T	T	T	R(B)	C, T	I, T
S-BERT	T	T	T	S	S	I, T
MoverScore	T	T	T	S	T	T
CIDEr	R(B)	C, I	R(B)	S	T	S
Best PES	T	T	T	S	T	S
Qwen
BLEU	T	T	C	B	C	S
ROUGE	T	T	T	S	T	S
METEOR	T	T	C, I	S	T	S
S-BERT	T	T	C	S	S	S
MoverScore	T	T	T	S	S	S
CIDEr	I, T	B	R(T)	S	T	S
Best PES	T	T	T	S	T	S
Performance Enhancement Summary
Best MLLM	Qwen	Qwen	Qwen	Qwen	Phi	Qwen
Best MLLM with PES	Phi	Qwen	Qwen	Phi	Qwen	Phi
Performance Enhancement	132.3%	78.1%	37.3%	49.1%	90.3%	489.7%

Table 17. Prompt examples for each PE technique (B, I, C, R(B), S, T, R(C)).

PE	Prompt
B	Describe this image.
I	[Example Image] [Example Caption] [Input Image] Describe this image.
C	Describe this image. Generate a caption for this image. However, instead of simply outputting a caption, first analyze the element of the image and think logically before deriving the final caption. Follow these steps: - Object Identification: List the key objects in the image (people, objects, animals, etc.). - Context Analysis: Interpret the scene based on the relationships, actions, and background. - Emotion and Atmosphere Analysis: Describe the emotions or mood conveyed by the image. - Final Caption Generation: Based on the analysis above, create a meaningful and insightful caption.
R(B)	[Retrieval Image] [Retrieval Caption] [Input Image] Describe this image.
S	Describe this image. Analyze and describe the image step by step, following each question in sequence to build a detailed description: Start by identifying the main subject of the image. Describe what is visible in the background or surrounding area. Identify any key details that stand out, such as specific objects, people, or animals. Describe the dominant colors and any notable lighting effects. Observe any actions or movements occurring in the image. Conclude with an interpretation of the image’s mood or theme. Answer each question in order, step by step, to create a clear and comprehensive description of the image.
T	Describe this image. Imagine a panel of experts from different fields analyzing this image together to reach the most accurate and insightful interpretation. Follow this structured discussion format: Introduction and Initial Hypothesis Contradictory Evidence and Debate Alternative Hypotheses Final Consensus and Comprehensive Description The panel collectively decides on the most well-supported interpretation, synthesizing all perspectives into a detailed and insightful description of the image.
R(C)	[Example Image] [Example Caption] [Input Image] Describe second image in one sentence. Generate a caption for this image. However, instead of simply outputting a caption, first analyze elements of the image and think logically before deriving the final caption. Follow these steps: - Object Identification: List the key objects in the image (people, objects, animals, etc.). - Context Analysis: Interpret the scene based on the relationships, actions, and background. - Emotion and Atmosphere Analysis: Describe the emotions or mood conveyed by the image. - Final Caption Generation: Based on the analysis above, create a meaningful and insightful caption.

Table 18. Image captioning results for Figure 1 (Pixtral).

PE	Answer
B	Three people stand outdoors in a grassy area, with two men holding hands with a woman in a white, feathered dress.
I	A woman in a white dress stands between two men, one holding a gun and the other holding a feathered cloak.
C	In a serene, rustic setting, a bride in a stunning feathered gown is escorted by two unique figures, creating a moment of dramatic elegance.
R(B)	A bride in a white dress with feathers stands between two men, one holding a gun and the other holding a feathered cloak.
S	The main subject of the image is a woman in a white, feathered dress being escorted by two men.
T	A stylized, themed event featuring a bride flanked by two men in distinctive, culturally diverse attire, set in an outdoor environment.
R(C)	A unique and imaginative wedding scene showcases the couple’s individuality with whimsical attire and accessories.

Table 19. Image captioning results for Figure 1 (Llama 3.2).

PE	Answer
B	The image depicts a woman in a white wedding dress being restrained by two individuals, one of whom is wearing a mask.
I	The image depicts a woman in a white wedding dress holding hands with a man in a gray suit and a man in a white shirt and black pants.
C	The image depicts a woman in a white wedding dress being restrained by two individuals, one of whom is wearing a mask and the other is shirtless.
R(B)	The image depicts a bride and groom standing in a field, with the bride holding a bouquet of flowers and the groom wearing a white suit.
S	The image depicts a woman in a white wedding dress, with two individuals on either side of her, one wearing a tan suit and the other dressed in black pants and a gray jacket. The woman is holding hands with both men, creating a sense of unity and connection between them.
T	The image depicts a woman in a white wedding dress, flanked by two individuals, one of whom is wearing a mask and the other is shirtless, standing in a grassy area with trees and a fence in the background.
R(C)	The image depicts a man and woman standing in a grassy area, with the woman holding a bouquet of flowers. The man is wearing a white shirt and black pants, while the woman is dressed in a white dress. The atmosphere of the image suggests a romantic or celebratory occasion, possibly a wedding or anniversary. The presence of the bouquet and the couple’s attire implies a sense of joy and love.

Table 20. Image captioning results for Figure 1 (Phi-3.5).

PE	Answer
B	A person in a wedding dress is being held by two individuals in suits, one of whom is wearing a mask.
I	A wedding scene with a bride and groom in costumes.
C	A whimsical wedding photo with a twist: the groom and a friend playfully ‘steal’ the bride’s dress.
R(B)	A man in a mask and a woman in a wedding dress are being held by two men in suits.
S	The main subject of the image appears to be a person in a wedding dress.
T	The image depicts a scene where two individuals are holding the hands of a person in a wedding dress, with one person wearing a mask and the other in a shirtless, fur-lined jacket, set against an outdoor backdrop with trees and a fence.
R(C)	A whimsical twist on wedding traditions: the bride and groom in their wedding attire are playfully interacting with a person in a feathered costume, adding a touch of humor to the ceremony.

Table 21. Image captioning results for Figure 1 (Qwen2).

PE	Answer
B	The image depicts a woman in a white dress being held by two men, one wearing a mask and the other with a fur stole, in an outdoor setting with a wooden fence and trees in the background.
I	a woman in a white dress is standing between two men
C	A woman in a white dress is being led by two men, one wearing a mask and the other with a fur stole, in a grassy outdoor setting.
R(B)	A woman in a white dress is standing between two men in a field.
S	In the image, a person in a white feathered dress is being held by two individuals, one wearing a mask and the other with a fur stole, in an outdoor setting with a wooden fence and trees in the background.
T	The image depicts a surreal scene with a person in a white dress holding hands with two individuals dressed in unconventional attire, one wearing a mask and the other draped in a fur-like garment, set against a rustic outdoor backdrop.
R(C)	A woman in a white dress is standing between two men in a field.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Son, M.; Lee, S. Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance. Appl. Sci. 2025, 15, 3992. https://doi.org/10.3390/app15073992

AMA Style

Son M, Lee S. Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance. Applied Sciences. 2025; 15(7):3992. https://doi.org/10.3390/app15073992

Chicago/Turabian Style

Son, Minjun, and Sungjin Lee. 2025. "Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance" Applied Sciences 15, no. 7: 3992. https://doi.org/10.3390/app15073992

APA Style

Son, M., & Lee, S. (2025). Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance. Applied Sciences, 15(7), 3992. https://doi.org/10.3390/app15073992

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance

Abstract

1. Introduction

1.1. Multimodal Large Language Models: Foundations and Architectures

1.2. Technical Challenges and Hallucination in MLLMs

1.3. Prompt Engineering for Enhancing MLLM Performance

2. Related Works

3. System Models

3.1. Phi

3.2. Llama

3.3. Pixtral

3.4. Qwen

4. Performance Metric

4.1. BLEU

4.2. ROUGE

4.3. METEOR

4.4. Sentence-BERT

4.5. MoverScore

4.6. CIDEr

5. Dataset

6. Greedy Prompt Engineering Strategy

7. Simulation Result

7.1. MSCOCO

7.2. Flickr30k

7.3. nocaps

7.4. ScienceQA

7.5. MathVista

7.6. CVBench

7.7. Performance Analysis and Discussion

7.8. Prompt Examples

8. Featured Application

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI