Promptception

Promptception: How Sensitive Are Large Multimodal Models to Prompts?
[EMNLP 2025]

1Mohamed bin Zayed University of Artificial Intelligence,
2Swiss Federal Institute of Technology Lausanne (EPFL), 3Australian National University

Despite the success of Large Multimodal Models (LMMs) in recent years, prompt design for LMMs in Multiple-Choice Question Answering (MCQA) remains poorly understood. Ambiguity and Probabilistic Prompts We show that even minor variations in prompt phrasing and structure can lead to accuracy deviations of up to 15% for certain prompts and models. This variability poses a challenge for transparent and fair LMM evaluation, as models often report their best-case performance using carefully selected prompts. To address this, we introduce Promptception, a systematic framework for evaluating prompt sensitivity in LMMs. It consists of 61 prompt types, spanning 15 categories and 6 supercategories, each targeting specific aspects of prompt formulation, and is used to evaluate 10 LMMs ranging from lightweight open-source models to GPT-4o and Gemini 1.5 Pro, across 3 MCQA benchmarks: MMStar, MMMU-Pro, Roleplay Scenarios Choice Formatting and Presentation Structured Formatting Figure 1: Categorization of prompts proposed in our Promptception framework. It consists of 61 prompt types, spanning 15 categories (e.g. Answer Handling, Penalty-Based Prompts, Poor Linguistic Formatting) and 6 supercategories (e.g. TaREMOVED_KEYSpecific Instructions, MVBench. Our findings reveal that proprietary Focus-Driven Prompts Choice Formatting and Presentation), providing a commodels exhibit greater sensitivity to prompt phrasing, reflecting tighter alignment with instruction semantics, while open-source models are steadier but struggle with nuanced and complex phrasing. Based on this analysis, we propose Prompting Principles tailored to proprietary and open-source LMMs, enabling more robust and fair model evaluation.

🔥Highlights

The contributions of this paper can be summarized as follows:
  1. Comprehensive Prompt Sensitivity Analysis: We present the most extensive study to date on the impact of prompt variations across diverse multimodal benchmarks and LMM architectures. To facilitate this study, we introduce Promptception, a systematic evaluation framework comprising of 61 prompt types, organized into 15 categories and 6 supercategories, each designed to probe specific aspects of prompt formulation in LMMs.

  2. Evaluation Across Models, Modalities, and Benchmarks: We assess prompt sensitivity across a diverse set of model sizes and architectures, including both open-source and proprietary LMMs. Our analysis spans multiple modalities and benchmarks; MMStar (single image), MMMU-Pro (multi-image), and MVBench (video) and we further evaluate sensitivity across various question dimensions within these benchmarks to ensure a comprehensive understanding.

  3. Best Practices for Prompting: We identify key trends in prompting and propose Prompting Principles for effective and consistent evaluation of LMMs.

Promptception_Logo Sensitivity of state-of-the-art LMMs to prompt variations.

Examples from the MMStar benchmark illustrating divergent model outputs despite identical user queries, caused solely by changes in prompt phrasing (Left: InternVL-38B, Middle: GPT-4o, Right: Gemini 1.5 Pro). This demonstrates the models’ sensitivity to how instructions are framed.

Promptception Logo How Does Variation in Prompts Impact Accuracy?

Proprietary Models Performance
Average Prompt Performance for Proprietary Models. PRA (Percentage Relative Accuracy) with respect to the Baseline Prompt Accuracy is averaged across Open-source Models and the 3 Benchmarks (MMStar, MMMU-Pro & MVBench) for each Prompt Type.
Open Source Models Performance
Average Prompt Performance for Open-Source Models. PRA (Percentage Relative Accuracy) with respect to the Baseline Prompt Accuracy is averaged across Open-source Models and the 3 Benchmarks (MMStar, MMMU-Pro & MVBench) for each Prompt Type.

Promptception_Logo Prompting Principles

Based on the insights from our study, we outline best practices for optimizing LMM performance on the MCQA task. These strategies are designed to enhance both accuracy and consistency. While our insights are based on MCQA evaluations, we believe these principles can be broadly applied to other tasks and extended to LLMs and LMMs. An important observation underlying these principles is the clear difference in behavior between open-source and proprietary models. Open-source models are often not extensively instruction-tuned, which makes them less responsive to prompt variations. In contrast, proprietary models typically undergo rigorous instruction tuning with large-scale, high-quality data, as well as advanced reinforcement learning and post-training techniques. This makes them considerably more sensitive to user instructions, where even subtle changes in prompt phrasing can lead to notable differences in performance. Given these differences in instruction-following capabilities, we present prompting principles separately for open-source and proprietary models. This distinction allows us to account for their varying adherence to instructions and to highlight strategies that are most effective for each category.


# Open-Source Models Proprietary Models
1 Concise prompts yield better performance: Keeping prompts short and direct improves accuracy. "Answer with the option letter from the given choices directly." (1.1)
Overly short or vague prompts reduce accuracy: When the prompt is too brief and lacks clarity, the model may not understand the expected format or task. "Best Choice: $LETTER" (12.3)
Detailed prompts are ineffective: Long or highly descriptive prompts do not improve accuracy. (Notably in Category 5 and other long prompts)
Prompt length and detail have minimal impact: Unlike open-source models, proprietary models perform consistently across prompts of varying lengths and complexity.
Restricting responses to the letter choice is detrimental: Limiting the model to respond with just a letter (e.g., A, B, C, D) can suppress reasoning and reduce accuracy. (12.2)
2 Complex or structured formatting decreases accuracy: Using formats such as JSON, YAML or Markdown negatively impacts model performance. (2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9)
Clear separation of option letters enhances clarity: Using parentheses for option labels improves model understanding.
"(A) choice 1
(B) choice 2
(C) choice 3
(D) choice 4"
(1.2)
Explicit labeling of question and options is beneficial: Using clear section headers improves comprehension.
"Question: <QUESTION>
Options:
<OPTIONS>
Answer with the option letter from the given choices directly."
(2.2)
Placing question and options at the end helps: Structuring prompts so that the question and answer choices appear at the end leads to better results.
"Answer with the option letter from the given choices directly.
<QUESTION>
<OPTIONS>"
(3.1)
Complex formatting does not impair accuracy: Unlike open-source models, proprietary models can handle structured formats such as JSON, Markdown, or YAML without a drop in performance. (Category 2)
3 Poor linguistic formatting hinders performance: Use of all upper case, poor grammar, or misspellings negatively impacts accuracy. (Category 4) Poor linguistic formatting does not affect performance: These models are robust to grammatical errors, casing, and minor typos, likely due to stronger pretraining and instruction tuning. (Category 4)
4 Chain-of-Thought reasoning is ineffective: Step-by-step reasoning does not improve accuracy in this context. (Category 6) Allowing room for reasoning significantly improves accuracy: Allowing the model to think leads to higher accuracy. (Categories 6 & 12.5)
5 Penalties, incentives, or competitive framing are ineffective: Using competitive language, penalizing mistakes, or offering rewards often introduces ambiguity. (Category 13,14,15)
Competitive framing degrades performance: Prompts that use game-like or adversarial language introduce unnecessary pressure or distraction, reducing answer accuracy. (Category 15)
Penalties or incentives improve performance: Framing prompts with rewards or penalties can enhance performance, possibly due to better contextual understanding. (Categories 13 & 14)
6 Specifying personas or target audiences is ineffective: Tailoring prompts by specifying a persona or intended audience does not improve model performance. (Category 8 & 9) Persona-based prompting has mixed effects: Positive persona prompts do not enhance accuracy, while negative persona prompts can significantly degrade performance. (Category 9)
7 Overemphasis on answer format is unhelpful: Excessive instruction about answer formatting can degrade performance. (Category 12 & 11.3) Answer format plays an important role in accuracy: Proprietary models are sensitive to how the answer is requested. (Category 12 & 11.3)
8 Temporal reasoning enhances video comprehension: Prompts that emphasize temporal order improve accuracy on video-based tasks. (11.4, 11.5) Temporal reasoning enhances video comprehension: Prompts that emphasize temporal aspects of events in videos result in more accurate responses. (11.4 & 11.5)
9 Image-focused prompting helps: Directing the model to rely solely on the image content improves answer accuracy. (11.1) Asking to focus on image or question hinders performance: In contrast to open-source models, proprietary models do worse when explicitly told to focus only on the image or only on the question. (11.1 & 11.2)
10 Answer leakage degrades performance: Including unintended hints or answer cues leads to lower accuracy. (Category 7) Asking to avoid bias or stereotypes helps: Prompts that explicitly instruct the model to avoid bias or stereotypes lead to more accurate responses. (Category 10)

BibTeX


@misc{ismithdeen2025promptceptionsensitivelargemultimodal,
      title={Promptception: How Sensitive Are Large Multimodal Models to Prompts?}, 
      author={Mohamed Insaf Ismithdeen and Muhammad Uzair Khattak and Salman Khan},
      year={2025},
      eprint={2509.03986},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.03986}, 
}
  
MBZUAI Logo EPFL Logo ANU Logo