Promptception: How Sensitive Are Large Multimodal Models to Prompts?

🔥Highlights

The contributions of this paper can be summarized as follows:

Comprehensive Prompt Sensitivity Analysis: We present the most extensive study to date on the impact of prompt variations across diverse multimodal benchmarks and LMM architectures. To facilitate this study, we introduce Promptception, a systematic evaluation framework comprising of 61 prompt types, organized into 15 categories and 6 supercategories, each designed to probe specific aspects of prompt formulation in LMMs.

Evaluation Across Models, Modalities, and Benchmarks: We assess prompt sensitivity across a diverse set of model sizes and architectures, including both open-source and proprietary LMMs. Our analysis spans multiple modalities and benchmarks; MMStar (single image), MMMU-Pro (multi-image), and MVBench (video) and we further evaluate sensitivity across various question dimensions within these benchmarks to ensure a comprehensive understanding.

Best Practices for Prompting: We identify key trends in prompting and propose Prompting Principles for effective and consistent evaluation of LMMs.

Sensitivity of state-of-the-art LMMs to prompt variations.

Examples from the MMStar benchmark illustrating divergent model outputs despite identical user queries, caused solely by changes in prompt phrasing (Left: InternVL-38B, Middle: GPT-4o, Right: Gemini 1.5 Pro). This demonstrates the models’ sensitivity to how instructions are framed.

How Does Variation in Prompts Impact Accuracy?

Proprietary Models Performance — Average Prompt Performance for Proprietary Models. PRA (Percentage Relative Accuracy) with respect to the Baseline Prompt Accuracy is averaged across Open-source Models and the 3 Benchmarks (MMStar, MMMU-Pro & MVBench) for each Prompt Type.

Open Source Models Performance — Average Prompt Performance for Open-Source Models. PRA (Percentage Relative Accuracy) with respect to the Baseline Prompt Accuracy is averaged across Open-source Models and the 3 Benchmarks (MMStar, MMMU-Pro & MVBench) for each Prompt Type.

Prompting Principles

Based on the insights from our study, we outline best practices for optimizing LMM performance on the MCQA task. These strategies are designed to enhance both accuracy and consistency. While our insights are based on MCQA evaluations, we believe these principles can be broadly applied to other tasks and extended to LLMs and LMMs. An important observation underlying these principles is the clear difference in behavior between open-source and proprietary models. Open-source models are often not extensively instruction-tuned, which makes them less responsive to prompt variations. In contrast, proprietary models typically undergo rigorous instruction tuning with large-scale, high-quality data, as well as advanced reinforcement learning and post-training techniques. This makes them considerably more sensitive to user instructions, where even subtle changes in prompt phrasing can lead to notable differences in performance. Given these differences in instruction-following capabilities, we present prompting principles separately for open-source and proprietary models. This distinction allows us to account for their varying adherence to instructions and to highlight strategies that are most effective for each category.

#	Open-Source Models	Proprietary Models
1	Concise prompts yield better performance: Keeping prompts short and direct improves accuracy. "Answer with the option letter from the given choices directly." (1.1) Overly short or vague prompts reduce accuracy: When the prompt is too brief and lacks clarity, the model may not understand the expected format or task. "Best Choice: $LETTER" (12.3) Detailed prompts are ineffective: Long or highly descriptive prompts do not improve accuracy. (Notably in Category 5 and other long prompts)	Prompt length and detail have minimal impact: Unlike open-source models, proprietary models perform consistently across prompts of varying lengths and complexity. Restricting responses to the letter choice is detrimental: Limiting the model to respond with just a letter (e.g., A, B, C, D) can suppress reasoning and reduce accuracy. (12.2)
2	Complex or structured formatting decreases accuracy: Using formats such as JSON, YAML or Markdown negatively impacts model performance. (2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9) Clear separation of option letters enhances clarity: Using parentheses for option labels improves model understanding. "(A) choice 1 (B) choice 2 (C) choice 3 (D) choice 4" (1.2) Explicit labeling of question and options is beneficial: Using clear section headers improves comprehension. "Question: <QUESTION> Options: <OPTIONS> Answer with the option letter from the given choices directly." (2.2) Placing question and options at the end helps: Structuring prompts so that the question and answer choices appear at the end leads to better results. "Answer with the option letter from the given choices directly. <QUESTION> <OPTIONS>" (3.1)	Complex formatting does not impair accuracy: Unlike open-source models, proprietary models can handle structured formats such as JSON, Markdown, or YAML without a drop in performance. (Category 2)
3	Poor linguistic formatting hinders performance: Use of all upper case, poor grammar, or misspellings negatively impacts accuracy. (Category 4)	Poor linguistic formatting does not affect performance: These models are robust to grammatical errors, casing, and minor typos, likely due to stronger pretraining and instruction tuning. (Category 4)
4	Chain-of-Thought reasoning is ineffective: Step-by-step reasoning does not improve accuracy in this context. (Category 6)	Allowing room for reasoning significantly improves accuracy: Allowing the model to think leads to higher accuracy. (Categories 6 & 12.5)
5	Penalties, incentives, or competitive framing are ineffective: Using competitive language, penalizing mistakes, or offering rewards often introduces ambiguity. (Category 13,14,15) Competitive framing degrades performance: Prompts that use game-like or adversarial language introduce unnecessary pressure or distraction, reducing answer accuracy. (Category 15)	Penalties or incentives improve performance: Framing prompts with rewards or penalties can enhance performance, possibly due to better contextual understanding. (Categories 13 & 14)
6	Specifying personas or target audiences is ineffective: Tailoring prompts by specifying a persona or intended audience does not improve model performance. (Category 8 & 9)	Persona-based prompting has mixed effects: Positive persona prompts do not enhance accuracy, while negative persona prompts can significantly degrade performance. (Category 9)
7	Overemphasis on answer format is unhelpful: Excessive instruction about answer formatting can degrade performance. (Category 12 & 11.3)	Answer format plays an important role in accuracy: Proprietary models are sensitive to how the answer is requested. (Category 12 & 11.3)
8	Temporal reasoning enhances video comprehension: Prompts that emphasize temporal order improve accuracy on video-based tasks. (11.4, 11.5)	Temporal reasoning enhances video comprehension: Prompts that emphasize temporal aspects of events in videos result in more accurate responses. (11.4 & 11.5)
9	Image-focused prompting helps: Directing the model to rely solely on the image content improves answer accuracy. (11.1)	Asking to focus on image or question hinders performance: In contrast to open-source models, proprietary models do worse when explicitly told to focus only on the image or only on the question. (11.1 & 11.2)
10	Answer leakage degrades performance: Including unintended hints or answer cues leads to lower accuracy. (Category 7)	Asking to avoid bias or stereotypes helps: Prompts that explicitly instruct the model to avoid bias or stereotypes lead to more accurate responses. (Category 10)

BibTeX


@misc{ismithdeen2025promptceptionsensitivelargemultimodal,
      title={Promptception: How Sensitive Are Large Multimodal Models to Prompts?}, 
      author={Mohamed Insaf Ismithdeen and Muhammad Uzair Khattak and Salman Khan},
      year={2025},
      eprint={2509.03986},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.03986}, 
}

Promptception: How Sensitive Are Large Multimodal Models to Prompts? [EMNLP 2025]

🔥Highlights

Sensitivity of state-of-the-art LMMs to prompt variations.

How Does Variation in Prompts Impact Accuracy?

Prompting Principles

BibTeX

Promptception: How Sensitive Are Large Multimodal Models to Prompts?
[EMNLP 2025]