MME-Unify

A Comprehensive Benchmark for Unified Multimodal Understanding and Generation

Wulin Xie¹^*^, Yi-Fan Zhang¹^*^†^, Chaoyou Fu³^, Yang Shi²^, Bingyan Nie¹^, Hongkai Chen⁴^, Zhang Zhang¹^, Liang Wang¹^, Tieniu Tan³

¹CASIA, ²PKU, ³NJU, ⁴Vivo
* Equal Contribution † Project leader

Abstract

   Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 11) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities.
   We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: 1. Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies.     2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning.     3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, and VILA-U, alongside specialized understanding (e.g., Claude-3.5) and generation models (e.g., DALL-E-3).
   Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively.

Leaderboard

Models are ranked according to their average performance on understanding, generation, and unify tasks, from highest to lowest. "SIPU", "MTITU", "VPU", "CVIG", "FIR", "TIE", "TIG", "TVG", "VP", "IEE", "CSQ", "AL", "SD", "VCoT" each indicate a specific task domain. "Avg" indicates the average accuracy across subtasks in each domain. "-" indicates that the model is unable to finish the corresponding task.

By default, this leaderboard is sorted by results with Overall. To view other sorted results, please click on the corresponding cell.

Method	LLM	Date	Overall	Understanding					Generation								Unify
Task Split			Avg	SIPU	MTITU	VPU	Avg	CVIG	FIR	TIE	TIG	TVG	VP	Avg	IEE	CSQ	AL	SD	VCoT	Avg
QA pairs			1964	1200	400	364	1964	600	200	200	200	200	194	1594	200	100	52	104	90	546
Gemini2.0-flash-exp Google DeepMind	-	2025-03-12	45.57	72.58	68.25	54.90	65.24	-	77.61	43.54	57.56	-	-	29.79	38.42	74.75	47.12	26.00	12.41	40.74
MIO-Instruct Beihang University	MIO-7B	2024-09-26	37.17	52.00	33.50	39.01	41.50	51.24	59.29	43.66	48.23	51.88	66.37	53.45	24.16	38.50	8.66	11.50	0	16.56
SEED-LLaMA Tencent AI Lab	LLaMA2-Chat-13B	2023-12-18	28.45	49.17	33.00	36.26	39.48	-	57.00	42.26	41.96	-	-	23.54	22.00	51.49	12.50	22.00	3.61	22.32
Anole GAIR	-	2024-07-08	18.59	17.17	14.50	9.00	13.56	-	36.64	43.42	41.52	-	-	19.91	18.55	59.65	14.42	15.00	3.89	22.30
VILA-U Tsinghua University	LLama-7B	2024-09-06	18.58	51.04	32.25	36.54	39.95	-	-	-	45.10	49.64	-	15.79	-	-	-	-	-	-
Janus-Pro DeepSeek-AI	DeepSeek-LLM-7B-base	2025-01-29	18.10	59.56	43.50	42.22	48.43	-	-	-	35.29	-	-	5.88	-	-	-	-	-	-
MiniGPT-5 University of California	Vicuna-7B	2023-10-03	16.43	19.25	10.92	15.93	15.37	-	38.96	35.04	35.48	-	-	18.25	22.80	34.13	14.37	5.00	2.08	15.67
Janus-Flow DeepSeek-AI	DeepSeek-LLM-1.5B-base	2024-11-12	16.31	41.49	32.00	35.16	43.44	-	-	-	32.88	-	-	5.48	-	-	-	-	-	-
GILL Carnegie Mellon University	OPT-6-7B	2023-03-26	15.10	22.18	6.00	3.56	10.58	-	50.67	35.71	46.60	-	-	22.16	24.25	21.29	8.66	6.67	1.90	12.55
HermersFlow Peking University	Phi-1.5	2025-2-17	14.01	41.49	33.00	28.32	34.27	-	-	-	46.48	-	-	7.75	-	-	-	-	-	-
Emu3 BAAI	LLama-8B	2024-09-27	13.79	45.75	30.50	23.32	33.19	-	-	-	49.08	-	-	8.18	-	-	-	-	-	-
Show-o Show Lab	Phi-1.5	2024-8-22	12.74	32.47	34.75	25.66	30.96	-	-	43.54	-	-	-	7.26	-	-	-	-	-	-

Data Examples

Diagram of MME-Unify: Our benchmark consists of 3 main domains, encompassing 15 subtasks to comprehensively evaluate U-MLLMs' understanding, generation, and unified capabilities. Specifically, each unify task includes at least one question, an input image, multiple text choices, and image choices. The image choices consist of a correct answer image and a set of manually crafted negative samples. During the evaluation process, we input the image, question, and text options, and the U-MLLMs are required to select the correct text answer and generate an image. The text answer is evaluated by matching it with the correct answer, while the generated image is compared with the constructed image choices. If the CLIP score between the generated image and the correct answer image is the highest, it is considered correct; otherwise, it is deemed incorrect.

Benchmark Statistics

A comprehensive visualization of the diverse tasks in MME-Unify: The figure illustrates the wide-ranging nature of the tasks covered in our benchmark, which spans from traditional understanding tasks to complex mixed-modality generation challenges.

Benchmark Comparison

Comparison of MME-Unify and other Benchmark.: SIPU: Single Image Perception & Understanding; MITIU: Multiple & Interleaved Image-Text Understanding; VPU: Video Perception & Understanding; CIVG: Conditional Image-to-Video Generation; FIR: Fine-grained Image Reconstruction; TIE: Text-Guided Image Editing; TIG: Text-to-Image Generation; TVG: Text-to-Video Generation; VP: Video Prediction; IEE: Image Editing and Explaining; CSQ: Common Sense Question Answering; AL: Auxiliary Lines; SD: SpotDiff; VCoT: Visual CoT.

Accuracy on Visual CoT Task

Accuracy distribution across different dimensions on Visual CoT task : (a) action, (b) location, and (c) image.

Experimental Results on All Task Splits

(1) Experimental results on various generation tasks.

(2) Experimental results on various unify tasks.

@article{zhang2024mme, title={MME-Unify: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?}, author={Zhang, Yi-Fan and Zhang, Huanyu and Tian, Haochen and Fu, Chaoyou and Zhang, Shuangqing and Wu, Junfei and Li, Feng and Wang, Kun and Wen, Qingsong and Zhang, Zhang and others}, journal={arXiv preprint arXiv:2408.13257}, year={2024} }

MME-Unify

A Comprehensive Benchmark for Unified Multimodal Understanding and Generation

Abstract

Visualization

Leaderboard

Benchmark

Data Examples

Benchmark Statistics

Benchmark Comparison

Experiment Results

Accuracy on Visual CoT Task

Experimental Results on All Task Splits

Citation