Logo MME-Unify

A Comprehensive Benchmark for Unified Multimodal Understanding and Generation

Wulin Xie1*, Yi-Fan Zhang1*, Chaoyou Fu3, Yang Shi2, Bingyan Nie1, Hongkai Chen4, Zhang Zhang1, Liang Wang1, Tieniu Tan3
1CASIA, 2PKU, 3NJU, 4Vivo
* Equal Contribution   † Project leader

Abstract

   Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 11) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities.
   We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: 1. Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies.     2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning.     3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, and VILA-U, alongside specialized understanding (e.g., Claude-3.5) and generation models (e.g., DALL-E-3).
   Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively.

Visualization

Research abstract on Unified MLLMs evaluation framework

Leaderboard

Models are ranked according to their average performance on understanding, generation, and unify tasks, from highest to lowest. "SIPU", "MTITU", "VPU", "CVIG", "FIR", "TIE", "TIG", "TVG", "VP", "IEE", "CSQ", "AL", "SD", "VCoT" each indicate a specific task domain. "Avg" indicates the average accuracy across subtasks in each domain. "-" indicates that the model is unable to finish the corresponding task.

By default, this leaderboard is sorted by results with Overall. To view other sorted results, please click on the corresponding cell.

# Method LLM Date Overall Understanding Generation Unify
Task Split Avg SIPU MTITU VPU Avg CVIG FIR TIE TIG TVG VP Avg IEE CSQ AL SD VCoT Avg
QA pairs 1964 1200 400 364 1964 600 200 200 200 200 194 1594 200 100 52 104 90 546
Gemini2.0-flash-exp

Google DeepMind

- 2025-03-12 45.57 72.58 68.25 54.90 65.24 - 77.61 43.54 57.56 - - 29.79 38.42 74.75 47.12 26.00 12.41 40.74
MIO-Instruct

Beihang University

MIO-7B 2024-09-26 37.17 52.00 33.50 39.01 41.50 51.24 59.29 43.66 48.23 51.88 66.37 53.45 24.16 38.50 8.66 11.50 0 16.56
SEED-LLaMA

Tencent AI Lab

LLaMA2-Chat-13B 2023-12-18 28.45 49.17 33.00 36.26 39.48 - 57.00 42.26 41.96 - - 23.54 22.00 51.49 12.50 22.00 3.61 22.32
Anole

GAIR

- 2024-07-08 18.59 17.17 14.50 9.00 13.56 - 36.64 43.42 41.52 - - 19.91 18.55 59.65 14.42 15.00 3.89 22.30
VILA-U

Tsinghua University

LLama-7B 2024-09-06 18.58 51.04 32.25 36.54 39.95 - - - 45.10 49.64 - 15.79 - - - - - -
Janus-Pro

DeepSeek-AI

DeepSeek-LLM-7B-base 2025-01-29 18.10 59.56 43.50 42.22 48.43 - - - 35.29 - - 5.88 - - - - - -
MiniGPT-5

University of California

Vicuna-7B 2023-10-03 16.43 19.25 10.92 15.93 15.37 - 38.96 35.04 35.48 - - 18.25 22.80 34.13 14.37 5.00 2.08 15.67
Janus-Flow

DeepSeek-AI

DeepSeek-LLM-1.5B-base 2024-11-12 16.31 41.49 32.00 35.16 43.44 - - - 32.88 - - 5.48 - - - - - -
GILL

Carnegie Mellon University

OPT-6-7B 2023-03-26 15.10 22.18 6.00 3.56 10.58 - 50.67 35.71 46.60 - - 22.16 24.25 21.29 8.66 6.67 1.90 12.55
HermersFlow

Peking University

Phi-1.5 2025-2-17 14.01 41.49 33.00 28.32 34.27 - - - 46.48 - - 7.75 - - - - - -
Emu3

BAAI

LLama-8B 2024-09-27 13.79 45.75 30.50 23.32 33.19 - - - 49.08 - - 8.18 - - - - - -
Show-o

Show Lab

Phi-1.5 2024-8-22 12.74 32.47 34.75 25.66 30.96 - - 43.54 - - - 7.26 - - - - - -

Benchmark

Data Examples

teaser_tasks

Diagram of MME-Unify: Our benchmark consists of 3 main domains, encompassing 15 subtasks to comprehensively evaluate U-MLLMs' understanding, generation, and unified capabilities. Specifically, each unify task includes at least one question, an input image, multiple text choices, and image choices. The image choices consist of a correct answer image and a set of manually crafted negative samples. During the evaluation process, we input the image, question, and text options, and the U-MLLMs are required to select the correct text answer and generate an image. The text answer is evaluated by matching it with the correct answer, while the generated image is compared with the constructed image choices. If the CLIP score between the generated image and the correct answer image is the highest, it is considered correct; otherwise, it is deemed incorrect.

Benchmark Statistics

teaser_tasks

A comprehensive visualization of the diverse tasks in MME-Unify: The figure illustrates the wide-ranging nature of the tasks covered in our benchmark, which spans from traditional understanding tasks to complex mixed-modality generation challenges.

Benchmark Comparison

data-composition

Comparison of MME-Unify and other Benchmark.: SIPU: Single Image Perception & Understanding; MITIU: Multiple & Interleaved Image-Text Understanding; VPU: Video Perception & Understanding; CIVG: Conditional Image-to-Video Generation; FIR: Fine-grained Image Reconstruction; TIE: Text-Guided Image Editing; TIG: Text-to-Image Generation; TVG: Text-to-Video Generation; VP: Video Prediction; IEE: Image Editing and Explaining; CSQ: Common Sense Question Answering; AL: Auxiliary Lines; SD: SpotDiff; VCoT: Visual CoT.

Experiment Results

Accuracy on Visual CoT Task

e

Accuracy distribution across different dimensions on Visual CoT task : (a) action, (b) location, and (c) image.

Experimental Results on All Task Splits

Citation


      @article{zhang2024mme,
        title={MME-Unify: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?},
        author={Zhang, Yi-Fan and Zhang, Huanyu and Tian, Haochen and Fu, Chaoyou and Zhang, Shuangqing and Wu, Junfei and Li, Feng and Wang, Kun and Wen, Qingsong and Zhang, Zhang and others},
        journal={arXiv preprint arXiv:2408.13257},
        year={2024}
      }