AVBench | Human-Aligned AV Evaluation Benchmark

Figure 1. Overview of our AVBench. It integrates a multi-dimensional evaluation suite covering cross-modal consistency, audio metrics, and video metrics for human-centered real-world scenarios, together with a hierarchical AV prompt design containing normal and hard subsets. The framework supports automated large-scale assessment and human preference-based alignment verification to ensure reliable perceptual alignment.

Abstract

Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactions. Yet evaluation for AV generation remains at an early stage, with only a few coarse-grained benchmarks for human-related scenarios and relying on limited preset evaluations with generic multimodal LLMs, leading to inaccurate assessments of model capabilities. To address these issues, we introduce AVBench, a fully automated benchmark tailored for human-centric AV generation. AVBench is built on two key designs for comprehensive and accurate evaluation: (i) Human-centric and fine-grained metrics. AVBench integrates ten evaluation dimensions designed for human-centered real-world scenarios, covering visual quality, audio quality, and multi-level consistency across modalities. These practical metrics capture human-related details that existing benchmarks often overlook.

(ii) Specialized evaluators via preference learning. To address the lack of specialized training data, we construct large-scale supervision by transforming real-world videos into diverse training pairs with controlled perturbations. After fine-tuning on this high-quality dataset, the evaluators learn to reliably detect subtle cross-modal inconsistencies. Crucially, instead of producing discrete textual judgment, AVBench derives continuous evaluation scores from the model's prediction confidence on binary decisions. This probabilistic scoring mechanism enables a more reliable assessment than traditional VQA-style evaluation and aligns closely with human judgment. Taken together, AVBench offers automated evaluation for AV generation, demonstrates strong potential for data filtering and serving as a differentiable reward signal for Reinforcement Learning from Human Feedback (RLHF).

10

Evaluation Dimensions

300K

Hard-Negative SFT Pairs

470

Test Prompts (Normal + Hard)

Benchmark Construction and Analysis Overview

Key visual summaries adapted from the paper to explain data curation, evaluator design, and model-level performance patterns.

Figure 2. Overview of the AVBench construction pipeline. Our framework comprises two parallel workflows. The upper branch illustrates the training of the automated evaluators: clips are densely annotated and then branched into positive and hard-negative samples for SFT. The lower branch outlines benchmark prompt curation via heuristic sampling and strict filtering. Together, robust SFT evaluators and high-quality prompts form a fully automated evaluation framework.

Figure 3. Data distribution of AVBench's normal and hard subsets. The multi-layer chart illustrates dataset diversity over key human-centric attributes including language, number of speakers, interaction complexity, emotional expression, and camera shot type. The hard subset contains a higher ratio of challenging scenarios to rigorously test fine-grained cross-modal alignment.

Figure 4. Taxonomy of multi-dimensional hard negatives in AVBench. The chart shows the comprehensive distribution of constructed negatives across major alignment axes. Rather than random perturbations, these dimensions are explicitly designed to target common T2AV failure modes, ensuring robust and fine-grained evaluation of cross-modal consistency.

Figure 5. Holistic model performance. Radar comparison across AVBench dimensions reveals complementary strengths and weaknesses among representative T2AV systems, highlighting the persistent gap between technical quality and strict cross-modal instruction fidelity.

Main Contributions

Human-Centric, Fine-Grained Suite

A 10-dimension protocol covers cross-modal alignment, speech content/realism, and audio-video perceptual quality in realistic human scenes.

Specialized Evaluators via SFT

Dedicated AV/AT/VT evaluators are trained on large-scale hard negatives, greatly improving sensitivity to subtle semantic and temporal mismatches.

Automated and Human-Aligned

A Normal/Hard hierarchical split exposes robustness limits, and evaluator outputs align well with human preference annotations.

Comprehensive Evaluation Suite

AVBench evaluates 10 dimensions. Each dimension is paired with a concrete evaluation method for reproducible scoring.

Cross-Modal Alignment and Synchronization

01

AV Consistency

Specialized SFT AV evaluator estimates audio-video semantic and temporal alignment via confidence-normalized Yes/No scoring.

Method: SFT AV Evaluator

02

AT Consistency

Specialized SFT AT evaluator measures whether generated audio faithfully matches prompt semantics and intent.

Method: SFT AT Evaluator

03

VT Consistency

Specialized SFT VT evaluator checks visual adherence to textual instructions under fine-grained human-centric conditions.

Method: SFT VT Evaluator

04

Lip Sync Consistency

SyncNet-based alignment confidence and temporal offset analysis quantify speech-mouth synchronization fidelity.

Method: SyncNet / LatentSync

Unimodal Generation Quality

05

Speech Content Accuracy

Whisper transcription with completeness, lexical accuracy, and hallucination penalty composes final speech-content score.

Method: Whisper-based Scoring

06

Speech Realism

DF-Arena discriminator evaluates naturalness and authenticity of synthesized voices against real human speech priors.

Method: DF-Arena Discriminator

07

Audio Quality (NISQA)

NISQAv2 MOS prediction provides perceptual audio-quality estimation across speech and environmental sounds.

Method: NISQAv2 MOS

08

Audio Aesthetics (Audiobox)

Audiobox aesthetic sub-scores are aggregated to reflect production quality, usefulness, and listening experience.

Method: Audiobox Aesthetics

09

Video Quality (DOVER++)

DOVER++ evaluates technical visual fidelity and structural stability from a multi-perspective quality assessment view.

Method: DOVER++

10

Video Aesthetics

LAION-based aesthetic predictor scores overall visual composition and high-level perceptual appeal.

Method: LAION Aesthetic Predictor

Main Results

Quantitative results on AVBench Normal and Hard splits.

Normal Split

Model	AV	AT	VT	SyncNet	SC	DF-Arena	NISQA	Audiobox	DOVER++	Aesthetic
Sora 2	0.8713	0.8675	0.7599	4.9057	87.8391	0.4328	2.3784	3.1759	60.0125	4.0704
Veo 3 Fast	0.6924	0.8300	0.7235	6.5943	77.4950	0.3043	2.8191	3.5877	69.2275	4.9967
Wan 2.6	0.8207	0.8227	0.7556	4.5016	91.5568	0.0441	3.0289	3.9271	71.6473	4.7790
Kling 2.6	0.7626	0.8061	0.7501	8.1027	68.7844	0.1665	3.3141	3.8082	65.6786	5.4885
Seedance 1.5 Pro	0.6536	0.8554	0.7363	5.0146	84.9268	0.1602	3.6411	4.1686	71.7205	4.7373

Hard Split

Model	AV	AT	VT	SyncNet	SC	DF-Arena	NISQA	Audiobox	DOVER++	Aesthetic
Sora 2	0.9320	0.8575	0.7190	3.7932	76.7905	0.5498	2.0564	3.1339	58.1538	4.0434
Veo 3 Fast	0.7766	0.8117	0.6943	3.4535	70.3144	0.3827	2.3321	3.6113	67.0833	5.1438
Wan 2.6	0.8780	0.8418	0.7482	3.0488	84.4512	0.0498	3.0726	4.0924	71.5229	4.7721
Kling 2.6	0.8813	0.7602	0.7105	3.9844	69.0691	0.1469	3.2425	3.8912	62.9994	5.5033
Seedance 1.5 Pro	0.7409	0.8646	0.7398	3.3239	80.8029	0.2059	3.4093	4.1618	69.4430	4.7707

Citation

@misc{yang2026avbenchhumanalignedautomatedevaluation,
      title={AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models}, 
      author={Jialiang Yang and Bin Xia and Ruihang Chu and Dingdong Wang and Wanke Xia and Zhun Mou and Tianyang Zhong and Yiting Zhao and Wenming Yang},
      year={2026},
      eprint={2605.24652},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.24652}, 
}