AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

AVBench targets the core bottleneck of modern T2AV systems: existing metrics miss subtle but critical human-centric failures. We provide a fully automated, fine-grained, and human-aligned evaluation protocol with specialized evaluators and continuous confidence scores.

AVBench teaser figure

Figure 1. Overview of our AVBench. It integrates a multi-dimensional evaluation suite covering cross-modal consistency, audio metrics, and video metrics for human-centered real-world scenarios, together with a hierarchical AV prompt design containing normal and hard subsets. The framework supports automated large-scale assessment and human preference-based alignment verification to ensure reliable perceptual alignment.

Abstract

Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactions. Yet evaluation for AV generation remains at an early stage, with only a few coarse-grained benchmarks for human-related scenarios and relying on limited preset evaluations with generic multimodal LLMs, leading to inaccurate assessments of model capabilities. To address these issues, we introduce AVBench, a fully automated benchmark tailored for human-centric AV generation. AVBench is built on two key designs for comprehensive and accurate evaluation: (i) Human-centric and fine-grained metrics. AVBench integrates ten evaluation dimensions designed for human-centered real-world scenarios, covering visual quality, audio quality, and multi-level consistency across modalities. These practical metrics capture human-related details that existing benchmarks often overlook.

(ii) Specialized evaluators via preference learning. To address the lack of specialized training data, we construct large-scale supervision by transforming real-world videos into diverse training pairs with controlled perturbations. After fine-tuning on this high-quality dataset, the evaluators learn to reliably detect subtle cross-modal inconsistencies. Crucially, instead of producing discrete textual judgment, AVBench derives continuous evaluation scores from the model's prediction confidence on binary decisions. This probabilistic scoring mechanism enables a more reliable assessment than traditional VQA-style evaluation and aligns closely with human judgment. Taken together, AVBench offers automated evaluation for AV generation, demonstrates strong potential for data filtering and serving as a differentiable reward signal for Reinforcement Learning from Human Feedback (RLHF).

10
Evaluation Dimensions
300K
Hard-Negative SFT Pairs
470
Test Prompts (Normal + Hard)

Benchmark Construction and Analysis Overview

Key visual summaries adapted from the paper to explain data curation, evaluator design, and model-level performance patterns.

Pipeline overview

Figure 2. Overview of the AVBench construction pipeline. Our framework comprises two parallel workflows. The upper branch illustrates the training of the automated evaluators: clips are densely annotated and then branched into positive and hard-negative samples for SFT. The lower branch outlines benchmark prompt curation via heuristic sampling and strict filtering. Together, robust SFT evaluators and high-quality prompts form a fully automated evaluation framework.

Data distribution normal hard

Figure 3. Data distribution of AVBench's normal and hard subsets. The multi-layer chart illustrates dataset diversity over key human-centric attributes including language, number of speakers, interaction complexity, emotional expression, and camera shot type. The hard subset contains a higher ratio of challenging scenarios to rigorously test fine-grained cross-modal alignment.

Hard negative taxonomy

Figure 4. Taxonomy of multi-dimensional hard negatives in AVBench. The chart shows the comprehensive distribution of constructed negatives across major alignment axes. Rather than random perturbations, these dimensions are explicitly designed to target common T2AV failure modes, ensuring robust and fine-grained evaluation of cross-modal consistency.

Radar chart of model performance

Figure 5. Holistic model performance. Radar comparison across AVBench dimensions reveals complementary strengths and weaknesses among representative T2AV systems, highlighting the persistent gap between technical quality and strict cross-modal instruction fidelity.

Main Contributions

Human-Centric, Fine-Grained Suite

A 10-dimension protocol covers cross-modal alignment, speech content/realism, and audio-video perceptual quality in realistic human scenes.

Specialized Evaluators via SFT

Dedicated AV/AT/VT evaluators are trained on large-scale hard negatives, greatly improving sensitivity to subtle semantic and temporal mismatches.

Automated and Human-Aligned

A Normal/Hard hierarchical split exposes robustness limits, and evaluator outputs align well with human preference annotations.

Comprehensive Evaluation Suite

AVBench evaluates 10 dimensions. Each dimension is paired with a concrete evaluation method for reproducible scoring.

Cross-Modal Alignment and Synchronization
01

AV Consistency

Specialized SFT AV evaluator estimates audio-video semantic and temporal alignment via confidence-normalized Yes/No scoring.

Method: SFT AV Evaluator
02

AT Consistency

Specialized SFT AT evaluator measures whether generated audio faithfully matches prompt semantics and intent.

Method: SFT AT Evaluator
03

VT Consistency

Specialized SFT VT evaluator checks visual adherence to textual instructions under fine-grained human-centric conditions.

Method: SFT VT Evaluator
04

Lip Sync Consistency

SyncNet-based alignment confidence and temporal offset analysis quantify speech-mouth synchronization fidelity.

Method: SyncNet / LatentSync
Unimodal Generation Quality
05

Speech Content Accuracy

Whisper transcription with completeness, lexical accuracy, and hallucination penalty composes final speech-content score.

Method: Whisper-based Scoring
06

Speech Realism

DF-Arena discriminator evaluates naturalness and authenticity of synthesized voices against real human speech priors.

Method: DF-Arena Discriminator
07

Audio Quality (NISQA)

NISQAv2 MOS prediction provides perceptual audio-quality estimation across speech and environmental sounds.

Method: NISQAv2 MOS
08

Audio Aesthetics (Audiobox)

Audiobox aesthetic sub-scores are aggregated to reflect production quality, usefulness, and listening experience.

Method: Audiobox Aesthetics
09

Video Quality (DOVER++)

DOVER++ evaluates technical visual fidelity and structural stability from a multi-perspective quality assessment view.

Method: DOVER++
10

Video Aesthetics

LAION-based aesthetic predictor scores overall visual composition and high-level perceptual appeal.

Method: LAION Aesthetic Predictor

Main Results

Quantitative results on AVBench Normal and Hard splits.

Normal Split

Model AV AT VT SyncNet SC DF-Arena NISQA Audiobox DOVER++ Aesthetic
Sora 20.87130.86750.75994.905787.83910.43282.37843.175960.01254.0704
Veo 3 Fast0.69240.83000.72356.594377.49500.30432.81913.587769.22754.9967
Wan 2.60.82070.82270.75564.501691.55680.04413.02893.927171.64734.7790
Kling 2.60.76260.80610.75018.102768.78440.16653.31413.808265.67865.4885
Seedance 1.5 Pro0.65360.85540.73635.014684.92680.16023.64114.168671.72054.7373

Hard Split

Model AV AT VT SyncNet SC DF-Arena NISQA Audiobox DOVER++ Aesthetic
Sora 20.93200.85750.71903.793276.79050.54982.05643.133958.15384.0434
Veo 3 Fast0.77660.81170.69433.453570.31440.38272.33213.611367.08335.1438
Wan 2.60.87800.84180.74823.048884.45120.04983.07264.092471.52294.7721
Kling 2.60.88130.76020.71053.984469.06910.14693.24253.891262.99945.5033
Seedance 1.5 Pro0.74090.86460.73983.323980.80290.20593.40934.161869.44304.7707

Citation

@misc{yang2026avbenchhumanalignedautomatedevaluation,
      title={AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models}, 
      author={Jialiang Yang and Bin Xia and Ruihang Chu and Dingdong Wang and Wanke Xia and Zhun Mou and Tianyang Zhong and Yiting Zhao and Wenming Yang},
      year={2026},
      eprint={2605.24652},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.24652}, 
}