Can AI Understand Mandarin Speech Prosody? A Framework and Benchmark Showcase
How to model and estimate speech prosody is considered as a challenging task in understanding and generating natural speech. We introduce the Mandarin Speech Prosody Benchmark (MSPB), a linguistically grounded dataset for evaluating Speech Large Language Models (Speech LLMs) in Mandarin. MSPB comprises eight tasks covering crucial prosodic features and their interactions with syntax, semantics, and prag-matics. All MSPB items, designed per Mandarin linguistic principles and validated by experts, were phonetically recorded and verified. We evaluated six Speech LLMs (GPT-4o, Gemini-1.5-Pro, Gemini-2-Flash, Qwen2-Audio-7B-Instruct, GLM-4-Voice, MiniCPM-o 2.6). Although some models perform well with context-rich cues (e.g., irony), they generally struggle with subtle prosodic variations (e.g., focus marking) and underper-form humans. MSPB provides a valuable tool to assess and enhance prosodic comprehension, underscoring the need for improved prosodic integration in future research.