Aurelius: Relation Aware Text-to-Audio Generation At Scale

Yuhang He; He Liang; Yash Jain; Andrew Markham; Vibhav Vineet

Aurelius: Relation Aware Text-to-Audio Generation At Scale

Yuhang He ,
He Liang ,
Yash Jain ,
Andrew Markham ,
Vibhav Vineet

ICLR | February 2026

Download BibTex

We present Aurelius, a new framework that enables relation aware text-toaudio (TTA) generation research at scale. Given the lack of essential audio event and relation corpora, Aurelius contributes a large-scale audio event corpus AudioEventSet and another large-scale relation corpus AudioRelSet. Comprising 110 event categories, AudioEventSet maximally covers all commonly heard audio events and each event is unique, realistic and of high-quality. AudioRelSet consists of 100 relations, comprehensively covering the relations that present in the physical world or can be neatly described by text. As the two corpora provide audio event and relation independently, they can be combined to create massive pairs with our pair generation strategy to support relation aware TTA investigation at scale. We comprehensively benchmark all existing TTA models from both general and relation aware evaluation perspective. We further provide an in-depth investigation into scaling existing TTA models’ relation aware generation by either training from scratch or leveraging cross-domain general TTA knowledge. The introduced corpora and the findings from investigation potentially facilitate future research on relation aware TTA generation.