Audio ControlNet

This work studies fine-grained text-to-audio (T2A) generation with explicit control over key audio attributes, including loudness, pitch, and sound events. Instead of retraining models for individual control types, a ControlNet-based framework is built on top of pre-trained T2A backbones to enable flexible and extensible controllable generation. Two designs, T2A-ControlNet and the more lightweight T2A-Adapter, are introduced. With only 38M additional parameters, T2A-Adapter achieves state-of-the-art performance on AudioSet-Strong while maintaining strong control ability. The framework is further extended to audio editing, enabling the insertion and removal of audio events at specified time locations.

Caption	Sound Events Roll	T2A-Adapter (Ours)	T2A-ControlNet (Ours)	AudioComposer	EzAudio-XL	GT
People speak and clap, a child speaks and a camera clicks.	Female speech, woman speaking: 0.00s-3.97s, 7.91s-8.16s, 8.19s-9.65s Child speech, kid speaking: 9.72s-10.00s
Background noise, tapping, and cat sounds are interspersed with purring.	Cat: 0.98s-2.29s, 9.03s-10.00s
Animals, dogs, and people are growling, shouting, and speaking.	Dog: 0.01s-0.17s, 0.72s-1.53s, 1.98s-3.14s, 3.57s-4.56s, 4.87s-5.96s, 6.39s-7.62s, 8.07s-8.98s, 9.30s-9.88s Speech: 0.15s-0.74s, 1.61s-1.95s, 4.58s-4.89s, 7.63s-8.02s, 9.01s-9.29s Male speech, man speaking: 3.20s-3.53s, 5.98s-6.38s, 9.88s-10.00s
Water flows and dishes clatter with child speech and laughter.	Child speech, kid speaking: 0.00s-1.50s, 1.73s-2.12s, 2.94s-3.54s, 7.80s-8.49s Dishes, pots, and pans: 1.98s-2.16s, 3.18s-3.30s, 4.77s-5.08s, 5.71s-5.83s, 6.08s-6.24s, 6.42s-7.01s Male speech, man speaking: 8.55s-9.56s Water tap, faucet: 0.00s-10.00s
Speech babble and clattering dishes and silverware can be heard, along with a child's voice.	Dishes, pots, and pans: 0.85s-0.97s, 1.39s-1.50s, 7.72s-7.87s Male speech, man speaking: 0.75s-1.17s Cutlery, silverware: 4.69s-4.84s, 5.30s-5.52s Female speech, woman speaking: 1.63s-3.41s Child speech, kid speaking: 8.76s-9.35s

Caption	Loudness	T2A-Adapter (Ours)	T2A-ControlNet (Ours)	EzAudio-L-Energy	EzAudio-XL	GT
A man speaks and an arrow is shot as more speech is heard.
A humming noise is heard with dishes, pots, and pans being moved on a surface.
Dogs are whimpering, howling, and other domestic animals are heard.

Caption	Pitch	T2A-Adapter (Ours)	T2A-ControlNet (Ours)	FluxAudio	EzAudio-XL	GT
An alarm and beeping sounds alternate with music and ratcheting noises.
A man is speaking, breathing, and answering the phone while other sounds like clicking and thumping occur.
Various mechanisms make sounds in the background while a vacuum cleaner is used and there is female speech.

Caption	Input Controls (Loudness+Events)	Generated Audio (T2A-Adapter)
Telephone ringing, dialing, and speech occur in a small room amidst laughter and hubbub.	Loudness Sound events Telephone bell ringing: 0.52-1.43,4.14-5.05 Speech: 5.05-6.54 Male speech, man speaking: 7.59-10.0	Loudness (Generated audio)
Music and sizzling sounds accompany a man singing and speaking, and slapping is heard.	Loudness Sound events Sizzle: 0.02-9.97 Male speech, man speaking: 5.69-5.91	Loudness (Generated audio)
People are speaking outside and dogs barking.	Loudness Sound events Telephone bell ringing: 0.52-1.43,4.14-5.05 Speech: 5.05-6.54 Male speech, man speaking: 7.59-10.0	Loudness (Generated audio)

The core idea is to introduce an auxiliary conditional network that injects conditions into the pretrained FluxAudio, enabling controllable audio generation without retraining the backbone. We introduce two variants of Audio ControlNet: T2A-ControlNet and T2A-Adapter. T2A-ControlNet replicates FluxAudio, including both MMDiT and DiT layers, and directly adds the processed latent to each layer in FluxAudio. T2A-Adapter employs a lightweight encoder to extract features from time-varying conditions. It extracts condition features and injects them into the latent of the backbone.

Our experiments demonstrate that both T2A-ControlNet and T2A-Adapter achieve consistent improvements in controlling loudness, pitch, and sound events. Notably, T2A-Adapter achieves state-of-the-art performance on the AudioSet-Strong in both event-level and segment-level F1 scores while introducing only 38M additional parameters. Subjective evaluations further indicate that our methods outperform all baseline models.

Audio ControlNet for Fine-Grained Audio Generation and Editing

Sound Events Controlled Audio Generation

Loudness Controlled Audio Generation

Pitch Controlled Audio Generation

Multi-condition Control

Overall Architecture

Main Results