UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion

arXiv Paper GitHub Code
Text-to-Audio Text-to-Speech Zero-shot TTS Mixed Speech+Audio Audio Editing Speech Editing Timed Generation Flow Matching MM-DiT LLM Fusion
Abstract. We present UNISON, a unified latent diffusion framework for audio generation and editing. Using a single set of weights, UNISON integrates speech generation, sound generation, and audio editing within one model. The model supports Text-to-Audio, Text-to-Speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level audio editing, and speech-in-scene editing. It features layer-wise deep LLM fusion, which injects depth-matched semantic conditioning from a frozen large language model to improve instruction following for compositional audio prompts.
UNISON overview teaser
[Teaser figure — place images/fig1.png here]

Figure 1. Overview of UNISON. A single flow-matching model handles text-to-audio generation, zero-shot TTS, gender control, audio-scene editing, and timed temporal composition. All tasks share the same architecture and weights, differentiated only by a task mask channel and optional source latent concatenation.

0

Model Architecture

All tasks share the same VAE encoder/decoder, MM-DiT backbone, and forward pass. Task identity is encoded solely by a channel-wise mask; source audio is concatenated as additional input channels in the latent space via VAE encoding. Text conditioning uses layer-wise deep LLM fusion: hidden states from uniformly sampled layers of the frozen Qwen2.5-Omni-7B backbone are injected into corresponding MM-DiT double-stream blocks via learned linear projections.

UNISON architecture diagram
[Architecture figure — place images/fig2.png here]

Figure 2. UNISON architecture. The frozen MLLM backbone (Qwen2.5-Omni-7B) provides layer-wise hidden states injected into the corresponding MM-DiT double-stream blocks via learned projections. Task identity is encoded by a channel-wise mask; source audio is provided via VAE-encoded channel concatenation.

Key Results (single checkpoint)

T2A · AudioCaps
1.56
FAD (VGGish) ↓
T2A · AudioCaps
0.503
CLAP ↑
Zero-shot TTS (EN)
1.50%
WER ↓
Zero-shot TTS (ZH)
0.89%
CER ↓
TTS (ZH)
0.92%
CER ↓
TTS (EN)
1.27%
CER ↓
TTS Gender
100%
Accuracy ↑
Mixed Speech+Audio
0.444
CLAP ↑
Audio Editing
0.364
CLAP ↑
Timed Composition
0.308
Per-seg CLAP ↑
1

Text-to-Audio Generation

Input: a text caption of a sound scene. Instruction format: [Audio] {caption}. Source latent zs = zeros; task mask m = 0. Output: 10-second audio clip. Evaluated on the AudioCaps test set (881 clips).

Prompt Ground Truth UNISON-16k (ours) UNISON-44k (ours) Audio-Omni GenAU-L Tango MMAudio-L Make-An-Audio 2 AudioLDM2-Large Stable Audio Open
Clip-clops gallop as the wind blows and thunder cracks
A train running on railroad tracks as a train horn blows and steam hisses
A very low-pitched hum occurs, followed by an explosion
Wind blowing followed by people speaking then a loud burst of thunder
A man speaking over an intercom as a helicopter engine runs followed by several gunshots firing
Footsteps shuffling on snow alongside a camera muffling while wind blows into a microphone
Tribal drums playing as footsteps shuffle on wet dirt as frogs and crickets chirp in the background
2

Text-to-Speech (Instruction-based)

Input: a plain-text instruction specifying speaker gender and content. Instruction format: [Speech] A {gender} voice saying “{text}”. Source latent zs = zeros; task mask m = 0. No phoneme encoder or speaker embedding used. Output: speech audio in the specified gender. Evaluated on the Seed-TTS test set (EN & ZH).
Note: Audio-Omni does not support explicit gender control; its output defaults to a male voice and is shown in the male rows. Audio-Omni also does not support Chinese TTS; those cells are left blank.

Instruction UNISON 16k (ours) UNISON 44k (ours) Audio-Omni
EN · female
A female voice saying "It now has five chapters: Butterfield Trail, Magazine Mountain, Ozark, Razorback and Cornerstone."
EN · male
A male voice saying "It now has five chapters: Butterfield Trail, Magazine Mountain, Ozark, Razorback and Cornerstone."
EN · female
A female voice saying "Some large firms or specialized jobs require a master's degree."
EN · male
A male voice saying "Some large firms or specialized jobs require a master's degree."
EN · female
A female voice saying "He is a good player and deserves a chance at this level."
EN · male
A male voice saying "He is a good player and deserves a chance at this level."
ZH · female
A female voice saying “你是不是想死呀你,你问什么呀你。”
ZH · male
A male voice saying “你是不是想死呀你,你问什么呀你。”
ZH · female
A female voice saying “大大的眼睛望着镜头闪闪发亮,肉肉的小胳膊小腿也惹人怜爱。”
ZH · male
A male voice saying “大大的眼睛望着镜头闪闪发亮,肉肉的小胳膊小腿也惹人怜爱。”
ZH · female
A female voice saying “尾号四六七幺的乘客刚夸了你,你是电你是光,你是出行的神话。”
ZH · male
A male voice saying “尾号四六七幺的乘客刚夸了你,你是电你是光,你是出行的神话。”
3

Zero-shot Speaker Cloning

Input: a short reference audio clip (3–10 s) + target text. Instruction format: [Speech with voice] {text}. Source latent zs = VAE(reference clip + zero padding); task mask m = 2. No dedicated speaker encoder. Output: full target utterance in the reference speaker's voice. Evaluated on the Seed-TTS test set (EN & ZH).

Target Text Reference Speaker Ground Truth UNISON 16k (ours) UNISON 44k (ours) Audio-Omni
EN · zero-shot
"The King of Portugal was being shaved by his barber."
reference
EN · zero-shot
"The bark of the pine tree was shiny and dark."
reference
ZH · zero-shot
"是对全球时装产业,与时装学院伟大成就的曼妙回眸。"
reference
ZH · zero-shot
"顺风时提高警惕,逆风时笃定前行。"
reference
4

Mixed Speech-and-Audio Generation (T2AS)

Input: a joint instruction specifying both speech and background sound. Instruction format: [Speech] … [Audio] … (speech tokens describe what to say and gender; audio tokens describe the background soundscape). Source latent zs = zeros; task mask m = 0. Output: a unified audio clip containing intelligible speech mixed with the described background, generated in a single forward pass.

Instruction UNISON 16k (ours) UNISON 44k (ours)
Speech (female) · EN
“She put on her slippers, scaled the stairs, and went to bed.”
Background: A bird is cooing and flapping its wings
Speech (female) · EN
“He chairs the executive committee of council and the community policing committee.”
Background: Bee buzzes while man speaks
Speech (male) · EN
“It then became involved with pharmaceuticals, food additives, and industrial and consumer chemicals.”
Background: A man speaks with a high frequency hum and some banging and clanking
Speech (male) · ZH
“目前,葛某及其朋友也被另案处理。”
Background: Several very loud explosions occur
Speech (female) · ZH
“二儿子同样是不会浪费时间的人。”
Background: A piano playing as plastic bonks
Speech (female) · ZH
“不光俺村群众不方便,谁从这过都不方便。”
Background: Man speaks in a crowd, a distant horn blows, then a race car goes by
5

Audio Scene Editing (Add / Remove / Replace)

Input: source audio clip + text edit instruction. Instruction formats — Add: [Edit][Audio] Add {event}; Remove: [Edit][Audio] Remove {event}; Replace: [Edit][Audio] Remove {old} [Edit][Audio] Add {new}. Source latent zs = VAE(source clip); task mask m = 1. Output: edited audio clip. Compared with MMEDIT, AudioEditCode (DDPM & SDEdit), and Audio-Omni.

Edit Instruction Source Audio UNISON 16k (ours) UNISON 44k (ours) Audio-Omni MMEdit AudioEditCode (DDPM) AudioEditCode (SDEdit)
Add
Add Thunder to the background.
source
Add
Add Bow-wow and Whimper (dog) to the background.
source
Remove
Eliminate the sound of Gunshot, gunfire.
source
Remove
Eliminate the sound of Bird and Domestic animals, pets.
source
Replace
Remove the Duck. Add Sizzle.
source
Replace
Remove the Burst, pop. Add Telephone dialing, DTMF.
source
Replace
Remove the Bell. Add Truck and Skidding.
source
6

Speech-in-Scene Editing (Insert / Delete / Rewrite)

Input: source audio clip (speech-in-scene) + text edit instruction. Instruction formats — Insert: [Edit][Speech] Add a voice saying “{text}”; Delete: [Edit][Speech] Remove the speech; Rewrite: [Edit][Speech] Change speech to “{new}”. Source latent zs = VAE(background) or VAE(background + old speech); task mask m = 1. Output: edited audio with modified speech, background preserved.

Edit Instruction Source Audio UNISON 16k (ours) UNISON 44k (ours)
Rewrite content
Alter the spoken content to “I will not be alarmed though your sister does play so well.”
source
Insert speech
Add a voice saying “The facade is constructed from white Portland stone and red brick.”
source
Delete speech
Remove the speech; keep the background sounds.
source
7

Speech Denoising

Input: noisy speech recording. Instruction format: [Edit][Speech] Denoise the record. Source latent zs = VAE(noisy speech); task mask m = 1. Output: clean speech waveform (background noise removed).

Transcript Noisy Input Clean Reference UNISON 16k (ours) UNISON 44k (ours)
EN · SNR 11.3 dB
“It does not comment on specific disputes.”
noisy
clean
EN · SNR 6.4 dB
“Man with guitar holding a microphone in one hand and extending his other arm.”
noisy
clean
ZH · SNR 9.4 dB
“乘车系好安全带,将你安全带向爱。”
noisy
clean
ZH · SNR 10.9 dB
“自动驾驶将为现有的司机运力提供补充。”
noisy
clean
8

Timed Temporal Composition

Input: a text instruction with explicit temporal anchors. Instruction format: [Audio] From {t1}s to {t2}s, {event}. From {t3}s to {t4}s, {event}. … Source latent zs = zeros; task mask m = 0. Timestamps parsed directly by the frozen LLM. Output: 10-second audio where each sound event occurs at the specified time interval.

Timed Instruction UNISON 16k (ours) UNISON 44k (ours)
From 0.0s to 3.1s, Ambulance (siren). From 3.6s to 6.1s, Chainsaw. From 6.6s to 10.0s, Subway, metro, underground.
From 0.0s to 4.1s, Thunder. From 4.6s to 10.0s, Stream.
From 0.0s to 4.2s, Rub. From 4.7s to 10.0s, Clip-clop.
[Overlapping] From 0.0s to 4.8s, Children playing. From 3.1s to 7.9s, Ambulance (siren). From 6.1s to 10.0s, Chainsaw.