UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion

Abstract. We present UNISON, a unified latent diffusion framework for audio generation and editing. Using a single set of weights, UNISON integrates speech generation, sound generation, and audio editing within one model. The model supports Text-to-Audio, Text-to-Speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level audio editing, and speech-in-scene editing. It features layer-wise deep LLM fusion, which injects depth-matched semantic conditioning from a frozen large language model to improve instruction following for compositional audio prompts.

[Teaser figure — place images/fig1.png here]

Figure 1. Overview of UNISON. A single flow-matching model handles text-to-audio generation, zero-shot TTS, gender control, audio-scene editing, and timed temporal composition. All tasks share the same architecture and weights, differentiated only by a task mask channel and optional source latent concatenation.

Model Architecture

All tasks share the same VAE encoder/decoder, MM-DiT backbone, and forward pass. Task identity is encoded solely by a channel-wise mask; source audio is concatenated as additional input channels in the latent space via VAE encoding. Text conditioning uses layer-wise deep LLM fusion: hidden states from uniformly sampled layers of the frozen Qwen2.5-Omni-7B backbone are injected into corresponding MM-DiT double-stream blocks via learned linear projections.

[Architecture figure — place images/fig2.png here]

Figure 2. UNISON architecture. The frozen MLLM backbone (Qwen2.5-Omni-7B) provides layer-wise hidden states injected into the corresponding MM-DiT double-stream blocks via learned projections. Task identity is encoded by a channel-wise mask; source audio is provided via VAE-encoded channel concatenation.

★

Key Results (single checkpoint)

T2A · AudioCaps

1.56

FAD (VGGish) ↓

T2A · AudioCaps

0.503

CLAP ↑

Zero-shot TTS (EN)

1.50%

WER ↓

Zero-shot TTS (ZH)

0.89%

CER ↓

TTS (ZH)

0.92%

CER ↓

TTS (EN)

1.27%

CER ↓

TTS Gender

100%

Accuracy ↑

Mixed Speech+Audio

0.444

CLAP ↑

Audio Editing

0.364

CLAP ↑

Timed Composition

0.308

Per-seg CLAP ↑

Text-to-Audio Generation

Input: a text caption of a sound scene. Instruction format: [Audio] {caption}. Source latent z_s = zeros; task mask m = 0. Output: 10-second audio clip. Evaluated on the AudioCaps test set (881 clips).

Prompt	Ground Truth	UNISON-16k (ours)	UNISON-44k (ours)	Audio-Omni	GenAU-L	Tango	MMAudio-L	Make-An-Audio 2	AudioLDM2-Large	Stable Audio Open
Clip-clops gallop as the wind blows and thunder cracks
A train running on railroad tracks as a train horn blows and steam hisses
A very low-pitched hum occurs, followed by an explosion
Wind blowing followed by people speaking then a loud burst of thunder
A man speaking over an intercom as a helicopter engine runs followed by several gunshots firing
Footsteps shuffling on snow alongside a camera muffling while wind blows into a microphone
Tribal drums playing as footsteps shuffle on wet dirt as frogs and crickets chirp in the background

Text-to-Speech (Instruction-based)

Input: a plain-text instruction specifying speaker gender and content. Instruction format: [Speech] A {gender} voice saying “{text}”. Source latent z_s = zeros; task mask m = 0. No phoneme encoder or speaker embedding used. Output: speech audio in the specified gender. Evaluated on the Seed-TTS test set (EN & ZH).
Note: Audio-Omni does not support explicit gender control; its output defaults to a male voice and is shown in the male rows. Audio-Omni also does not support Chinese TTS; those cells are left blank.

Instruction	UNISON 16k (ours)	UNISON 44k (ours)	Audio-Omni
EN · female A female voice saying "It now has five chapters: Butterfield Trail, Magazine Mountain, Ozark, Razorback and Cornerstone."			—
EN · male A male voice saying "It now has five chapters: Butterfield Trail, Magazine Mountain, Ozark, Razorback and Cornerstone."
EN · female A female voice saying "Some large firms or specialized jobs require a master's degree."			—
EN · male A male voice saying "Some large firms or specialized jobs require a master's degree."
EN · female A female voice saying "He is a good player and deserves a chance at this level."			—
EN · male A male voice saying "He is a good player and deserves a chance at this level."
ZH · female A female voice saying “你是不是想死呀你，你问什么呀你。”			—
ZH · male A male voice saying “你是不是想死呀你，你问什么呀你。”			—
ZH · female A female voice saying “大大的眼睛望着镜头闪闪发亮，肉肉的小胳膊小腿也惹人怜爱。”			—
ZH · male A male voice saying “大大的眼睛望着镜头闪闪发亮，肉肉的小胳膊小腿也惹人怜爱。”			—
ZH · female A female voice saying “尾号四六七幺的乘客刚夸了你，你是电你是光，你是出行的神话。”			—
ZH · male A male voice saying “尾号四六七幺的乘客刚夸了你，你是电你是光，你是出行的神话。”			—

Zero-shot Speaker Cloning

Input: a short reference audio clip (3–10 s) + target text. Instruction format: [Speech with voice] {text}. Source latent z_s = VAE(reference clip + zero padding); task mask m = 2. No dedicated speaker encoder. Output: full target utterance in the reference speaker's voice. Evaluated on the Seed-TTS test set (EN & ZH).

Target Text	Reference Speaker	Audio-Omni
EN · zero-shot "The King of Portugal was being shaved by his barber."	reference
EN · zero-shot "The bark of the pine tree was shiny and dark."	reference
ZH · zero-shot "是对全球时装产业，与时装学院伟大成就的曼妙回眸。"	reference	—
ZH · zero-shot "顺风时提高警惕，逆风时笃定前行。"	reference	—

Mixed Speech-and-Audio Generation (T2AS)

Input: a joint instruction specifying both speech and background sound. Instruction format: [Speech] … [Audio] … (speech tokens describe what to say and gender; audio tokens describe the background soundscape). Source latent z_s = zeros; task mask m = 0. Output: a unified audio clip containing intelligible speech mixed with the described background, generated in a single forward pass.

Instruction	UNISON 16k (ours)	UNISON 44k (ours)
Speech (female) · EN “She put on her slippers, scaled the stairs, and went to bed.” Background: A bird is cooing and flapping its wings
Speech (female) · EN “He chairs the executive committee of council and the community policing committee.” Background: Bee buzzes while man speaks
Speech (male) · EN “It then became involved with pharmaceuticals, food additives, and industrial and consumer chemicals.” Background: A man speaks with a high frequency hum and some banging and clanking
Speech (male) · ZH “目前，葛某及其朋友也被另案处理。” Background: Several very loud explosions occur
Speech (female) · ZH “二儿子同样是不会浪费时间的人。” Background: A piano playing as plastic bonks
Speech (female) · ZH “不光俺村群众不方便，谁从这过都不方便。” Background: Man speaks in a crowd, a distant horn blows, then a race car goes by

Audio Scene Editing (Add / Remove / Replace)

Input: source audio clip + text edit instruction. Instruction formats — Add: [Edit][Audio] Add {event}; Remove: [Edit][Audio] Remove {event}; Replace: [Edit][Audio] Remove {old} [Edit][Audio] Add {new}. Source latent z_s = VAE(source clip); task mask m = 1. Output: edited audio clip. Compared with MMEDIT, AudioEditCode (DDPM & SDEdit), and Audio-Omni.

Edit Instruction	Source Audio	UNISON 16k (ours)	UNISON 44k (ours)	Audio-Omni	MMEdit	AudioEditCode (DDPM)	AudioEditCode (SDEdit)
Add Add Thunder to the background.	source
Add Add Bow-wow and Whimper (dog) to the background.	source
Remove Eliminate the sound of Gunshot, gunfire.	source
Remove Eliminate the sound of Bird and Domestic animals, pets.	source
Replace Remove the Duck. Add Sizzle.	source
Replace Remove the Burst, pop. Add Telephone dialing, DTMF.	source
Replace Remove the Bell. Add Truck and Skidding.	source

Speech-in-Scene Editing (Insert / Delete / Rewrite)

Input: source audio clip (speech-in-scene) + text edit instruction. Instruction formats — Insert: [Edit][Speech] Add a voice saying “{text}”; Delete: [Edit][Speech] Remove the speech; Rewrite: [Edit][Speech] Change speech to “{new}”. Source latent z_s = VAE(background) or VAE(background + old speech); task mask m = 1. Output: edited audio with modified speech, background preserved.

Edit Instruction	Source Audio	UNISON 16k (ours)	UNISON 44k (ours)
Rewrite content Alter the spoken content to “I will not be alarmed though your sister does play so well.”	source
Insert speech Add a voice saying “The facade is constructed from white Portland stone and red brick.”	source
Delete speech Remove the speech; keep the background sounds.	source

Speech Denoising

Input: noisy speech recording. Instruction format: [Edit][Speech] Denoise the record. Source latent z_s = VAE(noisy speech); task mask m = 1. Output: clean speech waveform (background noise removed).

Transcript	Noisy Input	Clean Reference
EN · SNR 11.3 dB “It does not comment on specific disputes.”	noisy	clean
EN · SNR 6.4 dB “Man with guitar holding a microphone in one hand and extending his other arm.”	noisy	clean
ZH · SNR 9.4 dB “乘车系好安全带，将你安全带向爱。”	noisy	clean
ZH · SNR 10.9 dB “自动驾驶将为现有的司机运力提供补充。”	noisy	clean

Timed Temporal Composition

Input: a text instruction with explicit temporal anchors. Instruction format: [Audio] From {t1}s to {t2}s, {event}. From {t3}s to {t4}s, {event}. … Source latent z_s = zeros; task mask m = 0. Timestamps parsed directly by the frozen LLM. Output: 10-second audio where each sound event occurs at the specified time interval.

Timed Instruction	UNISON 16k (ours)	UNISON 44k (ours)
From 0.0s to 3.1s, Ambulance (siren). From 3.6s to 6.1s, Chainsaw. From 6.6s to 10.0s, Subway, metro, underground.
From 0.0s to 4.1s, Thunder. From 4.6s to 10.0s, Stream.
From 0.0s to 4.2s, Rub. From 4.7s to 10.0s, Clip-clop.
[Overlapping] From 0.0s to 4.8s, Children playing. From 3.1s to 7.9s, Ambulance (siren). From 6.1s to 10.0s, Chainsaw.