High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers
Heitor R. Guimarães1,2, Jiaqi Su1, Rithesh Kumar1, Tiago H. Falk2, Zeyu Jin1
1 Adobe Research, 2 Université du Québec (INRS-EMT)
Abstract: Real-world speech recordings frequently suffer from degradations such as background noise and reverberation. Speech enhancement aims to mitigate these issues by generating clean high-fidelity signals. While recent generative approaches for speech enhancement have shown promising results, they still face two major challenges: (1) content hallucination, where plausible phonemes generated differ from the original utterance; and (2) inconsistency, failing to preserve speaker's identity and prosodic cues from the input speech. In this work, we introduce DiTSE (Diffusion Transformer for Speech Enhancement), which addresses quality issues of degraded speech at full bandwidth. Our approach employs a latent diffusion model together with robust conditioning features, effectively addressing these challenges while remaining computationally efficient. Experimental results from both subjective and objective evaluations demonstrate that DiTSE produces studio-quality audio while preserving speaker identity. Furthermore, DiTSE significantly improves content fidelity and alleviates hallucinations, reducing the word error rate (WER) across datasets compared to state-of-the-art enhancers.
For more details, read our paper on arxiv.
Method | DAPS (48 kHz) | DEMO (16 kHz) | AQECC (16 kHz) | VBDMD (48 kHz) | EARS (48 kHz) |
---|---|---|---|---|---|
Noisy Input | |||||
HiFi-GAN-2 | |||||
SGMSE+ | |||||
Genhancer | |||||
(Ours) DiTSE [Tiny] | |||||
(Ours) DiTSE [Base] | |||||
(Ours) DiTSE [Base + Post] |
To foster collaboration and comparisons, we release the entire evaluation sets in the following repository: [URL]