High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers
Heitor R. Guimarães1,2, Jiaqi Su1, Rithesh Kumar1, Tiago H. Falk2, Zeyu Jin1
1 Adobe Research, 2 Université du Québec (INRS-EMT)
Abstract: Real-world speech recordings frequently suffer from degradations such as background noise and reverberation. Speech enhancement aims to mitigate these issues by generating clean high-fidelity signals. While recent generative approaches for speech enhancement have shown promising results, they still face two major challenges: (1) content hallucination, where plausible phonemes generated differ from the original utterance; and (2) inconsistency, failing to preserve speaker's identity and prosodic cues from the input speech. In this work, we introduce DiTSE (Diffusion Transformer for Speech Enhancement), which addresses quality issues of degraded speech at full bandwidth. Our approach employs a latent diffusion model together with robust conditioning features, effectively addressing these challenges while remaining computationally efficient. Experimental results from both subjective and objective evaluations demonstrate that DiTSE produces studio-quality audio while preserving speaker identity. Furthermore, DiTSE significantly improves content fidelity and alleviates hallucinations, reducing the word error rate (WER) across datasets compared to state-of-the-art enhancers.
For more details, read our paper on arxiv.
Method | DAPS (48 kHz) | DEMO (16 kHz) | AQECC (16 kHz) | VBDMD (48 kHz) | EARS (48 kHz) |
---|---|---|---|---|---|
Noisy Input | |||||
HiFi-GAN-2 | |||||
SGMSE+ | |||||
Genhancer | |||||
(Ours) DiTSE [Base] | |||||
(Ours) DiTSE [Base + Post] |
Herein, we compare our method with some challenging examples reported in sample pages of several models.
Notes:
Clean | Degraded | DiTSE | Miipher w/o text | Genhancer | SGMSE+ |
---|---|---|---|---|---|
Transcript: Thanks again for sharing your story so others can get help sooner than I did.
The first sample is a degraded speech signal from the Miipher demo dataset.
Clean | Degraded | DiTSE | Miipher w/o text | Genhancer | SGMSE+ |
---|---|---|---|---|---|
Degraded | DiTSE | Low-latency SE | HiFi-GAN-2 | StoRM | SGMSE+ |
---|---|---|---|---|---|
This is a challenging example for all methods, especially at the end of the utterance. Although Genhancer can achieve good acoustic quality, we can note some content hallucination towards the end of the utterance. In contrast, DiTSE can achieve good content preservation even in the presence of strong noise.
Clean | Degraded | DiTSE | Low-latency SE | Genhancer | SGMSE+ |
---|---|---|---|---|---|
Traditional enhancement methods struggle to generalize to OOD data, as displayed by HiFi-GAN-2. However, most of the generative methods can achieve good performance.
Degraded | DiTSE | UNIVERSE | Miipher | HiFi-GAN-2 | StoRM |
---|---|---|---|---|---|
Degraded | DiTSE | UNIVERSE | Genhancer | Miipher | SGMSE+ |
---|---|---|---|---|---|
Degraded | DiTSE | UNIVERSE | HiFi-GAN-2 | StoRM | SGMSE+ |
---|---|---|---|---|---|
Noise sample from the StoRM demo page. Note that even the clean signal has some low-frequency noise.
Clean | Degraded | DiTSE | StoRM [Original] | StoRM [Our dataset] | Miipher |
---|---|---|---|---|---|
Note that the original StoRM method was trained exclusively to do reverberation.
Clean | Degraded | DiTSE | StoRM [Original] | StoRM [Our dataset] | HiFi-GAN-2 |
---|---|---|---|---|---|
To foster collaboration and comparisons, we release the entire evaluation sets in the following repository: [URL]