return to home

DiTSE

High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers

Heitor R. Guimarães1,2, Jiaqi Su1, Rithesh Kumar1, Tiago H. Falk2, Zeyu Jin1

1 Adobe Research, 2 Université du Québec (INRS-EMT)

Abstract: Real-world speech recordings frequently suffer from degradations such as background noise and reverberation. Speech enhancement aims to mitigate these issues by generating clean high-fidelity signals. While recent generative approaches for speech enhancement have shown promising results, they still face two major challenges: (1) content hallucination, where plausible phonemes generated differ from the original utterance; and (2) inconsistency, failing to preserve speaker's identity and prosodic cues from the input speech. In this work, we introduce DiTSE (Diffusion Transformer for Speech Enhancement), which addresses quality issues of degraded speech at full bandwidth. Our approach employs a latent diffusion model together with robust conditioning features, effectively addressing these challenges while remaining computationally efficient. Experimental results from both subjective and objective evaluations demonstrate that DiTSE produces studio-quality audio while preserving speaker identity. Furthermore, DiTSE significantly improves content fidelity and alleviates hallucinations, reducing the word error rate (WER) across datasets compared to state-of-the-art enhancers.

DiTSE Architecture

For more details, read our paper on arxiv.


Examples

Samples from datasets discussed in our paper

Comparison with original reported methods


Herein, we compare our method with some challenging examples reported in sample pages of several models.

Notes:


Sample 1 - Bandwidth limitation + Reverberation (Miipher demo)

Transcript: Thanks again for sharing your story so others can get help sooner than I did.


Sample 2 - Noise + Reverberation (Miipher demo)

The first sample is a degraded speech signal from the Miipher demo dataset.


Sample 3 - Noise + Distortion (Low-latency SE demo)
Sample 4 - Noise + Reverberation (Low-latency SE demo)

This is a challenging example for all methods, especially at the end of the utterance. Although Genhancer can achieve good acoustic quality, we can note some content hallucination towards the end of the utterance. In contrast, DiTSE can achieve good content preservation even in the presence of strong noise.


Sample 5 - OOD emotional speech (UNIVERSE demo)

Traditional enhancement methods struggle to generalize to OOD data, as displayed by HiFi-GAN-2. However, most of the generative methods can achieve good performance.


Sample 6 - New language (UNIVERSE demo)
Sample 7 - Multi-Speaker (UNIVERSE demo)
Sample 8 - Noise (StoRM demo)

Noise sample from the StoRM demo page. Note that even the clean signal has some low-frequency noise.


Sample 9 - Strong Reverberation (StoRM demo)

Note that the original StoRM method was trained exclusively to do reverberation.



Test sets download

To foster collaboration and comparisons, we release the entire evaluation sets in the following repository: [URL]