DiTSE

High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers

Heitor R. Guimarães^1,2, Jiaqi Su¹, Rithesh Kumar¹, Tiago H. Falk², Zeyu Jin¹

¹ Adobe Research, ² Université du Québec (INRS-EMT)

Abstract: Real-world speech recordings frequently suffer from degradations such as background noise and reverberation. Speech enhancement aims to mitigate these issues by generating clean high-fidelity signals. While recent generative approaches for speech enhancement have shown promising results, they still face two major challenges: (1) content hallucination, where plausible phonemes generated differ from the original utterance; and (2) inconsistency, failing to preserve speaker's identity and prosodic cues from the input speech. In this work, we introduce DiTSE (Diffusion Transformer for Speech Enhancement), which addresses quality issues of degraded speech at full bandwidth. Our approach employs a latent diffusion model together with robust conditioning features, effectively addressing these challenges while remaining computationally efficient. Experimental results from both subjective and objective evaluations demonstrate that DiTSE produces studio-quality audio while preserving speaker identity. Furthermore, DiTSE significantly improves content fidelity and alleviates hallucinations, reducing the word error rate (WER) across datasets compared to state-of-the-art enhancers.

For more details, read our paper on arxiv.

Examples

Samples from datasets discussed in our paper

Method	DAPS (48 kHz)	DEMO (16 kHz)	AQECC (16 kHz)	VBDMD (48 kHz)	EARS (48 kHz)
Noisy Input
HiFi-GAN-2
SGMSE+
Genhancer
(Ours) DiTSE [Base]
(Ours) DiTSE [Base + Post]

Comparison with original reported methods

Herein, we compare our method with some challenging examples reported in sample pages of several models.

Notes:

Models marked with red are the ones reported in the method's original sample page.
All the samples here presented are in 16 kHz. If the original sample was reported in other sampling rate, we resampled it to 16 kHz.
For some samples, we report the original result reported in the method's sample page AND the result obtained with this method trained on our dataset. (e.g., sample 8 and 13)

Sample 1 - Bandwidth limitation + Reverberation (Miipher demo)

Clean	Degraded	DiTSE	Miipher w/o text	Genhancer	SGMSE+

Transcript: Thanks again for sharing your story so others can get help sooner than I did.

Sample 2 - Noise + Reverberation (Miipher demo)

The first sample is a degraded speech signal from the Miipher demo dataset.

Clean	Degraded	DiTSE	Miipher w/o text	Genhancer	SGMSE+

Sample 3 - Noise + Distortion (Low-latency SE demo)

Degraded	DiTSE	Low-latency SE	HiFi-GAN-2	StoRM	SGMSE+

Sample 4 - Noise + Reverberation (Low-latency SE demo)

This is a challenging example for all methods, especially at the end of the utterance. Although Genhancer can achieve good acoustic quality, we can note some content hallucination towards the end of the utterance. In contrast, DiTSE can achieve good content preservation even in the presence of strong noise.

Clean	Degraded	DiTSE	Low-latency SE	Genhancer	SGMSE+

Sample 5 - OOD emotional speech (UNIVERSE demo)

Traditional enhancement methods struggle to generalize to OOD data, as displayed by HiFi-GAN-2. However, most of the generative methods can achieve good performance.

Degraded	DiTSE	UNIVERSE	Miipher	HiFi-GAN-2	StoRM

Sample 6 - New language (UNIVERSE demo)

Degraded	DiTSE	UNIVERSE	Genhancer	Miipher	SGMSE+

Sample 7 - Multi-Speaker (UNIVERSE demo)

Degraded	DiTSE	UNIVERSE	HiFi-GAN-2	StoRM	SGMSE+

Sample 8 - Noise (StoRM demo)

Noise sample from the StoRM demo page. Note that even the clean signal has some low-frequency noise.

Clean	Degraded	DiTSE	StoRM [Original]	StoRM [Our dataset]	Miipher

Sample 9 - Strong Reverberation (StoRM demo)

Note that the original StoRM method was trained exclusively to do reverberation.

Clean	Degraded	DiTSE	StoRM [Original]	StoRM [Our dataset]	HiFi-GAN-2

Test sets download

To foster collaboration and comparisons, we release the entire evaluation sets in the following repository: [URL]