Keyword spotting (KWS) refers to the task of identifying a set of predefined words in audio streams. With the advances seen recently with deep neural networks, it has become a popular technology to activate and control small devices, such as voice assistants. Relying on such models for edge devices, however, can be challenging due to hardware constraints. Moreover, as adversarial attacks have increased against voice-based technologies, developing solutions robust to such attacks has become crucial. In this work, we propose VIC-KD, a robust distillation recipe for model compression and adversarial robustness. Using self-supervised speech representations, we show that imposing geometric priors to the latent representations of both Teacher and Student models leads to more robust target models. Experiments on the Google Speech Commands datasets show that the proposed methodology improves upon current state-of-the-art robust distillation methods, such as ARD and RSLAD, by 12% and 8% in robust accuracy, respectively.
2023
Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech Recognition
Arthur Pimentel, Heitor R Guimarães, Anderson Avila, and 1 more author
Applied Sciences, 2023
Journal
Analysis of Oral Exams With Speaker Diarization and Speech Emotion Recognition: A Case Study
Wesley Beccaro, Miguel Arjona Ramírez, William Liaw, and 1 more author
Self-supervised speech pre-training has emerged as a useful tool to extract representations from speech that can be used across different tasks. While these models are starting to appear in commercial systems, their robustness to so-called adversarial attacks have yet to be fully characterized. This paper evaluates the vulnerability of three self-supervised speech representations (wav2vec 2.0, HuBERT and WavLM) to three white-box adversarial attacks under different signal-to-noise ratios (SNR). The study uses keyword spotting as a downstream task and shows that the models are very vulnerable to attacks, even at high SNRs. The paper also investigates the transferability of attacks between models and analyses the generated noise patterns in order to develop more effective defence mechanisms. The modulation spectrum shows to be a potential tool for detection of adversarial attacks to speech systems.
Conference
RobustDistiller: Compressing Universal Speech Representations for Enhanced Environment Robustness
Heitor R. Guimarães, Arthur Pimentel, Anderson R. Avila, and 3 more authors
In 2023 International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Self-supervised speech pre-training enables deep neural network models to capture meaningful and disentangled factors from raw waveform signals. The learned universal speech representations can then be used across numerous downstream tasks. These representations, however, are sensitive to distribution shifts caused by environmental factors, such as noise and/or room reverberation. Their large sizes, in turn, make them unfeasible for edge applications. In this work, we propose a knowledge distillation methodology termed RobustDistiller which compresses universal representations while making them more robust against environmental artifacts via a multi-task learning objective. The proposed layer-wise distillation recipe is evaluated on top of three well-established universal representations, as well as with three downstream tasks. Experimental results show the proposed methodology applied on top of the WavLM Base+ teacher model outperforming all other benchmarks across noise types and levels, as well as reverberation times. Oftentimes, the obtained results with the student model (24M parameters) achieved results inline with those of the teacher model (95M).
Preprint
On the Transferability of Whisper-based Representations for" In-the-Wild" Cross-Task Downstream Speech Applications
Vamsikrishna Chemudupati, Marzieh Tahaei, Heitor Guimaraes, and 5 more authors
Large self-supervised pre-trained speech models have achieved remarkable success across various speech-processing tasks. The self-supervised training of these models leads to universal speech representations that can be used for different downstream tasks, ranging from automatic speech recognition (ASR) to speaker identification. Recently, Whisper, a transformer-based model was proposed and trained on large amount of weakly supervised data for ASR; it outperformed several state-of-the-art self-supervised models. Given the superiority of Whisper for ASR, in this paper we explore the transferability of the representation for four other speech tasks in SUPERB benchmark. Moreover, we explore the robustness of Whisper representation for “in the wild” tasks where speech is corrupted by environment noise and room reverberation. Experimental results show Whisper achieves promising results across tasks and environmental conditions, thus showing potential for cross-task real-world deployment.
2022
Conference
Improving the Robustness of DistilHuBERT to Unseen Noisy Conditions via Data Augmentation, Curriculum Learning, and Multi-Task Enhancement
Heitor R. Guimarães, Arthur Pimentel, Anderson R. Avila, and 2 more authors
In NeurIPS 2022 Efficient Natural Language and Speech Processing Workshop, 2022
Self-supervised speech representation learning aims to extract meaningful factors from the speech signal that can later be used across different downstream tasks, such as speech and/or emotion recognition. Existing models, such as HuBERT, however, can be fairly large thus may not be suitable for edge speech applications. Moreover, realistic applications typically involve speech corrupted by noise and room reverberation, hence models need to provide representations that are robust to such environmental factors. In this study, we build on the so-called DistilHuBERT model, which distils HuBERT to a fraction of its original size, with three modifications, namely: (i) augment the training data with noise and reverberation, while the student model needs to distill the clean representations from the teacher model; (ii) introduce a curriculum learning approach where increasing levels of noise are introduced as the model trains, thus helping with convergence and with the creation of more robust representations; and (iii) introduce a multi-task learning approach where the model also reconstructs the clean waveform jointly with the distillation task, thus also acting as an enhancement step to ensure additional environment robustness to the representation. Experiments on three SUPERB tasks show the advantages of the proposed method not only relative to the original DistilHuBERT, but also to the original HuBERT, thus showing the advantages of the proposed method for “in the wild” edge speech applications.
Workshop
An Exploration into the Performance of Unsupervised Cross-Task Speech Representations for "In the Wild" Edge Applications
Heitor R. Guimarães, Arthur Pimentel, Anderson R. Avila, and 2 more authors
Unsupervised speech models are becoming ubiquitous in the speech and machine learning communities. Upstream models are responsible for learning meaningful representations from raw audio. Later, these representations serve as input to downstream models to solve a number of tasks, such as keyword spotting or emotion recognition. As edge speech applications start to emerge, it is important to gauge how robust these cross-task representations are on edge devices with limited resources and different noise levels. To this end, in this study we evaluate the robustness of four different versions of HuBERT, namely: base, large, and extralarge versions, as well as a recent version termed Robust-HuBERT. Tests are conducted under different additive and convolutive noise conditions for three downstream tasks: keyword spotting, intent classification, and emotion recognition. Our results show that while larger models can provide some important robustness to environmental factors, they may not be applicable to edge applications. Smaller models, on the other hand, showed substantial accuracy drops in noisy conditions, especially in the presence of room reverberation. These findings suggest that cross-task speech representations are not yet ready for edge applications and innovations are still needed.
Workshop
How Robust is Robust wav2vec 2.0 for Edge Applications?: An Exploration into the Effects of Quantization and Model Pruning on “In-the-Wild” Speech Recognition
Arthur Pimentel, Heitor R. Guimarães, Anderson R. Avila, and 2 more authors
Recent advances on self-supervised learning have allowed speech recognition systems to achieve state-of-the-art (SOTA) word error rates (WER) while requiring only a fraction of the labeled training data needed by its predecessors (e.g., the wav2vec 2.0 model). Notwithstanding, while such models have shown to achieve SOTA performance on matched conditions, their performance has shown to degrade in unmatched conditions, which is typically the case in edge applications. To overcome this problem, strategies such as data augmentation and/or multicondition training have been explored and a robust version of wav2vec 2.0 has been implemented. It is argued here, however, that such models are still too large to be considered for edge applications on resource-constrained devices, which justifies why model compression tools are needed. In this paper, we report findings on the effects of quantization and model pruning on speech recognition tasks in noisy conditions. Our results show that model compression has minimal impact on final WER, while signal-to-noise ratio (SNR) has a stronger impact.
Conference
A Perceptual Loss Based Complex Neural Beamforming for Ambix 3D Speech Enhancement
Heitor R. Guimarães, Wesley Beccaro, and Miguel A. Ramirez
In Proc. L3DAS22: Machine Learning for 3D Audio Signal Processing, 2022
This work proposes a novel approach to B-Format AmbiX 3D speech enhancement based on the short-time Fourier transform (STFT) representation. The model is a Fully Complex Convolutional Network (FC2N) that estimates a mask to be applied to the input features. Then, a final layer is responsible for converting the B-format to a monaural representation in which we apply the inverse STFT (ISTFT) operation. For the optimization process, we use a compounded loss function, applied in the time-domain, based on the short-time objective intelligibility (STOI) metric combined with a perceptual loss on top of the wav2vec 2.0 model. The approach is applied on Task 1 of the L3DAS22 challenge, where our model achieves a score of 0.845 in the metric proposed by the challenge, using a subset of the development set as reference.
Thesis
On Self-Supervised Representations for 3D Speech Enhancement
Noise in 3D reverberant environment is detrimental to several downstream applications. In this work, we propose a novel approach to 3D speech enhancement directly in the time domain through the usage of Fully Convolutional Networks (FCN) with a custom loss function based on the combination of a perceptual loss, built on top of the wav2vec model and a soft version of the short-time objective intelligibility (STOI) metric. The dataset and experiments were based on Task 1 of the L3DAS21 challenge. Our model achieves a STOI score of 0.82, word error rate (WER) equal to 0.36, and a score of 0.73 in the metric proposed by the challenge based on STOI and WER combination using as reference the development set. Our submission, based on this method, was ranked second in Task 1 of the L3DAS21 challenge.
2020
Journal
Monaural speech enhancement through deep wave-U-net
Heitor R. Guimarães, Hitoshi Nagano, and Diego W. Silva
In this paper, we present Speech Enhancement through Wave-U-Net (SEWUNet), an end-to-end approach to reduce noise from speech signals. This background context is detrimental to several downstream systems, including automatic speech recognition (ASR) and word spotting, which in turn can negatively impact end-user applications. We show that our proposal does improve signal-to-noise ratio (SNR) and word error rate (WER) compared with existing mechanisms in the literature. In the experiments, network input is a 16 kHz sample rate audio waveform corrupted by an additive noise. Our method is based on the Wave-U-Net architecture with some adaptations to our problem. Four simple enhancements are proposed and tested with ablation studies to prove their validity. In particular, we highlight the weight initialization through an autoencoder before training for the main denoising task, which leads to a more efficient use of training time and a higher performance. Through quantitative metrics, we show that our method is prefered over the classical Wiener filtering and shows a better performance than other state-of-the-art proposals.
2018
Thesis
Music Information Retrieval: Deep Learning Approach