publications | Heitor R. Guimarães

2024

Conference
VIC-KD: Variance-Invariance-Covariance Knowledge Distillation to Make Keyword Spotting More Robust Against Adversarial Attacks

Heitor R. Guimarães, Arthur Pimentel, Anderson R. Avila, and 1 more author

In 2024 International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

Abs Bib

Keyword spotting (KWS) refers to the task of identifying a set of predefined words in audio streams. With the advances seen recently with deep neural networks, it has become a popular technology to activate and control small devices, such as voice assistants. Relying on such models for edge devices, however, can be challenging due to hardware constraints. Moreover, as adversarial attacks have increased against voice-based technologies, developing solutions robust to such attacks has become crucial. In this work, we propose VIC-KD, a robust distillation recipe for model compression and adversarial robustness. Using self-supervised speech representations, we show that imposing geometric priors to the latent representations of both Teacher and Student models leads to more robust target models. Experiments on the Google Speech Commands datasets show that the proposed methodology improves upon current state-of-the-art robust distillation methods, such as ARD and RSLAD, by 12% and 8% in robust accuracy, respectively.
@inproceedings{guimaraes2024vickd, author = {Guimar{\~a}es, Heitor R. and Pimentel, Arthur and Avila, Anderson R. and Falk, Tiago H.}, title = {VIC-KD: Variance-Invariance-Covariance Knowledge Distillation to Make Keyword Spotting More Robust Against Adversarial Attacks}, year = {2024}, booktitle = {2024 International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}, pages = {--}, organization = {IEEE}, }

2023

Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech Recognition

Arthur Pimentel, Heitor R Guimarães, Anderson Avila, and 1 more author

Applied Sciences, 2023

Journal

Analysis of Oral Exams With Speaker Diarization and Speech Emotion Recognition: A Case Study

Wesley Beccaro, Miguel Arjona Ramírez, William Liaw, and 1 more author

IEEE Transactions on Education, 2023

Bib

@article{10287917,
  author = {Beccaro, Wesley and Ramírez, Miguel Arjona and Liaw, William and Guimarães, Heitor Rodrigues},
  journal = {IEEE Transactions on Education},
  title = {Analysis of Oral Exams With Speaker Diarization and Speech Emotion Recognition: A Case Study},
  year = {2023},
  volume = {},
  number = {},
  pages = {1-13},
  doi = {10.1109/TE.2023.3321155},
}

Conference

Adapting Self-Supervised Features for Background Speech Detection in Beehive Audio Recordings

Heitor R. Guimarães*, Mahsa Abdollahi*, Yi Zhu, and 4 more authors

In 2023 IEEE International Workshop on Metrology for Agriculture and Forestry (MetroAgriFor), 2023

Bib

@inproceedings{guimaraesabdollahi2023s3rlvadbee,
  title = {Adapting Self-Supervised Features for Background Speech Detection in Beehive Audio Recordings},
  author = {Guimarães*, Heitor R. and Abdollahi*, Mahsa and Zhu, Yi and Maucourt, Ségolène and Coallier, Nico and Giovenazzo, Pierre and Falk, Tiago H.},
  booktitle = {2023 IEEE International Workshop on Metrology for Agriculture and Forestry (MetroAgriFor)},
  pages = {--},
  year = {2023},
  organization = {IEEE},
}

Conference

Early prediction of honeybee hive winter survivability using multi-modal sensor data

Yi Zhu, Mahsa Abdollahi, Ségolène Maucourt, and 4 more authors

In 2023 IEEE International Workshop on Metrology for Agriculture and Forestry (MetroAgriFor), 2023

Bib

@inproceedings{zhu2023bee,
  title = {Early prediction of honeybee hive winter survivability using multi-modal sensor data},
  author = {Zhu, Yi and Abdollahi, Mahsa and Maucourt, Ségolène and Coallier, Nico and Guimarães, Heitor R. and Giovenazzo, Pierre and Falk, Tiago H.},
  booktitle = {2023 IEEE International Workshop on Metrology for Agriculture and Forestry (MetroAgriFor)},
  pages = {--},
  year = {2023},
  organization = {IEEE},
}

Conference
Assessing the Vulnerability of Self-Supervised Speech Representations for Keyword Spotting Under White-Box Adversarial Attacks

Heitor R. Guimarães, Yi Zhu, Orson Mengara, and 2 more authors

In 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2023

Abs Bib

Self-supervised speech pre-training has emerged as a useful tool to extract representations from speech that can be used across different tasks. While these models are starting to appear in commercial systems, their robustness to so-called adversarial attacks have yet to be fully characterized. This paper evaluates the vulnerability of three self-supervised speech representations (wav2vec 2.0, HuBERT and WavLM) to three white-box adversarial attacks under different signal-to-noise ratios (SNR). The study uses keyword spotting as a downstream task and shows that the models are very vulnerable to attacks, even at high SNRs. The paper also investigates the transferability of attacks between models and analyses the generated noise patterns in order to develop more effective defence mechanisms. The modulation spectrum shows to be a potential tool for detection of adversarial attacks to speech systems.
@inproceedings{guimaraes2023s3rladvattacks, author = {Guimarães, Heitor R. and Zhu, Yi and Mengara, Orson and Avila, Anderson R. and Falk, Tiago H.}, title = {Assessing the Vulnerability of Self-Supervised Speech Representations for Keyword Spotting Under White-Box Adversarial Attacks}, year = {2023}, booktitle = {2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC)}, pages = {--}, organization = {IEEE}, }
Conference
RobustDistiller: Compressing Universal Speech Representations for Enhanced Environment Robustness

Heitor R. Guimarães, Arthur Pimentel, Anderson R. Avila, and 3 more authors

In 2023 International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

Abs Bib

Self-supervised speech pre-training enables deep neural network models to capture meaningful and disentangled factors from raw waveform signals. The learned universal speech representations can then be used across numerous downstream tasks. These representations, however, are sensitive to distribution shifts caused by environmental factors, such as noise and/or room reverberation. Their large sizes, in turn, make them unfeasible for edge applications. In this work, we propose a knowledge distillation methodology termed RobustDistiller which compresses universal representations while making them more robust against environmental artifacts via a multi-task learning objective. The proposed layer-wise distillation recipe is evaluated on top of three well-established universal representations, as well as with three downstream tasks. Experimental results show the proposed methodology applied on top of the WavLM Base+ teacher model outperforming all other benchmarks across noise types and levels, as well as reverberation times. Oftentimes, the obtained results with the student model (24M parameters) achieved results inline with those of the teacher model (95M).
@inproceedings{guimaraes2023robustdistiller, author = {Guimar{\~a}es, Heitor R. and Pimentel, Arthur and Avila, Anderson R. and Rezagholizadeh, Mehdi and Chen, Boxing and Falk, Tiago H.}, title = {RobustDistiller: Compressing Universal Speech Representations for Enhanced Environment Robustness}, year = {2023}, booktitle = {2023 International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}, pages = {--}, organization = {IEEE}, }
Preprint
On the Transferability of Whisper-based Representations for" In-the-Wild" Cross-Task Downstream Speech Applications

Vamsikrishna Chemudupati, Marzieh Tahaei, Heitor Guimaraes, and 5 more authors

arXiv preprint arXiv:2305.14546, 2023

Abs Bib

Large self-supervised pre-trained speech models have achieved remarkable success across various speech-processing tasks. The self-supervised training of these models leads to universal speech representations that can be used for different downstream tasks, ranging from automatic speech recognition (ASR) to speaker identification. Recently, Whisper, a transformer-based model was proposed and trained on large amount of weakly supervised data for ASR; it outperformed several state-of-the-art self-supervised models. Given the superiority of Whisper for ASR, in this paper we explore the transferability of the representation for four other speech tasks in SUPERB benchmark. Moreover, we explore the robustness of Whisper representation for “in the wild” tasks where speech is corrupted by environment noise and room reverberation. Experimental results show Whisper achieves promising results across tasks and environmental conditions, thus showing potential for cross-task real-world deployment.
@article{chemudupati2023transferability, title = {On the Transferability of Whisper-based Representations for" In-the-Wild" Cross-Task Downstream Speech Applications}, author = {Chemudupati, Vamsikrishna and Tahaei, Marzieh and Guimaraes, Heitor and Pimentel, Arthur and Avila, Anderson and Rezagholizadeh, Mehdi and Chen, Boxing and Falk, Tiago}, journal = {arXiv preprint arXiv:2305.14546}, year = {2023}, }

2022

Conference
Improving the Robustness of DistilHuBERT to Unseen Noisy Conditions via Data Augmentation, Curriculum Learning, and Multi-Task Enhancement

Heitor R. Guimarães, Arthur Pimentel, Anderson R. Avila, and 2 more authors

In NeurIPS 2022 Efficient Natural Language and Speech Processing Workshop, 2022

Abs Bib

Self-supervised speech representation learning aims to extract meaningful factors from the speech signal that can later be used across different downstream tasks, such as speech and/or emotion recognition. Existing models, such as HuBERT, however, can be fairly large thus may not be suitable for edge speech applications. Moreover, realistic applications typically involve speech corrupted by noise and room reverberation, hence models need to provide representations that are robust to such environmental factors. In this study, we build on the so-called DistilHuBERT model, which distils HuBERT to a fraction of its original size, with three modifications, namely: (i) augment the training data with noise and reverberation, while the student model needs to distill the clean representations from the teacher model; (ii) introduce a curriculum learning approach where increasing levels of noise are introduced as the model trains, thus helping with convergence and with the creation of more robust representations; and (iii) introduce a multi-task learning approach where the model also reconstructs the clean waveform jointly with the distillation task, thus also acting as an enhancement step to ensure additional environment robustness to the representation. Experiments on three SUPERB tasks show the advantages of the proposed method not only relative to the original DistilHuBERT, but also to the original HuBERT, thus showing the advantages of the proposed method for “in the wild” edge speech applications.
@inproceedings{guimaraes2022improving, author = {Guimar{\~a}es, Heitor R. and Pimentel, Arthur and Avila, Anderson R. and Rezagholizadeh, Mehdi and Falk, Tiago H.}, title = {Improving the Robustness of DistilHuBERT to Unseen Noisy Conditions via Data Augmentation, Curriculum Learning, and Multi-Task Enhancement}, year = {2022}, booktitle = {NeurIPS 2022 Efficient Natural Language and Speech Processing Workshop}, }
Workshop
An Exploration into the Performance of Unsupervised Cross-Task Speech Representations for "In the Wild" Edge Applications

Heitor R. Guimarães, Arthur Pimentel, Anderson R. Avila, and 2 more authors

In Edge Intelligence Workshop 2022, 2022

Abs Bib

Unsupervised speech models are becoming ubiquitous in the speech and machine learning communities. Upstream models are responsible for learning meaningful representations from raw audio. Later, these representations serve as input to downstream models to solve a number of tasks, such as keyword spotting or emotion recognition. As edge speech applications start to emerge, it is important to gauge how robust these cross-task representations are on edge devices with limited resources and different noise levels. To this end, in this study we evaluate the robustness of four different versions of HuBERT, namely: base, large, and extralarge versions, as well as a recent version termed Robust-HuBERT. Tests are conducted under different additive and convolutive noise conditions for three downstream tasks: keyword spotting, intent classification, and emotion recognition. Our results show that while larger models can provide some important robustness to environmental factors, they may not be applicable to edge applications. Smaller models, on the other hand, showed substantial accuracy drops in noisy conditions, especially in the presence of room reverberation. These findings suggest that cross-task speech representations are not yet ready for edge applications and innovations are still needed.
@inproceedings{guimaraes2022explorationsrl, author = {Guimar{\~a}es, Heitor R. and Pimentel, Arthur and Avila, Anderson R. and Rezagholizadeh, Mehdi and Falk, Tiago H.}, title = {An Exploration into the Performance of Unsupervised Cross-Task Speech Representations for "In the Wild" Edge Applications}, year = {2022}, booktitle = {Edge Intelligence Workshop 2022}, }
Workshop
How Robust is Robust wav2vec 2.0 for Edge Applications?: An Exploration into the Effects of Quantization and Model Pruning on “In-the-Wild” Speech Recognition

Arthur Pimentel, Heitor R. Guimarães, Anderson R. Avila, and 2 more authors

In Edge Intelligence Workshop 2022, 2022

Abs Bib

Recent advances on self-supervised learning have allowed speech recognition systems to achieve state-of-the-art (SOTA) word error rates (WER) while requiring only a fraction of the labeled training data needed by its predecessors (e.g., the wav2vec 2.0 model). Notwithstanding, while such models have shown to achieve SOTA performance on matched conditions, their performance has shown to degrade in unmatched conditions, which is typically the case in edge applications. To overcome this problem, strategies such as data augmentation and/or multicondition training have been explored and a robust version of wav2vec 2.0 has been implemented. It is argued here, however, that such models are still too large to be considered for edge applications on resource-constrained devices, which justifies why model compression tools are needed. In this paper, we report findings on the effects of quantization and model pruning on speech recognition tasks in noisy conditions. Our results show that model compression has minimal impact on final WER, while signal-to-noise ratio (SNR) has a stronger impact.
@inproceedings{pimentel2022robustwav2vec, author = {Pimentel, Arthur and Guimar{\~a}es, Heitor R. and Avila, Anderson R. and Rezagholizadeh, Mehdi and Falk, Tiago H.}, title = {How Robust is Robust wav2vec 2.0 for Edge Applications?: An Exploration into the Effects of Quantization and Model Pruning on “In-the-Wild” Speech Recognition}, year = {2022}, booktitle = {Edge Intelligence Workshop 2022}, }
Conference
A Perceptual Loss Based Complex Neural Beamforming for Ambix 3D Speech Enhancement

Heitor R. Guimarães, Wesley Beccaro, and Miguel A. Ramirez

In Proc. L3DAS22: Machine Learning for 3D Audio Signal Processing, 2022

Abs Bib

This work proposes a novel approach to B-Format AmbiX 3D speech enhancement based on the short-time Fourier transform (STFT) representation. The model is a Fully Complex Convolutional Network (FC2N) that estimates a mask to be applied to the input features. Then, a final layer is responsible for converting the B-format to a monaural representation in which we apply the inverse STFT (ISTFT) operation. For the optimization process, we use a compounded loss function, applied in the time-domain, based on the short-time objective intelligibility (STOI) metric combined with a perceptual loss on top of the wav2vec 2.0 model. The approach is applied on Task 1 of the L3DAS22 challenge, where our model achieves a score of 0.845 in the metric proposed by the challenge, using a subset of the development set as reference.
@inproceedings{guimaraes22_l3das, author = {Guimarães, Heitor R. and Beccaro, Wesley and Ramirez, Miguel A.}, title = {{A Perceptual Loss Based Complex Neural Beamforming for Ambix 3D Speech Enhancement}}, year = {2022}, booktitle = {Proc. L3DAS22: Machine Learning for 3D Audio Signal Processing}, pages = {16--20}, doi = {10.21437/L3DAS.2022-4}, }

Thesis

On Self-Supervised Representations for 3D Speech Enhancement

Heitor R. Guimarães

Universidade de São Paulo (USP), 2022

Bib

@thesis{guimaraes2018recuperaccap,
  author = {Guimarães, Heitor R.},
  title = {On Self-Supervised Representations for 3D Speech Enhancement},
  school = {Universidade de São Paulo (USP)},
  year = {2022},
}

2021

Conference
Optimizing Time Domain Fully Convolutional Networks for 3D Speech Enhancement in a Reverberant Environment Using Perceptual Losses

Heitor R. Guimarães, Wesley Beccaro, and Miguel A Ramı́rez

In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), 2021

Abs Bib

Noise in 3D reverberant environment is detrimental to several downstream applications. In this work, we propose a novel approach to 3D speech enhancement directly in the time domain through the usage of Fully Convolutional Networks (FCN) with a custom loss function based on the combination of a perceptual loss, built on top of the wav2vec model and a soft version of the short-time objective intelligibility (STOI) metric. The dataset and experiments were based on Task 1 of the L3DAS21 challenge. Our model achieves a STOI score of 0.82, word error rate (WER) equal to 0.36, and a score of 0.73 in the metric proposed by the challenge based on STOI and WER combination using as reference the development set. Our submission, based on this method, was ranked second in Task 1 of the L3DAS21 challenge.
@inproceedings{guimaraes2021optimizing, title = {Optimizing Time Domain Fully Convolutional Networks for 3D Speech Enhancement in a Reverberant Environment Using Perceptual Losses}, author = {Guimar{\~a}es, Heitor R. and Beccaro, Wesley and Ram{\'\i}rez, Miguel A}, booktitle = {2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)}, pages = {1--6}, year = {2021}, organization = {IEEE}, }

2020

Journal
Monaural speech enhancement through deep wave-U-net

Heitor R. Guimarães, Hitoshi Nagano, and Diego W. Silva

Expert Systems with Applications, 2020

Abs Bib

In this paper, we present Speech Enhancement through Wave-U-Net (SEWUNet), an end-to-end approach to reduce noise from speech signals. This background context is detrimental to several downstream systems, including automatic speech recognition (ASR) and word spotting, which in turn can negatively impact end-user applications. We show that our proposal does improve signal-to-noise ratio (SNR) and word error rate (WER) compared with existing mechanisms in the literature. In the experiments, network input is a 16 kHz sample rate audio waveform corrupted by an additive noise. Our method is based on the Wave-U-Net architecture with some adaptations to our problem. Four simple enhancements are proposed and tested with ablation studies to prove their validity. In particular, we highlight the weight initialization through an autoencoder before training for the main denoising task, which leads to a more efficient use of training time and a higher performance. Through quantitative metrics, we show that our method is prefered over the classical Wiener filtering and shows a better performance than other state-of-the-art proposals.
@article{GUIMARAES2020113582, title = {Monaural speech enhancement through deep wave-U-net}, journal = {Expert Systems with Applications}, volume = {158}, pages = {113582}, year = {2020}, issn = {0957-4174}, doi = {https://doi.org/10.1016/j.eswa.2020.113582}, url = {https://www.sciencedirect.com/science/article/pii/S0957417420304061}, author = {Guimarães, Heitor R. and Nagano, Hitoshi and Silva, Diego W.}, keywords = {Speech enhancement, Noise reduction, Wave-U-net, Deep learning, Signal to Noise Ratio (SNR), Word Error Rate (WER)}, }

2018

Thesis

Music Information Retrieval: Deep Learning Approach

Heitor R. Guimarães

Federal University of Rio de Janeiro (UFRJ), 2018

Bib

@thesis{guimaraes2018recuperaccao,
  author = {Guimarães, Heitor R.},
  title = {Music Information Retrieval: Deep Learning Approach},
  school = {Federal University of Rio de Janeiro (UFRJ)},
  year = {2018},
}

2016

Journal

Ssegmentation of Microtomography Images of Rocks using Texture Filter

Luciana Olivia Dias, Clécio R De Bom, Heitor R. Guimarães, and 5 more authors

Notas Técnicas - CBPF, 2016

Bib

@article{dias2016segmentaccao,
  title = {Ssegmentation of Microtomography Images of Rocks using Texture Filter},
  author = {Dias, Luciana Olivia and De Bom, Cl{\'e}cio R and Guimar{\~a}es, Heitor R. and Faria, Elis{\^a}ngela L and de Albuquerque, M{\'a}rcio P and de Albuquerque, Marcelo P and Correia, Maury D and Surmas, Rodrigo},
  journal = {Notas T{\'e}cnicas - CBPF},
  volume = {6},
  number = {1},
  year = {2016},
}