ASR Model Fine-Tuning Series: Navigating Data Scarcity with Finesse



Published on: 2023-10-19

By Emer Butler

When it comes to fine-tuning speech models in Automatic Speech Recognition (ASR), here’s the deal: we need data, lots of it. The more audio data we have with accurate transcriptions to train an ASR model, the better the model performs. However, we often face two related problems: 

These issues are collectively known as “data scarcity.” Data scarcity is a common hurdle in ASR model fine-tuning. 

In this article, we’ll dive into four approaches that can help overcome this challenge and improve the performance of your ASR model: data augmentation, transfer learning, self-supervised learning, and active learning.

Data augmentation

Augmenting data is a technique that creates new training data from existing data by applying various changes to the existing data. You can think of it as creating variations of the data you have, so that it looks like you have more data in total. In the case of speech modelling and audio data, data augmentation entails techniques such as noise injection, pitch scaling, and spectral warping, among others. The idea is to use these changes to increase the diversity and robustness of the audio data, thereby training the ASR model to handle a wider range of conditions. 

For example, if you want to fine-tune an ASR model for children’s speech, you can use spectral warping to synthesise speech with different vocal tract lengths (or, in other words, you can change the pitch of the speaker’s voice) and therefore capture more speaker variability. A study from the Journal of Signal Processing systems found that, with their under-resourced dataset of school children’s speech, spectral warping had the most significant impact on performance compared to other augmentation methods. 

Often, combining multiple audio augmentation techniques gives the best results in improving model performance when you have limited resources.

Transfer learning

Transfer learning is a machine learning technique that helps us train our models with less data. It works by using the knowledge from a large, general dataset (e.g. a set with adult speech) and applying it to a smaller, more specific dataset (e.g. a set of elderly speech) This way, we can use the common and unique features of both datasets to make better predictions. 

A really great example of the use of transfer learning in speech modelling is its use case in emotion recognition in elderly people. Emotion detection from speech can help better identify the mental health state of elderly patients in care-homes. Using a larger data set of emotion recognition in adult voices, the researchers were able to use transfer learning techniques to improve their model accuracy in emotion detection of elderly speech, without having to employ feature engineering techniques.

Controllable speech synthesis: using synthetic data

Self-supervised learning is a technique used to identify patterns in unlabeled data by solving related tasks. This can include predicting the next word or reconstructing masked speech in a dataset. The insights gained from such tasks can be used to initialise or enhance an ASR model. Synthetic voice data is voice data that you create from text. 

The paper “Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis” proposes a new method for personalising automatic speech recognition (ASR) models. The authors use synthetic data generated by controllable speech synthesis (or, in other words, creating synthetic voice data from large amounts of input text) to train ASR models to recognize the speech of specific individuals. The interesting discovery from this study was that the speech content, i.e. what was said, had a much greater effect on the model’s accuracy for speaker adaptation (identifying different speakers), than the speech style, i.e how it was said. This study has important implications for the development of ASR models that can be used in personalised settings, such as nursing homes or hospitals, or even in language learning classrooms for individualised learning paths to develop accent and pronunciation.

Active learning

Active learning is a technique that helps us select the most informative examples to label for training our models. Instead of labelling random or easy examples, we choose those that our model is most uncertain about. This way, we can improve our model faster with less labelling effort. Active learning can be very useful for ASR fine-tuning, especially when we have limited labelled data for our target domain or speaker. A study by Drugmann et al. (2019) demonstrated how active learning can significantly enhance the performance of ASR models, both in terms of acoustic and language models, by using confidence filtering as a criterion for selecting data. This approach can significantly improve the word error rate in automated transcription, yielding more accurate transcripts.

To build quality transcripts for your audio data so that you can employ any of the above mentioned techniques, along with your transcriptions, for training and ultimately fine-tuning your chosen ASR model, the best thing you can do is work with an audio transcription platform that bulk transcribes audio data, while providing you with a tool that makes correcting and annotating transcripts straightforward and easy.

The role of ASR Platforms

Regardless of the amount of audio data you have available for your project, all instances of fine-tuning and improving an ASR model rely on accurate transcripts of the given data. While generating such transcripts is as easy as receiving some output text file from your model after giving it an input audio file, tweaking and editing those transcripts to perfection can easily become tricky work. Identifying exact timestamps in the audio file that correspond to your transcript can become cumbersome and time consuming.

That’s exactly why we’ve built Transcribe ASR. Transcribe ASR is a platform that enables you to automatically transcribe your audio files in bulk, with short processing times and an intuitive interface for you to use in validating and correcting transcripts to suit accent variations and domain specific jargon, allowing you to focus on labelling only the instances that the ASR model is most uncertain about. This approach streamlines the labelling process, making active learning more efficient and effective.

Data scarcity can be a formidable challenge when fine-tuning ASR models. Employing these four strategies can help navigate this tricky terrain and improve ASR models’ performance, even when resources are limited. Alongside a large volume dataset, accurate transcriptions are key to fine-tuning your model. We invite you to take advantage of the Transcribe ASR platform’s free trial, and of course, we’d love to hear your feedback.

Contact us at [email protected] to discuss how we can assist you.