We use voice commands almost every day asking Google Assistant, Siri, and others to get a few things done. Voice assistive technologies are based on accurate speech recognition technology to provide the correct response to the user’s query. However, one of the major problems with these technologies is the overlapping of speech. It presents itself as a great challenge to filter out overlapping speech and that is where Google’s 2018 VoiceFilter system comes to the rescue.
VoiceFilter has proven its mantle in improving source to distortion ratio (SDR). However, on-demand streaming speech recognition technology requires inputs like CPU, model size, memory limitations, among others. This is where the latest VoiceFilter-Lite from Google comes into play. It has shown significant improvement in on-device speech recognition when subjected to overlapping speech. It does so with the assistant of the enrolled voice of the select speaker. VoiceFilter-Lite was able to return a 25.1 percent improvement towards word error rate (WER) when subjected to overlapping speech.
VoiceFilter-Lite: Improving on-device speech recognition
The new VoiceFilter-Lite takes the input as a speech recognition model. It further enhances the features that treat the unwanted components from a speech that doesn’t belong to the target speaker. It then runs it through optimizations on a number of runtime operations and network topologies. The model runs the data through TensorFlow Lite towards quantizing the neural network. The resulting model size is around 2.2MB which is suitable for on-device applications.
The model architecture uses filterbanks with a dump of noisy speech and applies a vector to the target speaker for identification. It further uses neural networks to find the difference between a set of clean speech filterbanks and the enhanced filterbanks to get the speech right.
The best thing about VoiceFilter-Lite is its plug-and-play and it can be trained and upgraded separately from the existing recognition model that eliminates complexities when deploying it.
Apparently, two types of errors are recorded whilst working upon improving speech recognition. Under-suppression occurs when the system fails to carve out noisy components in a speech (signal). On the other hand, over-suppression occurs when the system carves out so much of a signal that it fails to preserve useful parts of a signal which results in loss of words and so on.
Speech recognition models are already robust on fixing under-suppression while VoiceFilter-Lite uses two approaches to get rid of over-suppression as well.
VoiceFilter-Lite is promising for on-device speech applications. It is limited to the English language only. However, the researchers are willing to train the model in other languages in the later run. Training the system to improve speech recognition even further is the next goal.