Some melodies get stuck in your head and I know that this is the most irritating phenomenon as you can’t get it out. Called earworms, the easiest way to get it out of your head is to simply sing or listen to the original song. But only if it was that easy.
Shazam lets you find a song once you place your phone near the audio/studio recording. But since you can’t recall the song, you are stuck with its melody that you can hum only. Google introduced its Hum to Search feature in October 2020 and it literally was one of the best features from Google. No doubt the hummed tune is way different from an actual audio/studio recording. It lacks background instruments, vocals, lyrics, and other musical traits that help song recognition tools to find it from a huge database.
Google has pioneered its “Hum to Search” which is a fully machine-learned system. It lets users hum to search for any song by matching it with pre-existing melodies. The approach constructs a spectrogram of a song hummed for the model to work and matches it with the polyphonic (original) recording without the need of any intermediate recordings. This is what makes Hum to Search an efficient model to sift through an enormous database of songs to find a match.
Hum to Search: Background
A music recognition system converts an audio sample (sourced from an audio/studio recording). It converts it into a spectrogram (image below). The system then explores through its database of spectrograms trying to find a perfect match. However, hummed melodies lack information including lyrics, background score, and others.
The model requires to find a dominant melody from the hummed version and compares it with the corresponding audio recordings to find a match. The model has to cope up with noises, room reverberations, lack of background vocals and instruments, etc. Google demonstrated its song recognition skills with ‘Now Playing’ & ‘South Search’ in 2017. Instead of a server-based recognition, Google shifted to an on-device recognition system with deep neural network integration.
The Machine Learning Setup
Naturally, the first step was to alter the music recognition models to identify hummed recordings. A neural network is fed with inputs of both hummed/sung and recorded audio of the same song. The model then requires to produce embeddings that will be used for matching purposes.
The model produces embeddings with melodies of both hummed and recorded songs that are as close to each other. Unsurprisingly, different melodies in the same song would be completely different. The model is trained to ignore singing voices and musical instruments used. The model is then trained on this same principle. The last leg of the journey is simply referencing the embeddings of the hummed version to the embeddings of the recorded audio to find the match.
Training Data with Neural Network
The researchers at Google had to train the model to recognize the hummed version. They used a pre-existing dataset to train the system, and injected the dataset from SPICE which is a pitch extraction model. This enabled the researchers to alter the pitch of existing songs and generate melodies with discrete tones to mimic the hummed version. They further refined the system using the neural network. It was able to produce melodies with whistled and hummed tunes.
Improvements with Machine Learning
Google researchers made a few improvements in the machine learning behind Hum to Search using a triplet loss function. According to the model, embeddings are fed on the system. It detects melodies that are either too hard or too easy but doesn’t match with the input. Finally, it ignores different moledies.
Google has used several ways including augmentation, variations, and superpositions of the training data. The neural network helps in recognizing hummed or sung melodies with a higher accuracy rate. The model can recognize more than half a million songs already. It is constantly being updated adding more songs every day.