Detecting Sounds with Deep Learning | by Hyeongchan Kim | Dec, 2020


ResNeSt for Audio

How to convert audio to images and analyze it with ResNeSt

Hyeongchan Kim

Created together with Dmytro Karabash, Maxim Korotkov, Tony Chen.

ResNeSt architecture (Image by ResNeSt paper)

Have you ever woken up without understanding what it was, but knowing for sure that some sound isn’t right?

Sound identification is one of our instincts that kept human beings safe. Sounds play a significant role in our life, Starting from recognizing a predator nearby to being inspired by music, to groups of human voices, to the cry of a bird. Inevitably, developing audio classifiers is a crucial task in our lives.

Ordinarily, it is essential to classify the sounds’ source and is already widely used for various purposes. In music, there’s a classifier for the genre of music. Recently similar systems began to be used to classify birdcalls, historically done by Ornithologists. Their goal is to categorize birds, considering it is challenging to discover birdcalls from the fields or noisy surroundings.

Recently, Deep Learning (DL) has grown one of the popular technologies to solve multiple tasks in our lives due to its accuracy and the improvement of computational devices like CPU (Central Processing Unit), GPU (Graphics Processing Unit). The below chart shows how influential the deep learning market is and its expected size from the aspects of the software, hardware, and services.

Deep Learning market of U.S. from 2014 to 2025 (Image from

In this post, We will take the task of reading an audio file with zero to few birdcalls. Moreover, using deep learning to identify which bird it is, based on the Cornell Birdcall Identification Challenge, where we acquired a silver medal (top 2%).

We can find that countless articles about processing audio data into a spectrogram, along with explained how to load sound data, including making it to a spectrogram format and why it is critical. Here’s an example of a spectrogram of birdcall of Alder Flycatcher and a photo of such a bird, just in case you are curious.

log-mel spectrogram of birdcall of Alder Flycatcher (Image by Author)

The speed of data processing is one of the keys to employing a deep learning model. Conversely, the increment of computation power, the computation cost of audio processing is still expensive on a CPU. Nevertheless, if we choose a better computation resource to process the data like a GPU, it can boost the speed of about ten to one hundred times faster! We will show how to process spectrogram fast by utilizing a library called torchlibrosa that enables us to process spectrogram on a GPU.

torchlibrosa is a Python library that has some audio processing functions implemented in PyTorch that can utilize GPU resources. PyTorch enables running this spectrogram algorithm on a GPU. Here’s an example of extracting spectrogram features using torchlibrosa.

from torchlibrosa.stft import Spectrogram

spectrogram_extractor = Spectrogram(

We can load audio data via the librosa library, one of the popular Python audio processing libraries.

import librosa

# get raw audio data
example, _ = librosa.load('example.wav', sr=32000, mono=True)

raw_audio = torch.Tensor(example).unsqueeze(0).cuda()spectrogram = spectrogram_extractor(raw_audio)

We can process audio data on the GPU by adopting the torchlibrosa library. You may wonder how much faster on the GPU than the CPU. Here’s the speed of processing the benchmark between the devices. We just picked audio from the data obtained from the Cornell Birdcall Identification Kaggle Challenge, which is publicly available, and compared how long it takes on CPU and GPU. We tested on the Colab to reproduce the performance, and it is about x15 faster on GPU than CPU to process log-mel spectrogram from about 5 minutes audio.

Processing time between CPU (Intel Xeon 2.20 GHz) and GPU (NVIDIA T4). (Image by Author)

Accordingly, Deep Learning has shown brilliant performance in the audio domain. It can catch numerous patterns of target classes correctly in the time-series data. The more important point is the environment and data matter in birdcalls. The environments like fields or the middle of the mountains make batches of noises interfering with the birdcalls. Several birds can exist in long recorded audio. Consequently, we need to build a noise-robust, multi-label audio classifier.

We will present a deep learning architecture used by our team (Dragonsong) in Cornell Birdcall Identification Kaggle Challenge.

We built a novel audio classifier architecture that effectively catches time-series features by utilizing CNN, RNN, and Attention modules. Here is our brief plot of architecture used at the challenge.

Our architecture of birdcall classifier (Image by Author)

We process raw audio with a log-mel spectrogram as an input of our architecture, and it passes through the ResNeSt50 backbone, which is one of the image classification architectures. Afterward, we take the features, which contain both spatial and temporal information, to the RoI (Region of Interest) pooling and bi-GRU layers. In the layers, it catches the time-wise information while reducing the feature dimension because we thought extracting temporal features is pivotal to classify numbers of birdcalls in long audio. Ultimately, we pass the data into the attention module to score by each time step to find out which time-step the birds exist.

Not only building deep learning architecture to represent the data but also how to train the model is vital (a.k.a training recipe). To classify audios that contain various birdcalls with a noisy background, we mix bunches of birdcalls into audio and noises like white noise. Also, regarding many variations of birdcalls, we augment pitch and mask some audio frames by using SpecAugment.

Here is a short example (a mixed version of Alder Flycatcher and American Avocet) of what we applied augmentations.

Have you ever woken up without understanding what it was, but knowing for sure that some sound isn’t right? With great algorithms, machines will be able to identify what it was and help you sleep better. Stay tuned!

Originally published at YourDataBlog.

Read More …


Write a comment