Seeing is Believing: Converting Audio Data into Images | by Tony Chen | Dec, 2020


Tony Chen

Created together with Dmytro Karabash, Maxim Korotkov, and Hyeongchan Kim.

Photo by Franco Antonio Giovanella on Unsplash

Close your eyes and listen to the sound around you. Whether you are in a crowded office, cozy home, or open space of nature, you can distinguish the environment with the sound around you. One of the five major senses of humans is hearing, so audio plays a significant role in our life. Therefore, organizing and exploiting values in audio data with deep learning is a crucial process for AI to understand our world. An important task in sound processing is enabling computers to distinguish one sound from another. This capability enables computers to do things ranging from detecting metal wearing in power plants to monitoring and optimizing cars’ fuel efficiency. In this post, we will use bird sound identification as an example. We will detect locations of bird calls in recordings produced in natural settings and classify species. By converting audio data to image data and applying computer vision models, we acquired a silver medal (top 2%) in Kaggle Cornell Birdcall Identification challenge.

When a doctor diagnoses heart problems, he can either directly listen to the patient’s heartbeat or look at the ECG – a diagram that describes the heartbeat – of the patient. The former usually takes longer — it takes time for the doctor to listen — and harder — memorizing what you heard can be hard. In contrast, visual perceptions of ECG allows a doctor to absorb spatial information instantly and accelerates the tasks.

The same rationales apply to our sound detection tasks. Here are spectrograms of four bird species. You can listen to the original audio clips here. Even human eyes can see the differences between species instantly based on color and shapes.

Going over the audio waves through time takes more computational resources, and we can acquire more information from the 2-dimensional data of images than 1-dimensional waves. In addition, the recent rapid development of computer visions, especially with the help of Convolutional Neural Network (CNNs), can significantly benefit our approach if treating audios as images as we (along with pretty much everyone) did in the competition.

The specific image representation that we use is called a spectrogram: a visual representation of the spectrum of frequencies of a signal as it varies with time.

Sounds can be represented in the form of waves, and waves have two important properties: frequency and amplitude as illustrated in the picture below. The frequency determines how the audio sounds like, and amplitude determines how loud the sound is.

Sound Waves Parameters Explaination (Image by Author)

In a spectrogram of an audio clip, the horizontal direction represents time, and the vertical direction represents different frequencies. Finally, the amplitude of sounds of a particular frequency exists at a particular point of time is represented by the point’s color, resulting from the corresponding x-y coordinates.

Spectrogram Explaination (Image by Author)

To more intuitively see how frequencies are embodied in spectrograms, here’s a 3D visualization, which demonstrates the amplitude with an extra dimension. Again, the x-axis is time, and y-axis is the value of frequencies. The z-axis is the amplitude of sounds of the frequency of y-coordinate at the moment of the x-coordinate. As the z-value increases, the color changes from blue to red, which results in the color we saw in the previous example of a 2D spectrogram.

3D Spectrogram Visualization (Image by Author)

Spectrograms are helpful because they extract exactly the information we need: frequencies, the features that shape the form of sound we hear. Different bird species, or actually all objects that produce sound, have their own unique frequency range so that their sounds appear to be different for our ears. Our model will simply need to master distinguishing between frequencies to achieve ideal classification results.

However, human ears do not perceive differences in all frequency ranges equally. As frequencies increase, it is more difficult for us to distinguish between different frequencies. In order to better emulate human ear behaviors with deep learning models, we measure frequencies in mel scale. In the mel scale, any equal distance between frequencies sound equally different for human ears. mel scale converts frequency from in Hertz (f) to in mel (m) with the following equation:

m = 2595 * log(1+f/700)

A mel scale spectrogram is simply a spectrogram with frequencies measured in mel.

To create a mel spectrogram from audio waves, we will employ librosa library.

import librosa
y, sr = librosa.load('img-tony/amered.wav', sr=32000, mono=True)
melspec = librosa.feature.melspectrogram(y, sr=sr, n_mels = 128)
melspec = librosa.power_to_db(melspec).astype(np.float32)

Where y denotes the raw wave data, sr denotes sample rate of the audio sample, and n_mels decides the number of mel bands in the generated spectrogram. When using melspectrogram method, you can also set f_min and f_max method You can also set Then, we can convert mel spectrogram that express amplitude in amplitude squared scale to decibel scale with the power_to_db method.

To visualize the generated spectrogram, run

import librosa.display
librosa.display.specshow(melspec, x_axis='time', y_axis='mel', sr=sr, fmax=16000)

Alternatively, if you are using GPU, you can accelerate the mel spectrogram generation process with torchlibrosa library.

from torchlibrosa.stft import Spectrogram, LogmelFilterBankspectrogram_extractor = Spectrogram()
logmel_extractor = LogmelFilterBank()
y = spectrogram_extractor(y)
y = self.logmel_extractor(y)

In conclusion, we can take advantages from recent developments in computer vision in audio-related tasks by converting audio clips into image data. We achieve so with spectrograms that exhibit frequency, amplitude, and time information of audio data in an image. Using mel scale and mel scale spectrogram helps computers to emulate human hearing behaviors to distinguish sounds of different frequencies. To generate spectrograms, we could employ librosa library, or torchlibrosa for GPU acceleration, in Python. By treating audio-related tasks in such a way, we are able to establish efficient deep learning models to identify and classify sounds, like how doctors diagnose heart-related diseases with ECG.

Originally published at YourDataBlog.

Read More …


Write a comment