Hguimaraes random tech notes

Music genre classification: A Deep Learning approach

tl;dr: A Deep Learning approach for music genre classification using a simple Convolutional Neural Network (CNN). Based on my undergraduate thesis. You can find the Jupyter Notebook with all the codes in the nbs folder here: https://github.com/Hguimaraes/gtzan.keras

This is the second post on music genre classification using Machine Learning. Link for the previous post with a classical approach and useful definitions: http://hguimaraes.me/journal/classical-ml-mir.html


In the previous post we explained why musical genres are difficult to classify: The labels are subjective and depends of complex social iterations.

Our objective is to use deep learning for the task of music genre classification by using CNNs. This kind of network are a good fit to handle data with a known grid-like topology and have sucessfully aplied to computer vision tasks, speech recognition and others.

Reading the data

The first step of our pipeline is to import all data into memory. We created a function to read each file in GTZAN structure and save the file path and the genre associated with in the function bellow.

def read_data(src_dir, genres, song_samples):    
    # Empty array of dicts with the processed features from all files
    arr_fn = []
    arr_genres = []

    # Get file list from the folders
    for x,_ in genres.items():
        folder = src_dir + x
        for root, subdirs, files in os.walk(folder):
            for file in files:
                file_name = folder + "/" + file

                # Save the file name and the genre
    # Split into train and test
    X_train, X_test, y_train, y_test = train_test_split(
        arr_fn, arr_genres, test_size=0.3, random_state=42, stratify=arr_genres
    # Split into small segments and convert to spectrogram
    X_train, y_train = split_convert(X_train, y_train)
    X_test, y_test = split_convert(X_test, y_test)

    return X_train, X_test, y_train, y_test

In this function we also split the dataset into train and test data before the next step and avoid leaking.

The next step is to read the data in raw audio format (PCM based). We use a song_samples constant size that covers almost 30 seconds of audio.

def split_convert(X, y):
    arr_specs, arr_genres = [], []
    # Convert to spectrograms and split into small windows
    for fn, genre in zip(X, y):
        signal, sr = librosa.load(fn)
        signal = signal[:song_samples]

        # Convert to dataset of spectograms/melspectograms
        signals, y = splitsongs(signal, genre)

        # Convert to "spec" representation
        specs = to_melspectrogram(signals)

        # Save files
    return np.array(arr_specs), to_categorical(arr_genres)

As you can see, in the previous function there is two steps: Split the original audio into smaller representations and convert to melspectrogram.

@description: Method to split a song into multiple songs using overlapping windows
def splitsongs(X, y, window = 0.05, overlap = 0.5):
    # Empty lists to hold our results
    temp_X = []
    temp_y = []

    # Get the input song array size
    xshape = X.shape[0]
    chunk = int(xshape*window)
    offset = int(chunk*(1.-overlap))
    # Split the song and create new ones on windows
    spsong = [X[i:i+chunk] for i in range(0, xshape - chunk + offset, offset)]
    for s in spsong:
        if s.shape[0] != chunk:


    return np.array(temp_X), np.array(temp_y)

The idea behind split into smaller segments is to increase our training data to train a Deep Neural Network (and avoid overfitting). Each song will be splited into a smaller space that correspond to 5% of the original audio (1.5 seconds length). When slide the window to the right, we will allow an overleap of 50% within the previous slice.


You probably have heard that a spectrogram is time-frequency representation of a signal through short term Fourier transform (STFT) and is usually more compact then the raw audio format. The mel scale was a study based on human perception of sounds. The idea is to perform a set of non-linear transformations to frequency space of the spectrogram to obtain the melspectrogram. We use the Librosa package for this task as well.

import librosa
import numpy as np
@description: Method to convert a list of songs to a np array of melspectrograms
def to_melspectrogram(songs, n_fft=1024, hop_length=256):
    # Transformation function
    melspec = lambda x: librosa.feature.melspectrogram(x, n_fft=n_fft,
        hop_length=hop_length, n_mels=128)[:,:,np.newaxis]

    # map transformation of input songs to melspectrogram using log-scale
    tsongs = map(melspec, songs)
    return np.array(list(tsongs))

Data Augmentation

Training Deep Neural Networks can be tricky if you have few data. Also to avoid overfitting, we perform some data augmentation to our melspectrograms. The transformations implemented are: Horizontal Flip (Similar to image processing) and Cutout/Random erasing. The idea of Cutout is to randomly eliminate some frequencies when the CNN is learning and also erase some small window of time. In that way the CNN will be seeing new data at each epoch.

from tensorflow.keras.utils import Sequence

class GTZANGenerator(Sequence):
    def __init__(self, X, y, batch_size=64, is_test = False):
        self.X = X
        self.y = y
        self.batch_size = batch_size
        self.is_test = is_test
    def __len__(self):
        return int(np.ceil(len(self.X)/self.batch_size))
    def __getitem__(self, index):
        # Get batch indexes
        signals = self.X[index*self.batch_size:(index+1)*self.batch_size]

        # Apply data augmentation
        if not self.is_test:
            signals = self.__augment(signals)
        return signals, self.y[index*self.batch_size:(index+1)*self.batch_size]
    def __augment(self, signals, hor_flip = 0.5, random_cutout = 0.5):
        spectrograms =  []
        for s in signals:
            signal = copy(s)
            # Perform horizontal flip
            if np.random.rand() < hor_flip:
                signal = np.flip(signal, 1)

            # Perform random cutoout of some frequency/time
            if np.random.rand() < random_cutout:
                lines = np.random.randint(signal.shape[0], size=3)
                cols = np.random.randint(signal.shape[0], size=4)
                signal[lines, :, :] = -80 # dB
                signal[:, cols, :] = -80 # dB

        return np.array(spectrograms)
    def on_epoch_end(self):
        self.indexes = np.arange(len(self.X))
        return None

Model definition

Each convolutional block (Abstraction of a set of layers) is composed of a convolution layer, the activation function, a pooling operation and a regularization operation (Dropout in our case).

def conv_block(x, n_filters, pool_size=(2, 2)):
    x = Conv2D(n_filters, (3, 3), strides=(1, 1), padding='same')(x)
    x = Activation('relu')(x)
    x = MaxPooling2D(pool_size=pool_size, strides=pool_size)(x)
    x = Dropout(0.25)(x)
    return x

And here we define the entire model by appending a Fully-Connected at the end to perform the classification using Softmax.

def create_model(input_shape, num_genres):
    inpt = Input(shape=input_shape)
    x = conv_block(inpt, 16)
    x = conv_block(x, 32)
    x = conv_block(x, 64)
    x = conv_block(x, 128)
    x = conv_block(x, 256)
    # Global Pooling and MLP
    x = Flatten()(x)
    x = Dropout(0.5)(x)
    x = Dense(512, activation='relu', 
    x = Dropout(0.25)(x)
    predictions = Dense(num_genres, 
    model = Model(inputs=inpt, outputs=predictions)
    return model

Training Procedure

We used the Adam optimization algorithm in this task and also a scheduler to reduce the learning rate case the validation loss stop decreasing.


reduceLROnPlat = ReduceLROnPlateau(

Using a batch_size of 128 elements and 150 epochs, we are ready to start our learning procedure!

batch_size = 128
hist = model.fit_generator(

Majority Voting

After we trained the network, we perform another operation: The majority voting. Since we splited our song into small pieces, we are going to classify each part and the result will be the consensus of the network.

def majority_vote(scores):
    values, counts = np.unique(scores,return_counts=True)
    ind = np.argmax(counts)
    return values[ind]


alt text

Using the Majority voting, we see an accuracy of 82% in the test set of the GTZAN! It is not the state of the art but it’s a cool result with such simple network.

I hope you find this small report useful! You can check the full implementation in my github: https://github.com/Hguimaraes/gtzan.keras


comments powered by Disqus