tl;dr: A Deep Learning approach for music genre classification using a simple Convolutional Neural Network (CNN). Based on my undergraduate thesis. You can find the Jupyter Notebook with all the codes in the nbs folder here: https://github.com/Hguimaraes/gtzan.keras
This is the second post on music genre classification using Machine Learning. Link for the previous post with a classical approach and useful definitions: http://hguimaraes.me/journal/classical-ml-mir.html
In the previous post we explained why musical genres are difficult to classify: The labels are subjective and depends of complex social iterations.
Our objective is to use deep learning for the task of music genre classification by using CNNs. This kind of network are a good fit to handle data with a known grid-like topology and have sucessfully aplied to computer vision tasks, speech recognition and others.
The first step of our pipeline is to import all data into memory. We created a function to read each file in GTZAN structure and save the file path and the genre associated with in the function bellow.
def read_data(src_dir, genres, song_samples): # Empty array of dicts with the processed features from all files arr_fn =  arr_genres =  # Get file list from the folders for x,_ in genres.items(): folder = src_dir + x for root, subdirs, files in os.walk(folder): for file in files: file_name = folder + "/" + file # Save the file name and the genre arr_fn.append(file_name) arr_genres.append(genres[x]) # Split into train and test X_train, X_test, y_train, y_test = train_test_split( arr_fn, arr_genres, test_size=0.3, random_state=42, stratify=arr_genres ) # Split into small segments and convert to spectrogram X_train, y_train = split_convert(X_train, y_train) X_test, y_test = split_convert(X_test, y_test) return X_train, X_test, y_train, y_test
In this function we also split the dataset into train and test data before the next step and avoid leaking.
The next step is to read the data in raw audio format (PCM based). We use a song_samples constant size that covers almost 30 seconds of audio.
""" """ def split_convert(X, y): arr_specs, arr_genres = ,  # Convert to spectrograms and split into small windows for fn, genre in zip(X, y): signal, sr = librosa.load(fn) signal = signal[:song_samples] # Convert to dataset of spectograms/melspectograms signals, y = splitsongs(signal, genre) # Convert to "spec" representation specs = to_melspectrogram(signals) # Save files arr_genres.extend(y) arr_specs.extend(specs) return np.array(arr_specs), to_categorical(arr_genres)
As you can see, in the previous function there is two steps: Split the original audio into smaller representations and convert to melspectrogram.
""" @description: Method to split a song into multiple songs using overlapping windows """ def splitsongs(X, y, window = 0.05, overlap = 0.5): # Empty lists to hold our results temp_X =  temp_y =  # Get the input song array size xshape = X.shape chunk = int(xshape*window) offset = int(chunk*(1.-overlap)) # Split the song and create new ones on windows spsong = [X[i:i+chunk] for i in range(0, xshape - chunk + offset, offset)] for s in spsong: if s.shape != chunk: continue temp_X.append(s) temp_y.append(y) return np.array(temp_X), np.array(temp_y)
The idea behind split into smaller segments is to increase our training data to train a Deep Neural Network (and avoid overfitting). Each song will be splited into a smaller space that correspond to 5% of the original audio (1.5 seconds length). When slide the window to the right, we will allow an overleap of 50% within the previous slice.
You probably have heard that a spectrogram is time-frequency representation of a signal through short term Fourier transform (STFT) and is usually more compact then the raw audio format. The mel scale was a study based on human perception of sounds. The idea is to perform a set of non-linear transformations to frequency space of the spectrogram to obtain the melspectrogram. We use the Librosa package for this task as well.
import librosa import numpy as np """ @description: Method to convert a list of songs to a np array of melspectrograms """ def to_melspectrogram(songs, n_fft=1024, hop_length=256): # Transformation function melspec = lambda x: librosa.feature.melspectrogram(x, n_fft=n_fft, hop_length=hop_length, n_mels=128)[:,:,np.newaxis] # map transformation of input songs to melspectrogram using log-scale tsongs = map(melspec, songs) return np.array(list(tsongs))
Training Deep Neural Networks can be tricky if you have few data. Also to avoid overfitting, we perform some data augmentation to our melspectrograms. The transformations implemented are: Horizontal Flip (Similar to image processing) and Cutout/Random erasing. The idea of Cutout is to randomly eliminate some frequencies when the CNN is learning and also erase some small window of time. In that way the CNN will be seeing new data at each epoch.
from tensorflow.keras.utils import Sequence class GTZANGenerator(Sequence): def __init__(self, X, y, batch_size=64, is_test = False): self.X = X self.y = y self.batch_size = batch_size self.is_test = is_test def __len__(self): return int(np.ceil(len(self.X)/self.batch_size)) def __getitem__(self, index): # Get batch indexes signals = self.X[index*self.batch_size:(index+1)*self.batch_size] # Apply data augmentation if not self.is_test: signals = self.__augment(signals) return signals, self.y[index*self.batch_size:(index+1)*self.batch_size] def __augment(self, signals, hor_flip = 0.5, random_cutout = 0.5): spectrograms =  for s in signals: signal = copy(s) # Perform horizontal flip if np.random.rand() < hor_flip: signal = np.flip(signal, 1) # Perform random cutoout of some frequency/time if np.random.rand() < random_cutout: lines = np.random.randint(signal.shape, size=3) cols = np.random.randint(signal.shape, size=4) signal[lines, :, :] = -80 # dB signal[:, cols, :] = -80 # dB spectrograms.append(signal) return np.array(spectrograms) def on_epoch_end(self): self.indexes = np.arange(len(self.X)) np.random.shuffle(self.indexes) return None
Each convolutional block (Abstraction of a set of layers) is composed of a convolution layer, the activation function, a pooling operation and a regularization operation (Dropout in our case).
def conv_block(x, n_filters, pool_size=(2, 2)): x = Conv2D(n_filters, (3, 3), strides=(1, 1), padding='same')(x) x = Activation('relu')(x) x = MaxPooling2D(pool_size=pool_size, strides=pool_size)(x) x = Dropout(0.25)(x) return x
And here we define the entire model by appending a Fully-Connected at the end to perform the classification using Softmax.
def create_model(input_shape, num_genres): inpt = Input(shape=input_shape) x = conv_block(inpt, 16) x = conv_block(x, 32) x = conv_block(x, 64) x = conv_block(x, 128) x = conv_block(x, 256) # Global Pooling and MLP x = Flatten()(x) x = Dropout(0.5)(x) x = Dense(512, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.02))(x) x = Dropout(0.25)(x) predictions = Dense(num_genres, activation='softmax', kernel_regularizer=tf.keras.regularizers.l2(0.02))(x) model = Model(inputs=inpt, outputs=predictions) return model
We used the Adam optimization algorithm in this task and also a scheduler to reduce the learning rate case the validation loss stop decreasing.
model.compile(loss=tf.keras.losses.categorical_crossentropy, optimizer=tf.keras.optimizers.Adam(), metrics=['accuracy']) reduceLROnPlat = ReduceLROnPlateau( monitor='val_loss', factor=0.95, patience=3, verbose=1, mode='min', min_delta=0.0001, cooldown=2, min_lr=1e-5 )
Using a batch_size of 128 elements and 150 epochs, we are ready to start our learning procedure!
batch_size = 128 hist = model.fit_generator( train_generator, steps_per_epoch=steps_per_epoch, validation_data=validation_generator, validation_steps=val_steps, epochs=150, verbose=1, callbacks=[reduceLROnPlat])
After we trained the network, we perform another operation: The majority voting. Since we splited our song into small pieces, we are going to classify each part and the result will be the consensus of the network.
def majority_vote(scores): values, counts = np.unique(scores,return_counts=True) ind = np.argmax(counts) return values[ind]
Using the Majority voting, we see an accuracy of 82% in the test set of the GTZAN! It is not the state of the art but it’s a cool result with such simple network.
I hope you find this small report useful! You can check the full implementation in my github: https://github.com/Hguimaraes/gtzan.keras
Cheers,Written on January 19th, 2020 by Heitor Guimarães