Hguimaraes random tech notes

Music genre classification: A classical ML approach

“The only truth is music - the only meaning is without meaning - Music blends with the heartbeat universe and we forget the brain beat.” — Jack Kerouac, Desolation Angels.

tl;dr: Hands-on post about how to create a music genre classifier from raw audio using the GTZAN dataset and extracting handcrafted features. Based on my undergraduate thesis. In the future I will write about my deep learning approach. You can find a Jupyter Notebook with all tne codes in the nbs folder here: https://github.com/Hguimaraes/gtzan.keras

Music Genres

Musical genres are categorical labels created to organize the huge musical universe. The definitions of genres are indeed subjective, and the separation line between them are blurry, because those labels are created based on complex social interactions between people, history and culture. But even being a hard task, musics share features related to instruments, rhytm and others patterns.

In a study about the task of music genre classification, Perrot showed that the accuracy for a person to classify the correct genre in a 10 genre problem was about 70% for a 3 second music sample. Even increasing the size of the sample the accuracy was the same (but lower for small samples).


We are going to use one of the most famous dataset for this task, the GTZAN. The dataset was developed by Tzanetakis with songs collected between 2000 and 2001, from multiple sources like CDs and radio. The dataset consists of 1000 raw audio files in .au format, with 30 seconds of duration each. The audio is organized in 10 musical genres, each genre with 100 samples. In this dataset we have a sample rate of 22050 Hz, a single channel (mono) and 16-bits for quantization. The genres here are: Metal, Disco, Classical, Hiphop, Jazz, Country, Pop, Blues, Reggae and Rock.

The dataset can be find here: http://marsyasweb.appspot.com/download/data_sets/

Reading the dataset

First of all, we need to import some dependencies in python for the reading process and to create some handcrafted features as we will see in the next topic.

import os
import librosa
import numpy as np
import pandas as pd
from scipy.stats import kurtosis
from scipy.stats import skew

Then we can define some useful function to look inside the folder of the GTZAN uncompressed file as above. The idea is to create a list of dictionaries here to transform in a pandas object later. We are creating one dictionary for each music and later will represent each line of our dataframe. The target here is the “genre” key.

def read_process_songs(src_dir, debug = True):    
    # Empty array of dicts with the processed features from all files
    arr_features = []

    # Read files from the folders
    for x,_ in genres.items():
        folder = src_dir + x
        for root, subdirs, files in os.walk(folder):
            for file in files:
                # Read the audio file
                file_name = folder + "/" + file
                signal, sr = librosa.load(file_name)
                # Debug process
                if debug:
                    print("Reading file: {}".format(file_name))
                # Append the result to the data structure
                features = get_features(signal, sr)
                features['genre'] = genres[x]
    return arr_features

The function get_features will be defined next.


In the classical machine learning approach we need to have some knowledge about the field that you are tackling in order to extract useful features. In this post we are going to extract timbral and tempo features from small windows of the raw audio (typically we use 10ms windows). The idea of calculate each feature from this windowed signal is to supose stationarity in this small window. The exception here is the tempo feature where we calculate the tempo over the entire signal instead of using a window function.

  1. Timbral Features

    Timbral based feature are low level feature that measures how similar songs are and supply a semantical information about the signal (e.g. centroid, flux, etc).

    1. Spectral Centroid
    2. Spectral Rollof
    3. Spectral Flux
    4. RMSE
    5. ZCR
    6. MFCC

    In the case of the MFCCs we are going to use the first 13 coefficients.

  2. Tempo Features

    A high level definition is to consider tempo as a measure of how fast (or slow) a song is. Usually we represent this measure in BPMs (Beats-per-minutes).

    1. Global tempo

Extracting Features using LibROSA

Now let’s transform this in code. First we will extract the timbral features, put then into dictionaries and then extract the first 4 moments of each feature. The last step is to get the global tempo.

def get_features(y, sr, n_fft = 1024, hop_length = 512):
    # Features to concatenate in the final dictionary
    features = {'centroid': None, 'roloff': None, 
        'flux': None, 'rmse': None, 'zcr': None}
    # Using librosa to calculate the features
    features['centroid'] = librosa.feature.spectral_centroid(y, sr=sr, 
    	n_fft=n_fft, hop_length=hop_length).ravel()
    features['roloff'] = librosa.feature.spectral_rolloff(y, sr=sr, 
    	n_fft=n_fft, hop_length=hop_length).ravel()
    features['zcr'] = librosa.feature.zero_crossing_rate(y, 
        frame_length=n_fft, hop_length=hop_length).ravel()

    features['rmse'] = librosa.feature.rmse(y, frame_length=n_fft, 
    features['flux'] = librosa.onset.onset_strength(y=y, sr=sr).ravel()
    # MFCC treatment
    mfcc = librosa.feature.mfcc(y, n_fft = n_fft, 
        hop_length = hop_length, n_mfcc=13)

    for idx, v_mfcc in enumerate(mfcc):
        features['mfcc_{}'.format(idx)] = v_mfcc.ravel()
    # Get statistics from the vectors
    def get_moments(descriptors):
        result = {}
        for k, v in descriptors.items():
            result['{}_mean'.format(k)] = np.mean(v)
            result['{}_std'.format(k)] = np.std(v)
            result['{}_kurtosis'.format(k)] = kurtosis(v)
            result['{}_skew'.format(k)] = skew(v)
        return result
    dict_agg_features = get_moments(features)
    dict_agg_features['tempo'] = librosa.beat.tempo(y, sr=sr)[0]
    return dict_agg_features

By the end of this process will we have 73 features for each sample of the dataset and a target. We can run our functions as:

# Get list of dicts with features and convert to dataframe
features = read_process_songs("FOLDER_GTZAN", debug=False)
df_features = pd.DataFrame(features)
df_features.to_csv('GTZAN_Features.csv', index=False)

The last step is to save our dataframe to a CSV and process everything later using scikit-learn.


Now that you have a csv file, probably you are confortable to use scikit-learn/xgboost/lightgbm and your own methodology to go on. One last tip (but depends on the algorithm that you are using) is to normalize the columns before run the classifier. Using a 5-fold CV the best algorithm here was the SVM with RBF kernel [ svm = SVC(C=2, kernel=’rbf’) ], with a accuracy mean of 76.2% and 3.4% of standard deviation. You can find the confusion matrix right bellow.

alt text

After this post you can write your own machine to learn and classify songs better than humans! Cool, right?!

I hope that you have find this mini-tutorial useful and fun! Let’s play with music and machine learning more!


comments powered by Disqus