Introduction

1ras Jornadas de Inteligencia Artificial del Litoral

An event organized by the Research Institute for Signals, systems and computational intelligence took place in November at the City of Santa Fe - Argentina. The event hosted a competition opened for students and the main objective was to classify audio. It was exciting and as every competition, challenging. I had the opportunity to meet awesome people and listen to very good talks.

The Challenge

The dataset consists of audio files describing the sound of cows 🐮 and the main objective is to identify the type of grass that the cow is eating.

The possible categories are:

  • bite
  • chew
  • chew-bite

The metric of evaluation is Balanced Accuracy, and the dataset is relatively small.

The result

I managed to get a Balanced Accuracy of 0.897 using Assembled Models, Mel-frequency Cepstrum and Convolutional Neural Networks. That result gave me the second place and an amazing Logitech G433.

The solution

The first thing I did was to take a look at the data using librosa and plotting the audio files:

The dataset is unbalanced and the audio files presented some type of noise, so the first step is preprocessing.

Preprocessing audio files:

Noise detected using Spectrogram

The horizontal lines represent the noise, in order to remove it, I used librosa lib, detecting Percussive and Harmonic sounds.

After using the hpss filter provided by librosa I successfully split the sound in Harmonic and Percussive.

Preparing the data

After preprocessing the audio files, its time to convert them in some format my model could understand, as I'd mentioned earlier I used MFCC.

MFCC Representation of audio file

The MFCC representation works very well using Convolutional Neural Networks, my first approach was to save the MFCC or Melspectrograms images representations and apply transfer learning with a simple image classification model, but I had better results feeding the model directly with the MFCC adding one more dimension.

The Model

Since I have a little bit of experience classifying images I used a VGG19 like architecture:

If you want to know more about each layer you can go to my collab and explore it.

So, 3 Convolutional blocks, max-pooling after each block and dropout between each in order to prevent overfitting.

Training

The training process is very interesting, I used Google Collab and I did not find a way to seed random numbers, so I used Assembled models in order to get the best accuracy, trusting model calibration.

I divided the dataset in the following way: 85% training and 15% validation. Also, I used class weights to tell keras the weights of each class.

For the training process, I used 6 Stratified K Folds with 70% for training and 30% for validation.  For each fold, I saved the best model.

1: single=0.909, ensembled=0.909
2: single=0.887, ensembled=0.896
3: single=0.891, ensembled=0.891
4: single=0.891, ensembled=0.896
5: single=0.887, ensembled=0.896
6: single=0.887, ensembled=0.896
Acc Individual 0.892 (0.008)
Acc Ensembled 0.897 (0.005)

Prediction

The prediction was made using the bests models, I passed the unknown audio file through each model and computed the category using the average and taking the argmax.
As you can see passing the audio file through ensembled ones gets better accuracy.


My best model gave me 0.873 without Folding, and using Kfolds I achieved 0.897.

Result

I passed to the final round with a score of 0.81900 in the public leaderboard, then I achieved second place thanks to my private score and the quality of my Collab.

You can run, fork and test my solution here:

jeffersonlicet/litoral-audio-classification
Public Leaderboard Top 5 - 0.81900 Audio classification Colab for the AI Litoral 2019 Challenge - jeffersonlicet/litoral-audio-classification

Thanks.