animal-sounds

Feature extraction

The modules in this directory are used to extract acoustic and/or deep learning features from ‘.wav’ files. The features are used as input for the classifiers, i.e. svm and cnn.

Instructions

Installation instructions

Feature extraction for Support Vector Machines

We extract several feature sets from using:

a python version of the rasta-mat library.
an Automatic Analysis Architecture

For our analyses we chunk all recordings into 0.5 second frames (with 0.25 second overlap between the chunks). We apply a Butterworth bandpass filter for filtering audio between 100 and 2000 before extracting features. We create MFCC and RASTA-PLPC low level descriptors (LLDs) from the filtered signal. For each horizontal band of the MFCC and RASTA-PLPC representation we calculate $\Delta$ and $\Delta^2$, and extract statistical features from the plain LLDs, $\Delta$ and $\Delta^2$.

We extend the feature set with the features from an Automatic Analysis Architecture

The script results in a feature set of 1140 features per audio frame.

Running the script

Use shell script run_svm.sh to start extract_features_svm.py from the command line. The following arguments should be specified:

--input_dir; directory where the ‘.wav’ files are located.
--output_dir; directory where the feature files (‘.csv’) should be stored.
--frame_length; subdivide ‘.wav’ files in frames of this length (in number of samples, if the sample rate is 48000 samples per second, choose e.g. 24000 for 0.5 second frames)
--hop_length; overlap between frames in number of samples per hop
--filter; butter bandpass filter variables

In ./config the user can specify which features to extract.

sndfile library

If you get an error saying something about a ‘snd_file’ dependency on an ubuntu machine, this can be fixed by installing the following C library:

sudo apt-get install libsndfile-dev

Feature extraction for Convolutional Neural Network (CNN)

To extract audio features for CNN classifier, .wav files are converted to Log-mel Spectrograms using librosa. Log-Melspectrograms had the best results in [1]. As a future work we can try others such as Log-Spectrograms, and Gammatone-Spectrograms.

In this study, first we apply a Butterworth (bandpass) filter to filter frequencies between 100 and 2000 hz for further processing. Then the short time Fourier transform (STFT) is applied to create spectrograms. Then we convert the spectrograms to MFCC (Mel-Frequency Cepstral coefficient) representation, which is often done for speech processing (Find more info here).

Running the script

Open a command line and run the following command:

sh run_dl.sh

This command applies extract_features_dl.py on the whole dataset. The following arguments should be specified:

--input_dir; directory where the ‘.wav’ files are located.
--output_dir; directory where the feature files (‘.pkl’) should be stored.
--label; the label of the wav file, i.e. chimpanze or background
--window_length; subdivide ‘.wav’ files in frames of this length (in number of samples, in our case, the sample rate is 48000 samples per second, we chose 750 for 15-millisecond frames)
--hop_length; overlap between frames in number of samples per hop (in our case, the sample rate is 48000 samples per second, we chose 376)
--n_mel; number of mel features, i.e. horizontal bars in spectrogram, which in our case it is 64.
--new_img_size; the number of rows and columns of the log-melspectrograms which is ingested as an image to cnn. In our case it is 64 * 64.

References

K. Palanisamy,D. Singhania†, and A. Yao,”Rethinking CNN Models for Audio Classification”,2020 arXiv preprint, github

This site is open source. Improve this page.