Animal Sounds

Detecting primate vocalizations in a jungle of audio recordings

Background

Biodiversity monitoring in tropical rainforests in relation to wood logging certification


Bioacoustic monitoring:

  • is a non-invasive method to monitor biodiversity
  • dense canopy makes visual monitoring difficult

Background

Applied Data Science seed grant for interdisciplinary collaboration


Team:

  • Researcher biology
  • 2 biology students
  • Researcher in human speech recognition
  • Research Engineers (3x)

Challenges and objectives

  • Data processing requires automation
  • Machine learning requires labeled data
  • Low vocalization density
  • Training data is not available
  • Noisy environment

Challenges and objectives

  • Can we use data from a zoo to train a model, and detect vocalizations in the wild?
  • If so > create a pipeline to automate the process for reuse in other projects

Training data

Species # vocalizations example
Chimpanzee 1190
Guenon 554
Mandrill 2717
Red Capped Mangabey 584

Creating synthetic data

  • “We need more”

  • Combine vocalizations with jungle noise

  • Dampen the vocalizations to simulate distance

    • 0 dB
    • -3.3 dB
    • -6.6 dB
    • -10 dB
  • For each segment, 4 new segments are created

Training and testing classifiers


How to test the classifiers?

  • cross-validation
  • need for an independent test set
  • super low density of vocalizations in the wild

Classical machine learning

Feature extraction inspired by human speech recognition

Scikit-learn: Feature selection

Determining number of features to select with RFE

Scikit-learn: SVM

  • Using feature_importances method in ExtraTreesClassifier to select 50 most important features

  • Train SVM model with selected features

Deep learning Models

Convolutional Neural Networks (CNN)

  • Convolutional layer
    • Detects features e.g., edges, textures, by applying filters
  • Pooling layer:
    • Reduces the dimensionality of feature maps
  • Fully Connected layer
    • maps the features to the final output

How does CNN work?

How doe we change audio to image?

Spectrogram - represents the intensity of different frequencies as they change over time, typically using a color map

Log-mel-spectrogram a variation of the standard spectrogram that applies a filter bank and a log function on top of it.

  • making quieter sounds more detectable.

  • Aligns the representation with human auditory perception

  • Normalizes the features

Model Architecture - Derived from PANNs

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

  • designed for audio event detection and classification

  • a combination of convolutional blocks and pooling operations

CNN10 Architecture

Results

Trained on SVM CNN CNN10
Sanctuary 0.86 0.81 0.83
Synthetic 0.65 0.82 0.85
Sanctuary + Synthetic 0.87 0.83 0.87


Numbers represent: Unweighted Average Recall (UAR)

Deliverables (1/2)

Python modules for audio analysis

  • Preprocessing
    • Condensation for speeding up annotation
    • Extracting relevant audio segments
    • Generate synthetic data
  • Feature extraction and selection
  • Classification
    • SVM
    • Deep learning models

Deliverables (2/2)

Public train and test data

dataset

Challenges & Learning points

  • Collaboration with ML audio expert
  • Involvement researcher
  • Data management
  • Code management (Matlab scripts, repo’s with playground folders, no real git workflow, documentation, package from early stage)
  • Interspeech challenge

Future work

  • Publication

  • Generic Audio Analysis Platform

    • Modular Architecture