AST: Audio Spectrogram Transformer

The paper can be accessed from here

This paper is the first convolution-free, purely attention-based model for audio classification.

Background

CNN have been widely used to learn representations from raw spectrograms for end-to-end modelling
Add a self-attention mechanism on the top of the CNN to better capture long-range global context
Have achieved state-of-the-art (SOTA) results for many audio classification tasks
The success of purely attention-based models in the vision domain

CNN-based models typically require architecture tuning to obtain optimal performance for different tasks.
SOTA CNN-attention hybrid models have more complicated architecture with lots of parameters.

Build a convolution-free, purely attention-based model for audio classification which features a simple architecture and superior performance.
Achieve better performance with a simpler architecture with fewer parameters.

The input audio waveform of t seconds is converted into a sequence of 128-D log Mel filterbank features computed with a 25 ms Hamming window every 10ms. (128 * 100t spectrogram)
Split the spectrogram into a sequence of N 16 * 16 patches with an overlap of 6 in both time and frequency dimension
Flatten each 16 * 16 patch to a 1D patch embedding of size 768 using a linear projection layer
Add a trainable positional embedding (size 768) to each patch embedding
Append a [CLS] token at the beginning of the sequence
Propose an approach for transferring knowledge from the Vision Transformer (ViT) pretrained on ImageNet to AST
propose a cut and bi-linear interpolate method for positional embedding adaptation.

AudioSet – a collection of over 2 million 10-second audio clips contains from a set of 527 labels
ESC-50 – consists of 2,000 5-second environmental audio recordings organized into 50 classes
Speech Commands V2 – consists of 105,829 1-second recordings of 35 common speech commands

Use weight averaging strategies
Use ensemble strategies

Ensemble-S: run three times with the exact same setting, but with a different random seed.

Ensemble-M: ensemble the three models in Ensemble-S together with another three models trained with different patch split strategies.
Compare AST with the SOTA models in two setting

AST-S: train an AST model with only ImageNet pretraining

AST-P: train an AST model with ImageNet and AudioSet pretraining

Impact of ImageNet Pretraining: ImageNet pretraining can greatly reduce the demand for in-domain audio data for AST.
Impact of Positional Embedding Adaptation: reinitializing positional embedding does not completely break the pretrained model, but it leads to a noticeable performance drop. Bi-linear interpolation and nearest-neighbor interpolation do not result in a big difference.
Impact of Patch Split Overlap: the performance improves with the overlap size for both balanced and full set experiments.
Impact of Patch Shape and Size: when the area of the patch is the same (256), 128 * 2 rectangle patches perform better than 16 * 16 square patches. But considering there is not 128 * 2 patch based ImageNet pretrained models, using 16 * 16 patches is still the current optimal solution.

AST feature a simple architecture and superior performance
The final performance of AST on all three datasets indicated the potential for AST use as a generic audio classifier.