nnAudio2 2.0.2
nnAudio2 is an audio feature extraction toolbox for deep learning, built on PyTorch.
Spectrograms and other audio transforms are implemented as nn.Module layers — they run
on-device (CUDA, MPS, or CPU), are fully differentiable, and can be embedded directly
inside a neural network. Filter banks (Mel, CQT, STFT kernels) can optionally be made
trainable.
nnAudio2 is developed and maintained by the AMAAI Lab at SUTD. It is a modernised successor to nnAudio, which is no longer actively maintained. The original codebase has been fully overhauled to work with modern PyTorch and the current scientific Python ecosystem.
If you use nnAudio2, please cite both papers.
Quick Start
import torch
from nnAudio2.features.mel import MelSpectrogram
mel = MelSpectrogram(sr=22050, n_fft=1024, hop_length=512, n_mels=128)
mel = mel.to('cuda') # or 'mps' on Apple Silicon
audio = torch.randn(4, 22050).to('cuda') # batch of 4 × 1-second clips
spec = mel(audio) # [4, 128, T] — on GPU
Because the transform is an nn.Module, it moves with your model and its parameters
participate in backpropagation. Passing trainable_mel=True or trainable_STFT=True
allows the filter banks themselves to be optimised during training.
For inverse STFT, use the uniform-bin configuration (freq_scale='no'). The
non-uniform linear, log, and log2 scales are analysis-only; attempting
inversion raises an explicit error.
The source code is on GitHub.
Getting Started
API Documentation
Examples & Tutorials
GitHub
Citation