nnAudio2 2.0.2

nnAudio2 is an audio feature extraction toolbox for deep learning, built on PyTorch. Spectrograms and other audio transforms are implemented as nn.Module layers — they run on-device (CUDA, MPS, or CPU), are fully differentiable, and can be embedded directly inside a neural network. Filter banks (Mel, CQT, STFT kernels) can optionally be made trainable.

nnAudio2 is developed and maintained by the AMAAI Lab at SUTD. It is a modernised successor to nnAudio, which is no longer actively maintained. The original codebase has been fully overhauled to work with modern PyTorch and the current scientific Python ecosystem.

If you use nnAudio2, please cite both papers.

Quick Start

import torch
from nnAudio2.features.mel import MelSpectrogram

mel = MelSpectrogram(sr=22050, n_fft=1024, hop_length=512, n_mels=128)
mel = mel.to('cuda')          # or 'mps' on Apple Silicon

audio = torch.randn(4, 22050).to('cuda')   # batch of 4 × 1-second clips
spec  = mel(audio)                          # [4, 128, T] — on GPU

Because the transform is an nn.Module, it moves with your model and its parameters participate in backpropagation. Passing trainable_mel=True or trainable_STFT=True allows the filter banks themselves to be optimised during training.

For inverse STFT, use the uniform-bin configuration (freq_scale='no'). The non-uniform linear, log, and log2 scales are analysis-only; attempting inversion raises an explicit error.

The source code is on GitHub.

Examples & Tutorials

GitHub

Citation

Indices and tables