nnAudio2 2.0.3

nnAudio2 is an audio feature extraction toolbox for deep learning, built on PyTorch. Spectrograms and other audio transforms are implemented as nn.Module layers — they run on-device (CUDA, MPS, or CPU), are fully differentiable, and can be embedded directly inside a neural network. Filter banks (Mel, CQT, STFT kernels) can optionally be made trainable.

nnAudio2 is developed and maintained by the AMAAI Lab at SUTD. It is a modernised successor to nnAudio, which is no longer actively maintained. The original codebase has been fully overhauled to work with modern PyTorch and the current scientific Python ecosystem.

If you use nnAudio2, please cite both papers.

Quick Start

import torch
from nnAudio2.features.mel import MelSpectrogram

mel = MelSpectrogram(sr=22050, n_fft=1024, hop_length=512, n_mels=128)
mel = mel.to('cuda')          # or 'mps' on Apple Silicon

audio = torch.randn(4, 22050).to('cuda')   # batch of 4 × 1-second clips
spec  = mel(audio)                          # [4, 128, T] — on GPU

Because the transform is an nn.Module, it moves with your model and its parameters participate in backpropagation. Passing trainable_mel=True or trainable_STFT=True allows the filter banks themselves to be optimised during training.

For inverse STFT, use the uniform-bin configuration (freq_scale='no'). The non-uniform linear, log, and log2 scales are analysis-only; attempting inversion raises an explicit error.

The source code is on GitHub.

Getting Started

API Documentation

Examples & Tutorials

Tutorials

GitHub

Source Code

Citation

Citation

nnAudio2 2.0.3

Quick Start

Indices and tables