nnAudio2.features.mel.MFCC

class nnAudio2.features.mel.MFCC(sr=22050, n_mfcc=20, norm='ortho', verbose=True, ref=1.0, amin=1e-10, top_db=80.0, **kwargs)

Bases: Module

This function is to calculate the Mel-frequency cepstral coefficients (MFCCs) of the input signal. This algorithm first extracts Mel spectrograms from the audio clips, then the discrete cosine transform is calcuated to obtain the final MFCCs. Therefore, the Mel spectrogram part can be made trainable using trainable_mel and trainable_STFT. It only support type-II DCT at the moment. Input signal should be in either of the following shapes.

  1. (len_audio)

  2. (num_audio, len_audio)

  3. (num_audio, 1, len_audio)

The correct shape will be inferred autommatically if the input follows these 3 shapes. Most of the arguments follow the convention from librosa. This class inherits from nn.Module, therefore, the usage is same as nn.Module.

Parameters:
  • sr (int) – The sampling rate for the input audio. It is used to calculate the correct fmin and fmax. Setting the correct sampling rate is very important for calculating the correct frequency.

  • n_mfcc (int) – The number of Mel-frequency cepstral coefficients

  • norm (string) – The default value is ‘ortho’. Normalization for DCT basis

  • **kwargs – Other arguments for Melspectrogram such as n_fft, n_mels, hop_length, and window

Returns:

MFCCs – It returns a tensor of MFCCs. shape = (num_samples, n_mfcc, time_steps).

Return type:

torch.tensor

Examples

>>> spec_layer = Spectrogram.MFCC()
>>> mfcc = spec_layer(x)

Methods

__init__

Initialize internal Module state, shared by both nn.Module and ScriptModule.

extra_repr

Return the extra representation of the module.

forward

Convert a batch of waveforms to MFCC.

extra_repr() str

Return the extra representation of the module.

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(x)

Convert a batch of waveforms to MFCC.

Parameters:

x (torch tensor) –

Input signal should be in either of the following shapes.

  1. (len_audio)

  2. (num_audio, len_audio)

3. (num_audio, 1, len_audio) It will be automatically broadcast to the right shape