Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training




Abstract

With rapid globalization, the need to build inclusive and representative speech technology cannot be overstated. Accent is an important aspect of speech that needs to be taken into consideration while building inclusive speech synthesizers. Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent. We note that state-of-the-art Text-to-Speech (TTS) systems may currently not be suitable for all people, regardless of their background, as they are designed to generate high-quality voices without focusing on accent. In this paper, we propose a TTS model that utilizes a Multi-Level Variational Autoencoder with adversarial learning to address accented speech synthesis and conversion in TTS, with a vision for more inclusive systems in the future. We evaluate the performance through both objective metrics and subjective listening tests. The results show an improvement in accent conversion ability compared to the baseline.

Code repo link


Non-converted audio samples from the listening test:

Below are audio samples for the MOS voice quality comparison from the paper. These samples are non-converted (no accent change).
Ground Truth GST MLVAE MLVAE-ADV
ABA (Arabic): And you always want to see it in the superlative degree.
EBVS (Spanish): What was the object of your little sensation.
HKK (Korean): I came for information more out of curiosity than anything else.
NCC (Chinese): I will go over tomorrow afternoon.
SLT (American): Will we ever forget it.
SVBI (Hindi): He will knock you off a few sticks in no time.
THV (Vietnamese): For the twentieth time that evening the two men shook hands.

Accent-converted audio samples from the listening test:

Below are audio samples for the speaker and accent similarity evaluation from the paper. These samples are accent-converted.
Original Speaker New Accent Reference MLVAE MLVAE-ADV
ABA (Arabic) into Vietnamese: I came for information more out of curiosity than anything else.
EBVS (Spanish) into Korean: What was the object of your little sensation.
HKK (Korean) into Arabic: But what they want with your toothbrush is more than I can imagine.
NCC (Chinese) into American: I graduated last of my class.
SLT (American) into Chinese: But what they want with your toothbrush is more than I can imagine.
SVBI (Hindi) into Spanish: I will go over tomorrow afternoon.
THV (Vietnamese) into Hindi: He will knock you off a few sticks in no time.