DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech¶

Accepted paper in NeurIPS2024, Audio Imagination workshop¶

This is the sample site for the DART paper accepted in the Audio Imagination workshop of NeurIPS 2024. Below, you can find audio samples from this paper.
For code please refer to: https://github.com/amaai-lab/DART

Non-converted samples¶

These samples were used for evaluating audio quality in the listening test.

Utterance 1: The eastern heavens were equally spectacular.
Utterance 2: Philip did not pursue the subject.
Utterance 3: At the same time spears and arrows began to fall among the invaders.
Utterance 4: Men who endure it call it living death.
Utterance 5: In the crib the baby sat up and began to prattle.
Utterance 6: In the bohemian club of san francisco there are some crack sailors.
Utterance 7: Everything was working smoothly better than I had expected.

Ground Truth MLVAE-Tacotron Fastspeech2-GE2E Fastspeech2-GST Fastspeech2-GST-GE2E DARTscratch DART without VQ DART
Speaker: ABA (Arabic) alternative text
Speaker: EBVS (Spanish) alternative text
Speaker: HKK (Korean) alternative text
Speaker: LXC (Chinese) alternative text
Speaker: NCC (Chinese) alternative text
Speaker: SVBI (Hindi) alternative text
Speaker: THV (Vietnamese) alternative text
alternative text
alternative text
alternative text
alternative text alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text alternative text
alternative text
alternative text

Accent converted samples¶

These speakers had their accent converted to the target accent.

Utterance 1: This piece of cake is so yummy, I can't wait to bake another one.
Utterance 2: Without you, I would not be able to do it.
Utterance 3: I will go inside and tell the truth.
Utterance 4: And you always come to that shop to order the same meal.

Note that ground truth reference is a different sentence...

Source Ground Truth DART no conversion (for reference) MLVAE-Tacotron Fastspeech2-GST Fastspeech2-GST-GE2E DARTscratch DART without VQ DART64 DART128 DART512
Speaker:ABA (Arabic) Accent: Vietnamese alternative text
Speaker:NCC (Chinese) Accent: Arabic alternative text
Speaker:SVBI (Hindi) Accent: Chinese alternative text
Speaker:THV (Vietnamese) Accent: Hindi alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
alternative text
In [ ]: