DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech¶

Accepted paper in NeurIPS2024, Audio Imagination workshop¶

This is the sample site for the DART paper accepted in the Audio Imagination workshop of NeurIPS 2024. Below, you can find audio samples from this paper.
For code please refer to: https://github.com/amaai-lab/DART

Non-converted samples¶

These samples were used for evaluating audio quality in the listening test.

Utterance 1: The eastern heavens were equally spectacular.
Utterance 2: Philip did not pursue the subject.
Utterance 3: At the same time spears and arrows began to fall among the invaders.
Utterance 4: Men who endure it call it living death.
Utterance 5: In the crib the baby sat up and began to prattle.
Utterance 6: In the bohemian club of san francisco there are some crack sailors.
Utterance 7: Everything was working smoothly better than I had expected.

Ground Truth	MLVAE-Tacotron	Fastspeech2-GE2E	Fastspeech2-GST	Fastspeech2-GST-GE2E	DARTscratch	DART without VQ	DART
Speaker: ABA (Arabic) Speaker: EBVS (Spanish) Speaker: HKK (Korean) Speaker: LXC (Chinese) Speaker: NCC (Chinese) Speaker: SVBI (Hindi) Speaker: THV (Vietnamese)

Accent converted samples¶

These speakers had their accent converted to the target accent.

Utterance 1: This piece of cake is so yummy, I can't wait to bake another one.
Utterance 2: Without you, I would not be able to do it.
Utterance 3: I will go inside and tell the truth.
Utterance 4: And you always come to that shop to order the same meal.

Note that ground truth reference is a different sentence...

Source Ground Truth	DART no conversion (for reference)	MLVAE-Tacotron	Fastspeech2-GST	Fastspeech2-GST-GE2E	DARTscratch	DART without VQ	DART64	DART128	DART512
Speaker:ABA (Arabic) Accent: Vietnamese Speaker:NCC (Chinese) Accent: Arabic Speaker:SVBI (Hindi) Accent: Chinese Speaker:THV (Vietnamese) Accent: Hindi

In [ ]: