Glow-WaveGAN 2: high-quality zero-shot text-to-speech synthesis and any-to-any voice conversion

Yi Lei1, Shan Yang2, Jian Cong1, Lei Xie1, Dan Su2
1 Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China
2 Tencent AI Lab, China

Contents

1. Abstract

The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voice in zero-shot scenario exist in both stages -- acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming to solve the problem from both stages for high-quality zero-shot text-to-speech synthesis (TTS) and any-to-any voice conversion (VC). We firstly build a universal WaveGAN model for extracting latent distribution p(z) of speech and reconstructing waveform from it. Then a flow-based acoustic model only needs to learn the same p(z) from texts, which naturally avoids the mismatch between the acoustic model and the vocoder, resulting in high-quality generated speech without model fine-tuning. Based on a continuous speaker space and the reversible property of flows, the conditional distribution can be obtained for any speaker, and thus we can further conduct high-quality zero-shot speech generation for new speakers. We particularly investigate two methods to construct the speaker space, namely pre-trained and jointly-trained speaker encoder. The superiority of Glow-WaveGAN 2 has been proved through TTS and VC experiments conducted on LibriTTS corpus and VTCK corpus.



1. Visualization -- speaker visualization of generated seen and unseen speakers from both VCTK and LibriTTS

Short summary: Figure above is the visualization of speaker embeddings. Each shape corresponds to a different speaker. For the seen (green and red dots) and unseen (blue and yellow dots) speakers on both TTS and VC tasks, the generated audios and corresponding reference speech of each speaker can form a distinct cluster in our two proposed methods, which shows that the proposed methods can effectively model and control the speaker identities.

2. Demos -- Speech synthesis of different models for seen and unseen speakers

2.1 Text-to-speech for seen speakers

Seen speakers
VCTK LibriTTS
Speaker Name p254 p255 p268 p278 118 4598 251 5717
Ground-Truth / Reference
GlowTTS-HiFiGAN
VITS
Glow-WaveGAN
Glow-WaveGAN2-joint
Glow-WaveGAN2-pre

Short summary: Results of TTS on seen speakers from different models show that the Glow-WaveGAN family and VITS performed better than GlowTTS-HiFiGAN in both audio quality and speaker similarity, especially in LibriTTS corpus becuase of the low-quality of the original recordings.

2.2 Zero-shot text-to-speech for unseen speakers

unseen speakers (zero-shot)
VCTK LibriTTS
Speaker Name p276 p251 p329 p334 237 2300 1580 7176
Reference speech
Glow-WaveGAN2-joint
Glow-WaveGAN2-pre

Short summary: Results of zero-shot TTS on unseen speakers demonstrate that our proposed methods can generate high-quality and similar voices that is unseen to training data using an utterance only.

2.3 Zero-shot text-to-speech for unseen speakers of another dataset

unseen speakers (zero-shot)
Training: LibriTTS, Test: VCTK Training: VCTK, Test: LibriTTS
Speaker Name p340 p254 p269_400 p287_320 4957 5126 5400 707
Reference speech
Glow-WaveGAN2-joint
Glow-WaveGAN2-pre

Short summary: Results of zero-shot TTS for unseen speakers from another dataset demonstrate the generalization ability of our proposed methods for target speakers from different dataset. It is noticed that the model trained on LibriTTS corpus achieves more speaker similarity than the model trained on VCTK, since the larger dataset learns a righer speaker space with more speakers.

3. Demos -- Voice conversion

3.1 Many-to-many VC of seen speakers

Corpus Source speech Target speaker Method
GlowTTS-HiFiGAN VITS Glow-WaveGAN Glow-WaveGAN2-joint Glow-WaveGAN2-pre
VCTK
LibriTTS

Short summary: Results of many-to-many voice conversion from different models demonstrate that the Glow-WaveGAN family and VITS model both achieve satisfactory naturalness and quality on the VC task, and the speaker similarity from Glow-WaveGAN family is slightly better than VITS model.

3.2 Any-to-any VC of proposed methods

For better understanding, we cross-convert the 2 seen and 2 unseen speakers of both VCTK and LibriTTS to show the seen2seen, seen2unseen, unseen2seen, unseen2unseen conversion, where the first column contains the ground-truth utterance of each speaker.

VCTK results:

Glow-WaveGAN2-joint Glow-WaveGAN2-pre
Seen Unseen (zero-shot) Seen Unseen (zero-shot)
Ground-Truth
Source ↓ / Target → p245 p301 p251 p276 p245 p301 p251 p276
Seen p245
p301
Unseen p251
p276


LibriTTS results:

Glow-WaveGAN2-joint Glow-WaveGAN2-pre
Seen Unseen (zero-shot) Seen Unseen (zero-shot)
Ground-Truth
Source ↓ / Target → 1093 4595 237 8230 1093 4595 237 8230
Seen 1093
4595
Unseen 237
8230

Short summary: The results of any-to-any VC demonstrate that our proposed methods can achieve high quality and similar speech by conducting any-to-any VC of both seen and unseen speakers.

3.3 Any-to-any VC of cross corpus

We test the VC performance when the unseen target speaker is from another dataset.


Glow-WaveGAN2-joint Glow-WaveGAN2-pre
Target speech→
Source Speech ↓ p245 p301 1093 4595 p245 p301 1093 4595
p245 - - - -
p301 - - - -
1093 - - - -
4595 - - - -

We also test the VC performance when both unseen source speaker and target speaker are from another dataset.


Glow-WaveGAN2-joint Glow-WaveGAN2-pre
Target speech→
Source Speech ↓ p252 p257 p271 p351 p252 p257 p271 p351
p252
p257
p271
p351



Glow-WaveGAN2-joint Glow-WaveGAN2-pre
Target speech→
Source Speech ↓ 724 1752 5322 7789 724 1752 5322 7789
724
1752
5322
7789

Short summary: From the results of any-to-any VC of cross corpus, we can find that when the unseen target speaker is from another daraset, our proposed methods can convert the voice to the target speaker successfully. When the unseen source speaker and unseen target speaker are both from another dataset, the performance of the model training in LibriTTS corpus is better than the model training in VCTK corpus. It may due to the quality of original recordings in VCTK is higher than LibriTTS and the number of speakers in VCTK is lower than LibriTTS. Therefore, the genaralization of models with LibriTTS as training set are greater.