Glow-WaveGAN 2: high-quality zero-shot text-to-speech synthesis and any-to-any voice conversion
Contents
- Abstract
- Visualization -- speaker visualization of generated seen and unseen speakers from both VCTK and LibriTTS
- Demos -- Text-to-speech
- Text-to-speech for seen speakers
- Zero-shot text-to-speech for unseen speakers
- Zero-shot text-to-speech for unseen speakers of another dataset
- Demos -- Voice conversion
1. Abstract
The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voice in zero-shot scenario exist in both stages -- acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming to solve the problem from both stages for high-quality zero-shot text-to-speech synthesis (TTS) and any-to-any voice conversion (VC). We firstly build a universal WaveGAN model for extracting latent distribution p(z) of speech and reconstructing waveform from it. Then a flow-based acoustic model only needs to learn the same p(z) from texts, which naturally avoids the mismatch between the acoustic model and the vocoder, resulting in high-quality generated speech without model fine-tuning. Based on a continuous speaker space and the reversible property of flows, the conditional distribution can be obtained for any speaker, and thus we can further conduct high-quality zero-shot speech generation for new speakers. We particularly investigate two methods to construct the speaker space, namely pre-trained and jointly-trained speaker encoder. The superiority of Glow-WaveGAN 2 has been proved through TTS and VC experiments conducted on LibriTTS corpus and VTCK corpus.

1. Visualization -- speaker visualization of generated seen and unseen speakers from both VCTK and LibriTTS
- Red dots: ground-truth seen speakers
- Green dots: generated seen speakers
- Blue dots: ground-truth unseen speakers
- Yellow dots: generated unseen speakers

Short summary: Figure above is the visualization of speaker embeddings. Each shape corresponds to a different speaker. For the seen (green and red dots) and unseen (blue and yellow dots) speakers on both TTS and VC tasks, the generated audios and corresponding reference speech of each speaker can form a distinct cluster in our two proposed methods, which shows that the proposed methods can effectively model and control the speaker identities.
2. Demos -- Speech synthesis of different models for seen and unseen speakers
2.1 Text-to-speech for seen speakers
Seen speakers | ||||||||
---|---|---|---|---|---|---|---|---|
VCTK | LibriTTS | |||||||
Speaker Name | p254 | p255 | p268 | p278 | 118 | 4598 | 251 | 5717 |
Ground-Truth / Reference | ||||||||
GlowTTS-HiFiGAN | ||||||||
VITS | ||||||||
Glow-WaveGAN | ||||||||
Glow-WaveGAN2-joint | ||||||||
Glow-WaveGAN2-pre |
Short summary: Results of TTS on seen speakers from different models show that the Glow-WaveGAN family and VITS performed better than GlowTTS-HiFiGAN in both audio quality and speaker similarity, especially in LibriTTS corpus becuase of the low-quality of the original recordings.
2.2 Zero-shot text-to-speech for unseen speakers
unseen speakers (zero-shot) | ||||||||
---|---|---|---|---|---|---|---|---|
VCTK | LibriTTS | |||||||
Speaker Name | p276 | p251 | p329 | p334 | 237 | 2300 | 1580 | 7176 |
Reference speech | ||||||||
Glow-WaveGAN2-joint | ||||||||
Glow-WaveGAN2-pre |
Short summary: Results of zero-shot TTS on unseen speakers demonstrate that our proposed methods can generate high-quality and similar voices that is unseen to training data using an utterance only.
2.3 Zero-shot text-to-speech for unseen speakers of another dataset
unseen speakers (zero-shot) | ||||||||
---|---|---|---|---|---|---|---|---|
Training: LibriTTS, Test: VCTK | Training: VCTK, Test: LibriTTS | |||||||
Speaker Name | p340 | p254 | p269_400 | p287_320 | 4957 | 5126 | 5400 | 707 |
Reference speech | ||||||||
Glow-WaveGAN2-joint | ||||||||
Glow-WaveGAN2-pre |
Short summary: Results of zero-shot TTS for unseen speakers from another dataset demonstrate the generalization ability of our proposed methods for target speakers from different dataset. It is noticed that the model trained on LibriTTS corpus achieves more speaker similarity than the model trained on VCTK, since the larger dataset learns a righer speaker space with more speakers.
3. Demos -- Voice conversion
3.1 Many-to-many VC of seen speakers
Corpus | Source speech | Target speaker | Method | ||||
GlowTTS-HiFiGAN | VITS | Glow-WaveGAN | Glow-WaveGAN2-joint | Glow-WaveGAN2-pre | |||
VCTK | |||||||
LibriTTS | |||||||
Short summary: Results of many-to-many voice conversion from different models demonstrate that the Glow-WaveGAN family and VITS model both achieve satisfactory naturalness and quality on the VC task, and the speaker similarity from Glow-WaveGAN family is slightly better than VITS model.
3.2 Any-to-any VC of proposed methods
For better understanding, we cross-convert the 2 seen and 2 unseen speakers of both VCTK and LibriTTS to show the seen2seen, seen2unseen, unseen2seen, unseen2unseen conversion, where the first column contains the ground-truth utterance of each speaker.
VCTK results:
- Seen speakers: p245, p301
- Unseen speakers: p251, p276
Glow-WaveGAN2-joint | Glow-WaveGAN2-pre | ||||||||
---|---|---|---|---|---|---|---|---|---|
Seen | Unseen (zero-shot) | Seen | Unseen (zero-shot) | ||||||
Ground-Truth | |||||||||
Source ↓ / Target → | p245 | p301 | p251 | p276 | p245 | p301 | p251 | p276 | |
Seen | p245 | ||||||||
p301 | |||||||||
Unseen | p251 | ||||||||
p276 |
LibriTTS results:
- Seen speakers: 1093, 4595
- Unseen speakers: 237, 8230
Glow-WaveGAN2-joint | Glow-WaveGAN2-pre | ||||||||
---|---|---|---|---|---|---|---|---|---|
Seen | Unseen (zero-shot) | Seen | Unseen (zero-shot) | ||||||
Ground-Truth | |||||||||
Source ↓ / Target → | 1093 | 4595 | 237 | 8230 | 1093 | 4595 | 237 | 8230 | |
Seen | 1093 | ||||||||
4595 | |||||||||
Unseen | 237 | ||||||||
8230 |
Short summary: The results of any-to-any VC demonstrate that our proposed methods can achieve high quality and similar speech by conducting any-to-any VC of both seen and unseen speakers.
3.3 Any-to-any VC of cross corpus
We test the VC performance when the unseen target speaker is from another dataset.
- VCTK speakers: p245, p301
- LibriTTS speakers: 1093, 4595
Glow-WaveGAN2-joint | Glow-WaveGAN2-pre | ||||||||
---|---|---|---|---|---|---|---|---|---|
Target speech→ | |||||||||
Source Speech ↓ | p245 | p301 | 1093 | 4595 | p245 | p301 | 1093 | 4595 | |
p245 | - | - | - | - | |||||
p301 | - | - | - | - | |||||
1093 | - | - | - | - | |||||
4595 | - | - | - | - |
We also test the VC performance when both unseen source speaker and target speaker are from another dataset.
- Training dataset: LibriTTS
- Inference dataset: VCTK
Glow-WaveGAN2-joint | Glow-WaveGAN2-pre | ||||||||
---|---|---|---|---|---|---|---|---|---|
Target speech→ | |||||||||
Source Speech ↓ | p252 | p257 | p271 | p351 | p252 | p257 | p271 | p351 | |
p252 | |||||||||
p257 | |||||||||
p271 | |||||||||
p351 |
- Training dataset: VCTK
- Inference dataset: LibriTTS
Glow-WaveGAN2-joint | Glow-WaveGAN2-pre | ||||||||
---|---|---|---|---|---|---|---|---|---|
Target speech→ | |||||||||
Source Speech ↓ | 724 | 1752 | 5322 | 7789 | 724 | 1752 | 5322 | 7789 | |
724 | |||||||||
1752 | |||||||||
5322 | |||||||||
7789 |
Short summary: From the results of any-to-any VC of cross corpus, we can find that when the unseen target speaker is from another daraset, our proposed methods can convert the voice to the target speaker successfully. When the unseen source speaker and unseen target speaker are both from another dataset, the performance of the model training in LibriTTS corpus is better than the model training in VCTK corpus. It may due to the quality of original recordings in VCTK is higher than LibriTTS and the number of speakers in VCTK is lower than LibriTTS. Therefore, the genaralization of models with LibriTTS as training set are greater.