Glow-WaveGAN 2: high-quality zero-shot text-to-speech synthesis and any-to-any voice conversion

Yi Lei¹, Shan Yang², Jian Cong¹, Lei Xie¹, Dan Su² ¹ Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China ² Tencent AI Lab, China

Abstract
Visualization -- speaker visualization of generated seen and unseen speakers from both VCTK and LibriTTS
Demos -- Text-to-speech

Text-to-speech for seen speakers
Zero-shot text-to-speech for unseen speakers
Zero-shot text-to-speech for unseen speakers of another dataset

Demos -- Voice conversion

Many-to-many VC of different models
Any-to-any VC of proposed models
Any-to-any VC for cross datasets

1. Abstract

The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voice in zero-shot scenario exist in both stages -- acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming to solve the problem from both stages for high-quality zero-shot text-to-speech synthesis (TTS) and any-to-any voice conversion (VC). We firstly build a universal WaveGAN model for extracting latent distribution p(z) of speech and reconstructing waveform from it. Then a flow-based acoustic model only needs to learn the same p(z) from texts, which naturally avoids the mismatch between the acoustic model and the vocoder, resulting in high-quality generated speech without model fine-tuning. Based on a continuous speaker space and the reversible property of flows, the conditional distribution can be obtained for any speaker, and thus we can further conduct high-quality zero-shot speech generation for new speakers. We particularly investigate two methods to construct the speaker space, namely pre-trained and jointly-trained speaker encoder. The superiority of Glow-WaveGAN 2 has been proved through TTS and VC experiments conducted on LibriTTS corpus and VTCK corpus.

1. Visualization -- speaker visualization of generated seen and unseen speakers from both VCTK and LibriTTS

Red dots: ground-truth seen speakers

Green dots: generated seen speakers

Blue dots: ground-truth unseen speakers

Yellow dots: generated unseen speakers

Short summary: Figure above is the visualization of speaker embeddings. Each shape corresponds to a different speaker. For the seen (green and red dots) and unseen (blue and yellow dots) speakers on both TTS and VC tasks, the generated audios and corresponding reference speech of each speaker can form a distinct cluster in our two proposed methods, which shows that the proposed methods can effectively model and control the speaker identities.

2. Demos -- Speech synthesis of different models for seen and unseen speakers

2.1 Text-to-speech for seen speakers

	Seen speakers
	VCTK				LibriTTS
Speaker Name	p254	p255	p268	p278	118	4598	251	5717
Ground-Truth / Reference
GlowTTS-HiFiGAN
VITS
Glow-WaveGAN
Glow-WaveGAN2-joint
Glow-WaveGAN2-pre

Short summary: Results of TTS on seen speakers from different models show that the Glow-WaveGAN family and VITS performed better than GlowTTS-HiFiGAN in both audio quality and speaker similarity, especially in LibriTTS corpus becuase of the low-quality of the original recordings.

2.2 Zero-shot text-to-speech for unseen speakers

	unseen speakers (zero-shot)
	VCTK				LibriTTS
Speaker Name	p276	p251	p329	p334	237	2300	1580	7176
Reference speech
Glow-WaveGAN2-joint
Glow-WaveGAN2-pre

Short summary: Results of zero-shot TTS on unseen speakers demonstrate that our proposed methods can generate high-quality and similar voices that is unseen to training data using an utterance only.

2.3 Zero-shot text-to-speech for unseen speakers of another dataset

	unseen speakers (zero-shot)
	Training: LibriTTS, Test: VCTK				Training: VCTK, Test: LibriTTS
Speaker Name	p340	p254	p269_400	p287_320	4957	5126	5400	707
Reference speech
Glow-WaveGAN2-joint
Glow-WaveGAN2-pre

Short summary: Results of zero-shot TTS for unseen speakers from another dataset demonstrate the generalization ability of our proposed methods for target speakers from different dataset. It is noticed that the model trained on LibriTTS corpus achieves more speaker similarity than the model trained on VCTK, since the larger dataset learns a righer speaker space with more speakers.

3. Demos -- Voice conversion

3.1 Many-to-many VC of seen speakers

Corpus	Source speech	Target speaker	Method
Corpus	Source speech	Target speaker	GlowTTS-HiFiGAN	VITS	Glow-WaveGAN	Glow-WaveGAN2-joint	Glow-WaveGAN2-pre
VCTK



LibriTTS

Short summary: Results of many-to-many voice conversion from different models demonstrate that the Glow-WaveGAN family and VITS model both achieve satisfactory naturalness and quality on the VC task, and the speaker similarity from Glow-WaveGAN family is slightly better than VITS model.

3.2 Any-to-any VC of proposed methods

For better understanding, we cross-convert the 2 seen and 2 unseen speakers of both VCTK and LibriTTS to show the seen2seen, seen2unseen, unseen2seen, unseen2unseen conversion, where the first column contains the ground-truth utterance of each speaker.

VCTK results:

Seen speakers: p245, p301

Unseen speakers: p251, p276

		Glow-WaveGAN2-joint				Glow-WaveGAN2-pre
		Seen		Unseen (zero-shot)		Seen		Unseen (zero-shot)
	Ground-Truth

	Source ↓ / Target →	p245	p301	p251	p276	p245	p301	p251	p276
Seen	p245
Seen	p301
Unseen	p251
Unseen	p276

LibriTTS results:

Seen speakers: 1093, 4595

Unseen speakers: 237, 8230

		Glow-WaveGAN2-joint				Glow-WaveGAN2-pre
		Seen		Unseen (zero-shot)		Seen		Unseen (zero-shot)
	Ground-Truth

	Source ↓ / Target →	1093	4595	237	8230	1093	4595	237	8230
Seen	1093
Seen	4595
Unseen	237
Unseen	8230

Short summary: The results of any-to-any VC demonstrate that our proposed methods can achieve high quality and similar speech by conducting any-to-any VC of both seen and unseen speakers.

3.3 Any-to-any VC of cross corpus

We test the VC performance when the unseen target speaker is from another dataset.

VCTK speakers: p245, p301

LibriTTS speakers: 1093, 4595

		Glow-WaveGAN2-joint				Glow-WaveGAN2-pre
	Target speech→
Source Speech ↓		p245	p301	1093	4595	p245	p301	1093	4595
	p245	-	-			-	-
	p301	-	-			-	-
	1093			-	-			-	-
	4595			-	-			-	-

We also test the VC performance when both unseen source speaker and target speaker are from another dataset.

Training dataset: LibriTTS

Inference dataset: VCTK

		Glow-WaveGAN2-joint				Glow-WaveGAN2-pre
	Target speech→
Source Speech ↓		p252	p257	p271	p351	p252	p257	p271	p351
	p252
	p257
	p271
	p351

Training dataset: VCTK

Inference dataset: LibriTTS

		Glow-WaveGAN2-joint				Glow-WaveGAN2-pre
	Target speech→
Source Speech ↓		724	1752	5322	7789	724	1752	5322	7789
	724
	1752
	5322
	7789

Short summary: From the results of any-to-any VC of cross corpus, we can find that when the unseen target speaker is from another daraset, our proposed methods can convert the voice to the target speaker successfully. When the unseen source speaker and unseen target speaker are both from another dataset, the performance of the model training in LibriTTS corpus is better than the model training in VCTK corpus. It may due to the quality of original recordings in VCTK is higher than LibriTTS and the number of speakers in VCTK is lower than LibriTTS. Therefore, the genaralization of models with LibriTTS as training set are greater.