UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

Yi Lei¹, Shan Yang², Xinsheng Wang¹, Qicong Xie¹, Jixun Yao¹, Lei Xie¹, Dan Su² ¹ Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China http://www.npu-aslp.org ² Tencent AI Lab, China AAAI 2023

Demos -- Comparison of various unified models on SVS for speakers and singers

Demos -- Ablation studies

1. Abstract

Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating high-quality speaking and singing voice according to textual input and music scores, respectively. Unifying TTS and SVS into a single system is crucial to the applications requiring both of them. Existing methods usually suffer from some limitations, which rely on either both singing and speaking data from the same person or cascaded models of multiple tasks. To address these problems, a simplified elegant framework for TTS and SVS, named UniSyn, is proposed in this paper. It is an end-to-end unified model that can make a voice speak and sing with only singing or speaking data from this person. To be specific, a multi-conditional variational autoencoder (MC-VAE), which constructs two independent latent sub-spaces with the speaker- and style-related (i.e. speak or sing) conditions for flexible control, is proposed in UniSyn. Moreover, supervised guided-VAE and timbre perturbation with the Wasserstein distance constraint are adopted to further disentangle the speaker timbre and style. Experiments conducted on two speakers and two singers demonstrate that UniSyn can generate natural speaking and singing voice without corresponding training data. The proposed approach outperforms the state-of-the-art end-to-end voice generation work, which proves the effectiveness and advantages of UniSyn.

1. Examples of training data

A singing corpus and a speech corpus are involved to train UniSyn. The singing corpus contains two singers, denoted as 'Singer-1' and 'Singer-2', with only singing data. The speech corpus contains two speakers, denoted as 'Speaker-1' and 'Speaker-2', with only speech recordings. Here, we list a few examples from the training corpora to provide the target speaker/singer timbre for listeners.
Singer-1/Singer-2: Singers with only singing training data

Speaker-1/Speaker-2: Speakers with only speech training data

Target Timbre	Ground-truth speech recording			Ground-truth singing recording
Singer-1	\	\	\
Singer-2	\	\	\
Speaker-1				\	\	\
Speaker-2				\	\	\

2. Demos - Synthetic samples by UniSyn

Target timbre	Ground-truth timbre	TTS		SVS
Singer-1
Singer-2
Speaker-1
Speaker-2

3. Demos -- Comparison of various unified models on TTS for speakers and singers.

We compare the synthetic speech of target speakers and singers generated from different systems.

Compared models:

VITS-tts[1]: VITS system only trained with the speech data treating TTS as the only task.

VITS-unify: the unified model constructed on VITS using the flow-decoder for generating speaking and singing voice.

UniSyn-tts: the proposed system only trained on the speech data for TTS.

UniSyn-unify (Proposed): the proposed unified model for both TTS and SVS.

3.1 Synthetic speech of target singers, who only have singing training data.

Target spekaer	VITS-tts	UniSyn-tts
Singer-1	\	\
	\	\
	\	\
	\	\
Singer-2	\	\
	\	\
	\	\
	\	\

Short summary: The synthetic speech voices show the superiority of the proposed method in producing synthetic speaking voice of the target singer without speech training data.

3.2 Synthetic speech of target speakers.

Target spekaer	Groung-truth timbre	VITS-tts	VITS-unify	UniSyn-tts	UniSyn-unify (proposed)
Speaker-1



Speaker-2

Short summary: The results demonstrate that the proposed UniSyn with interpretable latent distribution has the equivalent ability in speech generation with VITS.

4. Demos -- Comparison of various unified models on SVS for speakers and singers.

We compare the synthetic singing voices of target speakers and singers from different systems.

Compared models:

Learn2Sing[2]: system to teach speakers to sing with a HiFi-GAN vocoder to synthesize audio from mel-spectrogram.

VITS-svs: VITS system only trained with the singing data treat SVS as the only task.

VITS-unify: the unified model constructed on VITS using the flow-decoder for generating speaking and singing voice.

UniSyn-svs: the proposed system only trained on the singing data for SVS.

UniSyn-unify (Proposed): the proposed unified model for both TTS and SVS.

4.1 Synthetic singing voices of target speakers, who only have speaking training data.

Target spekaer	VITS-svs	UniSyn-svs
Speaker-1	\	\
	\	\
	\	\
	\	\
Speaker-2	\	\
	\	\
	\	\
	\	\

Short summary: The results demonstrate the effectiveness of the proposed model in generating natural singing voices of speakers, who don't have singing training data, with preserving the target timbre.>

4.2 Synthetic singing voices of target singers.

Target spekaer	Groung-truth timbre	Learn2Sing	VITS-Sing	VITS-unify	UniSyn-Sing	UniSyn-unify (proposed)
Singer-1



Singer-2

Short summary: The proposed model can achieve a similar performance with the SOTA voice generation system VITS on generating natural singing voices of singers.

5. Demos -- Ablation studies

We conduct ablation studies to evaluate the proposed components in UniSyn, where ``-pert" and ``-GVAE" denote removing speaker timbre perturbation and supervised guided-VAE from the proposed method respectively.

Target Timbre	Ground-truth recording	UniSyn-unify (proposed)		-pert		-GVAE		-pert -GVAE
TTS: Singer-1
SVS: Singer-1
TTS: Singer-2
SVS: Singer-2
TTS: Speaker-1
SVS: Speaker-1
TTS: Speaker-2
SVS: Speaker-2

Short summary: The results of ablation studies indicate that removing the two strategies leads to different degrees of performance degradation. To be specific, removing speaker timbre perturbation noticeably leads to different degree of degradation on speaker similarity for TTS and SVS. We also find that GVAE is more important to the naturalness of the TTS task although it is necessary for both tasks and metrics.

References

[1] Kim, J.; Kong, J.; and Son, J. 2021. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proceedings of International Conference on Machine Learning, 5530–5540. PMLR.

[2] Xue, H.; Wang, X.; Zhang, Y.; Xie, L.; Zhu, P.; and Bi, M. 2022. Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher. In Proc. Interspeech 2022, 4267–4271.