UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

Yi Lei1, Shan Yang2, Xinsheng Wang1, Qicong Xie1, Jixun Yao1, Lei Xie1, Dan Su2
1 Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China
http://www.npu-aslp.org
2 Tencent AI Lab, China
AAAI 2023

Contents

1. Abstract

Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating high-quality speaking and singing voice according to textual input and music scores, respectively. Unifying TTS and SVS into a single system is crucial to the applications requiring both of them. Existing methods usually suffer from some limitations, which rely on either both singing and speaking data from the same person or cascaded models of multiple tasks. To address these problems, a simplified elegant framework for TTS and SVS, named UniSyn, is proposed in this paper. It is an end-to-end unified model that can make a voice speak and sing with only singing or speaking data from this person. To be specific, a multi-conditional variational autoencoder (MC-VAE), which constructs two independent latent sub-spaces with the speaker- and style-related (i.e. speak or sing) conditions for flexible control, is proposed in UniSyn. Moreover, supervised guided-VAE and timbre perturbation with the Wasserstein distance constraint are adopted to further disentangle the speaker timbre and style. Experiments conducted on two speakers and two singers demonstrate that UniSyn can generate natural speaking and singing voice without corresponding training data. The proposed approach outperforms the state-of-the-art end-to-end voice generation work, which proves the effectiveness and advantages of UniSyn.




1. Examples of training data


Target Timbre Ground-truth speech recording Ground-truth singing recording
Singer-1 \ \ \
Singer-2 \ \ \
Speaker-1 \ \ \
Speaker-2 \ \ \


2. Demos - Synthetic samples by UniSyn

Target timbre Ground-truth timbre TTS SVS
Singer-1
Singer-2
Speaker-1
Speaker-2


3. Demos -- Comparison of various unified models on TTS for speakers and singers.

We compare the synthetic speech of target speakers and singers generated from different systems.

Compared models:

3.1 Synthetic speech of target singers, who only have singing training data.

Target spekaer Groung-truth timbre VITS-tts VITS-unify UniSyn-tts UniSyn-unify (proposed)
Singer-1 \ \
\ \
\ \
\ \
Singer-2 \ \
\ \
\ \
\ \

Short summary: The synthetic speech voices show the superiority of the proposed method in producing synthetic speaking voice of the target singer without speech training data.


3.2 Synthetic speech of target speakers.


Target spekaer Groung-truth timbre VITS-tts VITS-unify UniSyn-tts UniSyn-unify (proposed)
Speaker-1
Speaker-2

Short summary: The results demonstrate that the proposed UniSyn with interpretable latent distribution has the equivalent ability in speech generation with VITS.


4. Demos -- Comparison of various unified models on SVS for speakers and singers.

We compare the synthetic singing voices of target speakers and singers from different systems.

Compared models:

4.1 Synthetic singing voices of target speakers, who only have speaking training data.

Target spekaer Groung-truth timbre Learn2Sing VITS-svs VITS-unify UniSyn-svs UniSyn-unify (proposed)
Speaker-1 \ \
\ \
\ \
\ \
Speaker-2 \ \
\ \
\ \
\ \

Short summary: The results demonstrate the effectiveness of the proposed model in generating natural singing voices of speakers, who don't have singing training data, with preserving the target timbre.>

4.2 Synthetic singing voices of target singers.

Target spekaer Groung-truth timbre Learn2Sing VITS-Sing VITS-unify UniSyn-Sing UniSyn-unify (proposed)
Singer-1
Singer-2

Short summary: The proposed model can achieve a similar performance with the SOTA voice generation system VITS on generating natural singing voices of singers.


5. Demos -- Ablation studies

We conduct ablation studies to evaluate the proposed components in UniSyn, where ``-pert" and ``-GVAE" denote removing speaker timbre perturbation and supervised guided-VAE from the proposed method respectively.


Target Timbre Ground-truth recording UniSyn-unify (proposed) -pert -GVAE -pert -GVAE
TTS: Singer-1
SVS: Singer-1
TTS: Singer-2
SVS: Singer-2
TTS: Speaker-1
SVS: Speaker-1
TTS: Speaker-2
SVS: Speaker-2

Short summary: The results of ablation studies indicate that removing the two strategies leads to different degrees of performance degradation. To be specific, removing speaker timbre perturbation noticeably leads to different degree of degradation on speaker similarity for TTS and SVS. We also find that GVAE is more important to the naturalness of the TTS task although it is necessary for both tasks and metrics.




References

[1] Kim, J.; Kong, J.; and Son, J. 2021. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proceedings of International Conference on Machine Learning, 5530–5540. PMLR.

[2] Xue, H.; Wang, X.; Zhang, Y.; Xie, L.; Zhu, P.; and Bi, M. 2022. Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher. In Proc. Interspeech 2022, 4267–4271.