MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

Yi Lei1, Shan Yang2, Xinsheng Wang3, Lei Xie1
1 Northwestern Polytechnical University, China
2 Tencent AI Lab, China
3 Xi’an Jiaotong University, China

0. Contents

  1. Abstract
  2. Demos -- Emotional speech synthesis by transferring the emotion from reference audio
  3. Demos -- Emotional speech synthesis by predicting the emotion from input text
  4. Demos -- Emotional speech synthesis by manual control
  5. Demos -- Global-level emotion presenting module (GM)
  6. Demos -- Utterance-level emotion presenting module (UM)


1. Abstract

Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels. Specifically, the proposed method is a typical attention-based sequence-to-sequence model but with proposed three modules, including global-level emotion presenting module (GM), utterance-level emotion presenting module (UM), and local-level emotion presenting module (LM), to model the global emotion category, utterance-level emotion variation, and syllable-level emotion strength, respectively. In addition to modeling the emotion from different levels, the proposed method also allows us to synthesize emotional speech in different ways, i.e., transferring the emotion from reference audio, predicting the emotion from input text, and controlling the emotion strength manually. Extensive experiments conducted on a Chinese emotional speech corpus demonstrate that the proposed method outperforms the compared reference audio-based and text-based emotional speech synthesis methods on the emotion transfer speech synthesis and text-based emotion prediction speech synthesis respectively. Besides, the experiments also show that the proposed method can control the emotion expressions flexibly. Detail analysis is conducted to show the effectiveness of each module and the good design of the proposed method.



2. Demos -- Emotional speech synthesis by transferring the emotion from reference audio

Corresponding to Section 5.1 in our paper, below lists the samples that are synthesized for evaluations on the emotion transfer task. We compared MsEmoTTS (proposed) with the GST model on both parallel and non-parallel emotion transfer.

Parallel emotion transfer: the input text is the same as the transcription of reference audio.

Emotion Reference GST MsEmoTTS (Proposed)
Happiness Text: 我要写一篇下周一要交的论文。(English: I'm going to write a paper to be submitted on next Monday.)
Anger Text: 就因为我们反对李超那害群新法。(English: Just because we oppose Li Chao's new law.)
Sadness Text: 注定我要在思念的路上走一段吧。(English: I'm destined to walk on the road of missing.)
Surprise Text: 众人的冷嘲热讽使他如芒刺背,坐立不安。(English: The cynicism of the crowd made him on tenterhooks and disturbed.)
Fear Text: 大灰狼露出了它锋利的牙齿。(English: The big bad wolf showed its sharp teeth.)
Disgust Text: 还真的觉得自己美如天仙啊。(English: Really you think you are as beautiful as a fairy.)

Non-parallel emotion transfer: the input text is different from the transcription of reference audio.

Emotion Reference GST MsEmoTTS (Proposed)
Happiness Text: 日子够苦的了,我们得互相体谅。(English: Days are hard enough, we should understand each other.)
Anger Text: 单身狗我在家,防止受到一万点伤害。(English: As a single man, I'm at home to prevent the serious injury.)
Sadness Text: 第一次登上领奖台,我又紧张又兴奋。(English: For the first time on the podium, I was nervous and excited.)
Surprise Text: 保罗沃克去世了,一个时代结束了。(English: Paul Walker died, and an era was over.)
Fear Text: 区区一小鬼,也敢在我面前大言不惭!(English: You, just a kid, dare to brag unblushingly in my presence!)
Disgust Text:这不公平 ,怎么能这样对我,好气呀!(English: It's not fair! How can you treat me in this way! How angry!)

Short summary: Experiment results for both parallel and non-parallel emotion transfer demonstrate that the proposed method can convey emotion information from the reference audio effectively, and achieves superiority on the reference audiobased emotional speech synthesis compared with the GST model.



3. Demos -- Emotional speech synthesis by predicting the emotion from input text

Corresponding to Section 5.2 in our paper, below lists the samples that are synthesized for evaluations on emotion prediction from input text only. We compared MsEmoTTS (proposed) with the TPSE-GST model.

Emotion TPSE-GST Proposed
Happiness Text: 老师今天好开心呐!(English: The teacher is so happy today!)
Anger Text: 我那个妻子气得我简直要暴跳如雷了。(English: My wife makes me angry to stamp with rage.)
Sadness Text: 看你伤心,我都伤心了。(English: Seeing you sad, I'm so sad.)
Surprise Text: 啊?什么行不行的?(English: Ah? What is ok or not?)
Fear Text: 太可怕了,我看到小绵羊被马桶吃掉了!(English: It's terrible! I saw the little sheep eaten by the closestool.)
Disgust Text: 他这背信的人,令人厌恶。(English: He is treacherous and disgusting.)

Short summary: The higher emotion naturalness MOS scores of MsEmoTTS than TPSE-GST demonstrate that MsEmoTTS is suitable for emotional speech synthesis by predicting the emotion from input text only.



4. Demos -- Emotional speech synthesis by manual control

Corresponding to Section 5.3 in our paper, below lists the samples that are synthesized for presenting the ability of emotional speech synthesis by manual control. For the same input text, different global emotion categories and local emotion strengths are utilized to synthesize emotional speech with different expressions. ''All strengths 0'' and ''All strengths 1'' below represent all syllable strengths in the utterance are appointed to 0 or 1 for the manual labels in emotional TTS. ''Strength increasing'' and ''Strength decreasing'' below represent local emotion strengths gradually increase or decrease within an utterance.

Text: 舒家姐妹现在的人气很高啊。(English: The sisters of Shu are very popular now.)

Emotion All strengths 0 All strengths 1 F0 curves
Happiness
Anger
Sadness
Surprise
Fear
Disgust

Emotion Strength increasing Strength decreasing F0 curves
Happiness
Anger
Sadness
Surprise
Fear
Disgust

Short summary: The results for all emotions indicate that the local emotion strength and global emotion category of synthetic speech can be manually controlled successfully.



5. Demos -- Component analysis - Global-level emotion presenting module (GM)

Corresponding to Section 6.1 in our paper, below lists the samples that are synthesized for evaluations on GM. We compared the consistency of the generated emotion expressions with input text of our proposed ''soft'' emotion embeddings and the conventional ''hard'' emotion embeddings in emotion prediction. ''P-h'' represents the model as same as our proposed model except the global emotion renders are conventional ''hard'' emotion embeddings.

Emotion P-h MsEmoTTS (Proposed)
Happiness Text: 老师今天好开心呐!(English: The teacher is so happy today!)
Anger Text: 我那个妻子气得我简直要暴跳如雷了。(English: My wife makes me angry to stamp with rage.)
Sadness Text: 看你伤心,我都伤心了。(English: Seeing you sad, I'm so sad.)
Surprise Text: 啊?什么行不行的?(English: Ah? What is ok or not?)
Fear Text: 太可怕了,我看到小绵羊被马桶吃掉了!(English: It's terrible! I saw the little sheep eaten by the closestool.)
Disgust Text: 他这背信的人,令人厌恶。(English: He is treacherous and disgusting.)

Short summary: The emotion naturalness MOS scores indicate the effectiveness of the proposed soft emotion embedding method.



6. Demos -- Component analysis - Utterance-level emotion presenting module (UM)

Corresponding to Section 6.2 in our paper, below lists the samples that are synthesized for evaluating the effectiveness of UM. ''w/o'' means without and ''w'' mean with. We compare the similarity of generated emotion expressions from reference audio on the emotion transfer task.

Parallel emotion transfer

Emotion Reference w/o UM w UM
Happiness Text: 舒家姐妹现在的人气很高啊。(English: The sisters of Shu are very popular now.)
Anger Text: 就因为我们反对李超那害群新法。(English: Just because we oppose Li Chao's new law.)
Sadness Text: 当爱情来临,当然也是快乐的。(English: Certainly you feel happy when love comes.)
Surprise Text: 每一次他要张嘴说话,他的外交顾问竟然都会紧张。(English: Every time he opened his mouth to speak, his diplomatic advisers would be nervous.)
Fear Text: 快跑啊,海盗们都要冲上来了。(English: Run! The pirates are coming up.)
Disgust Text: 这姑娘的婚事是令人憎恶的,现在能得以幸免是天大的喜事。(English: The girl's marriage was abominable, and it's a great joy to be spared now.)

Non-parallel emotion transfer

Emotion Reference w/o UM w UM
Happiness Text: 这不公平 ,怎么能这样对我,好气呀!(English: It's not fair! How can you treat me in this way! How angry!)
Anger Text: 单身狗我在家,防止受到一万点伤害。(English: As a single man, I'm at home to prevent the serious injury.)
Sadness Text: 交到知心朋友,是多么快乐的事啊。(English: How happy it is to make close friends.)
Surprise Text: 你会说话就多说几句,没人拦你。(English: Say more if you are good at speaking, and no one will stop you.)
Fear Text: 他们出卖国家,遭到人民的唾弃。(English: They betrayed their country and were despised by the people.)
Disgust Text: 交到知心朋友,是多么快乐的事啊。(English: How happy it is to make close friends.)

Short summary: The results of CMOS and A/B preference test indicate the effectiveness of modeling the utterance-level variation for the emotional speech synthesis, and also demonstrates the good design of the proposed UM.