Cross-speaker Emotion Transfer for Expressive Speech Synthesis through Information Perturbation

Yi Lei, Shan Yang, Xinfa Zhu, Qicong Xie, Lei Xie, Dan Su
Northwestern Polytechnical University
Tencent AI Lab

0. Contents

  1. Abstract
  2. Examples of information perturbation
  3. Demos -- Cross-speaker emotion transfer
  4. Demos -- Generated audio examples from different branches


1. Abstract

Cross-speaker emotion transfer is an effective way to produce expressive speech for neutral target speakers, which doesn't require emotional training data of target speakers. Since the emotion and timbre of the source speaker are heavily entangled, existing approaches often struggle to trade off the speaker similarity and the emotion expressions. In this paper, we propose to disentangle the timbre and the emotion of speech through information perturbation to conduct cross-speaker emotion transfer, which effectively learns the emotion expressions of the source speaker and maintains the timbre of the target neutral speakers. Specifically, we perturb the timbre and emotion information (e.g., formant and pitch) of source speech separately to obtain and model the emotion- and timbre-independent signals, based on which the proposed model could further produce emotional speech in the timbre of target speakers. Experiment results demonstrate that the proposed approach significantly outperforms the baseline models in terms of naturalness and similarity, indicating the effectiveness of information perturbation for cross-speaker emotion transfer.



2. Examples of information perturbation

Speaker Original recording Perturbation on speaker Perturbation on emotion
Source speaker - Neutral
Source speaker - Happy
Source speaker - Angry
Source speaker - Sad
Source speaker - Surprise
Source speaker - Disgust
M1
F1
F2


3. Demos -- Cross-speaker emotion transfer

Convert the emotion expresssions from the source speaker to the neutral target speakers without emotional training data.

Target speaker: M1

Emotion Target emotion example Target speaker example FS2-GST FS2-VAE FS2-BN Proposed
Neutral Text: 我很明白你的意思。(English: I know exactly what you mean.)
Happy Text: 太棒了,风浩这种做法果然是有效。(English: Great! Feng Hao's practice is really effective.)
Angry Text: 很快穆先生就由恐惧滋生成了愤怒的情绪。(English: Mr. Mu became angry from fear very soon.)
Sad Text: 心痛到濒临疯狂,是谁告诉我要学会遗忘。(English: Heartache to the brink of madness, who told me to learn to forget.)
Surprise Text: 我的手指头断了,竟然一点儿也没有感觉。(English: I broke my finger, but I didn't feel it at all.)
Disgust Text: 心动不如行动,我不太擅长卖萌。(English: Action is better than thought. I'm not very good at being cute.)

Short summary: When the target neutral speaker is male, the FS2-GST and FS2-VAE cannot produce correct timbre, although the genereted speech is emotional. Besides, FS2-BN sometimes also cannot generate male voice. The proposed model could successfully maintain the target timbre when generating emotional speech.

Target speaker: F1

Emotion Target emotion example Target speaker example FS2-GST FS2-VAE FS2-BN Proposed
Neutral Text: 他希望能找个办法既能让这家伙吃苦头又不连累自己。(English: He hoped to find a way to make this guy suffer without harming himself.)
Happy Text: 戴老师他不是很高兴嘛?(English: Mr. Dai is very happy, isn't he?)
Angry Text: 第一次打昏的时候,我是最愤怒的!(English: The first time I fainted, I was the angriest!)
Sad Text: 求您给我一个改过自新的机会,以后我再也不去行骗了。(English: Please give me a chance to reform. I will never cheat again in the future.)
Surprise Text: 我的手指头断了,竟然一点儿也没有感觉。(English: I broke my finger, but I didn't feel it at all.)
Disgust Text: 你是存在于世界的另一个我,如此憎恨却又无法摆脱。(English: You are another me in the world, so hated but unable to get rid of.)

Target speaker: F2

Emotion Target emotion example Target speaker example FS2-GST FS2-VAE FS2-BN Proposed
Neutral Text: 预计今天下午到前半夜我市部分地区有雷阵雨。(English: It is expected that there will be thunderstorms in some areas of our city from this afternoon to the first midnight.)
Happy Text: 又有新发现了,真是令人惊喜万分。(English: It's amazing to find something new.)
Angry Text: 住口,不要喊我父亲,我没有你这样的女儿!(English: Shut up, don't call my father, I don't have a daughter like you!)
Sad Text: 求您给我一个改过自新的机会,以后我再也不去行骗了。(English: Please give me a chance to reform. I will never cheat again in the future.)
Surprise Text: 啊,是的是的。真是耳闻不如目见。天使呀,您来有什么话对我说?(English: Ah, yes, yes. Seeing is better than hearing. Angel, what do you want to say to me?)
Disgust Text: 翁倩玉在赈灾晚会上演唱。(English: Weng Qianyu sang at the disaster relief party.)

Short summary: The results indicate the effectiveness of our proposed method can successfully transfer the source emotion to the target speaker while maintaining the target speaker's timbre.



4. Demos -- Generated audio examples from different branches

Speaker Emotion Speaker-mel generator Emotion-mel generator Final output mel
F1 Neutral
F2 Happy
F2 Angry
F1 Sad
Source speaker Surprise
M1 Disgust

Short summary: Given different speaker and emotion embedding during inference, the Speaker-mel gererator could provide emotionless speech with specific timbre, while the output of the Emotion-mel gererator contains the emotion variations with random or averaged timbre. Based on the two outputs, the final generated speech has specifec emotion expressions and timbre.