Voice Conversion with Diverse Intonation using Conditional Variational Auto-Encoder

ISCA Workshop on MLSLP (2018)


Voice conversion is a task of synthesizing an utterance with target speaker’s voice while maintaining linguistic information of the source utterance. While a speaker can produce varying utterances from a single script with different intonations, conventional voice conversion models were limited to producing only one result per source input. To overcome this limitation, we propose a novel approach for voice conversion with diverse intonations using conditional variational autoencoder (CVAE). Experiments have shown that the speaker’s style feature can be mapped into a latent space with Gaussian distribution. We have also been able to convert voices with more diverse intonation by making the posterior of the latent space more complex with inverse autoregressive flow (IAF). As a result, the converted voice not only has a diversity of intonations, but also has better sound quality than the model without CVAE.


서수빈 (서울대학교), 안다비 (카카오브레인), 박희웅 (서울대학교), 박종헌 (서울대학교)



발행 날짜