UniTTS Demo Samples

Visualization of Unified Embedding Space

These visualization results illustrate the latent space of style attributes modeled by UniTTS.
We used EmotionTTS dataset for visualization. EmotionTTS dataset consists of 15 speakers and 4 emotions(neutral, happy, sad and angry).

Initial phoneme embeddings that does not contain style information


(a) Initial phoneme embeddings colored by phoneme type	(b) Initial phoneme embeddings colored by speaker label	(c) Initial phoneme embeddings colored by emotion label

These figures are the t-SNE visualization of the distribution of the unstyled phoneme embeddings extracted from the locations marked as A in Figure 3 (please refer the paper).
(a) shows that the unstyled phoneme embedding represents phoneme types, while (b)and (c) show that it does not contain speaker or emotion information.

Residual embeddings of style attributes (speaker ID, emotion, pitch and energy)


(a) t-SNE visualization of speaker embeddings (B-A) colored by speaker label	(b) t-SNE visualization of emotion embeddings (C-B) colored by emotion label	(c) PCA visualization of pitch embeddings (E-D) colored by predicted pitch value	(d) PCA visualization of energy embeddings (F-E) colored by predicted energy value

These figures are the distribution of the residual embeddings of speaker, emotion, pitch, and energy. (Please note that A, B, C, D, E and F are denoted in paper-figure 3.)
The uppercase letters indicate the locations in Fig. 3 where the embeddings were extracted.
These figures show that the residual embeddings are effective in representing the style attributes.

Full style embeddings that jointly represent all attributes (speaker ID, emotion, unlabeled local prosody, pitch, and energy)


(a) Full style embeddings(F-A) colored by speaker label	(b) Full style embeddings(F-A) colored by emotion label	(c) Full style embeddings normalized by speaker embedding (F-B) colored by speaker label	(d) Full style embeddings normalized by speaker embedding (F-B) colored by emotion label

These figures are the t-SNE visualization of the distribution of the full style embeddings that incorporate all style attributes.
The uppercase letters indicate the locations in Fig. 3 where the embeddings were extracted.
(a) and (b)show that the full style embedding contains both speaker and emotion information.
(c) shows that the full style embedding normalized by the means of the speaker embeddings does not contain speaker information.
(d) shows that the variance in emotion is dominant after normalizing the full style embedding by the means of speaker embeddings.

Energy and Pitch Control (with and without Transform-aware Data Augmentation)

Energy control

The following samples were synthesized by UniTTS with increased, the original, and decreased energy values
     - Row 1: the ground truth samples and augmented samples whose energy values were increased or decreased using the SOX toolkit.
     - Row 2: the audio samples synthesized by UniTTS applying data augmentation
     - Row 3: the audio samples synthesized by UniTTS not applying data augmentation

Without data augmentation, UniTTS produced speech samples with deteriorated quality when the energy value was increased or decreased. Particularly, when the energy value was decreased, it produced severly broken and distorted samples.
However, when applying data augmentation, it produced clean samples even with increased or decreased energy values.

	increased energy values	the original energy values	decreased energy values
GT mel
GT wavs
w/ data aug. mel
w/ data aug. wavs
w/o data aug. mel
w/o data aug. wavs

Pitch control

The following samples were synthesized by UniTTS with increased, the original, and decreased pitch values
Please note that adjusting the pitch of the voice using the SOX toolkit has a side-effect that changes the timbre as shown in the first row.
     - Row 1: the ground truth samples and augmented samples whose pitch values were increased or decreased using the SOX toolkit.
     - Row 2: the audio samples synthesized by UniTTS applying data augmentation
     - Row 3: the audio samples synthesized by UniTTS not applying data augmentation

Without data augmentation, UniTTS shows limited ability to control pitch, as shown more clearly in the spectrograms.
When applying data augmentation, it controlled pitch more effectively but changed timbre, because it was trained with the augmented samples whose timbre was changed due to the side-effect of the SOX toolkit.
We ask the listener to compare the samples focusing on the ability to control pitch.

	increased pitch values	the original pitch values	decreased pitch values
GT mel
GT wavs
w/ data aug. mel
w/ data aug. wavs
w/o data aug. mel
w/o data aug. wavs

Pitch and energy control

The following samples were synthesized by UniTTS controling both pitch and energy.
Applying data augmenation, UniTTS can effectively control pitch and energy.

	the original pitch and energy values	pitch +, energy +	pitch +, energy -	pitch -, energy +	pitch -, energy -
GT mel
GT wavs
w/ data aug. mel
w/ data aug. wavs
w/o data aug. mel
w/o data aug. wavs

Style Mixing

Speaker identity transfer

The first two columns show the synthesized samples with different speaker and emotion IDs.
We extracted the speaker embedding used to synthesize the first samples and other propody embeddings used to synthesize the second samples.
Then, we combined the embeddings to synthesize the third samples.
The third samples have the timbre of the first samples and the style of the second samples.

	The sources of speaker embedding		The sources of other style embeddings		The samples synthesized with the combined style embedding
emh speaker + emg(angry)'s prosody mel
emh speaker + emg(angry)'s prosody wav
emh speaker + emb(happy)'s prosody mel
emh speaker + emb(happy)'s prosody wav
emb speaker + emg(angry)'s prosody mel
emb speaker + emg(angry)'s prosody wav
emb speaker + ema(sad)'s prosody mel
emb speaker + ema(sad)'s prosody wav

Emotion representation transfer

The first two columns show the synthesized samples with different speaker and emotion IDs.
We extracted the emotion embedding used to synthesize the first samples and other propody embeddings used to synthesize the second samples.
Then, we combined the embeddings to synthesize the third samples.
The third samples have the emotion of the first samples and the style of the second samples.

	The sources of emotion embedding		The sources of other style embeddings		The samples synthesized with the combined style embedding
emb's neutral emotion + emb(sad)'s other prosodies mel
emb's neutral emotion + emb(sad)'s other prosodies wav
emh's happy emotion + emh(angry)'s other prosodies mel
emh's happy emotion + emh(angry)'s other prosodies wav
emg's sad emotion + emg(happy)'s other prosodies mel
emg's sad emotion + emg(happy)'s other prosodies wav
emb's angry emotion + emb(happy)'s other prosodies mel
emb's angry emotion + emb(happy)'s other prosodies wav

Transfer of emotion, duration, pitch and energy from ETOD samples to KSS speaker

The KSS dataset contains 12,853 speech samples WITHOUT emotion label spoken by a SINGLE female speaker. (Please note that no emotion label exists in KSS dataset)
The ETOD dataset contains 6,000 samples with 4 emotion types spoken by 15 speakers.

We transferred the style of the samples in the ETOD dataset to the KSS speaker.
We extracted the speaker embedding from the KSS samples, that do not have emotion labels, and the other style embeddings from the samples of the ETOD dataset.
Then, we synthesized speech using the combined style embedding.

The first and second columns show the samples of the KSS dataset and the ETOD dataset, respectively.
The third column shows the syntesized samples using the combined style embeddings.

	The sources of speaker embedding. (KSS dataset)		The sources of other style embeddings (ETOD dataset)		The samples synthesized with the combined style embedding
KSS + emh speaker(angry)'s prosodies mel
KSS + emh speaker(angry)'s prosodies wav
KSS + emg speaker(angry)'s prosodies mel
KSS + emg speaker(angry)'s prosodies wav
KSS + emb speaker(neutral)'s prosodies mel
KSS + emb speaker(neutral)'s prosodies wav
KSS + ema speaker(sad)'s prosodies mel
KSS + ema speaker(sad)'s prosodies wav

Transfer of duration, pitch, and energy from ETOD samples to KES speaker

We transferred the duration, pitch, energy of the samples in the ETOD dataset to the KES speaker.
We extracted the speaker and emotion embeddings from the KES samples spoken by a single speaker and the duration, pitch, and energy embeddings from the samples in the ETOD dataset.
Then, we synthesized speech using the combined style embedding.

The first and second columns show the samples of the KES dataset and the ETOD dataset, respectively.
The third column shows the syntesized samples using the combined style embeddings.

	The sources of speaker and emotion embeddings. (KES dataset)		The sources of other style embeddings (ETOD dataset)		The samples synthesized with the combined style embedding.
KES(disgusting) + emh speaker(angry)'s prosodies mel
KES(disgusting) + emh speaker(angry)'s prosodies wav
KES(surprise) + emb speaker(sad)'s prosodies mel
KES(surprise) + emb speaker(sad)'s prosodies wav
KES(fear) + emf speaker(happy)'s prosodies mel
KES(fear) + emf speaker(happy)'s prosodies wav

Comparison to other methods

We compared samples from UniTTS(w.o. aug) with other method such as separate embedding and gradient reversal.
UniTTS produces the most natural and similar samples compared to separate embedding and gradient reversal.
(Please note that most of the samples are from MOS test.)

Overall, UniTTS (w.o. aug) produces natural samples while keeping speaker id and emotion similarity.
UniTTS(w.o. aug) produces best result with respect to natural duration, pitch, energy and intonation, compared to other methods.
In addition, UniTTS (w.o. aug) can preserve speaker id, even on unknown combination of speaker and emotion. (e.g. speaker ema's identity and kss's curious emotion)