![]() |
![]() |
![]() |
---|---|---|
(a) Initial phoneme embeddings colored by phoneme type |
(b) Initial phoneme embeddings colored by speaker label |
(c) Initial phoneme embeddings colored by emotion label |
![]() |
![]() |
![]() |
![]() |
---|---|---|---|
(a) t-SNE visualization of speaker embeddings (B-A) colored by speaker label |
(b) t-SNE visualization of emotion embeddings (C-B) colored by emotion label |
(c) PCA visualization of pitch embeddings (E-D) colored by predicted pitch value |
(d) PCA visualization of energy embeddings (F-E) colored by predicted energy value |
![]() |
![]() |
![]() |
![]() |
---|---|---|---|
(a) Full style embeddings(F-A) colored by speaker label |
(b) Full style embeddings(F-A) colored by emotion label |
(c) Full style embeddings normalized by speaker embedding (F-B) colored by speaker label |
(d) Full style embeddings normalized by speaker embedding (F-B) colored by emotion label |
neutral | happy | sad | angry | |
---|---|---|---|---|
nea speaker mel | ![]() |
![]() |
![]() |
![]() |
nea speaker wav | ||||
nem speaker mel | ![]() |
![]() |
![]() |
![]() |
nem speaker wav | ||||
nec speaker mel | ![]() |
![]() |
![]() |
![]() |
nec speaker wav | ||||
neo speaker mel | ![]() |
![]() |
![]() |
![]() |
neo speaker wav |
increased energy values | the original energy values | decreased energy values | |
---|---|---|---|
GT mel | ![]() |
![]() |
![]() |
GT wavs | |||
w/ data aug. mel | ![]() |
![]() |
![]() |
w/ data aug. wavs | |||
w/o data aug. mel | ![]() |
![]() |
![]() |
w/o data aug. wavs |
increased pitch values | the original pitch values | decreased pitch values | |
---|---|---|---|
GT mel | ![]() |
![]() |
![]() |
GT wavs | |||
w/ data aug. mel | ![]() |
![]() |
![]() |
w/ data aug. wavs | |||
w/o data aug. mel | ![]() |
![]() |
![]() |
w/o data aug. wavs |
the original pitch and energy values | pitch +, energy + | pitch +, energy - | pitch -, energy + | pitch -, energy - | |
---|---|---|---|---|---|
GT mel | ![]() |
![]() |
![]() |
![]() |
![]() |
GT wavs | |||||
w/ data aug. mel | ![]() |
![]() |
![]() |
![]() |
![]() |
w/ data aug. wavs | |||||
w/o data aug. mel | ![]() |
![]() |
![]() |
![]() |
![]() |
w/o data aug. wavs |
The sources of speaker embedding | The sources of other style embeddings | The samples synthesized with the combined style embedding |
|||
---|---|---|---|---|---|
emh speaker + emg(angry)'s prosody mel | ![]() |
![]() |
![]() |
![]() |
![]() |
emh speaker + emg(angry)'s prosody wav | |||||
emh speaker + emb(happy)'s prosody mel | ![]() |
![]() |
![]() |
![]() |
![]() |
emh speaker + emb(happy)'s prosody wav | |||||
emb speaker + emg(angry)'s prosody mel | ![]() |
![]() |
![]() |
![]() |
![]() |
emb speaker + emg(angry)'s prosody wav | |||||
emb speaker + ema(sad)'s prosody mel | ![]() |
![]() |
![]() |
![]() |
![]() |
emb speaker + ema(sad)'s prosody wav |
The sources of emotion embedding | The sources of other style embeddings | The samples synthesized with the combined style embedding |
|||
---|---|---|---|---|---|
emb's neutral emotion + emb(sad)'s other prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
emb's neutral emotion + emb(sad)'s other prosodies wav | |||||
emh's happy emotion + emh(angry)'s other prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
emh's happy emotion + emh(angry)'s other prosodies wav | |||||
emg's sad emotion + emg(happy)'s other prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
emg's sad emotion + emg(happy)'s other prosodies wav | |||||
emb's angry emotion + emb(happy)'s other prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
emb's angry emotion + emb(happy)'s other prosodies wav |
The sources of speaker embedding. (KSS dataset) |
The sources of other style embeddings (ETOD dataset) |
The samples synthesized with the combined style embedding |
|||
---|---|---|---|---|---|
KSS + emh speaker(angry)'s prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
KSS + emh speaker(angry)'s prosodies wav | |||||
KSS + emg speaker(angry)'s prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
KSS + emg speaker(angry)'s prosodies wav | |||||
KSS + emb speaker(neutral)'s prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
KSS + emb speaker(neutral)'s prosodies wav | |||||
KSS + ema speaker(sad)'s prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
KSS + ema speaker(sad)'s prosodies wav |
The sources of speaker and emotion embeddings. (KES dataset) |
The sources of other style embeddings (ETOD dataset) |
The samples synthesized with the combined style embedding. |
|||
---|---|---|---|---|---|
KES(disgusting) + emh speaker(angry)'s prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
KES(disgusting) + emh speaker(angry)'s prosodies wav | |||||
KES(surprise) + emb speaker(sad)'s prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
KES(surprise) + emb speaker(sad)'s prosodies wav | |||||
KES(fear) + emf speaker(happy)'s prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
KES(fear) + emf speaker(happy)'s prosodies wav |
Reference sample (GT-mel) | Separate embedding | Gradient reversal | UniTTS (w/o aug) | ||
---|---|---|---|---|---|
neo - angry | |||||
emb - sad | |||||
ema - happy | |||||
emh - angry | |||||
nea - angry | |||||
unseen* |
Reference sample (GT mel) | UniTTS learned from scratch | UniTTS (w/o aug) | UniTTS | |
---|---|---|---|---|
ema - happy | ||||
ema - angry | ||||
emg - happy | ||||
ned - sad |