|
|
|
|---|---|---|
| (a) Initial phoneme embeddings colored by phoneme type |
(b) Initial phoneme embeddings colored by speaker label |
(c) Initial phoneme embeddings colored by emotion label |
|
|
|
|
|---|---|---|---|
| (a) t-SNE visualization of speaker embeddings (B-A) colored by speaker label |
(b) t-SNE visualization of emotion embeddings (C-B) colored by emotion label |
(c) PCA visualization of pitch embeddings (E-D) colored by predicted pitch value |
(d) PCA visualization of energy embeddings (F-E) colored by predicted energy value |
|
|
|
|
|---|---|---|---|
| (a) Full style embeddings(F-A) colored by speaker label |
(b) Full style embeddings(F-A) colored by emotion label |
(c) Full style embeddings normalized by speaker embedding (F-B) colored by speaker label |
(d) Full style embeddings normalized by speaker embedding (F-B) colored by emotion label |
| neutral | happy | sad | angry | |
|---|---|---|---|---|
| nea speaker mel | ![]() |
![]() |
![]() |
![]() |
| nea speaker wav | ||||
| nem speaker mel | ![]() |
![]() |
![]() |
![]() |
| nem speaker wav | ||||
| nec speaker mel | ![]() |
![]() |
![]() |
![]() |
| nec speaker wav | ||||
| neo speaker mel | ![]() |
![]() |
![]() |
![]() |
| neo speaker wav |
| increased energy values | the original energy values | decreased energy values | |
|---|---|---|---|
| GT mel | ![]() |
![]() |
![]() |
| GT wavs | |||
| w/ data aug. mel | ![]() |
![]() |
![]() |
| w/ data aug. wavs | |||
| w/o data aug. mel | ![]() |
![]() |
![]() |
| w/o data aug. wavs |
| increased pitch values | the original pitch values | decreased pitch values | |
|---|---|---|---|
| GT mel | ![]() |
![]() |
![]() |
| GT wavs | |||
| w/ data aug. mel | ![]() |
![]() |
![]() |
| w/ data aug. wavs | |||
| w/o data aug. mel | ![]() |
![]() |
![]() |
| w/o data aug. wavs |
| the original pitch and energy values | pitch +, energy + | pitch +, energy - | pitch -, energy + | pitch -, energy - | |
|---|---|---|---|---|---|
| GT mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| GT wavs | |||||
| w/ data aug. mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| w/ data aug. wavs | |||||
| w/o data aug. mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| w/o data aug. wavs |
| The sources of speaker embedding | The sources of other style embeddings | The samples synthesized with the combined style embedding |
|||
|---|---|---|---|---|---|
| emh speaker + emg(angry)'s prosody mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| emh speaker + emg(angry)'s prosody wav | |||||
| emh speaker + emb(happy)'s prosody mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| emh speaker + emb(happy)'s prosody wav | |||||
| emb speaker + emg(angry)'s prosody mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| emb speaker + emg(angry)'s prosody wav | |||||
| emb speaker + ema(sad)'s prosody mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| emb speaker + ema(sad)'s prosody wav |
| The sources of emotion embedding | The sources of other style embeddings | The samples synthesized with the combined style embedding |
|||
|---|---|---|---|---|---|
| emb's neutral emotion + emb(sad)'s other prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| emb's neutral emotion + emb(sad)'s other prosodies wav | |||||
| emh's happy emotion + emh(angry)'s other prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| emh's happy emotion + emh(angry)'s other prosodies wav | |||||
| emg's sad emotion + emg(happy)'s other prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| emg's sad emotion + emg(happy)'s other prosodies wav | |||||
| emb's angry emotion + emb(happy)'s other prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| emb's angry emotion + emb(happy)'s other prosodies wav |
| The sources of speaker embedding. (KSS dataset) |
The sources of other style embeddings (ETOD dataset) |
The samples synthesized with the combined style embedding |
|||
|---|---|---|---|---|---|
| KSS + emh speaker(angry)'s prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| KSS + emh speaker(angry)'s prosodies wav | |||||
| KSS + emg speaker(angry)'s prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| KSS + emg speaker(angry)'s prosodies wav | |||||
| KSS + emb speaker(neutral)'s prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| KSS + emb speaker(neutral)'s prosodies wav | |||||
| KSS + ema speaker(sad)'s prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| KSS + ema speaker(sad)'s prosodies wav |
| The sources of speaker and emotion embeddings. (KES dataset) |
The sources of other style embeddings (ETOD dataset) |
The samples synthesized with the combined style embedding. |
|||
|---|---|---|---|---|---|
| KES(disgusting) + emh speaker(angry)'s prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| KES(disgusting) + emh speaker(angry)'s prosodies wav | |||||
| KES(surprise) + emb speaker(sad)'s prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| KES(surprise) + emb speaker(sad)'s prosodies wav | |||||
| KES(fear) + emf speaker(happy)'s prosodies mel | ![]() |
![]() |
![]() |
![]() |
![]() |
| KES(fear) + emf speaker(happy)'s prosodies wav |
| Reference sample (GT-mel) | Separate embedding | Gradient reversal | UniTTS (w/o aug) | ||
|---|---|---|---|---|---|
| neo - angry | |||||
| emb - sad | |||||
| ema - happy | |||||
| emh - angry | |||||
| nea - angry | |||||
| unseen* |
| Reference sample (GT mel) | UniTTS learned from scratch | UniTTS (w/o aug) | UniTTS | |
|---|---|---|---|---|
| ema - happy | ||||
| ema - angry | ||||
| emg - happy | ||||
| ned - sad |