Abstract

Unsupervised text-to-speech (TTS) aims to train TTS models for a specific language without any paired speech-text training data in that language. Existing methods either use speech and corresponding pseudo text generated by an unsupervised automatic speech recognition (ASR) model as training data, or employ the back-translation technique. Though effective, they suffer from low robustness to low-quality data and heavy dependence on the lexicon of a language that is sometimes unavailable, leading to difficulty in convergence, especially in low-resource language scenarios. In this work, we introduce a bag of tricks to enable effective unsupervised TTS. Specifically, 1) we carefully design a voice conversion model to normalize the variable and noisy information in the low-quality speech data while preserving the pronunciation information; 2) we employ the non-autoregressive TTS model to overcome the robustness issue; and 3) we explore several tricks applied in back-translation, including curriculum learning, length augmentation and auxiliary supervised loss to stabilize the back-translation and improve its effectiveness. Through experiments, it has been demonstrated that our method achieves better intelligibility and audio quality than all previous methods, and that these tricks are very essential to the performance gain.

Main Performance (Tab. 2, English)

All audios are converted to LJSpeech’s voice.

  1. the worst, which perhaps was the english, was a terrible falling off from the work of the earlier presses ;
    Supervised Ours Ren et al. (2019b) Xu et al. (2020) Liu et al. (2022b) & Ni et al. (2022)
  2. neild gives some figures which well illustrate this.
    Supervised Ours Ren et al. (2019b) Xu et al. (2020) Liu et al. (2022b) & Ni et al. (2022)
  3. the squalor and uncleanness of the debtors side was intensified by constant overcrowding.
    Supervised Ours Ren et al. (2019b) Xu et al. (2020) Liu et al. (2022b) & Ni et al. (2022)
  4. the most trifling acts were magnified into offenses.
    Supervised Ours Ren et al. (2019b) Xu et al. (2020) Liu et al. (2022b) & Ni et al. (2022)
  5. no prisoners should in future be ironed, except in cases of misconduct,
    Supervised Ours Ren et al. (2019b) Xu et al. (2020) Liu et al. (2022b) & Ni et al. (2022)
  6. it was frequently stated in evidence that the jail of the borough was in so unfit a state for the reception of prisoners,
    Supervised Ours Ren et al. (2019b) Xu et al. (2020) Liu et al. (2022b) & Ni et al. (2022)
  7. the first named had long been an active philanthropist, devoting himself more particularly to the reformation of juvenile criminals.
    Supervised Ours Ren et al. (2019b) Xu et al. (2020) Liu et al. (2022b) & Ni et al. (2022)
  8. numbers of ladies were present, although the public feeling was much against their attendance.
    Supervised Ours Ren et al. (2019b) Xu et al. (2020) Liu et al. (2022b) & Ni et al. (2022)

Main Performance (Tab. 2, Indonesian)

All audios are converted to LJSpeech’s voice.

  1. Terdakwa korupsi itu dituntut hukuman seumur hidup.
    Supervised Ours Ren et al. (2019b) Xu et al. (2020) Liu et al. (2022b) & Ni et al. (2022)
  2. Empat kata per-baris.
    Supervised Ours Ren et al. (2019b) Xu et al. (2020) Liu et al. (2022b) & Ni et al. (2022)
  3. Ini adalah sushi yang enak.
    Supervised Ours Ren et al. (2019b) Xu et al. (2020) Liu et al. (2022b) & Ni et al. (2022)
  4. Ini adalah selai buatan rumah.
    Supervised Ours Ren et al. (2019b) Xu et al. (2020) Liu et al. (2022b) & Ni et al. (2022)
  5. Kamu bodoh.
    Supervised Ours Ren et al. (2019b) Xu et al. (2020) Liu et al. (2022b) & Ni et al. (2022)
  6. dua
    Supervised Ours Ren et al. (2019b) Xu et al. (2020) Liu et al. (2022b) & Ni et al. (2022)
  7. Uang kami habis.
    Supervised Ours Ren et al. (2019b) Xu et al. (2020) Liu et al. (2022b) & Ni et al. (2022)
  8. Berapa kali sehari kamu bercermin?
    Supervised Ours Ren et al. (2019b) Xu et al. (2020) Liu et al. (2022b) & Ni et al. (2022)

Ablation Study (Tab. 3)

All audios are converted to LJSpeech’s voice.

  1. the worst, which perhaps was the english, was a terrible falling off from the work of the earlier presses ;
    Ours w/o. Norm w/o. CL w/o. BT w/o. Aug
    w/o. Aux w/o. NAR
  2. neild gives some figures which well illustrate this.
    Ours w/o. Norm w/o. CL w/o. BT w/o. Aug
    w/o. Aux w/o. NAR
  3. the squalor and uncleanness of the debtors side was intensified by constant overcrowding.
    Ours w/o. Norm w/o. CL w/o. BT w/o. Aug
    w/o. Aux w/o. NAR
  4. the most trifling acts were magnified into offenses.
    Ours w/o. Norm w/o. CL w/o. BT w/o. Aug
    w/o. Aux w/o. NAR
  5. no prisoners should in future be ironed, except in cases of misconduct,
    Ours w/o. Norm w/o. CL w/o. BT w/o. Aug
    w/o. Aux w/o. NAR
  6. it was frequently stated in evidence that the jail of the borough was in so unfit a state for the reception of prisoners,
    Ours w/o. Norm w/o. CL w/o. BT w/o. Aug
    w/o. Aux w/o. NAR
  7. the first named had long been an active philanthropist, devoting himself more particularly to the reformation of juvenile criminals.
    Ours w/o. Norm w/o. CL w/o. BT w/o. Aug
    w/o. Aux w/o. NAR
  8. numbers of ladies were present, although the public feeling was much against their attendance.
    Ours w/o. Norm w/o. CL w/o. BT w/o. Aug
    w/o. Aux w/o. NAR

Analyses on VC (Tab. 4)

All audios are converted to LJSpeech’s voice.

  1. the worst, which perhaps was the english, was a terrible falling off from the work of the earlier presses ;
    Ours w/o. Var. Enc. Chn=8 Chn=32 Chn=128
  2. neild gives some figures which well illustrate this.
    Ours w/o. Var. Enc. Chn=8 Chn=32 Chn=128
  3. the squalor and uncleanness of the debtors side was intensified by constant overcrowding.
    Ours w/o. Var. Enc. Chn=8 Chn=32 Chn=128
  4. the most trifling acts were magnified into offenses.
    Ours w/o. Var. Enc. Chn=8 Chn=32 Chn=128
  5. no prisoners should in future be ironed, except in cases of misconduct,
    Ours w/o. Var. Enc. Chn=8 Chn=32 Chn=128
  6. it was frequently stated in evidence that the jail of the borough was in so unfit a state for the reception of prisoners,
    Ours w/o. Var. Enc. Chn=8 Chn=32 Chn=128
  7. the first named had long been an active philanthropist, devoting himself more particularly to the reformation of juvenile criminals.
    Ours w/o. Var. Enc. Chn=8 Chn=32 Chn=128
  8. numbers of ladies were present, although the public feeling was much against their attendance.
    Ours w/o. Var. Enc. Chn=8 Chn=32 Chn=128

Voice Conversion Audio Samples

English

  1. perosi hailed from an extremely musical and religious family
    Before VC After VC
  2. the farmer works the soil and produces grain
    Before VC After VC
  3. double seaming uses rollers to shape the can,lid and the final double seam
    Before VC After VC

French

  1. au hameau de vaux fevroux se trouvent des maisons typiques de la région
    Before VC After VC
  2. ils sont représentés sur les temples hindous à partir de l’époque médiévale
    Before VC After VC
  3. d’autres voitures anciennes peuvent être aperçues durant le film
    Before VC After VC

Indonesian

  1. pak tanaka tinggal di rumah besar
    Before VC After VC
  2. terdakwa korupsi itu dituntut hukuman seumur hidup
    Before VC After VC
  3. dia masih di bawah umur
    Before VC After VC

Using Other Rich-Resource Languages (Tab. 5 in Appendix)

All audios are converted to LJSpeech’s voice.

  1. the worst, which perhaps was the english, was a terrible falling off from the work of the earlier presses ;
    All Languges French German Dutch Spanish
    Portuguese
  2. neild gives some figures which well illustrate this.
    All Languges French German Dutch Spanish
    Portuguese
  3. the squalor and uncleanness of the debtors side was intensified by constant overcrowding.
    All Languges French German Dutch Spanish
    Portuguese
  4. the most trifling acts were magnified into offenses.
    All Languges French German Dutch Spanish
    Portuguese
  5. no prisoners should in future be ironed, except in cases of misconduct,
    All Languges French German Dutch Spanish
    Portuguese
  6. it was frequently stated in evidence that the jail of the borough was in so unfit a state for the reception of prisoners,
    All Languges French German Dutch Spanish
    Portuguese
  7. the first named had long been an active philanthropist, devoting himself more particularly to the reformation of juvenile criminals.
    All Languges French German Dutch Spanish
    Portuguese
  8. numbers of ladies were present, although the public feeling was much against their attendance.
    All Languges French German Dutch Spanish
    Portuguese