eSpeak

eSpeak is a free and open-source, cross-platform, compact, software speech synthesizer. It uses a formant synthesis method, providing many languages in a relatively small file size. eSpeakNG (Next Generation) is a continuation of the original developer's project with more feedback from native speakers.

eSpeakNG
Original author(s)Jonathan Duddington
Developer(s)Reece Dunn
Initial releaseFebruary 2006 (2006-02)
Stable release
1.51 / 2 April 2022 (2022-04-02)
Repositorygithub.com/espeak-ng/espeak-ng/
Written inC
Operating systemLinux
Windows
macOS
FreeBSD
TypeSpeech synthesizer
LicenseGPLv3
Websitegithub.com/espeak-ng/espeak-ng/

Because of its small size and many languages, eSpeakNG is included in NVDA[1] open source screen reader for Windows, as well as Android,[2] Ubuntu[3] and other Linux distributions. Its predecessor eSpeak was recommended by Microsoft in 2016[4] and was used by Google Translate for 27 languages in 2010;[5] 17 of these were subsequently replaced by proprietary voices.[6]

The quality of the language voices varies greatly. In eSpeakNG's predecessor eSpeak, the initial versions of some languages were based on information found on Wikipedia.[7] Some languages have had more work or feedback from native speakers than others. Most of the people who have helped to improve the various languages are blind users of text-to-speech.

History

Isotype for ESpeak.

In 1995, Jonathan Duddington released the Speak speech synthesizer for RISC OS computers supporting British English.[8] On 17 February 2006, Speak 1.05 was released under the GPLv2 license, initially for Linux, with a Windows SAPI 5 version added in January 2007.[9] Development on Speak continued until version 1.14, when it was renamed to eSpeak.

Development of eSpeak continued from 1.16 (there was not a 1.15 release)[9] with the addition of an eSpeakEdit program for editing and building the eSpeak voice data. These were only available as separate source and binary downloads up to eSpeak 1.24. The 1.24.02 version of eSpeak was the first version of eSpeak to be version controlled using subversion,[10] with separate source and binary downloads made available on SourceForge.[9] From eSpeak 1.27, eSpeak was updated to use the GPLv3 license.[11] The last official eSpeak release was 1.48.04 for Windows and Linux, 1.47.06 for RISC OS and 1.45.04 for macOS.[12] The last development release of eSpeak was 1.48.15 on 16 April 2015.[13]

eSpeak uses the Usenet scheme to represent phonemes with ASCII characters.[14]

eSpeak NG

On 25 June 2010,[15] Reece Dunn started a fork of eSpeak on GitHub using the 1.43.46 release. This started off as an effort to make it easier to build eSpeak on Linux and other POSIX platforms.

On 4 October 2015 (6 months after the 1.48.15 release of eSpeak), this fork started diverging more significantly from the original eSpeak.[16][17]

On 8 December 2015, there were discussions on the eSpeak mailing list about the lack of activity from Jonathan Duddington over the previous 8 months from the last eSpeak development release. This evolved into discussions of continuing development of eSpeak in Jonathan's absence.[18][19] The result of this was the creation of the espeak-ng (Next Generation) fork, using the GitHub version of eSpeak as the basis for future development.

On 11 December 2015, the espeak-ng fork was started.[20] The first release of espeak-ng was 1.49.0 on 10 September 2016,[21] containing significant code cleanup, bug fixes, and language updates.

Features

eSpeakNG can be used as a command-line program, or as a shared library.

It supports Speech Synthesis Markup Language (SSML).

Language voices are identified by the language's ISO 639-1 code. They can be modified by "voice variants". These are text files which can change characteristics such as pitch range, add effects such as echo, whisper and croaky voice, or make systematic adjustments to formant frequencies to change the sound of the voice. For example, "af" is the Afrikaans voice. "af+f2" is the Afrikaans voice modified with the "f2" voice variant which changes the formants and the pitch range to give a female sound.

eSpeakNG uses an ASCII representation of phoneme names which is loosely based on the Usenet system.

Phonetic representations can be included within text input by including them within double square-brackets. For example: espeak-ng -v en "Hello [[w3:ld]]" will say Hello world in English.

Synthesis method

eSpeakNG can be used as text-to-speech translator in different ways, depending on which text-to-speech translation step user want to use.

1. step — text to phoneme translation

There are many languages (notably English) which do not have straightforward one-to-one rules between writing and pronunciation; therefore, the first step in text-to-speech generation has to be text-to-phoneme translation.

  1. input text is translated into pronunciation phonemes (e.g. input text xerox is translated into zi@r0ks for pronunciation).
  2. pronunciation phonemes are synthesized into sound e.g., zi@r0ks is voiced as zi@r0ks in monotone way

To add intonation for speech i.e. prosody data are necessary (e.g. stress of syllable, falling or rising pitch of basic frequency, pause, etc.) and other information, which allows to synthesize more human, non-monotonous speech. E.g. in eSpeakNG format stressed syllable is added using apostrophe: z'i@r0ks which provides more natural speech: z'i@r0ks with intonation

For comparison two samples with and without prosody data:

  1. [[DIs Iz m0noUntoUn spi:tS]] is spelled in monotone way
  2. [[DIs Iz 'Int@n,eItI2d sp'i:tS]] is spelled intonated way

If eSpeakNG is used for generation of prosody data only, then prosody data can be used as input for MBROLA diphone voices.

2. step — sound synthesis from prosody data

The eSpeakNG provides two different types of formant speech synthesis using its two different approaches. With its own eSpeakNG synthesizer and a Klatt synthesizer:[22]

  1. The eSpeakNG synthesizer creates voiced speech sounds such as vowels and sonorant consonants by additive synthesis adding together sine waves to make the total sound. Unvoiced consonants e.g. /s/ are made by playing recorded sounds,[23] because they are rich in harmonics, which makes additive synthesis less effective. Voiced consonants such as /z/ are made by mixing a synthesized voiced sound with a recorded sample of unvoiced sound.
  2. The Klatt synthesizer mostly uses the same formant data as the eSpeakNG synthesizer. But, it also produces sounds by subtractive synthesis by starting with generated noise, which is rich in harmonics, and then applying digital filters and enveloping to filter out necessary frequency spectrum and sound envelope for particular consonant (s, t, k) or sonorant (l, m, n) sound.

For the MBROLA voices, eSpeakNG converts the text to phonemes and associated pitch contours. It passes this to the MBROLA program using the PHO file format, capturing the audio created in output by MBROLA. That audio is then handled by eSpeakNG.

Languages

eSpeakNG performs text-to-speech synthesis for the following languages:[24][25]

  1. Abaza
  2. Aburlin
  3. Abenaki
  4. Achinese
  5. Adyghe
  6. Afar
  7. Afrikaans[26]
  8. Albanian[27]
  9. Amharic
  10. Apache
  11. Arabela
  12. Ancient Greek
  13. Arabic1
  14. Aragonese[28]
  15. Arapaho
  16. Armenian (Eastern Armenian)
  17. Armenian (Western Armenian)
  18. Aromanian
  19. Assamese
  20. Assiniboine
  21. Avaric
  22. Awadhi
  23. Aymara
  24. Azerbaijani
  25. Bambara
  26. Bashkir
  27. Basque
  28. Basic English
  29. Belarusian
  30. Bengali
  31. Bhojpuri
  32. Bicolano
  33. Bodo
  34. Bishnupriya Manipuri
  35. Bosnian
  36. Bulgarian[28]
  37. Breton
  38. Burmese
  39. Buryat
  40. Caddo
  41. Cahuilla
  42. Cantonese[28]
  43. Carrier
  44. Catalan[28]
  45. Catawba
  46. Cayuga
  47. Cebuano
  48. Chamorro
  49. Chechen
  50. Cherokee
  51. Cheyenne
  52. Chhattisgarhi
  53. Chichewa
  54. Chickasaw
  55. Chinese (Mandarin)
  56. Chipewyan
  57. Chippewa
  58. Chitonga
  59. Chittagonian
  60. Choctaw
  61. Conestoga
  62. Corsican
  63. Croatian[28]
  64. Crow
  65. Czech
  66. Chuvash
  67. Church Slavonic
  68. Crimean Tatar
  69. Dakota
  70. Danish[28]
  71. Dari
  72. Divehi
  73. Dogri
  74. Dogrib
  75. Dutch[28]
  76. Dzongkha
  77. Edo
  78. English (American)[28]
  79. English (British)
  80. English (Caribbean)
  81. English (Lancastrian)
  82. English (Received Pronunciation)
  83. English (Scottish)
  84. English (West Midlands)
  85. Esperanto[28]
  86. Estonian[28]
  87. Ewe
  88. Eyak
  89. Finnish[28]
  90. Filipino
  91. Fon
  92. Fox
  93. French (Belgian)[28]
  94. French (Canada)
  95. French (France)
  96. French (Swiss)
  97. Frisian
  98. Gagauz
  99. Galician
  100. Garhwali
  101. Garifuna
  102. Garo
  103. Georgian[28]
  104. German[28]
  105. Greek (Modern)[28]
  106. Greenlandic
  107. Guarani
  108. Gujarati
  109. Gwichin
  110. Hadza
  111. Haida
  112. Haisla
  113. Hakka Chinese3
  114. Haitian Creole
  115. Hän
  116. Haryanvi
  117. Hausa
  118. Hawaiian
  119. Hebrew
  120. Hidatsa
  121. High Valyrian
  122. Hiligaynon
  123. Hindi[28]
  124. Hmong
  125. Ho-Chunk
  126. Hopi
  127. Hungarian[28]
  128. Hunsrik
  129. Iban
  130. Ibibio
  131. Icelandic[28]
  132. Igbo
  133. Iloko
  134. Indonesian[28]
  135. Ido
  136. Interlingua
  137. Interlingue
  138. Irish[28]
  139. Italian[28]
  140. Itelmen
  141. Japanese4[29]
  142. Javanese
  143. Judaeo-Spanish
  144. Kannada[28]
  145. Kansa
  146. Kashmiri
  147. Kazakh
  148. Khakas
  149. Khmer
  150. Klingon
  151. Kʼicheʼ
  152. Kirundi
  153. Kikuyu
  154. Kinyarwanda
  155. Konkani[30]
  156. Korean
  157. Krio
  158. Kumyk
  159. Kurdish[28]
  160. Kyrgyz
  161. Quechua
  162. Ladakhi
  163. Lakota
  164. Lao
  165. Latin
  166. Latgalian
  167. Latvian[28]
  168. Lang Belta
  169. Lingua Franca Nova
  170. Lepcha
  171. Lezgi
  172. Limbu
  173. Limburgish
  174. Lingala
  175. Lithuanian
  176. Lojban[28]
  177. Luganda
  178. Luxembourgish
  179. Masai
  180. Macedonian
  181. Madurese
  182. Magahi
  183. Maithili
  184. Makassarese
  185. Malagasy
  186. Malay[28]
  187. Malayalam[28]
  188. Maltese
  189. Mandan
  190. Manipuri
  191. Māori
  192. Marathi[28]
  193. Mohawk
  194. Moldovan
  195. Mon
  196. Mongolian
  197. Nahuatl (Classical)
  198. Navajo
  199. Nepali[28]
  200. Norwegian (Bokmål)[28]
  201. Northern Sotho
  202. Novial
  203. Nogai
  204. Nomatsiguenga
  205. Nottoway
  206. Old English
  207. Odia
  208. Omaha-Ponca
  209. Oneida
  210. Onondaga
  211. Oromo
  212. Occitan
  213. Papiamento
  214. Palauan
  215. Pashto
  216. Pawnee
  217. Persian[28]
  218. Persian (Latin alphabet)2
  219. Polish[28]
  220. Portuguese (Brazilian)[28]
  221. Portuguese (Portugal)
  222. Punjabi[31]
  223. Pyash (a constructed language)
  224. Quapaw
  225. Romanian[28]
  226. Raramuri
  227. Russian[28]
  228. Russian (Latvia)
  229. Sadri
  230. Salar
  231. Samoan
  232. Sanskrit
  233. Santali
  234. Scottish Gaelic
  235. Seneca
  236. Serbian[28]
  237. Shan (Tai Yai)
  238. Sharda
  239. Sesotho
  240. Shipibo
  241. Shona
  242. Sindhi
  243. Sinhala
  244. Slovak[28]
  245. Slovenian
  246. Somali
  247. Spanish (Spain)[28]
  248. Spanish (Latin American)
  249. Spanish (United States)
  250. Stoney
  251. Sundanese
  252. Swahili[26]
  253. Swedish[28]
  254. Sylheti
  255. Tajik
  256. Tamil[28]
  257. Tatar
  258. Tetum
  259. Telugu
  260. Tibetan
  261. Tswana
  262. Thai
  263. Tuvan
  264. Tuamotuan
  265. Turkmen
  266. Turkish[28]
  267. Tatar
  268. Uyghur
  269. Ukrainian
  270. Urarina
  271. Urdu
  272. Uzbek
  273. Vietnamese (Central Vietnamese)[28]
  274. Vietnamese (Northern Vietnamese)
  275. Vietnamese (Southern Vietnamese)
  276. Volapük
  277. Wayuu
  278. Welsh
  279. Wolof
  280. Wyandot
  281. Xavante
  282. Xhosa
  283. Yiddish
  284. Yoruba
  285. Yucateco
  286. Zulu
  287. Zuni
  1. Currently, only fully diacritized Arabic is supported.
  2. Persian written using English (Latin) characters.
  3. Currently, only Pha̍k-fa-sṳ is supported.
  4. Currently, only Hiragana and Katakana are supported.

See also

References

  1. Switch to eSpeak NG in NVDA distribution #5651
  2. eSpeak TTS for Android
  3. espeak-ng package in Ubuntu
  4. "Download voices for Immersive Reader, Read Mode, and Read Aloud".
  5. Google blog, Giving a voice to more languages on Google Translate, May 2010
  6. Google blog, Listen to us now, December 2010.
  7. eSpeak Speech Synthesizer 3. LANGUAGES
  8. http://espeak.sourceforge.net/
  9. "ESpeak: Speech synthesis - Browse /Espeak at SourceForge.net".
  10. Subversion history (revision 1)
  11. Subversion history (revision 56)
  12. "Espeak: Downloads".
  13. http://espeak.sourceforge.net/test/latest.html
  14. van Leussen, Jan-Wilem; Tromp, Maarten (26 July 2007). "Latin to Speech": 6. CiteSeerX 10.1.1.396.7811. {{cite journal}}: Cite journal requires |journal= (help)
  15. "Build: Allow portaudio 18 and 19 to be switched easily. · rhdunn/Espeak@63daaec". GitHub.
  16. "Espeakedit: Fix argument processing for unicode argv types · rhdunn/Espeak@61522a1". GitHub.
  17. "Switch to eSpeak NG in NVDA distribution · Issue #5651 · nvaccess/Nvda". GitHub.
  18. Taking ownership of the eSpeak project and its future
  19. Vote for new main eSpeak developer
  20. Rebrand the espeak program to espeak-ng.
  21. espeak-ng 1.49.0
  22. Klatt, Dennis H. (1979). "Software for a cascade/parallel formant synthesizer" (PDF). J. Acoustical Society of America, 67(3) March 1980.
  23. List of recorded fricatives in eSpeakNG
  24. "ESpeak NG Text-to-Speech". GitHub. 13 February 2022.
  25. "ESpeak NG Text-to-Speech". GitHub. 22 October 2021.
  26. Butgereit, L., & Botha, A. (2009, May). Hadeda: The noisy way to practice spelling vocabulary using a cell phone. In The IST-Africa 2009 Conference, Kampala, Uganda.
  27. Hamiti, M., & Kastrati, R. (2014). Adapting eSpeak for converting text into speech in Albanian. International Journal of Computer Science Issues (IJCSI), 11(4), 21.
  28. Kayte, S., & Gawali, D. B. (2015). Marathi Speech Synthesis: A review. International Journal on Recent and Innovation Trends in Computing and Communication, 3(6), 3708-3711.
  29. Pronk, R. (2013). Adding Japanese language synthesis support to the eSpeak system. University of Amsterdam.
  30. Mohanan, S., Salkar, S., Naik, G., Dessai, N. F., & Naik, S. (2012). Text Reader for Konkani Language. Automation and Autonomous System, 4(8), 409-414.
  31. Kaur, R., & Sharma, D. (2016). An Improved System for Converting Text into Speech for Punjabi Language using eSpeak. International Research Journal of Engineering and Technology, 3(4), 500-504.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.