Page 47 - MSDN Magazine, June 2019
P. 47
k uh l”). This is done by special grapheme-to-phoneme algorithms. For languages like Spanish, some relatively straightforward rules can be applied. But for others, like English, pronunciation differs significantly from the written form. Statistical methods are then employed along with databases for known words. After that, addi- tional post-lexical processing is needed, because the pronunciation of words can change when combined in a sentence.
While parsers try to extract all possible information from the text, there’s something that’s so elusive that it’s not extractable: prosody or intonation. While speaking, we use prosody to emphasize certain words, to convey emotion, and to indicate affirmative sentences, commands and questions. But written text doesn’t have symbols to indicate prosody. Sure, punctuation offers some context: A com- ma means a slight pause, while a period means a longer one, and a question mark means you raise your intonation toward the end of a sentence. But if you’ve ever read your children a bedtime story, you know how far these rules are from real reading.
Moreover, two different people often read the same text differently (ask your children who is better at reading bedtime stories—you or your spouse). Because of this you cannot reliably use statistical methods since different experts will produce different labels for supervised learning. This problem is complex and, despite inten- sive research, far from being solved. The best programmers can do is use SSML, which has some tags for prosody.
Speech Generation
Now that we have the tree with metadata, we turn to speech generation. Original TTS systems tried to synthesize signals by combining sinusoids. Another interesting approach was construct- ing a system of differential equations describing the human vocal tract as several connected tubes of different diameters and lengths. Such solutions are very compact, but unfortunately sound quite mechanical. So, as with musical synthesizers, the focus gradually shifted to solutions based on samples, which require significant space, but essentially sound natural.
Tobuildsuchasystem,youhavetohavemanyhoursofhigh-quality recordings of a professional actor reading specially constructed text. This text is split into units, labeled and stored into a database. Speech generation becomes a task of selecting proper units and gluing them together.
Because you’re not synthesizing speech, you can’t significantly adjust parameters in the runtime. If you need both male and female voices or must provide regional accents (say, Scottish or Irish), they have to be recorded separately. The text must be constructed to cover all possible sound units you’ll need. And the actors must read in a neutral tone to make concatenation easier.
Splitting and labeling are also non-trivial tasks. It used to be done manually, taking weeks of tedious work. Thankfully, machine learning is now being applied to this.
Unit size is probably the most important parameter for a TTS system. Obviously, by using whole sentences, we could make the most natural sounds even with correct prosody, but recording and storing that much data is impossible. Can we split it into words? Probably, but how long will it take for an actor to read an entire dictionary? And what database size limitations are we facing? On msdnmagazine.com
the other side, we cannot just record the alphabet—that’s sufficient only for a spelling bee contest. So usually units are selected as two three-letter groups. They’re not necessarily syllables, as groups span- ning syllable borders can be glued together much better.
Now the last step. Having a database of speech units, we need to deal with concatenation. Alas, no matter how neutral the intona- tion was in the original recording, connecting units still requires adjustments to avoid jumps in volume, frequency and phase. This is done with digital signal processing (DSP). It can also be used to add some intonation to phrases, like raising or lowering the generated voice for assertions or questions.
Wrapping Up
In this article I covered only the .NET API. Other platforms provide similar functionality. MacOS has NSSpeechSynthesizer in Cocoa with comparable features, and most Linux distributions include the eSpeak engine. All of these APIs are accessible through native code, so you have to use C# or C++ or Swift. For cross-platform ecosystems like Python, there are some bridges like Pyttsx, but they usually have certain limitations.
Cloud vendors, on the other hand, target wide audiences, and offer services for most popular languages and platforms. While functionality is comparable across vendors, support for SSML tags can differ, so check documentation before choosing a solution.
Microsoft offers a Text-to-Speech service as part of Cognitive Services (bit.ly/2XWorku). It not only gives you 75 voices in 45 languages, but also allows you to create your own voices. For that, the service needsaudiofileswithacorrespondingtranscript.Youcanwriteyour text first then have someone read it, or take an existing recording and write its transcript. After uploading these datasets to Azure, a machine learning algorithm trains a model for your own unique “voice font.” A good step-by-step guide can be found at bit.ly/2VE8th4.
A very convenient way to access Cognitive Speech Services is by using the Speech Software Development Kit (bit.ly/2DDTh9I). It supports both speech recognition and speech synthesis, and is available for all major desktop and mobile platforms and most popular languages. It’s well documented and there are numerous code samples on GitHub.
TTS continues to be a tremendous help to people with special needs. For example, check out linka.su, a Web site created by a talented programmer with cerebral paralysis to help people with speech and musculoskeletal disorders, autism, or those recovering from a stroke. Knowing from personal experience what limitations they’re facing, the author created a range of applications for people who can’t type on a regular keyboard, can only select one letter at a time, or just touch a picture on a tablet. Thanks to TTS, he liter- ally gives a voice to those who do not have one. I wish that we all, as programmers, could be that useful to others. n
IlIa SmIrnov has more than 20 years of experience developing enterprise appli- cations on major platforms, primarily in Java and .NET. For the last decade, he has specialized in simulation of financial risks. He holds three master’s degrees, FRM and other professional certifications.
ThankS to the following Microsoft technical expert for reviewing this article: Sheng Zhao
June 2019 43