Page 44 - MSDN Magazine, June 2019
P. 44
SPEECH
Text-to-Speech
Synthesis in .NET
Ilia Smirnov
I often fly to Finland to see my mom. Every time the plane lands in Vantaa airport, I’m surprised at how few passengers head for the airport exit. The vast majority set off for connecting flights to des- tinations spanning all of Central and Eastern Europe. It’s no wonder, then, that when the plane begins its descent, there’s a barrage of announcements about connecting flights. “If your destination is Tallinn, look for gate 123,” “For flight XYZ to Saint Petersburg, proceed to gate 234,” and so on. Of course, flight attendants don’t typically speak a dozen languages, so they use English, which is not the native language of most passengers. Considering the qual- ity of the public announcement (PA) systems on the airliners, plus engine noise, crying babies and other disturbances, how can any information be effectively conveyed?
Well, each seat is equipped with headphones. Many, if not all, long-distance planes have individual screens today (and local ones have at least different audio channels). What if a passenger could choose the language for announcements and an onboard com- puter system allowed flight attendants to create and send dynamic (that is, not pre-recorded) voice messages? The key challenge here is the dynamic nature of the messages. It’s easy to pre-record safety instructions, catering options and so on, because they’re rarely updated. But we need to create messages literally on the fly.
Fortunately, there’s a mature technology that can help: text-to- speech synthesis (TTS). We rarely notice such systems, but they’re ubiquitous: public announcements, prompts in call centers, navigation devices, games, smart devices and other applications are all exam- ples where pre-recorded prompts aren’t sufficient or using a digitized waveform is proscribed due to memory limitations (a text read by a TTS engine is much smaller to store than a digitized waveform).
Computer-based speech synthesis is hardly new. Telecom com- panies invested in TTS to overcome the limitations of pre-recorded messages, and military researchers have experimented with voice prompts and alerts to simplify complex control interfaces. Portable synthesizers have likewise been developed for people with disabil- ities. For an idea of what such devices were capable of 25 years ago, listen to the track “Keep Talking” on the 1994 Pink Floyd album “TheDivisionBell,”whereStephenHawkingsayshisfamousline: “All we need to do is to make sure we keep talking.”
TTS APIs are often provided along with their “opposite”—speech recognition. While you need both for effective human-computer interaction, this exploration is focused specifically on speech syn- thesis. I’ll use the Microsoft .NET TTS API to build a prototype of an airliner PA system. I’ll also look under the hood to understand the basics of the “unit selection” approach to TTS. And while I’ll be walking through the construction of a desktop application, the principles here apply directly to cloud-based solutions.
Roll Your Own Speech System
Before prototyping the in-flight announcement system, let’s explore the API with a simple program. Start Visual Studio and create a con- sole application. Add a reference to System.Speech and implement the method in Figure 1.
Now compile and run. Just a few lines of code and you’ve repli- cated the famous Hawking phrase.
This article discusses:
• Computer-based speech synthesis
• Concatenative unit selection text-to-speech (TTS) systems • Machine learning in TTS processing
Technologies discussed:
.NET Speech API, .NET Speech SDK, Microsoft Cognitive Services, Speech Synthesis Markup Language (SSML)
40 msdn magazine