Page 46 - MSDN Magazine, June 2019
P. 46
There will be error handling and so on. But the point here is to illustrate the core functionality.
Deconstructing Speech
So far we’ve achieved our objective with a surprisingly small code- base. Let’s take an opportunity to look under the hood and better understand how TTS engines work.
There are many approaches to constructing a TTS system. His- torically, researchers have tried to discover a set of pronunciation rules on which to build algorithms. If you’ve ever studied a foreign language, you’re familiar with rules like “Letter ‘c’ before ‘e,’ ‘i,’ ‘y’ is pronounced as ‘s’ as in ‘city,’ but before ‘a,’ ‘o,’ ’u’ as ‘k’ as in ‘cat.’” Alas, there are so many exceptions and special cases—like pronunciation changes in consecutive words—that constructing a comprehensive set of rules is difficult. Moreover, most such systems tend to pro- duce a distinct “machine” voice—imagine a beginner in a foreign language pronouncing a word letter-by-letter.
For more naturally sounding speech, research has shifted toward systems based on large databases of recorded speech fragments,
Figure 4 The XAML Code
and these engines now dominate the market. Commonly known as concatenation unit selection TTS, these engines select speech samples (units) based on the input text and concatenate them into phrases. Usually, engines use two-stage processing closely resem- bling compilers: First, parse input into an internal list- or tree-like structure with phonetic transcription and additional metadata, and then synthesize sound based on this structure.
Because we’re dealing with natural languages, parsers are more sophisticated than for programming languages. So beyond tokeni- zation (finding boundaries of sentences and words), parsers must correct typos, identify parts of speech, analyze punctuation, and decode abbreviations, contractions and special symbols. Parser output is typically split by phrases or sentences, and formed into collections describing words that group and carry metadata such as part of speech, pronunciation, stress and so on.
Parsers are responsible for resolving ambiguities in the input. For example, what is “Dr.”? Is it “doctor” as in “Dr. Smith,” or “drive” as in “Privet Drive?” And is “Dr.” a sentence because it starts with an uppercase letter and ends with a period? Is “project” a noun or a verb? This is important to know because the stress is on different syllables.
These questions are not always easy to answer and many TTS systems have separate parsers for specific domains: numerals, dates, abbreviations, acronyms, geographic names, special forms of text like URLs and so on. They’re also language- and region-specific. Luckily, such problems have been studied for a long time and we have well-developed frameworks and libraries to lean on.
The next step is generating pronunciation forms, such as tag- ging the tree with sound symbols (like transforming “school” to “s
Neural Networks in TTS
using System.Collections.Generic; using System.Globalization; using System.Speech.Synthesis; using System.Windows;
namespace GuiTTS {
public partial class MainWindow : Window {
private const string en = "en-US";
private const string ru = "ru-RU";
private readonly IDictionary<string, string> _messagesByCulture =
new Dictionary<string, string>();
public MainWindow() {
InitializeComponent();
PopulateMessages(); }
private void PromptInEnglish(object sender, RoutedEventArgs e) {
DoPrompt(en); }
private void PromptInRussian(object sender, RoutedEventArgs e) {
DoPrompt(ru); }
private void DoPrompt(string culture) {
var synthesizer = new SpeechSynthesizer(); synthesizer.SetOutputToDefaultAudioDevice(); var builder = new PromptBuilder(); builder.StartVoice(new CultureInfo(culture)); builder.AppendText(_messagesByCulture[culture]); builder.EndVoice();
synthesizer.Speak(builder); }
private void PopulateMessages() {
_messagesByCulture[en] = "For the connection flight 123 to Saint Petersburg, please, proceed to gate A1";
_messagesByCulture[ru] =
"Дляпересадкинарейс 123 в Санкт-Петербург,пожалуйста,пройдитеквыходу A1";
} }
}
Statistical or machine learning methods have for years been applied in all stages of TTS processing. For exam- ple, Hidden Markov Models are used to create parsers producing the most likely parse, or to perform labeling for speech sample databases. Decision trees are used in unit selection or in graph- eme-to-phoneme algorithms, while neural networks and deep learning have emerged at the bleeding edge of TTS research.
We can consider an audio sample as a time-series of waveform sampling. By creating an auto-regressive model, it’s possible to predict the next sample. As a result, the model generates speech- kind bubbling, like a baby learning to talk by imitating sounds.
If we further condition this model on the audio transcript or the pre-processing output from an existing TTS system, we get a pa- rameterized model of speech. The output of the model describes a spectrogram for a vocoder producing actual waveforms. Be- cause this process doesn’t rely on a database with recorded sam- ples, but is generative, the model has a small memory footprint and allows for adjustment of parameters.
Because the model is trained on natural speech, the output re- tains all of its characteristics, including breathing, stresses and in- tonation (so neural networks can potentially solve the prosody problem). It’s possible also to adjust the pitch, create a completely different voice and even imitate singing.
At the time of this writing, Microsoft is offering its preview ver- sion of a neural network TTS (bit.ly/2PAYXWN). It provides four voic- es with enhanced quality and near instantaneous performance.
42 msdn magazine
Speech