5. Speech – Building Intelligent Apps with Cognitive APIs

Chapter 5. Speech

Speech recognition has long been one of the more complex computer science problems—but years of research and recent breakthroughs with deep learning neural networks have turned this from a research problem into a set of easy-to-use services. The very first successful implementation of deep learning instead of the traditional speech recognition algorithms was funded by Microsoft Research. In 2017, a system built by Microsoft researchers outperformed not just individuals but a more accurate multitranscriber process at transcribing recorded phone conversations.

The Cognitive Services Speech Services are built upon these innovations: they provide a set of pretrained speech APIs that work across multiple speakers and many different languages. Add them to your code and you’re using the same engines that power Microsoft’s own services, from Skype’s real-time translation tools to PowerPoint’s live captioning.

The Speech Services include speech-to-text, text-to-speech, voice identification, and real-time translation capabilities. Combined, these features make it easy to add natural interaction to your apps and let your users communicate in whatever way they find convenient.

The services are available through the Speech SDK, the Speech Devices SDK, or REST APIs. These cloud APIs enable speech-to-text translation in just a few lines of code, making it economical to add these capabilities in applications where client-side translation services would have been considered too expensive.

Speech to Text

It used to require hours of time and specialized equipment for a trained human to translate speech into text. Often transcribers used a system that was more like drawing gestures than normal typing. It was expensive, and even commercial services didn’t always reach high enough accuracy.

The Speech to Text tool works with real-time streamed audio data or prerecorded audio files. It’s the same underlying technology as that used in Cortana, so it’s been proven in a wide range of conditions, with many accents and in multiple languages. The list of supported languages is long and continues to grow, covering most European languages, Arabic, Thai, Chinese, and Japanese. Not all languages offer the same level of customization, however.

Speech to Text is available through a set of SDKs and REST APIs. As the service is primarily intended to be used with streamed data, it’s easiest to use the SDKs. These libraries give you direct access to audio streams, including device microphones and local audio recording files. The REST APIs are useful for quick speech commands (say, for adding speech controls to mobile apps or websites). If you’ve built custom language understanding models in LUIS, you can use these in conjunction with the Speech Services to extract the speaker’s intent, making it easier to deliver what your user is asking for.

.NET uses the Microsoft.Cognitive.Services.Speech namespace to expose all interactions with the service. The key base class is the Recognizer that controls the connection to the service, sending speech data and detecting start and end events. Calls to the SpeechRecognizer are asynchronous, and the SDK handles the connection to your microphone and recognizing data until a preset length of silence is found. Calls to the service can either include short speech snippets or long utterances for recognition. There is also a continuous recognition model. The SDK returns recognized speech as a string, with error handling for failed recognitions.

Here’s a snippet of what this looks like using version 1 of the C# SDK. First, configure your Speech credentials by filling in your own subscription key and service region (e.g., "westus"):

var config = SpeechConfig.FromSubscription(
      "<Your Subscription Key>", "<Your Service Region>"
   );

Next, create a SpeechRecognizer:

var recognizer = new SpeechRecognizer(config);

This calls the API to capture short utterances and convert them to text—for long-running multiutterance recognition, use StartContinuousRecognitionAsync instead:

var result = await recognizer.RecognizeOnceAsync();

Now, check to see if the speech was recognized (other options are NoMatch, Cancelled, and Error):

if (result.Reason == ResultReason.RecognizedSpeech)
{
   Console.WriteLine($"We recognized: {result.Text}");
}

One of the difficulties with speech recognition is the many ways people speak. Speech styles, prosody, accents, and vocabulary vary considerably. Your business also likely has unique names and jargon that are not found in general dictionaries. To address these challenges, you can customize the speech models to understand accents or work with specific vocabularies. Like the Custom Vision service, these customizations build on top of the existing trained models and allow you to create use case-specific models without the burdensome data requirements of creating your own from scratch.

The places in which speech is being recorded can pose challenges too. For example, the background noise at a drive-through or the acoustics of a mall are both very different from someone speaking into their phone in a quiet room. You can add acoustic models to account for the complexities of varied environments where accurate recognition is essential: in vehicles, on the factory floor, or out in the field. Adding a custom acoustic model will be necessary if you’re building code for use in a predictably noisy environment.

To get started building your own custom models, you will need samples recorded in the same conditions in which your application will be recognizing speech. That means people talking in the environment or into the device you plan to use. You can also use this method to tune speech recognition to a single voice. This technique is useful for transcribing podcasts or other audio sources.

Data needs to be in 8 KHz or 16 KHz WAV files, using mono recordings. Split them up into 10- to 12-second chunks for the best results, starting and finishing with silence. Each file needs a unique name, and should contain a single utterance: a query, a name, or a short sentence. Package the files in a single zipped folder that’s less than 2 GB, and upload that to the Custom Speech website. Each file needs to be accompanied by the transcription in the correct format: a single line of text in a file that starts with the audio file’s name, then a tab, then the text. You will then need to walk through the process of training the model on the website.

Building custom language models, either for a specific technical vocabulary or to improve recognition of accented speech, also requires labeled data. However, this data consists of a list of sentences and phrases as text rather than voice recordings. For best results, include text that uses your specific vocabulary in different sentences and contexts that cover the ways you expect the terms to be used. You can provide up to 1.5 GB of raw text data. The service’s website provides a walkthrough on how to create these custom acoustic models.

Text to Speech

We can’t always be looking at screens. In many cases this can be a dangerous distraction, diverting attention away from hazardous environments or expensive equipment. When people need to interact with devices in such environments, one option is to use speech synthesis, perhaps paired with speech recognition for input. Speech synthesis can also provide accessibility tooling for the visually impaired or be used to deliver information in an augmented reality tool.

The following code snippet will take text from the console input and play the resulting speech through your default audio device.

First you will need to configure your Speech credentials (fill in your own subscription key and service region (e.g., "westus"):

var config = SpeechConfig.FromSubscription(
     "<Your Subscription Key>", "<Your Service Region>"
 );

Then, create a SpeechSynthesizer that uses your device speaker:

var synthesizer = new SpeechSynthesizer(config);

Here we’re going to take the text to be spoken from console input:

string text = Console.ReadLine();

Call the API and synthesize the text to audio:

var result = await synthesizer.SpeakTextAsync(text);

Then check to see if the speech was synthesized (the other options are Cancelled and Error):

if (result.Reason == ResultReason.SynthesizingAudioCompleted)
{
   Console.WriteLine(
       $"Speech synthesized to speaker for text [{text}]"
   );
}

Neural and Custom Voices

The Text to Speech service converts text into synthesized speech that’s natural and sounds near human. You can pick from a set of standard and higher-quality “neural” voices, or if you want to express your brand’s personality you can create your own voices.

Currently five neural voices are available in English, German, Italian, and Chinese. You can also choose from more than 75 standard voices in over 45 languages and locales. The standard voices are created using statistical parametric synthesis and/or concatenation synthesis techniques.

Neural text to speech is a powerful new improvement over standard speech synthesis, offering human-sounding inflection and articulation. The result is computer-generated speech that is less tiring to listen to. It’s ideal if you’re using speech to deliver long-form content, for example when narrating scenes for the visually impaired or generating audiobooks from web content. It’s also a useful tool when you’re expecting a lot of human interaction, such as for high-end chatbots or virtual assistants.

Standard speech synthesis supports many more languages, but it’s clearly artificial. You can experiment to find the right set of parameters to give it the feel you want, tuning speed, pitch, and other settings—including adding pauses.

To generate your own custom voices, known as voice fonts, you need studio recordings, preferably made by a professional voice actor, and a set of scripts to create the training data. It is possible to use public recordings, but they will require significant editing to remove filler sounds and ambient noise. The best results come when you use an expressive voice at a consistent volume, speaking rate, and pitch. Voice fonts can only be single language: either English (US), Chinese (Mainland), French, German, or Italian.

Custom voices can be configured using the service’s website. Audio files for creating a voice need to be WAV files sampled with a sample rate of at least 16 KHz, in 16-bit PCM, bundled into a ZIP file less than 200 MB in size. Files need to be single utterances: either a single sentence or a section of any dialog you wish to construct, with a maximum length of 15 seconds. As with custom recognition, you also need a script file that ties the voice to text. You can upload multiple speech files, with free users limited to 2 GB and subscription users to 5 GB.

To turn your uploaded voice data set into a voice font, you need to set the locale and gender for the voice to match the data set. Training can take a significant amount of time, depending on the volume of data (from 30 minutes to 40 hours). Once the voice font has been trained, you can try it out from the portal.

Text sent via the REST API must use the Speech Synthesis Markup Language (SSML), which controls how voices operate. Start by setting the voice you’re using, then add text in an XML format. You can add breaks, using time in milliseconds, and change the speaking rate (defined as prosody, and increased and decreased using percentages). You can also change how a voice pronounces a word, altering the phonemes used. Other options let you control volume and pitch, and even switch between voices. Constructing speech as SSML can take time, but it gives you a wide range of options and helps deliver a more natural experience.

Here’s the SSML for an enthusiastic rather than neutral response:

<speak version='1.0' 
      xmlns="https://www.w3.org/2001/10/synthesis" 
      xmlns:mstts="https://www.w3.org/2001/mstts" 
      xml:lang="en-US">
 <voice name='en-US-JessaNeural'>
   <mstts:express-as type="cheerful">
      That'd be just amazing! 
    </mstts:express-as>
  </voice>
</speak>

Translation and Unified Speech

Real-time speech translation was one of the first deep learning speech services Microsoft showcased, with a 2012 demonstration showing an English speaker communicating with a Chinese speaker. In just a few years, those translation services have gone from research to product to service. Using neural machine translation techniques, rather than the traditional statistical approach, allows them to deliver higher-quality translations. (Not all language pairs in Azure Speech use neural translation; some still depend on the statistical operations.)

The Speech Translation tool uses a four-step process, starting with speech recognition to convert spoken words into text. The transcribed text is then passed through a TrueText engine to normalize the speech and make it more suitable for translation. Next, the text is passed through the machine translation tools and converted to the target language. Finally, the translated text is sent through the Text to Speech service to produce the final audio.

Speech Translation works in a similar fashion to the standard speech recognition tools, using a TranslationRecognizer object to work with audio data. By default it uses the local microphone, though you can configure it to use alternative audio sources. To make a translation, you set both the source and target languages, using the standard Windows language types (even if your app doesn’t run on Windows).

Translations are delivered as events, so your code needs to subscribe to the response stream. The streamed data can be displayed as text in real time, or you can use it to produce a synthesized translation, using neural speech if available. Custom Translator lets you extend the default translation models to cover industry-specific terms or language that’s essential to your business. We go into more depth on Custom Translator in the next chapter.

Speaker Verification and Identification

In the 1992 movie Sneakers, the villain Cosmo protects his computer systems with a speech-controlled door lock using the passphrase “My voice is my passport. Verify me.” The Speaker Recognition API promises to turn that fiction into fact. While still in preview, the service makes it easy to identify and even verify the person speaking.

Speaker verification uses the unique characteristics of a person’s voice to identify them, just like a fingerprint. Speaker identification uses enrolled voices to identify a specific voice based on speech patterns. By playing back selected audio, it can determine who in the speech database is speaking.

Speaker Verification

We all speak differently; our voices have unique characteristics that make them suitable for use as biometric identifiers. Like face recognition, voice recognition is a quick and easy alternative to the traditional password, simplifying login and access. You enroll users with a spoken passphrase that they will use when they want to be verified. Once they’re enrolled with at least three samples, the passphrase can be processed to build a voice signature.

When a user speaks their passphrase for verification, it’s processed to create a voice signature, which is then compared with the stored phrase. If the phrase and voice match, the user is verified.

You need to create a profile for each user, including their language; currently only US English is supported. When the profile is created through the REST API, it returns a GUID that you use to enroll the user. While the service is in preview, you can only register a thousand profiles per subscription.

There’s a preset list of supported passphrases (which is quite short for the preview), and each user will need to choose one—they don’t need to be unique to each user, though. You can retrieve the list through the API and either present one randomly or let the users pick from a list. Each user can then provide three samples of them saying the selected phrase, which must be recorded in mono WAV format at 16 KHz and in 16-bit PCM. Each sample is uploaded separately, along with a count and the phrase being used. Once all three have been loaded, the profile is then trained, and when training is complete, it’s ready for use.

A single REST API call handles verifying a user, but you still need to handle usernames and give users a way to enter these in your application so you can retrieve the appropriate GUID from a local store. They then speak their passphrase, which is sent to the speaker verification service. Three fields are returned: the result of the verification (accept or reject), the confidence level of the verification (low, normal, or high), and the recognized passphrase.

If a user is recognized with high confidence, you can let them into your application. If confidence is low, you may want to ask them to use an alternative authentication method. You may also want to use a similar fallback approach if they are rejected with low confidence rather than locking them out.

Speaker Identification

Speaker identification lets you automatically identify the person speaking in an audio file from a given group of potential speakers. This functionality is useful in dictation applications and other scenarios where you need to differentiate between speakers. You can use the speaker identification service in parallel with the other Speech Services to build an annotated transcript of a conversation, something that’s hard to do using conventional speech recognition products.

Because you’re not converting speech to text, all you need are speech samples to train the identification service. Think of this as the voice equivalent of music recognition services like Shazam. The service identifies the “voice fingerprint” of an individual and then compares this fingerprint with new recordings to determine if the same person is speaking. Multiple recordings can be compared with the same voice fingerprint, or different voices extracted from the same file.

Speaker identification uses the same underlying approach as speaker verification. You first create a speaker identification profile that can then be enrolled in the service. Enrollment for identification is different from verification because you’re uploading a single file, which can be up to five minutes long. The minimum recommended length is 30 seconds, though you have the option of uploading a shorter file. Again, you need to use mono WAV files, recorded at 16 KHz in 16-bit PCM.

To identify a voice, you send a request to the Speaker Identification API with the speech you wish to identify along with a list of the 10 profiles you wish to compare it against. The service runs asynchronously and will return a URI that can be queried to get the status. A separate API returns an identified voice GUID, along with a confidence level. If no one is identified, a dummy zero GUID is then returned.

Both services are still in preview and currently suitable for trial applications rather than production use. You can test and adopt them in production when the service moves to general availability.