Using the Web Speech API
The Web Speech API provides two distinct areas of functionality — speech recognition, and speech synthesis (also known as text to speech, or TTS) — which open up interesting new possibilities for accessibility, and control mechanisms. This article provides an introduction to both areas, along with demos.
Speech recognition
Speech recognition involves receiving speech through a device's microphone (or an audio track), which is then checked by a speech recognition service. When a word or phrase is successfully recognized, it is returned as a result (or list of results) as a text string, and further actions can be initiated as a result.
The Web Speech API has a main controller interface for this — SpeechRecognition
— plus several related interfaces for representing results, etc.
Generally, the default speech recognition system available on the device will be used for the speech recognition. Most modern OSes have a speech recognition system for issuing voice commands such as Dictation on macOS or Copilot on Windows.
By default, using Speech Recognition on a web page usually involves a server-based recognition engine. Your audio is sent to a web service for recognition processing, so it won't work offline.
To improve privacy and performance, it is possible to specify that you want the speech recognition performed on-device, thereby ensuring that neither audio nor transcribed speech are sent to a third-party service for processing. We specifically cover the on-device functionality in on-device speech recognition.
Demo
To demonstrate usage of Web speech recognition, we've written a demo called Speech color changer. When the button is pressed, you can say an HTML color keyword, and the app's background color will change to that color.
To run the demo, navigate to the live demo URL in a supporting browser (such as Chrome).
HTML and CSS
The HTML and CSS for the app is trivial. We have a title, instructions paragraph (<p>
), control <button>
, and an output paragraph into which we output diagnostic messages including the words that are recognized.
<h1>Speech color changer</h1>
<p class="hints"></p>
<button>Start recognition</button>
<p class="output"><em>...diagnostic messages</em></p>
The CSS provides a basic responsive styling so that it looks OK across devices.
JavaScript
Let's look at the JavaScript in a bit more detail.
Prefixed properties
Some browsers currently support speech recognition with prefixed properties. Therefore, at the start of our code, we include these lines to allow for both prefixed properties and unprefixed versions:
const SpeechRecognition =
window.SpeechRecognition || window.webkitSpeechRecognition;
const SpeechRecognitionEvent =
window.SpeechRecognitionEvent || window.webkitSpeechRecognitionEvent;
Color list
The next part of our code represents sample colors that we will print to the UI to give users an idea of what to say:
const colors = [
"aqua",
"azure",
"beige",
"bisque",
"black",
"blue",
"brown",
"chocolate",
"coral",
// …
];
Creating a speech recognition instance
The next thing to do is define a speech recognition instance to control the recognition for our application. This is done using the SpeechRecognition()
constructor.
const recognition = new SpeechRecognition();
We then set a few other properties of the recognition instance before we move on:
SpeechRecognition.continuous
: Controls whether continuous results are captured (true
), or just a single result each time recognition is started (false
).SpeechRecognition.lang
: Sets the language of the recognition. Setting this is good practice, and therefore recommended.SpeechRecognition.interimResults
: Defines whether the speech recognition system should return interim results, or just final results. Final results are good enough for this demo.SpeechRecognition.maxAlternatives
: Sets the number of alternative potential matches that should be returned per result. This can sometimes be useful, say if a result is not completely clear and you want to display a list if alternatives for the user to choose the correct one from. But it is not needed for this demo, so we are just specifying one (which is actually the default anyway.)
recognition.continuous = false;
recognition.lang = "en-US";
recognition.interimResults = false;
recognition.maxAlternatives = 1;
Starting the speech recognition
After grabbing references to the output paragraph, the <html>
element (so we can output diagnostic messages and update the app background color later on), the instructions paragraph, and the <button>
, we implement an onclick
handler so that when it is pressed, the speech recognition service will start. This is achieved by calling SpeechRecognition.start()
. A forEach()
method is also used, to output colored indicators showing what colors to try saying.
const diagnostic = document.querySelector(".output");
const bg = document.querySelector("html");
const hints = document.querySelector(".hints");
const startBtn = document.querySelector("button");
let colorHTML = "";
colors.forEach(function (v, i, a) {
console.log(v, i);
colorHTML += '<span style="background-color:' + v + ';"> ' + v + " </span>";
});
hints.innerHTML =
"Press the button then say a color to change the background color of the app. Try " +
colorHTML +
".";
startBtn.onclick = function () {
recognition.start();
console.log("Ready to receive a color command.");
};
Receiving and handling results
Once the speech recognition is started, there are many event handlers that can be used to retrieve results and other pieces of surrounding information (see the SpeechRecognition
events.) The most common one you'll probably use is the result
event, which is fired once a successful result is received:
recognition.onresult = (event) => {
const color = event.results[0][0].transcript;
diagnostic.textContent = `Result received: ${color}.`;
bg.style.backgroundColor = color;
console.log(`Confidence: ${event.results[0][0].confidence}`);
};
The second line here is a bit complex, so let's explain it step by step:
- The
SpeechRecognitionEvent.results
property returns aSpeechRecognitionResultList
object containingSpeechRecognitionResult
objects. It has a getter so it can be accessed like an array — so the first[0]
returns theSpeechRecognitionResult
at position 0. - Each
SpeechRecognitionResult
object containsSpeechRecognitionAlternative
objects that contain individual recognized words. These also have getters so they can be accessed like arrays — the second[0]
therefore returns theSpeechRecognitionAlternative
at position 0. - We then return its
transcript
property to get a string containing the individual recognized result as a string, set the background color to that color, and report the color recognized as a diagnostic message in the UI.
We also use the speechend
event to stop the speech recognition service from running (using SpeechRecognition.stop()
) once a single word has been recognized and it has finished being spoken:
recognition.onspeechend = () => {
recognition.stop();
};
Handling errors and unrecognized speech
The last two handlers handle cases where the spoken term wasn't recognized, or an error occurred with the recognition. The nomatch
event is supposed to handle the first case mentioned, although note that in most cases the recognition engine will return something, even if it is unintelligible:
recognition.onnomatch = (event) => {
diagnostic.textContent = "I didn't recognize that color.";
};
The error
event handles cases where there is an actual error with the recognition — the SpeechRecognitionErrorEvent.error
property contains the actual error returned:
recognition.onerror = (event) => {
diagnostic.textContent = `Error occurred in recognition: ${event.error}`;
};
On-device speech recognition
In many cases, the default is to perform the speech recognition using an online service. This means that an audio recording or transcription of the speech will be sent to a server for processing, and the results returned to the browser. This has a couple of problems:
- Privacy: Many users are not comfortable with their speech being sent off to s server.
- Performance: Having to send data to a server for every bit of recognition can cause a performance hit in more intensive applications, and your apps won't work offline.
To mitigate these problems, the Web Speech API provides functionality to specify that you want your speech recognition handled on-device by the browser. There is a one-time language pack download for each language you want to recognize, but after that the functionality will be available offline.
This section explains how to use on-device speech recognition.
Demo
To demonstrate usage of on-device speech recognition, we've written a demo called On-device speech color changer (run the demo live).
This demo works in a very similar fashion to the online speech color changer demo discussed earlier, with the differences noted below.
Specifying on-device recognition
To specify that you want to use the browser's own on-device processing, you need to set the SpeechRecognition.processLocally
property to true
before starting any speech recognition (the default value is false
):
recognition.processLocally = true;
Checking availability and installing language packs
For on-device speech recognition to occur, the browser needs to have a language pack installed for the language you are trying to recognize. If you run the start()
method after specifying processLocally = true
and you haven't got the correct language pack installed, it will fail with a language-not-supported
error.
To get the correct language pack installed, there are two steps to follow.
- You need to check whether the language pack is available on the user's computer. This is handled using the
SpeechRecognition.available()
static method. - You need to install the language pack if it isn't available. This is handled using the
SpeechRecognition.install()
static method.
Both of the above steps are handled using the following modified click
event handler on the app's control <button>
:
startBtn.addEventListener("click", () => {
// check availability of target language
SpeechRecognition.available({ langs: ["en-US"], processLocally: true }).then(
(result) => {
if (result === "unavailable") {
diagnostic.textContent = `en-US not available to download at this time. Sorry!`;
} else if (result === "available") {
recognition.start();
console.log("Ready to receive a color command.");
} else {
diagnostic.textContent = `en-US language pack downloading`;
SpeechRecognition.install({
langs: ["en-US"],
processLocally: true,
}).then((result) => {
if (result) {
diagnostic.textContent = `en-US language pack downloaded. Try again.`;
} else {
diagnostic.textContent = `en-US language pack failed to download. Try again later.`;
}
});
}
},
);
});
The available()
method takes an options object containing two properties:
- A
langs
array containing the languages to check availability for. - A
processLocally
boolean specifying whether to check for availability of the language locally (true
) or locally or on a server-based recognition service (false
, which is the default).
When run, this method returns a Promise
that resolves with a enumerated value indicating the availability of the specified languages for recognition. In our case, we test for three conditions:
- If the resulting value is
unavailable
, it means that the language is not available, and a suitable language pack is not available to download, so we print an appropriate message to the output. - If the resulting value is
available
, it means that the language pack is available locally, so recognition can begin. In this case, we runstart()
and log a message to the console when the app is ready to receive speech. - If the value is something else (
downloadable
ordownloading
), we print a diagnostic message to inform the user that a language code download is commencing, then run theinstall()
method to handle the download.
The install()
method works in a similar way to the available()
method, except that its options object only takes the langs
array. When run, it starts downloading all the language packs for the languages indicated in langs
and returns a Promise
that resolves with a boolean indicating whether the specified language packs were downloaded and installed successfully (true
) or not (false
).
In this basic demo, we just write a diagnostic message to indicate the success and failure cases; in a more substantial demo you'd probably want to disable the controls while the download process occurs, and enable them again when the promise resolves.
Permissions-policy integration
Usage of the available()
and install()
methods is controlled by the on-device-speech-recognition
Permissions-Policy
directive. Specifically, where a defined policy blocks usage, any attempts to call these methods will fail.
The default allowlist value for on-device-speech-recognition
is self
, so you don't need to worry about this unless you are attempting to use these features in embedded cross-origin documents, or specifically want to disable their usage.
Unprefixed Web Speech API
In the original speech color changer demo, we include lines to handle the issue that some browsers still support the Web Speech API with prefixed properties (see the Prefixed properties section for more details).
In the on-device version, we don't include the prefix-handling code because the new implementations that support this functionality do so without prefixes.
Speech recognition contextual biasing
There are times where a speech recognition service will fail to correctly recognize a specific word or phrase. This is most commonly the case for domain-specific terminology (for example medical or scientific terms), proper nouns, uncommon phrases, or words that are similar to other words and so may be misrecognized.
For example, during testing, we found that our on-device speech color changer had trouble recognizing the color azure
— it kept returning results like "as you". Other colors that were frequently misrecognized included khaki
("car key"), tan
, and thistle
("this all").
To mitigate such problems, the Web Speech API can provide hints to the recognition engine to highlight phrases that are more likely to be spoken and which the engine should therefore be biased towards. The end result is that these phrases are more likely to be recognized correctly.
This is done by setting an array of SpeechRecognitionPhrase
objects as the value of the SpeechRecognition.phrases
property. Each SpeechRecognitionPhrase
object contains:
- A
phrase
property, which is a string containing a word or phrase you want boosted in terms of the engine's bias towards it. - A
boost
property, which is a floating pointing number between0.0
and10.0
(inclusive) representing the amount of boost you want to give that phrase. Higher values make the phrase more likely to be recognized.
In our on-device speech color changer
demo, we handle this by first creating an array containing the phrases to boost and their boost values:
const phraseData = [
{ phrase: "azure", boost: 5.0 },
{ phrase: "khaki", boost: 3.0 },
{ phrase: "tan", boost: 2.0 },
];
These need to be represented as an ObservableArray
of SpeechRecognitionPhrase
objects. We can handle this by mapping the original array, converting each array element into a SpeechRecognitionPhrase
object using the SpeechRecognitionPhrase()
constructor:
const phraseObjects = phraseData.map(
(p) => new SpeechRecognitionPhrase(p.phrase, p.boost),
);
After we've created our SpeechRecognition
instance, we can then plug our contextual biasing phrases into it by setting the phraseObjects
array as the value of the SpeechRecognition.phrases
property:
recognition.phrases = phraseObjects;
The phrases array can be modified just like a normal JavaScript array, for example by pushing new phrases to it dynamically:
recognition.phrases.push(new SpeechRecognitionPhrase("thistle", 5.0));
After adding in this code, we found that the problematic color keywords were recognized more accurately than before.
Speech synthesis
Speech synthesis (aka text-to-speech, or TTS) involves receiving synthesizing text contained within an app to speech, and playing it out of a device's speaker or audio output connection.
The Web Speech API has a main controller interface for this — SpeechSynthesis
— plus a number of closely-related interfaces for representing text to be synthesized (known as utterances), voices to be used for the utterance, etc. Again, most OSes have some kind of speech synthesis system, which will be used by the API for this task as available.
Demo
To show usage of Web speech synthesis, we've provided a demo called Speak easy synthesis. This includes a set of form controls for entering text to be synthesized, and setting the pitch, rate, and voice to use when the text is uttered. After you have entered your text, you can press Enter/Return to hear it spoken.
To run the demo, navigate to the live demo URL in a supporting browser.
HTML and CSS
The HTML and CSS are again pretty trivial, containing a title, some instructions for use, and a form with some basic controls. The <select>
element is initially empty, but is populated with <option>
s via JavaScript (see later on.)
<h1>Speech synthesizer</h1>
<p>
Enter some text in the input below and press return to hear it. Change voices
using the dropdown menu.
</p>
<form>
<input type="text" class="txt" />
<div>
<label for="rate">Rate</label
><input type="range" min="0.5" max="2" value="1" step="0.1" id="rate" />
<div class="rate-value">1</div>
<div class="clearfix"></div>
</div>
<div>
<label for="pitch">Pitch</label
><input type="range" min="0" max="2" value="1" step="0.1" id="pitch" />
<div class="pitch-value">1</div>
<div class="clearfix"></div>
</div>
<select></select>
</form>
JavaScript
Let's investigate the JavaScript that powers this app.
Setting variables
First of all, we capture references to all the DOM elements involved in the UI, but more interestingly, we capture a reference to Window.speechSynthesis
. This is API's entry point — it returns an instance of SpeechSynthesis
, the controller interface for web speech synthesis.
const synth = window.speechSynthesis;
const inputForm = document.querySelector("form");
const inputTxt = document.querySelector(".txt");
const voiceSelect = document.querySelector("select");
const pitch = document.querySelector("#pitch");
const pitchValue = document.querySelector(".pitch-value");
const rate = document.querySelector("#rate");
const rateValue = document.querySelector(".rate-value");
const voices = [];
Populating the select element
To populate the <select>
element with the different voice options the device has available, we've written a populateVoiceList()
function. We first invoke SpeechSynthesis.getVoices()
, which returns a list of all the available voices, represented by SpeechSynthesisVoice
objects. We then loop through this list — for each voice we create an <option>
element, set its text content to display the name of the voice (grabbed from SpeechSynthesisVoice.name
), the language of the voice (grabbed from SpeechSynthesisVoice.lang
), and -- DEFAULT
if the voice is the default voice for the synthesis engine (checked by seeing if SpeechSynthesisVoice.default
returns true
.)
We also create data-
attributes for each option, containing the name and language of the associated voice, so we can grab them easily later on, and then append the options as children of the select.
function populateVoiceList() {
voices = synth.getVoices();
for (const voice of voices) {
const option = document.createElement("option");
option.textContent = `${voice.name} (${voice.lang})`;
if (voice.default) {
option.textContent += " — DEFAULT";
}
option.setAttribute("data-lang", voice.lang);
option.setAttribute("data-name", voice.name);
voiceSelect.appendChild(option);
}
}
Older browser don't support the voiceschanged
event, and just return a list of voices when SpeechSynthesis.getVoices()
is fired.
While on others, such as Chrome, you have to wait for the event to fire before populating the list.
To allow for both cases, we run the function as shown below:
populateVoiceList();
if (speechSynthesis.onvoiceschanged !== undefined) {
speechSynthesis.onvoiceschanged = populateVoiceList;
}
Speaking the entered text
Next, we create an event handler to start speaking the text entered into the text field. We are using an onsubmit handler on the form so that the action happens when Enter/Return is pressed. We first create a new SpeechSynthesisUtterance()
instance using its constructor — this is passed the text input's value as a parameter.
Next, we need to figure out which voice to use. We use the HTMLSelectElement
selectedOptions
property to return the currently selected <option>
element. We then use this element's data-name
attribute, finding the SpeechSynthesisVoice
object whose name matches this attribute's value. We set the matching voice object to be the value of the SpeechSynthesisUtterance.voice
property.
Finally, we set the SpeechSynthesisUtterance.pitch
and SpeechSynthesisUtterance.rate
to the values of the relevant range form elements. Then, with all necessary preparations made, we start the utterance being spoken by invoking SpeechSynthesis.speak()
, passing it the SpeechSynthesisUtterance
instance as a parameter.
inputForm.onsubmit = (event) => {
event.preventDefault();
const utterThis = new SpeechSynthesisUtterance(inputTxt.value);
const selectedOption =
voiceSelect.selectedOptions[0].getAttribute("data-name");
for (const voice of voices) {
if (voice.name === selectedOption) {
utterThis.voice = voice;
}
}
utterThis.pitch = pitch.value;
utterThis.rate = rate.value;
synth.speak(utterThis);
utterThis.onpause = (event) => {
const char = event.utterance.text.charAt(event.charIndex);
console.log(
`Speech paused at character ${event.charIndex} of "${event.utterance.text}", which is "${char}".`,
);
};
inputTxt.blur();
};
In the final part of the handler, we include a pause
event to demonstrate how SpeechSynthesisEvent
can be put to good use. When SpeechSynthesis.pause()
is invoked, this returns a message reporting the character number and name that the speech was paused at.
Finally, we call blur()
on the text input. This is mainly to hide the keyboard on Firefox OS.
Updating the displayed pitch and rate values
The last part of the code updates the pitch
/rate
values displayed in the UI, each time the slider positions are moved.
pitch.onchange = () => {
pitchValue.textContent = pitch.value;
};
rate.onchange = () => {
rateValue.textContent = rate.value;
};