Using the Web Speech API - Web APIs

Speech recognition

Speech recognition involves receiving speech through a device's microphone (or an audio track), which is then checked by a speech recognition service. When a word or phrase is successfully recognized, it is returned as a result (or list of results) as a text string, and further actions can be initiated as a result.

The Web Speech API has a main controller interface for this — SpeechRecognition — plus several related interfaces for representing results, etc.

Generally, the default speech recognition system available on the device will be used for the speech recognition. Most modern OSes have a speech recognition system for issuing voice commands such as Dictation on macOS or Copilot on Windows.

By default, using Speech Recognition on a web page usually involves a server-based recognition engine. Your audio is sent to a web service for recognition processing, so it won't work offline.

To improve privacy and performance, it is possible to specify that you want the speech recognition performed on-device, thereby ensuring that neither audio nor transcribed speech are sent to a third-party service for processing. We specifically cover the on-device functionality in on-device speech recognition.

Demo

To demonstrate usage of Web speech recognition, we've written a demo called Speech color changer. When the button is pressed, you can say an HTML color keyword, and the app's background color will change to that color.

The UI of an app titled Speech Color changer. It invites the user to tap the screen and say a color, and then it turns the background of the app that color. In this case it has turned the background red.

To run the demo, navigate to the live demo URL in a supporting browser (such as Chrome).

HTML and CSS

The HTML and CSS for the app is trivial. We have a title, instructions paragraph (<p>), control <button>, and an output paragraph into which we output diagnostic messages including the words that are recognized.

html

<h1>Speech color changer</h1>

<p class="hints"></p>

<button>Start recognition</button>

<p class="output"><em>...diagnostic messages</em></p>

The CSS provides a basic responsive styling so that it looks OK across devices.

JavaScript

Let's look at the JavaScript in a bit more detail.

Prefixed properties

Some browsers currently support speech recognition with prefixed properties. Therefore, at the start of our code, we include these lines to allow for both prefixed properties and unprefixed versions:

const SpeechRecognition =
  window.SpeechRecognition || window.webkitSpeechRecognition;
const SpeechRecognitionEvent =
  window.SpeechRecognitionEvent || window.webkitSpeechRecognitionEvent;

Color list

The next part of our code represents sample colors that we will print to the UI to give users an idea of what to say:

const colors = [
  "aqua",
  "azure",
  "beige",
  "bisque",
  "black",
  "blue",
  "brown",
  "chocolate",
  "coral",
  // …
];

Creating a speech recognition instance

The next thing to do is define a speech recognition instance to control the recognition for our application. This is done using the SpeechRecognition() constructor.

const recognition = new SpeechRecognition();

We then set a few other properties of the recognition instance before we move on:

SpeechRecognition.continuous: Controls whether continuous results are captured (true), or just a single result each time recognition is started (false).
SpeechRecognition.lang: Sets the language of the recognition. Setting this is good practice, and therefore recommended.
SpeechRecognition.interimResults: Defines whether the speech recognition system should return interim results, or just final results. Final results are good enough for this demo.
SpeechRecognition.maxAlternatives: Sets the number of alternative potential matches that should be returned per result. This can sometimes be useful, say if a result is not completely clear and you want to display a list if alternatives for the user to choose the correct one from. But it is not needed for this demo, so we are just specifying one (which is actually the default anyway.)

recognition.continuous = false;
recognition.lang = "en-US";
recognition.interimResults = false;
recognition.maxAlternatives = 1;

Starting the speech recognition

After grabbing references to the output paragraph, the <html> element (so we can output diagnostic messages and update the app background color later on), the instructions paragraph, and the <button>, we implement an onclick handler so that when it is pressed, the speech recognition service will start. This is achieved by calling SpeechRecognition.start(). A forEach() method is also used, to output colored indicators showing what colors to try saying.

const diagnostic = document.querySelector(".output");
const bg = document.querySelector("html");
const hints = document.querySelector(".hints");
const startBtn = document.querySelector("button");

let colorHTML = "";
colors.forEach(function (v, i, a) {
  console.log(v, i);
  colorHTML += '<span style="background-color:' + v + ';"> ' + v + " </span>";
});
hints.innerHTML =
  "Press the button then say a color to change the background color of the app. Try " +
  colorHTML +
  ".";

startBtn.onclick = function () {
  recognition.start();
  console.log("Ready to receive a color command.");
};

Receiving and handling results

Once the speech recognition is started, there are many event handlers that can be used to retrieve results and other pieces of surrounding information (see the SpeechRecognition events.) The most common one you'll probably use is the result event, which is fired once a successful result is received:

recognition.onresult = (event) => {
  const color = event.results[0][0].transcript;
  diagnostic.textContent = `Result received: ${color}.`;
  bg.style.backgroundColor = color;
  console.log(`Confidence: ${event.results[0][0].confidence}`);
};

The second line here is a bit complex, so let's explain it step by step:

The SpeechRecognitionEvent.results property returns a SpeechRecognitionResultList object containing SpeechRecognitionResult objects. It has a getter so it can be accessed like an array — so the first [0] returns the SpeechRecognitionResult at position 0.
Each SpeechRecognitionResult object contains SpeechRecognitionAlternative objects that contain individual recognized words. These also have getters so they can be accessed like arrays — the second [0] therefore returns the SpeechRecognitionAlternative at position 0.
We then return its transcript property to get a string containing the individual recognized result as a string, set the background color to that color, and report the color recognized as a diagnostic message in the UI.

We also use the speechend event to stop the speech recognition service from running (using SpeechRecognition.stop()) once a single word has been recognized and it has finished being spoken:

recognition.onspeechend = () => {
  recognition.stop();
};

Handling errors and unrecognized speech

The last two handlers handle cases where the spoken term wasn't recognized, or an error occurred with the recognition. The nomatch event is supposed to handle the first case mentioned, although note that in most cases the recognition engine will return something, even if it is unintelligible:

recognition.onnomatch = (event) => {
  diagnostic.textContent = "I didn't recognize that color.";
};

The error event handles cases where there is an actual error with the recognition — the SpeechRecognitionErrorEvent.error property contains the actual error returned:

recognition.onerror = (event) => {
  diagnostic.textContent = `Error occurred in recognition: ${event.error}`;
};

On-device speech recognition

In many cases, the default is to perform the speech recognition using an online service. This means that an audio recording or transcription of the speech will be sent to a server for processing, and the results returned to the browser. This has a couple of problems:

Privacy: Many users are not comfortable with their speech being sent off to s server.
Performance: Having to send data to a server for every bit of recognition can cause a performance hit in more intensive applications, and your apps won't work offline.

To mitigate these problems, the Web Speech API provides functionality to specify that you want your speech recognition handled on-device by the browser. There is a one-time language pack download for each language you want to recognize, but after that the functionality will be available offline.

This section explains how to use on-device speech recognition.

Demo

To demonstrate usage of on-device speech recognition, we've written a demo called On-device speech color changer (run the demo live).

This demo works in a very similar fashion to the online speech color changer demo discussed earlier, with the differences noted below.

Specifying on-device recognition

To specify that you want to use the browser's own on-device processing, you need to set the SpeechRecognition.processLocally property to true before starting any speech recognition (the default value is false):

recognition.processLocally = true;

Checking availability and installing language packs

For on-device speech recognition to occur, the browser needs to have a language pack installed for the language you are trying to recognize. If you run the start() method after specifying processLocally = true and you haven't got the correct language pack installed, it will fail with a language-not-supported error.

To get the correct language pack installed, there are two steps to follow.

You need to check whether the language pack is available on the user's computer. This is handled using the SpeechRecognition.available() static method.
You need to install the language pack if it isn't available. This is handled using the SpeechRecognition.install() static method.

Both of the above steps are handled using the following modified click event handler on the app's control <button>:

startBtn.addEventListener("click", () => {
  // check availability of target language
  SpeechRecognition.available({ langs: ["en-US"], processLocally: true }).then(
    (result) => {
      if (result === "unavailable") {
        diagnostic.textContent = `en-US not available to download at this time. Sorry!`;
      } else if (result === "available") {
        recognition.start();
        console.log("Ready to receive a color command.");
      } else {
        diagnostic.textContent = `en-US language pack downloading`;
        SpeechRecognition.install({
          langs: ["en-US"],
          processLocally: true,
        }).then((result) => {
          if (result) {
            diagnostic.textContent = `en-US language pack downloaded. Try again.`;
          } else {
            diagnostic.textContent = `en-US language pack failed to download. Try again later.`;
          }
        });
      }
    },
  );
});

The available() method takes an options object containing two properties:

A langs array containing the languages to check availability for.
A processLocally boolean specifying whether to check for availability of the language locally (true) or locally or on a server-based recognition service (false, which is the default).

When run, this method returns a Promise that resolves with a enumerated value indicating the availability of the specified languages for recognition. In our case, we test for three conditions:

If the resulting value is unavailable, it means that the language is not available, and a suitable language pack is not available to download, so we print an appropriate message to the output.
If the resulting value is available, it means that the language pack is available locally, so recognition can begin. In this case, we run start() and log a message to the console when the app is ready to receive speech.
If the value is something else (downloadable or downloading), we print a diagnostic message to inform the user that a language code download is commencing, then run the install() method to handle the download.

The install() method works in a similar way to the available() method, except that its options object only takes the langs array. When run, it starts downloading all the language packs for the languages indicated in langs and returns a Promise that resolves with a boolean indicating whether the specified language packs were downloaded and installed successfully (true) or not (false).

In this basic demo, we just write a diagnostic message to indicate the success and failure cases; in a more substantial demo you'd probably want to disable the controls while the download process occurs, and enable them again when the promise resolves.

Permissions-policy integration

Usage of the available() and install() methods is controlled by the on-device-speech-recognition Permissions-Policy directive. Specifically, where a defined policy blocks usage, any attempts to call these methods will fail.

The default allowlist value for on-device-speech-recognition is self, so you don't need to worry about this unless you are attempting to use these features in embedded cross-origin documents, or specifically want to disable their usage.

Unprefixed Web Speech API

In the original speech color changer demo, we include lines to handle the issue that some browsers still support the Web Speech API with prefixed properties (see the Prefixed properties section for more details).

In the on-device version, we don't include the prefix-handling code because the new implementations that support this functionality do so without prefixes.

Speech recognition contextual biasing

There are times where a speech recognition service will fail to correctly recognize a specific word or phrase. This is most commonly the case for domain-specific terminology (for example medical or scientific terms), proper nouns, uncommon phrases, or words that are similar to other words and so may be misrecognized.

For example, during testing, we found that our on-device speech color changer had trouble recognizing the color azure — it kept returning results like "as you". Other colors that were frequently misrecognized included khaki ("car key"), tan, and thistle ("this all").

To mitigate such problems, the Web Speech API can provide hints to the recognition engine to highlight phrases that are more likely to be spoken and which the engine should therefore be biased towards. The end result is that these phrases are more likely to be recognized correctly.

This is done by setting an array of SpeechRecognitionPhrase objects as the value of the SpeechRecognition.phrases property. Each SpeechRecognitionPhrase object contains:

A phrase property, which is a string containing a word or phrase you want boosted in terms of the engine's bias towards it.
A boost property, which is a floating pointing number between 0.0 and 10.0 (inclusive) representing the amount of boost you want to give that phrase. Higher values make the phrase more likely to be recognized.

In our on-device speech color changer demo, we handle this by first creating an array containing the phrases to boost and their boost values:

const phraseData = [
  { phrase: "azure", boost: 5.0 },
  { phrase: "khaki", boost: 3.0 },
  { phrase: "tan", boost: 2.0 },
];

These need to be represented as an ObservableArray of SpeechRecognitionPhrase objects. We can handle this by mapping the original array, converting each array element into a SpeechRecognitionPhrase object using the SpeechRecognitionPhrase() constructor:

const phraseObjects = phraseData.map(
  (p) => new SpeechRecognitionPhrase(p.phrase, p.boost),
);

After we've created our SpeechRecognition instance, we can then plug our contextual biasing phrases into it by setting the phraseObjects array as the value of the SpeechRecognition.phrases property:

recognition.phrases = phraseObjects;

The phrases array can be modified just like a normal JavaScript array, for example by pushing new phrases to it dynamically:

recognition.phrases.push(new SpeechRecognitionPhrase("thistle", 5.0));

After adding in this code, we found that the problematic color keywords were recognized more accurately than before.

Speech synthesis

Speech synthesis (aka text-to-speech, or TTS) involves receiving synthesizing text contained within an app to speech, and playing it out of a device's speaker or audio output connection.

The Web Speech API has a main controller interface for this — SpeechSynthesis — plus a number of closely-related interfaces for representing text to be synthesized (known as utterances), voices to be used for the utterance, etc. Again, most OSes have some kind of speech synthesis system, which will be used by the API for this task as available.

Demo

To show usage of Web speech synthesis, we've provided a demo called Speak easy synthesis. This includes a set of form controls for entering text to be synthesized, and setting the pitch, rate, and voice to use when the text is uttered. After you have entered your text, you can press Enter/Return to hear it spoken.

UI of an app called speak easy synthesis. It has an input field in which to input text to be synthesized, slider controls to change the rate and pitch of the speech, and a drop down menu to choose between different voices.

To run the demo, navigate to the live demo URL in a supporting browser.

HTML and CSS

The HTML and CSS are again pretty trivial, containing a title, some instructions for use, and a form with some basic controls. The <select> element is initially empty, but is populated with <option>s via JavaScript (see later on.)

html

<h1>Speech synthesizer</h1>

<p>
  Enter some text in the input below and press return to hear it. Change voices
  using the dropdown menu.
</p>

<form>
  <input type="text" class="txt" />
  <div>
    <label for="rate">Rate</label
    ><input type="range" min="0.5" max="2" value="1" step="0.1" id="rate" />
    <div class="rate-value">1</div>
    <div class="clearfix"></div>
  </div>
  <div>
    <label for="pitch">Pitch</label
    ><input type="range" min="0" max="2" value="1" step="0.1" id="pitch" />
    <div class="pitch-value">1</div>
    <div class="clearfix"></div>
  </div>
  <select></select>
</form>

JavaScript

Let's investigate the JavaScript that powers this app.

Setting variables

First of all, we capture references to all the DOM elements involved in the UI, but more interestingly, we capture a reference to Window.speechSynthesis. This is API's entry point — it returns an instance of SpeechSynthesis, the controller interface for web speech synthesis.

const synth = window.speechSynthesis;

const inputForm = document.querySelector("form");
const inputTxt = document.querySelector(".txt");
const voiceSelect = document.querySelector("select");

const pitch = document.querySelector("#pitch");
const pitchValue = document.querySelector(".pitch-value");
const rate = document.querySelector("#rate");
const rateValue = document.querySelector(".rate-value");

const voices = [];

Populating the select element

To populate the <select> element with the different voice options the device has available, we've written a populateVoiceList() function. We first invoke SpeechSynthesis.getVoices(), which returns a list of all the available voices, represented by SpeechSynthesisVoice objects. We then loop through this list — for each voice we create an <option> element, set its text content to display the name of the voice (grabbed from SpeechSynthesisVoice.name), the language of the voice (grabbed from SpeechSynthesisVoice.lang), and -- DEFAULT if the voice is the default voice for the synthesis engine (checked by seeing if SpeechSynthesisVoice.default returns true.)

We also create data- attributes for each option, containing the name and language of the associated voice, so we can grab them easily later on, and then append the options as children of the select.

function populateVoiceList() {
  voices = synth.getVoices();

  for (const voice of voices) {
    const option = document.createElement("option");
    option.textContent = `${voice.name} (${voice.lang})`;

    if (voice.default) {
      option.textContent += " — DEFAULT";
    }

    option.setAttribute("data-lang", voice.lang);
    option.setAttribute("data-name", voice.name);
    voiceSelect.appendChild(option);
  }
}

Older browser don't support the voiceschanged event, and just return a list of voices when SpeechSynthesis.getVoices() is fired. While on others, such as Chrome, you have to wait for the event to fire before populating the list. To allow for both cases, we run the function as shown below:

populateVoiceList();
if (speechSynthesis.onvoiceschanged !== undefined) {
  speechSynthesis.onvoiceschanged = populateVoiceList;
}

Speaking the entered text

Next, we create an event handler to start speaking the text entered into the text field. We are using an onsubmit handler on the form so that the action happens when Enter/Return is pressed. We first create a new SpeechSynthesisUtterance() instance using its constructor — this is passed the text input's value as a parameter.

Next, we need to figure out which voice to use. We use the HTMLSelectElement selectedOptions property to return the currently selected <option> element. We then use this element's data-name attribute, finding the SpeechSynthesisVoice object whose name matches this attribute's value. We set the matching voice object to be the value of the SpeechSynthesisUtterance.voice property.

Finally, we set the SpeechSynthesisUtterance.pitch and SpeechSynthesisUtterance.rate to the values of the relevant range form elements. Then, with all necessary preparations made, we start the utterance being spoken by invoking SpeechSynthesis.speak(), passing it the SpeechSynthesisUtterance instance as a parameter.

inputForm.onsubmit = (event) => {
  event.preventDefault();

  const utterThis = new SpeechSynthesisUtterance(inputTxt.value);
  const selectedOption =
    voiceSelect.selectedOptions[0].getAttribute("data-name");
  for (const voice of voices) {
    if (voice.name === selectedOption) {
      utterThis.voice = voice;
    }
  }
  utterThis.pitch = pitch.value;
  utterThis.rate = rate.value;
  synth.speak(utterThis);
  utterThis.onpause = (event) => {
    const char = event.utterance.text.charAt(event.charIndex);
    console.log(
      `Speech paused at character ${event.charIndex} of "${event.utterance.text}", which is "${char}".`,
    );
  };
  inputTxt.blur();
};

In the final part of the handler, we include a pause event to demonstrate how SpeechSynthesisEvent can be put to good use. When SpeechSynthesis.pause() is invoked, this returns a message reporting the character number and name that the speech was paused at.

Finally, we call blur() on the text input. This is mainly to hide the keyboard on Firefox OS.

Updating the displayed pitch and rate values

The last part of the code updates the pitch/rate values displayed in the UI, each time the slider positions are moved.

pitch.onchange = () => {
  pitchValue.textContent = pitch.value;
};

rate.onchange = () => {
  rateValue.textContent = rate.value;
};

Help improve MDN

Learn how to contribute

This page was last modified on ⁨Jan 1, 1970⁩ by MDN contributors.

View this page on GitHub • Report a problem with this content