Play it, Sam. Play “Speech Synthesis” – ██FR█████ █INTELL███████████

This content originally appeared on dbushell.com (blog) and was authored by dbushell.com (blog)

Terence Eden’s Numbers Station is a fun experiment using the Web Speech API to synthesise text-to-speech (TTS). Modern web browsers have TTS built-in. Learning of this powerful API gave me an idea.

I’m a big podcast and audiobook listener. I appreciate when blogs provide an alternative audio version. Citation Needed and Pivot to AI are two exemplary examples. I’ve always been tempted to narrate my own articles, but unlike Molly and David, I have a voice best suited for the silent movies. I’m also building an RSS reader and TTS would be a perfect feature.

Speech Synthesis

The Web Speech API saves me from the sound of my own voice. Reading an entire post can be as simple as three lines of code.

const $post = document.querySelector(".Main > .Prose");
const utterance = new SpeechSynthesisUtterance($post.innerText);
globalThis.speechSynthesis.speak(utterance);

The global speechSynthesis object has speak, pause, and resume methods. The utterance instance will emit events like pause and end. These primitives are enough to build basic playback controls.

Highlighting Speech

What I want is visual tracking of playback state. Is it possible to highlight specific words as they’re spoken? Yes! With a lot more effort.

The utterance instance also fires a boundary event.

Fired when the spoken utterance reaches a word or sentence boundary.
Web Speech API Draft Specification

The boundary event includes two properties:

charIndex — starting index of the next character
charLength — length of the next word to be spoken

This is very promising! The CSS Highlight API takes a start and end range. The next problem is that the speech synthesiser only has one chunk of text. Mapping those numbers back to DOM nodes accurately isn’t possible.

The solution I’ve come up with is to collect an array of all text nodes.

const nodeList = [];
const collectNodes = ($parent) => {
  for (const $child of $parent.childNodes) {
    if ($child.nodeType === Node.TEXT_NODE) {
      if ($child.textContent.trim() !== "") {
        nodeList.push($child);
      }
    } else if ($child.nodeType === Node.ELEMENT_NODE) {
      collectNodes($child);
    }
  }
};
const $post = document.querySelector(".Main > .Prose");
collectNodes($post);

I use a recursive function to create a flat array of all text nodes from my blog post. Next I can iterate the array and speak each node one-by-one.

const nextWord = () => {
  if (nodeList.length === 0) {
    return;
  }
  const $text = nodeList.shift();
  const utterance = new SpeechSynthesisUtterance($text.textContent);
  utterance.addEventListener("end", () => nextWord());
  globalThis.speechSynthesis.speak(utterance);
};
nextWord();

This function works by removing the first word from the top of the list and speaking it. Using the end event it repeats until all words are spoken.

Now when I add the boundary event listener I have a reference to the parent text node. I can use this for the CSS highlight range.

const highlight = new Highlight();
CSS.highlights.set("speech-synth", highlight);

const nextWord = () => {
  if (nodeList.length === 0) {
    return;
  }
  const $text = nodeList.shift();
  const utterance = new SpeechSynthesisUtterance($text.textContent);
  utterance.addEventListener("end", () => nextWord());
  utterance.addEventListener("boundary", (ev) => {
    highlight.clear();
    const range = new Range();
    range.setStart($text, ev.charIndex);
    range.setEnd($text, ev.charIndex + ev.charLength);
    highlight.add(range);
  });
  globalThis.speechSynthesis.speak(utterance);
};
nextWord();

CSS has a special named highlight selector.

::highlight(speech-synth) {
  background: green;
}

And with that I’m able to highlight each word as they’re spoken. That’s neat! To track the highlighted word I’m scrolling the parent element into view.

$text.parentNode.scrollIntoView({
  behavior: "auto",
  block: "nearest",
});

Improvements

Some elements like images and videos have no text content using this technique. For those I’ve added extra conditions. First I create a map (I’ll explain later).

const nodeParent = new WeakMap();

Then within collectNodes I add the edge cases. For images I generate a text node from the alt attribute prefixed with “image:” for context when spoken.

const tagName = $child.nodeName.toLowerCase();
if (tagName === "img") {
  const $text = document.createTextNode(`image: ${$child.alt}`);
  nodeParent.set($text, $child);
  nodeList.push($text);
  continue next;
}

Within the boundary event listener, and before I apply the highlight range, I first check the weak map. If a parent is mapped I apply a different style.

if (nodeParent.has($text)) {
  const $parent = nodeParent.get($text);
  $parent.dataset.speechSynthHighlight = "true";
  return;
}

These nodes can’t be highlighted so instead I apply an outline.

[data-speech-synth-highlight] {
  outline: 10px solid green;
}

Later I remove the data attribute alongside clearing any highlights (code not shown).

I do the same thing for videos and code examples. Should I be descending into code blocks and reading syntax verbatim? I’ve opted not to because I think that’d be a worse experience. This is not intended to replace a proper screen reader.

Browser Support

The Web Speech API is well supported. The CSS Highlight API is less so. The latest Chromium and WebKit browsers I use work well. My version of “Firefox” (Mullvad; ESR 128) doesn’t work. (I’ll start caring about Firefox again when Mozilla do.)

The synthetic voice on macOS is good enough. It sounds robotic. It makes some grammatical mistakes. But it’s usable! Presumably Windows and Linux have similar voices.

Source Code

You can view my JavaScript source file for the full code. It’s a bit messy right now! I’ve implemented this as a <speech-synth> custom element. I’ve added playback controls to an additional <dialog> to pause, resume, and end speech.

There is a “Play Synthesised Audio” button at the top of my blog posts and individual note pages. I hope someone finds it useful! I’m going to improve this next week.

Bookmark croissantrss.com if you enjoy reading RSS feeds.

Croissant will be launching soon!

This content originally appeared on dbushell.com (blog) and was authored by dbushell.com (blog)