An Interesting HTML Parser Conundrum



This content originally appeared on dbushell.com and was authored by dbushell.com

Despite better judgement I decided to code a basic HTML parser. Not the full HTML spec but enough to create a tree of nodes and attributes. I’ve already written a streamable XML parser that has been working for my podcast web app.

Parsing (most) HTML isn’t as complicated as it sounds. Look for a less-than sign < and see if a valid tag like <div> follows. If that node is a void element or self-closing element it gets appended to the current parent. If it’s an opening tag it becomes the current parent until a matching close tag is found.

There are several HTML elements that I consider opaque and will skip parsing inside.

So far that list includes:

export const opaqueTags = new Set([
  'code', 'iframe', 'math', 'noscript',
  'object', 'pre', 'script', 'style',
  'svg', 'template', 'textarea'
]);

For these elements I just want to gather the raw text and avoid creating a node tree. This is where I became confused.

The Conundrum

I started thinking about inline <script> and <style> tags. The contents of said elements are not HTML but could look like HTML.

What happens when I parse this inline script:

<script>
  console.log('</script>');
</script>

Or similar:

<script>
  /* </script> */
</script>

In these two examples the JavaScript text includes </script> inside a string literal and comment. How do HTML parsers know that is not real HTML? They’re not JavaScript parsers; they’re not aware of the string or comment context.

Investigation

I tested two popular Node.js libraries: htmlparser2 and parse5. Both libraries failed — at least I thought — by ending the <script> node early.

Nodes are created something like this:

  1. <script> opening tag
  2. console.log(' child text node
  3. </script> closing tag
  4. '); adjacent text node

The final </script> gets thrown away as a stray error.

I wasn’t satisfied! At this point I remembered that the best HTML parsers are web browsers, not Node packages. Surely a web browser can parse this correctly? Nope. Well… actually yes, once I realised my assumptions were wrong. Web browsers behave exactly the same.

See this CodePen for proof.

The same behaviour happens with an inline <style>:

<style>
  /* </style> */
  html {
    background: red;
  }
</style>

Everything from */ html { onwards is rendered as a text node and the “real” closing </style> tag is thrown away.

I did not expect this behaviour, but oh boy am I relieved! Can you imagine how difficult it would be to parse HTML otherwise?

My HTML parsing efforts currently reside in my Hyperless GitHub repo; an assortment of JavaScript + HTML experimental utilities. I’m not sure my final plans I’m just coding for fun right now. Originally I was planning to make a reference in JavaScript and then reimplement it in Rust or Zig. I just need more free time!


This content originally appeared on dbushell.com and was authored by dbushell.com