This content originally appeared on dbushell.com and was authored by dbushell.com
Despite better judgement I decided to code a basic HTML parser. Not the full HTML spec but enough to create a tree of nodes and attributes. I’ve already written a streamable XML parser that has been working for my podcast web app.
Parsing (most) HTML isn’t as complicated as it sounds. Look for a less-than sign <
and see if a valid tag like <div>
follows. If that node is a void element or self-closing element it gets appended to the current parent. If it’s an opening tag it becomes the current parent until a matching close tag is found.
There are several HTML elements that I consider opaque and will skip parsing inside.
So far that list includes:
export const opaqueTags = new Set([
'code', 'iframe', 'math', 'noscript',
'object', 'pre', 'script', 'style',
'svg', 'template', 'textarea'
]);
For these elements I just want to gather the raw text and avoid creating a node tree. This is where I became confused.
The Conundrum
I started thinking about inline <script>
and <style>
tags. The contents of said elements are not HTML but could look like HTML.
What happens when I parse this inline script:
<script> console.log('</script>'); </script>
Or similar:
<script> /* </script> */ </script>
In these two examples the JavaScript text includes </script>
inside a string literal and comment. How do HTML parsers know that is not real HTML? They’re not JavaScript parsers; they’re not aware of the string or comment context.
Investigation
I tested two popular Node.js libraries: htmlparser2 and parse5. Both libraries failed — at least I thought — by ending the <script>
node early.
Nodes are created something like this:
<script>
opening tagconsole.log('
child text node</script>
closing tag');
adjacent text node
The final </script>
gets thrown away as a stray error.
I wasn’t satisfied! At this point I remembered that the best HTML parsers are web browsers, not Node packages. Surely a web browser can parse this correctly? Nope. Well… actually yes, once I realised my assumptions were wrong. Web browsers behave exactly the same.
The same behaviour happens with an inline <style>
:
<style> /* </style> */ html { background: red; } </style>
Everything from */ html {
onwards is rendered as a text node and the “real” closing </style>
tag is thrown away.
I did not expect this behaviour, but oh boy am I relieved! Can you imagine how difficult it would be to parse HTML otherwise?
My HTML parsing efforts currently reside in my Hyperless GitHub repo; an assortment of JavaScript + HTML experimental utilities. I’m not sure my final plans I’m just coding for fun right now. Originally I was planning to make a reference in JavaScript and then reimplement it in Rust or Zig. I just need more free time!
This content originally appeared on dbushell.com and was authored by dbushell.com