How I Built a Chrome Extension That Parses Any Job Site Without Scraping – ██FR█████ █INTELL███████████

This content originally appeared on DEV Community and was authored by Galih

Nine months ago, I got laid off. During my brief spell looking for a job (before deciding to go indie), as I have always been, I wanted to do it in an organized manner.

I used to rely on Airtable base for collecting all the job postings I found online, to then take notes and track activities and so on, but this time decided to look for a ready-made solution. Found some cool products but they are too expensive for my liking, costing between $20-40 a month!

That is how the idea of HuntingPad was born and the core feature seemed simple: clip job posts with one click from anywhere in the browser!

Spoiler of what I ended up achieving

What I didn’t expect was that this “simple” feature would lead me through a maze of technical decisions – from expensive webscraping APIs to LLM token optimization.

What I wanted to have as the UX is something like this:

whenever user is viewing a job posting, using just one click the user should be able to save the job post immediately and have the important info parsed and structured nicely without manual intervention using AI.

There are couple of questions/challenges about this:

What APIs available to a chrome extension that enables this flow? I know at least one, but what are the options?
Once we’re able to trigger the clipping, how do I go about to get the job posting content?

I decided two study the question #2 first since that seemed more critical.

Getting the content of the job posting

Here are the alternatives that I was considering:

Build my own webscraping system.
Use webscraper API → feed the content to LLM → structured data.
Use regex to parse job data, skipping webscraper API → feed the content to LLM → structured data.
Use LLM directly to both fetch, parse, and structure the job posting in one or two calls.

I wanted to enable users on a free plan to use the extension because I believe it should be useful for job seekers from all economic backgrounds. This makes cost the most important criteria in choosing the solution, we need to come up with economically-viable one.

Option #1 was immediately out of the question—too large in scope for a project with no guarantee of ROI. It would be fun, don’t get me wrong, but fun pays no bills unfortunately :).

The Webscraper API initially seemed like the obvious choice. But after researching some providers, I discovered most API providers have pricing that gets quite steep. This option is not the most viable.

Next, I considered writing regex patterns to parse job data from different sites. LinkedIn has one structure, Indeed another, and company career pages vary wildly. After thorough discussions with my friend Claude, I realized this approach was doomed.

This left me with just one alternative from my original list: having the LLM alone handle scraping, parsing, and structuring the job posting. My instinct said it would be prohibitively expensive. I proceeded to test it on three job postings—one from LinkedIn, one from Indeed, and another from a company career page.

The results confirmed my suspicion: too expensive.

This forced me to reconsider the webscraping API option, so I decided to go with it (though this wasn’t the end of the story).

The Trigger

With the fetching job posting content settled, I researched how to actually trigger the job posting clipping flow. From my experience with other extensions, I knew we could click on the extension icon in the browser bar, but I wondered what other options were available.

Then I found chrome.contextMenus .

This is what I want, for the user on any job posting page to just right-click and select one option from the menu like “Clip job post with HuntingPad” and it will just work. I was sold. I was very glad this one is a much easier decision to reach.

Back to the idea of scraping

Remember about how I settled with the Webscraping API decision earlier? Having researched about the context menu, I thought of something.

What if we let users “scrape” the job posting themselves? We could ask them to highlight the text they want to clip before right-clicking and selecting “Clip job post with HuntingPad.” The extension’s background script would detect this selection and extract the highlighted HTML tags before sending them to the backend for processing.

While this approach deviates from my ideal UX flow, it offers significant benefits: eliminating the cost of a scraping API and removing a potential source of latency and errors from integrating with another API provider.

(Let’s be honest—it was mainly about the cost ).

But just when I thought we were in the clear, I discovered another challenge that threatened to derail our progress!

The challenge in user-led highlighting

On our background script, we need to determine both the start and end parts of the html page to pick up. So we need to find the smallest/closest common ancestor that the start (usually the html tag of the job title heading) and the end part (the last sentence of the job description content).

function findCommonParent(elementA: HTMLElement, elementB: HTMLElement): HTMLElement | null {
  // Get all parents of element A
  const parentsA: HTMLElement[] = [];
  let currentA: HTMLElement | null = elementA;

  while (currentA && currentA.tagName !== 'BODY' && currentA.tagName !== 'HTML') {
    parentsA.push(currentA);
    currentA = currentA.parentElement;
  }

  // Check if any parent of element B matches a parent of element A
  let currentB: HTMLElement | null = elementB;
  while (currentB && currentB.tagName !== 'BODY' && currentB.tagName !== 'HTML') {
    // Check if this element is in the parents of A
    const index = parentsA.indexOf(currentB);
    if (index >= 0) {
      // We found a common parent, now check if it's a block element
      const commonParent = parentsA[index];

      // Is it already a block-level element?
      const style = window.getComputedStyle(commonParent);
      if (style.display === 'block' || style.display === 'flex' || style.display === 'grid') {
        return commonParent;
      }

      // If not a block element, try to find its closest block parent
      const blockParent = findClosestBlockParent(commonParent);
      if (blockParent) {
        return blockParent;
      }

      // If no block parent found, just return the common parent
      return commonParent;
    }

    currentB = currentB.parentElement;
  }

  // No common parent found (shouldn't happen as body/html would be common)
  return null;
}

I know that job boards do not share common job posting pages structure, but what I did not expect was how varied they are in a way that highlighting/selecting the job posting text from title to the end can basically end up fetching almost 90 percent of the page content depends on how the pages are structured.

Illustration:

<body>
  <h1 id="job-title">Senior Backend Engineer at Anthropic</h1>
  // buttons, filters, and so on

  // left sidebar

  // scripts (yes, scripts)

  // code comments (expected, we're all humans) 

  <div id="job-description">
     Start of Job desc details
     // right sidebar

     // Another job recommendations

     // random empty html tags (p, div, hr, br)

     End of relevant job details
  </div>  
</body>

So when user highlights from top to bottom of the job posting, as we expect them to, the background scripts will pick too many irrelevant items. Trimming is needed then.

This is what I end up building:

function pruneJobAdHtml(container: HTMLElement): HTMLElement {
  // Create a deep clone of the element to avoid modifying the original DOM
  const clone = container.cloneNode(true) as HTMLElement;

  // 1. [Easy Part] Remove unwanted tags and all their descendants
  // using a constant of TAGS_TO_REMOVE and iterate over it
  // this includes removing tailwind or other styling classes

  // 2. Process images - replace with text nodes
  // I decided to not pick up any images for now

  // 3. Remove HTML comments, no need to explain

  // 4. Remove empty elements and attributes

  // 5. Flatten redundant containers
  // 6. Convert to semantic HTML
  // 7. Merge adjacent text nodes
}

The trimming step of #5, #6, and #7 were added later as refinements to further reduce the size of the “scraped” content so that the input token consumption by the LLM can be reduced.

From my observation when testing the effectiveness of the clipping, using both the selection function (findCommonParent) and the pruning above, the clipping can effectively reduces up to 80% of the content to feed to the LLM.

Illustration of the full flow:

Closing

This solution isn’t perfect. Users have to select text instead of just clicking a button. But it taught me something valuable: sometimes the best technical decision isn’t the most elegant one – it’s the one that ships.

By requiring users to select the job posting text, we eliminated webscraping costs entirely and reduced LLM token usage by 80%. The extension stays free for job seekers who need it most, and the parsing is actually more accurate because users naturally select only the relevant content.

Are you building a chrome extension right now? Or thinking to build one and don’t know where to start? What walls have you hit building browser extensions? Feel free to drop a comment

PS. If any of you want to try using the Pro version of HuntingPad, let me know in the comment and I will happily give you discount

This content originally appeared on DEV Community and was authored by Galih