Advanced Web Crawler 101: Sitemap Trigger Extractors – ██FR█████ █INTELL███████████

This content originally appeared on DEV Community and was authored by Sebastián Aliaga

When working with Sitecore Search, the foundation of a good experience is how well your crawler is configured. If the crawler doesn’t discover or properly extract your content, the rest of your search implementation (facets, sorting, widgets, SDK calls) will fall apart.

In this article, we’ll go through the Advanced Web Crawler step by step:

Configure a Sitemap trigger to start crawling.
Use Document extractors (XPath, CSS, JS, JSON-LD) to pull structured data from each page.
Map those fields into Attributes that will power search, filtering, and sorting.

Step 1: Configure a Sitemap Trigger

A trigger is how Sitecore Search discovers new content. The Sitemap trigger is the most common because it gives the crawler a structured list of URLs to fetch.

Go to Sources → [Your Source] → Triggers → Add Trigger.
Select Sitemap as the trigger type.
Add your sitemap URL (https://www.example.com/sitemap.xml).
(Optional) Adjust include/exclude rules if your sitemap covers areas you don’t want indexed.

Step 2: Define Attributes

Attributes are the fields Sitecore Search will index and make available for search, filters, facets, and sorting. They act as the schema of your search index.

Go to Sources → [Your Source] → Attributes.
Click Add Attribute.
For each attribute, define:
- Display name (friendly name in the UI).
- Attribute name (the key you’ll map in extractors and use in queries).
- Data type (string, boolean, array, date, etc.).
- Whether it should:
- Return in API response
- Be available for filtering or facets
- Be sortable

Example attributes might include:

title → searchable
description → searchable
date → sortable
category → facetable
language → filter

This is where you define the “contract” for what data should be stored and how it will be used in queries and widgets.

Step 3: Define Document Extractors

Now that the crawler knows what pages to fetch, we need to tell it what to extract from each page.

Types of Extractors

Sitecore Search supports different types of document extractors depending on how your content is structured:

XPath selectors → great for structured HTML elements like <title> or <meta> tags.
CSS selectors → useful for targeting content by class names or IDs.
JavaScript extractors → advanced, allows you to run custom logic to decide what to index.

Example: JavaScript Extractor

Here’s a real-world example of a JavaScript extractor. This script checks if a page is marked as searchable via a <meta> tag, and then collects metadata, product/event/article details, and page content into fields that will be indexed:

function extract(request, response) {
  $ = response.body;

  // Only index pages that are marked as searchable
  const isSearchable = $('meta[name="searchable"]').attr('content') === 'true';
  if (!isSearchable) {
    return []; // Skip this page
  }

  return [{
    // Primary fields
    name: $('meta[name="name"]').attr('content'),
    navigation_title: $('meta[name="navigation_title"]').attr('content'),
    browser_title: $('meta[name="browser_title"]').attr('content') || $('title').text(),
    breadcrumbs_title: $('meta[name="breadcrumbs_title"]').attr('content'),
    keywords: $('meta[name="keywords"]')?.attr('content')?.split('|'),
    teaser: $('meta[name="teaser"]').attr('content') || $('meta[name="description"]').attr('content'),
    template: $('meta[name="template"]').attr('content'),
    image: $('meta[name="image"]').attr('content'),
    lead_in: $('meta[name="lead_in"]').attr('content'),
    masthead_description: $('meta[name="masthead_description"]').attr('content'),

    // Product details
    pdp_product_name: $('meta[name="pdp_product_name"]').attr('content'),
    pdp_product_style: $('meta[name="pdp_product_style"]').attr('content'),
    pdp_product_description: $('meta[name="pdp_product_description"]').attr('content'),
    product_category: $('meta[name="product_category"]')?.attr('content')?.split('|'),
    product_type: $('meta[name="product_type"]')?.attr('content')?.split('|'),
    product_collection: $('meta[name="product_collection"]')?.attr('content')?.split('|'),

    // Event details
    event_full_date: $('meta[name="event_full_date"]').attr('content'),
    event_date: $('meta[name="event_date"]').attr('content'),
    event_location: $('meta[name="event_location"]').attr('content'),

    // Article details
    article_full_date: $('meta[name="article_full_date"]').attr('content'),
    article_topic: $('meta[name="article_topic"]').attr('content'),
    article_date: $('meta[name="article_date"]').attr('content'),

    // Additional metadata
    language: $('meta[name="language"]').attr('content'),
    url: request.url || $('meta[property="og:url"]').attr('content'),

    // Page content
    page_full_content: $('p').text(),
    page_headers_h1: $('h1').text(),
    page_headers_h2: $('h2').text(),
    page_headers_h3: $('h3').text(),
  }];
}

This pattern gives you fine-grained control:

You can exclude pages dynamically.

Extract different metadata depending on page type (product, event, article).

Return arrays for multi-valued fields like categories or tags.

Validation: Crawl & Verify

Run your crawler manually the first time.
Go to Content Collections.
Open a few sample documents and check that the attributes are populated correctly.

Key Takeaways

The Sitemap trigger ensures you have a complete list of URLs.
Document extractors pull structured fields like title, body, tags, and publish date.
Attributes map those fields into Sitecore Search so they can power search, filtering, and sorting.
A clean extractor + attribute setup = a solid foundation for widgets and SDK queries later.

Next up in this series: Crawling large sites with Sitemap Index triggers and rules.

This content originally appeared on DEV Community and was authored by Sebastián Aliaga