JavaScript Document Extractors with Sitecore Search

Saturday, July 26, 2025

Recently I was trying to set up an API crawler in Sitecore Search to crawl an existing API. Everything looked right, the endpoint worked, but the crawler just refused to pull in any data.

Turns out, the problem wasn’t with authentication or the field mapping configuration. It was just that the API didn’t return a top-level array when Sitecore Search expects it.

What Sitecore Search Expects

The standard extractor is designed for simple responses like this:

[
  {
    "id": 1,
    "title": "Item One"
  },
  {
    "id": 2,
    "title": "Item Two"
  }
]

The default document extractor doesn’t know how to traverse deeper paths or unwrap objects. That’s fine if your API is flat, but most modern APIs wrap data for pagination, metadata, or GraphQL-style responses.

Arrays in Objects

Sitecore Search makes it simple to crawl data from an API: you point it at an endpoint, define your extractor, and let it build out your index. It’s great for structured data, especially if your response is a clean JSON array.

But if your API response looks like this:

{
  "data": {
    "items": [
      {
        "id": 1,
        "title": "Item One"
      },
      {
        "id": 2,
        "title": "Item Two"
      }
    ]
  }
}

where the array you want is inside a JSON object, it won’t work. The standard document extractor expects an array at the root of the response and not nested inside another object like data.items.

Enter the JavaScript Document Extractor

If your response isn’t structured as a top-level array, switch your crawler’s document extractor type to JavaScript. This gives you the flexibility to reshape the data however you need before it’s indexed.

Here’s an example extractor that handles a nested array:

function extractDocuments(response) {
  const json = JSON.parse(response.text());
  const items = json.data?.items || [];

  return items.map((item) => ({
    id: item.id,
    title: item.title,
    url: item.title
      .toLowerCase()
      .trim()
      .replace(/[^a-z0-9]+/g, "-")
      .replace(/^-+|-+$/g, ""),
  }));
}

This script parses the response, digs into data.items, and returns a flattened list of documents ready for indexing. You can adjust the mapping logic for your specific fields or even merge multiple values if your data model needs it.

If your API uses pagination, you can also handle that within the crawler or in your script by checking for a next link or cursor.

The default extractor is great for straightforward feeds but the JavaScript extractor gives you control over real-world data structures.