Introducing Crawling via the Spidra API

About two weeks ago, we announced the launch of crawling inside the Spidra Playground. Yesterday, we shipped automatic captcha solving and stealth mode to make crawling work reliably on protected and JavaScript heavy sites.

Today, we are happy to announce the introduction of a comprehensive set of crawling endpoints in the Spidra API, enabling you to run, track, and manage crawl jobs directly from your own code.

If you prefer automation over UI, this release is for you.

Note: To use the Spidra API, you need a valid API key. You can create and manage your API keys from your dashboard. All API requests are authenticated using the x-api-key header.

Submitting a crawl job

Crawling begins by submitting a job.

This endpoint mirrors what happens in the Playground. You provide a starting URL, describe which pages should be discovered, and define how content should be extracted from each page. Spidra handles discovery, navigation, captcha solving, and transformation automatically.

curl --request POST \
  --url https://api.spidra.io/api/crawl \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: YOUR_API_KEY' \
  --data '{
    "baseUrl": "https://example.com/blog",
    "crawlInstruction": "Crawl all blog post pages",
    "transformInstruction": "Extract title, author, date, and content from each post",
    "maxPages": 5,
    "useProxy": false
  }'

The response confirms the job is queued and returns a job ID.

{
  "status": "queued",
  "jobId": "abc-123",
  "message": "Crawl job queued. Poll /api/crawl/abc-123 for results."
}

From this point on, the crawl runs asynchronously. You can submit the job and come back later to inspect progress or results.

Full parameter details and edge cases are documented in the crawling section of the API reference, where you can also test requests using our interactive docs.

Checking crawl job status

Once a crawl is running, the job status endpoint becomes the primary way to understand what is happening.

curl --request GET \
  --url https://api.spidra.io/api/crawl/abc-123 \
  --header 'x-api-key: YOUR_API_KEY'

This endpoint serves two purposes. While the crawl is active, it reports progress such as how many pages have been visited so far. When the crawl completes, it returns the extracted results for each page.

{
  "status": "completed",
  "progress": {
    "message": "Crawl complete",
    "pagesCrawled": 5,
    "maxPages": 10
  },
  "result": [
    {
      "url": "https://example.com/blog/post-1",
      "title": "First Post",
      "status": "success",
      "data": {
        "title": "First Post",
        "author": "John",
        "date": "2025-01-01"
      }
    }
  ],
  "error": null
}

Each page is returned as a separate object. This makes it easy to reason about partial successes and failures without rerunning the entire crawl.

Getting crawled pages

For workflows that need more control, crawled pages can also be fetched independently.

curl --request GET \
  --url https://api.spidra.io/api/crawl/abc-123/pages \
  --header 'x-api-key: YOUR_API_KEY'

This endpoint is useful when you want to process results incrementally, retry individual pages, or store page-level data in your own systems. It exposes the same page objects used internally by the job status endpoint, without coupling them to job progress.

Example response:

{
  "pages": [
    {
      "id": "page-1",
      "url": "https://example.com/blog/post-1",
      "title": "First Post",
      "status": "success",
      "data": {
        "title": "First Post",
        "author": "John",
        "date": "2025-01-01"
      },
      "error_message": null,
      "created_at": "2025-12-17T15:00:00Z"
    }
  ]
}

Getting crawl job details

Sometimes you need to understand how a crawl was configured rather than what it returned.

The job details endpoint exposes the original instructions, limits, and usage information associated with a crawl job.

curl --request GET \
  --url https://api.spidra.io/api/crawl/job/abc-123 \
  --header 'x-api-key: YOUR_API_KEY'

Example response:

{
  "id": "abc-123",
  "base_url": "https://example.com",
  "crawl_instruction": "Crawl all blog post pages",
  "transform_instruction": "Extract title, author, date, and content",
  "max_pages": 10,
  "pages_crawled": 5,
  "status": "completed",
  "created_at": "2025-12-17T15:00:00Z",
  "updated_at": "2025-12-17T15:05:00Z",
  "credits_used": 25
}

This is helpful for auditing, debugging, and tracking how crawling fits into your overall usage.

Listing crawl history

To support longer-running or repeated workflows, we also added an endpoint for listing crawl history.

curl --request GET \
  --url 'https://api.spidra.io/api/crawl/history?page=1&limit=10' \
  --header 'x-api-key: YOUR_API_KEY'

This endpoint returns a paginated view of past crawl jobs along with basic status and usage information.

{
  "jobs": [
    {
      "id": "abc-123",
      "base_url": "https://example.com",
      "status": "completed",
      "max_pages": 10,
      "pages_crawled": 8,
      "created_at": "2025-12-17T15:00:00Z",
      "credits_used": 25
    }
  ],
  "total": 15,
  "page": 1,
  "totalPages": 2
}

It is useful for monitoring activity, understanding credit usage, and building simple reporting around your crawls.

Additional query parameters are available for pagination and filtering. You can find the full list of supported parameters in the API documentation, where you can also test different combinations using the interactive docs.

Same engine as the Playground

The API uses the same crawling engine as the Playground. That includes automatic handling of CAPTCHA, optional stealth mode with proxy rotation, and retry logic for failed pages.

Check out the Spidra API today.