Blog/ How I built a crypto data pipeline for 200+ tokens using Spidra
May 15, 2026 · 8 min read

How I built a crypto data pipeline for 200+ tokens using Spidra

Joel Olawanle
Joel Olawanle
How I built a crypto data pipeline for 200+ tokens using Spidra

Comprehensive crypto data sounds like a solved problem until you actually need it.

CoinGecko and CoinMarketCap cover the major tokens well. But the moment you go beyond the top 100, coverage of smaller altcoins, newer listings, and niche DEX tokens gets patchy fast. Historical yield data, staking records, and token distribution events are either paywalled, incomplete, or simply not aggregated anywhere.

I needed a clean historical dataset covering yield and distribution history for 200+ tokens across multiple chains. No single API had it. The data existed across various public crypto information sites, structured and visible in any browser, but pulling it programmatically at scale meant either maintaining a brittle Puppeteer scraper, fighting bot detection, or writing a custom parser for every site.

I used Spidra instead. This is a full breakdown of how the pipeline came together.

Why Spidra

Most scraping tools return raw HTML and leave all the parsing to you. Spidra works differently. You provide a JSON Schema describing the exact output shape you want, point it at a URL, and it returns structured, typed data that matches that schema, with AI-powered extraction under the hood.For crypto data specifically, this matters.

Distribution and yield tables across crypto information sites are inconsistent. Dates appear in multiple formats. Some rows say "N/A", others say "-", and others are simply empty. Some sites show amounts with token symbols, others with USD equivalents, others with both. A traditional scraper needs custom parsing logic for every variation. With Spidra, I define what I want, and the AI handles the rest, regardless of how the source page is formatted.

The other capability I relied on heavily was Spidra's Batch Scrape API. A single POST /api/batch/scrape request accepts up to 50 URLs, processes them in parallel on Spidra's infrastructure, and returns a single batchId to poll for results. For a pipeline covering 200+ tokens, this was exactly the right primitive.

Defining the data model

Before writing any pipeline code, I defined exactly what a token distribution record should look like:

const DISTRIBUTION_SCHEMA = {
  type: "object",
  properties: {
    distributions: {
      type: "array",
      items: {
        type: "object",
        properties: {
          ex_date:          { type: ["string", "null"] },
          amount:           { type: ["string", "null"] },
          type:             { type: ["string", "null"] },
          payment_date:     { type: ["string", "null"] },
          yield:            { type: ["string", "null"] },
        },
        required: ["ex_date", "amount", "type", "payment_date", "yield"],
      },
    },
    token_symbol: { type: ["string", "null"] },
  },
  required: ["distributions", "token_symbol"],
};

I kept all fields as string | null rather than typed numbers or dates. Spidra extracts exactly what the page shows, such as "0.0042 ETH", "May 14, 2026", "N/A," and I handle my own normalization downstream. Asking the AI to coerce types when source data is messy risks silent failures. Better to receive the raw string and parse it yourself.

Because I pass this schema on every request, the output shape is guaranteed. Fields listed as required are always present, returning null when a value cannot be found rather than being omitted. No defensive parsing logic, no guessing what shape came back.

Submitting a batch

With 200+ tokens to process, the Batch Scrape API is the right tool. One request, up to 50 URLs, all processed in parallel:

const SPIDRA_BASE = process.env.SPIDRA_BASE_URL || "https://api.spidra.io/api";

async function submitBatch(urls) {
  const res = await axios.post(
    `${SPIDRA_BASE}/batch/scrape`,
    {
      urls,
      output: "json",
      schema: DISTRIBUTION_SCHEMA,
      extractContentOnly: true,
    },
    {
      headers: {
        "x-api-key": process.env.SPIDRA_API_KEY,
        "Content-Type": "application/json",
      },
      timeout: 30_000,
    }
  );
  return res.data.batchId;
}

The extractContentOnly: true flag tells Spidra to strip navigation, ads, and boilerplate before extraction, so only the main content reaches the AI. Crypto information pages tend to be heavy on sidebars, banners, and price tickers, and this flag made a noticeable difference in extraction accuracy.

Polling for results

Batch jobs are asynchronous. After submitting, I poll GET /api/batch/scrape/{batchId} until all items finish:

const POLL_INTERVAL_MS = 20_000; // 20 seconds between polls
const POLL_TIMEOUT_MS  = 300_000; // 5 minutes max wait

async function pollBatch(batchId) {
  const deadline = Date.now() + POLL_TIMEOUT_MS;

  while (Date.now() < deadline) {
    await sleep(POLL_INTERVAL_MS);

    const res = await axios.get(`${SPIDRA_BASE}/batch/scrape/${batchId}`, {
      headers: { "x-api-key": process.env.SPIDRA_API_KEY },
      timeout: 15_000,
    });

    const { status, items } = res.data;

    if (status === "completed") return { failed: false, items };
    if (status === "failed")    return { failed: true, reason: "batch_failed" };

    // still processing -- keep polling
  }

  return { failed: true, reason: "timeout" };
}

A 20-second poll interval is the right call here. Because Spidra processes all URLs in parallel on its own infrastructure, polling more frequently while jobs are still running generates unnecessary requests without improving throughput.

Running the full pipeline

With the Batch API doing the heavy lifting, the orchestration logic stayed simple:

const BATCH_SIZE = 50;

async function run(tokens) {
  for (let i = 0; i < tokens.length; i += BATCH_SIZE) {
    const batch = tokens.slice(i, i + BATCH_SIZE);
    const urls  = batch.map(t => buildUrl(t.slug));

    const batchId = await submitBatchWithRetry(urls);
    const result  = await pollBatch(batchId);

    if (result.failed) {
      console.warn(`Batch failed: ${result.reason}`);
      continue;
    }

    for (const item of result.items) {
      const data = item.result?.content;
      if (!data || data.ai_extraction_failed) continue;
      await normaliseAndUpsert(data);
    }

    if (i + BATCH_SIZE < tokens.length) {
      await sleep(5_000);
    }
  }
}

The ai_extraction_failed check is worth calling out. When Spidra cannot extract structured data matching the schema, for example, because a page has no distribution history table at all, it sets that flag rather than returning empty or malformed data. That distinction matters: a token with no distribution history is different from a failed extraction, and the pipeline handles them differently.

I also added a thin retry wrapper for the submission step to handle 429 responses gracefully:

async function submitBatchWithRetry(urls, maxAttempts = 5) {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await submitBatch(urls);
    } catch (err) {
      const is429 = err?.response?.status === 429;
      if (is429 && attempt < maxAttempts) {
        console.warn(`Rate limited -- waiting 60s (attempt ${attempt}/${maxAttempts})...`);
        await sleep(60_000);
      } else {
        throw err;
      }
    }
  }
}

Making the pipeline idempotent

Any data pipeline running at this scale needs to be safe to re-run. I built two layers of idempotency into this one.

The first is a --resume flag that skips tokens already present in the database:

async function getAlreadyScraped() {
  const rows = await db.query("SELECT DISTINCT token_symbol FROM distribution_history");
  return new Set(rows.map(r => r.token_symbol));
}

if (args.resume) {
  const done = await getAlreadyScraped();
  tokens = tokens.filter(t => !done.has(t.symbol));
  console.log(`Skipping ${done.size} already-scraped tokens`);
}

The second is ON DUPLICATE KEY UPDATE at the database level, so running the same data through twice never creates duplicates:

INSERT INTO distribution_history
  (token_id, symbol, ex_date, amount, type, payment_date, yield)
VALUES (?, ?, ?, ?, ?, ?, ?)
ON DUPLICATE KEY UPDATE
  amount       = VALUES(amount),
  type         = VALUES(type),
  payment_date = VALUES(payment_date),
  yield        = VALUES(yield),
  updated_at   = CURRENT_TIMESTAMP

Spidra also provides a scrape-logs API that gives a server-side audit trail of every completed job. If a client-side poll window closes before a batch finishes, the logs API lets you retrieve those results without re-submitting the job and spending additional credits:

GET /api/scrape-logs?status=success&dateStart=YYYY-MM-DD&limit=100
GET /api/scrape-logs/{uuid}

I ran a recovery pass after the main scrape to catch any gaps. The combination of --resume, idempotent upserts, and the logs API means the pipeline can be stopped and restarted at any point without losing data or duplicating work.

Parsing raw strings

Spidra returns values exactly as they appear on the page. Dates come in formats like "May 14, 2026", "14/05/2026", or "2026-05-14". Amounts come with token symbols, commas, or USD equivalents. A simple parser handles all the known variations:

const MONTH_MAP = {
  Jan: 1, Feb: 2, Mar: 3, Apr: 4, May: 5,  Jun: 6,
  Jul: 7, Aug: 8, Sep: 9, Oct: 10, Nov: 11, Dec: 12,
};

function parseDate(str) {
  if (!str || str.trim() === "-" || str.trim() === "N/A") return null;
  str = str.trim();

  // ISO: 2026-05-14
  if (/^\d{4}-\d{2}-\d{2}$/.test(str)) return str;

  // "May 14, 2026"
  const m1 = str.match(/([A-Za-z]{3})\s+(\d{1,2}),?\s+(\d{4})/);
  if (m1) {
    const month = MONTH_MAP[m1[1]];
    if (month) return `${m1[3]}-${String(month).padStart(2, "0")}-${m1[2].padStart(2, "0")}`;
  }

  // DD/MM/YYYY
  const m2 = str.match(/^(\d{1,2})\/(\d{1,2})\/(\d{4})$/);
  if (m2) return `${m2[3]}-${m2[2].padStart(2, "0")}-${m2[1].padStart(2, "0")}`;

  return null;
}

function parseDecimal(str) {
  if (!str || str.trim() === "-" || str.trim().toLowerCase() === "n/a") return null;
  const val = parseFloat(str.replace(/[^0-9.]/g, ""));
  return Number.isFinite(val) ? val : null;
}

The rule throughout: always return null for missing or unparseable values, never throw. A bad value in one row should not abort processing the remaining records for that token.

Results

After the main run plus one recovery pass:

Metric

Value

Tokens processed

218

Total distribution records

2,300+

Tokens with no distribution history

21

Errors

0

Total Spidra batch jobs

5

Time to complete

~2 hours

The 21 tokens with no records are not failures. They genuinely have no distribution history on the source sites. The pipeline correctly returns distributions: [] for them rather than erroring.

What made this work

  • Schema-first extraction. Defining the output shape before writing any pipeline code meant there was no downstream conditional parsing logic. Every response conforms to the same schema. One normalisation pass works uniformly across all 218 tokens.
  • Batch API for scale. Using POST /api/batch/scrape collapsed what would otherwise be a complex concurrency problem, tracking dozens of individual job IDs, handling partial failures, coordinating timeouts into a single submission, and a single poll loop per batch.
  • extractContentOnly on data-heavy pages. Enabling this flag strips navigation, banners, and price tickers before extraction. On crypto information sites with dense page layouts, it makes a meaningful difference to what the AI focuses on.
  • Null over throw. Every parser returns null on failure rather than throwing. This keeps bad rows from interrupting the broader run and makes the output predictable for anything consuming the database.

Get started free at spidra.io.

Share this article

Start scraping for free.

Get 300 free credits to explore Spidra. Build your first scraper in minutes, not hours. Upgrade anytime as you scale.

We build features around real workflows. Usually within days.