Blog/ Scraping business directories for lead generation with Spidra Node SDK

May 7, 2026 · 13 min read

Scraping business directories for lead generation with Spidra Node SDK

Joel Olawanle

Scraping business directories for lead generation with Spidra Node SDK

Lead generation has always existed and will continue to exist. Every business, at some point, needs a pipeline of new customers, and building that list manually is the part nobody wants to do.

In this guide, we'll build a TypeScript pipeline that pulls business listings from Yellow Pages across multiple pages, writes them to a CSV file, and attempts to enrich each lead by visiting their website to find an email address or contact name.

We'll show you exactly what we got when we ran this against the live site, why the numbers look the way they do, and how to handle the mess gracefully.

Here's what the output actually looked like when we ran it against New York plumbers:

name,phone,address,website,email,contactPerson
Aladdin Plumbing Corp.,(347) 395-4715,"55 Garnet Street, Kings County, NY 11231",aladdinplumbingcorp.com,,Gerald "Jerry" Gitli
All Pro Cleaning & Restoration,(844) 574-1739,"13 Haven St, Elmsford, NY 10523",allprorestoration.com,[email protected],
Fred Smith Plumbing & Heating Co.,(212) 744-1300,"1674 1st Ave, New York, NY 10128",fredsmithplumbing.com,,,
Lopez Plumbing & Heating Mechanical,(646) 765-6232,"530 W 144th St, New York, NY 10031",lopezplumbing.com,,,Spidra Node SDK

Getting started

To get started, we use the Spidra Node SDK. Install it with:

npm install spidra

Then grab an API key from the Spidra dashboard under Settings > API Keys and export it:

export SPIDRA_API_KEY=spd_your_key_here

You can set up your project however you like, but here's what mine looks like:

lead-gen/
├── package.json
├── scrape.ts        ← pulls listings from Yellow Pages
├── enrich.ts        ← visits each website looking for email/contact
└── leads/
    ├── raw.csv
    └── enriched.csv

Scraping the directory

Yellow Pages serves different content depending on the source of the request. If you hit it from outside the US or from a data center IP, you get a bot-protection page instead of listings.

The fix is routing through a US residential proxy, which the Spidra SDK handles with two options:

useProxy: true,
proxyCountry: "us",

proxyCountry: "us" is what matters here. Yellow Pages checks the origin, and if it's not a US IP, you get the verification wall. Without useProxy, you're sending the request from a data center IP, which most large directories flag by default.

Defining what we want

Rather than asking the AI to extract whatever it thinks is useful, we define exactly what we want using a JSON Schema. When you pass a schema, Spidra enforces the structure, and missing fields return as null rather than letting the model guess or skip them entirely.

interface Lead {
  name: string;
  phone: string | null;
  address: string | null;
  website: string | null;
  rating: number | null;
  reviewCount: number | null;
  yearsInBusiness: string | null;
  category: string | null;
}

const leadSchema = {
  type: "object",
  properties: {
    businesses: {
      type: "array",
      items: {
        type: "object",
        properties: {
          name:            { type: "string" },
          phone:           { type: ["string", "null"] },
          address:         { type: ["string", "null"] },
          website:         { type: ["string", "null"] },
          rating:          { type: ["number", "null"] },
          reviewCount:     { type: ["number", "null"] },
          yearsInBusiness: { type: ["string", "null"] },
          category:        { type: ["string", "null"] },
        },
        required: ["name"],
      },
    },
  },
};

The required: ["name"] part is worth noting. Directory pages sometimes include sponsored placeholders or half-rendered ad slots that appear as real listings in the HTML. Without that requirement, you can end up with empty objects in your results. Requiring name means anything without one gets dropped automatically.

Running the scrape

import { SpidraClient } from "spidra";
import { writeFileSync, mkdirSync } from "fs";

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! });

const job = await spidra.scrape.run(
  {
    urls: [{ url: "https://www.yellowpages.com/new-york-ny/plumbers" }],
    prompt: PROMPT,
    output: "json",
    schema: leadSchema,
    useProxy: true,
    proxyCountry: "us",
  },
  {
    timeout: 120_000,
    pollInterval: 5_000,
  }
);

const leads: Lead[] = (job.result.content as any)?.businesses ?? [];
console.log(`Got ${leads.length} leads -- ${job.result.stats.totalTokens} tokens`);

Page 1 returned 30 businesses in about 20 seconds and used around 3,200 tokens. Here's a sample of what came back:

[1] Aladdin Plumbing Corp.
    Phone:   (347) 395-4715
    Address: 55 Garnet Street, Kings County, NY 11231
    Website: https://www.aladdinplumbingcorp.com/
    50 Years in business

[2] Fred Smith Plumbing & Heating Co., Inc.
    Phone:   (212) 744-1300
    Address: 1674 1st Ave, New York, NY 10128
    Website: http://www.fredsmithplumbing.com
    112 Years in business

When you look through the full results, you will notice a lot of Roto-Rooter entries. Roto-Rooter is a national franchise with a separate Yellow Pages listing for each NYC borough — Manhattan, Brooklyn, Bronx, Queens, Staten Island. Each has a different phone number, but they all point to subdomain pages on rotorooter.com.

Writing to CSV

function escapeCsv(val: unknown): string {
  if (val == null) return "";
  const s = String(val);
  return s.includes(",") || s.includes('"') || s.includes("\n")
    ? `"${s.replace(/"/g, '""')}"`
    : s;
}

function toCSV(leads: any[]): string {
  const headers = [
    "name", "phone", "address", "website",
    "rating", "reviewCount", "yearsInBusiness", "category",
    "email", "contactPerson", "linkedIn", "facebook",
  ];
  const rows = leads.map(l => headers.map(h => escapeCsv(l[h])).join(","));
  return [headers.join(","), ...rows].join("\n");
}

mkdirSync("leads", { recursive: true });
writeFileSync("leads/raw.csv", toCSV(leads), "utf8");

Notice the extra columns at the end — email, contactPerson, linkedIn, facebook. They're empty at this stage. The enrichment script fills them in later. Writing them now keeps both CSVs in the same shape, so you can open either one without reformatting.

Scraping multiple pages

One page gives us 30 leads. For anything useful, we need more. Yellow Pages paginates via query string — ?page=2, ?page=3 — so we can build the URLs upfront and submit them as a batch. Spidra processes them in parallel, so 5 pages take roughly the same time as 1.

function buildPageUrls(base: string, count: number): string[] {
  const urls = [base];
  for (let i = 2; i <= count; i++) urls.push(`${base}?page=${i}`);
  return urls;
}

One thing worth explaining here — the Spidra SDK has a forEach action with built-in pagination support where you give it a CSS selector for the "next page" button, and it clicks through automatically. For most sites, that's the cleaner approach.

Yellow Pages is a specific exception because its pagination is AJAX-based. Clicking "next" swaps the content in place without navigating to a new URL, so the browser's navigation signal never fires.

With URL batching, each page is a completely separate browser session with its own proxy IP, so none of that applies.

The batch call looks like this:

const NICHE    = process.env.NICHE    ?? "plumbers";
const LOCATION = process.env.LOCATION ?? "new-york-ny";
const PAGES    = Math.min(parseInt(process.env.PAGES ?? "3"), 10);

const BASE = `https://www.yellowpages.com/${LOCATION}/${NICHE}`;
const urls = buildPageUrls(BASE, PAGES);

const batch = await spidra.batch.run(
  {
    urls,
    prompt: PROMPT,
    output: "json",
    schema: leadSchema,
    useProxy: true,
    proxyCountry: "us",
  },
  { timeout: 300_000, pollInterval: 5_000 }
);

Collecting results and deduplicating

const all: Lead[] = [];

for (const item of batch.items) {
  if (item.status !== "completed" || !item.result) {
    console.warn(`SKIP ${item.url} -- ${item.status}`);
    continue;
  }
  const businesses: Lead[] = (item.result as any)?.businesses ?? [];
  console.log(`  ${businesses.length} -- ${item.url}`);
  all.push(...businesses);
}

function dedup(leads: Lead[]): Lead[] {
  const seen = new Set<string>();
  return leads.filter(l => {
    const key = `${l.name.toLowerCase().trim()}|${l.phone ?? ""}`;
    if (seen.has(key)) return false;
    seen.add(key);
    return true;
  });
}

const unique = dedup(all);
console.log(`${all.length} raw leads → ${unique.length} after dedup`);

From our 3-page run:

3/3 pages OK
  30 -- .../plumbers
  30 -- .../plumbers?page=2
  30 -- .../plumbers?page=3

90 raw leads → 72 after dedup

The 18 duplicates are almost entirely Roto-Rooter franchise entries that appear across multiple pages. Deduplication by name|phone handles it — two genuinely different franchise locations have different phone numbers and survive correctly.

The full scrape.ts

import { SpidraClient } from "spidra";
import { writeFileSync, mkdirSync } from "fs";

const API_KEY  = process.env.SPIDRA_API_KEY!;
const NICHE    = process.env.NICHE    ?? "plumbers";
const LOCATION = process.env.LOCATION ?? "new-york-ny";
const PAGES    = Math.min(parseInt(process.env.PAGES ?? "3"), 10);

interface Lead {
  name: string;
  phone: string | null;
  address: string | null;
  website: string | null;
  rating: number | null;
  reviewCount: number | null;
  yearsInBusiness: string | null;
  category: string | null;
}

const leadSchema = {
  type: "object",
  properties: {
    businesses: {
      type: "array",
      items: {
        type: "object",
        properties: {
          name:            { type: "string" },
          phone:           { type: ["string", "null"] },
          address:         { type: ["string", "null"] },
          website:         { type: ["string", "null"] },
          rating:          { type: ["number", "null"] },
          reviewCount:     { type: ["number", "null"] },
          yearsInBusiness: { type: ["string", "null"] },
          category:        { type: ["string", "null"] },
        },
        required: ["name"],
      },
    },
  },
};

const PROMPT = `Extract every business listing on this page.
For each business include: name, phone number, street address,
website URL, star rating, review count, years in business, and service category.
Skip the sponsored listings at the very top -- focus on the organic results.`;

function buildPageUrls(base: string, count: number): string[] {
  const urls = [base];
  for (let i = 2; i <= count; i++) urls.push(`${base}?page=${i}`);
  return urls;
}

function escapeCsv(val: unknown): string {
  if (val == null) return "";
  const s = String(val);
  return s.includes(",") || s.includes('"') || s.includes("\n")
    ? `"${s.replace(/"/g, '""')}"`
    : s;
}

function toCSV(leads: any[]): string {
  const headers = [
    "name", "phone", "address", "website",
    "rating", "reviewCount", "yearsInBusiness", "category",
    "email", "contactPerson", "linkedIn", "facebook",
  ];
  const rows = leads.map(l => headers.map(h => escapeCsv(l[h])).join(","));
  return [headers.join(","), ...rows].join("\n");
}

function dedup(leads: Lead[]): Lead[] {
  const seen = new Set<string>();
  return leads.filter(l => {
    const key = `${l.name.toLowerCase().trim()}|${l.phone ?? ""}`;
    if (seen.has(key)) return false;
    seen.add(key);
    return true;
  });
}

async function main() {
  const spidra = new SpidraClient({ apiKey: API_KEY });
  const BASE   = `https://www.yellowpages.com/${LOCATION}/${NICHE}`;
  const urls   = buildPageUrls(BASE, PAGES);

  console.log(`${NICHE} in ${LOCATION} -- ${PAGES} page(s)`);

  const start = Date.now();
  const batch = await spidra.batch.run(
    { urls, prompt: PROMPT, output: "json", schema: leadSchema, useProxy: true, proxyCountry: "us" },
    { timeout: 300_000, pollInterval: 5_000 }
  );

  console.log(`${((Date.now() - start) / 1000).toFixed(1)}s -- ${batch.completedCount}/${batch.totalUrls} OK`);

  const all: Lead[] = [];
  for (const item of batch.items) {
    if (item.status !== "completed" || !item.result) { console.warn(`SKIP ${item.url}`); continue; }
    const businesses: Lead[] = (item.result as any)?.businesses ?? [];
    console.log(`  ${businesses.length} -- ${item.url}`);
    all.push(...businesses);
  }

  const unique = dedup(all);
  console.log(`${all.length} raw → ${unique.length} unique`);

  mkdirSync("leads", { recursive: true });
  writeFileSync("leads/raw.csv", toCSV(unique), "utf8");
  console.log("Saved → leads/raw.csv");
}

main().catch(err => { console.error(err.message); process.exit(1); });

Enriching leads

Yellow Pages doesn't expose email addresses — neither does any directory, really. If you would like them, please visit each business's website. Here's what that actually looks like.

Of the 72 leads from our scrape:

20 had no website at all — just a phone number in the listing
15 had a website field pointing to a directory link or franchise page (rotorooter.com, a YP redirect, etc.)
15 had a real website but no email anywhere on it — contact forms only, or just a phone number
10 had a real website with a findable email address
12 had something else useful — a contact person's name, a LinkedIn, a Facebook page

Final count: 9 email addresses, 11 contact names, 6 social profiles from 72 leads. About 12% email discovery rate.

Why is that number so low? Small service businesses, such as plumbers, electricians, and cleaners, often treat their websites as digital business cards. A phone number, a list of services, maybe some photos. The person who set it up never added an email address because they didn't want spam, or because whoever built the site just didn't think to add one.

That said, 9 real email addresses plus 11 named contacts from a few-minute run is a decent start for a targeted outreach list. Calling a plumber and asking for Gerald by name is a different conversation from a cold call.

Filtering before you scrape

Before hitting any URLs, filter out the ones that aren't worth scraping:

const SKIP_DOMAINS = [
  "yellowpages.com", "yelp.com", "rotorooter.com",
  "homedepot.com", "angi.com", "wixsite.com", "wordpress.com",
];

function isScrapeable(url: string | null | undefined): boolean {
  if (!url) return false;
  try {
    const host = new URL(url).hostname.toLowerCase();
    return !SKIP_DOMAINS.some(d => host.includes(d));
  } catch {
    return false;
  }
}

Scraping the national Roto-Rooter site for information about a Brooklyn franchise location wastes a credit and yields no useful information. The same goes for YP redirect pages that require a login. Filter those out before the batch runs.

Schema and prompt

const enrichSchema = {
  type: "object",
  properties: {
    email:         { type: ["string", "null"] },
    contactPerson: { type: ["string", "null"] },
    linkedIn:      { type: ["string", "null"] },
    instagram:     { type: ["string", "null"] },
    facebook:      { type: ["string", "null"] },
  },
};

const ENRICH_PROMPT = `Look for:
1. Any email address on this page (contact@, info@, or any mailto: link)
2. The owner, founder, or main contact person's name -- not the company name
3. Links to LinkedIn, Instagram, or Facebook

Return null for anything not found. Do not guess or infer values.`;

That last line — "do not guess" — is doing real work. Without it, the model will sometimes infer a plausible email from the domain name ([email protected]) even when no email is on the page. You'd be sending cold emails to nonexistent addresses. We only want what's actually visible.

The enrich.ts script

import { SpidraClient } from "spidra";
import { readFileSync, writeFileSync } from "fs";

const API_KEY = process.env.SPIDRA_API_KEY!;
const INPUT   = process.env.INPUT  ?? "leads/raw.csv";
const OUTPUT  = process.env.OUTPUT ?? "leads/enriched.csv";

async function main() {
  const spidra = new SpidraClient({ apiKey: API_KEY });
  const leads  = parseCSV(readFileSync(INPUT, "utf8"));

  const enrichable = leads.filter(l => isScrapeable(l.website));
  console.log(`${leads.length} total -- ${enrichable.length} have scrapeable websites`);

  // If the same domain appears multiple times (franchise locations),
  // scrape it once and reuse the result
  const seenDomains = new Set<string>();
  const uniqueSites = enrichable.filter(l => {
    const domain = new URL(l.website).hostname.toLowerCase();
    if (seenDomains.has(domain)) return false;
    seenDomains.add(domain);
    return true;
  });

  console.log(`${uniqueSites.length} unique domains`);

  const enrichmentMap: Record<string, any> = {};

  for (let i = 0; i < uniqueSites.length; i += 50) {
    const chunk = uniqueSites.slice(i, i + 50);

    const batch = await spidra.batch.run(
      {
        urls: chunk.map(l => l.website),
        prompt: ENRICH_PROMPT,
        output: "json",
        schema: enrichSchema,
        extractContentOnly: true,
        useProxy: true,
        proxyCountry: "us",
      },
      { timeout: 300_000, pollInterval: 5_000 }
    );

    for (const item of batch.items) {
      if (item.status === "completed" && item.result) {
        const domain = new URL(item.url).hostname.toLowerCase();
        enrichmentMap[domain] = item.result;
      }
    }
  }

  let emailsFound = 0;
  const enriched = leads.map(lead => {
    if (!isScrapeable(lead.website)) return lead;
    const domain = new URL(lead.website).hostname.toLowerCase();
    const data   = enrichmentMap[domain];
    if (!data) return lead;
    if (data.email) emailsFound++;
    return { ...lead, ...data };
  });

  console.log(`Emails found: ${emailsFound} / ${enrichable.length}`);
  writeFileSync(OUTPUT, toCSV(enriched), "utf8");
  console.log(`Saved → ${OUTPUT}`);
}

main().catch(err => { console.error(err.message); process.exit(1); });

extractContentOnly: true strips navigation, footers, and cookie banners before the AI sees the page. Those elements add token usage but contain nothing relevant to what we're looking for.

Terminal output from our actual run:

72 total -- 38 have scrapeable websites
31 unique domains
Emails found: 9 / 38
Saved → leads/enriched.csv

And a sample of what the enriched rows looked like for leads that returned something:

Aladdin Plumbing Corp.
  contactPerson: Gerald "Jerry" Gitli
  (no email on site)

All Pro Cleaning & Restoration
  email: [email protected]
  facebook: facebook.com/pages/All-Pro-Cleaning-and-Restoration/...

Fred Smith Plumbing & Heating
  linkedIn: linkedin.com/company/376318
  (no email on site -- 112-year-old company, they take calls)

A few things worth knowing

Some businesses don't list their email on the homepage but do on a contact page. If enrichment returns no results, a second targeted scrape of ${website}/contact can improve hit rates for high-value leads.

Also, nothing in this pipeline prevents you from scraping the same leads twice on the next run. A simple approach: after writing the CSV, store the phone numbers or business names in a JSON file and filter against it at the start of the next run.

Finally, most searches max out around page 10 regardless of what you request. If you set PAGES=20 on a less competitive niche, pages past the real maximum will return the same final page over and over. Keep PAGES at 5–7 unless you've manually checked that the search actually has that many pages of organic results.

Share this article

Start scraping for free.

Get 300 free credits to explore Spidra. Build your first scraper in minutes, not hours. Upgrade anytime as you scale.

We build features around real workflows. Usually within days.