Does robots.txt have legal force?

Not directly. Robots.txt is a voluntary convention, not a legally binding document in most jurisdictions. However, ignoring it signals bad faith, demonstrates disregard for clearly communicated access preferences, and courts have noted it as a relevant factor in disputes. Following robots.txt is both good practice and smart risk management.

Can I scrape data that is behind a login?

Scraping data that requires authentication to access is in significantly different legal territory. The CFAA's "unauthorized access" language applies to authenticated systems. Scraping logged-in data also typically requires accepting terms of service that prohibit automated access. The safe and defensible approach is to limit scraping to publicly accessible pages that do not require credentials.

Does GDPR apply to me if I am outside the EU?

Yes. GDPR applies to the processing of personal data of individuals in the EU and EEA regardless of where the processing organization is located. If you scrape personal data about EU residents, GDPR obligations apply to your organization wherever it is based.

Can terms of service alone stop me from scraping?

If you have not accepted the terms, meaning no account created and no explicit agreement clicked, browsewrap terms posted in a website footer carry limited legal weight against scrapers. Courts have generally been reluctant to enforce them absent clear assent. If you have created an account and clicked through an agreement that prohibits scraping, those terms are enforceable as a contract.

What is the safest type of data to scrape?

Publicly available, non-personal factual data is the clearest low-risk category. Product prices, business listings, job postings, public company information, publicly traded financial data, and similar factual content about businesses rather than individuals sits well within the established legal framework.

Blog/ Is web scraping legal? Laws and best practices guide for 2026

June 27, 2026 · 16 min read

Is web scraping legal? Laws and best practices guide for 2026

Spidra Team

Is web scraping legal? Laws and best practices guide for 2026

Web scraping is one of the most powerful tools available for collecting public data at scale. Price monitoring, market research, competitive intelligence, academic study, financial analysis, and real estate tracking depend on the ability to read and collect what is publicly visible on the web.

The question of whether this is legal comes up constantly, and the honest answer is: yes, with a clear understanding of where the boundaries are.

This guide walks through the legal landscape in 2026: the key court cases that shaped the current framework, the laws that apply, how different jurisdictions handle it, and the practical best practices that keep scraping operations on solid ground. It is written for developers, data teams, and businesses who want to understand the rules and work within them.

A note before we begin: This article is for informational purposes and does not constitute legal advice. For questions about a specific project, consult a qualified attorney in your jurisdiction.

Tl,dr

Scraping publicly accessible web data is generally legal in the United States and most major jurisdictions. Courts have consistently held that accessing information that is visible to any anonymous visitor, without bypassing any authentication barrier, does not constitute unauthorized computer access. The landmark cases that established this are covered in detail below.

The legality of any specific scraping project depends on four things:

Whether the data is publicly accessible or gated behind a login
Whether the data includes personal information subject to privacy laws
Whether scraping the content could implicate copyright
Whether the scraper has agreed to terms that prohibit automated access

Get these four things right and web scraping is a fully legitimate way to collect publicly available information.

The cases that shaped the legal framework

Let's explore three of them.

hiQ Labs v. LinkedIn (Ninth Circuit, 2022)

This is the most cited case in web scraping law. hiQ Labs was a small data analytics company that scraped public LinkedIn profiles to provide workforce analytics to employers. LinkedIn sent a cease-and-desist in 2017, claiming the scraping violated the Computer Fraud and Abuse Act (CFAA), a 1986 anti-hacking statute.

hiQ sued for the right to continue. The case worked its way through the courts over five years. In April 2022, the Ninth Circuit reaffirmed its earlier ruling: scraping publicly available data does not violate the CFAA. On a publicly available website there are no rules or access permissions to prevent access, and therefore accessing that publicly available data cannot violate the CFAA.

The court drew a clear line. If data is accessible to any anonymous visitor without a password or login, accessing it programmatically is not "unauthorized access" under the CFAA. The statute was written to target hackers who break into protected systems. It does not extend to reading what any member of the public can already see.

The trend initiated by the Ninth Circuit's hiQ ruling in favor of open access to public data remains intact. The ruling has been cited in over 50 subsequent cases.

The full story has an important nuance: hiQ ultimately lost the war after winning the CFAA battle. After winning the headline issue, hiQ lost on breach of contract. The court found that hiQ had agreed to LinkedIn's User Agreement (which prohibits scraping), and the case ended in a consent judgment with hiQ paying and accepting a permanent injunction.

The practical lesson: the CFAA no longer protects platform owners from public scraping, but contract claims survive. If you create an account, accept terms that prohibit scraping, and then scrape, you may face breach of contract liability. The CFAA issue and the contract issue are independent.

Van Buren v. United States (Supreme Court, 2021)

One year before the final hiQ ruling, the Supreme Court narrowed the CFAA even further in a case involving a police officer who accessed a law enforcement database for an improper purpose.

The Supreme Court vacated the Ninth Circuit's earlier hiQ decision and remanded it for review under Van Buren, which ruled that the "exceeds authorized access" clause of the CFAA only applies when an individual has valid access to a system but accesses parts of a system they are not intended to access.

The effect on scraping: misusing legitimate access (for example, scraping a site you are logged into against its terms) is not a CFAA violation under Van Buren. The CFAA is about accessing systems you have no right to access at all, not about how you use data you are already permitted to see. This ruling made the CFAA even less useful as a tool against public data scraping.

Meta v. Bright Data (Northern District of California, January 2024)

This case reinforced the hiQ framework and extended it to a new question: can a platform's terms of service reach scraping by someone who is not actively logged in?

Bright Data, one of the world's largest scraping infrastructure companies, scraped publicly available data from Facebook and Instagram. Meta sued, arguing the scraping violated its terms of service. A summary judgment opinion by U.S. District Judge Edward Chen confirmed: "The Facebook and Instagram Terms do not bar logged-off scraping of public data; perforce it does not prohibit the sale of such public data."

The court found Bright Data could only violate Meta's terms of service if Meta proved that Bright Data scraped data while logged into a Meta account. "Meta has not presented evidence sufficient to raise a reasonable inference that Bright Data scraped data while logged into an account and thereby accessed nonpublic data," Chen wrote.

The ruling sharpened the distinction between logged-in and logged-out scraping. A platform's terms of service bind users of the platform, not outside visitors observing publicly available content. Logged-out scraping of public data is the defensible position.

The laws you need to understand

The Computer Fraud and Abuse Act (United States)

The CFAA is the primary federal law that platforms have historically used against scrapers. After hiQ, Van Buren, and Meta v. Bright Data, the CFAA is no longer an effective tool against scraping of publicly accessible data.

Scraping data visible to any anonymous user, such as public search results, product prices on an e-commerce site, and news articles, generally does not violate the CFAA. Authentication is the barrier. The moment you need to log in, enter a password, or use credentials to see data, you are in unauthorized access territory if you are scraping without consent.

What remains off limits under the CFAA: bypassing technical access controls, scraping data that requires authentication, using fake credentials to access systems, and accessing data you have been specifically blocked from after receiving legal notice to stop.

GDPR (European Union and EEA)

The General Data Protection Regulation applies to the personal data of individuals in the EU and EEA, regardless of where the scraping company is located. If you scrape data about EU residents, GDPR applies to you even if your company is based in the United States, Australia, or anywhere else.

The critical point is that GDPR applies to personal data regardless of whether that data is publicly available. A person's name, email address, phone number, or photograph on a public website does not cease to be personal data because it is publicly visible. Privacy laws do not prohibit web scraping outright. They regulate what types of data you can collect, what you can do with it, and what obligations you have toward the individuals whose data you scrape.

For most commercial scraping operations, the lawful basis most likely to apply is "legitimate interest" under Article 6(1)(f) of the GDPR. This requires a documented Legitimate Interest Assessment balancing the scraper's interests against the data subject's rights. Consent is not a practical basis for large-scale scraping because you cannot obtain prior consent from the individuals whose data you are collecting.

The Clearview AI cases illustrate the stakes. Clearview built a facial recognition database by scraping billions of public photos from social media. Even though the photos were publicly available, GDPR-equivalent regulators across the EU imposed fines totalling over €91 million across 15 jurisdictions by 2025. The public visibility of the data was not a defence under GDPR.

For businesses whose scraping focuses on non-personal data such as product prices, business listings, job postings, and financial data, GDPR compliance is straightforward. The regulation's requirements are most material when personal information is involved.

CCPA and US state privacy laws

The California Consumer Privacy Act applies to businesses that collect personal information of California residents above certain revenue or data volume thresholds. Like GDPR, it does not prohibit scraping but imposes obligations around personal data collection and consumer rights. The CCPA requires disclosure of data collection practices and honoring opt-out requests. "Personal information" under the CCPA is broadly defined as information that identifies, relates to, describes, or can be reasonably linked to a California consumer or household.

Several other US states have enacted comparable privacy legislation including Virginia, Colorado, Connecticut, and Texas. The practical approach: treat personal data of US residents with the same care as GDPR-regulated data, and limit personal data collection to what is genuinely necessary for your use case.

Copyright law

Scraping content and copyright are two separate questions. The act of scraping is generally not a copyright issue. What you do with the content after scraping is where copyright applies.

Facts are not copyrightable. Product names, prices, addresses, stock counts, financial figures, and similar factual data can be collected and used freely. Creative works such as articles, blog posts, product descriptions written with literary care, photographs, and videos are protected by copyright.

The key principle for most scraping use cases: collecting factual data for internal analysis, price monitoring, or market research poses minimal copyright risk. Republishing substantial portions of copyrighted content, building a competing product that reproduces protected creative works, or scraping content to train AI models on protected expression each carries copyright exposure.

The fair use doctrine in the United States provides some protection for transformative uses of copyrighted material, including research, criticism, commentary, and uses that create new value rather than substituting for the original. Courts evaluate fair use on a case-by-case basis, weighing the purpose of the use, the nature of the work, the amount copied, and the effect on the original work's market.

Terms of service

Most major websites prohibit automated access in their terms of service. The legal effect of these terms depends on whether the scraper has actually agreed to them. Courts distinguish between two kinds of agreements:

Clickwrap terms require an active affirmative action, specifically clicking "I agree" during account creation. These create enforceable contracts. Scraping in violation of clickwrap terms you have accepted carries breach of contract risk.
Browsewrap terms are buried in a footer or notice that users never explicitly acknowledge. Courts have generally been reluctant to enforce these against scrapers because there is no clear acceptance. The Meta v. Bright Data ruling followed this logic: Bright Data was not "using" Facebook as a user when it scraped logged-out public pages, so Meta's user-facing terms could not bind it.

The practical guidance: if you have not created an account or clicked through an agreement, browsewrap terms carry limited legal weight against you. If you have created an account and accepted terms that prohibit scraping, scraping in violation of those terms creates contract liability even if the CFAA does not apply.

How different jurisdictions handle web scraping

United States

The US has the most scraper-friendly legal framework for public business data, anchored by hiQ and Meta v. Bright Data. Public product, price, and business data collected without bypassing authentication is the clearest defensible category. Personal data is covered by CCPA and state-level laws. Copyright applies to creative content.

European Union

The EU framework is more restrictive, particularly after GDPR enforcement intensified and the EU AI Act began taking effect. GDPR applies to any personal data of EU residents regardless of public visibility. The EU Copyright Directive's text and data mining exception allows scraping for research purposes, subject to machine-readable opt-outs that site operators can set. Scraping for AI training in the EU requires compliance with the AI Act's transparency and copyright provisions.

United Kingdom

Post-Brexit, the UK retained the EU database rights framework and has its own UK GDPR operating under the Data Protection Act 2018. The legal position is similar to the EU. The Computer Misuse Act covers unauthorized access to computer systems.

Canada

Canada's PIPEDA (Personal Information Protection and Electronic Documents Act) covers personal data with a broadly similar framework to GDPR. Scraping publicly available business information generally falls outside PIPEDA's scope. Scraping personal information of individuals requires a lawful basis.

Australia

The Privacy Act covers personal information of Australian individuals. The Australian Competition and Consumer Commission has pursued cases involving misleading data practices. Public factual data scraping is generally permitted.

Brazil

Brazil's LGPD (Lei Geral de Proteção de Dados) mirrors GDPR's structure. Enforcement has been less aggressive than Europe to date, but the regulator ANPD has signaled increased activity.

The five factors that determine your risk level

Understanding the framework is one thing. Assessing whether any specific scraping project sits in the low-risk or high-risk category comes down to five questions.

Is the data publicly accessible without authentication? Data visible to any anonymous visitor without login, payment, or credentials is in the defensible zone. Data behind a login is not.
Does the data include personal information? Product prices, business listings, stock counts, job titles, company names, and factual data about businesses sit in the low-risk category. Names, email addresses, personal phone numbers, and profile data tied to identifiable individuals require compliance consideration under GDPR, CCPA, or equivalent laws.
Have you agreed to terms that prohibit scraping? If you created an account and accepted terms that prohibit automated access, scraping creates contract risk. If you have not created an account or accepted any terms, browsewrap-only sites carry limited contract exposure.
Are you bypassing technical access controls? Circumventing CAPTCHAs, authentication systems, IP blocks applied after legal notice, or other technical barriers moves the activity toward the CFAA's prohibited zone regardless of whether the underlying data is public.
What are you doing with the data? Internal analysis, price monitoring, market research, and competitive intelligence are the most defensible purposes. Republishing copyrighted content, building competing products that substitute for the original, and scraping for purposes the site explicitly prohibits in commercially significant terms carry more exposure.

Best practices for responsible web scraping

Working within the framework above is both legally sound and operationally smart. Scrapers that follow these practices run more reliably, get blocked less often, and have a defensible position if questions arise.

Read robots.txt

Every scraping project should start at the target site's /robots.txt file. This file, a convention established in 1994, tells automated agents which paths are permitted and which are not, and may specify crawl delay requirements. It is not legally binding in most jurisdictions, but ignoring it signals bad faith and courts have noted robots.txt compliance as a factor in disputes. Follow the Disallow directives and honor any Crawl-delay specified.

Respect rate limits

Sending more requests than a server can handle creates a burden on site infrastructure that courts have treated as actionable trespass to chattels, even where the underlying data is public. Implement reasonable rate limiting of at least 1-2 seconds between requests, don't degrade site performance, and back off when you see rate-limit responses. If you see 429 errors or increasing response times, slow down.

Use clear User-Agent identification

Identify your bot honestly in the User-Agent string. A format like YourCompanyBot/1.0 (+https://yourcompany.com/bot) tells site operators who you are and gives them a way to contact you. Impersonating a regular browser to avoid detection is a form of deception that can undermine your legal position. Good-faith scraping does not need to hide.

Minimize personal data collection

Collect only the fields you need for your use case. If you need product prices and titles, do not collect reviewer names and email addresses alongside them. Data minimization is both a GDPR principle and sound risk management. Hash or anonymize any personal data you do collect, set retention periods, and delete data when the purpose has been served.

Review terms of service before creating accounts

If a site requires you to create an account to access data, read the terms before accepting. If the terms explicitly prohibit automated access and you accept them, scraping in violation of those terms creates contract exposure. Evaluate whether the data is accessible without an account, whether an API or data license is available, or whether the terms have exceptions for your use case.

Document your compliance decisions

Keep records of your robots.txt checks, your rate limiting configuration, your data minimization decisions, and any legal review you conduct. Documentation creates an audit trail that demonstrates good faith if a dispute arises. For organizations operating under GDPR, a documented Legitimate Interest Assessment is a compliance requirement for processing personal data.

Use official APIs where available

Many platforms offer official APIs that provide structured data access within defined terms. Using the official API where one exists is the clearest path to compliance and typically provides more reliable data access than scraping. Where an API is too restrictive for your use case or does not cover the data you need, scraping public pages remains an option within the framework above.

Specific use cases and where they stand

Price monitoring and competitive intelligence: Scraping publicly visible product prices, availability, and business listings is one of the clearest low-risk use cases. Prices are factual data and not copyrightable. Major retailers publish this information to the public and it drives legitimate market activity.
Market research and trend analysis: Collecting publicly available data about market trends, job postings, public company filings, and industry developments is well-established practice. Keep the focus on aggregate, non-personal data.
Real estate data: Public property listings, sale prices, and comparable market data are routinely scraped by real estate platforms and research tools. Verify that data being accessed is the genuinely public listing rather than aggregator-only data.
Academic and journalistic research: Courts and regulators have consistently given more latitude to scraping for academic research, journalism, and public interest purposes. This does not create a blanket exemption but does strengthen a fair use or legitimate interest argument.
Lead generation and contact data: This is the highest-risk category. Scraping names, email addresses, phone numbers, and personal contact details for marketing purposes is the clearest path to GDPR, CCPA, and equivalent compliance obligations. Most enforcement actions have targeted exactly this use case.
AI training data: This is an actively evolving area. Many platforms including Reddit, Getty Images, and major news publishers have added explicit AI training prohibition clauses to their terms of service. In the EU, the AI Act requires transparency about training data sources. Scraping public factual data for AI training in the US is generally supported by fair use for transformative purposes. Scraping copyrighted creative works at scale for AI training faces increasing legal challenge.

Scraping responsibly with Spidra

Responsible scraping is built into how Spidra works. The Spidra API routes requests through a network of residential proxies that operate within published terms, respects rate limits automatically, and provides structured AI extraction that collects the fields you define and nothing more. The schema-based approach to data collection supports data minimization by design: you specify exactly what fields you need and only those fields are returned.

For teams building price monitoring pipelines, competitive intelligence tools, market research datasets, or any other data collection application on public web data, Spidra provides the infrastructure to do it at scale within the framework the courts and regulators have established.

Get started free at app.spidra.io.

Frequently asked questions

Yes. Scraping publicly accessible data that does not require authentication is generally legal in the United States and most major jurisdictions. The Ninth Circuit's 2022 ruling in hiQ Labs v. LinkedIn confirmed that scraping public pages does not violate the CFAA. The January 2024 ruling in Meta v. Bright Data confirmed that logged-out scraping of public data is not a violation of platform terms of service. Legality depends on what data you collect, whether it includes personal information, whether you have agreed to terms prohibiting scraping, and what you do with the data.

In the US, scraping publicly available factual data for AI training is generally supported by the fair use doctrine where the use is transformative. Scraping copyrighted creative works at scale for AI training faces increasing litigation. In the EU, the AI Act imposes transparency requirements on general-purpose AI providers about training data sources. Review site-specific terms for explicit AI training prohibitions before scraping at scale.

The two results address different legal questions. The CFAA issue was whether scraping public pages constitutes unauthorized computer access. The court said no. The contract issue was whether hiQ had agreed to LinkedIn's terms that prohibited scraping. It had, by creating accounts. The CFAA ruling protects public scraping from criminal or federal hacking liability. It does not protect against breach of contract claims if you have accepted terms that prohibit scraping.

Share this article

Tutorials

How to scrape web data with Beautiful Soup: step-by-step guide in 2026

How to scrape web data with Beautiful Soup in Python. Covers requests, HTML parsing, CSS selectors, XPath, pagination, CSV and JSON export, and how to handle dynamic content.

June 27, 2026 · 12 min read

Tutorials

How to scrape eBay with JavaScript and Node.js in 2026

How to scrape eBay with JavaScript and Node.js using the Spidra API. Covers item pages, search results, batch scraping, price monitoring, and CSV export with real tested output.

June 27, 2026 · 14 min read

Tutorials

How to scrape eBay with the Spidra API (2026)

Complete guide to scraping eBay with the Spidra API. Covers item pages, search results, pagination, batch scraping 50 items in parallel, price monitoring, and error handling with real output.

June 27, 2026 · 17 min read

Start scraping for free.

Get 300 free credits to explore Spidra. Build your first scraper in minutes, not hours. Upgrade anytime as you scale.

We build features around real workflows. Usually within days.

Is web scraping legal? Laws and best practices guide for 2026

Tl,dr

The cases that shaped the legal framework

hiQ Labs v. LinkedIn (Ninth Circuit, 2022)

Van Buren v. United States (Supreme Court, 2021)

Meta v. Bright Data (Northern District of California, January 2024)

The laws you need to understand

The Computer Fraud and Abuse Act (United States)

GDPR (European Union and EEA)

CCPA and US state privacy laws

Copyright law

Terms of service

How different jurisdictions handle web scraping

United States

European Union

United Kingdom

Canada

Australia

Brazil

The five factors that determine your risk level

Best practices for responsible web scraping

Read robots.txt

Respect rate limits

Use clear User-Agent identification

Minimize personal data collection

Review terms of service before creating accounts

Document your compliance decisions

Use official APIs where available

Specific use cases and where they stand

Scraping responsibly with Spidra

Frequently asked questions

Share this article

Related posts

How to scrape web data with Beautiful Soup: step-by-step guide in 2026

How to scrape eBay with JavaScript and Node.js in 2026

How to scrape eBay with the Spidra API (2026)

Start scraping for free.