Skip to main content
Lection Logolection
Blog/Guide

Complete Guide to robots.txt for Web Scrapers (2025)

Joel Faure

If you have ever tried to extract data from a website, you have probably encountered the term "robots.txt." This small text file, sitting quietly at the root of every major website, acts as a communication channel between site owners and web crawlers. Understanding it is foundational to ethical, sustainable web scraping.

Yet for many people getting started with data extraction, robots.txt remains mysterious. What do those cryptic directives actually mean? Is ignoring them illegal? How do modern AI crawlers fit into the picture?

This guide breaks down everything you need to know about robots.txt from a web scraper's perspective. Whether you are building automated data pipelines, conducting market research, or just trying to understand best practices, this comprehensive reference will help you navigate the rules that govern web crawling in 2025.

Google's robots.txt file showing various User-agent directives and Disallow rules

What is robots.txt?

The robots.txt file is part of the Robots Exclusion Protocol (REP), a standard created in 1994 to help website owners communicate their crawling preferences to automated bots. It is a plain text file located at the root of a website's domain.

To find any website's robots.txt file, simply append /robots.txt to the site's root URL. For example:

  • https://google.com/robots.txt
  • https://amazon.com/robots.txt
  • https://linkedin.com/robots.txt

If you receive a 404 error, the website does not have a robots.txt file, which typically means no explicit crawling restrictions are in place. However, this does not mean you should scrape without consideration—responsible practices still apply.

What robots.txt Is NOT

Before diving into the syntax, let us clear up common misconceptions:

It is not a security measure. The robots.txt file is a request, not a command. It tells crawlers "please don't access this," but does not technically prevent anything. A malicious bot can ignore it entirely. If a website truly needs to protect content, they use authentication, paywalls, or server-side access controls.

It is not legally binding on its own. While disregarding robots.txt can factor into legal disputes (especially when combined with Terms of Service violations), the file itself is not a contract. Courts look at the totality of circumstances, not just robots.txt compliance.

It is not universal. Different crawlers interpret directives differently. Googlebot might handle certain syntax that other crawlers ignore. Always test your assumptions.

The Anatomy of a robots.txt File

A robots.txt file consists of one or more rule sets, each targeting specific crawlers (User-agents) with specific instructions (Allow, Disallow, Crawl-delay, etc.).

Here is a typical example:

User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 10

User-agent: Googlebot
Allow: /
Disallow: /search

Sitemap: https://example.com/sitemap.xml

Let us break down each directive.

User-agent

The User-agent line specifies which crawler the following rules apply to. A User-agent is the identifier string that a bot sends to a server to identify itself.

User-agentBot
*Wildcard: applies to all crawlers
GooglebotGoogle's main search crawler
BingbotMicrosoft Bing's crawler
GPTBotOpenAI's crawler for training models
CCBotCommon Crawl's data collection bot
anthropic-aiAnthropic's AI training crawler
ClaudeBotAnthropic's Claude-related crawler
ApplebotApple's web crawler for Siri and Spotlight
facebookexternalhitFacebook's link preview crawler
TwitterbotTwitter/X's link preview crawler

When a crawler visits a site, it checks the robots.txt for its specific User-agent first. If a match exists, those rules take precedence. If not, it falls back to the wildcard (*) rules.

Practical example: If you build a custom scraper, it likely does not have a named entry in most robots.txt files. Your scraper would follow the User-agent: * rules by default.

Disallow

The Disallow directive tells crawlers not to access specific paths or directories.

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /checkout/
Disallow: /user/profile/

Key syntax rules:

  • Disallow: / blocks the entire site
  • Disallow: /directory/ blocks everything under that directory
  • Disallow: /page.html blocks that specific page
  • Disallow: (empty value) means nothing is disallowed

Pattern matching: Most modern crawlers support wildcards:

  • Disallow: /private/* blocks all content under /private/
  • Disallow: /*.pdf$ blocks all PDF files
  • Disallow: /search?*sort= blocks URLs with specific parameters

Allow

The Allow directive creates exceptions within Disallow rules. It permits access to specific paths that would otherwise be blocked.

User-agent: *
Disallow: /private/
Allow: /private/public-readme.html

In this example, all content under /private/ is blocked, except for public-readme.html.

The order matters less than specificity. More specific rules (longer paths) take precedence over less specific ones. But when paths are equally specific, Allow typically takes precedence over Disallow.

Crawl-delay

The Crawl-delay directive specifies how many seconds a crawler should wait between requests. This helps protect servers from being overwhelmed.

User-agent: *
Crawl-delay: 10

This tells crawlers to wait 10 seconds between each page request.

Important caveat: Googlebot does not respect the Crawl-delay directive. Google expects webmasters to manage crawl rates through Google Search Console instead. However, many other crawlers (Bingbot, Yandex, and custom scrapers) do respect it.

For ethical web scraping, even if a site does not specify a Crawl-delay, implementing your own delay (1-5 seconds minimum) is considered best practice.

Sitemap

The Sitemap directive points to the location of the site's XML sitemap, which lists all publicly accessible URLs.

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/blog-sitemap.xml

For web scrapers, sitemaps are invaluable. Instead of guessing what pages exist or relying on link discovery, you can parse the sitemap for a structured list of URLs. This is often faster and more complete than crawling.

Lection can leverage sitemap information to build more efficient extraction workflows, ensuring you capture all relevant pages without missing hidden content.

Request-rate

Some robots.txt files include the Request-rate directive, which is more specific than Crawl-delay:

Request-rate: 5/60

This means 5 requests per 60 seconds minimum. It is less commonly used but appears on some high-traffic sites.

Visit-time

Rarely seen, the Visit-time directive restricts crawling to specific hours:

Visit-time: 0200-0600

This requests crawlers only access the site between 2 AM and 6 AM. Support for this directive is inconsistent across crawlers.

Real-World Examples: How Major Sites Use robots.txt

Understanding theory is useful, but seeing how major platforms structure their robots.txt files provides practical insight.

Google (google.com/robots.txt)

Google's robots.txt is extensive, listing dozens of specific disallowed paths for different crawlers. It blocks internal tools, search result pages, and administrative sections while allowing main content.

Notable patterns:

  • Blocks /search paths (you cannot crawl their search results)
  • Allows Googlebot access to most public content
  • Specifies sitemaps for main site and news content

Amazon (amazon.com/robots.txt)

Amazon blocks numerous paths related to cart management, checkout, wishlists, and internal APIs. Product pages are generally accessible, which is why e-commerce data extraction is possible.

Key takeaway: Amazon's robots.txt allows you to access product listings but blocks user-specific and transactional pages. Tools like Lection that focus on visible product data work within these boundaries.

LinkedIn (linkedin.com/robots.txt)

LinkedIn has one of the most restrictive robots.txt files among major platforms. They block almost everything for generic crawlers:

User-agent: *
Disallow: /

However, they allow access for specific crawlers like Googlebot and Bingbot to index public profiles for search.

This restrictiveness is part of why LinkedIn data extraction requires browser-based approaches that work with your authenticated session, rather than traditional server-side scraping.

Lection dashboard showing your scraping projects and recent extractions

AI Crawlers: The New Frontier

2024 and 2025 brought a wave of new crawlers designed to gather training data for large language models. Site owners are updating their robots.txt files to address these specifically.

Common AI Crawler User-agents

User-agentCompanyPurpose
GPTBotOpenAITraining data for GPT models
ChatGPT-UserOpenAIChatGPT browsing feature
CCBotCommon CrawlNonprofit data archive
anthropic-aiAnthropicClaude training data
ClaudeBotAnthropicClaude-related crawling
Google-ExtendedGoogleGemini/Bard training data
BytespiderTikTok/ByteDanceAI training data
cohere-aiCohereAI training data

Many publishers now include explicit blocks for these crawlers:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

If you are building a web scraper for business purposes (not AI training), this is good news. Sites are becoming more explicit about what they block and why, making it easier to understand their preferences.

What This Means for Scrapers

The distinction between "data for AI training" and "data for business analysis" is becoming clearer in robots.txt conventions. Sites that block GPTBot may still allow general crawlers because they differentiate between:

  1. AI training (using content to build models that compete with the original)
  2. Indexing (helping users find the content)
  3. Research/analysis (extracting structured data for business purposes)

This is where browser-based, user-session-driven tools like Lection maintain an advantage. Operating within your authenticated browser session, extracting data you can already see, is fundamentally different from mass crawling for model training.

Robots.txt in Court Cases

Several landmark cases have referenced robots.txt compliance:

hiQ Labs v. LinkedIn (2022): The Ninth Circuit ruled that scraping publicly accessible data does not violate the CFAA (Computer Fraud and Abuse Act), even when the company sends cease-and-desist letters. However, the court noted that this applies to genuinely public data, not authenticated or restricted content.

Ryanair v. Opodo (EU, 2015): European courts upheld that ignoring robots.txt can factor into unfair competition claims, especially when combined with aggressive commercial scraping.

The takeaway: robots.txt compliance alone does not make scraping legal or illegal, but it is one factor courts consider. Following robots.txt demonstrates good faith.

Building a Compliance Framework

For ethical, sustainable web scraping, consider this checklist:

Before scraping:

  • Check robots.txt for your target site
  • Review the site's Terms of Service
  • Assess whether the data is genuinely public
  • Plan appropriate request delays

During scraping:

  • Respect Disallow directives
  • Implement delays between requests (minimum 1-2 seconds)
  • Use descriptive User-agent strings when appropriate
  • Monitor for rate limiting or blocking signals

After scraping:

  • Store data securely
  • Use data only for stated purposes
  • Consider data freshness and periodic re-extraction needs
  • Document your compliance decisions

How to Parse robots.txt Programmatically

If you are building custom scraping tools, parsing robots.txt accurately matters.

Python Example

import urllib.robotparser

rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

# Check if a URL is allowed
can_fetch = rp.can_fetch("*", "/products/item-123")
print(f"Can fetch: {can_fetch}")

# Get crawl delay
crawl_delay = rp.crawl_delay("*")
print(f"Crawl delay: {crawl_delay}")

JavaScript Example

const robotsParser = require('robots-txt-parser');
const parser = robotsParser({
  userAgent: 'MyBot',
  allowOnNeutral: true
});

await parser.useRobotsFor('https://example.com');
const isAllowed = parser.canCrawl('/products/item-123');
console.log('Can crawl:', isAllowed);

Using No-Code Tools

If you are using no-code scraping tools like Lection, you do not need to parse robots.txt manually. Browser-based extractors work within your normal browsing session, accessing only content that is already visible to you as a user. This sidesteps many robots.txt concerns because you are not making automated server requests—you are extracting from an already-rendered page.

Lection's cloud scraping options for automated extractions

Common robots.txt Patterns and Their Meanings

Pattern 1: "Block everything for everyone"

User-agent: *
Disallow: /

The site does not want any crawlers. This is common for authenticated applications, internal tools, or sites with legal restrictions on content redistribution.

Pattern 2: "Block everything except search engines"

User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

The site wants search engine indexing but blocks other crawlers. This is typical for content publishers who want organic traffic but not data extraction.

Pattern 3: "Block specific sections"

User-agent: *
Disallow: /admin/
Disallow: /checkout/
Disallow: /my-account/

The site allows most crawling but protects user-specific and administrative areas. E-commerce sites commonly use this pattern.

Pattern 4: "Rate-limited access"

User-agent: *
Crawl-delay: 30
Disallow: /api/

The site allows crawling but asks for slow, cautious access. Common for smaller sites with limited infrastructure.

Pattern 5: "Block AI trainers specifically"

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Disallow: /private/

The site blocks AI training crawlers specifically while allowing general access. This pattern became common in 2024-2025.

Troubleshooting: When robots.txt Seems Confusing

Conflicting rules

If you see conflicting Allow and Disallow rules, remember that more specific paths take precedence. If still unclear, choose the more restrictive interpretation—it is safer.

No robots.txt file

A missing robots.txt file (404 response) technically means no restrictions, but practice ethical crawling anyway. Implement reasonable delays, respect implicit rate limits, and focus on publicly accessible content.

Robots.txt changes frequently

Some sites update their robots.txt regularly. If you are running scheduled scrapes, periodically re-check the file to ensure your policies are still compliant.

The site also blocks via other means

Even if robots.txt allows access, websites can block scrapers through:

  • Rate limiting (returning 429 errors)
  • CAPTCHA challenges
  • JavaScript rendering requirements
  • IP blocking

For resilient data extraction, tools like Lection that operate within your browser and mimic natural human behavior are more reliable than server-side HTTP scrapers that bypass normal rendering.

Best Practices Summary

  1. Always check robots.txt first. It takes seconds and demonstrates good faith.

  2. Implement crawl delays. Even if not specified, 1-5 seconds between requests is respectful.

  3. Respect Disallow directives. They represent the site owner's stated preferences.

  4. Use descriptive User-agent strings. If you are building custom tools, identify yourself so webmasters know who is accessing their site.

  5. Combine with Terms of Service review. robots.txt is one signal; also review the site's legal policies.

  6. Consider browser-based extraction. Tools like Lection that extract from already-rendered pages sidestep many automated crawling concerns.

  7. Document your decisions. Keep records of why you made specific compliance choices, especially for business-critical data pipelines.

  8. Monitor for changes. Websites update their robots.txt periodically. Build checking into your workflow.

Conclusion: robots.txt as a Communication Tool

The robots.txt file is not a locked door—it is more like a "please knock first" sign. It represents website owners' preferences communicated in a standardized format that the web scraping community has respected for 30 years.

Understanding robots.txt is foundational to ethical, sustainable data extraction. While it does not solve all legal and ethical questions around web scraping, respecting these signals demonstrates good faith and reduces friction with site owners.

For most business use cases—market research, competitive analysis, lead generation, price monitoring—working within robots.txt constraints is both possible and advisable. The sites that block scraping entirely often have alternative data sources (APIs, data partnerships) worth exploring.

Ready to extract data the right way? Install Lection and start building ethical, efficient data workflows today.


Ready to supercharge your research?

Join thousands of researchers using Lection to capture and organize the web. It's free to get started.

Learn More