IAM Dawn · Deep Research

RFC 9309 & Creator Rights
in the Age of AI Training

Gaby · Vanderbilt Owen MBA · AI-Accelerated Entrepreneurship Practicum · Spring 2026

Executive Summary

The Robots Exclusion Protocol (RFC 9309, formalized 2022) was never designed as an enforcement mechanism — it is a voluntary preference-signaling system. In the era of large-scale AI training data scraping, this distinction has become critically consequential for creators. AI training bots show only 60–70% compliance with robots.txt, unidentified scrapers ignore it entirely, and the protocol cannot retroactively remove content already in training datasets. The legal battleground spans US fair use doctrine (producing contradictory 2025 decisions) and EU regulatory frameworks that are actively transforming preference signaling into enforceable legal obligation. Creators must deploy layered defense strategies combining robots.txt, TDMRep, on-chain provenance, and collective enforcement.

🔵 Knowledge Graph — WHAT

Hard Data & Evidence

  • GPTBot blocked in only 7.8% of top-10K domain robots.txt files (Cloudflare)
  • anthropic-ai blocked in fewer than 5% of major domain robots.txt files
  • Overall AI bot robots.txt compliance: 60–70%; identified bots ~80%
  • Anthropic settled for $1.5B (~$3,000/book) after fair use ruling in its favor
  • Hamburg court (Dec 2025): plain-language TOS copyright statements are insufficient — machine-readable protocols required under EU law
  • EU AI Act (2025) requires GPAI providers to publish training content summaries and comply with machine-readable rights reservations
🟢 Generative Graph — WHAT IF

Opportunities & Structural Asymmetries

  • On-chain provenance (Artiquity model) addresses the enforcement gap robots.txt can never close
  • EU's machine-readable opt-out mandate creates regulatory demand for cryptographic rights infrastructure
  • TDM·AI Protocol's content-derived ISCC identifiers survive metadata stripping — the key failure mode of robots.txt
  • Creators who implement layered opt-outs NOW protect against retroactive dataset inclusion
  • Licensing market for AI training data is forming — well-documented provenance = monetizable asset
  • AI companies settling despite fair use wins signals economic rationality of licensing over litigation exposure

What RFC 9309 Actually Does

Core Mechanism

RFC 9309 defines robots.txt as a plaintext file at a website's root directory. It uses User-Agent lines to identify crawlers and Allow/Disallow directives to specify accessible paths. A longest-path-match principle ensures specific Allow rules override broader Disallow rules. Special characters (* for wildcards, $ for end-of-URL) allow complex patterns without full regular expressions.

Critical Design Fact: RFC 9309 explicitly states crawlers are "requested to honor" the rules. Compliance is entirely voluntary. There is no authentication, no verification, no enforcement mechanism. Robots.txt was designed to prevent server overload — not to protect intellectual property rights.

Avenues for Creators Within RFC 9309

  • User-Agent Blocking: Add known AI training bot identifiers (GPTBot, CCBot, anthropic-ai, ClaudeBot, Bytespider) to Disallow lists
  • Path-Level Granularity: Allow general indexing while blocking specific directories (e.g., /premium/, /archive/) from AI bots
  • Wildcard Blocking: Disallow: / with specific AI user-agents blocks all site access for those crawlers
  • Wildcard User-Agent (*): Block all unrecognized crawlers by default, then whitelist desired ones

Where Robots.txt Fails Against AI Scrapers

Failure Mode Explanation
No authenticationScrapers can spoof any user-agent string — robots.txt cannot verify identity
No enforcementViolations are not actionable under robots.txt alone; legal liability requires separate legal theories (CFAA, copyright)
No nuanceCannot express conditional rules: "allow search indexing but not AI training"
No retroactivityCannot remove content already ingested into training datasets
Streisand EffectDisallow entries in robots.txt publicly advertise what is being protected
Residential proxy evasionUnidentified bots route through residential IPs, mimicking human traffic — robots.txt is never even read
AI-specific directives excludedIETF working group explicitly rejected adding AI training directives to RFC 9309

Legal Landscape: US vs. EU

US Fair Use Battleground

  • Thomson Reuters v. ROSS Intelligence (Feb 2025): First major ruling — AI training on Westlaw headnotes NOT fair use. Source
  • Anthropic (June 2025): Training on copyrighted books found "spectacularly transformative" — fair use upheld for training. Storage of pirated materials reserved for trial. Anthropic settled for $1.5B. Source
  • Meta/Llama (June 2025): Similar fair use finding — but judge noted better records on market effects could change outcome. Source
  • US Copyright Office (May 2025): AI training may constitute prima facie infringement; fair use applies on a spectrum from noncommercial research to commercial use of pirated sources. Source

EU Rights Reservation Framework

  • EU Copyright Directive (2019/790): TDM permitted on lawfully accessible content UNLESS rightsholder expressly reserves rights through "machine-readable means"
  • EU AI Act (2025): GPAI providers must implement policies identifying and complying with rights reservations; must publish training content summaries
  • Hamburg Court (Dec 2025): Plain-language TOS copyright statements INSUFFICIENT — machine-readable protocols (robots.txt, metadata, HTTP headers) required. Source
  • General-Purpose AI Code of Practice explicitly names RFC 9309 as a primary compliance mechanism

Emerging Mechanisms Beyond Robots.txt

NEW TDMRep Protocol (W3C)

Enables rightsholders to declare TDM choices through tdmrep.json in /.well-known directory. Supports granular rules (research vs. commercial, archival vs. AI training). Values: "reserved," "unreserved," "unset." W3C Spec →

NEW TDM·AI Protocol

Registry-based protocol using International Standard Content Code (ISCC) — a content-derived identifier robust to metadata stripping. Cryptographically verifiable opt-out declarations that persist even when metadata is removed during data processing. This addresses the core failure mode of all tag-based approaches. Docs →

NEW ai.txt & Do Not Train Registry

ai.txt: An emerging standard dedicated to AI training controls, proposed by The Guardian and Spawning AI. Separates search indexing from AI training extraction. Do Not Train Registry (Spawning): Stability AI and HuggingFace have committed to honoring opt-outs registered in this system.

IPTC Layered Opt-Out Best Practices

Recommended layered approach: plain-language rights declarations + HTML metadata tags + embedded image metadata (XMP packets) + HTTP headers. Increases probability that at least some opt-out signals survive content processing. PDF →

Implications for Artiquity

The Enforcement Gap Artiquity Closes

RFC 9309 and its successors establish the machine-readable preference signaling layer — but preference signaling without enforcement is insufficient. Artiquity's Consent Layer and on-chain minting address the enforcement gap directly: by anchoring creative identity and provenance in immutable on-chain records, Artiquity creates a verifiable rights record that transcends the voluntary-compliance limitations of robots.txt.


The TDM·AI Protocol's content-derived ISCC identifier approach and Artiquity's Creative Capsule architecture point in the same direction — cryptographically verifiable, persistent rights declarations that survive metadata stripping. As the EU AI Act increasingly requires AI developers to identify and comply with machine-readable rights reservations, Artiquity's infrastructure positions creators ahead of the compliance curve.

Cited Sources