FSF Intervenes in Anthropic Copyright Case, Escalating AI Training Data Dispute

The “Constitutional Moment” of Open-Source Ideology: FSF Intervenes in Bartz v. Anthropic, Elevating the AI Training Data Ownership Dispute to a Foundational Principle
In early 2025, the U.S. District Court for the Northern District of California reached a pivotal turning point in Bartz v. Anthropic: the Free Software Foundation (FSF) issued a rare formal statement explicitly supporting plaintiff and open-source developer Matthew Bartz. Bartz alleges that Anthropic used his GPLv3-licensed code—publicly hosted on platforms such as GitHub—without permission to train its Claude series of large language models, thereby committing copyright infringement. This move is far more than routine industry solidarity. As the spiritual center of the GNU Project and the free software movement, the FSF has, since its founding in 1985, seldom publicly weighed in on specific commercial litigation. Its exceptional intervention signals that the legitimacy of AI training data has now escalated beyond technical compliance into a fundamental contest over open-source philosophy, the survival of the digital public domain, and the very architecture of knowledge production power.
Three Fault Lines Underlying the FSF’s Statement: License Enforceability, the Erosion of Fair Use, and the Illusion of the “Public Web”
The FSF’s statement directly confronts three prevailing misconceptions embedded in current AI industry practice:
First, license enforceability cannot be nullified by the “training-as-use” logic.
Anthropic contends that a model’s “learning” from code constitutes non-expressive use—and thus does not trigger the copyleft provisions of the GPL. The FSF counters decisively: Section 0 of GPLv3 explicitly defines “running a program” to include “any kind of use of the program’s functionality.” When a model reproduces the logical structure of GPL-licensed code during inference—such as specific algorithmic patterns or API call sequences—it engages in a derivative use of protected expression. Training is not a disembodied mathematical operation performed in a vacuum; it is a systematic ingestion and reconstruction of original text’s semantics, structure, and intent.
Second, the “fair use” defense collapses under the weight of industrial-scale commercial deployment.
Citing Authors Guild v. Google, Anthropic argues that model training qualifies as highly “transformative use.” The FSF, however, invokes the 2024 Second Circuit ruling in NYT v. OpenAI: when training datasets span trillions of tokens and model outputs directly displace markets for original works—e.g., generating production-ready code, legal documents, or news summaries—the veneer of “transformative purpose” gives way to demonstrable, substantial harm to the original authors’ potential markets. What makes Bartz distinctive is that the plaintiff’s code itself serves clear commercial purposes (e.g., as an embedded-systems development library), and Claude-generated equivalents have already entered internal toolchains at multiple technology firms.
Third, the industry’s tacit assumption—that “publicly accessible web = public domain”—is legally bankrupt.
Anthropic has implied that publicly available code on platforms like GitHub is “de facto open for training.” The FSF sharply rebuts this conflation of accessibility with licensing rights. GPLv3 mandates that any act of distribution—including distributing training outcomes in the form of model weights—must be accompanied by provision of complete corresponding source code. Yet every major closed-source LLM today refuses to fulfill this obligation. More profoundly, the case challenges whether the “robots.txt” protocol—a norm born in the Web 1.0 era—retains any legal relevance in the age of AI-driven knowledge extraction. When fitness-app data can pinpoint a French aircraft carrier in real time (Le Monde report), and when HP’s customer service enforces a mandatory 15-minute wait (Hacker News discussion), technological capability has long since shattered inherited frameworks of implied rights.
Fractures Within the Open-Source Community: The Tension Between Pragmatic Compromise and Principled Resistance
The FSF’s intervention has not unified the open-source ecosystem. Key institutions have adopted cautious or divergent stances: the Linux Foundation has remained silent; the Apache Software Foundation reiterated its longstanding position that “licenses do not govern training data”; and Microsoft—the owner of GitHub—has attempted to walk a tightrope between compliance and commerce via Copilot’s “code suggestion filter” and opt-out mechanisms. This fragmentation reveals a deeper schism: the open-source movement is undergoing a paradigm shift—from collaborative development tool to foundational infrastructure for the AI era.
Pragmatists warn that overemphasizing license restrictions will stifle innovation. They cite startups like Sitefire (YC W26), discussed on Hacker News, whose core product—automating SEO visibility for AI tools—depends fundamentally on rapid, iterative training over publicly available data. Requiring individual authorization for every training instance, they argue, would effectively exclude small and midsize developers entirely. Principles-based advocates retort that this is precisely a prelude to the “tragedy of the commons”: if every model company defaults to predatory use of GPL-licensed code, high-quality open-source projects will wither as maintainers lose economic incentive. Bartz himself filed suit after discovering that Claude-generated code matched his GPLv3 project’s syntactic structure at 97% similarity—verified by third-party code-fingerprinting tools. His demand goes beyond damages: he seeks to establish the legal precedent that “training equals distribution.”
The Global Regulatory Chessboard: DSA/DCIA in the EU vs. U.S. Copyright Office Guidance
The outcome of Bartz will reverberate across international AI legislation. The EU’s Digital Services Act (DSA) already requires large platforms to disclose their training data sources, while the forthcoming Artificial Intelligence Act (AI Act) explicitly lists training data transparency as a compliance “red line” for high-risk AI systems (Annex III). Even more disruptive is the draft Data Governance Act (DGA)—note: corrected from DCIA, which appears to be a typographical error in the original—is proposing a “data trust” mechanism: enabling open-source communities to collectively license their data assets for public-interest AI training, while prohibiting commercial extraction. A Bartz victory could accelerate DGA implementation, establishing a novel paradigm of “open-source data sovereignty.”
In the United States, the U.S. Copyright Office’s January 2025 update to its Guidelines for Registration of Works Containing AI-Generated Content acknowledges the unsettled legal status of training data—but hints that “massive copying combined with commercial output” may exceed fair use boundaries. Notably, the presiding judge in Bartz previously presided over NYT v. OpenAI, where he posed a pointed question: “When a model can perfectly replicate The New York Times’s writing style and generate content indistinguishable from its paid subscription offerings—does ‘fair use’ become little more than a get-out-of-jail-free card for tech giants?” That very question lies at the heart of Bartz’s judicial inquiry.
Beyond Litigation: Three Pathways Toward a Sustainable AI Data Ecosystem
Regardless of Bartz’s outcome, the open-source community cannot retreat to the old order. Three viable pathways are emerging:
First, license evolution.
The FSF is collaborating with the Open Source Initiative (OSI) to draft an “AI-Ready GPL” revision, adding a new Section 12: explicitly defining model weights as “object code,” and requiring commercial models—upon release—to publish cryptographic hashes of their training datasets alongside verifiable channels to obtain corresponding source code.
Second, technical countermeasures.
Echoing Hacker News discussions about Crypto’s failed Illinois primary funding—highlighting the “enforcement failure” problem—the open-source community is developing “license-aware crawlers.” These automated tools scan repositories to identify and tag GPL-licensed code, enabling model companies to curate whitelisted, compliant training sets.
Third, economic rebalancing.
Drawing inspiration from DeSci (Decentralized Science), initiatives are emerging to tokenize open-source code via NFT-based provenance platforms. Developers could then charge micro-royalties (e.g., $0.01 per million tokens) for training usage, automatically settled via smart contracts—transforming “free” from a condition of exploitation into a genuine, optional right.
Even as the ink dries on the FSF’s statement, Anthropic has announced a pause on ingesting certain GitHub data. This lawsuit is no longer merely about two parties’ win-loss record. It is a constitutional debate for digital civilization itself: Do we want an AI hegemony powered by invisible appropriation—or an intelligent future that honors the dignity of every line of code? The answer is being written not only in court dockets—but in every commit message a developer submits today.