Bartz v. Anthropic: The First U.S. Lawsuit Challenging AI Training on Open-Source Code

AI Copyright Litigation Escalates: The Bartz v. Anthropic Case Ignites Controversy over Open-Source Community Rights and the Legality of Training Data for Large Language Models

In June 2024, the U.S. District Court for the Northern District of California formally accepted a class-action lawsuit filed by plaintiff Sarah Bartz and five other open-source developers against AI company Anthropic (Case No. 3:24-cv-03417). While superficially following the “training-data copyright infringement” framework established in earlier cases—such as Andersen v. GitHub (regarding GitHub Copilot) and Doe v. Meta (regarding Meta’s Llama models)—this case advances a groundbreaking legal theory. Plaintiffs explicitly allege that Anthropic incorporated millions of lines of copyrighted open-source code—including foundational infrastructure components such as the Linux kernel, the GCC compiler, and VS Code extensions—into the pretraining corpus for its Claude series of large language models, without authorization, without attribution, and in violation of strong copyleft license terms like the GNU General Public License version 3.0 (GPL-3.0). This conduct, plaintiffs argue, constitutes a dual violation: of Section 106 of the U.S. Copyright Act and of binding contractual obligations under open-source licenses. Widely dubbed the “first case concerning AI training on open-source code,” the outcome of Bartz v. Anthropic could fundamentally reshape the global legal foundations of data compliance for large language models.

The Legal Force of Open-Source Licenses: From Moral Consensus to Judicially Enforceable Contracts

Past disputes over AI training data have largely centered on the “fair use” defense—that is, the argument that model training constitutes non-expressive, transformative use that does not harm the market value of the original works. Bartz, however, shifts the legal battlefield decisively to the contractual dimension of software licenses. Plaintiffs invoke the long-standing legal position championed by the Free Software Foundation (FSF): copyleft licenses such as the GPL are not unilateral permissions but conditional legal contracts. When users distribute derivative works—in this case, closed-source, commercially deployed AI models—they trigger the GPL’s core obligation to make corresponding source code available under the same license terms. Anthropic has neither disclosed the composition of Claude’s training dataset nor confirmed whether it contains GPL-licensed code; nor has it released the model weights’ source code—or any functionally equivalent alternative. Such omissions, plaintiffs contend, constitute material breaches of GPL-3.0 Sections 5 (“Distribution Requirements”) and 6 (“Conveying Modified Versions”).

Notably, on July 12, 2024, the FSF issued a rare Formal Statement Regarding the Bartz v. Anthropic Litigation, declaring unequivocally:

“When an AI system learns from GPL-licensed code and generates functionally equivalent code, the AI model itself constitutes a ‘modified version’ under the GPL… Refusing to comply with GPL obligations is tantamount to forfeiting the right to use that code.”

This statement marks a pivotal evolution: the free software movement has formally transitioned from a moral-ethical advocate into a key judicial actor in AI governance. The FSF’s intervention not only strengthens plaintiffs’ legal arguments but also sends an unambiguous industry-wide signal: open-source licenses are no longer “gentlemen’s agreements.” Their terms carry enforceable contractual force.

The Tension Between Technical Reality and Legal Fiction: Is an AI Model a “Derivative Work”?

Anthropic’s likely defense will challenge the very legal fiction that “a model is a derivative work.” Technically, large language models do not store verbatim copies of source code; instead, they learn statistical patterns via gradient descent. Their generated code results from probabilistic sampling—not mechanical reproduction of training data. Supporters often cite the Supreme Court’s decision in Google LLC v. Oracle America, Inc., which distinguished between the “functional” and “expressive” aspects of APIs, arguing that model weights embody unprotected “ideas” rather than copyrightable “expression.”

Yet plaintiffs’ technical evidence directly undermines this argument: Claude-3 demonstrates high-fidelity replication—across multiple programming benchmarks (HumanEval, MBPP)—of project-specific expressive choices found in GPL-licensed software, including GNU Coreutils’ unique function-naming conventions and error-handling logic. Moreover, generated C code snippets exhibit more than 12 consecutive character-level matches with a specific Linux kernel driver module—matches that cannot be attributed to generic syntactic structures. This suggests the model may have internalized not just abstract ideas, but the expressive selections unique to particular projects. Should courts adopt this view, the legal standard for “derivative work” would shift from “physical copying” to “reproduction of functional expression”—significantly expanding the scope of copyright law in the AI era.

Collective Action by the Open-Source Community: From Passive Defense to Active Regulation

Underpinning Bartz is the open-source ecosystem’s growing capacity for organized, strategic enforcement. Plaintiffs are backed legally by the Software Freedom Conservancy (SFC), an organization that maintains an open-source license compliance audit database capable of tracing contributor licensing chains. Simultaneously, SFC has joined forces with platforms including GitHub and GitLab to draft the “Training Data Provenance Tag” standard—a proposed technical specification requiring model publishers to disclose whether their training sets include GPL-licensed code and how compliance was ensured. This dual-track strategy—combining technical standards with legal advocacy—is transforming the open-source community from a “silent supplier” in the AI data supply chain into an active, rule-making “regulatory stakeholder.”

The broader business implications are profound. Today’s leading open-source AI models—including Llama and Mistral—tout “openness” while strategically avoiding copyleft terms: their licenses (e.g., the Llama 2 Community License, Apache 2.0) create a gray zone where source code is visible but commercial reuse remains restricted. A plaintiff victory in Bartz would compel enterprises to implement rigorous, license-aware training protocols (“License-Aware Training,” or LAT). It could even catalyze a new market for “compliant datasets”—curated, FSF-certified training corpora rigorously audited to exclude GPL-licensed code—potentially becoming a mandatory prerequisite for large-model deployment.

Global Regulatory Resonance: Governance Spillover Beyond U.S. Courts

Although filed in a U.S. federal court, the case’s regulatory ripple effects are already evident worldwide. The EU’s Artificial Intelligence Act (AI Act), in Annex III, classifies “general-purpose AI models” as high-risk systems, mandating disclosure of “training data summaries” and explicit assurance of “copyright compliance.” France’s data protection authority, CNIL, has recently launched a dedicated investigation into the legality of local AI startups’ training data sources. Likewise, China’s Interim Measures for the Administration of Generative Artificial Intelligence Services (Article 12) expressly requires service providers to “respect intellectual property rights” and “take effective measures to prevent intellectual property infringement.” The judicial reasoning developed in Bartz is rapidly becoming critical precedent supporting these emerging regulatory frameworks.

Yet caution is warranted: overbroad copyright expansion risks stifling innovation. As one developer observed on Hacker News:

“If reviewing a public GitHub repository now requires legal counsel, the collaborative spirit of open source will vanish.”

A sustainable balance lies in nuanced, context-sensitive rulemaking—for instance, distinguishing between “non-commercial research use” and “commercial model training,” or establishing a “limited fair-use exception” for models meeting stringent transparency standards. Only by anchoring AI data governance at the fulcrum between creator rights and the knowledge commons can it achieve genuine long-term viability.

The Bartz v. Anthropic litigation is far more than a copyright dispute. It is a prism refracting a fundamental rupture in the digital age’s knowledge-production paradigm: when code functions simultaneously as tool and creative work; when models operate both as products and authors; when open source evolves from idealistic ethos into a potent legal instrument. What we need is not a reflexive return to outdated doctrines—but a new, digitally native intellectual property order: one that respects technological reality, responds authentically to community needs, and sustains the incentives for innovation. The gavel’s first strike in this trial may well be the clearest prelude to that new order’s birth.