A Stepwise Approach to Copyright and Generative Artificial Intelligence

Katherine Lee, A. Feder Cooper, & James Grimmelmann, Talkin’ ’Bout AI Generation: Copyright and the Generative-AI Supply Chain, __ J. Copyright Soc’y U.S.A. __ (forthcoming, 2024), available at SSRN (July 27, 2023).

Michael W. Carroll

In order to understand whether generative AI may infringe copyrights, one must first have a sound grounding in the technical complexities of the “generative AI supply chain.” This Article not only explains the technology in terms accessible to a legal audience, but also explores the doctrinal complexities of how generative AI maps onto existing copyright law. The authors do an admirable job in accomplishing both goals.

I. Understanding Generative AI

This jot highlights four key technical points made in the article about generative AI that a copyright-interested legal reader needs to understand.

1. It is all math. Even though the outputs of generative AI models may be expressed in texts or images, the process by which the models are trained, and by which they operate, rely on converting the expressive inputs, including works of authorship, into structured numeric values in structured data sets that can be used in various mathematical operations to produce numeric outputs that are translated back into text and images. The math used in the operation of these models is fourth-grade math carried out by the machine-equivalent of an army of billions or trillions of fourth graders.¹

2. Generative AI models are capable of discerning patterns in the entire corpus of works in the training data, not just in individual works on which the models were trained. Works of authorship used in training are treated as sources for pattern analysis of the constituent elements of textual or visual works, so “an item like a painting or a book is not itself data; rather it can be processed computationally to be converted into data to be used in machine-learning applications.” (P. 10.) If a generative AI model produces an output that look like a copy of a work in the training data, that is not because the training process retained any sense of those patterns as part of an individual work.

3. Generative AI technologies are prediction engines, not knowledge bases. A generative AI model or system is generally designed to predict outputs that a user would find to be responsive to a prompt. The article explains that some relatively recent technical advances in transformer architecture and diffusion techniques have improved these systems’ predictive power. Once granular patterns in the training data set are in place, the model or system is trained to generate an output at a similar level of granularity.

4. Scale matters. The amount of data required to train today’s most advanced generative AI models is unprecedented in scale. The greatly enlarged scale of training data is primarily responsible for the increasing utility of these technologies and their ability to produce results that surprise even those who built them.

II. Copyright and the Generative AI Supply Chain

The article breaks the stages of preparation and operation of a generative AI system into a “supply chain” comprised of component elements: from creation of expressive works to dataset collection and curation to model (pre-)training (creation of a base model) and then fine-tuning to system deployment for output generation which may be followed by model alignment. The value in this decomposition lies in showing that different actors can play different roles at different stages of the generative AI supply chain. This has implications for who might have copyright liability at what stages of that supply chain.

With respect to authorship/ownership, the article identifies a range of potential authors in the supply chain. In addition to copyrightable source works, compiling training datasets may sometimes exhibit sufficient originality to give rise to at least thin copyrights. In many cases, creating a base model may not be sufficiently original, but the article identifies some instances in which the model could be considered a work of authorship. Fine tuning a model raises similar issues, but the work of fine tuning lends itself to a potentially wider array of creative choices that underpin copyrightability claims. Are the generated outputs copyrightable? The article works through four possibilities: (1) authors of works in the training data, (2) some entity in the generative-AI supply chain (e.g., the model trainer, model fine-tuner, or application developer), (3) the user who prompted the system, or (4) no one.

The article discusses potential liability by organizing discussion around each exclusive right’s application to the each stage of the supply chain. The liability analysis covers reproductions in compiling and structuring the training datasets as well as in generating outputs.

After providing some interesting examples of “memorized” outputs (that is, outputs that are substantially similar to particular inputs) and techniques by which determined users can circumvent internal controls to prevent memorized outputs from being generated, the article usefully lays out the liability matrix for direct and indirect liability. The article also discusses why a provider of a deployed generate AI service may not be able to assert the Digital Millennium Copyright Act’s safe harbor protection.²

III. Concerns About the Fair Use Discussion

While the article generally does a good job of educating the reader about relevant copyright issues about which reasonable minds could disagree, it makes some assertions about fair use law that are reasonably contestable. Having previously acknowledged space for reasonable disagreement about whether models should be deemed to contain copies of individual works, this point should carry over to the fair use discussion. The article should remind the reader that fair use applies only if the elements of a prima facie case have first been met.

In the fair use discussion, I regard treating use of works in generative AI training to be transformative use as much stronger than the article suggests.

In addition, the article should distinguish between existing or emerging licensing markets for access and for uses. (See pp. 947-949 (elaborating on this point in the text and data mining context).) The market for access licensing, in my view, has no bearing on whether under the fourth fair use factor training on accessible works interferes with a market for use licenses.

IV. Conclusion

Despite these points of disagreement, this article makes a very valuable contribution to the literature. Its careful explanation of the technology and its readable application of relevant doctrine is impressive. As the law and technology continue to evolve, this article will stand out as a marker of where we were in the early days of copyright law’s application to generative artificial intelligence.

See, e.g., pijipvideo, Copyright and Generative AI – Prof Michael Carroll and Assistant Prof Charles Duan, Youtube (Nov. 5, 2023).
See 17 U.S.C. §§ 512(c)-(d).

Cite as: Michael W. Carroll, A Stepwise Approach to Copyright and Generative Artificial Intelligence, JOTWELL (January 14, 2025) (reviewing Katherine Lee, A. Feder Cooper, & James Grimmelmann, Talkin’ ’Bout AI Generation: Copyright and the Generative-AI Supply Chain, __ J. Copyright Soc’y U.S.A. __ (forthcoming, 2024), available at SSRN (July 27, 2023)), https://ip.jotwell.com/a-stepwise-approach-to-copyright-and-generative-artificial-intelligence/.

A Stepwise Approach to Copyright and Generative Artificial Intelligence

Submit a Comment Cancel reply

INSIDE JOTWELL

Sponsored By

SECTIONS

Editor in Chief

Section Editors

CONTRIBUTING EDITORS

Student Editors

Feeds & Subscriptions

Get Email Updates

Search

Archives