Generative AI copyright litigation has produced some of the most technically complex expert witness questions in recent memory. The central dispute in most of these cases is whether training a large language model or image generation model on copyrighted works constitutes infringement, and whether the model's outputs infringe the copyrights in the training data. Both questions require technical analysis that goes well beyond what the existing copyright framework was designed to address.
How Generative Models Are Trained
Large language models and image generation models are trained on datasets containing billions of examples of text, images, or other media. During training, the model adjusts its internal parameters to minimize the difference between its predictions and the actual content in the training data. The result is a model that has, in some sense, learned from the training data, but the relationship between the training data and the model's parameters is not a simple copy or storage operation.
The model does not store the training data in a retrievable form. Instead, the training process produces a set of numerical parameters that encode statistical patterns learned from the data. These parameters are not themselves the training data, and in most cases it is not possible to reconstruct the training data from the parameters alone. This technical fact is central to the copyright analysis: the question is not whether the model stores the training data but whether the training process itself constitutes copying and whether the model's outputs are substantially similar to the training data.
The Training Data Infringement Question
The question of whether training on copyrighted data constitutes infringement turns primarily on the fair use analysis. The four fair use factors are the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect of the use on the potential market for the original work.
The purpose and character factor has been the most contested in generative AI cases. Defendants argue that training is a transformative use because the purpose is to extract statistical patterns, not to reproduce the expressive content of the works. Plaintiffs argue that training is not transformative because the model learns from and reproduces the expressive elements of the training data in its outputs.
The market effect factor is also significant. Plaintiffs argue that generative AI models compete with the market for the original works by producing substitutes. Defendants argue that the outputs of generative models are not substitutes for the original works because they are generated, not reproduced.
Key Technical Issues in Generative AI Copyright Cases
| Issue | Technical Question | Legal Relevance |
|---|---|---|
| Training data composition | What copyrighted works were included in the training dataset? | Establishes the scope of alleged infringement |
| Memorization | Does the model reproduce verbatim or near-verbatim content from training data? | Evidence of copying; relevant to substantial similarity |
| Extraction | Can training data be extracted from the model through prompting? | Evidence that model stores and reproduces protected expression |
| Substantial similarity | Are the model's outputs substantially similar to specific training works? | Central to output infringement analysis |
Memorization and Extraction Analysis
One of the most technically significant issues in generative AI copyright litigation is memorization: the phenomenon by which a model reproduces verbatim or near-verbatim content from its training data in response to certain prompts. Memorization has been documented in large language models and image generation models, and it is directly relevant to the copyright analysis because it demonstrates that the model has, in some sense, stored and can reproduce protected expression.
Extraction analysis is the technical process of systematically probing a model to identify memorized content. Researchers have developed methods for extracting training data from language models by constructing prompts that cause the model to reproduce memorized sequences. The rate of memorization varies across models and depends on factors including model size, training data composition, and the number of times specific examples appeared in the training data.
For copyright plaintiffs, memorization analysis can provide direct evidence that the model reproduces protected expression. For defendants, the analysis can be used to demonstrate that memorization is limited and does not extend to the specific works at issue. Expert testimony on memorization and extraction requires deep technical expertise in large language model architecture and training dynamics.
Output Infringement Analysis
A separate but related question is whether the outputs of generative AI models infringe the copyrights in the training data. This question requires a substantial similarity analysis: are the model's outputs substantially similar to specific copyrighted works in the training data?
For text models, substantial similarity analysis requires comparing the model's outputs to specific training works and assessing whether the similarities are in the protected expression or in unprotectable elements such as ideas, facts, or style. For image models, the analysis requires comparing the visual characteristics of the model's outputs to specific training images.
The substantial similarity analysis for generative AI outputs is complicated by the fact that the model's outputs are probabilistic: the same prompt will produce different outputs on different runs, and the relationship between a specific output and a specific training work may be indirect. Expert testimony on output infringement requires both technical expertise in generative model architecture and familiarity with the substantial similarity framework.
The Northern District of California
The Northern District of California has been the primary venue for generative AI copyright litigation, in part because many of the major AI companies are headquartered in the district. Several significant cases are currently pending in the district, including cases involving large language models, image generation models, and code generation models.
The district's courts have begun to develop a body of case law on the technical and legal issues in generative AI copyright cases, including rulings on the scope of discovery for training data documentation, the admissibility of expert testimony on memorization and extraction, and the application of the fair use factors to AI training.
AI Expert Witness Services provides technical expert support for attorneys handling generative AI copyright litigation, including training data analysis, memorization and extraction analysis, and output infringement analysis.
AI Intellectual Property Services