In the past year, we witnessed a true generative AI revolution. New, user-friendly technology swept the world by storm, leaving no turning back. Ground-breaking and revolutionary, it challenged not only the way business operations are run, but also the legal landscape, posing (seemingly) novel questions on copyright, data privacy, bias, and liability attribution.
In this article, we will examine how the current U.S. copyright law might treat machine learning on copyrighted data.
Disclaimer: AI and machine learning technology are not one-size-fits-all and have diverse structures and algorithms specific to the tasks they are programmed to solve. So, any discussion of the legal implications of artificial intelligence must be made on a case-by-case basis, considering the technology in question and its treatment of copyrighted materials.
Introduction to AI and Machine Learning
Artificial intelligence (“AI”) is a computer system designed to make predictions or decisions (almost) independently from a human coder. It carries out a variety of functions, including the generation of images, computer code, text, sound, and video and even music (“generative AI”).
To make the predictions, AI must go through machine learning, which involves processing incredible amounts of input training data to identify patterns. The more training data is input into the datasets, the more precise and valuable the output data. For example, the LAION-5B dataset consists of 5.85 billion image-text pairs. Often, those massive datasets contain copyrighted materials – photos, paintings, books, or computer source code.
Copyright Implications of Machine Learning
In general, copyright grants to its holder the following exclusive rights to the copyrighted material:
- Make copies of the work.
- Prepare derivative works (create new matter based on the original copyrighted work).
- Distribute copies of the work to the public.
- Perform or display the work publicly.
Machine learning is likely to implicate the first two: the right to make copies of the work and the right to create derivative works. Since creating or using a dataset may technically involve making copies of copyrighted material, it may implicate the “reproduction of copies” aspect of the copyright. If the output data produced by the AI closely resembles one or several of the copyrighted materials in the training dataset (by incorporating them in some concrete form), that may implicate the right to create derivative works or make copies. Unless there is an applicable exception (such as the “fair use” doctrine), those are acts of copyright infringement.
There appears to be no case law in the US yet that would directly address the use of copyrighted materials in machine learning.
Doe 1. v. Github, Inc. is probably the first one in a string of copyright infringement class action lawsuits related to generative AI and is currently pending. This case raised an important issue: whether developing code-generating AI violates the DMCA and open-source licenses attached to the code on which it was trained. Most open-source code licenses condition the right to use the code on crediting the underlying software’s authors and making the resulting code available on a public repository for free. One of the problems the plaintiffs raised is that GitHub Copilot, the code-suggesting AI in question, makes it impossible for end users to learn which open source license attaches to the open-source code it was allegedly trained on, so compliance with the license terms becomes impossible.
Since then, lawsuits have been filed over generative AI trained on copyrighted literature, graphic art, and photographs, with little to no guidance available from the legislatures.
Generative AI has transformed the world economy, and one cannot underestimate the far-reaching ramifications of the bourgeoning AI-specific copyright case law. Although authors’ consent and compensation are paramount for the lawful use of their works in most cases, there is an exception from copyright law widely cited in the tech industry. Training AI models on copyrighted data may constitute fair use.
In this article, we will examine in which cases AI developers may invoke fair use when training models on copyrighted works.
Fair Use – in General
The purpose of the fair use doctrine is to balance the protections copyright grants its owners with the greater social good and promote creativity, education, and free speech. Fair use is an exception from copyright allowing the use of copyrighted materials without the owner’s consent for criticism, comment, news reporting, teaching, scholarship, or research. Fair use is a mixed question of law and fact, which means that the finding of whether something constitutes fair use is case-specific. There are no areas where fair use is presumed. Procedurally, fair use is an affirmative defense (meaning that a defendant in a copyright infringement suit has the right to invoke it). The burden of proof of fair use is on the defendant.
Fair Use Criteria
In deciding fair use cases, courts must consider the following factors having equal weight:
- The purpose and character of the use, including whether it is commercial, transformative, and non-expressive;
- the nature of the copyrighted work;
- the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
- the effect of the use upon the potential market for or value of the copyrighted work.
1. Purpose and Character of Use
Commercial use, as opposed to not-for-profit, weighs against the finding of fair use. Courts presume commercial use if the purported infringer profits from exploiting the copyrighted material without paying the customary price to copyright owners. Hypothetically, this can occur if the AI owners charge end users money, host ads on the AI website/app, or otherwise profit from the AI (for example, by collecting and selling user data).
On the other hand, transformative use favors fair use. Use is transformative if it transforms the original work in some ways, altering the original with new expression, meaning, or message). Transformative use may occur if it has a different purpose than the original work or constitutes copying for the analysis or reverse engineering (“intermediate” copying). For example, it is fair use to copy a competitor’s computer program code to understand its unprotected functional elements and ensure compatibility of one’s new program with the competitor’s gaming console (Sega Enterprises Ltd. v. Accolade, Inc., 977 F.2d 1510 (9th Cir. 1992)). It is important that the elements of the competitor’s code used in the new program must be functional and not creative.
Transformative use may occur if its result and the original copyrighted work serve different market functions (Campbell v. Acuff-Rose Music, Inc. 510 US 569 (1994) 591). A transformative use offers something new and different from the original or expands the original’s utility, thus serving copyright’s overall objective of contributing to public knowledge (Authors Guild v. Google, Inc. – 804 F.3d 202 (2d Cir. 2015) 214). Search engines’ production of thumbnails or snippets of copyrighted books or images was found to be transformative fair use because it served another function than the underlying creative content. For example, by providing image thumbnails in the image search feature, search engines did not engage in artistic expression, which is the prerogative of the underlying copyrighted content, but rather improved access to information on the Internet (Kelly v. Arriba Soft Corp., 336 F.3d 811 (9th Cir. 2002) 819).
In cases where the end goal of machine learning is new functionality, the use is likely transformative. Some examples could be using the learned ability to recognize faces or types of objects in the pictures for purposes other than generating images, like narrating the surroundings for the blind. Another example could be analyzing text to learn to find and correct grammatical mistakes.
In its comments to the USPTO, OpenAI contended that including copyrighted material in datasets for machine learning is fair use because it is transformative “non-expressive intermediate copying.” According to OpenAI, unlike the original works’ “human entertainment” purpose, machine training has the purpose of learning “patterns inherent in human-generated media”.
However, it appears that not all machine learning on copyrighted data is inherently transformative for fair use purposes. Training AI on copyrighted works to create output that serves the same (aesthetic/expressive/entertaining) purpose as the training data is likely not transformative. Arguably, all the functional transformation in most generative AI stays “inside” the AI model and goes unnoticed by the end user (for example, images go into the training dataset – images come out). End users employ generative AI to produce content (art, computer code, prose, videos, music) that may or may not serve a similar purpose as the training data.
2. Amount and Substantiality of Portion Used in Relation to Full Copyrighted Work
For there to be a finding of fair use, the amount and substantiality of the portion used in relation to the copyrighted work as a whole should be reasonable in relation to the copying’s purpose. It is detrimental to the finding of fair use if the defendant used so much of the original copyrighted work to consider that the defendant made a “competing substitute” available to the public (Authors Guild v. Google, Inc. – 804 F.3d 202 (2d Cir. 2015) 214). An important factor is not just whether a lot was copied from a copyrighted work but whether much of the resulting product consists only of the copied material (Campbell v. Acuff-Rose Music, Inc.).
In the case of machine learning, training datasets involve full works (the whole book, the whole image). Superficially, this may count against the finding of fair use. However, using full corpora of copyrighted works in the training datasets is likely reasonable in relation to the purpose of machine learning, which requires the analysis of whole works to learn the targeted patterns. A crucial factual factor to consider is the amount of a single piece of training data made available verbatim in the output (see our analysis of copyright implications of generative AIs producing recognizable portions of training data in another article).
3. Effect of Use Upon Potential Market for or Value of Copyrighted Work
An essential prong in finding fair use is the effect of the use on (1) the potential market for or (2) the value of the copyrighted work used in machine learning and its derivative works. It means that fair use is unlikely if the AI output may “substitute” the original work and compete for its market, harming the copyright owner’s ability to sell or license the original work.
The goal is to strike a balance between the benefit gained by the copyright owner when the copying is found to be an “unfair use” and the benefit gained by the public when the use is held to be fair.
With some companies already outsourcing their graphic art needs to AI, it is a reality that some generative AI competes with human authors, potentially negatively affecting the value of the copyrighted work on which it is trained. The factual inquiry will thus be into the relationship between the output data and the copyrighted training data (does training on a text result in a new text or in a function of correcting grammatical mistakes?).
One of the lawyers in Doe 1 v. GitHub, Inc., Matthew Butterick, argued that code-writing AI will potentially “starve” the open-source communities. According to him, it will remove the incentive for developers to discover and contribute to “traditional open-source communities” that made the creation and constant development of the open-source code possible, stifling the growth and development of open-source software. This is a potential argument for the finding of a negative market effect.
Conclusion?
(Un)fortunately for all stakeholders, there may not be a definitive answer as to whether machine learning on copyrighted material is generally fair use. Courts will need to consider the factors outlined in this article in deciding on each individual case.
Although it may not absolve generative AI developers from copyright infringement claims, they can consider structuring the machine learning process to not make tangible copies of training data and to analyze only the non-expressive structural elements (pixels, parts of speech) directly from the source. Datasets could be designed to contain links to the training data and not reproductions of the copyrighted material. Solving a different problem or serving a different purpose than the copyrighted training material can make the finding of fair use more likely. Finally, putting controls in place to ensure the output does not include recognizable portions of expressive/creative elements of the training data may decrease the likelihood of copyright infringement.
The CommLaw Group Can Help!
Whether you developed AI without clarity on training data rights and need to assess your legal exposure or are considering investments in AI models or startups and require due diligence analysis on associated legal risks, our proficient and versatile team is here to assist. Reach out to us for comprehensive, client-focused solutions!
Attorneys:
Jonathan Marashlian – Tel: 703-714-1313 / E-mail: jsm@CommLawGroup.com
Michael Donahue – Tel: 703-714-1319 / E-mail: mpd@CommLawGroup.com
Linda McReynolds – Tel: 703-714-1318 / E-mail: lgm@commlawgroup.com
Ronald Quirk– Tel: 703-714-1305 / E-mail: req@CommLawGroup.com
Diana Bikbaeva – Tel: 703-663-6757 / E-mail: dab@commlawgroup.com