Copyright vs. Generative Artificial Intelligence: Questions and Answers? More like Just Questions.
The relationship between copyright and Generative AI (GAI) has been a hot topic since the technology’s rise to prominence. There are many aspects to the question, from whether big tech companies have used copyrighted content to train their models without paying for it (spoiler: they apparently have), to who will be liable if the content generated by the models is infringing. Is it really the user’s fault if they unknowingly generate copyrighted content? Especially in the case of a machine-learned model, the exact operation of which is sometimes difficult even for experts to understand?
Source: DALL E 3.
The copyright situation for GAI applications is currently so wild west that it is not uncommon for developer companies to make conflicting statements on the subject within a few days. According to recent reports, Google, OpenAI, and Microsoft, for example, consider that if the output of their GAI models infringes copyright, the user is responsible. This is broadly because in such cases the infringing content is generated based on a prompt given by the user. This is consistent with the fact that any copyright claim relating to content generated through, for example, the OpenAI API is expressly waived.
In stark contrast to the above, Sam Altman at the OpenAI DevDay conference in San Francisco recently announced that the company is introducing an initiative called Copyright Shield. This is a passage in the usage agreement for ChatGPT Enterprise or OpenAI’s ChatGPT API. Under it, the company assumes all liability in the event of copyright infringement for content generated by their mentioned tools.
The interesting thing is that the above companies have been repeatedly accused of using copyrighted content as training data for their models. All without any compensation. Just think of the case when the New York Times was reportedly considering legal action against OpenAI for unauthorized use of the newspaper’s content. From a moral point of view, it is highly questionable whether, with such an unclear background, blaming users is the best course of action in the copyright chaos surrounding GAI today.
However, there is another practical problem with this approach, which is more technical and relatively rarely heard of. We have already mentioned prompt engineering in a previous article. The essence of this technique is that we can tailor the output of the models to our own needs. This could mean, for example, specifying the expected style of a text (e.g., colloquial, literary, or press language) or even other specific requests. One of the reasons why this method was developed is to make the most efficient use of the capabilities of models (such as Large Language Models—LLMs) in practice.
However, it has another significance. Generative models are essentially statistical-based. This means that their output is a result of a series of probabilities. Therefore, from the user’s point of view, it is not always possible to predict exactly what the response to a given request will be. To illustrate this, I asked ChatGPT to summarize the first 9 sentences of a recent BBC news report. The prompt contained some very specific instructions:
„Summarize the text enclosed with “”” punctuation marks in at most 2 sentences, like you were searching for similar results in a search engine! Do not just list search terms! Avoid mentioning specific persons, organizations, names of products, or geographic names!”
The prompt simulates a situation where you want to find past events very similar to a given event using a search engine. In this case, the point is not the specific details in the text, but a higher-level summary of the events. This is why it is necessary to produce a concise summary that does not focus on the specifics. For instance, in the article mentioned above, the most important event is the temporary outage of online services of a large tech company and the handling of the resulting complaints.
In fact, the resulting summary consistently violates the explicit criterion that the prompt required:
„Over 1,500 users in the UK faced issues with Teams and Xbox Live, impacting some who couldn’t make payments for the latest Call of Duty game. Microsoft addressed the problem, attributing it to an artificial surge in synthetic network traffic, and after implementing configuration changes, confirmed the resolution following service monitoring.”
It is undisputed that the summary captures the essence of the article, but it mentions both the names of specific services and the exact geographical location. The interesting thing is not really that ChatGPT’s answer is not suitable for the task for which it was generated. The example highlights the fact that, as a user, I received a response that was in some ways unexpected, since part of it contradicted the instruction I had given.
This is related to copyright in the sense that, if big tech companies want to charge users for generating infringing content, it would be reasonable to expect that no one should inadvertently generate such content. However, today’s models are not nearly so predictable. This is illustrated by the fact that even Altman’s statement above does not claim to prevent the creation of such content, only that the company will sort out any legal problems that may arise.
It is also worth noting that if GPT models have received copyrighted texts as training data, it is practically impossible to remove this kind of data afterwards. The only solution would be to re-train the entire model, which according to previous statements could cost up to $100 million. This is therefore unlikely to happen in practice.
Ultimately, the question is: who should be held liable in a similar case? In my opinion, blaming users is unethical and highly problematic. It is true that malicious or abusive use is always to be expected, but it is not the average user who should be punished for this. Rather, the solution is that the almost limitless data needs of generative AI should only be met with training data that has a settled legal status. At the time of writing, a precedent-setting decision that could resolve similar situations is still awaited. What is certain, however, is that the need to resolve such problems is becoming more urgent as the GAI becomes more and more pervasive.
István ÜVEGES is a researcher in Computer Linguistics at MONTANA Knowledge Management Ltd. and a researcher at the Centre for Social Sciences, Political and Legal Text Mining and Artificial Intelligence Laboratory (poltextLAB). His main interests include practical applications of Automation, Artificial Intelligence (Machine Learning), Legal Language (legalese) studies and the Plain Language Movement.