István ÜVEGES: Watermarking AI-generated Content; Solution or Pseudo-solution?

2023.10.13.

István ÜVEGES: Watermarking AI-generated Content; Solution or Pseudo-solution?

The rise of Generative Artificial Intelligence (GAI) inevitably leads to the proliferation of misleading content created by it. The problem of explicitly labeling artificially generated content (text, video, audio) is increasingly becoming a matter of public discourse, mainly as a reaction to this. Enforcement is a major issue with regulations in the pipeline, not to mention the identifiability of content generated by the misuse of technology.

Why it all matters?

There is now a wide range of content that can be created using GAI services. Just think of chatbots that can imitate human language with deceptive accuracy (e.g., ChatGPT) or tools that can generate deepfake videos in hours. The content generated by such tools (especially text) is now fully capable of misleading users. This means, on the one hand, that they could have been created by a human being and, on the other hand, that the facts or images they contain are taken as real.

This situation also provides opportunities for abuses, such as launching targeted disinformation campaigns or undermining the credibility of public figures. However, it can be observed that the detection of synthetic content is nowhere near the same level. At the same time, thanks to the democratization of AI, more and more people can use GAI tools that are open source, and in many cases easily usable even at a no-code level. In practice, this results in a situation where more and more people are more and more easily able to produce synthetic content (even with malicious intent), and the result will be indistinguishable from “original” content.

For someone to create misleading content, it is not even necessary to use GAI tools with malicious intent. Staying with the example of artificial text generation, one of the side effects of today’s large language models’ (LLMs) operation is that they often hallucinate. This means that their output may be nonsensical, factually incorrect, or out-of-context text. The situation is complicated by the fact that while anyone can recognize, for instance, a grammatically incorrect sentence, a text written in perfect English that contains just the wrong information is much harder to spot. This (sometimes erroneous) operation is a consequence of statistical modeling, whereby language models give the most likely word sequences for the given context as a response, regardless of their truthfulness or correctness in the given situation.

Why is this a problem?

The Interim Administrative Measures for Generative Artificial Intelligence Services, which recently entered into force in China, set specific standards for all service providers that deliver GAI-generated content within the state. Within the framework of the regulation, specific standards have been drawn up, which, in the case of image materials, prescribe, for example, the informative text to be displayed, its size, and positioning. It is also important that this information should be clear and unambiguous, like “Generated by AI”. In addition to the explicit watermarks mentioned above, the use of so-called implicit watermarks is also mentioned. These are invisible to the human eye but can be easily detected by software.

Similar initiatives can be seen from the leading US AI companies, and a similar regulation is emerging in the European Union. While these expectations sound good on paper, they will not be easy to implement in practice. Products from big tech companies are usually in the spotlight, so it’s hard to imagine a way around regulation. However, the situation is completely different for individual users.

As briefly mentioned above, a recent trend in AI development is the increasing popularity of open-source solutions. These are primarily aimed at increasing transparency and confidence in the technology. However, such solutions are (by their very nature) available to anyone. Let’s assume that synthetic content tagging will be mandatory in a few years in all countries where current leading AI developments are taking place[1]. The outputs of the models then created will inevitably have all the identifiers required by the regulation. However, for previous GAI solutions, the content will remain untagged.

And in cases where GAI is used with malicious intent (e.g., automatically generated comments on social media to spread propaganda), we cannot in principle expect the creators to reveal themselves voluntarily.

What can we do?

Based on current trends, soon it will be even harder to distinguish whether what we see, read, or hear is real or fake, man-made or machine-made, and whether it is intended to mislead or inform. For us to continue to be able to adjust in such a world, the regulation of companies operating within the legal framework is only one side of the coin. It is also equally important to be prepared to identify synthetic content where the creator’s intention is precisely to avoid this identification.

Experiments are already underway to see how efficiently content produced by LLMs can be automatically identified. Worryingly, the results for artificially generated texts, for example, suggest that they cannot yet be reliably identified.

In the case of the above-mentioned implicit watermarks, it is also unclear exactly “who” will be able to credibly demonstrate them. From the point of view of the average user, tools such as a plug-in running in a browser would be needed to automatically check all content displayed and alert when synthetic text, video, or audio is detected. However, this is currently pure science fiction. But even if such a system were to be created, the proliferation of synthetic content to this extent raises entirely new social issues. For instance, how does the proliferation of watermarked content affect the perception of “original” content? It is easy to imagine a scenario in which the presence of large amounts of synthetic content erodes trust in any information available online.

Looking at the role of the GAI in society today, it is easy to see that we are witnessing far-reaching technological and social processes for which we are not at all prepared. Synthetic content, indistinguishable from original, has the potential to fundamentally challenge our perception of reality. If AI is to play the role it is meant to play, such as improving our lives or increasing economic productivity, we must prepare now for the upcoming challenges. To prevent GAI from becoming an uncontrollable runaway train, legislators and technology developers must find solutions together.

[1] This is mainly defined as the China – US – EU axis.

István ÜVEGES is a researcher in Computer Linguistics at MONTANA Knowledge Management Ltd. and a researcher at the Centre for Social Sciences, Political and Legal Text Mining and Artificial Intelligence Laboratory (poltextLAB). His main interests include practical applications of Automation, Artificial Intelligence (Machine Learning), Legal Language (legalese) studies, and the Plain Language Movement.

Tech & AI

István ÜVEGES: Watermarking AI-generated Content; Solution or Pseudo-solution?

Previous post

Next post

István ÜVEGES: Watermarking AI-generated Content; Solution or Pseudo-solution?

Previous post

Next post

Related Posts

Language as Evidence: The History and Present of Forensic Linguistics in the Age of AI

EU AI Act: Some Considerations to Think About—Part II.

EU AI Act: Some Considerations to Think About—Part I.