Submission
Privacy Policy
Code of Ethics
Newsletter

What Should We Do with Synthetic Data? A Necessary Tool with Risky Side Effects

As AI systems grow more powerful, the role of synthetic data is coming under increasing scrutiny. Once seen as a niche solution, it is now a central tool in addressing data scarcity, privacy concerns, and training efficiency. But with its growing influence comes a set of risks that demand critical attention.

One reason synthetic data is receiving growing attention in 2025 is that generative models (e.g. ChatGPT or Gemini) now require such vast amounts of training data that it can no longer be sourced exclusively from human-created content. Several industry analyses, including the State of AI Report, have highlighted a growing concern: publicly available, human-authored text is becoming scarce. At the same time, the internet is being flooded with AI-generated content, which inevitably leads to training datasets containing a rising proportion of synthetic material. This raises the risk of a feedback loop in which AI no longer learns from the real world but instead begins to repeat and reinforce its own past outputs over and over again.

Interest in synthetic data is not only a response to the depletion of (human generated) data sources but also reflects a practical need. In many fields, the amount of real data available is limited from the outset, or the data lacks sufficient diversity. This latter challenge aligns with the previously discussed issue of feedback-loop distortion. While that involves the reprocessing of existing data, here the key issue is a lack of foundational data availability, which makes synthetic data generation necessary. This holds true especially in fields like rare disease research, the prediction of security incidents, or the development of language technologies for small-language communities. In such cases, synthetic data offers an opportunity to expand datasets and may also be more favorable from a privacy and ethical perspective than directly processing sensitive personal information, such as health or financial data.

The quality and usability of synthetic data largely depend on how that data is generated. We can distinguish between two fundamental types. One category includes data that is entirely newly created by generative AI. The other involves data produced by modifying existing material. This can be done, for example, by replacing words with synonyms, inserting new words, or changing the order of sentences. This method is known as data augmentation, or EDA (Easy Data Augmentation). It is important to recognize that the concept of synthetic data includes both types.

AI tools are now capable of generating texts, images, videos, or audio recordings that are not the product of human authorship. These can be found, for example, in the responses of customer service chatbots, automatically generated comments, profile pictures used in advertisements, or on social media platforms where entire profiles are artificially operated. Synthetic content like this is appearing in more and more everyday situations. However, it is important to distinguish that not all such data is automatically fed back into machine learning pipelines. At the same time, with current technological capabilities, the reliable identification of synthetically generated content is still in its infancy. As a result, this data may enter training datasets, whether intentionally or unintentionally. This becomes a serious issue when synthetic data is not properly labeled, or when there are no sufficiently accurate automated filtering solutions in place. In such cases, machine learning models may begin learning from patterns that no longer reflect reality, but rather a simplified and distorted version of it. This can skew the learning process, reduce the model’s ability to generalize, and negatively impact the accuracy of applications, especially in tasks that rely on sensitive linguistic or contextual understanding.

All this highlights that synthetic data not only creates new types of content, but also actively shapes how AI perceives and represents the world. When a significant portion of training data is artificially generated, newer-generation models may develop a fragmented and unpredictable relationship with the reality they are meant to model.

In addition, generated data is often more homogeneous and formulaic than real-world linguistic patterns. This manifests in several ways: the vocabulary tends to be repetitive, sentence structures are less varied, and the data lacks linguistic creativity. Such data may replicate the surface features of human language use, but fails to capture its pragmatic, stylistic, or emotional richness. This can distort the learning process, as models are not exposed to the full range of variation that characterizes real-life linguistic interaction. It is like someone trying to learn a foreign language exclusively through textbook example sentences: they may learn the rules, but they will not understand how those rules function in real-life communication. The dynamics of natural speech, shifts in register, stylistic nuance, and cultural references may all be missing from the learning experience.

The use of synthetic data is also significant from a privacy perspective, especially when sensitive information such as health or financial data needs to be processed. The “privacy by design” approach is becoming increasingly widespread, with more organizations choosing to use synthetic data as a default in their development processes, turning to real data only when necessary. This approach not only meets privacy expectations but is also aligned with the requirements of the GDPR and the AI Act. At this point, a core question surrounding synthetic data reemerges: how can we balance data privacy with the usability of data for analytical purposes? This challenge is commonly referred to as the privacy-utility trade-off. The essence of this concept is that the more protective measures we apply to data, such as statistical distortions or anonymization, the more likely it is that the data will become less useful for practical analysis. At the same time, this approach illustrates that, under controlled conditions, synthetic data can offer a real alternative, especially when the goal is to safeguard confidential information.

Generated data is only as good as the underlying models and the quality and diversity of the data used to train them. Poor input data can lead to flawed synthetic outputs. This is why the generation of synthetic data always requires careful preparation, thorough testing, and rigorous validation.

When used properly, synthetic data can offer real benefits: it can reduce privacy risks, supplement incomplete datasets, and support more efficient model training. This is especially relevant in scenarios where real-world data is scarce or inaccessible due to ethical considerations. However, if it is generated or used without adequate oversight, it can easily distort the worldview learned by machine learning systems.

Synthetic data is therefore not just a technical tool. It also plays a role in shaping the substantive and methodological quality of machine learning. To be truly useful, it must be applied within frameworks that are deliberate, transparent, and grounded in sound professional practice.


István ÜVEGES, PhD is a Computational Linguist researcher and developer at MONTANA Knowledge Management Ltd. and a researcher at the HUN-REN Centre for Social Sciences. His main interests include the social impacts of Artificial Intelligence (Machine Learning), the nature of Legal Language (legalese), the Plain Language Movement, and sentiment- and emotion analysis.