Privacy Policy
Code of Ethics

Mónika MERCZ: Unlocking the key to safer AI – the necessary data criteria in light of the draft AIA

Nowadays we hear numerous conversations regarding Artificial Intelligence, with a specific focus on the draft AI Act of the European Union. However, when we discuss this issue it is not just the technology itself which presents challenges. To my mind setting up several categories of AI technologies based on their risk factor – as is in the current form of the aforementioned Act – simply not enough. The central problem is left open and needs further investigation: namely, the quality of data used to train deep learning algorithms and to lead programmers to a high level of safety, transparency and accuracy when it comes to creating an AI that is appropriate in relation to the draft AIA.

This aspect of AI is crucial for the goals of the digital plan on the future of AI (and dare I say, the future of the EU as well, which seems to be deeply intertwined with technological advances). Companies would also be wise to take these new requirements into account in their plans, as OpenAI, the creator of the infamous ChatGPT has already come under fire for their use of scraped data from the web to train their chatbot. This violates the rights of millions of internet users, whose data was stolen, from which the company made unbelievable amounts of money, while the data subjects were left without any compensation – or choice on how their data was used. Of course, when it comes to big data, there will always be concerns – but the EU is seemingly trying its best to tackle them.

The EU aims to regulate these huge companies and try to gain some influence to become a leader in AI, but from a different approach compared to countries such as the US and China. With the Act on Artificial Intelligence specifically, the focus must be heavily placed on data quality, because that is the key to making sure that an AI is indeed safe – that it was trained in a manner which not only did not violate the rights of data subjects, but also to ensure that the technology we will slowly use for every aspect of our lives is indeed trustworthy.

But what is data quality? It means measuring how well a dataset meets criteria for accuracy, completeness, validity, consistency, uniqueness, timeliness, and fitness for purpose, and it is critical to all data governance initiatives within an organization. It must be investigated how reliable a particular set of data is and whether or not it will be good enough for a user to employ in decision-making. The reason behind its crucial nature is not just AI governance and the goals of the EU – poor data quality also costs organizations an average of USD 12.9 million each year and over the long term, poor quality data increases the complexity of data ecosystems and leads to poor decision making. As companies integrate artificial intelligence and automation technologies into their workflows, the effectiveness of these tools will depend largely on high-quality data. Therefore, companies must improve their data quality if they wish to be successful in the EU market in the future.

The draft AI Act’s (44) has the following requirements with regard to this issue: “High data quality is essential for the performance of many AI systems, especially when techniques involving the training of models are used, with a view to ensure that the high-risk AI system performs as intended and safely and it does not become the source of discrimination prohibited by Union law. High quality training, validation and testing data sets require the implementation of appropriate data governance and management practices. Training, validation and testing data sets should be sufficiently relevant, representative and free of errors and complete in view of the intended purpose of the system. They should also have the appropriate statistical properties, including as regards the persons or groups of persons on which the high-risk AI system is intended to be used. In particular, training, validation and testing data sets should take into account, to the extent required in the light of their intended purpose, the features, characteristics or elements that are particular to the specific geographical, behavioural or functional setting or context within which the AI system is intended to be used. In order to protect the rights of others from the discrimination that might result from the bias in AI systems, the providers should be able to process also special categories of personal data, as a matter of substantial public interest, in order to ensure the bias monitoring, detection and correction in relation to high-risk AI systems.

This particular provision explains what needs to be done in order to satisfy the legal requirements when it comes to AI systems employed in companies – and it will mean a lot of work for them, even possibly blocking their competitiveness with the introduction of such high requirements. I firmly believe that data governance issues will be crucial, especially while companies attempt to comply with the new legislation. Article 10 of the draft AI Act elaborates on this issue in 2. when it states that “Training, validation and testing data sets shall be subject to appropriate data governance and management practices. Those practices shall concern in particular, (a)the relevant design choices; (b)data collection; (c)relevant data preparation processing operations, such as annotation, labelling, cleaning, enrichment and aggregation; (d)the formulation of relevant assumptions, notably with respect to the information that the data are supposed to measure and represent; (e)a prior assessment of the availability, quantity and suitability of the data sets that are needed; (f)examination in view of possible biases; (g)the identification of any possible data gaps or shortcomings, and how those gaps and shortcomings can be addressed.” Additionally, the Act states that “Training, validation and testing data sets shall be relevant, representative, free of errors and complete. They shall have the appropriate statistical properties, including, where applicable, as regards the persons or groups of persons on which the high-risk AI system is intended to be used.” (Draft AI Act Article 10. 3.) These are indeed high expectations from companies, and only time will tell whether the future supervisory authorities shall take these requirements seriously or will be lenient.

In addition to the ever-rising significance of data governance, a few other key concepts should be noted when we discuss this issue. While I mentioned that data quality is a broader category of criteria that organizations use to evaluate their data for accuracy, completeness, validity, consistency, uniqueness, timeliness, and fitness for purpose, data integrity focuses on only accuracy, consistency, and completeness and implementing safeguards to prevent against data corruption by malicious actors. Data profiling, on the other hand, focuses on the process of reviewing and cleansing data to maintain data quality standards within an organization. This can also encompass the technology that supports these processes.

These processes will hopefully contribute to the accurate processing of data, reliable decision-making and overall a lower risk of using AI not just for business purposes, but overall. It will also be particularly intriguing to see how the GDPR’s value increases in the coming years, seeing how datasets are the backbone of AI technologies. Data quality to my mind should be put first especially because due to the blackbox of AI technologies we actually have no insight into how AI develops its goals. Models sometimes pursue goals their designers didn’t intend, which creates an alignment problem – where advanced deep learning models could pursue dangerous goals. While there are currently several ideas on how this problem could be solved, the easiest and most reliable thing we can do to avoid misaligned goals is to train the models on a data set of high quality. It might not solve all problems, but it would undoubtedly be beneficial to reduce risks associated with both the problems raised in the draft AI Act and those introduced by scientists.

The draft of the AI Act has several principles that we can interpret in favor of this goal, including the aim to minimise the risk of algorithmic discrimination, in particular in relation to the design and the quality of data sets used for the development of AI systems complemented with obligations for testing, risk management, documentation and human oversight throughout the AI systems’ lifecycle. (Explanatory memorandum, 1.2.) Because of the requirement of proportionality, for high-risk AI systems, the requirements of high quality data, documentation and traceability, transparency, human oversight, accuracy and robustness, are strictly necessary to mitigate the risks to fundamental rights and safety posed by AI. (Draft AI Act 2. Legal basis, subsidiarity and proportionality 2.3.) How well these regulations will be implemented in practice remains to be seen – but seeing that humanity’s safety is at stake, I am hopeful that the long-term risks might be mitigated if most countries slowly start to regulate AI as well.

Ultimately, I must say that categorising AI on a risk-based approach was necessary to avoid things like digital dictatorship and the complete erasure of privacy. However, as this technology could very easily have disastrous consequences for humanity, data quality and other high expectations that come with enforcing the Act must come first – even at the cost of competitiveness. 

Mónika MERCZ, JD, specialized in English legal translation, Professional Coordinator at the Public Law Center of Mathias Corvinus Collegium Foundation while completing a PhD in Law and Political Sciences at the Károli Gáspár University of the Reformed Church in Budapest, Hungary. She is an editor of Constitutional Discourse. Mónika’s past and present research focuses on constitutional identity in EU member states, data protection aspects of DNA testing, environment protection, children’s rights and Artificial Intelligence.


Print Friendly, PDF & Email