“Blind” Models, Invisible Biases: the Limits of Algorithmic Fairness
Modern machine learning systems have become part of our social infrastructure, which means that the biases they transmit are not just technical glitches but real legal and ethical risks. In practice, bias often persists even when protected attributes are formally removed from models, because information related to those attributes survives through proxy variables and through the way the data are generated. The discussion below examines how this gives rise to hidden, hard-to-detect forms of bias and why the fairness through unawareness approach does not offer a real solution.
By 2025, AI-based systems are not only decision-support tools in credit scoring, applicant screening, or criminal justice, but also form core infrastructure for online platforms, recommendation systems, content moderation, and large-language-model-based assistants. These models are embedded in complex organizational and social processes, and their decisions directly affect rights, financial situations, access to information, and, more broadly, people’s social opportunities.
In general, algorithmic bias is present when a system consistently produces worse outcomes for some groups than for others, and these differences cannot be explained by relevant, professionally justified distinctions between the groups. The key issue is the existence of a systematic pattern. Any model can make mistakes, and this is not surprising. It becomes problematic when these errors persistently affect the same groups and, as a result, the system reinforces existing social inequalities. In such cases, we are not dealing with mere random errors but with a form of structural bias, which is the real concern.
This kind of bias can enter the system at several points: when data are generated, during labeling, in the training of the model, and when decisions are applied in practice. Input data may come from an environment that is already biased, labeling may reflect human assumptions, and the choice of objective function also encodes an implicit value judgment about what we consider the primary performance criterion to optimize. It is not the same whether we maximize overall accuracy, penalize false negatives or false positives more heavily, or include explicit fairness regularization terms in the loss function. These choices determine whose risks, burdens, and benefits we consider acceptable, and they express normative commitments that are hidden behind technical parameters.
The following discussion focuses on a narrower issue: the still widespread idea that it is sufficient to simply remove sensitive attributes (such as gender, ethnicity, or age) from the model’s inputs in order to make the algorithm “blind” to these protected characteristics. The approach known as fairness through unawareness may be intuitively appealing, but on its own it is not suitable for preventing discriminatory outcomes.
The reason is that models learn from many features that are interrelated and that often reflect underlying social structures. If members of a particular ethnic group live in higher proportions in certain regions, then an address can in itself be a strong signal about ethnicity. If the labor market is segregated, occupation or industry may indirectly reveal a person’s gender. If wages are strongly correlated with ethnicity or gender, income variables will likewise carry the imprint of protected characteristics. In such cases, simply removing the protected attribute does not eliminate the associated information. The model can still learn patterns that effectively relate to protected characteristics, based on other features that are closely correlated with them.
In economics and statistics, features of this kind are called proxy variables: they do not directly measure the attribute of interest, but they move closely together with it. From a machine learning perspective, this means that a model can infer information about protected attributes through proxies, even if it never receives those attributes explicitly as inputs. A classic example is the ZIP code. On its own it does not record anyone’s ethnicity, yet in many countries it is a good approximation of the social and ethnic composition of a neighborhood. If, for instance, a credit scoring model learns that certain ZIP codes are consistently associated with higher risk, it may in practice end up differentiating between applicants along ethnic lines, even without any explicit protected attribute in the data.
This implies that bias in a model is not tied to a single parameter or feature that can be easily “switched off.” Instead, it is distributed across the entire data-generating process and the relationships among variables. In practice, many systems show good overall accuracy, and their decision rules do not explicitly refer to protected attributes, yet bias is still present and only becomes visible when we examine how the data were generated and how errors are distributed across groups. It is precisely this hidden, distributed nature that makes algorithmic bias particularly difficult to detect.
Risk assessment algorithms used in criminal justice provide a clear example. In such systems, bias often relates to the construction of the ground truth, that is, to what is treated as the factual reference point. Arrest and conviction data also reflect where police presence is more intensive, which offenses are more likely to be reported, and where investigations are more successful. The widely discussed COMPAS case shows that a tool for predicting recidivism risk can produce race-based disparities even if the algorithm itself does not contain race as an input variable. The system was trained on historical data that already embodied selective patterns of policing, so differences in its outputs cannot be clearly separated from the unequal distribution of surveillance and enforcement.
Similar dynamics can be observed in credit scoring. If a bank has been less willing to extend credit to certain groups over a long period of time, this can lead to persistent differences in wealth, housing, and employment. A later model may formally rely only on “objective” features such as income, occupation, and address, but these variables strongly encode the consequences of past decisions. As a result, the system can indirectly reproduce historically entrenched disparities, while to the user it appears to be “simply following the numbers.”
In facial recognition systems, bias often appears already at the data collection stage. If white male faces are overrepresented in the training set and other groups are underrepresented, the model will primarily recognize the former with high accuracy. Overall accuracy can still be high, so the system may appear reliable. The hidden bias only becomes visible when we examine error rates by group and find that members of certain groups are misidentified far more often, or not recognized at all.
All of this suggests that algorithmic bias is the digital imprint of social structures and institutional practices, rather than a mere technical side effect. If fairness is taken seriously, it is not enough to state that the model does not use gender or ethnicity as inputs. It is necessary to analyze how the data are generated, how different groups are represented, and how error patterns are distributed, and it is equally important to make explicit which notion of fairness the system is intended to reflect.
Today there are many, partly incompatible definitions of fairness. Some group-based criteria compare the distribution of decisions or errors across protected groups, while individual fairness approaches require that similar individuals be treated in similar ways. Several results show that certain fairness criteria cannot be satisfied simultaneously, which means that it is always a matter of value judgment which definition we prioritize and what trade-offs we are willing to accept between accuracy and equity.
To detect hidden bias, the model’s performance has to be examined separately for different groups; measuring only overall accuracy is nowhere near sufficient. It is also necessary to understand the data-generating process: where the data come from, which selection effects shape them, and which institutional practices influence what is recorded and what is not. Without this, it is difficult to determine what the model is actually learning. In recent years, methods based on counterfactual reasoning have emerged. These approaches ask how a decision would change if a fairness-critical attribute were hypothetically altered while everything else remained the same. If the decision is highly sensitive to such changes, this may indicate that the system is implicitly encoding protected characteristics and relying on them in its decisions.
One practical difficulty in identifying hidden bias is that doing so often requires access to exactly the kind of information that the logic of fairness through unawareness would remove. Without knowing protected attributes, it is impossible to analyze performance differences between groups, so the bias remains invisible. Legal and ethical analyses have also drawn attention to this tension. There can be a conflict between a general prohibition on processing sensitive data and the obligation to prevent algorithmic discrimination, which is why narrowly tailored exceptions for audit purposes, backed by strong safeguards, may be justified.
This insight is now reflected in the regulatory landscape as well. The European Union’s AI Act sets out detailed requirements for the quality and lack of bias in training data for high-risk systems and requires that potential sources of bias be identified and addressed. Implicitly, this means that issues related to protected characteristics cannot be resolved simply by pruning the feature list; instead, a risk assessment, documentation, and monitoring framework is needed that covers the entire lifecycle of the system.
The critique of fairness through unawareness therefore goes beyond a narrow technical observation. It highlights that justice cannot be achieved through deliberate blindness. On the contrary, the more we want to avoid discriminatory outcomes, the more we need to make visible how social patterns are embedded in our data and models, and which tools we can use to measure, constrain, and correct these effects.
István ÜVEGES, PhD is a Computational Linguist researcher and developer at MONTANA Knowledge Management Ltd. and a researcher at the HUN-REN Centre for Social Sciences. His main interests include the social impacts of Artificial Intelligence (Machine Learning), the nature of Legal Language (legalese), the Plain Language Movement, and sentiment- and emotion analysis.