Perhaps one of the best-known concepts in artificial intelligence research is the Turing test. The idea of the test is to determine whether a system has human-like intelligence. To decide this, it can rely mainly on its linguistic capabilities. Nowadays, when many systems already have a human-like language ‘skill’, the question rightly arises: how adequate are the original test criteria? Don’t we need something more reliable? A new proposal accordingly puts the assessment of ‘intelligent behavior’ in a completely different context.
In a world where artificial intelligence-based systems are gradually infiltrating everyday life, it is crucial to be aware of their true capabilities. While the study of intelligence in human terms is an important issue for psychology, neuroscience, computer science, and many other disciplines, the question also has a strong ethical side as well. In everyday life, in most cases, we still encounter applications that are designed to make our tasks easier. However, this does not mean that systems capable of imitating human language can’t be used even with malicious intent. While most of us automatically maintain a certain distrust of machines, this is easily overridden in the case of other human beings. If an application can make users believe that they are talking to a real person, it can easily lead to abuse, data theft, or political manipulation, for instance by quickly spreading propaganda messages. It is therefore particularly important to have an objective benchmark to help determine the level of sophistication of such applications. This will also make it easier to prepare for their effects.
The Turing test is designed to infer the intelligence of a system based on the quality of its imitation of human language. Originally, the test runs roughly as follows. The experiment involves 3 participants, two of whom have a written conversation with each other. The third participant’s task is to decide whether both interlocutors are human. In another version, two people are talking to each other and the human subject must decide whether the other person is human. When the Turing test was first proposed in 1950, imitating human speech proved to be such a difficult problem that no system was able to meet its requirements until the 2010s.
The first breakthrough was an app called Eugene Goostman, which passed the test back in 2014 (although it was only successful in 33% of cases). There was a twist in the result, as the chatbot was presented as a 13-year-old Ukrainian boy from Odessa. This circumstance could have greatly reduced the expectations about its language use. At the same time, it is important to note that the background of the chatbot as outlined could have served as a plausible explanation for mistakes that would not have been made by native speakers. Nonetheless, the result achieved was a good indication that the development of AI-based applications is approaching a critical stage. At this level, however, we need to fundamentally reassess what exactly we consider to be an authentic sign of intelligence.
The validity of the Turing test was not unanimously accepted from the beginning. One of the main reasons for this is that, by its very nature, it can only make a judgment about the intelligence of a system based on its linguistic capabilities. At the time the test was created, the artificial simulation of human language use was a difficult problem that only truly intelligent systems could solve.
This idea is not, of course, independent of the context in which it originated. In the 1950s, there was no such thing as natural language processing (NLP) or computational linguistics as we know it today, i.e., the branch of artificial intelligence research on human language. Machine translation was the earliest to come into the spotlight in this area (as early as the 1940s), but it took decades to create systems that could be used effectively in practice. It was perhaps only from the 2000s onwards that statistical language modeling and language models based on neural networks made a real breakthrough in the field.
We have already mentioned how difficult it was at the time of the Turing test to analyze languages, discover their regularities, and reconstruct them artificially. This is important because there is an unspoken axiom that pervades the approach to artificial intelligence and research of intelligence in general. The concept of intelligence is a moving target, the measurement and definition of which varies from age to age and from discipline to discipline. Before the invention of the calculator, most people believed that dealing with abstract things such as the concept of numbers, or rather, performing operations with numbers, was not possible without intelligence. The situation was similar, for example, in chess, where winning a game requires strategic planning, modeling possible future outcomes, and simulating the opponent’s moves. The creation of programs in the 1960s to play entire games quickly overturned the idea that chess games could only be played by truly intelligent beings.
The situation is similar regarding human language. Today, there are many language models whose fine-tuned versions as chatbots can mimic human language use with deceptive accuracy. Despite this, there is a relative consensus among those dealing with the topic that these models cannot be considered intelligent, nor the path to artificial general intelligence. The latter is the name given to humanly intelligent, creative, task-independent artificial intelligence.
The question arises, however, that if human-level language use can no longer be considered a measure of intelligence, then what can? In fact, language acquisition is only a foundation stone on which future AI solutions can be built.
The situation is complicated by the fact that there is still no complete agreement on the set of skills or abilities that clearly and unmistakably distinguish humans from all other creatures on Earth. In psychology, for example, there is a theory that there is not just one, but 8 different types of intelligence that characterize humans. These include language intelligence (which forms the basis of the Turing test), logical-mathematical intelligence, musical intelligence, or even intra- and interpersonal intelligence.
A new approach, which has recently entered the public consciousness under the name of AI Classification Framework (ACF), aims to capture and precisely measure these aspects. The framework attempts to classify the development of all the types of intelligence described above, in addition to linguistic competence.
Another idea is to provide a more flexible way of testing existing language models, so that their real strengths and weaknesses can be more clearly identified. The FLASK (A Fine-Grained Evaluation Framework for Language Models Based on Skill Sets) aims to address this by testing:
- the logical reasoning of the model,
- its ability to construct arguments based on common sense and background knowledge,
- the model’s ability to solve problems, and
- the correspondence of the generated answers to user preferences (conciseness, easy to understand wording).
Such a test could help to detect, for example, the phenomenon of hallucination, one of the most pressing problems for LLMs today. Hallucination occurs when the answer generated by the model, although correct in its formulation, is factually incorrect, perhaps because the information it contains is not related to the context in which the question was asked. This phenomenon is rooted in the way language models generate responses. The model has no human knowledge of the world, or even self-reflection on its answers, but simply provides the most likely sequence of characters from its training data in each context.
Taking a different approach again, Mustafa Suleyman, co-founder of DeepMind’s AI lab, believes that a radically rethought version of the Turing test could be used. In his interpretation, the goal is to find out what the model actually understands from the data it stores, how capable it is of future planning, and whether it is capable of conducting complex ‘internal monologues’. The key here is again to infer the presence of capabilities that are (currently, at least) considered to be intrinsic to humans.
According to his idea, the task of this new-Turing test could be to build a business. This should include an initial product idea and a commercial business plan, a plan to find potential vendors, and organize sales. In his view, this would make the model’s ability to set goals, plan, and perform complex tasks autonomously more verifiable.
The idea is, of course, not unrelated to the nature of the tasks typically encountered in the entrepreneurial sphere, based on which Suleyman has drawn up the list of tasks. One potential problem with the method is that it is designed to test human creativity and planning skills. But, it operates with a task that many people would not be able to perform to a high standard. This could be due to a lack of professional knowledge of the individual or the absence of other intrapersonal characteristics.
Some of these new approaches may seem highly utopian. However, we should not forget that even 10 years ago, passing the Turing test (without facilitation) might have seemed like pure science fiction. Extrapolating from today’s pace of development, it is easy to imagine that in another 10 years, these will be the de facto standards that an AI capable of operating in human language will have to meet.
We still have only theories about the exact nature of human intelligence. That is why it’s difficult to draw a sharp line between a highly sophisticated but automatic function and the traces of the creative mind. One thing is certain, however, that in today’s technologically advanced world, Alan Turing’s test, which has been a yardstick for decades, is increasingly unable to perform its original function. The new ideas that can be used to test AI clearly indicate that the boundary between artificially reconstructable and intelligent behavior, or what appears to be intelligent behavior, and human-defining intelligence is becoming more and more blurred.
István ÜVEGES is a researcher in Computer Linguistics at MONTANA Knowledge Management Ltd. and a researcher at the Centre for Social Sciences, Political and Legal Text Mining and Artificial Intelligence Laboratory (poltextLAB). His main interests include practical applications of Automation, Artificial Intelligence (Machine Learning), Legal Language (legalese) studies and the Plain Language Movement.