From Text to Action: World Models in Practice (Part I.)
In recent years, Large Language Models (LLMs) and their multimodal counterparts have advanced considerably, yet their ability to understand the physical world and predict consequences remains limited. World Models (WMs) aim to fill this gap by building internal, time-evolving representations from multiple sensory inputs, enabling systems not only to describe but also to anticipate and act. This post explains the conceptual shift behind WMs, places them within the current conceptual landscape of generative AI (LLMs, multimodal models, RAG), and reviews the conditions under which it is reasonable to adopt WMs today and to what extent.
In recent years, the development of generative AI has focused mainly on LLMs and conversational systems. These are trained on vast amounts of text and excel at dialogue and natural (human) language tasks, but on their own they lack sensory or physical grounding. They cannot consistently reason about images, sounds, or the dynamics of the physical world. The emergence of multimodal models marked a significant shift. Such systems can process not only text but also images, video, audio, and sensor data, providing richer context for understanding. Yet in many cases they still lack a unified, abstract “map” of the environment and are not specifically designed to predict real-world consequences.
A world model can be defined as a system that not only integrates multiple sources of signals (such as images, sounds, or texts), but also builds an internal representation of the environment and its changes over time. In this way, the machine can estimate the likely outcome of actions, reason about causal relations, and on that basis interact autonomously with the physical world. An LLM is text-centered and strong in language; a multimodal model integrates and responds to multiple signals; a WM adds to this by constructing an internal map of the environment where it can simulate, predict, and act in anticipation of future states. This has immediate practical significance in domains such as robotics and autonomous systems, where AI must make near real-time decisions in real environments. The aim of developing WMs is not only to describe and interpret sensory inputs (for example, images, videos, or sensor readings) but also to understand the structures and relationships behind them, and to use this “knowledge” to predict and influence future events or state changes.
The industry-wide shift toward developing these more complex and more versatile models has emerged because the practical limitations of models trained only on textual input are becoming increasingly clear. Text is powerful for description, but it does not capture every nuance of space, motion, and material properties. A vehicle avoids a collision not because it can articulate what the obstacle is, but because it chooses an effective path in the situation. For this, the system must grasp how the environment changes over time and what consequences each possible action entails.
Taking this requirement (that the system must understand spatial and temporal changes and the consequences of actions) seriously, the practical question arises: how can a model learn all this safely and fast enough? In practice, a combination of two settings has proved effective. The first is simulation, a safe virtual training ground where AI can make mistakes and try again without consequences. Its role is to accelerate experience. A warehouse robot, for example, can practice turning and lifting pallets thousands of times, while the system learns which movements lead to a stable grip. The second setting is the real world, where sensor data corrects and refines the patterns learned in simulation. Several established methods help bridge the difference between the two. One is domain randomization, the systematic variation of simulated worlds to prevent the model from adapting only to a single idealized environment. Another is synthetic data augmentation, which deliberately generates rare but important cases. Finally, regular feedback from real measurements allows the model to recalibrate its assumptions.
In recent months, several well-known actors have introduced developments that move in this direction. Google DeepMind’s Genie 3, for example, generates interactive video environments that can be explored in real time. The user is not only a spectator but can intervene, and the system updates the state of the world accordingly. Meta’s V-JEPA research line trains on raw video and develops the ability to estimate the likely outcome of movements and actions. Within NVIDIA’s ecosystem, Omniverse and Isaac Sim have become the basis for professional simulation and synthetic data generation, Jetson provides the embedded runtime environment for enterprise and industrial edge devices, and Cosmos offers a developer toolkit created specifically for building world models.
Considering the examples mentioned in the previous section, the essential question follows: why is an LLM not enough for this task? The name alone indicates that these systems are built for language. They are strong in tasks that can be solved through language use alone, but text by itself is far from sufficient for perception, temporal consequences, and action in the physical world. Language models operate on statistical associations and, as has become widely recognized, they are prone to producing confident but inaccurate answers. They also lack the ability to model the physical layer of a given situation. Anyone who has tried to loosen a rusted bolt knows how much depends on friction, leverage, angle, and the condition of the surface. These outcomes are not decided in the world of words but in reality. The advantage of a WM is that it interprets the task based on perceived signals and prior experience, then predicts what will happen if we try with more force, at a different angle, or in a different sequence. Language still plays a role by helping to set goals and explain actions, but for execution it is the concrete physical intuition that matters most.
István ÜVEGES, PhD is a Computational Linguist researcher and developer at MONTANA Knowledge Management Ltd. and a researcher at the HUN-REN Centre for Social Sciences. His main interests include the social impacts of Artificial Intelligence (Machine Learning), the nature of Legal Language (legalese), the Plain Language Movement, and sentiment- and emotion analysis.