Submission
Privacy Policy
Code of Ethics
Newsletter

From Text to Action: World Models in Practice (Part II.)

In practical applications and in the context of World Models (WMs), industrial robotics is usually mentioned first, and for good reasons. A robot that relies on a world model does not merely repeat a pre-programmed sequence of movements. It notices if the workpiece has shifted, if a shelf has bent slightly, or if someone has entered its workspace, and it adjusts its actions accordingly. The same logic applies in logistics, where machines can move more independently and safely among people, and in autonomous driving, where decisions are based not on snapshots but on continuous predictions, such as the expected trajectories of other vehicles. In healthcare, the benefits are less visible but equally important. Hospital logistics, the autonomous movement of instruments, or the customization of rehabilitation devices are all tasks where the environment constantly changes. In these cases, the world model does not replace the specialist’s judgment but makes daily operations smoother and safer.

The entertainment and gaming industries are adopting these methods particularly quickly. World models that learn interactively do not provide fixed, pre-written paths but create worlds that adapt to the player’s actions. This is both an engaging experience and a testing ground. From a broader perspective, the same algorithms that generate realistic traffic in a virtual city could also be applied in urban planning or in the development of autonomous driving. This is where the concept of the digital twin enters. A digital twin is a virtual counterpart of a physical device, process, or facility that is continuously synchronized with real-time data. A pairing of a world model and a digital twin could, at least in theory, explore multiple possible futures starting from the present state and recommend the most promising one.

Before turning to applications, it is useful to recall what RAG means, how it developed, and why it matters. Retrieval-augmented generation (RAG) became a widely referenced approach around 2020, when the first broadly cited descriptions of systems combining language models with external knowledge sources were published. The initial problem was that the knowledge stored in the parameters of large models was expensive, difficult to update, and opaque. RAG addresses this by allowing the generative model, before producing an answer, to retrieve relevant texts from an external, non-parametric source such as a knowledge base or a company document repository, and then use them together with its internal knowledge. Its advantages include reducing errors caused by outdated information, mitigating hallucinations, enabling references and source citations, and in many cases avoiding costly retraining, since new information comes from the retrieved material.

On its own, this does not yet provide physical context. However, if RAG can also draw on the state of a digital twin and on sensor data, an entirely new quality emerges. Consider a maintenance scenario where the system first retrieves the appropriate procedure, then tests in simulation how to reach a rusted bolt within the given spatial constraints. The recommendation is no longer a general instruction, but a solution tailored to the specific environment and moment. The data generated during execution is then fed back into the system, improving the model further and updating the documentation. In this way, knowledge is not lost but circulates.

Along with all this come the ethical, legal, and safety issues that affect every AI system, and for which we do not yet have fully reassuring answers, even in the case of LLMs that have a far smaller impact on their environment. Who is responsible if a system using a world model makes a mistake and causes harm? How should sensitive sensor data be handled, especially in domestic or healthcare settings? What counts as safe enough for a system to operate among people? In real-world deployments, safety is the result of multiple overlapping layers and formalized risk management. A structural framework is provided by the NIST AI RMF, which sets out practices from risk identification through testing and evaluation to continuous monitoring. Sector-specific standards define concrete requirements, such as ISO 26262 for functional safety in the automotive industry, ISO/TS 15066 for collaborative robotics (monitoring speed and distance, force and power limits), and UL 4600 for building a complete safety case for autonomous systems.

On the technical level, a guardrail is never just a single rule but a toolkit: runtime safety filters and monitors, restrictions on the operational domain, safety PLCs and emergency stops, as well as runtime assurance architectures that can fall back to a simple and proven safe controller when needed. At the same time, safety filters that correct control commands online are spreading, for example those based on control barrier functions or Hamilton-Jacobi reachability, which ensure that collisions and boundary violations are avoided. Formal verification in this context focuses on critical properties and the safety case, while deployment is gradual, within well-defined operating domains, under close supervision and with continuous feedback.

Industry examples suggest that the tool chains needed for world modeling are already available, but at varying levels of maturity. DeepMind Genie 3 offers interactive generative environments for research purposes; Meta’s V-JEPA explores video-based representations; NVIDIA’s Omniverse and Isaac Sim are used for industrial simulation and synthetic data generation; Jetson is an edge deployment platform; and Cosmos is a development kit for world modeling. These components by themselves do not constitute a complete solution. Their successful adoption depends on data quality, validation, costs, and compliance with relevant standards. Nonetheless, it can be observed that the transition from research prototypes to targeted, verifiable operational use is gradually becoming easier.

Given the mixed maturity of the toolchains and deployment conditions outlined so far, the key issue is what can realistically be drawn for practice. The first step is to identify processes where physical context genuinely influences outcomes and where sensors are either already available or can be reasonably installed. This can be followed by building and validating a digital twin tailored to a specific task, regularly calibrated with real data. If training examples are scarce, synthetic data can be generated through controlled simulation, but verification with real measurements should remain mandatory. The choice of execution environment depends on the task: edge deployment is often justified for reasons of privacy and latency, while in other cases central resources are more suitable. Existing RAG systems become truly useful when they integrate not only textual sources but also the state of simulations and sensor networks. In practice, value usually does not come from a single breakthrough but from measurement-based iteration: recording and analyzing failure cases retrains the model and refines the documentation.

World models do not replace LLMs and multimodal systems but extend them. By managing physical dynamics and causality, they provide measurable benefits in situations where states, movements, and risks evolve over time. For purely textual or image-based tasks, they are often oversized. Today, adoption is determined by the gap between simulation and reality, the reliability of evaluation, data quality and costs, as well as compliance and accountability. In the short term, hybrid architectures appear most promising, where world models, language components, and retrieval work together within clearly defined operating domains and measurable goals. The conclusion is straightforward: instead of general promises, it is worth pursuing targeted, auditable, and sufficiently good solutions where physical context truly matters.


István ÜVEGES, PhD is a Computational Linguist researcher and developer at MONTANA Knowledge Management Ltd. and a researcher at the HUN-REN Centre for Social Sciences. His main interests include the social impacts of Artificial Intelligence (Machine Learning), the nature of Legal Language (legalese), the Plain Language Movement, and sentiment- and emotion analysis.