Submission
Privacy Policy
Code of Ethics
Newsletter

Beyond the Hype: Building AI That Actually Runs (Part II.)

What Makes AI Production-Ready?

Once a system is expected to operate over millions of documents, technical sophistication alone is not enough. The decisive question becomes whether the system can be trusted in practice. That trust does arise from the discipline with which the system is built, evaluated, and maintained.

This is where many discussions of AI remain misleading. In professional settings, success is rarely a matter of selecting the “best” model in the abstract. What matters more is the surrounding process: how performance is defined, how quality is measured, and what operational standards are enforced. In legal and other high-stakes domains, a plausible answer is not necessarily a usable one. What matters is whether it is accurate, reproducible, and fit for professional use. Domain-specific truth is therefore often more important than general-purpose fluency.

Systems of this kind perform best when they are closely aligned with the rules, vocabulary, and standards of the domain in which they operate. For that reason, expert involvement cannot stop once the first version of the system has been built. Domain experts are also needed to define, test, and periodically revise what counts as a satisfactory result. This has become especially important with the spread of generative models, whose greatest weakness in professional environments is not linguistic fluency but factual unreliability. Retrieval-Augmented Generation (RAG) has emerged in response to precisely this problem and is increasingly becoming standard practice in high-stakes use cases. We discuss the method and our own implementation experience in greater detail in our whitepaper. Its basic logic is that the system first retrieves relevant information from trusted sources, and only then uses the model to synthesize that material into a concise answer. For legal use, this is the difference between a model improvising a constitutional citation and a system that can point the user back to the specific source text on which its answer rests.

Deploying such a system in live operation also requires a level of data work that is often underestimated at the planning stage. A frequent misunderstanding is that the annotation of training data is merely a technical support task. In fact, quality is largely determined much earlier, when the team decides which categories matter, how edge cases should be handled, and which examples are genuinely representative. During annotation, domain experts mark relevant passages in documents, and those decisions become the basis of the system’s later ability to recognize similar patterns. If the work is imprecise at this stage, the resulting errors are costly to diagnose and correct once the system is already in use. In practice, poor-quality training data often produces failures that remain hidden in testing. These failures often become visible only when professionals begin relying on the system in daily work and discover that it does not support their due diligence reliably. In a legal context, that might mean inconsistent recognition of statutory references, misclassification of procedural documents, or the omission of legally salient passages. A system might, for example, fail to identify the controlling statutory provision in a filing or miss a passage that materially changes the meaning of a document. For that reason, data curation is not a peripheral activity. It is one of the central determinants of whether the system will behave within professionally acceptable boundaries.

Documentation is less glamorous than model design, but in long-lived systems it is just as important. Teams change, infrastructure evolves, and earlier design choices quickly lose their context unless they are recorded. When that happens, institutions do not merely lose efficiency; they lose memory. In practice, good documentation allows a later team to understand why a retrieval pipeline was structured in a particular way, why one data source was preferred over another, or why a known limitation was accepted. In systems intended for enduring professional use, especially in public institutions and large legal organizations, that continuity is indispensable.

A recurring difficulty in AI development is that changes which appear minor often have consequences elsewhere in the system. A different chunking strategy, that is, a different way of splitting documents into smaller units (e.g. sentences or paragraphs) for retrieval, may improve retrieval in one class of documents while degrading it in another. A prompt adjustment may make answers read better while weakening their factual grounding. Without continuous evaluation, those shifts can remain invisible until users begin to notice them in practice.

For that reason, evaluation cannot be limited to occasional spot checks or anecdotal success cases. It requires stable metrics that capture both substantive quality and operational performance: whether relevant authorities are retrieved consistently, whether answers remain grounded in source material, and whether response times stay within acceptable bounds. Measurement is what keeps development answerable to evidence rather than intuition.

Evaluation tells us whether the system is getting better. Observability tells us what the system is doing while it is running. Once a system enters live operation, that distinction becomes crucial. Contemporary AI systems generate continuous, usage-dependent costs, and they introduce failure modes that are not always visible from the outside. If an institution cannot see where latency is rising, where retrieval quality is falling, or which component is driving cost, it cannot govern the system responsibly. Observability is therefore not merely a technical convenience. It is a condition of oversight: the organization can exercise meaningful control only over a system whose behavior and costs are visible.

Legal texts are unusually sensitive to context. A sentence that appears decisive when isolated may look quite different once the surrounding provision, exception, or definition is restored. For example, a sentence may seem to impose a clear legal obligation on its own, while the immediately following subsection limits that obligation to a specific procedural situation or creates an exception that changes its practical meaning. For that reason, research advances that preserve contextual coherence during retrieval can have immediate practical value.

One example is late-chunking in text processing, which aims to preserve more of a document’s broader context during retrieval. At first sight, it appeared particularly well suited to legal materials, and in our own work we expected it to improve performance. Yet our experience showed the opposite. In sentence-level labeling of Hungarian court decisions, late chunking introduced more noise than useful context and reduced performance rather than improving it. Strong engineering teams follow developments of this kind closely, not because novelty is valuable in itself, but because each method has to be validated against the actual task, the actual data, and the actual failure modes of professional use.

The broader conclusion is straightforward. Artificial Intelligence is not a self-sufficient instrument that can simply be acquired and “applied.” It is a system that must be designed, integrated, tested, operated, and continuously improved. The decisive advances in professional use do not come from downloading a single model, but from combining domain expertise, careful data work, clear documentation, and continuous evaluation within a coherent engineering framework. The objective, in each case, is not to produce an impressive demonstration, but to build a system that remains reliable, explainable, and improvable over time. In domains such as law, that standard is not optional. It is the precondition for professional trust. The systems that will matter most in the coming years are therefore unlikely to be merely more fluent or more impressive in isolation. They will be the systems that are more deeply integrated into institutional practice, and more readily subject to verification.


Daniel Nagy is Interim Director of AI Enablement and Head of Docutent Division at GriffSoft Zrt. Previously, he led software development at MONTANA Knowledge Management. His main interests include AI-powered knowledge management, natural language processing, semantic search, and the development of production-ready AI products and document intelligence systems.

Constitutional Discourse
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.