Submission
Privacy Policy
Code of Ethics
Newsletter

Knowledge Graphs and LLMs: How Can We Move from Structured Knowledge to AI-Generated Answers? – Part I.

With the development of Artificial Intelligence, there is a growing need for efficient processing and use of data. Knowledge Graphs are a well-established solution for representing structured knowledge, while Large Language Models (LLMs) generate natural language responses. But do these two technologies compete or complement each other? In this post, we explore this question, showing how Knowledge Graphs can help LLMs generate more accurate and reliable answers, and how automated graph building can lead to a new, more efficient approach to information management.

Since the very beginning of the development of Artificial Intelligence, a core question has been how to represent knowledge about the world in a way that can be understood and effectively used by information processing systems. In the early days, researchers used rule-based approaches and semantic networks. These methods attempted to capture the structure and organization patterns of human knowledge through logical rules and web-like relationships. From this approach grew the concepts of ontologies and the Semantic Web, and by the 2000s standards such as RDF (Resource Description Framework) and OWL (Web Ontology Language) had emerged. The goal all along has been to make information on the web – and in the digital space – not just searchable, but truly interpretable and contextual.

The big breakthrough, according to many, came with the launch of the Google Knowledge Graph in 2012, which showed that a semantic approach can be extremely effective in business and everyday life. The term “Knowledge Graph” came to the fore, indicating the power of the knowledge that a graphical representation of connections and relations of data points can carry. The Knowledge Graph is about representing, for example, people, places, concepts and events as nodes, and creating relationships between them in the form of edges. Edges can represent abstract concepts, ownership relationships, acquaintances (in the case of social media profiles), the number of personal interactions, etc. With this approach, search results are no longer displayed as a simple list, but are embedded in context, linking connected concepts.

A simple example: people and the number of messages exchanged displayed in a graph.

In the business and technology world today, Knowledge Graphs seem to be enjoying a second renaissance. For a long time, however, their creation seemed to too time-consuming and complex a process for widespread practical application. Manual graph construction – where one tries to extract entities (e.g. persons, locations, etc.) and relationships between them from documents by hand – is extremely laborious and many projects fail at this point. Many people have found that, even if they learn to use Neo4J or similar graph databases, the biggest challenge is to extract data automatically and reliably from various PDF or Word documents or web pages. The core of the problem is that older, simple text extraction and keyword search techniques can only extract raw text from documents, without recognizing the important entities and their relationships. Thus, in this case, information is available but not in a structured form, which is essential for building Knowledge Graphs. In its absence, a lot of manual post-processing or a much more complex Artificial Intelligence-based solution is required to produce a truly usable data structure with meaningful relations.

In recent years, however, a very different approach has become increasingly dominant. Large Language Models (LLMs), such as GPT-4 or LLaMA, do not rely on pre-structured relations. Instead, they learn from a huge corpus of texts and can predict the next words in a text using statistical patterns. Their “knowledge”, however, does not take the form of explicit graph relations, but lies in a network of internal weights and representations of the model. Although this method yields impressive results, these models are prone, for example, to hallucinations, i.e. to producing convincing sounding but incorrect or unverifiable statements.

Retrieval-Augmented Generation (RAG) and the role of Knowledge Graphs

The need for reliability and source attribution gave rise to Retrieval-Augmented Generation (RAG) methods. The idea is that generative models retrieve information from external data sources – such as Wikipedia, a corporate database, or a domain-specific Knowledge Graph – before answering a given question, and “embeds” the resulting material into the answer. This greatly reduces the model’s tendency to hallucinate and allows for accurate source attribution.

However, it is important to note that RAG systems do not necessarily require Knowledge Graphs or even traditional databases. In essence, they can be used in all cases where a large amount of information is available and relevant parts of it can be extracted for LLM. For example, a web search can serve as a RAG information retrieval mechanism as well as a database for semantic search (e.g. containing vectors of texts). If the RAG uses a graph database to retrieve contextual information, we speak of GraphRAG. This can be used to complement traditional semantic (“vector-based”) search to explore more complex relationships between data and support more efficient queries.

Vector databases work based on semantic similarity. To make text content searchable, text is first transformed into a sequence of numbers (vector) using an embedding model. These vectors are in a space of hundreds or even thousands of dimensions, where texts with similar meanings are placed close to each other. In the search process, the retrieved text is also vectorized and the documents whose vectors are closest to these are searched for. In practice, this is often called semantic search. This type of operation allows the system to find not only exact keyword matches, as in the old search systems, but also to give preference to content that is genuinely similar in meaning.

For example, if you ask, “Who is the current head of Tesla?”, the system will first convert this question into a vector and then find the most similar vectors in the database. The original texts associated with the resulting vectors are likely to contain the answer. If a document states, “Elon Musk is the CEO of Tesla”, the system will easily recognize this, even if the exact words of the question do not appear in the text.

Although vector-based search is highly effective in detecting semantic similarities and providing clear fact-based answers, it also has serious limitations. One major issue arises when a particular question cannot be answered by a single document but requires linking related information from multiple sources. This is where Knowledge Graphs come into the picture, storing not only individual entities but also the relationships between them, allowing more complex conclusions to be drawn.


István ÜVEGES, PhD is a Computational Linguist researcher and developer at MONTANA Knowledge Management Ltd. and a researcher at the HUN-REN Centre for Social Sciences. His main interests include the social impacts of  Artificial Intelligence (Machine Learning), the nature of Legal Language (legalese), the Plain Language Movement, and sentiment- and emotion analysis.