From APIs and Relational Data to Knowledge Graphs
In our previous post, we explored DataPoints—the fundamental building blocks of cognee's knowledge graphs. We discussed how they process ingested information to extract entities and establish relationships between them, infusing unstructured data with meaning.
Although cognee automatically generates DataPoints, in this post we'll create them manually to better understand their features and purpose. Throughout this process, we'll implicitly construct an ontology-like framework that shapes how knowledge is stored and interconnected, and introduce you to some of cognee's tasks, pipelines, and the indispensable add_data_points function.
Once we've covered these basics, we'll walk through an example in which we'll link two DataPoints together and add them to a graph. Then, we'll take on a more ambitious challenge of pulling an array of structured data from an API, organizing it with dlt, and transforming it into a fully connected knowledge graph system. Along the way, we'll also be taking a peek under the hood to get an idea of how this all actually happens.
Ready to learn? Let's get started.
The Tasks and Pipelines Powering Graph Creation
We've seen how DataPoints define entities and relationships in a structured way. But how do you transform these isolated building blocks into a fully connected, searchable knowledge graph? Cognee makes this possible through pipelines—modular sequences of tasks that break down complex processes into manageable steps. Although pipelines can seem a bit daunting at first, we're keeping things straightforward today.
Each task in a pipeline wraps a function, ensuring it runs with the proper inputs and configurations. In a pipeline, tasks are connected in sequence so that the output of one task becomes the input for the next, creating a structured data flow. Cognee handles this recursively, ensuring that DataPoints and relationships propagate efficiently. Pipelines also bring practical benefits such as batching for improved performance and the ability to scale indefinitely, accommodating even the most intricate workflows.
For this post, we'll focus on a single-task pipeline to illustrate how DataPoints evolve into a graph. We'll use the built-in add_data_points function, which extracts nodes and edges, deduplicates them, and integrates them into the knowledge graph. It simultaneously writes data into multiple stores—vectors, metadata, and the graph database—enabling cognee to seamlessly interact with it across the system.
Let's jump into the examples.
A Simple Sample: Building a Two-Node Knowledge Graph
Before tackling more complex scenarios, we'll start with a straightforward example: creating a small knowledge graph with just two DataPoints and linking them together. The goal here is to demonstrate how pipelines and add_data_points work in practice.
Defining a Basic DataPoint
We begin by defining a simple DataPoint called TestDataPoint. Each instance will have a single field, testfield, along with an optional list of connections to other DataPoints.
This is the smallest unit of structured knowledge we can store. Now, let's create two instances and connect them:
Running a Single-Step Pipeline
Now we'll insert these DataPoints into the graph using a one-step pipeline with add_data_points:
What Just Happened?
- We created two DataPoints and linked them.
- We defined a Task within the tasks variable:
- A Task is an abstraction that wraps around a function, ensuring it gets executed with the necessary arguments.
- Tasks are executed sequentially within a pipeline
- Here, we wrapped the built-in add_data_points function inside a Task.
- The add_data_points function processes a list of DataPoints and integrates them into a knowledge graph.
- We ran a single-task pipeline using run_tasks:
- run_tasks is one of the ways to execute a pipeline in Cognee.
- We passed:
- The tasks list, which contained only the add_data_points task.
- The data argument, which held our two DataPoints.
- Internally, run_tasks:
- Fed the data to the first (and only) task in the pipeline.
- Executed the add_data_points function asynchronously.
- The pipeline completed execution, processing and storing the DataPoints in the knowledge graph.
You can find an example here.
This was just a warm-up. Let's now explore something a more advanced use case.
A Complex Challenge: Pokémon Knowledge Graph
Now we're taking things up a notch. Instead of just a couple of DataPoints, we'll fetch structured Pokémon data from an external API and turn it into a comprehensive knowledge graph. We'll collect data with dlt, which enables us to extract it from various sources—including REST APIs, SQL databases, and cloud storage—and transform it into well-structured datasets. We will load the collected data into DataPoints, and finally run a cognee pipeline to add it all to our graph. We'll even run a query to show how it all comes together.
Data Models: Defining Our Entities
We set up DataPoint classes for different Pokémon aspects. This gives our data structure a built-in ontology: Pokémon have abilities, and these abilities belong to specific Pokémon types.
Data Collection: Using DLT to Fetch Pokémon Data
We use dlt to pull raw data from the Pokémon API. One function retrieves a list of Pokémon, while another fetches detailed information for each one. Dlt writes this data as JSONL files for us to process later.
Data Loading: Converting JSONL Files to DataPoints
After dlt has stored our raw data, we load the JSONL files and convert the data into structured DataPoints. One function processes abilities data, and another matches the abilities with their Pokémon:
Setting Up and Processing the Data
In this step, we configure our system by setting the directories for data and system files. Then, we run the dlt pipeline to fetch Pokémon data from the API and store it as JSONL files. Finally, we load these files and convert them into structured DataPoints.
Integrating Data with Cognee and Querying the Graph
We can now use what we learned in our simple example. We start with a clean slate by pruning any existing data. Then, we initialize cognee and run the single-task pipeline with add_data_points to add our processed DataPoints to the knowledge graph. Once the data is integrated, we perform a sample search query to verify that the graph is both built and searchable.
Orchestrating the Workflow
Our main function simply calls the previous two steps in sequence. First, it sets up and processes the data, then it integrates the data and runs the query. Easy peasy.
You can also access the visualization here.
Under the Hood: What Happens in add_data_points?
Let's take a quick look at how the add_data_points function transforms structured data into a fully indexed, queryable knowledge graph:
Here's what happens inside add_data_points:
- Recursive Extraction (get_graph_from_model): This function traverses all connected DataPoints, extracting nodes (entities) and edges (relationships). It takes advantage of the data ontology-like structure that we created when defined the DataPoints. Since it's recursive, it ensures that even deeply nested connections are mapped properly. The results are gathered asynchronously, processing multiple DataPoints at once.
- Deduplication (deduplicate_nodes_and_edges): Once nodes and edges are extracted, this step removes duplicates, ensuring that we don't store redundant data. This keeps the graph lean and prevents unnecessary clutter.
- Graph Storage (get_graph_engine().add_nodes() & get_graph_engine().add_edges() ): The cleaned-up nodes and edges are then handed off to Cognee's graph engine, making them a persistent part of the knowledge graph.
- Indexing for Fast Queries (index_data_points() & index_graph_edges() ) – The final step indexes both the new nodes and edges, enabling fast lookups and efficient graph traversal. This is what makes querying smooth and scalable.
Once add_data_points completes, the DataPoints are deeply integrated into Cognee's knowledge graph, optimizing both storage and retrieval performance.
Intelligent Insights from Smarter Data Connections
Building robust knowledge graphs isn't merely a technical exercise—it's about unlocking the hidden potential within your data ecosystem. Taking a look at the steps behind cognee's graph creation shows us that even simple DataPoints, when thoughtfully structured and interconnected, can transform raw information into powerful insights.
This approach not only enhances personalized, context-aware responses but also redefines how we harness data to drive intelligent decision-making. As AI continues to evolve, the ability to extract and relate meaningful data will be a key competitive advantage, ushering in ever smarter and more responsive systems.
While cognee just works straight out of the box, this blog series is meant to provide you with a deeper understanding of the proverbial nuts & bolts behind its processes. As such, in the next post we'll dive deeper into advanced pipeline configurations and explore even more practical applications of knowledge graphs.
In the meantime, if you have any questions or ideas, don't hesitate to join our Discord community and share your thoughts. There's always exciting work happening here at cognee and we'd love to discuss what we're working on with you!
Special thanks to amazing Hiba Jamal from dlt for inspiring the Pokémon example used in this post. Her thoughtful conceptualization and collaboration helped showcase cognee's capabilities in a fun way.
To try this example yourself, checkout the notebook or the example script in cognee's repo.