From APIs and Relational Data to Knowledge Graphs

In our previous post, we explored DataPoints—the fundamental building blocks of cognee's knowledge graphs. We discussed how they process ingested information to extract entities and establish relationships between them, infusing unstructured data with meaning.

Although cognee automatically generates DataPoints, in this post we'll create them manually to better understand their features and purpose. Throughout this process, we'll implicitly construct an ontology-like framework that shapes how knowledge is stored and interconnected, and introduce you to some of cognee's tasks, pipelines, and the indispensable add_data_points function.

Once we've covered these basics, we'll walk through an example in which we'll link two DataPoints together and add them to a graph. Then, we'll take on a more ambitious challenge of pulling an array of structured data from an API, organizing it with dlt, and transforming it into a fully connected knowledge graph system. Along the way, we'll also be taking a peek under the hood to get an idea of how this all actually happens.

Ready to learn? Let's get started.

The Tasks and Pipelines Powering Graph Creation

We've seen how DataPoints define entities and relationships in a structured way. But how do you transform these isolated building blocks into a fully connected, searchable knowledge graph? Cognee makes this possible through pipelines—modular sequences of tasks that break down complex processes into manageable steps. Although pipelines can seem a bit daunting at first, we're keeping things straightforward today.

Each task in a pipeline wraps a function, ensuring it runs with the proper inputs and configurations. In a pipeline, tasks are connected in sequence so that the output of one task becomes the input for the next, creating a structured data flow. Cognee handles this recursively, ensuring that DataPoints and relationships propagate efficiently. Pipelines also bring practical benefits such as batching for improved performance and the ability to scale indefinitely, accommodating even the most intricate workflows.

For this post, we'll focus on a single-task pipeline to illustrate how DataPoints evolve into a graph. We'll use the built-in add_data_points function, which extracts nodes and edges, deduplicates them, and integrates them into the knowledge graph. It simultaneously writes data into multiple stores—vectors, metadata, and the graph database—enabling cognee to seamlessly interact with it across the system.

Let's jump into the examples.

A Simple Sample: Building a Two-Node Knowledge Graph

Before tackling more complex scenarios, we'll start with a straightforward example: creating a small knowledge graph with just two DataPoints and linking them together. The goal here is to demonstrate how pipelines and add_data_points work in practice.

Defining a Basic DataPoint

We begin by defining a simple DataPoint called TestDataPoint. Each instance will have a single field, testfield, along with an optional list of connections to other DataPoints.

from cognee.low_level import DataPoint
from typing import List

class TestDataPoint(DataPoint):
    testfield: str
    connected_to: List["TestDataPoint"] = []
    metadata: dict = {"index_fields": ["testfield"]}

This is the smallest unit of structured knowledge we can store. Now, let's create two instances and connect them:

first_datapoint = TestDataPoint(testfield="first")
second_datapoint = TestDataPoint(testfield="second")
first_datapoint.connected_to.append(second_datapoint)

Running a Single-Step Pipeline

Now we'll insert these DataPoints into the graph using a one-step pipeline with add_data_points:

import asyncio
import cognee
from cognee.modules.pipelines import run_tasks
from cognee.modules.pipelines.tasks.Task import Task
from cognee.tasks.storage import add_data_points
from uuid import uuid4

async def run_my_pipeline(my_datapoints):
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)

    tasks = [Task(add_data_points)]

    results = run_tasks(
        tasks=tasks,
        data=my_datapoints,
        dataset_id=uuid4(),
        pipeline_name='my_pipeline',
    )
    async for result in results:
        print(result)

async def main():
    await run_my_pipeline([first_datapoint, second_datapoint])

if __name__ == "__main__":
    asyncio.run(main())

What Just Happened?

We created two DataPoints and linked them.
We defined a Task within the tasks variable:
- A Task is an abstraction that wraps around a function, ensuring it gets executed with the necessary arguments.
- Tasks are executed sequentially within a pipeline
- Here, we wrapped the built-in add_data_points function inside a Task.
- The add_data_points function processes a list of DataPoints and integrates them into a knowledge graph.
We ran a single-task pipeline using run_tasks:
- run_tasks is one of the ways to execute a pipeline in Cognee.
- We passed:
  - The tasks list, which contained only the add_data_points task.
  - The data argument, which held our two DataPoints.
- Internally, run_tasks:
  - Fed the data to the first (and only) task in the pipeline.
  - Executed the add_data_points function asynchronously.
The pipeline completed execution, processing and storing the DataPoints in the knowledge graph.

You can find an example here.

This was just a warm-up. Let's now explore something a more advanced use case.

A Complex Challenge: Pokémon Knowledge Graph

Now we're taking things up a notch. Instead of just a couple of DataPoints, we'll fetch structured Pokémon data from an external API and turn it into a comprehensive knowledge graph. We'll collect data with dlt, which enables us to extract it from various sources—including REST APIs, SQL databases, and cloud storage—and transform it into well-structured datasets. We will load the collected data into DataPoints, and finally run a cognee pipeline to add it all to our graph. We'll even run a query to show how it all comes together.

Data Models: Defining Our Entities

We set up DataPoint classes for different Pokémon aspects. This gives our data structure a built-in ontology: Pokémon have abilities, and these abilities belong to specific Pokémon types.

# Data Models
class Abilities(DataPoint):
    name: str = "Abilities"
    metadata: dict = {"index_fields": ["name"]}

class PokemonAbility(DataPoint):
    name: str
    ability__name: str
    ability__url: str
    is_hidden: bool
    slot: int
    _dlt_load_id: str
    _dlt_id: str
    _dlt_parent_id: str
    _dlt_list_idx: str
    is_type: Abilities
    metadata: dict = {"index_fields": ["ability__name"]}

class Pokemons(DataPoint):
    name: str = "Pokemons"
    have: Abilities
    metadata: dict = {"index_fields": ["name"]}

class Pokemon(DataPoint):
    name: str
    base_experience: int
    height: int
    weight: int
    is_default: bool
    order: int
    location_area_encounters: str
    species__name: str
    species__url: str
    cries__latest: str
    cries__legacy: str
    sprites__front_default: str
    sprites__front_shiny: str
    sprites__back_default: Optional[str]
    sprites__back_shiny: Optional[str]
    _dlt_load_id: str
    _dlt_id: str
    is_type: Pokemons
    abilities: List[PokemonAbility]
    metadata: dict = {"index_fields": ["name"]}

Data Collection: Using DLT to Fetch Pokémon Data

We use dlt to pull raw data from the Pokémon API. One function retrieves a list of Pokémon, while another fetches detailed information for each one. Dlt writes this data as JSONL files for us to process later.

# Data Collection Functions
@dlt.resource(write_disposition="replace")
def pokemon_list(limit: int = 50):
    response = requests.get(f"{BASE_URL}pokemon", params={"limit": limit})
    response.raise_for_status()
    yield response.json()["results"]

@dlt.transformer(data_from=pokemon_list)
def pokemon_details(pokemons):
    """Fetches detailed info for each Pokémon"""
    for pokemon in pokemons:
        response = requests.get(pokemon["url"])
        response.raise_for_status()
        yield response.json()

Data Loading: Converting JSONL Files to DataPoints

After dlt has stored our raw data, we load the JSONL files and convert the data into structured DataPoints. One function processes abilities data, and another matches the abilities with their Pokémon:

# Data Loading Functions
def load_abilities_data(jsonl_abilities):
    abilities_root = Abilities()
    pokemon_abilities = []
    
    for jsonl_ability in jsonl_abilities:
        with open(jsonl_ability, "r") as f:
            for line in f:
                ability = json.loads(line)
                ability["id"] = uuid5(NAMESPACE_OID, ability["_dlt_id"])
                ability["name"] = ability["ability__name"]
                ability["is_type"] = abilities_root
                pokemon_abilities.append(ability)
    
    return abilities_root, pokemon_abilities

def load_pokemon_data(jsonl_pokemons, pokemon_abilities, pokemon_root):
    pokemons = []
    
    for jsonl_pokemon in jsonl_pokemons:
        with open(jsonl_pokemon, "r") as f:
            for line in f:
                pokemon_data = json.loads(line)
                abilities = [
                    ability for ability in pokemon_abilities
                    if ability["_dlt_parent_id"] == pokemon_data["_dlt_id"]
                ]
                pokemon_data["external_id"] = pokemon_data["id"]
                pokemon_data["id"] = uuid5(NAMESPACE_OID, str(pokemon_data["id"]))
                pokemon_data["abilities"] = [PokemonAbility(**ability) for ability in abilities]
                pokemon_data["is_type"] = pokemon_root
                pokemons.append(Pokemon(**pokemon_data))
    
    return pokemons

Setting Up and Processing the Data

In this step, we configure our system by setting the directories for data and system files. Then, we run the dlt pipeline to fetch Pokémon data from the API and store it as JSONL files. Finally, we load these files and convert them into structured DataPoints.

async def setup_and_process_data():
    # Configure data and system directories
    data_dir = str((Path(__file__).parent / ".data_storage").resolve())
    system_dir = str((Path(__file__).parent / ".cognee_system").resolve())
    cognee.config.data_root_directory(data_dir)
    cognee.config.system_root_directory(system_dir)

    # Run the dlt pipeline to collect Pokémon data
    pipeline = dlt.pipeline(
        pipeline_name="pokemon_pipeline",
        destination="filesystem",
        dataset_name="pokemon_data",
    )
    info = pipeline.run([pokemon_list, pokemon_details])
    print(info)

    # Load JSONL files and convert to DataPoints
    pokemon_files = sorted(Path(".data_storage/pokemon_data/pokemon_details").glob("*.jsonl"))
    abilities_files = sorted(Path(".data_storage/pokemon_data/pokemon_details__abilities").glob("*.jsonl"))
    if not pokemon_files or not abilities_files:
        raise FileNotFoundError("Missing JSONL files in storage.")

    abilities_root, pokemon_abilities = load_abilities_data(abilities_files)
    pokemon_root = Pokemons(have=abilities_root)
    pokemons = load_pokemon_data(pokemon_files, pokemon_abilities, pokemon_root)

    return pokemons

Integrating Data with Cognee and Querying the Graph

We can now use what we learned in our simple example. We start with a clean slate by pruning any existing data. Then, we initialize cognee and run the single-task pipeline with add_data_points to add our processed DataPoints to the knowledge graph. Once the data is integrated, we perform a sample search query to verify that the graph is both built and searchable.

async def pokemon_cognify(pokemons):
    # Clean slate and initialize Cognee
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)
    await cognee_setup()
    
    # Run pipeline: add DataPoints to the graph
    tasks = [Task(add_data_points, task_config={"batch_size": 50})]
    results = run_tasks(
        tasks=tasks,
        data=pokemons,
        dataset_id=uuid5(NAMESPACE_OID, "Pokemon"),
        pipeline_name='pokemon_pipeline',
    )
    async for result in results:
        print(result)
    print("Data integration done.")
    
    # Query the knowledge graph
    search_results = await cognee.search(
        query_type=SearchType.GRAPH_COMPLETION,
        query_text="pokemons?"
    )
    print("Search results:")
    for result in search_results:
        print(result)

Orchestrating the Workflow

Our main function simply calls the previous two steps in sequence. First, it sets up and processes the data, then it integrates the data and runs the query. Easy peasy.

async def main():
    pokemons = await setup_and_process_data()  # Step 1: Setup and process data
    await pokemon_cognify(pokemons)            # Step 2: Integrate and query data

if __name__ == "__main__":
    asyncio.run(main())

You can also access the visualization here.

Under the Hood: What Happens in add_data_points?

Let's take a quick look at how the add_data_points function transforms structured data into a fully indexed, queryable knowledge graph:

import asyncio
from cognee.infrastructure.engine import DataPoint
from cognee.infrastructure.databases.graph import get_graph_engine
from cognee.modules.graph.utils import deduplicate_nodes_and_edges, get_graph_from_model
from .index_data_points import index_data_points
from .index_graph_edges import index_graph_edges

async def add_data_points(data_points: list[DataPoint]):
    nodes, edges = [], []

    added_nodes, added_edges, visited_properties = {}, {}, {}

    results = await asyncio.gather(*[
        get_graph_from_model(dp, added_nodes, added_edges, visited_properties)
        for dp in data_points
    ])

    for result_nodes, result_edges in results:
        nodes.extend(result_nodes)
        edges.extend(result_edges)

    nodes, edges = deduplicate_nodes_and_edges(nodes, edges)

    graph_engine = await get_graph_engine()
    await index_data_points(nodes)
    await graph_engine.add_nodes(nodes)
    await graph_engine.add_edges(edges)
    await index_graph_edges()

    return data_points

Here's what happens inside add_data_points:

Recursive Extraction (get_graph_from_model): This function traverses all connected DataPoints, extracting nodes (entities) and edges (relationships). It takes advantage of the data ontology-like structure that we created when defined the DataPoints. Since it's recursive, it ensures that even deeply nested connections are mapped properly. The results are gathered asynchronously, processing multiple DataPoints at once.
Deduplication (deduplicate_nodes_and_edges): Once nodes and edges are extracted, this step removes duplicates, ensuring that we don't store redundant data. This keeps the graph lean and prevents unnecessary clutter.
Graph Storage (get_graph_engine().add_nodes() & get_graph_engine().add_edges() ): The cleaned-up nodes and edges are then handed off to Cognee's graph engine, making them a persistent part of the knowledge graph.
Indexing for Fast Queries (index_data_points() & index_graph_edges() ) – The final step indexes both the new nodes and edges, enabling fast lookups and efficient graph traversal. This is what makes querying smooth and scalable.

Once add_data_points completes, the DataPoints are deeply integrated into Cognee's knowledge graph, optimizing both storage and retrieval performance.

Intelligent Insights from Smarter Data Connections

Building robust knowledge graphs isn't merely a technical exercise—it's about unlocking the hidden potential within your data ecosystem. Taking a look at the steps behind cognee's graph creation shows us that even simple DataPoints, when thoughtfully structured and interconnected, can transform raw information into powerful insights.

This approach not only enhances personalized, context-aware responses but also redefines how we harness data to drive intelligent decision-making. As AI continues to evolve, the ability to extract and relate meaningful data will be a key competitive advantage, ushering in ever smarter and more responsive systems.

While cognee just works straight out of the box, this blog series is meant to provide you with a deeper understanding of the proverbial nuts & bolts behind its processes. As such, in the next post we'll dive deeper into advanced pipeline configurations and explore even more practical applications of knowledge graphs.

In the meantime, if you have any questions or ideas, don't hesitate to join our Discord community and share your thoughts. There's always exciting work happening here at cognee and we'd love to discuss what we're working on with you!

Special thanks to amazing Hiba Jamal from dlt for inspiring the Pokémon example used in this post. Her thoughtful conceptualization and collaboration helped showcase cognee's capabilities in a fun way.