Blog>Deep Dives

Repo to Knowledge Graph: A Minimal Toy Example

If you want to build a personal coding assistant, one of the first things you need to figure out is how to provide it with the exact context that would feed relevant results to the user.

A common strategy might involve parsing individual files or modules, analyzing their structure, and providing isolated snippets of information. However, this approach often misses the bigger semantic picture: the relationships and dependencies between different entities of the codebase.

This is where knowledge graphs come in. A knowledge graph is a powerful data structure that captures entities (code elements like files, modules, and functions) and the relationships between them.

Knowledge graphs enable us to query and navigate the connections between data points with ease. For a coding assistant, this means having access to not just the structure of a single file but also its interactions with the rest of the repository which is particularly useful in ‘dead code’ analysis. By building a knowledge graph of an entire codebase, you can provide the agent with rich, precise context it can use during inference.

In this article, we’ll explore a simplified, toy approach to this process. We’ll look at how to create a useful dependency graph (almost) from scratch. Then, we’ll explore how to transform that dependency graph into a queryable knowledge graph using cognee, a powerful tool for unleashing the full potential of LLMs by building and querying knowledge graphs.

Repo to Dependency Graph

When given a Python repository, our first task is to build a dependency graph. It will be a directed graph representing module dependencies in a static way, based solely on the code. It should capture all the direct dependencies... and a bit more.

A note on terminology:

Typically, “direct dependencies” refer to modules that are explicitly imported. This is not entirely useful for our code assistant ambitions.

As we want our assistant to also be aware of the code being referenced from other files, a more apt term would be "direct invocation dependency", a concept somewhat related to call graphs.

To keep things simple, we will keep the term “direct dependency” and live with the slight semantic abuse.

Let’s look at two examples that illustrate why this is exactly what we need.

Example 1: Chained Access Direct Dependency

image1

Here, we want to capture that a.py depends directly on b.py, located in the foo subfolder of the repository. A standard “direct dependency” would not capture this.

Example 2: Init-Mediated Direct Dependency

image2

In this slightly more subtle example, a.py depends directly on b.py, which is now located in foo/bar.

In both examples, it would be important for the coding assistant to have the code of b as context for a. So, let’s build a graph that enables that.

Existing Tools or a Tailored Solution?

There are many tools available that are relevant for what we are trying to build. However, as is often the case, none of them are a perfect fit for our exact use case. So, following Einstein’s advice that things should be made as simple as possible—but not simpler—we will create our own direct approach.

Our solution will aim for the sweet spot: a higher level than designing and coordinating AST tree parsers, but still a lower level than having a single command to generate the dependency graph we need.

Building a Direct Approach

We can combine Parso and Jedi to identify the dependencies we’re interested in.

Parso is a lightweight Python library for parsing and analyzing Python code, providing a tree structure that represents the source code. Jedi is a powerful tool for code analysis that can resolve definitions: given a file and the location of a name within it, Jedi can locate the file containing its definition. This is precisely the functionality we need.

The process boils down to using  Parso to build a code tree from a module, which will help us identify all the names for which it makes sense to look up definitions. Then, we will use Jedi to resolve those definitions.

The full solution is implemented in cognee’s repo. While we won’t cover all of it, here’s a brief walkthrough:

image3

image4

image5

image6

From here, we already have the information we need: each file’s dependencies can be extracted in a loop, giving us all the details about the nodes and edges.

This is, in itself, a graph—though we aren’t using any graph-specific libraries or data structures in Python. Instead, we can directly proceed to building a knowledge graph, as we’ll do in the following section.

Dependency graph to knowledge graph

cognee Knowledge Graphs

Creating a custom knowledge graph in cognee is straightforward. It involves three main steps:

  1. Define DataPoints, which are pydantic data structures that describe a single knowledge graph node, including its connections to other nodes.
  2. Create a function to fill the DataPoints with relevant information.
  3. Use cognee's built-in functionality to generate the knowledge graph.

Once these steps are done, cognee automatically creates a relational database for metadata, a graph database for nodes and edges, and a vector database for embeddings.

Users can choose the technologies they prefer at each step. Notably, cognee now supports FalkorDB, an blazing graph database designed specifically for AI applications.

Defining DataPoints

We are ready to define our CodeFile class, which inherits from DataPoint and represents nodes in our knowledge graph.

image7

Let’s break it down:

Creating CodeFile Data Points

Let’s say we’ve walked through the repository directory and stored all the file paths in a code_file_paths list. Next, we’ll create a function that returns a list of all CodeFile objects.

image8

Here’s what’s happening:

Creating a cognee knowledge graph

To create a knowledge graph, we’ll execute the following asynchronous function. Let’s dive straight into the code:

image9

Here’s how it works:

After running this function, the repository will be processed, and the knowledge graph will be created in a graph database like FalkorDB or Neo4j. You can then interact with the graph using cognee's powerful functionality, detailed in the documentation.

Conclusion

This post illustrated how easily you can build custom knowledge graphs tailored to your needs using cognee. By capturing relationships in your codebase and leveraging tools like FalkorDB, you can create AI-optimized solutions with minimal effort. We hope this example sparks ideas for how you can leverage knowledge graphs in your projects.