Repo to Knowledge Graph: A Minimal Toy Example
If you want to build a personal coding assistant, one of the first things you need to figure out is how to provide it with the exact context that would feed relevant results to the user.
A common strategy might involve parsing individual files or modules, analyzing their structure, and providing isolated snippets of information. However, this approach often misses the bigger semantic picture: the relationships and dependencies between different entities of the codebase.
This is where knowledge graphs come in. A knowledge graph is a powerful data structure that captures entities (code elements like files, modules, and functions) and the relationships between them.
Knowledge graphs enable us to query and navigate the connections between data points with ease. For a coding assistant, this means having access to not just the structure of a single file but also its interactions with the rest of the repository which is particularly useful in ‘dead code’ analysis. By building a knowledge graph of an entire codebase, you can provide the agent with rich, precise context it can use during inference.
In this article, we’ll explore a simplified, toy approach to this process. We’ll look at how to create a useful dependency graph (almost) from scratch. Then, we’ll explore how to transform that dependency graph into a queryable knowledge graph using cognee, a powerful tool for unleashing the full potential of LLMs by building and querying knowledge graphs.
Repo to Dependency Graph
When given a Python repository, our first task is to build a dependency graph. It will be a directed graph representing module dependencies in a static way, based solely on the code. It should capture all the direct dependencies... and a bit more.
A note on terminology:
Typically, “direct dependencies” refer to modules that are explicitly imported. This is not entirely useful for our code assistant ambitions.
As we want our assistant to also be aware of the code being referenced from other files, a more apt term would be "direct invocation dependency", a concept somewhat related to call graphs.
To keep things simple, we will keep the term “direct dependency” and live with the slight semantic abuse.
Let’s look at two examples that illustrate why this is exactly what we need.
Example 1: Chained Access Direct Dependency
Here, we want to capture that a.py
depends directly on b.py
, located in the foo subfolder of the repository. A standard “direct dependency” would not capture this.
Example 2: Init-Mediated Direct Dependency
In this slightly more subtle example, a.py
depends directly on b.py
, which is now located in foo/bar
.
In both examples, it would be important for the coding assistant to have the code of b as context for a. So, let’s build a graph that enables that.
Existing Tools or a Tailored Solution?
There are many tools available that are relevant for what we are trying to build. However, as is often the case, none of them are a perfect fit for our exact use case. So, following Einstein’s advice that things should be made as simple as possible—but not simpler—we will create our own direct approach.
Our solution will aim for the sweet spot: a higher level than designing and coordinating AST tree parsers, but still a lower level than having a single command to generate the dependency graph we need.
Building a Direct Approach
We can combine Parso and Jedi to identify the dependencies we’re interested in.
Parso is a lightweight Python library for parsing and analyzing Python code, providing a tree structure that represents the source code. Jedi is a powerful tool for code analysis that can resolve definitions: given a file and the location of a name within it, Jedi can locate the file containing its definition. This is precisely the functionality we need.
The process boils down to using Parso to build a code tree from a module, which will help us identify all the names for which it makes sense to look up definitions. Then, we will use Jedi to resolve those definitions.
The full solution is implemented in cognee’s repo. While we won’t cover all of it, here’s a brief walkthrough:
- Given a file, we can create a Parso tree as follows:
- The tree is recursively processed to extract the names of all entities for which definitions can be looked up (see the custom
_get_code_entities
function). In particular, for each entity, we also get its line and column in the module (the variablesentity_line
andentity_column
below) . - For a given entity (or rather its
entity_line
andentity_column
), we use Jedi to look up potential definitions. We use Jedi’sScript.goto()
powerful (albeit occasionally glitchy) function.
- Definitions are resolved optimistically as follows:
- Finally, we gather all the dependencies, avoiding duplicates:
From here, we already have the information we need: each file’s dependencies can be extracted in a loop, giving us all the details about the nodes and edges.
This is, in itself, a graph—though we aren’t using any graph-specific libraries or data structures in Python. Instead, we can directly proceed to building a knowledge graph, as we’ll do in the following section.
Dependency graph to knowledge graph
cognee Knowledge Graphs
Creating a custom knowledge graph in cognee is straightforward. It involves three main steps:
- Define
DataPoints
, which are pydantic data structures that describe a single knowledge graph node, including its connections to other nodes. - Create a function to fill the
DataPoints
with relevant information. - Use cognee's built-in functionality to generate the knowledge graph.
Once these steps are done, cognee automatically creates a relational database for metadata, a graph database for nodes and edges, and a vector database for embeddings.
Users can choose the technologies they prefer at each step. Notably, cognee now supports FalkorDB, an blazing graph database designed specifically for AI applications.
Defining DataPoints
We are ready to define our CodeFile
class, which inherits from DataPoint
and represents nodes in our knowledge graph.
Let’s break it down:
__tablename__ = "codefile"
defines the table in the relational database that will store metadata about thisDataPoint
type.path: str
stores the file path.source_code: Optional[str]
stores the source code of the file.depends_on: Optional[List["CodeFile"]]
contains a list of otherCodeFile
instances that the file depends on.- The
_metadata
dictionary tells cognee to embed the content ofsource_code
and store it in a vector database.
Creating CodeFile Data Points
Let’s say we’ve walked through the repository directory and stored all the file paths in a code_file_paths
list. Next, we’ll create a function that returns a list of all CodeFile
objects.
Here’s what’s happening:
- We first create all
CodeFile
instances and store them in a dictionary, using the file paths as keys. - This dictionary allows us to easily set the
depends_on
values in the second loop by referencing otherCodeFile
objects. - The function then returns all the
CodeFile
objects as a list.
Creating a cognee knowledge graph
To create a knowledge graph, we’ll execute the following asynchronous function. Let’s dive straight into the code:
Here’s how it works:
- Setup: The first three
await
calls prepare the system by pruning data, clearing system metadata, and creating the necessary databases and tables. - Repo Walking: We omit the code for walking the repo to collect
.py
file paths, assuming this part is straightforward. - Task Wrapping:
Task(get_code_files)
wraps theget_code_files
function into aTask
, enabling cognee to integrate theCodeFile
objects with the system. - Add Data Points:
Task(add_data_points)
is a built-in task that handles creating databases, graphs, and embeddings behind the scenes. cognee can seamlessly integrate with vector databases to store embeddings, enabling fast and intelligent searches. - Pipeline Definition: The list of tasks defines a cognee pipeline, where outputs from one task feed into the next.
- Pipeline Execution:
run_tasks
is a built-in function that executes the pipeline and generates the knowledge graph.
After running this function, the repository will be processed, and the knowledge graph will be created in a graph database like FalkorDB or Neo4j. You can then interact with the graph using cognee's powerful functionality, detailed in the documentation.
Conclusion
This post illustrated how easily you can build custom knowledge graphs tailored to your needs using cognee. By capturing relationships in your codebase and leveraging tools like FalkorDB, you can create AI-optimized solutions with minimal effort. We hope this example sparks ideas for how you can leverage knowledge graphs in your projects.