Cognee Benchmark Results

Production-ready Al Memory, tested and measured.

Benchmark Overview

We benchmarked Cognee against leading memory frameworks, including MemO, Graphiti, and LightRAG, using a subset of 24 HotPotQA multi-hop questions designed to test complex reasoning and factual consistency. Benchmarks were executed on Modal Cloud using 45 repeated runs per system to ensure reproducibility and remove noise caused by LLM-based evaluation variance.

All code, configs, and datasets are open-sourced. You can reproduce every step yourself.

Access the Benchmark CodeAvailable on GitHub to replicate and validate our evaluations independently.

Benchmarks

Key performance metrics

Results for Cognee

Human-like correctness

0.93

DeepEval F1

0.84

DeepEval correctness

0.85

DeepEval EM

0.69

Real-World Evaluation

Unlike typical QA tests that reward surface-level matches, our benchmark measures information correctness, reasoning depth, and faithfulness. We evaluate answers not just by what they say — but whether they actually make sense.

CogneeCogneeLightRAGLightRAGGraphiti (Previous Result)Graphiti (Previous Result)Mem0Mem0

Human-like CorrectnessDeepEval CorrectnessDeepEval F1DeepEval EM

Optimized Cognee configurations

Graph Completion CoTGraph Completion CoTGraph Completion Context ExtensionGraph Completion Context ExtensionGraph CompletionGraph CompletionCognee (Non-Optimized Evaluation)Cognee (Non-Optimized Evaluation)

Human-like CorrectnessDeepEval CorrectnessDeepEval F1DeepEval EM

Optimizations

Chain-of-Thought gains

Cognee Graph Completion with Chain-of-Thought (CoT) shows significant performance improvements over the previous non-optimized version:

Human-like Correctness (0.738 → 0.925)0

DeepEval Correctness (0.569 → 0.846)0

DeepEval F1 (0.203 → 0.841)0

DeepEval EM (0.04 → 0.687)0

Comprehensive Metrics Comparison

CogneeCogneeLightRAGLightRAGGraphiti (Previous Result)Graphiti (Previous Result)Mem0Mem0

Human-like CorrectnessDeepEval CorrectnessDeepEval F1DeepEval EM

Key takeaways

Human-like accuracy

Cognee delivers the most contextually accurate and human-like answers across all evaluated systems.

True comprehension

Its hybrid graph + vector memory produces responses that reflect true comprehension — not just keyword overlap.

Scales effortlessly

Cognee’s architecture runs on Modal’s distributed infrastructure. It easily scales from single-instance tests to multi-node workloads.

Stronger retrievers

Its graph-completion retrievers consistently outperform simpler retrievers in both correctness and performance.

Evaluation

Memory that holds up over long conversations

We evaluate Cognee on BEAM, a long-horizon memory benchmark where relevant evidence is scattered across many turns. The entire workflow is built from open-source Cognee components, not a benchmark-specific system.

BEAM stresses memory across growing context windows:

0tokens of context

What BEAM tests

Long, topically diverse conversations

BEAM probes long-term memory across coherent dialogues where the topic shifts and the evidence needed for an answer is spread across many turns.

Scattered evidence, recalled on demand

Success depends on storing information well and making it usable again later, not on remembering the last few messages.

Memory fundamentals, not a benchmark hack

Cognee was not built for BEAM. BEAM is a benchmark where structured memory, retrieval strategy, and session continuity happen to matter.

Research threads

Well-structured input

Conversations are ingested into structured memory so stored facts stay coherent and addressable across the full history.

Graph memory

A graph representation links entities and facts across turns, so related context can be traversed rather than re-searched from scratch.

Multiple retrieval strategies

Query decomposition and hybrid graph-plus-chunk retrieval target the right evidence depending on how the question is shaped.

Agentic retrieval

When a single lookup is not enough, retrieval becomes iterative, gathering and refining context until the answer is supported.

Session-aware memory and feedback

Memory tracks session continuity and incorporates feedback, so context carries forward across a long-running conversation.

Reproducible local and Modal pipeline

An end-to-end pipeline covers ingestion, retrieval, answer generation, and scoring. Run it locally for inspection, or on Modal for larger sweeps.

Get the eval pack

Open evaluation

Open Evaluation

We invite the community to explore the data, re-run the experiments and contribute improvements.

Read our Blog Post Access the Benchmark Code

Custom deployment

Cognee Benchmark Results

Benchmark Overview

Key performance metrics

Real-World Evaluation

Optimized Cognee configurations

Chain-of-Thought gains

Comprehensive Metrics Comparison

Human-like accuracy

True comprehension

Scales effortlessly

Stronger retrievers

Memory that holds up over long conversations

What BEAM tests

Long, topically diverse conversations

Scattered evidence, recalled on demand

Memory fundamentals, not a benchmark hack

Research threads

Well-structured input

Graph memory

Multiple retrieval strategies

Agentic retrieval

Session-aware memory and feedback

Reproducible local and Modal pipeline

Open Evaluation

Looking for a custom deployment?Chat with our engineers!

Looking for a custom deployment?
Chat with our engineers!