Cognee Benchmark Results
Production-ready Al Memory, tested and measured.
Benchmark Overview
We benchmarked Cognee against leading memory frameworks, including MemO, Graphiti, and LightRAG, using a subset of 24 HotPotQA multi-hop questions designed to test complex reasoning and factual consistency. Benchmarks were executed on Modal Cloud using 45 repeated runs per system to ensure reproducibility and remove noise caused by LLM-based evaluation variance.
All code, configs, and datasets are open-sourced. You can reproduce every step yourself.
Key performance metrics
Results for Cognee
Real-World Evaluation
Unlike typical QA tests that reward surface-level matches, our benchmark measures information correctness, reasoning depth, and faithfulness. We evaluate answers not just by what they say — but whether they actually make sense.
Optimized Cognee configurations
Chain-of-Thought gains
Cognee Graph Completion with Chain-of-Thought (CoT) shows significant performance improvements over the previous non-optimized version:
Comprehensive Metrics Comparison
Human-like accuracy
Cognee delivers the most contextually accurate and human-like answers across all evaluated systems.
True comprehension
Its hybrid graph + vector memory produces responses that reflect true comprehension — not just keyword overlap.
Scales effortlessly
Cognee’s architecture runs on Modal’s distributed infrastructure. It easily scales from single-instance tests to multi-node workloads.
Stronger retrievers
Its graph-completion retrievers consistently outperform simpler retrievers in both correctness and performance.
Memory that holds up over long conversations
We evaluate Cognee on BEAM, a long-horizon memory benchmark where relevant evidence is scattered across many turns. The entire workflow is built from open-source Cognee components, not a benchmark-specific system.
What BEAM tests
Long, topically diverse conversations
BEAM probes long-term memory across coherent dialogues where the topic shifts and the evidence needed for an answer is spread across many turns.
Scattered evidence, recalled on demand
Success depends on storing information well and making it usable again later, not on remembering the last few messages.
Memory fundamentals, not a benchmark hack
Cognee was not built for BEAM. BEAM is a benchmark where structured memory, retrieval strategy, and session continuity happen to matter.
Research threads
Well-structured input
Conversations are ingested into structured memory so stored facts stay coherent and addressable across the full history.
Graph memory
A graph representation links entities and facts across turns, so related context can be traversed rather than re-searched from scratch.
Multiple retrieval strategies
Query decomposition and hybrid graph-plus-chunk retrieval target the right evidence depending on how the question is shaped.
Agentic retrieval
When a single lookup is not enough, retrieval becomes iterative, gathering and refining context until the answer is supported.
Session-aware memory and feedback
Memory tracks session continuity and incorporates feedback, so context carries forward across a long-running conversation.
Reproducible local and Modal pipeline
An end-to-end pipeline covers ingestion, retrieval, answer generation, and scoring. Run it locally for inspection, or on Modal for larger sweeps.