Evaluation
Vector Graph RAG is evaluated on three standard multi-hop QA benchmarks.
Datasets
| Dataset | Description | Hop Count |
|---|---|---|
| MuSiQue | Multi-hop questions requiring 2–4 reasoning steps | 2–4 hops |
| HotpotQA | Wikipedia-based multi-hop QA | 2 hops |
| 2WikiMultiHopQA | Cross-document reasoning over Wikipedia | 2 hops |
Metric: Recall@5 — whether the ground-truth supporting passages appear within the top-5 retrieved results.
Results
Recall@5 vs. Naive RAG
| Method | MuSiQue | HotpotQA | 2WikiMultiHopQA | Average |
|---|---|---|---|---|
| Naive RAG | 55.6% | 90.8% | 73.7% | 73.4% |
| Vector Graph RAG | 73.0% | 96.3% | 94.1% | 87.8% |
| Improvement | +31.4% | +6.1% | +27.7% | +19.6% |
Comparison with State-of-the-Art
| Method | MuSiQue | HotpotQA | 2WikiMultiHopQA | Average |
|---|---|---|---|---|
| HippoRAG (ColBERTv2)¹ | 51.9% | 77.7% | 89.1% | 72.9% |
| IRCoT + HippoRAG¹ | 57.6% | 83.0% | 93.9% | 78.2% |
| NV-Embed-v2² | 69.7% | 94.5% | 76.5% | 80.2% |
| HippoRAG 2² | 74.7% | 96.3% | 90.4% | 87.1% |
| Vector Graph RAG | 73.0% | 96.3% | 94.1% | 87.8% |
¹ HippoRAG: Neurobiologically Inspired Long-Term Memory for LLMs (NeurIPS 2024) ² From RAG to Memory: Non-Parametric Continual Learning for LLMs (2025)
Methodology
For fair comparison with HippoRAG, we use the same pre-extracted triplets from HippoRAG's repository rather than re-extracting them. This ensures the evaluation isolates the retrieval algorithm improvements without interference from triplet extraction quality differences.
Reproduction
See evaluation/README.md for full reproduction steps.