Skip to content

Evaluation

Vector Graph RAG is evaluated on three standard multi-hop QA benchmarks.

Datasets

Dataset Description Hop Count
MuSiQue Multi-hop questions requiring 2–4 reasoning steps 2–4 hops
HotpotQA Wikipedia-based multi-hop QA 2 hops
2WikiMultiHopQA Cross-document reasoning over Wikipedia 2 hops

Metric: Recall@5 — whether the ground-truth supporting passages appear within the top-5 retrieved results.

Results

Recall@5 vs. Naive RAG

Method MuSiQue HotpotQA 2WikiMultiHopQA Average
Naive RAG 55.6% 90.8% 73.7% 73.4%
Vector Graph RAG 73.0% 96.3% 94.1% 87.8%
Improvement +31.4% +6.1% +27.7% +19.6%

Comparison with State-of-the-Art

Method MuSiQue HotpotQA 2WikiMultiHopQA Average
HippoRAG (ColBERTv2)¹ 51.9% 77.7% 89.1% 72.9%
IRCoT + HippoRAG¹ 57.6% 83.0% 93.9% 78.2%
NV-Embed-v2² 69.7% 94.5% 76.5% 80.2%
HippoRAG 2² 74.7% 96.3% 90.4% 87.1%
Vector Graph RAG 73.0% 96.3% 94.1% 87.8%

¹ HippoRAG: Neurobiologically Inspired Long-Term Memory for LLMs (NeurIPS 2024) ² From RAG to Memory: Non-Parametric Continual Learning for LLMs (2025)

Methodology

For fair comparison with HippoRAG, we use the same pre-extracted triplets from HippoRAG's repository rather than re-extracting them. This ensures the evaluation isolates the retrieval algorithm improvements without interference from triplet extraction quality differences.

Reproduction

See evaluation/README.md for full reproduction steps.