Evaluation¶
Vector Graph RAG is evaluated on three standard multi-hop QA benchmarks used in the HippoRAG papers.
Datasets¶
| Dataset | Description | Hop Count | Source |
|---|---|---|---|
| MuSiQue | Multi-hop questions requiring 2–4 reasoning steps | 2–4 hops | Paper |
| HotpotQA | Wikipedia-based multi-hop QA | 2 hops | Paper |
| 2WikiMultiHopQA | Cross-document reasoning over Wikipedia | 2 hops | Paper |
Evaluation Metric
Recall@5 — whether the ground-truth supporting passages appear within the top-5 retrieved results. This measures retrieval quality independent of the answer generation step.
Results¶
Recall@5 vs. Naive RAG¶
| Method | MuSiQue | HotpotQA | 2WikiMultiHopQA | Average |
|---|---|---|---|---|
| Naive RAG | 55.6% | 90.8% | 73.7% | 73.4% |
| Vector Graph RAG | 73.0% | 96.3% | 94.1% | 87.8% |
| Improvement | +31.4% | +6.1% | +27.7% | +19.6% |
Key Takeaway
Vector Graph RAG improves over Naive RAG by +19.6% on average, with the largest gains on datasets requiring cross-document reasoning (MuSiQue +31.4%, 2WikiMultiHopQA +27.7%).
Comparison with State-of-the-Art¶
| Method | MuSiQue | HotpotQA | 2WikiMultiHopQA | Average |
|---|---|---|---|---|
| HippoRAG (ColBERTv2)1 | 51.9% | 77.7% | 89.1% | 72.9% |
| IRCoT + HippoRAG1 | 57.6% | 83.0% | 93.9% | 78.2% |
| NV-Embed-v22 | 69.7% | 94.5% | 76.5% | 80.2% |
| HippoRAG 22 | 74.7% | 96.3% | 90.4% | 87.1% |
| Vector Graph RAG | 73.0% | 96.3% | 94.1% | 87.8% |
Analysis
- Best average performance (87.8%) among all compared methods
- Ties HippoRAG 2 on HotpotQA (96.3%) — the most popular multi-hop benchmark
- Leads on 2WikiMultiHopQA (94.1%) — +3.7% over HippoRAG 2, showing stronger cross-document reasoning
- Slightly behind HippoRAG 2 on MuSiQue (73.0% vs 74.7%) — the hardest benchmark with 3–4 hop questions
Methodology¶
Fair Comparison
For fair comparison with HippoRAG, we use the same pre-extracted triplets from HippoRAG's repository rather than re-extracting them. This ensures the evaluation isolates the retrieval algorithm improvements without interference from triplet extraction quality differences.
Evaluation Setup¶
flowchart LR
T["HippoRAG's\npre-extracted\ntriplets"] --> I["Index into\nMilvus"]
I --> Q["Run benchmark\nqueries"]
Q --> R["Check if gold\npassages in top-5"]
R --> M["Compute\nRecall@5"]
- Triplets: Use HippoRAG's pre-extracted
(subject, predicate, object)triplets from each benchmark dataset - Indexing: Build the vector knowledge graph in Milvus using these triplets
- Querying: Run all benchmark questions through the query pipeline
- Scoring: Check whether the ground-truth supporting passages appear in the top-5 retrieved results
Reproduction¶
Full reproduction steps are available in the evaluation directory:
# Clone the repository
git clone https://github.com/zilliztech/vector-graph-rag.git
cd vector-graph-rag
# See evaluation instructions
cat evaluation/README.md
See evaluation/README.md for detailed instructions.