RAG faithfulness over an insurance knowledge base
A RAG bot must answer only from the retrieved docs. One ungrounded sentence should block the deploy.
Five questions against a 5-document Acme Auto insurance KB. Four answers are fully grounded; the fifth invents a $75/day rental reimbursement limit that does not appear anywhere in the docs.
Faithfulness extracts every factual claim from each answer and verifies it against the retrieved context using an Anthropic claude-haiku-4-5 judge (QAG decomposition). Relevance scores whether the answer addresses the question at all.
from multivon_eval import EvalCase, EvalSuite, Faithfulness, Relevance
from multivon_eval import JudgeConfig, configure
configure(JudgeConfig(provider="anthropic", model="claude-haiku-4-5"))
CASES = [
# 4 grounded answers + 1 deliberately ungrounded:
{"question": "What is the rental car reimbursement limit on a standard policy?",
"answer": "Acme reimburses rental cars at up to $75 per day for a maximum "
"of 30 days. Premium policyholders receive unlimited rental "
"reimbursement."}, # ← not in the KB at all
# ...
]
cases = [EvalCase(input=c["question"], context=FULL_CONTEXT) for c in CASES]
suite = EvalSuite("RAG Faithfulness — Insurance KB")
suite.add_cases(cases)
suite.add_evaluators(Faithfulness(), Relevance())
report = suite.run(lambda q: next(c["answer"] for c in CASES if c["question"] == q))─────────── RAG Faithfulness — Insurance KB ───────────
Model: precomputed-answers
# Input Score Status
1 What is the standard deductible… 0.75 PASS
2 How do I file a claim and how lo… 1.00 PASS
3 What discounts are available for… 0.88 PASS
4 Does my auto policy cover items … 0.88 PASS
5 What is the rental car reimburs… 0.38 FAIL
By Evaluator
Evaluator Avg Score Pass Rate
faithfulness 0.80 80%
relevance 0.75 100%
Summary
Total: 5 Passed: 4 Failed: 1
Pass Rate: 80.0% [38%–96% 95% CI]
Avg Score: 0.78 [0.57–0.93]
=== Faithfulness verdict per case ===
[PASS] faithfulness=1.00 Q: What is the standard deductible for collision coverage?
[PASS] faithfulness=1.00 Q: How do I file a claim and how long does it take?
[PASS] faithfulness=1.00 Q: What discounts are available for bundling and safe driving?
[PASS] faithfulness=1.00 Q: Does my auto policy cover items stolen from inside my car?
[FAIL] faithfulness=0.00 Q: What is the rental car reimbursement limit on a standard pol...
Result: FAIL — 1/5 case(s) ungrounded.What this caught
The judge flagged every claim in the fabricated rental-reimbursement answer as not supported by the context — faithfulness 0.00 on case 5, while every grounded answer scored 1.00.