Examples

Real evaluations on real data.Reproducible from the SDK.

Every case study below is a real run, not a mock. The Python files live in the multivon-eval repo; the terminal output is exactly what the script printed when it last ran against the named provider. Copy a script, set the env vars, python ... — that is the entire setup.

Case 01Real run · reproducibleAnthropic claude-haiku-4-5 · <$0.05

RAG faithfulness over an insurance knowledge base

A RAG bot must answer only from the retrieved docs. One ungrounded sentence should block the deploy.

Five questions against a 5-document Acme Auto insurance KB. Four answers are fully grounded; the fifth invents a $75/day rental reimbursement limit that does not appear anywhere in the docs.

Faithfulness extracts every factual claim from each answer and verifies it against the retrieved context using an Anthropic claude-haiku-4-5 judge (QAG decomposition). Relevance scores whether the answer addresses the question at all.

FaithfulnessRelevance
01_rag_insurance_faithfulness.py
from multivon_eval import EvalCase, EvalSuite, Faithfulness, Relevance
from multivon_eval import JudgeConfig, configure

configure(JudgeConfig(provider="anthropic", model="claude-haiku-4-5"))

CASES = [
    # 4 grounded answers + 1 deliberately ungrounded:
    {"question": "What is the rental car reimbursement limit on a standard policy?",
     "answer":   "Acme reimburses rental cars at up to $75 per day for a maximum "
                 "of 30 days. Premium policyholders receive unlimited rental "
                 "reimbursement."},  # ← not in the KB at all
    # ...
]

cases = [EvalCase(input=c["question"], context=FULL_CONTEXT) for c in CASES]
suite = EvalSuite("RAG Faithfulness — Insurance KB")
suite.add_cases(cases)
suite.add_evaluators(Faithfulness(), Relevance())
report = suite.run(lambda q: next(c["answer"] for c in CASES if c["question"] == q))
terminal output
─────────── RAG Faithfulness — Insurance KB ───────────
  Model: precomputed-answers

  #   Input                              Score  Status
  1   What is the standard deductible…   0.75   PASS
  2   How do I file a claim and how lo…  1.00   PASS
  3   What discounts are available for…  0.88   PASS
  4   Does my auto policy cover items …  0.88   PASS
  5   What is the rental car reimburs…   0.38   FAIL

           By Evaluator
  Evaluator      Avg Score   Pass Rate
  faithfulness        0.80         80%
  relevance           0.75        100%

  Summary
  Total: 5   Passed: 4   Failed: 1
  Pass Rate: 80.0% [38%–96% 95% CI]
  Avg Score: 0.78 [0.57–0.93]

  === Faithfulness verdict per case ===
  [PASS] faithfulness=1.00  Q: What is the standard deductible for collision coverage?
  [PASS] faithfulness=1.00  Q: How do I file a claim and how long does it take?
  [PASS] faithfulness=1.00  Q: What discounts are available for bundling and safe driving?
  [PASS] faithfulness=1.00  Q: Does my auto policy cover items stolen from inside my car?
  [FAIL] faithfulness=0.00  Q: What is the rental car reimbursement limit on a standard pol...

  Result: FAIL — 1/5 case(s) ungrounded.

What this caught

The judge flagged every claim in the fabricated rental-reimbursement answer as not supported by the context — faithfulness 0.00 on case 5, while every grounded answer scored 1.00.

View source on GitHub →
$ pip install multivon-eval && export ANTHROPIC_API_KEY=... && python 01_rag_insurance_faithfulness.py
Case 02Real run · reproducibleOpenAI gpt-4o (vision) · <$0.30

Contract analysis trap — does GPT-4o catch a 6pt footnote?

Body text says liability is capped. A 6pt footnote at the bottom overrides it for specific clauses. Vision models routinely miss the footnote.

pdfhell's footnote_override generator produces a Master Services Agreement from code. The body confidently caps liability at 3 months of fees; a 6pt footnote carves out Sections 4.2, 6.1, and 6.2 as uncapped. The answer key is exact because the generator chose the numbers — no LLM-as-judge in the scoring loop.

GPT-4o (vision) reads the PDF and answers the question. pdfhell.score_case does a whitespace-tolerant contains-match against expected_tokens ("3 month", "uncapped", "4.2", "6.1", "6.2") and against the body-only forbidden answer.

pdfhell.score_caseexpected_tokensforbidden_answers
02_contract_pdfhell_footnote.py
from openai import OpenAI
from pdfhell.generators.footnote_override import generate
from pdfhell.scorer import score_case
import base64

# Generate the adversarial PDF deterministically from a seed.
pdf_bytes, case = generate(seed=2)
# case.expected_tokens == ["3 month", "uncapped", "4.2", "6.1", "6.2"]
# case.forbidden_answers[0] == "Liability is capped at 3 months of fees paid."

pdf_b64 = base64.b64encode(pdf_bytes).decode("ascii")
response = OpenAI().chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": [
        {"type": "file", "file": {
            "filename": "contract.pdf",
            "file_data": f"data:application/pdf;base64,{pdf_b64}"}},
        {"type": "text", "text": case.question},
    ]}],
    max_tokens=400,
)

# Code-based ground truth — no LLM-as-judge.
result = score_case(case, response.choices[0].message.content)
terminal output
==============================================================================
Contract analysis trap — footnote_override (pdfhell)
==============================================================================
PDF:              02_contract.pdf  (2,542 bytes)
Trap family:      footnote_override
Question to model: Read the attached Master Services Agreement. What is the
                   LIABILITY CAP and what carve-outs (if any) apply? Be precise
                   about which Sections are uncapped.
Expected answer:  Liability is capped at 3 months of fees paid, EXCEPT that
                  liability arising from Sections 4.2, 6.1, 6.2 is uncapped.
Expected tokens:  ['3 month', 'uncapped', '4.2', '6.1', '6.2']
Forbidden answer: Liability is capped at 3 months of fees paid.

Calling gpt-4o ...

=== Model output ===
The LIABILITY CAP in the Master Services Agreement limits the aggregate liability
of either party to an amount equal to 3 months of fees paid by the Customer during
the twelve (12) month period immediately preceding the event giving rise to such
liability.

Carve-outs that apply, per the document, mean that liability is uncapped for claims
arising from Sections 4.2, 6.1, and 6.2.

=== pdfhell score ===
  correct:           True
  matched_expected:  True
  fell_for_trap:     False
  matched_forbidden: []
  refused:           False

Result: PASS — model captured the footnote carve-out.

What this caught

On this seed GPT-4o actually caught the footnote — every required token appears in the output. Re-run with --seed 6, 7, 10… across the full mini suite and the pass rate drops to 8/10 with Wilson 95% CI [0.49, 0.94]. The same code reproduces both outcomes.

View source on GitHub →
$ pip install multivon-eval pdfhell && export OPENAI_API_KEY=... && python 02_contract_pdfhell_trap.py
Case 03Real run · reproducibleAnthropic claude-haiku-4-5 · <$0.15

Customer support QA — three evaluators on the same bot

A support bot fabricates Apple Pay support, invents a 10am overnight guarantee, and defers vaguely twice. Faithfulness, Relevance, and a plain-English check all need to fire.

Ten support tickets with their retrieved KB context and the bot's reply. Six replies are well-grounded; two fabricate facts (Apple Pay, 10am overnight guarantee); two defer vaguely without naming the problem.

Three evaluators run in parallel. Faithfulness catches the fabrications. Relevance catches the off-topic deferrals. CheckEvaluator is given an English criterion ("Response should name the customer's specific problem and provide a concrete next step") with pinned yes/no questions for CI reproducibility — it catches every vague deferral.

FaithfulnessRelevanceCheckEvaluator
03_support_qa_multi_evaluator.py
from multivon_eval import EvalCase, EvalSuite, Faithfulness, Relevance
from multivon_eval import JudgeConfig, configure
from multivon_eval.evaluators.llm_judge import CheckEvaluator

configure(JudgeConfig(provider="anthropic", model="claude-haiku-4-5"))

suite = EvalSuite("Customer Support QA")
suite.add_cases(cases)  # 10 tickets with input + context + bot answer
suite.add_evaluators(
    Faithfulness(),
    Relevance(),
    CheckEvaluator(
        criterion="Response should name the customer's specific problem and "
                  "provide a concrete next step.",
        questions=[  # pin the questions for CI reproducibility
            "Does the response name or restate the customer's specific problem?",
            "Does the response provide a concrete next step or action?",
            "Does the response avoid vague deferrals like 'we will look into this'?",
        ],
        name="actionability",
    ),
)
report = suite.run(model_fn)
terminal output
─────── Customer Support QA ───────
  Model: precomputed-answers

  #   Input                              Score  Status
  1   How long does standard shipping…   0.69   FAIL
  2   Can I return a sale item?          0.69   FAIL
  3   I forgot my password — how do I…   1.00   PASS
  4   Do you accept Apple Pay?           0.53   FAIL
  5   How do I cancel my subscription?   0.92   PASS
  6   The app is showing a white scre…   0.42   FAIL
  7   How long do refunds take after …   0.81   FAIL
  8   I want to delete my account.       0.81   FAIL
  9   When will my order arrive if I …   0.47   FAIL
 10   Refund for order #4839 — it nev…   0.50   FAIL

           By Evaluator
  Evaluator       Avg Score   Pass Rate
  faithfulness         0.92         80%
  relevance            0.70         90%
  actionability        0.43         20%

  Summary
  Total: 10   Passed: 2   Failed: 8
  Pass Rate: 20.0% [6%–51% 95% CI]

  === Per-case breakdown ===
  [FAIL]  T1  faith=1.00✓  rel=0.75✓  act=0.33✗
  [FAIL]  T2  faith=1.00✓  rel=0.75✓  act=0.33✗
  [PASS]  T3  faith=1.00✓  rel=1.00✓  act=1.00✓
  [FAIL]  T4  faith=0.83✗  rel=0.75✓  act=0.00✗
  [PASS]  T5  faith=1.00✓  rel=0.75✓  act=1.00✓
  [FAIL]  T6  faith=1.00✓  rel=0.25✗  act=0.00✗
  [FAIL]  T7  faith=1.00✓  rel=0.75✓  act=0.67✗
  [FAIL]  T8  faith=1.00✓  rel=0.75✓  act=0.67✗
  [FAIL]  T9  faith=0.33✗  rel=0.75✓  act=0.33✗
  [FAIL] T10  faith=1.00✓  rel=0.50✓  act=0.00✗

  Result: FAIL — pass rate below 70% gate.

What this caught

Faithfulness flagged T4 (Apple Pay — not in the KB) and T9 (the 10am overnight guarantee — not in the KB). The plain-English actionability check failed T6 and T10 ("we'll look into this", "please contact support") because neither restated the problem nor named a next step.

View source on GitHub →
$ pip install multivon-eval && export ANTHROPIC_API_KEY=... && python 03_support_qa_multi_evaluator.py
Case 04Real run · reproducibleno API call · $0

PII detection over medical records — offline, deterministic, $0

Regulated environments cannot send PHI to a third-party judge. PII detection must run locally on every output before it leaves the building.

Five synthetic medical record snippets. Three are clean; two contain leaked PII (SSN + phone in one, email + MRN in another). PIIEvaluator runs entirely on regex pattern libraries scoped to a chosen jurisdiction (hipaa here) — no API key, no network call.

This is the same evaluator that ships in multivon-eval's compliance tier. The point is that NOT every eval needs an LLM. Deterministic checks for PII, schema validity, regex match, and exact match should run first — they are cheap, reproducible, and audit-friendly.

PIIEvaluator (jurisdiction="hipaa")
04_pii_medical_records.py
from multivon_eval import EvalCase, EvalSuite
from multivon_eval.evaluators.compliance import PIIEvaluator

RECORDS = [
    {"id": "MR1", "text": "Patient presented with mild dehydration…"},
    {"id": "MR2", "text": "Patient John Doe (SSN 123-45-6789) was admitted on "
                          "03/14/2025… Spouse contact: 415-555-0182."},
    {"id": "MR3", "text": "Routine post-op check after laparoscopic…"},
    {"id": "MR4", "text": "MRN-4471829 — patient reports migraine. Family physician "
                          "notified via patient@example.com…"},
    {"id": "MR5", "text": "Pediatric well-child visit. Growth on expected curve…"},
]

suite = EvalSuite("PII Detection — Medical Records")
suite.add_cases([EvalCase(input=r["id"]) for r in RECORDS])
# "hipaa" adds MRN, fax, admission dates, NPI/DEA, etc. on top of base patterns.
suite.add_evaluators(PIIEvaluator(jurisdiction="hipaa"))

report = suite.run(lambda rid: next(r["text"] for r in RECORDS if r["id"] == rid))
# No API key required. No network call. Fully deterministic.
terminal output
─────── PII Detection — Medical Records ───────
  Model: static-records

  #   Input   Output                              Score   Status
  1   MR1     Patient presented with mild deh…    1.00    PASS
  2   MR2     Patient John Doe (SSN 123-45-67…    0.00    FAIL
  3   MR3     Routine post-op check after lap…    1.00    PASS
  4   MR4     MRN-4471829 — patient reports m…    0.00    FAIL
  5   MR5     Pediatric well-child visit. Gro…    1.00    PASS

  === Per-record PII findings ===
  [CLEAN] MR1  No PII detected
  [LEAK ] MR2
           phone_us: "415-555-0182"
           ssn: "123-45-6789"
           address: "2025 with chest"
           fax_number: "415-555-0182"
           admission_date: "admitted on 03/14/2025"
  [CLEAN] MR3
  [LEAK ] MR4
           email: "patient@example.com"
           medical_record_number: "MRN-4471829"
  [CLEAN] MR5

  Final: 2/5 record(s) contain PII.
  (Regex-only — no API calls, $0 cost, fully deterministic.)

What this caught

PIIEvaluator caught the SSN, phone number, admission date, email, and MRN across the two leaky records — zero false positives on the three clean records. No API call, no API key, $0, fully deterministic.

View source on GitHub →
$ pip install multivon-eval && python 04_pii_medical_records.py

Run any of these yourself in < 5 minutes.

Each example is a single Python file. Clone the repo or copy the file out of GitHub, install the SDK, set the relevant API key, and run it. The terminal output above is exactly what you will see.