Persona simulator — multivon-eval simulate
ShippedPersona-driven adaptive multi-turn eval: a simulator LLM plays a configured user persona against your bot, adapting each turn to the previous reply, under a hard budget ceiling. Honesty contract built in — every transcript carries provenance and is labeled simulated, not real traffic, so simulated results never blend into production metrics.
What you’d see
multivon-eval simulate runs a persona suite against your endpoint and emits scored multi-turn transcripts with provenance. Shipped in 0.12.0 (2026-06-12).
Scaled + gated case generation
Shippedbootstrap --n-seed-cases scales seed-case generation to 500 behind duplicate and hardness gates, with a generation report that states exactly what survived — "generated 500, accepted 431 — no silent caps." Rejected candidates are counted and named, never silently dropped.
What you’d see
bootstrap --n-seed-cases 500 emits the gated suite plus the acceptance report. Shipped in 0.12.0 (2026-06-12).
Scanner v4 + UNSCANNABLE tier
ShippedRobustness hardening on the staleness scanner: call sites the scanner cannot read surface as an explicit UNSCANNABLE tier instead of disappearing from the report — honest UNKNOWN over confident wrong, the same principle the determinacy gate was built on.
What you’d see
Staleness reports name what the scanner couldn't read instead of silently omitting it. Shipped in 0.11.1.
Prompt-drift staleness detection
Shippedmultivon-eval staleness diffs a live scan of every prompt call site against a committed prompt_baseline.json: CHANGED (with before/after fingerprints, bound cases, and a git diff pointer), REMOVED (always with the three-way caveat: feature removed / renamed+edited / moved beyond static reach), ADDED (new prompts with no covering cases), UNKNOWN (dynamic prompts — never guessed at). staleness baseline writes the snapshot; staleness stamp binds hand-written cases to the prompt call sites they exercise; bootstrap stamps generated cases and writes the baseline automatically. Matching is content-first — line numbers and git SHAs are display-only, so a whitespace refactor or rebase produces zero false staleness. Every report opens with a determinacy headline and closes with a blind-spots footer. Deliberately deferred: propose-and-review case refresh (sync) and the eval-action CI gate — tracked below on the epic.
What you’d see
Run multivon-eval staleness . in a bootstrapped repo and see exactly which prompts changed since your cases were authored. --fail-on changed,removed turns it into a CI gate; --format markdown drops into $GITHUB_STEP_SUMMARY. Shipped in 0.10.0 with 51 new tests.
Scanner v3 + the determinacy gate — run, failed, published
ShippedBefore claiming drift coverage, we measured how much real-world prompt traffic static analysis can actually read. Scanner v2's first pass reported zero call sites on 4 of 5 real repos — blind to aliased litellm imports (pr-agent), **kwargs-unpacked calls (aider), and messages=<variable>. v3 detects all three; what it still can't read surfaces as honest UNKNOWN records instead of vanishing. Re-measured: 278 call sites across aider, gpt-researcher, open-interpreter, letta, and pr-agent — 20.9% statically resolvable, below the 50% gate. The gate failed, the result is published with the per-repo table on the epic, and the runtime recorder was promoted to priority.
What you’d see
Every staleness report opens with the determinacy headline — your repo's exact static-resolvability ratio. Baselines written by v2 print a rescan warning instead of fake drift. Shipped in 0.10.1.
install-skills CLI
ShippedOne-command installer for the three bundled Claude Code skills. Symlinks eval-bootstrap, eval-audit, and eval-explain from the wheel into ~/.claude/skills/, so pip install -U multivon-eval propagates SKILL.md edits without re-running install.
What you’d see
Run multivon-eval install-skills once. The three skills become callable in any Claude Code session as /eval-bootstrap, /eval-audit, /eval-explain. Shipped in 0.9.8.
Cross-distribution held-out F1 (Benchmark 4)
ShippedHallucination evaluator calibrated on HaluEval-QA (threshold 0.55, explicit JudgeConfig), tested without re-tuning on HaluEval-Sum. Calibration set strictly disjoint from test set. Result: F1 0.830 [0.70–0.92] on n=60.
What you’d see
/eval shows F1 0.830 [0.70–0.92] with the calibration provenance disclosed. Reproducer is benchmarks/run_truly_held_out.py. Shipped in 0.9.5 → 0.9.7.
Wilson + bootstrap CIs on every published number
Shippedbenchmarks/_add_cis.py walks every results JSON and writes Wilson CIs on precision/recall plus bootstrap CIs (1000 resamples, seed 20260603) on F1. Idempotent. Closes the "framework preaches CIs but doesn't ship them on its own numbers" dogfood violation.
What you’d see
Every F1 on the leaderboard, /eval tile, and benchmarks/README carries its 95% bootstrap CI. Shipped in 0.9.4.
Calibration provenance — zero null F1
Shipped18 calibration entries across 6 judges × 3 evaluators in _calibration_data/v2.json. Six previously-null F1 cells (opus, gpt-4o, gpt-5.5 across faithfulness/hallucination/relevance) filled via a $15–20 sweep on real held-out data.
What you’d see
No more silent 0.7 fallback. calibrated_threshold(evaluator, judge) returns a calibrated value with a recorded F1 for every shipped (judge × evaluator) pair. Shipped in 0.9.4.
Three Claude Code skills bundled in the wheel
Shippedeval-bootstrap (cold-start eval generator), eval-audit (suite review), and eval-explain (judge-output interpreter). SKILL.md files ship inside the wheel under multivon_eval/_skills/. No separate marketplace install needed.
What you’d see
After pip install multivon-eval + multivon-eval install-skills, three new slash commands work in Claude Code immediately. Shipped in 0.9.4.
Self-correction audit trail (0.9.4 → 0.9.7)
ShippedFour same-day PyPI releases responding to public peer review: contamination fix on the headline held-out claim, runtime bug in the generated bootstrap template, threshold-vs-default mismatch in the held-out reproducer. All four releases left published as the audit trail.
What you’d see
pypi.org/project/multivon-eval/ shows the release sequence. CHANGELOG documents what each release fixed and which reviewer flagged it.
Phase 1 prompt attribution — descriptive diff
ShippedAST-aware scan of prompt sources in a repo, structured diff between two refs, markdown rendering. Public API: multivon_eval.attribution.scan(repo_root), diff_records(base, head), render_markdown(diffs). Descriptive only — causal attribution intentionally deferred to Phase 2. Scanner v2 (0.10.0) added one-hop module-level constant resolution and loose fingerprints; this scan is now the substrate the staleness drift report runs on.
What you’d see
multivon-eval attribution scan <repo> and multivon-eval attribution diff <base> <head> work today. JSON, text, and markdown output. Commit b43b98c on multivon-eval main.