multivon-eval v0.12.0 on PyPIFull changelog →

Roadmap

What we’ve shipped, what we’re building next, and what we are deliberately notbuilding. No dates more specific than quarters, no features that aren’t on someone’s branch.

Framing principles

  • Honesty over hype — no promised launch dates more specific than quarters, no aspirational features that aren't actually being built.
  • Every shipped item is verifiable — links to the CHANGELOG entry, commit SHA, or published PyPI artifact.
  • Every planned item names a specific user-facing outcome you would see, not a vague capability area.
  • Open source, Apache-2.0, runs on your machine. No managed-service tier. No hosted eval-database. No fine-tuning-as-a-service. Those are explicitly not in the cards.
  • The roadmap is readable as steps toward one thesis: an autonomous swarm of eval agents that lives inside your system. Every item below is groundwork for that future, not random feature accretion.
  • If a feature is hard to commit to honestly, it goes under "Exploring" — not "Q3". We'd rather under-promise.

Shipped in the last 30 days

May–June 2026

Persona simulator — multivon-eval simulate

Shipped

Persona-driven adaptive multi-turn eval: a simulator LLM plays a configured user persona against your bot, adapting each turn to the previous reply, under a hard budget ceiling. Honesty contract built in — every transcript carries provenance and is labeled simulated, not real traffic, so simulated results never blend into production metrics.

What you’d see

multivon-eval simulate runs a persona suite against your endpoint and emits scored multi-turn transcripts with provenance. Shipped in 0.12.0 (2026-06-12).

Scaled + gated case generation

Shipped

bootstrap --n-seed-cases scales seed-case generation to 500 behind duplicate and hardness gates, with a generation report that states exactly what survived — "generated 500, accepted 431 — no silent caps." Rejected candidates are counted and named, never silently dropped.

What you’d see

bootstrap --n-seed-cases 500 emits the gated suite plus the acceptance report. Shipped in 0.12.0 (2026-06-12).

Scanner v4 + UNSCANNABLE tier

Shipped

Robustness hardening on the staleness scanner: call sites the scanner cannot read surface as an explicit UNSCANNABLE tier instead of disappearing from the report — honest UNKNOWN over confident wrong, the same principle the determinacy gate was built on.

What you’d see

Staleness reports name what the scanner couldn't read instead of silently omitting it. Shipped in 0.11.1.

Prompt-drift staleness detection

Shipped

multivon-eval staleness diffs a live scan of every prompt call site against a committed prompt_baseline.json: CHANGED (with before/after fingerprints, bound cases, and a git diff pointer), REMOVED (always with the three-way caveat: feature removed / renamed+edited / moved beyond static reach), ADDED (new prompts with no covering cases), UNKNOWN (dynamic prompts — never guessed at). staleness baseline writes the snapshot; staleness stamp binds hand-written cases to the prompt call sites they exercise; bootstrap stamps generated cases and writes the baseline automatically. Matching is content-first — line numbers and git SHAs are display-only, so a whitespace refactor or rebase produces zero false staleness. Every report opens with a determinacy headline and closes with a blind-spots footer. Deliberately deferred: propose-and-review case refresh (sync) and the eval-action CI gate — tracked below on the epic.

What you’d see

Run multivon-eval staleness . in a bootstrapped repo and see exactly which prompts changed since your cases were authored. --fail-on changed,removed turns it into a CI gate; --format markdown drops into $GITHUB_STEP_SUMMARY. Shipped in 0.10.0 with 51 new tests.

Scanner v3 + the determinacy gate — run, failed, published

Shipped

Before claiming drift coverage, we measured how much real-world prompt traffic static analysis can actually read. Scanner v2's first pass reported zero call sites on 4 of 5 real repos — blind to aliased litellm imports (pr-agent), **kwargs-unpacked calls (aider), and messages=<variable>. v3 detects all three; what it still can't read surfaces as honest UNKNOWN records instead of vanishing. Re-measured: 278 call sites across aider, gpt-researcher, open-interpreter, letta, and pr-agent — 20.9% statically resolvable, below the 50% gate. The gate failed, the result is published with the per-repo table on the epic, and the runtime recorder was promoted to priority.

What you’d see

Every staleness report opens with the determinacy headline — your repo's exact static-resolvability ratio. Baselines written by v2 print a rescan warning instead of fake drift. Shipped in 0.10.1.

install-skills CLI

Shipped

One-command installer for the three bundled Claude Code skills. Symlinks eval-bootstrap, eval-audit, and eval-explain from the wheel into ~/.claude/skills/, so pip install -U multivon-eval propagates SKILL.md edits without re-running install.

What you’d see

Run multivon-eval install-skills once. The three skills become callable in any Claude Code session as /eval-bootstrap, /eval-audit, /eval-explain. Shipped in 0.9.8.

Cross-distribution held-out F1 (Benchmark 4)

Shipped

Hallucination evaluator calibrated on HaluEval-QA (threshold 0.55, explicit JudgeConfig), tested without re-tuning on HaluEval-Sum. Calibration set strictly disjoint from test set. Result: F1 0.830 [0.70–0.92] on n=60.

What you’d see

/eval shows F1 0.830 [0.70–0.92] with the calibration provenance disclosed. Reproducer is benchmarks/run_truly_held_out.py. Shipped in 0.9.5 → 0.9.7.

Wilson + bootstrap CIs on every published number

Shipped

benchmarks/_add_cis.py walks every results JSON and writes Wilson CIs on precision/recall plus bootstrap CIs (1000 resamples, seed 20260603) on F1. Idempotent. Closes the "framework preaches CIs but doesn't ship them on its own numbers" dogfood violation.

What you’d see

Every F1 on the leaderboard, /eval tile, and benchmarks/README carries its 95% bootstrap CI. Shipped in 0.9.4.

Calibration provenance — zero null F1

Shipped

18 calibration entries across 6 judges × 3 evaluators in _calibration_data/v2.json. Six previously-null F1 cells (opus, gpt-4o, gpt-5.5 across faithfulness/hallucination/relevance) filled via a $15–20 sweep on real held-out data.

What you’d see

No more silent 0.7 fallback. calibrated_threshold(evaluator, judge) returns a calibrated value with a recorded F1 for every shipped (judge × evaluator) pair. Shipped in 0.9.4.

Three Claude Code skills bundled in the wheel

Shipped

eval-bootstrap (cold-start eval generator), eval-audit (suite review), and eval-explain (judge-output interpreter). SKILL.md files ship inside the wheel under multivon_eval/_skills/. No separate marketplace install needed.

What you’d see

After pip install multivon-eval + multivon-eval install-skills, three new slash commands work in Claude Code immediately. Shipped in 0.9.4.

Self-correction audit trail (0.9.4 → 0.9.7)

Shipped

Four same-day PyPI releases responding to public peer review: contamination fix on the headline held-out claim, runtime bug in the generated bootstrap template, threshold-vs-default mismatch in the held-out reproducer. All four releases left published as the audit trail.

What you’d see

pypi.org/project/multivon-eval/ shows the release sequence. CHANGELOG documents what each release fixed and which reviewer flagged it.

Phase 1 prompt attribution — descriptive diff

Shipped

AST-aware scan of prompt sources in a repo, structured diff between two refs, markdown rendering. Public API: multivon_eval.attribution.scan(repo_root), diff_records(base, head), render_markdown(diffs). Descriptive only — causal attribution intentionally deferred to Phase 2. Scanner v2 (0.10.0) added one-hop module-level constant resolution and loose fingerprints; this scan is now the substrate the staleness drift report runs on.

What you’d see

multivon-eval attribution scan <repo> and multivon-eval attribution diff <base> <head> work today. JSON, text, and markdown output. Commit b43b98c on multivon-eval main.

Next 2 weeks

June 2026 — in-progress

Runtime prompt recorder

Shipped

The 2026-06-11 determinacy gate measured 20.9% static resolvability across 278 call sites on five real repos — most real-world prompt traffic is dynamic construction, beyond what static analysis can read. The recorder is the honest path past that ceiling: opt-in, local-only capture of prompt fingerprints at call time (pytest --record-prompts), runtime fingerprints labeled separately from static ones, and case binding via observation — never fabricated. Promoted from deferred to priority by the failed gate; shipped in 0.11.0 the same week.

What you’d see

Shipped in 0.11.0. Call sites the static scanner reports as UNKNOWN gain runtime-observed fingerprints; the staleness report renders them as a distinct OBSERVED tier in k-of-N language (recordings prove the renderings observed, not all renderings).

multivon-eval watch <dir> daemon

Planned

Long-running process that re-runs the configured suite on file or git changes in a watched directory. Debounced. Reuses the existing EvalSuite and CostTracker. First step toward continuous eval that doesn't require a CI trigger. Re-scoped after 0.10.0: the watcher will run the staleness report on file change, so prompt drift surfaces in your terminal before any suite re-run spends judge tokens.

What you’d see

Edit a prompt file, save it, see the suite re-run in your terminal within a second. No CI round-trip.

sync — propose-and-review case refresh

Planned

Consumes the staleness JSON report and proposes case-file diffs for human review when bound prompts change. Never auto-commits — confidently-wrong refreshes would poison the suite, so every proposed diff requires explicit approval. The deliberate other half of the 0.10.0 read-only report.

What you’d see

multivon-eval sync proposes updated cases for every CHANGED prompt; you review and apply, or reject. Nothing is rewritten without your sign-off.

Staleness gate in eval-action

Planned

Surfaces the staleness report in the eval-action PR workflow — warn-only $GITHUB_STEP_SUMMARY line first (works today as a documented one-liner with --format markdown), per-category fail-on as an action input later. Gate-by-default is deliberately avoided: failing PRs on ADDED prompts punishes adoption.

What you’d see

Every PR shows a staleness summary in the Actions run page without any gating; teams opt into --fail-on per category when ready.

MetricsSink interface (file / stdout / Datadog)

In progress

Pluggable sink so every EvalReport can be streamed to a destination of your choice. Three built-in adapters: file (JSONL append), stdout (rich-rendered), Datadog (gauge per evaluator + count per status). No hosted sink — bring your own observability.

What you’d see

suite.run(sink=DatadogSink(api_key=...)) lights up your existing Datadog dashboards with calibrated eval metrics. No multivon-hosted middleware.

Cross-source ingest adapters

In progress

First-class adapters for LangSmith, Sentry, and S3 so traces stored across systems can be loaded into a single eval run without you writing custom ETL. Extends the existing load_traces aliasing work (LangFuse, Phoenix already covered).

What you’d see

load_traces('langsmith://project/foo') and load_traces('sentry://issue/1234') return EvalCase lists ready to feed a suite.

eval-watch Claude Code skill

Planned

Fourth bundled skill wrapping the watch daemon. Slash command that starts the daemon scoped to the current repo, surfaces failures in chat, and proposes fixes when a regression appears. Sibling to /eval-bootstrap, /eval-audit, /eval-explain.

What you’d see

/eval-watch in Claude Code keeps your eval running while you code; agent narrates regressions as they happen.

Swarm coordination skeleton — multivon-eval swarm <repo>

Planned

Proof-of-concept that coordinates three named subagents (auditor, attributor, explainer) against a repo and surfaces a unified report. Skeleton only — the real swarm is the long-term thesis below; this is the first walking version.

What you’d see

multivon-eval swarm <repo> produces a markdown report combining a suite audit, a prompt-change attribution, and per-failure explanations from three independent agents.

Q3 2026 target

Q3 2026 — planned

Truly cross-corpus held-out evaluation (TruthfulQA + FaithBench)

Planned

HaluEval-QA and HaluEval-Sum share CNN/DailyMail structure. Adding TruthfulQA and FaithBench as held-out targets gives a genuinely cross-corpus generalization figure. Will also re-run at n=500–1000 to tighten CIs (current held-out n=60 is honestly wide).

What you’d see

/eval gains a second held-out tile: F1 with a 95% CI on a corpus the calibration set has never touched. Plus narrower CIs on the existing held-out claim.

Phase 2 prompt attribution — sidecar + majority voting

Planned

Adds a non-prompt-change signal (git diff classifier over pyproject, model configs, dependencies) ANDed against Haiku majority-voted attribution. When the sidecar fires `present`, attribution refuses rather than confidently misattributing a mixed-cause regression. Design doc: feature_prompt_attribution_phase2_sidecar_design.md.

What you’d see

On any PR, the eval-action comment shows per-case attribution with HIGH/MED/LOW confidence — or skips and tells you both prompts AND non-prompt code changed and the data doesn't support causal attribution.

Async / batch QAG

Planned

QAG evaluators (Faithfulness, Hallucination, ContextPrecision) currently don't override aevaluate, so async paths fall back to sync. Closes a known limitation flagged in iter-2 peer review. Pairs with batch judging via the existing OpenAI/Anthropic batch APIs for 50% cost reduction on long runs.

What you’d see

suite.run_async(workers=32) actually parallelizes QAG judge calls. Long benchmark runs (n=1000+) finish 5–10x faster at half the API cost.

Supply-chain hardening for eval-action

Planned

Cosign signature on the published container image, SBOM (CycloneDX) attached to each tagged release, immutable @v1.x.y tags alongside the mutable @v1. Closes r/MLOps procurement-tier concerns from launch review.

What you’d see

cosign verify succeeds against the eval-action image. Procurement teams get an SBOM file in the release assets. Pinning to a specific SHA stays stable.

Pluggable baseline source

Planned

Today multivon-eval compare pairs cases by sequential index within an input. Adding a pluggable baseline source (git ref, JSON file, S3 bucket, MetricsSink read-back) lets compare answer "did my prompt change help vs the version on main 30 days ago" without manual artifact wrangling.

What you’d see

multivon-eval compare --baseline git:origin/main proposal.json --markdown produces a PR-ready diff against whatever main was when the run happened.

Long-term vision

12–18 months

Autonomous eval swarm

Exploring

The underlying thesis. Every roadmap item above is a step toward one thing: a swarm of eval agents that goes into your system, reads your code and your data and your logs wherever they're stored, and continuously updates eval metrics without a human writing or maintaining the eval suite. The watch daemon is the trigger. MetricsSink is the egress. Ingest adapters are how the swarm sees your traces. The Phase 2 attribution sidecar is how it stays honest about causality. Swarm coordination is the orchestration layer. Once those pieces are mature, the eval suite stops being a thing you write — it becomes a thing your repo grows.

What you’d see

You install one process. It discovers your prompts, your traces, your code-change history, and your existing observability stack. It proposes evals, calibrates them, runs them continuously, attributes regressions, and ships a weekly health report. You never write a suite by hand.

What we are deliberately NOT building

Exploring

Honesty principle: a managed-SaaS eval dashboard is not on the roadmap. A hosted eval-database is not on the roadmap. Fine-tuning-as-a-service is not on the roadmap. multivon-eval is and will remain Apache-2.0, runs on your machine, sinks metrics to observability tools you already own. If the right answer for you is a hosted dashboard, the answer is Datadog or Phoenix or your own — not us. The commercial surface (if any) will be support and integration consulting, not a tier above the OSS.

What you’d see

Forever-free local install. No required cloud account. No data-egress to multivon servers. Your eval results stay in your infrastructure.

Want something on this list sooner?

File an issue on multivon-eval or email us. Real demand is how we re-prioritise — every “planned” item is shipped in dependency order, not alphabetical order.