Changelog

Recent multivon-eval releases. Full notes on GitHub.

v0.8.2·May 19, 2026

v0.8.2 — ContextRecall skip semantics

Second dogfood-driven patch. Fixes a UX footgun in EvalSuite.for_rag().
🐛 Fixed
ContextRecall now skips cleanly when expected_output is missing instead of returning a confusing 0.0 quality failure.
# Before (0.8.1 and earlier):
suite = EvalSuite.for_rag()  # auto-includes ContextRecall
suite.add_cases([EvalCase(input="Q", context="...")])  # no expected_output
report = suite.run(model)
# context_recall: 0.0  reason: "Requires both case.context and case.expected_output"
# pass_rate dragged down by a metric that couldn't actually evaluate

# After (0.8.2):
# context_recall: 1.0 (passed=True)  reason: "[skipped] Requires both ..."
# metadata.skipped = True so users can filter
# pass_rate reflects actual quality, not missing ground truth
Known issue (will be fixed in 0.9.0)
A similar "returns 0.0 when input shape doesn't match" pattern exists in ~20 other evaluators (AnswerAccuracy, ExactMatch, Contains, BLEU, ROUGE, agent evaluators when no agent_trace, conversation evaluators when no conversation). These will all get the same skip-semantics treatment in 0.9.0. For now, only ContextRecall is fixed because it's the only metric auto-included by a factory suite and was therefore the most-visible footgun in the headline RAG workflow.
🧪 Tests
3 new tests in tests/test_context_recall_skip.py
Full suite: 835 passed, 13 skipped (was 832/13 at 0.8.1).
Upgrade
pip install --upgrade multivon-eval
Full release notes →
v0.8.1·May 19, 2026

v0.8.1 — Context injection fix for RAG one-liners

Critical fix for the 0.8.0 RAG one-liner UX. Surfaced by a comprehensive dogfood pass and verified end-to-end before HN launch.
🐛 Fixed
run_with_anthropic / run_with_openai / run_with_litellm now auto-inject EvalCase.context into the system prompt.
Pre-0.8.1, every RAG case run via these one-line helpers silently dropped its context — Claude/GPT got the question with no grounding, faithfulness/hallucination evaluators scored 0/N against the empty-context reality, and users had no signal the helper wasn't doing what its name implied.
# This now works as you'd expect — Claude gets the context.
suite = EvalSuite.for_rag()
suite.add_cases([
    EvalCase(
        input="What is the refund window?",
        context="Refunds within 30 days of purchase.",
    ),
])
report = suite.run_with_anthropic("claude-haiku-4-5-20251001")
# faithfulness now scores 1.0 — pre-0.8.1 it scored 0.0
How the fix works
The adapter contract is extended with an optional _call_with_case(case) method. The suite uses it when present, falls back to the existing model_fn(case.input) path for plain callables — so existing custom adapters are unaffected. Built-in AnthropicAdapter / OpenAIAdapter / LiteLLMAdapter now implement _call_with_case and auto-inject case.context into the system prompt along with a Use ONLY this context to answer RAG prefix.
List-valued context (multiple retrieved chunks) gets [chunk i] markers so the model can see the boundaries between sources.
Same fix applied to the async path via _acall_with_case (used by run_async).
🧪 Tests
14 new tests in tests/test_adapter_context_injection.py cover: _format_context_block helper, AnthropicAdapter / OpenAIAdapter context injection, system-prompt composition with both user-supplied and RAG prefixes, list-valued context, suite routing to _call_with_case when available + fallback for plain callables.
Full suite: 832 passed, 13 skipped (was 818/13 at 0.8.0).
Upgrade
pip install --upgrade multivon-eval
0.8.0 users with RAG cases using run_with_anthropic / run_with_openai should upgrade — the score difference is dramatic and the API is unchanged.
Full release notes →
v0.8.0·May 19, 2026

v0.8.0 — Intelligent-eval bootstrap + auto module

The intelligent-eval release. Two new public surfaces address the "I don't know what to eval" cold-start bottleneck for teams shipping LLM products.
multivon-eval bootstrap CLI
Takes a product description + sample traces, emits a tuned EvalSuite in <60 seconds.
multivon-eval bootstrap \
  --product product.md \
  --traces traces.jsonl \
  --output ./eval-bootstrap/
Four artifacts land in the output dir:
eval_suite.py — runnable suite with 4-6 evaluators picked for your product shape
seed_cases.jsonl — 30 adversarial seed cases targeting the primary failure mode
thresholds.yaml — calibrated from your traces at p25 of baseline scores
DISCOVERY_REPORT.md — a forwardable eval design review for your team
Single Claude Haiku call for metric proposal; deterministic auto_evaluators heuristic runs as the safety net; threshold calibration on a capped trace sample. Cost ≈$0.12 per bootstrap, hard ceiling $0.15.
PII redaction (high-confidence local scan for AWS / Anthropic / OpenAI / GitHub keys, SSN, Luhn-valid credit cards, email) runs before any LLM call. Three policies via --pii-policy: redact (default), strict (abort on detection), allow (raw, with explicit confirmation).
🧠 multivon_eval.auto module
auto_evaluators(case) — pure-heuristic evaluator suggester. Pass an EvalCase, get back a ranked evaluator list with primary / secondary / guardrail tiers and confidence. Zero-cost.
generate_adversarial_cases — LLM-generates cases for 10 failure modes: ungrounded_claim, jailbreak, prompt_injection_direct/indirect, tool_injection, pii_leakage_invitation, tool_misuse, numeric_edge, off_topic, format_violation.
generate_unicode_obfuscation_cases — deterministic homoglyph / zero-width / RTL-override transforms. No LLM call.
validate_adversarial_cases — N-shot judge-noise filter. Validated +0.80 mean failure-rate separation between weak (always-confabulate) and strong (always-refuse) baselines on ungrounded_claim cases with real Claude Haiku judge — judge noise correctly filtered out at the per-shot level.
🧪 Tests
33 new tests for the bootstrap pipeline (test_discover.py)
19 tests for N-shot validation (test_auto_validate_adversarial.py)
19 tests for auto_evaluators heuristic (test_auto_evaluators.py)
7 tests for Unicode obfuscation (test_auto_unicode_obfuscation.py)
Full suite: 818 passed, 13 skipped (was 745/12 at 0.7.3)
🐛 Fixes
rag_eval.ipynb — corrected stale Experiment.add_runexp.record(report, run_id=...) and suite.prepare() → per-evaluator ev.prepare(). Switched private accessors to public to match the quickstart notebook style.
📚 Full changelog
See [CHANGELOG.md](https://github.com/multivon-ai/multivon-eval/blob/main/CHANGELOG.md) for the complete entry.
Full release notes →
v0.7.8·May 19, 2026

v0.7.8 — Critical fix: 0.7.7 CLI was broken

Fixed
Orphan sys.exit(1) at end of cli.py — broke 0.7.7 CLI entirely. A leftover top-level sys.exit(1) from a refactor caused every CLI invocation in 0.7.7 to exit 1 after the actual command's success. 0.7.7 users should upgrade immediately.
Full release notes →
v0.7.7·May 19, 2026

v0.7.7 — multivon-eval discover (capability catalog)

Added
multivon-eval discover — emit a machine-readable capability catalog (evaluators, jurisdictions, judges, factory suites) as JSON to stdout. Same shape as multivon-mcp's eval_discover tool, surfaced for agents that don't speak MCP (or shell scripts, or CI gates).
multivon-eval discover | jq '.evaluators[] | select(.category == "rag")'
Pair with --compact to flatten to a single line for piping into jq -c.
Full release notes →
v0.7.6·May 19, 2026

v0.7.6 — RAG starter judge-noise tolerance

Fixed
init -t rag starter passes its happy path with the calibration warnings introduced in 0.7.3. The default thresholds were tuned slightly to absorb the small amount of judge noise on the canned cases.
Full release notes →
v0.7.5·May 19, 2026

v0.7.5 — save_json fix + scaffolder dedup

Fixed
EvalReport.save_json() auto-creates the parent directory. Previously failed with FileNotFoundError if the target directory didn't exist; now mkdir(parents=True, exist_ok=True) is called first.
Drop duplicate template writes in the init scaffolder — the same starter file was being written twice on some templates.
Full release notes →
v0.7.4·May 19, 2026

v0.7.4 — DPDP (India) compliance support

Added
DPDP (India) compliance support. The Digital Personal Data Protection Act (DPDP) jurisdiction is now first-class in PIIEvaluator(jurisdiction="dpdp"). Adds Aadhaar (12-digit + Verhoeff checksum), PAN (5+4+1 alpha-numeric), and India-specific patterns to the detection rules.
Full release notes →
v0.7.3·May 19, 2026

v0.7.3 — Loud calibration fallback + multimodal evaluators

The trust release: the single most-cited bug across a 26-voice strategy deliberation was the silent calibration fallback. Fixed, with a public escape hatch for legacy callers who depended on it. Two experimental multimodal evaluators land as the seed for a forthcoming document-AI benchmark (see [pdfhell](https://github.com/multivon-ai/pdfhell)).
Fixed
Silent calibration fallback is now loud. calibrated_threshold(evaluator, judge) previously fell back to 0.7 silently when the (evaluator, judge_model) pair was missing from _calibration_data/v2.json — the strategy deliberation flagged this as the single most-cited trust bug (5+ persona voices including a Series-A CTO who called it "deceitful code"). The default behaviour is now "warn": a UserWarning fires once per pair, then the call returns 0.7. Pre-0.7.3 silent behaviour is opt-in via set_calibration_fallback_policy("silent") for back-compat. For procurement/audit deployments call set_calibration_fallback_policy("strict") (raises CalibrationMissing). The MULTIVON_CALIBRATION_FALLBACK={silent,warn,strict} env var overrides at process start.
Added
set_calibration_fallback_policy(policy) exported at top level. Module-level switch; per-call strict=True still wins.
MULTIVON_CALIBRATION_FALLBACK env var — set to silent, warn, or strict to override the in-process default without code changes.
Multimodal evaluators (experimental) — first multimodal capabilities shipped:
VQAFaithfulness — image-grounded faithfulness. Generates 3 QAG claims about an image, verifies each. Reads image from case.metadata['image_url' | 'image_path' | 'images'].
DocumentGrounding — multi-page document-agent grounding. Three QAG questions per case: claim support, entity invention, exception handling. Seed evaluator for the Document Agent Acceptance Protocol v0.1.
Vision dispatch wired for anthropic (Claude 3.5+ + 4.x), openai (GPT-4o+), google (Gemini 1.5+). Raises JudgeUnavailable with a friendlier hint when a text-only judge is mis-wired.
Both classes flagged experimental: no calibration rows shipped yet (so the new "warn" default fires on first use until thresholds are calibrated).
Tests
21 new tests in tests/test_multimodal.py exercise the public surface (image-metadata parsing, error paths, parse helpers) without provider API calls.
Full suite: 745 passed, 12 skipped (was 724/12).
Full release notes →
v0.7.2·May 19, 2026

v0.7.2 — Gemini judge + multi-judge agreement benchmark

Gemini lands as a first-class judge provider. The 5-judge multi-judge agreement benchmark re-ran with 250 LLM calls and ships in the website.
Added
provider="google" for JudgeConfig — backed by the official google-genai SDK. Default model: gemini-2.5-flash. Install with pip install 'multivon-eval[google]' (the extra pulls google-genai>=1.0.0). Sync + async paths wired. Auth via GOOGLE_API_KEY (matches Google's own docs); the standard "missing key" JudgeUnavailable setup hint now mentions where to get one.
Pricing for Gemini in _cost_models.pygemini-2.5-flash $0.075/$0.30 per million in/out, gemini-2.5-flash-lite $0.0375/$0.15, gemini-2.5-pro $1.25/$5.00, plus 1.5-series. Per-token usage is recorded the same way as Anthropic and OpenAI, so report.costs.total_cost_usd is correct out of the box.
Multi-judge benchmark refreshed with 5 judges. benchmarks/results/multi_judge_agreement.json now reports pairwise Cohen's κ 0.60–0.80 (substantial agreement on most pairs) and per-judge accuracy/precision/F1 on HaluEval QA, N=50. gemini-2.5-flash leads on every metric in this run (accuracy 0.860, precision 0.950, F1 0.844). Numbers surface on the website and in the new "Why multivon-eval" docs page.
Full release notes →
v0.7.1·May 19, 2026

v0.7.1 — Compliance reporter + HTML compliance report

Pre-public-launch hardening: two real bugs the new benchmarks surfaced, plus calibration around the numbers we put on the website.
Fixed
workers > 1 lost every CostTracker record. _run_parallel submitted work via ThreadPoolExecutor without copying the caller's contextvars, so each worker started in an empty context and the active CostTracker was invisible. report.costs.total_cost_usd came back $0.00 whenever you parallelised. Wrapped each submit() call in contextvars.copy_context().run(...). Verified — workers=4 on 4 Hallucination calls now reports $0.00022 instead of $0.
set_cache(JudgeCache(...)) was a silent no-op. The cache only activated when JudgeConfig(cache=True) was passed explicitly to each evaluator. Installing a cache globally — the most natural way to opt in — did nothing because JudgeConfig.cache defaults to False. Added cache_is_user_opted_in(); set_cache(non_none) now flips it; JudgeConfig.resolve() honors it. Verified — sequential rep1→rep2 cache hit goes from 2.9s/4-calls to 0ms/0-calls (~2,271× faster) without any per-evaluator config.
EvalSuite.run(save_json=..., save_junit_xml=...) already in 0.7.0 — added documentation here. Both keyword args write the report BEFORE fail_threshold raises EvalGateFailure, so a failing gate still leaves an artifact for multivon-eval view / compare.
Added (benchmarks)
benchmarks/run_multi_judge_agreement_benchmark.py — pairwise Cohen's κ across claude-haiku-4-5, claude-sonnet-4-6, gpt-4o-mini, gpt-4o on the same hallucination cases. Output at benchmarks/results/multi_judge_agreement.json. Numbers ship on the website.
benchmarks/run_cost_latency_benchmark.py — 50 HaluEval cases × 4 LLM-judge evaluators with workers=1, real Anthropic billing. cost_latency.json reports $0.00127/case, 17 judge calls/case, and a linear $6.35 extrapolation to 5,000 cases.
benchmarks/run_reproducibility_benchmark.py — 10 cases × 10 reps, cache on/off. Surfaces (a) the cache-miss bug fixed in this release and (b) ~3% irreducible stdev at temperature=0 across reps of claude-haiku-4-5. The cache fix turned this into the 2,271× speedup published on the site.
docs/sample-audit-package.zip (5.5 KB) — a real audit-package zip from the regulated template. Linked from the website's Compliance Bundle CTA so buyers can see what an auditor actually receives.
Changed (docs / README)
README hero example: your_llm.generate placeholder replaced with a real anthropic.Anthropic() snippet that runs after pip install.
README + docs claims aligned to reality: calibration F1 range corrected from "0.76–0.98" to the actual shipped 0.66–1.00; the 2.9× run_async and 4,700× cache website claims (the latter conservatively true now but unsupported in the repo) are replaced with the linkable benchmark numbers above.
New docs page: [Why multivon-eval](https://evaldocs.multivon.ai/why-multivon-eval) with head-to-head benchmark tables.
Full release notes →
v0.7.0·May 19, 2026

v0.7.0 — Status enum + JUnit + view + init + budgets + audit-pack

The trust release: explicit error classification so a transient judge outage no longer masquerades as a model regression, plus the first major batch of community-facing usability work — JUnit CI integration, a local HTML report viewer, classical similarity metrics, repaired examples and notebooks.
Fixed (pre-release audit, 0.7.0)
Headline trust feature now actually works. Every LLM-judge evaluator (Faithfulness, Hallucination, Relevance, ContextPrecision, CustomRubric, GEval, CheckEvaluator) plus the agent evaluators (ToolArgumentAccuracy, ToolCallNecessity, TaskCompletion, StepFaithfulness) and SelfConsistency had bare except Exception: blocks that silently swallowed JudgeUnavailable and re-classified the case as a quality failure (score=0.0). This defeated the entire CaseResult.status distinction the release advertises. Each judge call now re-raises JudgeUnavailable so suite.run() routes the case to EvalStatus.JUDGE_ERROR and pass_rate excludes it correctly.
fail_threshold no longer reports "Eval failed: pass rate 0.0%" when every case errored. When evaluated == 0 and errors > 0, suite.run() raises EvalGateFailure with the underlying error message (e.g. "Missing credentials … export OPENAI_API_KEY=sk-…") instead of a misleading quality gate failure.
rag init template no longer hangs ~45s when no API key is set and Ollama isn't running. The template now probes Ollama with a 0.5s timeout (matching the regulated template) and falls back to a JudgeConfig whose suite.run() call surfaces an actionable setup hint at first use.
__all__ re-exportsCaseResult, EvalResult, EvalReport, EvalStatus, EVALUATION_STATUSES, ERROR_STATUSES, Costs, CostTracker, ProviderUsage, ModelPricing, register_pricing, SuiteLock, EvaluatorFingerprint, LockMismatch, build_suite_lock, fingerprint_evaluator, verify_suite_against_lock, build_audit_package, assert_evaluators, EvaluatorFailure were imported at module top-level but missing from __all__. from multivon_eval import * and IDE introspection now see them.
README PII example had four missing commas between add_evaluators(...) arguments — pasted-as-is was a SyntaxError. Fixed.
docs/evaluators/deterministic.mdx StartsWith("`json") example used a literal triple-backtick inside a triple-backtick fence, closing the outer code block early in Mintlify. Switched the outer fence to four backticks.
Added
#### Foundation primitives
CaseResult.status — new property returning an EvalStatus enum (passed, failed_quality, model_error, judge_error, evaluator_error, timeout, skipped). Surfaces *what kind* of outcome the case had, not just pass/fail. Status fields (judge_error, evaluator_error, skipped, agent_trace) added to CaseResult directly.
EvalReport.evaluated, .errors, .errors_by_kind, .skipped — counts that distinguish quality outcomes from infrastructure failures.
Per-evaluator error isolation — when one evaluator raises JudgeUnavailable, the rest of the case's evaluators still run. The failing evaluator's result records a clear "judge unavailable" reason in EvalResult.metadata["error_kind"], and the case is tagged EvalStatus.JUDGE_ERROR. A non-JudgeUnavailable exception in an evaluator is tagged EvalStatus.EVALUATOR_ERROR (distinct, so retry logic can target judge outages without masking real bugs). Both sync and async (run_async) paths honor this.
CaseResult.agent_trace — captured agent traces now surface on the result (not only on the input case), so notebooks can iterate steps from the report without reaching back into the suite.
Multi-run aggregation propagates error fields — when runs > 1 and any run errors, the aggregate CaseResult keeps the first error of each kind. pass_count uses cr.passed (status-aware), so SPRT early-stop and flaky-detection don't count error runs as successes.
#### CI integration
EvalReport.to_junit_xml() + .save_junit_xml(path) — render the report as JUnit XML. GitHub Actions, GitLab CI, CircleCI, Jenkins all render JUnit XML natively in their PR/job summary UI. Quality failures emit <failure>, plumbing failures emit <error> (distinct so CI can route them differently), skipped cases emit <skipped>. XML 1.0-invalid control characters are stripped at the serialization boundary so strict CI consumers accept the document.
multivon-eval report results.json --junit out.xml flag.
multivon-eval view <report.json> CLI — local HTTP server with the HTML dashboard. --port, --no-browser flags. TemporaryDirectory + SIGTERM handler so the temp dir is cleaned up on Ctrl-C, docker stop, or exception. Port collision produces a clean error, not a traceback.
#### Public API surface
Top-level imports: CaseResult, EvalReport, EvalResult, EvalStatus, EVALUATION_STATUSES, ERROR_STATUSES are now importable from multivon_eval directly. Saves users from reaching into multivon_eval.result.
#### Evaluators
Levenshtein — character edit-distance similarity. Score = 1 − dist / max(len). Pure-Python (no extra deps). threshold, case_sensitive kwargs.
ChrfScore — character n-gram F-beta (Popović 2015), standard sacreBLEU aggregation: average precision per order, average recall per order, then F-beta on the averages. Defaults match sacreBLEU's chrF (max_n=6, beta=2, whitespace stripped). include_whitespace=True for the count-spaces variant.
#### Onboarding (from 0.6.2, surfaced here)
multivon-eval init — scaffold a starter project in under 5 minutes. Templates: quickstart (offline, no API key), rag, agent, regulated. --ci github generates a GitHub Actions workflow. --force to overwrite a non-empty target.
EvalReport.assert_budget(...) — opt-in cost / token / latency gate. Raises EvalGateFailure on violation. All thresholds opt-in; missing pricing data surfaces a clear actionable error.
#### CI hardening
.github/workflows/test.yml — pytest matrix on Python 3.10/3.11/3.12, every PR.
.github/workflows/install-smoke.yml — builds the wheel, installs in a clean venv WITHOUT the dev extras or pytest, verifies bare import, verifies the public API, runs the quickstart notebook headlessly with a placeholder API key (auth errors are expected; AttributeError/TypeError from API mismatches → regression). The project shipped 0.6.0 with no CI at all; both workflows close that gap.
#### Enterprise / compliance (later 0.7.0 additions)
Immutable audit-record provenance — every ComplianceReporter.record() row now carries a provenance block with package_version, package_git_sha + package_git_dirty (when running from a git workspace), host (python/platform/machine — no PII), full suite_lock (evaluator + judge + calibration + per-evaluator config fingerprint + cases hash), and a suite_lock_status field that distinguishes "absent" (synthetic report) from "ok" and "serialization_failed". The block is part of the SHA-256 hash chain, so tampering with provenance is detected by reporter.verify(). Marcus persona's compliance-grade blocker.
Evaluator config in the fingerprintSuiteLock.evaluators[].extra.config now captures the JSON-safe public attributes (WordCount.min_words, Contains.substrings, RegexMatch.pattern, etc.) so two suites with the same evaluator name + threshold but different config produce different suite_hash values. diff() surfaces config-level changes.
Calibration version pinningload_calibration(version="v1"), calibrated_threshold(..., version=), and threshold_table(version=) take an explicit version label. MULTIVON_CALIBRATION_VERSION env var pins globally for CI. calibration_versions() lists shipped labels. Unknown versions raise FileNotFoundError loudly — silent fallback would defeat the purpose of pinning for reproducibility. Sarah persona ask.
HTML report status badges — six pill variants surface the 0.7.0 EvalStatus enum: PASS, FAIL, FLAKY, MODEL ERR, JUDGE ERR, EVAL ERR, SKIPPED. Distinct colors (green/red/yellow/orange/slate) so a judge outage isn't visually confused with a model regression. Each error pill carries a tooltip explaining which subsystem to investigate. Errors and Skipped counts surface as summary cards when present. Priya persona ask.
Conversation template for multivon-eval init — fifth template (init -t conversation) demonstrating multi-turn dialogue eval with ConversationRelevance + KnowledgeRetention + TurnConsistency. Closes the gap noted in the examples audit (no template demoed the conversation API).
Calibration version pinned through audit-package replaySuiteLock gains a top-level calibration_version field populated unconditionally from effective_calibration_version() at lock-build time. The label flows through suite lock → audit log provenance → build_audit_package(), which now extracts the version from the FIRST log record and bundles the matching calibration_v{label}.json. Manifest gains calibration_version + calibration_source ("logged" vs "default"). An unshipped pin (MULTIVON_CALIBRATION_VERSION=v_doesnotexist) raises FileNotFoundError at suite.run time instead of silently writing suite_lock=None and defeating the pin. Fixes a real Marcus-persona replay-fidelity bug: previously a v1-pinned audit packaged on a v2-default install would silently bundle v2.
Per-case retry on transient judge errors — new JudgeRetry policy + suite.run(..., judge_retry=JudgeRetry(...)) opt-in. Cases whose status is in policy.retry_on (default: judge_error) are re-evaluated up to max_attempts times with exponential backoff (base_backoff * factor ** (attempt - 2)), symmetric jitter, and a max_backoff cap. Quality failures, model errors, and evaluator bugs are NOT retried — those are signal. CaseResult gains retry_attempts (count of retries actually performed) and retry_errors (the error per failed attempt that prompted a retry; len == retry_attempts). Sync, async (run_async — uses asyncio.sleep), and parallel-workers paths all honor the policy. JSON round-trip preserves retry history. Sarah persona ask — a 10k-case weekend cron no longer needs Monday triage when one 429 trips one case.
Native agent framework integrations (D16) — two new templates with real-framework tracers:
multivon-eval init -t agent-langgraphStateGraph + MessagesState + ToolNode + tools_condition, instrumented via the new LangGraphTracer. Uses run_id-keyed metadata + langgraph_node + graph:step:N tags. One AgentStep per LLM turn (not per graph node) so a ReAct's tools node aggregates with its preceding decision. Parallel tools within one node are correctly attributed; subgraph metadata is preserved.
multivon-eval init -t agent-openai-sdk — real Agent + function_tool + Runner.run_sync, instrumented via the new OpenAIAgentsTracer. Two integration paths: post-hoc tracer.capture(result) parses RunResult.new_items (default, no global state); live tracer.run_hooks() + tracer.merge(hooks) uses isolated RunHooksBase buffers (no leakage across concurrent runs). Idempotent merge. Known SDK item types (CompactionItem, ToolApprovalItem, MCP / ComputerCall / CodeInterpreter / ToolSearch items) preserved as visible markers rather than silently dropped.
Both templates ship 5 cases including negative trajectories (already-refunded, not-found, processing). New ToolCallAccuracy(penalize_unexpected=True) makes the negatives actually fail when the agent over-calls.
Pyproject extras: [langgraph], [openai-agents]. README "Pick your path" table extended.
Beginner-friendly onboarding pass (D15 from OSS-adoption audit):
README quickstart flipped to init -t quickstart (offline, no API key) instead of init -t rag (needed key). New "Pick your path" table makes the right entry obvious.
Agent template (init -t agent) now runs OFFLINE by default with deterministic ToolCallAccuracy. LLM-judge evaluators (ToolArgumentAccuracy, TrajectoryEfficiency, TaskCompletion) auto-activate when ANTHROPIC_API_KEY / OPENAI_API_KEY / local Ollama is detected. Previously: silent 0-scores when no key → looked like the agent failed.
JudgeUnavailable carries a setup hint when the underlying exception is auth- or connection-shaped. Concrete next steps: export ANTHROPIC_API_KEY=..., ollama pull, or init -t quickstart. Generic API errors (BadRequest, APIError, prompt-too-long, invalid model id) get clean messages without the hint — real bugs aren't drowned in setup advice.
AgentTracer.format_trace() + print_trace() for agent debugging: pretty-print a captured list[AgentStep] from a notebook or CLI without reaching into the dataclasses.
Public accessors: EvalSuite.evaluators, EvalSuite.cases, CheckEvaluator.criterion. Notebooks no longer teach _evaluators / _criterion private internals.
Local Ollama probe added to _auto_judge() in the agent, regulated, and conversation templates so the README's "no API key needed (Ollama works)" claim is honored everywhere.
Quickstart notebook version pin bumped to >=0.7.0 (was stale >=0.6.1).
multivon-eval compare baseline.json proposal.json — answer "did my prompt change help?" in one command. Pairs cases by case_input (sequential within duplicates), reports pass-rate / avg-score / errors / flaky deltas, per-case regressions and improvements, and a McNemar p-value over paired cases (None when no valid pairs). SKIPPED on either side is excluded from direction + McNemar so a not-evaluated case isn't falsely scored as a regression. CLI: --regressions-only, --markdown (PR-comment format), --json, --fail-on-regression (CI gate). Python: compare_reports(), EvalReport.compare(other), ReportDiff, CaseDiff.
Changed (BREAKING — minor version bump)
EvalReport.pass_rate excludes error cases from the denominator. A run with 2 passed + 3 judge-error cases now reports pass_rate = 1.0 (2/2 evaluated), not 0.4 (2/5 total). Use EvalReport.errors to surface infrastructure problems independently. This is the headline behavior change.
EvalReport.avg_score excludes error cases from the average.
EvalReport.failed counts *quality failures only* (cases with EvalStatus.FAILED_QUALITY). Use EvalReport.errors for the rest.
EvalReport.pass_rate_ci() uses evaluated as the denominator to match pass_rate. Pre-0.7.0 callers reading RunRecord.total for the z-test denominator should now read evaluated (legacy records default to total for backward compatibility).
CaseResult.passed is defined as status == EvalStatus.PASSED, so a case with no evaluator results or in any error state returns False even if individual EvalResult.passed values were True.
Fixed
Carry-over from the 0.6.1 + 0.6.2 patch series (which never reached PyPI; all changes are part of 0.7.0):
import multivon_eval no longer requires pytest. The pytest plugin import is guarded; users who don't have pytest installed get a clear ImportError only when they actually call assert_evaluators().
All 4 QAG-based agent evaluators (PlanQuality, TaskCompletion, TrajectoryEfficiency, AgentMemoryEval) now pass judge to _qag_eval. Previously raised TypeError on every real invocation.
All 4 conversation evaluators — same _qag_eval fix.
Contains.match_any — added as a keyword-only argument so Contains([...], False, 0.75) keeps 0.75 as threshold.
WordCount(min=, max=) alias kwargs.
audit-package CLI bundles the calibration version actually in use (v2.json preferred over v1.json).
Notebook auto-detects judge from ANTHROPIC_API_KEY / OPENAI_API_KEY env vars rather than hard-coding local Ollama. Colab now works without setup.
TrajectoryEfficiency recovery scoring uses the per-evaluator judge instead of the global default.
run_on_cases() applies the same per-evaluator isolation as the live run path.
Calibration reconciliationv2.json extends v1.json with new judges (gpt-5.5) but preserves v1 thresholds for every existing judge × evaluator combination. Eliminates the silent threshold drift between 0.5.x and 0.6.0.
Examples + notebooks repaired
examples/ci_eval.py — removed dead post-fail_threshold code, added JUnit XML output, distinct exit code 2 for infrastructure errors.
examples/basic_eval.py — simplified evaluator setup; added Levenshtein for short-string similarity.
examples/eu_ai_act_eval.py — tamper-detect demo now asserts the verifier raises (the contract); previously silently succeeded.
All examples/*.py — added if __name__ == "__main__" guards so importing them doesn't auto-run an LLM eval.
notebooks/agent_eval.ipynb — fixed cells 7 and 10 that referenced cr.trace.steps (never existed); now use cr.agent_trace directly.
All notebooks: install pins bumped to multivon-eval>=0.7.0.
Migration notes
Most callers don't need any code changes for 0.7.0. The behavior change is concentrated in EvalReport.pass_rate and .avg_score:
CI thresholds that gate on pass_rate become more sensitive — error cases no longer drag the metric down. What used to be a 60% pass rate (6 pass / 4 errors out of 10) is now pass_rate = 1.0 with errors = 4. If you want CI to fail on errors too, check report.errors == 0 explicitly.
Old:
report = suite.run(fn)
if report.pass_rate < 0.8:
    sys.exit(1)
New (recommended):
report = suite.run(fn)
if report.errors > 0:
    sys.exit(2)   # infrastructure problem — caller should retry
if report.pass_rate < 0.8:
    sys.exit(1)   # quality regression
The shipped multivon-eval init --template rag template uses this pattern.
[0.6.x] — never published to PyPI
The 0.6.1 and 0.6.2 wheels were built but not published; their contents (bug fixes, init scaffolder, budget gates) ship as part of 0.7.0 above.
Full release notes →
v0.6.0·May 14, 2026

v0.6.0 — Enterprise wave + cost + pytest + HIPAA + audit-package + suite.lock

The big enterprise release. Closes every tier-1 readiness gap from the May 2026 strategy roadmap and ships three artifacts demanded by 8+ of 12 voices in the cross-model persona deliberation.
What's new since 0.5.0
Async + structured concurrency. Evaluator.aevaluate() + EvalSuite.run_async() gathers evaluators in parallel per case. Live OpenAI smoke: 2.9× faster on a 2-case suite, scales with N.
Judge-result cache. SQLite-backed, content-hashed on (provider, model, base_url, prompt, temperature, max_tokens, extra). Off by default; opt-in via JudgeConfig(cache=True) or MULTIVON_JUDGE_CACHE=1. Live smoke: 4,700× faster on cache hit. Graceful degradation if the cache file is corrupt or unwritable — falls through to the live judge with a one-time stderr warning.
Per-run cost + token accounting. Every judge call now reports tokens via a contextvar tracker. EvalReport.costs carries totals + per-(provider, model) breakdown + USD estimate from the shipped pricing catalog (Anthropic + OpenAI + on-prem free). Override per model with register_pricing(). Async-safe via contextvars.
Pytest plugin. from multivon_eval.pytest_plugin import assert_evaluators raises a rich EvaluatorFailure (subclass of AssertionError) on any evaluator fail. Adds --multivon-runs=N for flaky-test investigation and a per-session cost summary line. Auto-loaded via pytest entry-point.
multivon-eval audit-package CLI. Bundles audit log + calibration data + coverage report + README + verify.py + signed manifest into a single ZIP an auditor can drop into a SOC 2 / ISO 42001 / EU AI Act evidence package. The bundled verifier recomputes every SHA-256 and walks the hash chain end-to-end.
HIPAA framework + EvalSuite.hipaa_safe_harbor() factory. Maps PIIEvaluator(jurisdiction="hipaa") to 45 CFR §164.514(b)(2) Safe Harbor and §164.312(a) access control; quality evaluators to §164.312(b) audit controls; structure evaluators to §164.312(c) integrity. Administrative + physical safeguards + BAA listed as process controls.
SuiteLock content-addressed fingerprint. suite.lock() / suite.write_lock(path) / suite.verify_lock(saved). Hashes library version, every evaluator (class, threshold, prompt, judge config, calibration provenance), case count, case content. LockMismatch carries a structured .differences list for CI tooling. Catches the silent-prompt-drift problem that 11 of 12 personas flagged as the load-bearing audit hole.
Secrets resolver. Pluggable SecretsResolver protocol — EnvResolver default, ChainedResolver, StaticResolver. Plug Vault / AWS SM / GCP SM / Azure KV without taking a hard dep.
Calibration provenance. Hardcoded threshold dict → versioned _calibration_data/v1.json shipped as package data. Every entry: dataset hash, N, F1/P/R, judge model id, measurement date. Loadable via calibration_provenance() for audit reports.
On-prem judge path. JudgeConfig(provider="litellm", model="bedrock/...") for Bedrock/Vertex/Azure/Ollama via LiteLLM. The existing provider="openai" + base_url covers vLLM / TGI / LM Studio / Ollama OpenAI-compatible endpoints.
Per-case audit records. reporter.record(report, mode="case") writes one chained NDJSON record per case for EU AI Act Art. 12 decision-level logging.
External anchor callback. anchor_fn: Callable[[str], None] parameter on ComplianceReporter. Built-in github_actions_anchor ships the chain's tip hash to $GITHUB_OUTPUT so it can't be silently rewritten.
HTML compliance rollup. ComplianceHtmlReporter produces a self-contained HTML report — coverage table + chain integrity + per-evaluator breakdown + per-case detail. Auditor-attachable to evidence packages.
Structured exception hierarchy. MultivonError root + JudgeUnavailable, CalibrationMissing, EvaluatorPrereqMissing, CacheError, SecretsError, ComplianceError. CI can branch on exception class instead of string-matching.
Paragraph-accurate EU AI Act mappings. Previously everything pointed at Article 9. Now: Art. 9(2)(b) foreseeable misuse → Toxicity; Art. 10(2)(f-g) bias → Bias; Art. 10(5) personal data → PII; Art. 15(1) accuracy → Faithfulness/Hallucination/Relevance; Art. 15(2) robustness → NotEmpty/Schema/SelfConsistency. Process controls (Art. 11/12/13/14/15(4-5)) surfaced separately.
Upgrade
pip install -U multivon-eval
from multivon_eval import EvalSuite, ComplianceReporter, JudgeConfig

# v0.6: async + cost-tracked + locked + audit-ready
suite = EvalSuite.hipaa_safe_harbor()
suite.add_cases(cases)
suite.write_lock("suite.lock")  # for CI drift detection

reporter = ComplianceReporter("./audit-logs", framework="hipaa")
print(reporter.coverage(suite))  # gap report

report = await suite.run_async(model_fn, runs=5, concurrency=5)
print(report.costs)              # total tokens + USD
reporter.record(report, mode="case")  # Art. 12 / §164.312(b) per-case audit

# Build the auditor-attachable ZIP
# (or: multivon-eval audit-package --logs ./audit-logs --suite ... --framework hipaa --out evidence.zip)
Test counts
405 tests passing (up from 261 at v0.5.0). 21 new tests for compliance reporter v2, 16 for cost accounting, 7 for pytest plugin, 21 for HIPAA + audit-package, 20 for suite.lock, plus the v0.6 enterprise wave from earlier.
Codex review
codex review --base v0.5.0 on the final diff: GATE: PASS (zero P1 critical findings; two P2 findings caught + fixed before release; one P2 round-2 finding caught + fixed). Independent second-opinion review.
Breaking changes
None intentional. The audit record JSON gained chain_version, prev_hash, and per-evaluator controls is now a list instead of a string. Old records continue to verify via the legacy code path.
Docs
README: https://github.com/multivon-ai/multivon-eval#readme
Compliance guide: https://evaldocs.multivon.ai/guides/compliance
Benchmark: https://github.com/multivon-ai/eval-framework-benchmark
Full release notes →
v0.5.0·May 12, 2026

v0.5.0 — Compliance Pack v1

Compliance Pack v1. Hash-chained audit log, paragraph-accurate EU AI Act mappings, eu_ai_act_high_risk factory, and a coverage gap report. All local, all open-source.
What's new
EvalSuite.eu_ai_act_high_risk(...) factory — wires the standard high-risk evaluator set (NotEmpty, Faithfulness, Hallucination, Relevance, Toxicity, Bias, PIIEvaluator + optional SchemaEvaluator) with calibrated thresholds.
Hash-chained ComplianceReporter — every record links to the previous record's record_hash via prev_hash. Mid-log deletion is now detectable (the failure mode the previous per-record hashing silently allowed). verify() reports OK / TAMPERED / CHAIN BROKEN.
Paragraph-accurate EU AI Act mappings. Earlier versions pinned every measurable control to Article 9. The correct attribution:
toxicity → Art. 9(2)(b) — Foreseeable misuse risk identification
bias → Art. 10(2)(f-g) — Examination and mitigation of possible biases
pii_detection → Art. 10(5) — Processing of personal data
Accuracy evaluators (faithfulness, hallucination, relevance, answer_accuracy, …) → Art. 15(1)
Robustness evaluators (not_empty, schema_compliance, self_consistency, latency, …) → Art. 15(2)
Process controls (Art. 11/12/13/14/15(4-5)) surfaced separately — they require organizational measures beyond evaluation.
NIST AI RMF subcategory mapping. Performance → MEASURE 2.3; robustness → MEASURE 2.5; safety → MEASURE 2.6; privacy → MEASURE 2.10; fairness → MEASURE 2.11.
reporter.coverage(suite) -> CoverageReport. Pre-flight gap analysis: which Articles your suite exercises, which it doesn't, which are process controls that can't be satisfied by evaluators alone. Returns a structured object with a pretty __str__ for printing.
21 new tests covering hash-chain integrity, in-place edit detection, mid-log deletion detection, article mapping correctness, coverage report, factory wiring.
Offline-runnable example at examples/eu_ai_act_eval.py.
Upgrade
pip install -U multivon-eval
from multivon_eval import EvalSuite, ComplianceReporter

suite = EvalSuite.eu_ai_act_high_risk(jurisdiction="gdpr")
suite.add_cases(cases)

reporter = ComplianceReporter("./audit-logs", framework="eu-ai-act")
print(reporter.coverage(suite))      # gap report
report = suite.run(model_fn, runs=5)
reporter.record(report, tags={"system": "triage-bot"})
reporter.verify(suite.name)          # walks the hash chain
Breaking changes
None. The audit record JSON schema gained chain_version and prev_hash fields; the control string is now a controls list. Records written by previous versions continue to verify under their original (unchained) algorithm with an OK (legacy) status.
Docs
Guide: https://evaldocs.multivon.ai/guides/compliance
README: https://github.com/multivon-ai/multivon-eval#readme
Full release notes →
Install with pip install multivon-eval or read the docs.