Give your AI coding agent direct access to evaluation tools.

multivon-mcpis an MCP server that exposes 22 evaluation tools — pdfhell runs, QAG-graded faithfulness, RAG retrieval quality, compliance + safety guardrails, multimodal scoring, custom rubrics, and agent-workflow utilities like comparing two runs or synthesising eval cases from text. Drop it into Claude Desktop, Claude Code, Cursor, Cline, or OpenCode and the agent calls evals directly when it’s helping you build an LLM product.

Install

Bare install pulls multivon-eval, pdfhell, the MCP SDK, and the three frontier provider SDKs. Bring your own API key in env (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY).

The 22 tools

Curated for the agent context window. The full 44-evaluator catalog is still available via eval_discover; the 22 tools below are the ones AI agents actually call mid-edit, grouped by category — faithfulness, RAG, agent, compliance, safety, multimodal, custom. Shipped in current multivon-mcp.

Tool	Category	What it does	Needs key
eval_discover	Discovery	Returns the full machine-readable capability catalog (44 evaluators, 17 trap families, 6 suites with version hashes, calibration data, package versions). Call this first.	no
pdfhell_make	PDF Hell	Generate one adversarial PDF + its answer key. Useful when the agent wants to inspect what a specific trap looks like before evaluating.	no
pdfhell_run		Run the pdfhell adversarial-PDF benchmark against a vision model. Returns pass rate, Wilson CI, per-trap breakdown, suite hash.	yes
eval_audit_pack		Build a hash-chained audit ZIP from a pdfhell run — manifest with SHA-256s, every PDF, every answer key, run JSON, JUnit XML, README.	no
eval_faithfulness	Faithfulness	QAG-graded faithfulness — is a RAG output grounded in the retrieved context? Returns score + threshold + reason.	yes
eval_hallucination		QAG-graded hallucination detection — does an output contain content NOT in the context? Score 1.0 = no hallucination.	yes
eval_relevance		QAG-graded relevance — does the output actually address the user's question, or does it ramble?	yes
eval_answer_accuracy		QAG-graded semantic equivalence vs ground truth. Useful when string match is too strict (paraphrased correct answers).	yes
eval_context_precision	RAG retrieval	QAG-graded retrieval quality — is the retrieved context relevant to the question, or padded with noise? Returns 0-1 plus per-chunk breakdown.	yes
eval_context_recall		QAG-graded retrieval completeness — does the retrieved context contain everything needed to answer? Useful for finding silent retrieval gaps.	yes
eval_tool_call_accuracy	Agent	Deterministic — checks whether an agent called the right tool with the right arguments. No LLM judge needed; works without API keys.	no
eval_pii_detection	Compliance	Detect PII leakage in model output — emails, phones, SSNs, credit cards, HIPAA identifiers. Regex-only, no LLM judge, runs offline.	no
eval_schema_compliance		Validate structured output against a JSON Schema. Catches missing fields, wrong types, extra keys. Useful for tool-calling agents.	no
eval_toxicity	Safety	QAG-graded toxicity check — does the output contain harmful, hateful, or threatening content? Score 1.0 = clean.	yes
eval_bias		QAG-graded bias detection — does the output reflect gender, racial, age, or other demographic bias? Score 1.0 = no detected bias.	yes
eval_vqa_faithfulness	Multimodal	Visual QA faithfulness — given an image (path/URL/base64) and an answer, is the answer grounded in the image? Image-grounded QAG decomposition.	yes
eval_document_grounding		Document-grounded QA scoring — multi-page image-grounded faithfulness for document AI workflows. Pairs naturally with pdfhell.	yes
eval_g_eval	Custom rubrics	G-Eval — chain-of-thought rubric scoring against an arbitrary criterion string. The judge generates its own scoring steps before scoring.	yes
eval_custom_rubric		Run a list of yes/no questions defined by you against the output, pinned for reproducibility. Useful when you want full control over what gets checked.	yes
eval_compare_runs	Agent workflows	Diff two saved eval-report JSONs — pass-rate delta, per-case regressions and improvements, McNemar p-value over paired cases. The tool an agent calls when it just shipped a fix and wants to know if anything got worse.	no
eval_generate_cases		Synthesise n test cases from arbitrary text (FAQs, docs, transcripts). Returns input/expected_output/context tuples ready for an EvalSuite. Useful when an agent is bootstrapping a fresh eval suite from a knowledge base.	yes
eval_ingest_trace		Parse a runtime agent trace (LangGraph, OpenAI Agents SDK, or canonical manual format) into an EvalCase. Lets an agent score its own trajectory after a real execution.	no

Provider compatibility

What works with what today. PII detection is regex-only and runs everywhere. LLM-judge tools (faithfulness, hallucination, relevance, etc.) support the three frontier providers plus OpenRouter and local Ollama as first-class judges (JudgeConfig(base_url=…)). The pdfhell vision benchmark still ships only the three frontier vision adapters — open-weights vision is queued for the next release.

Provider	Example models	LLM judge	PII	PDF Hell	Multimodal	Notes
Anthropic	Claude Sonnet 4.6, Haiku 4.5	✓	✓	✓	✓	Default judge
OpenAI	GPT-4o, GPT-5.x	✓	✓	✓	✓	GPT-5.x uses reasoning tokens
Google	Gemini 2.5/3.x, Flash + Pro	✓	✓	✓	✓	Cheapest judge ($0.001/case on Flash)
OpenRouter	Llama, Qwen, Mistral via OAI-compat	✓	✓	○	○	LLM-judge via OAI-compat; vision adapter still pending
Local (Ollama)	Llama 3.2 Vision, Qwen2-VL	✓	✓	○	○	Ollama as a first-class judge — JudgeConfig(base_url=…); vision adapter pending

✓ Wired and tested○ Queued for next release— Not applicable

Example session

After wiring up multivon-mcp in Claude Desktop / Cursor / Cline, this is what a real interaction looks like:

User:    I just shipped a RAG endpoint. Can you check it for hallucinations?

Agent:   I'll use multivon to evaluate it.
         [calls eval_discover to see what's available]
         [calls eval_faithfulness with your input/context/output]

         → score: 0.667 (passed: False), threshold: 0.9
           reason: 2/3 claims grounded
             ✓ "annual renewal" — supported by context
             ✓ "30-day notice" — supported by context
             ✗ "automatic upgrade" — NOT in context

Agent:   Your RAG hallucinated the "automatic upgrade" detail. The context
         doesn't mention upgrades. I'd add a Hallucination evaluator to your
         CI gate, threshold ≥0.85, and re-prompt with explicit "only use facts
         from context" instructions.

Configure your agent

Three of the most common AI coding agents shown below. Anything that speaks MCP works — point it at the multivon-mcp binary.

Claude Desktop

~/Library/Application Support/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "multivon": {
      "command": "multivon-mcp",
      "env": {
        "ANTHROPIC_API_KEY": "sk-ant-...",
        "OPENAI_API_KEY": "sk-proj-...",
        "GOOGLE_API_KEY": "AIza..."
      }
    }
  }
}

Cursor

Settings → MCP, or .cursor/mcp.json

{
  "mcpServers": {
    "multivon": { "command": "multivon-mcp" }
  }
}

Cline / OpenCode / any MCP-compatible agent

Same JSON shape — point at the multivon-mcp console script

{ "mcpServers": { "multivon": { "command": "multivon-mcp" } } }

Roadmap

The current 22 tools cover the common agent-eval moments — RAG checks, agent traces, adversarial PDFs, audit packs, compliance/safety guardrails, multimodal grounding. Next, in priority order:

●Adversarial prompt generation as an MCP tool — eval.adversarial_prompts(target=...), exposing the SDK’s generate_adversarial_cases failure-mode targeting to agents. (Case synthesis and trace ingestion already shipped — eval_generate_cases and eval_ingest_trace in the table above.)
●Continuous production evals, which sample and score live production traffic, surface regressions, and file Linear / GitHub issues.
●Multi-agent simulation — pit two agents against each other and have evaluators score the conversation.
●Runtime loading of domain-specific pdfhell trap families — custom-trap plugins — so an agent can add a new failure mode without forking the repo.

File a GitHub issue at multivon-ai/multivon-mcp if you need any of these sooner — we’ll prioritize based on real demand.

Star multivon-mcp PyPI →See the underlying SDK