For AI coding agents
Give your AI coding agent direct access to evaluation tools.
multivon-mcpis an MCP server that exposes 22 evaluation tools — pdfhell runs, QAG-graded faithfulness, RAG retrieval quality, compliance + safety guardrails, multimodal scoring, custom rubrics, and agent-workflow utilities like comparing two runs or synthesising eval cases from text. Drop it into Claude Desktop, Claude Code, Cursor, Cline, or OpenCode and the agent calls evals directly when it’s helping you build an LLM product.
Install
Bare install pulls multivon-eval, pdfhell, the MCP SDK, and the three frontier provider SDKs. Bring your own API key in env (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY).
The 22 tools
Curated for the agent context window. The full 44-evaluator catalog is still available via eval_discover; the 22 tools below are the ones AI agents actually call mid-edit, grouped by category — faithfulness, RAG, agent, compliance, safety, multimodal, custom. Shipped in multivon-mcp 0.2.0.
| Tool | What it does | Needs key |
|---|---|---|
| eval_discover | Returns the full machine-readable capability catalog (44 evaluators, 3 trap families, 2 suites with version hashes, calibration data, package versions). Call this first. | no |
| pdfhell_make | Generate one adversarial PDF + its answer key. Useful when the agent wants to inspect what a specific trap looks like before evaluating. | no |
| pdfhell_run | Run the pdfhell adversarial-PDF benchmark against a vision model. Returns pass rate, Wilson CI, per-trap breakdown, suite hash. | yes |
| eval_audit_pack | Build a hash-chained audit ZIP from a pdfhell run — manifest with SHA-256s, every PDF, every answer key, run JSON, JUnit XML, README. | no |
| eval_faithfulness | QAG-graded faithfulness — is a RAG output grounded in the retrieved context? Returns score + threshold + reason. | yes |
| eval_hallucination | QAG-graded hallucination detection — does an output contain content NOT in the context? Score 1.0 = no hallucination. | yes |
| eval_relevance | QAG-graded relevance — does the output actually address the user's question, or does it ramble? | yes |
| eval_answer_accuracy | QAG-graded semantic equivalence vs ground truth. Useful when string match is too strict (paraphrased correct answers). | yes |
| eval_context_precision | QAG-graded retrieval quality — is the retrieved context relevant to the question, or padded with noise? Returns 0-1 plus per-chunk breakdown. | yes |
| eval_context_recall | QAG-graded retrieval completeness — does the retrieved context contain everything needed to answer? Useful for finding silent retrieval gaps. | yes |
| eval_tool_call_accuracy | Deterministic — checks whether an agent called the right tool with the right arguments. No LLM judge needed; works without API keys. | no |
| eval_pii_detection | Detect PII leakage in model output — emails, phones, SSNs, credit cards, HIPAA identifiers. Regex-only, no LLM judge, runs offline. | no |
| eval_schema_compliance | Validate structured output against a JSON Schema. Catches missing fields, wrong types, extra keys. Useful for tool-calling agents. | no |
| eval_toxicity | QAG-graded toxicity check — does the output contain harmful, hateful, or threatening content? Score 1.0 = clean. | yes |
| eval_bias | QAG-graded bias detection — does the output reflect gender, racial, age, or other demographic bias? Score 1.0 = no detected bias. | yes |
| eval_vqa_faithfulness | Visual QA faithfulness — given an image (path/URL/base64) and an answer, is the answer grounded in the image? Image-grounded QAG decomposition. | yes |
| eval_document_grounding | Document-grounded QA scoring — multi-page image-grounded faithfulness for document AI workflows. Pairs naturally with pdfhell. | yes |
| eval_g_eval | G-Eval — chain-of-thought rubric scoring against an arbitrary criterion string. The judge generates its own scoring steps before scoring. | yes |
| eval_custom_rubric | Run a list of yes/no questions defined by you against the output, pinned for reproducibility. Useful when you want full control over what gets checked. | yes |
| eval_compare_runs | Diff two saved eval-report JSONs — pass-rate delta, per-case regressions and improvements, McNemar p-value over paired cases. The tool an agent calls when it just shipped a fix and wants to know if anything got worse. | no |
| eval_generate_cases | Synthesise n test cases from arbitrary text (FAQs, docs, transcripts). Returns input/expected_output/context tuples ready for an EvalSuite. Useful when an agent is bootstrapping a fresh eval suite from a knowledge base. | yes |
| eval_ingest_trace | Parse a runtime agent trace (LangGraph, OpenAI Agents SDK, or canonical manual format) into an EvalCase. Lets an agent score its own trajectory after a real execution. | no |
Provider compatibility
What works with what today. PII detection is regex-only and runs everywhere. LLM-judge tools (faithfulness, hallucination, relevance, etc.) and the pdfhell vision benchmark currently support three frontier providers natively; OpenRouter + local Ollama adapters are queued for the next release.
| Provider | LLM judge | PII | PDF Hell | Multimodal |
|---|---|---|---|---|
| Anthropic | ✓ | ✓ | ✓ | ✓ |
| OpenAI | ✓ | ✓ | ✓ | ✓ |
| ✓ | ✓ | ✓ | ✓ | |
| OpenRouter | ○ | ✓ | ○ | ○ |
| Local (Ollama) | ○ | ✓ | ○ | ○ |
Example session
After wiring up multivon-mcp in Claude Desktop / Cursor / Cline, this is what a real interaction looks like:
User: I just shipped a RAG endpoint. Can you check it for hallucinations?
Agent: I'll use multivon to evaluate it.
[calls eval_discover to see what's available]
[calls eval_faithfulness with your input/context/output]
→ score: 0.667 (passed: False), threshold: 0.9
reason: 2/3 claims grounded
✓ "annual renewal" — supported by context
✓ "30-day notice" — supported by context
✗ "automatic upgrade" — NOT in context
Agent: Your RAG hallucinated the "automatic upgrade" detail. The context
doesn't mention upgrades. I'd add a Hallucination evaluator to your
CI gate, threshold ≥0.85, and re-prompt with explicit "only use facts
from context" instructions.Configure your agent
Three of the most common AI coding agents shown below. Anything that speaks MCP works — point it at the multivon-mcp binary.
Claude Desktop
~/Library/Application Support/Claude/claude_desktop_config.json{
"mcpServers": {
"multivon": {
"command": "multivon-mcp",
"env": {
"ANTHROPIC_API_KEY": "sk-ant-...",
"OPENAI_API_KEY": "sk-proj-...",
"GOOGLE_API_KEY": "AIza..."
}
}
}
}Cursor
Settings → MCP, or .cursor/mcp.json{
"mcpServers": {
"multivon": { "command": "multivon-mcp" }
}
}Cline / OpenCode / any MCP-compatible agent
Same JSON shape — point at the multivon-mcp console script{ "mcpServers": { "multivon": { "command": "multivon-mcp" } } }Roadmap
The current 22 tools cover the common agent-eval moments — RAG checks, agent traces, adversarial PDFs, audit packs, compliance/safety guardrails, multimodal grounding. Next, in priority order:
- ●Synthetic eval generation as MCP tools — eval.generate_cases(from_docs=...), eval.adversarial_prompts(target=...). Agent generates its own evals on the fly.
- ●Runtime trace ingestion — eval.ingest_trace(trajectory=...) for agents using LangGraph / OpenAI Agents SDK / Anthropic Tools.
- ●Continuous production evals — sample + score live production traffic, surface regressions, file Linear / GitHub issues.
- ●Multi-agent simulation — pit two agents against each other and have evaluators score the conversation.
- ●Custom-trap plugin loading — load domain-specific pdfhell trap families at runtime so an agent can add a new failure mode without forking the repo.
File a GitHub issue at multivon-ai/multivon-mcp if you need any of these sooner — we’ll prioritize based on real demand.