For AI coding agents

Give your AI coding agent direct access to evaluation tools.

multivon-mcpis an MCP server that exposes 22 evaluation tools — pdfhell runs, QAG-graded faithfulness, RAG retrieval quality, compliance + safety guardrails, multimodal scoring, custom rubrics, and agent-workflow utilities like comparing two runs or synthesising eval cases from text. Drop it into Claude Desktop, Claude Code, Cursor, Cline, or OpenCode and the agent calls evals directly when it’s helping you build an LLM product.

Install

Bare install pulls multivon-eval, pdfhell, the MCP SDK, and the three frontier provider SDKs. Bring your own API key in env (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY).

The 22 tools

Curated for the agent context window. The full 44-evaluator catalog is still available via eval_discover; the 22 tools below are the ones AI agents actually call mid-edit, grouped by category — faithfulness, RAG, agent, compliance, safety, multimodal, custom. Shipped in multivon-mcp 0.2.0.

ToolWhat it doesNeeds key
eval_discoverReturns the full machine-readable capability catalog (44 evaluators, 3 trap families, 2 suites with version hashes, calibration data, package versions). Call this first.no
pdfhell_makeGenerate one adversarial PDF + its answer key. Useful when the agent wants to inspect what a specific trap looks like before evaluating.no
pdfhell_runRun the pdfhell adversarial-PDF benchmark against a vision model. Returns pass rate, Wilson CI, per-trap breakdown, suite hash.yes
eval_audit_packBuild a hash-chained audit ZIP from a pdfhell run — manifest with SHA-256s, every PDF, every answer key, run JSON, JUnit XML, README.no
eval_faithfulnessQAG-graded faithfulness — is a RAG output grounded in the retrieved context? Returns score + threshold + reason.yes
eval_hallucinationQAG-graded hallucination detection — does an output contain content NOT in the context? Score 1.0 = no hallucination.yes
eval_relevanceQAG-graded relevance — does the output actually address the user's question, or does it ramble?yes
eval_answer_accuracyQAG-graded semantic equivalence vs ground truth. Useful when string match is too strict (paraphrased correct answers).yes
eval_context_precisionQAG-graded retrieval quality — is the retrieved context relevant to the question, or padded with noise? Returns 0-1 plus per-chunk breakdown.yes
eval_context_recallQAG-graded retrieval completeness — does the retrieved context contain everything needed to answer? Useful for finding silent retrieval gaps.yes
eval_tool_call_accuracyDeterministic — checks whether an agent called the right tool with the right arguments. No LLM judge needed; works without API keys.no
eval_pii_detectionDetect PII leakage in model output — emails, phones, SSNs, credit cards, HIPAA identifiers. Regex-only, no LLM judge, runs offline.no
eval_schema_complianceValidate structured output against a JSON Schema. Catches missing fields, wrong types, extra keys. Useful for tool-calling agents.no
eval_toxicityQAG-graded toxicity check — does the output contain harmful, hateful, or threatening content? Score 1.0 = clean.yes
eval_biasQAG-graded bias detection — does the output reflect gender, racial, age, or other demographic bias? Score 1.0 = no detected bias.yes
eval_vqa_faithfulnessVisual QA faithfulness — given an image (path/URL/base64) and an answer, is the answer grounded in the image? Image-grounded QAG decomposition.yes
eval_document_groundingDocument-grounded QA scoring — multi-page image-grounded faithfulness for document AI workflows. Pairs naturally with pdfhell.yes
eval_g_evalG-Eval — chain-of-thought rubric scoring against an arbitrary criterion string. The judge generates its own scoring steps before scoring.yes
eval_custom_rubricRun a list of yes/no questions defined by you against the output, pinned for reproducibility. Useful when you want full control over what gets checked.yes
eval_compare_runsDiff two saved eval-report JSONs — pass-rate delta, per-case regressions and improvements, McNemar p-value over paired cases. The tool an agent calls when it just shipped a fix and wants to know if anything got worse.no
eval_generate_casesSynthesise n test cases from arbitrary text (FAQs, docs, transcripts). Returns input/expected_output/context tuples ready for an EvalSuite. Useful when an agent is bootstrapping a fresh eval suite from a knowledge base.yes
eval_ingest_traceParse a runtime agent trace (LangGraph, OpenAI Agents SDK, or canonical manual format) into an EvalCase. Lets an agent score its own trajectory after a real execution.no

Provider compatibility

What works with what today. PII detection is regex-only and runs everywhere. LLM-judge tools (faithfulness, hallucination, relevance, etc.) and the pdfhell vision benchmark currently support three frontier providers natively; OpenRouter + local Ollama adapters are queued for the next release.

ProviderLLM judgePIIPDF HellMultimodal
Anthropic
OpenAI
Google
OpenRouter
Local (Ollama)
Wired and tested Queued for next release Not applicable

Example session

After wiring up multivon-mcp in Claude Desktop / Cursor / Cline, this is what a real interaction looks like:

User:    I just shipped a RAG endpoint. Can you check it for hallucinations?

Agent:   I'll use multivon to evaluate it.
         [calls eval_discover to see what's available]
         [calls eval_faithfulness with your input/context/output]

         → score: 0.667 (passed: False), threshold: 0.9
           reason: 2/3 claims grounded
             ✓ "annual renewal" — supported by context
             ✓ "30-day notice" — supported by context
             ✗ "automatic upgrade" — NOT in context

Agent:   Your RAG hallucinated the "automatic upgrade" detail. The context
         doesn't mention upgrades. I'd add a Hallucination evaluator to your
         CI gate, threshold ≥0.85, and re-prompt with explicit "only use facts
         from context" instructions.

Configure your agent

Three of the most common AI coding agents shown below. Anything that speaks MCP works — point it at the multivon-mcp binary.

Claude Desktop

~/Library/Application Support/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "multivon": {
      "command": "multivon-mcp",
      "env": {
        "ANTHROPIC_API_KEY": "sk-ant-...",
        "OPENAI_API_KEY": "sk-proj-...",
        "GOOGLE_API_KEY": "AIza..."
      }
    }
  }
}

Cursor

Settings → MCP, or .cursor/mcp.json
{
  "mcpServers": {
    "multivon": { "command": "multivon-mcp" }
  }
}

Cline / OpenCode / any MCP-compatible agent

Same JSON shape — point at the multivon-mcp console script
{ "mcpServers": { "multivon": { "command": "multivon-mcp" } } }

Roadmap

The current 22 tools cover the common agent-eval moments — RAG checks, agent traces, adversarial PDFs, audit packs, compliance/safety guardrails, multimodal grounding. Next, in priority order:

  • Synthetic eval generation as MCP tools eval.generate_cases(from_docs=...), eval.adversarial_prompts(target=...). Agent generates its own evals on the fly.
  • Runtime trace ingestion eval.ingest_trace(trajectory=...) for agents using LangGraph / OpenAI Agents SDK / Anthropic Tools.
  • Continuous production evals — sample + score live production traffic, surface regressions, file Linear / GitHub issues.
  • Multi-agent simulation — pit two agents against each other and have evaluators score the conversation.
  • Custom-trap plugin loading — load domain-specific pdfhell trap families at runtime so an agent can add a new failure mode without forking the repo.

File a GitHub issue at multivon-ai/multivon-mcp if you need any of these sooner — we’ll prioritize based on real demand.