Insights

Practical thinking on AI evaluation, model quality, and shipping with confidence.

Benchmarks·9 min read·May 14, 2026

v2 benchmark: calibration transfers across datasets. multivon-eval F1 went up, not down.

Re-ran the head-to-head on RAGTruth — a dataset multivon-eval's calibration has never seen. At our HaluEval-derived threshold, F1=0.787 on RAGTruth, higher than the in-distribution 0.63. Removes the v1 circularity caveat with real numbers.

Benchmarks·12 min read·May 13, 2026

Three open-source eval frameworks disagree on 56% of cases. Here's what we found running our own benchmark.

Same judge, same dataset, same seed. multivon-eval, DeepEval, and RAGAS produce different verdicts on more than half the cases. The detection-prompt gap is small; the shipped-calibration gap is huge. Full repo, methodology, circularity disclosure included.

Compliance·10 min read·May 13, 2026

Compliance-First LLM Evaluation: EU AI Act, NIST AI RMF, and Zero-Egress PII Detection

If you're evaluating LLMs in healthcare, finance, or legal, your eval pipeline is part of your compliance documentation. Here's what that actually means in practice.

Comparison·14 min read·May 13, 2026

DeepEval vs RAGAS vs multivon-eval: How They Actually Differ

Three open-source LLM evaluation frameworks, each with a genuinely different philosophy. What they're actually good at, where each falls short, and how to pick one.

Engineering·6 min read·April 20, 2026

Why LLM Evals Fail in Production (And What To Do About It)

Teams spend weeks tuning prompts, get green scores in the playground, then watch things fall apart in production. Here's why it keeps happening.

Technical·5 min read·April 15, 2026

QAG vs LLM-as-Judge: Why We Score With Questions, Not Numbers

Asking a model to rate output 1-10 introduces its own hallucination risk. There's a more reliable way: generate yes/no questions and score by the fraction answered correctly.

Research·7 min read·April 10, 2026

Evaluating Multimodal AI: Text Is Just the Beginning

Most eval frameworks were built for text. But production AI systems generate images, process documents, and understand vision. The tooling hasn't caught up — yet.

Agents·9 min read·March 3, 2026

How to Evaluate AI Agents Without Getting Fooled

Agent evaluation has a non-determinism problem that's worse than text evaluation. Here are the three gaps that show up most often, and how to close them.

Evaluation·7 min read·February 10, 2026

The Structured Extraction Trap: When Your LLM Returns Garbage and Your Eval Doesn't Notice

Models fail to return parseable output on 10–15% of prompts. Here's why standard evals miss it, and how to catch format failures before they hit production.

Evaluation·8 min read·January 20, 2026

Why Your LLM Eval Results Are Probably Wrong

Single-run evaluation scores are so noisy they routinely misrank models and miss regressions. Here's the research behind it, and what to do instead.