Insights
Practical thinking on AI evaluation, model quality, and shipping with confidence.
v2 benchmark: calibration transfers across datasets. multivon-eval F1 went up, not down.
Re-ran the head-to-head on RAGTruth — a dataset multivon-eval's calibration has never seen. At our HaluEval-derived threshold, F1=0.787 on RAGTruth, higher than the in-distribution 0.63. Removes the v1 circularity caveat with real numbers.
Three open-source eval frameworks disagree on 56% of cases. Here's what we found running our own benchmark.
Same judge, same dataset, same seed. multivon-eval, DeepEval, and RAGAS produce different verdicts on more than half the cases. The detection-prompt gap is small; the shipped-calibration gap is huge. Full repo, methodology, circularity disclosure included.
Compliance-First LLM Evaluation: EU AI Act, NIST AI RMF, and Zero-Egress PII Detection
If you're evaluating LLMs in healthcare, finance, or legal, your eval pipeline is part of your compliance documentation. Here's what that actually means in practice.
DeepEval vs RAGAS vs multivon-eval: How They Actually Differ
Three open-source LLM evaluation frameworks, each with a genuinely different philosophy. What they're actually good at, where each falls short, and how to pick one.
Why LLM Evals Fail in Production (And What To Do About It)
Teams spend weeks tuning prompts, get green scores in the playground, then watch things fall apart in production. Here's why it keeps happening.
QAG vs LLM-as-Judge: Why We Score With Questions, Not Numbers
Asking a model to rate output 1-10 introduces its own hallucination risk. There's a more reliable way: generate yes/no questions and score by the fraction answered correctly.
Evaluating Multimodal AI: Text Is Just the Beginning
Most eval frameworks were built for text. But production AI systems generate images, process documents, and understand vision. The tooling hasn't caught up — yet.
How to Evaluate AI Agents Without Getting Fooled
Agent evaluation has a non-determinism problem that's worse than text evaluation. Here are the three gaps that show up most often, and how to close them.
The Structured Extraction Trap: When Your LLM Returns Garbage and Your Eval Doesn't Notice
Models fail to return parseable output on 10–15% of prompts. Here's why standard evals miss it, and how to catch format failures before they hit production.
Why Your LLM Eval Results Are Probably Wrong
Single-run evaluation scores are so noisy they routinely misrank models and miss regressions. Here's the research behind it, and what to do instead.