Compliance-First LLM Evaluation: EU AI Act, NIST AI RMF, and Zero-Egress PII Detection

The EU AI Act is in force. NIST AI RMF is the de facto standard for US federal AI deployments. If you're evaluating LLMs in a regulated industry — healthcare, finance, legal, public sector — your eval pipeline is part of your compliance documentation, not just your QA workflow.

This post covers what compliance-relevant LLM evaluation looks like in practice, and why local-first evaluation is a hard requirement for regulated data environments.

The Compliance Problem With Cloud-Based Eval

Most LLM evaluation frameworks are API-first. To run an evaluation, you call:

Your model's API (OpenAI, Anthropic, Google, etc.)
An LLM judge API (often a different model, also cloud)
Sometimes an eval-specific SaaS platform (third API)

For a regulated healthcare application, the test cases you're evaluating might contain:

Patient queries referencing health conditions
EHR summaries
Clinical decision support outputs

Sending any of these to a third-party API for evaluation is a potential HIPAA violation. Even if the text is anonymized, the evaluation platform's data handling policies and BAA status need to be verified — and most eval SaaS platforms don't offer BAAs.

The same logic applies to legal (client privilege), finance (NPI data), and any public sector application subject to FedRAMP or sovereign cloud requirements.

The lowest-friction safe option for regulated data: evaluation that runs entirely within your environment, with no third-party data transfers required. Cloud-based eval is also possible with the right data handling agreements in place — but most eval platforms don't offer BAAs, which makes local execution the practical default for regulated teams.

EU AI Act Risk Management Requirements for Evals

A common mistake — including in the first version of multivon-eval's own docs — is to pin every compliance claim to Article 9. Article 9 is the risk management system obligation. The specific measurable controls are distributed across several Articles, and getting the references right matters when an auditor reads them.

Here is the paragraph-accurate picture for high-risk AI systems (Regulation (EU) 2024/1689):

Art. 9(2)(b) — Identification of reasonably foreseeable misuse (covered by toxicity / safety evaluators)
Art. 10(2)(f-g) — Examination of training/test data for possible biases and appropriate mitigation (covered by bias evaluators)
Art. 10(5) — Processing of personal data, including special categories (covered by PII detection)
Art. 15(1) — Accuracy (covered by faithfulness, hallucination, relevance, answer accuracy, task completion, …)
Art. 15(2) — Robustness (covered by schema validation, not-empty, self-consistency, latency, …)
Art. 12 — Record-keeping (satisfied by using a reporter like ComplianceReporter)
Art. 11 / 13 / 14 / 15(4-5) — Technical documentation, transparency, human oversight, cybersecurity. These are process controls — they cannot be satisfied by model evaluation alone, and they require organizational measures.

The phrase "documented procedures" in the Act is load-bearing. It means you need an audit trail — not just that you ran evals, but what you ran, when, on what data, with what results, and that those results haven't been tampered with.

This is what ComplianceReporter is for. As of Compliance Pack v1, audit records are linked into a SHA-256 hash chain so the log is tamper-evident end-to-end — deleting a record from the middle is detected, not just in-place edits:

from multivon_eval import EvalSuite, ComplianceReporter

suite = EvalSuite.eu_ai_act_high_risk(jurisdiction="gdpr")
suite.add_cases(cases)

reporter = ComplianceReporter(
    output_dir="/audit/evals",
    framework="eu-ai-act",
)

# Pre-flight: which Articles does this suite actually exercise?
print(reporter.coverage(suite))

report = suite.run(my_pipeline, runs=5)
reporter.record(report, tags={"system": "triage-bot", "version": "2.4.1"})

The EvalSuite.eu_ai_act_high_risk(...) factory wires the standard measurable controls — Faithfulness, Hallucination, Relevance, Toxicity, Bias, PII, NotEmpty (plus an optional Pydantic/JSON Schema) — with calibrated thresholds per judge model.

reporter.coverage(suite) prints a gap report:

eu-ai-act coverage for suite 'EU AI Act High-Risk Eval'
───────────────────────────────────────────────────────
  [x] Art. 9(2)(b)   Foreseeable misuse        — covered by: toxicity
  [x] Art. 10(2)(f-g) Bias examination         — covered by: bias
  [x] Art. 10(5)     Personal data processing  — covered by: pii_detection
  [x] Art. 15(1)     Accuracy                  — covered by: faithfulness, hallucination, relevance
  [x] Art. 15(2)     Robustness                — covered by: not_empty

  Process controls (not satisfiable by evaluators alone):
      Art. 11        Technical documentation
      Art. 12        Record-keeping (satisfied by this reporter)
      Art. 13        Transparency and information to deployers
      Art. 14        Human oversight
      Art. 15(4-5)   Cybersecurity and resilience

  Coverage: 5/5 measurable controls exercised.

Each reporter.record() writes one NDJSON line. The line stores the previous record's hash as prev_hash, so the log is a chain — not a set of independently-hashed records:

{
  "record_id": "a3f9b2c1ef20",
  "suite_name": "EU AI Act High-Risk Eval",
  "model_id": "claude-sonnet-4-6",
  "timestamp": "2026-05-13T14:32:17.821Z",
  "framework": "eu-ai-act",
  "chain_version": 1,
  "prev_hash": "0000000000000000000000000000000000000000000000000000000000000000",
  "summary": {
    "total": 50,
    "passed": 46,
    "pass_rate": 0.92,
    "tags": {"system": "triage-bot", "version": "2.4.1"}
  },
  "evaluator_results": [
    {
      "evaluator": "faithfulness",
      "avg_score": 0.89,
      "pass_rate": 0.88,
      "controls": [{"id": "Art. 15(1)", "description": "Accuracy"}]
    },
    {
      "evaluator": "pii_detection",
      "avg_score": 1.0,
      "pass_rate": 1.0,
      "controls": [{"id": "Art. 10(5)", "description": "Processing of personal data"}]
    }
  ],
  "record_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}

reporter.verify(suite_name) walks the chain. An in-place edit produces TAMPERED on the modified record. A mid-log deletion — the failure mode that per-record hashing silently allows — produces CHAIN BROKEN on the next record, because its prev_hash no longer matches.

PII Detection: Local Regex, No API Calls

For many regulated applications, a core requirement is that LLM outputs don't leak PII. This sounds simple but has a subtle failure mode: if you use an LLM judge to check for PII, you've sent the potentially-PII-containing output to another LLM API.

PIIEvaluator avoids this entirely. It uses local regex patterns — no API calls:

from multivon_eval import PIIEvaluator

# HIPAA jurisdiction — 13 of 18 PHI identifiers detectable via regex
suite.add_evaluators(PIIEvaluator(jurisdiction="hipaa"))

# GDPR jurisdiction adds EU VAT numbers
suite.add_evaluators(PIIEvaluator(jurisdiction="gdpr"))

# Redact in reports (replaces PII with [REDACTED-TYPE])
suite.add_evaluators(PIIEvaluator(redact=True))

# Add custom patterns for domain-specific identifiers
suite.add_evaluators(PIIEvaluator(patterns={"employee_id": r"EMP-\d{6}"}))

Detected types (base, all jurisdictions):

Email addresses
US and international phone numbers
US Social Security Numbers
Credit card numbers (Luhn algorithm)
IBAN (international bank account numbers)
IP addresses (IPv4 and IPv6)
Dates of birth
Passport numbers
NHS numbers (UK)
Street addresses

HIPAA adds: medical record numbers (MRN), health plan numbers, VINs, NPI/DEA license numbers, device identifiers, account numbers, and admission/discharge dates.

When PII is detected, the evaluator fails with a breakdown by type:

PII detected (2 type(s)):
  email: "patient@hospital.org"
  phone_us: "555-123-4567"

Important limit: 5 of the 18 HIPAA Safe Harbor PHI identifiers — names, geographic subdivisions below state level, full-face photos, biometric identifiers, and any unique number not covered by the patterns above — cannot be reliably detected via regex in text output. These require upstream de-identification before the content reaches your LLM.

Schema Validation for Regulated Outputs

A common regulated use case: the LLM must produce a structured output that feeds a downstream clinical or financial workflow. If the output fails to parse, the downstream system silently receives no input, or worse, receives malformed data.

SchemaEvaluator catches this at eval time rather than in production:

from pydantic import BaseModel
from multivon_eval import SchemaEvaluator

class ClinicalSummary(BaseModel):
    patient_id: str
    diagnosis_codes: list[str]
    risk_level: str  # "low" | "medium" | "high"
    referral_required: bool

suite.add_evaluators(SchemaEvaluator(ClinicalSummary, strict=True))

strict=True rejects outputs with extra fields — important for systems where extra data in a downstream record could cause errors or compliance issues.

NIST AI RMF Alignment

The NIST AI Risk Management Framework organizes AI risk into four functions: GOVERN, MAP, MEASURE, MANAGE. For the MEASURE function (which covers testing, evaluation, and monitoring), multivon-eval evaluators map to specific control categories:

Evaluator	NIST AI RMF Subcategory
`Faithfulness`, `Hallucination`, `Relevance`, `AnswerAccuracy`, `TaskCompletion`, `ToolCallAccuracy`, …	MEASURE 2.3 — AI system performance evaluation
`NotEmpty`, `SchemaEvaluator`, `SelfConsistency`, `Latency`, …	MEASURE 2.5 — AI system robustness
`Toxicity`	MEASURE 2.6 — AI system safety
`PIIEvaluator`	MEASURE 2.10 — Privacy risk
`Bias`	MEASURE 2.11 — Fairness and harmful bias

When you use ComplianceReporter(framework="nist-ai-rmf"), each evaluator result in the audit record is annotated with the relevant subcategory. Process controls (GOVERN 1.1 policies, MEASURE 2.7 security, MEASURE 2.9 explainability, MANAGE 4.1 post-deployment monitoring) are surfaced separately in the coverage report — they require organizational measures, not evaluator output.

CI/CD Integration for Regulated Deployments

A compliant eval pipeline in CI:

# .github/workflows/regulated-eval.yml
name: Compliance Eval

on: [push]

jobs:
  eval:
    runs-on: self-hosted  # on-prem runner — no data leaves your environment
    steps:
      - uses: actions/checkout@v4

      - name: Install multivon-eval
        run: pip install multivon-eval

      - name: Run compliance eval
        run: python eval/run_compliance_eval.py

      - name: Verify audit trail integrity
        run: |
          python -c "
          from multivon_eval import ComplianceReporter
          reporter = ComplianceReporter('/audit/evals', framework='eu-ai-act')
          ok = reporter.verify('My Suite')
          if not ok:
              raise SystemExit('AUDIT TRAIL COMPROMISED — chain broken or record tampered')
          "

      - name: Archive audit records
        uses: actions/upload-artifact@v4
        with:
          name: eval-audit-${{ github.sha }}
          path: /audit/evals/
          retention-days: 2555  # 7 years — EU AI Act high-risk system requirement

The self-hosted runner is important: it runs in your environment, so no eval data goes to GitHub's infrastructure.

What Auditors Actually Check

When a regulated AI system is audited under EU AI Act Article 9 or a NIST-aligned framework, auditors look for:

Evidence that testing occurred: timestamp, dataset description, which evaluators ran
Evidence of pass/fail thresholds: what was the acceptance criterion?
Evidence that failures were tracked: what happened when an eval failed?
Tamper evidence: have the records been modified since they were written?
Continuity: are evals being run regularly, not just once at release?

ComplianceReporter addresses 1, 4, and 5 directly. Thresholds and failure handling (2, 3) are set in your eval configuration and show up in the audit record via the pass_rate and tags fields.

The one thing auditors consistently flag in early-stage AI deployments: single-run evals with no statistical rigor. A pass rate of 91% from a single run of 50 cases has a 95% confidence interval of [79%, 97%]. An auditor who understands statistics will note that you can't make claims about 91% reliability with 50 data points. Use wilson_interval() and report confidence intervals, not point estimates.

Practical Setup for Regulated Deployments

If you're in a regulated industry and evaluating LLMs for the first time, this is the minimum viable compliance-aware eval setup:

from multivon_eval import (
    EvalSuite, EvalCase,
    Faithfulness, Relevance,
    PIIEvaluator, SchemaEvaluator,
    ComplianceReporter,
    wilson_interval,
)

# Your schema
from pydantic import BaseModel
class MyOutput(BaseModel):
    answer: str
    confidence: float
    sources: list[str]

# Build the suite
suite = EvalSuite("Production Eval")
suite.add_cases(cases)
suite.add_evaluators(
    SchemaEvaluator(MyOutput),               # structure first
    PIIEvaluator(jurisdiction="hipaa", redact=True),
    Faithfulness(),
    Relevance(),
)

# Run with statistical rigor
report = suite.run(my_pipeline, runs=5)

# Report CIs, not just point estimates
lo, hi = wilson_interval(
    pass_count=int(report.pass_rate * len(cases)),
    n=len(cases)
)
print(f"Pass rate: {report.pass_rate:.0%} [{lo:.0%}–{hi:.0%}]")

# Record to audit trail
reporter = ComplianceReporter("/audit/evals", framework="eu-ai-act")
reporter.record(report, tags={"version": "2.4.1", "env": "production"})

Or use the auditor-ready factory shortcut:

from multivon_eval import EvalSuite, ComplianceReporter

# Full EU AI Act high-risk evaluator set, calibrated thresholds
suite = EvalSuite.eu_ai_act_high_risk(jurisdiction="gdpr", schema=MyOutput)
suite.add_cases(cases)

reporter = ComplianceReporter("/audit/evals", framework="eu-ai-act")

# Pre-flight coverage check — fail the build if controls are missing
cov = reporter.coverage(suite)
if cov.missing:
    raise SystemExit(f"Missing controls: {[c.id for c in cov.missing]}")

report = suite.run(my_pipeline, runs=5)
reporter.record(report, tags={"system": "triage-bot", "version": "2.4.1"})

For HIPAA-jurisdiction PII patterns, EvalSuite.for_medical(jurisdiction="hipaa") keeps the medical-specific evaluator mix; eu_ai_act_high_risk is calibrated specifically against the Act's Annex III obligations.