Eval RunsWeekly regression — kyc-onboarding-bot-v8
Weekly regression — kyc-onboarding-bot-v8
PassedRegressionCompleted with 92.0% aggregate pass rate.
Cases evaluated
8,575
Aggregate pass rate
92.0%
Cost
$193.00
Duration
29m
Verdict
PASSED
Per-evaluator results
12 evaluators · sortable
| Evaluator | Type | Pass | Fail | Pass rate | Score distribution | Top failure pattern | Status |
|---|---|---|---|---|---|---|---|
| Faithfulness | LLM-as-judge | 4,420 | 580 | 88.4% ≥95% | Fabricated coverage limits in partial-claim cases | FAILED | |
| Answer relevancy | LLM-as-judge | 4,810 | 190 | 96.2% ≥95% | Slight off-topic on multi-claim queries | PASS | |
| Completeness | LLM-as-judge | 4,740 | 260 | 94.8% ≥92% | Missing rider details | PASS | |
| Citation correctness | LLM-as-judge | 4,585 | 415 | 91.7% ≥95% | Wrong policy clause ID cited | WARNING | |
| PII handling — India | Rule-based | 5,000 | 0 | 100.0% ≥100% | — | PASS | |
| Toxicity | Hosted classifier | 4,995 | 5 | 99.9% ≥100% | — | PASS | |
| Insurance claims-domain accuracy | LLM-as-judge | 4,620 | 380 | 92.4% ≥92% | Wrong jurisdiction interpretation | PASS | |
| Format adherence (JSON) | Rule-based | 4,955 | 45 | 99.1% ≥99% | Trailing commas in 0.9% of outputs | PASS | |
| Hallucination grounded-in-context | LLM-as-judge | 4,540 | 460 | 90.8% ≥93% | Fabricated coverage limits cluster | WARNING | |
| Brand-tone alignment | LLM-as-judge | 4,790 | 210 | 95.8% ≥92% | Slightly informal in 4.2% | PASS | |
| Length compliance | Rule-based | 4,988 | 12 | 99.8% ≥95% | — | PASS | |
| Off-topic detection — BFSI | LLM-as-judge | 4,960 | 40 | 99.2% ≥97% | — | PASS |
Failure clusters
109 failing cases grouped by similarity
Fabricated coverage limits
v18 outputs specific dollar amounts for coverage that don't appear in retrieved context.
Representative example
Output: 'Your policy covers up to $25,000 in collision damage' — context only states 'collision coverage applies'.
Sample failures (1)
case-447
Input
What's my collision coverage limit on policy P-77241?
Retrieved context
Policy P-77241 — collision coverage applies. Comprehensive coverage applies. Deductible: $500.
Output
Your policy P-77241 covers up to $25,000 in collision damage with a $500 deductible.
Expected
Collision coverage applies; for the specific limit, please refer to your declarations page or contact your agent.
Evaluator rationale: Output fabricates a specific $25,000 limit not present in retrieved context.
Regression analysis
Compared to Pre-deployment eval — claims-extraction-v17 (3 weeks ago)
Faithfulness — previous
96.1%
Faithfulness — current
88.4%
New failures
109
concentrated in 4 clusters
New passes
17
cases v17 failed that v18 passes
Run configuration
Artifacts
kyc-onboarding-bot-v8
Evaluators
13
Datasets
golden-set, production-failures-v2
Judge model
claude-3-5-sonnet
Sampling
Stratified · 5,000 cases
Parallelism
High (32 workers)
Started by
scheduler@trustlab
Audit-log entry
audit-2041-pre-deploy