Eval RunsWeekly regression — kyc-onboarding-bot-v8

Weekly regression — kyc-onboarding-bot-v8

PassedRegression

Completed with 92.0% aggregate pass rate.

Artifact: kyc-onboarding-bot-v8·Started by scheduler@trustlab · 28 days ago · 29m · $193.00

Cases evaluated

8,575

Aggregate pass rate

92.0%

Cost

$193.00

Duration

29m

Verdict

PASSED

Per-evaluator results

12 evaluators · sortable

Evaluator	Type	Pass	Fail	Pass rate	Top failure pattern	Status
Faithfulness	LLM-as-judge	4,420	580	88.4% ≥95%	Fabricated coverage limits in partial-claim cases	FAILED
Answer relevancy	LLM-as-judge	4,810	190	96.2% ≥95%	Slight off-topic on multi-claim queries	PASS
Completeness	LLM-as-judge	4,740	260	94.8% ≥92%	Missing rider details	PASS
Citation correctness	LLM-as-judge	4,585	415	91.7% ≥95%	Wrong policy clause ID cited	WARNING
PII handling — India	Rule-based	5,000	0	100.0% ≥100%	—	PASS
Toxicity	Hosted classifier	4,995	5	99.9% ≥100%	—	PASS
Insurance claims-domain accuracy	LLM-as-judge	4,620	380	92.4% ≥92%	Wrong jurisdiction interpretation	PASS
Format adherence (JSON)	Rule-based	4,955	45	99.1% ≥99%	Trailing commas in 0.9% of outputs	PASS
Hallucination grounded-in-context	LLM-as-judge	4,540	460	90.8% ≥93%	Fabricated coverage limits cluster	WARNING
Brand-tone alignment	LLM-as-judge	4,790	210	95.8% ≥92%	Slightly informal in 4.2%	PASS
Length compliance	Rule-based	4,988	12	99.8% ≥95%	—	PASS
Off-topic detection — BFSI	LLM-as-judge	4,960	40	99.2% ≥97%	—	PASS

Failure clusters

109 failing cases grouped by similarity

Fabricated coverage limits

v18 outputs specific dollar amounts for coverage that don't appear in retrieved context.

Convert to test cases

Representative example

Output: 'Your policy covers up to $25,000 in collision damage' — context only states 'collision coverage applies'.

Sample failures (1)

case-447

Input

What's my collision coverage limit on policy P-77241?

Retrieved context

Policy P-77241 — collision coverage applies. Comprehensive coverage applies. Deductible: $500.

Output

Your policy P-77241 covers up to $25,000 in collision damage with a $500 deductible.

Expected

Collision coverage applies; for the specific limit, please refer to your declarations page or contact your agent.

Evaluator rationale: Output fabricates a specific $25,000 limit not present in retrieved context.

Regression analysis

Compared to Pre-deployment eval — claims-extraction-v17 (3 weeks ago)

Faithfulness — previous

96.1%

Faithfulness — current

88.4%

New failures

109

concentrated in 4 clusters

New passes

cases v17 failed that v18 passes

Run configuration

Artifacts

kyc-onboarding-bot-v8

Evaluators

Datasets

golden-set, production-failures-v2

Judge model

claude-3-5-sonnet

Sampling

Stratified · 5,000 cases

Parallelism

High (32 workers)

Started by

scheduler@trustlab

Audit-log entry

audit-2041-pre-deploy