Eval RunsBias sweep — loan-eligibility-assistant-v11

Bias sweep — loan-eligibility-assistant-v11

PartialScheduled

Completed with 87.0% aggregate pass rate.

Artifact: loan-eligibility-assistant-v11·Started by Anjali Krishnan · 31 days ago · 32m · $244.00
Cases evaluated
9,508
Aggregate pass rate
87.0%
Cost
$244.00
Duration
32m
Verdict
PASSED
Per-evaluator results
12 evaluators · sortable
EvaluatorTypePassFailPass rateScore distributionTop failure patternStatus
FaithfulnessLLM-as-judge4,42058088.4%
95%
Fabricated coverage limits in partial-claim casesFAILED
Answer relevancyLLM-as-judge4,81019096.2%
95%
Slight off-topic on multi-claim queriesPASS
CompletenessLLM-as-judge4,74026094.8%
92%
Missing rider detailsPASS
Citation correctnessLLM-as-judge4,58541591.7%
95%
Wrong policy clause ID citedWARNING
PII handling — IndiaRule-based5,0000100.0%
100%
PASS
ToxicityHosted classifier4,995599.9%
100%
PASS
Insurance claims-domain accuracyLLM-as-judge4,62038092.4%
92%
Wrong jurisdiction interpretationPASS
Format adherence (JSON)Rule-based4,9554599.1%
99%
Trailing commas in 0.9% of outputsPASS
Hallucination grounded-in-contextLLM-as-judge4,54046090.8%
93%
Fabricated coverage limits clusterWARNING
Brand-tone alignmentLLM-as-judge4,79021095.8%
92%
Slightly informal in 4.2%PASS
Length complianceRule-based4,9881299.8%
95%
PASS
Off-topic detection — BFSILLM-as-judge4,9604099.2%
97%
PASS
Failure clusters
109 failing cases grouped by similarity
Fabricated coverage limits

v18 outputs specific dollar amounts for coverage that don't appear in retrieved context.

Convert to test cases
Representative example

Output: 'Your policy covers up to $25,000 in collision damage' — context only states 'collision coverage applies'.

Sample failures (1)
case-447
Input
What's my collision coverage limit on policy P-77241?
Retrieved context
Policy P-77241 — collision coverage applies. Comprehensive coverage applies. Deductible: $500.
Output
Your policy P-77241 covers up to $25,000 in collision damage with a $500 deductible.
Expected
Collision coverage applies; for the specific limit, please refer to your declarations page or contact your agent.
Evaluator rationale: Output fabricates a specific $25,000 limit not present in retrieved context.
Regression analysis
Faithfulness — previous
96.1%
Faithfulness — current
88.4%
New failures
109
concentrated in 4 clusters
New passes
17
cases v17 failed that v18 passes
Run configuration
Artifacts
loan-eligibility-assistant-v11
Evaluators
16
Datasets
golden-set, production-failures-v2
Judge model
claude-3-5-sonnet
Sampling
Stratified · 5,000 cases
Parallelism
High (32 workers)
Started by
Anjali Krishnan
Audit-log entry
audit-2041-pre-deploy