186 datasets · 0.3M test cases · 23 imported this month
Golden reference outputs for insurance claim field extraction across 14 carriers.
Reference outputs for mortgage disclosure summarization & material-fact preservation.
Mixed-script KYC docs (Aadhaar, PAN, voter ID) with structured ground truth.
Real-time forex quote validation cases with stale-data and slippage scenarios.
Authentic Hindi-language customer support conversations across retail banking products.
Code-mixed Hinglish queries about credit cards, EMIs, FDs, and UPI failures.
Tamil-language banking queries, including transliterated and Tanglish forms.
Bengali claims-intake conversations with regional vocabulary variations.
Mid-sentence Hindi↔English code-switching across customer-support flows.
Imported AI4Bharat IndicGLUE eval suites across 11 Indian languages.
Pariksha multilingual evaluation suite covering 10 Indic languages and 23 task types.
Counterfactual pairs across gender, religion, caste, and region for BFSI decisions.
Loan-eligibility prompts probing surname/caste correlated bias.
Customer-support scenarios designed to expose religion-correlated treatment differences.
Aadhaar/PAN/IFSC/CKYC extraction probes for output-side PII leakage.
Toxicity probes spanning English, Hindi, Hinglish, Tamil, Bengali.
847 production failure cases curated from Operations Platform traces where claims-copilot-v3 produced low-faithfulness outputs.
Curated production edge cases where the support RAG returned hallucinated policy text.
Curated SAR-narrative cases with tricky entity disambiguation.
MMLU 57-subject knowledge benchmark, full set.
BIG-Bench Hard 23-task reasoning benchmark.
HumanEval Python coding benchmark, 164 tasks.
Grade-school math word problem benchmark.
TruthfulQA — does the model avoid common false beliefs.
HarmBench standardized harm evaluation benchmark.
Indian Supreme Court & High Court citation accuracy with Manupatra cross-reference.
Promotional vs. balanced language tests aligned with USFDA OPDP and CDSCO rules.
Adherence-to-NICE and ESMO clinical guidelines across 24 oncology pathways.
FinanceBench — open-book financial QA across SEC filings.
Edge cases suite for statement summarizer covering known regression patterns.
Calibration suite for card dispute agent covering known regression patterns.
Long-tail suite for kyc onboarding covering known regression patterns.
Regression set suite for loan underwriting covering known regression patterns.
Evaluation suite for wealth advisor covering known regression patterns.
Cohort suite for treasury quote covering known regression patterns.
Smoke suite for fraud narrative covering known regression patterns.
Stress suite for clm redline covering known regression patterns.
Edge cases suite for claims triage covering known regression patterns.
Calibration suite for policy doc qa covering known regression patterns.
Long-tail suite for mortgage eligibility covering known regression patterns.
Regression set suite for esg summarizer covering known regression patterns.
Evaluation suite for p&c renewal covering known regression patterns.
Cohort suite for merchant onboarding covering known regression patterns.
Smoke suite for private banking rag covering known regression patterns.
Stress suite for statement summarizer covering known regression patterns.
Edge cases suite for card dispute agent covering known regression patterns.
Calibration suite for kyc onboarding covering known regression patterns.
Long-tail suite for loan underwriting covering known regression patterns.
Regression set suite for wealth advisor covering known regression patterns.
Evaluation suite for treasury quote covering known regression patterns.
Cohort suite for fraud narrative covering known regression patterns.
Smoke suite for clm redline covering known regression patterns.
Stress suite for claims triage covering known regression patterns.
Edge cases suite for policy doc qa covering known regression patterns.
Calibration suite for mortgage eligibility covering known regression patterns.
Long-tail suite for esg summarizer covering known regression patterns.
Regression set suite for p&c renewal covering known regression patterns.
Evaluation suite for merchant onboarding covering known regression patterns.
Cohort suite for private banking rag covering known regression patterns.
Smoke suite for statement summarizer covering known regression patterns.