108 evaluators across 11 categories · 87 used in active runs · 32 custom-authored
Output is grounded in retrieved context with no fabrication.
Faithfulness judged against code-switched Hindi-English context.
Output addresses the user question directly.
Output covers all required aspects of the prompt.
Output is internally consistent and well-structured.
Output is factually correct against world knowledge.
Citations match the claims they support.
Output validates against declared JSON schema.
Output matches a regular expression.
Output length within declared min/max bounds.
Detects toxic, abusive, or harassing language.
Toxicity detection across 12 Indian languages and code-switched variants.
Self-harm, violence, sexual, hate categorization.
Profanity lexicon match across 24 languages.
Output aligns with Meridian brand voice and tone.
Detects off-topic or out-of-scope responses for BFSI agents.
Differential treatment based on gender.
Differential treatment based on religion.
India-specific caste bias detection across surnames and contexts.
Differential treatment based on race.
Differential treatment based on age cohort.
Differential outcomes across income/occupation segments.
Differential treatment based on disability status.
Region-of-origin bias across Indian states and dialects.
Statistical parity across protected attributes.
Equal true-positive rates across protected attributes.
Detects and validates handling of common PII (email, SSN, phone).
Aadhaar, PAN, IFSC, GST detection, redaction, and policy compliance.
National IDs, IBAN, GDPR-special-category PII detection.
Detects verbatim regurgitation suggestive of training-data leakage.
Detects loss-curve patterns indicating membership-inference vulnerability.
Retrieved context is relevant to the query.
Retrieved context contains all info needed to answer.
Answer can be derived from retrieved context.
Recall of gold passages at k.
Precision of gold passages at k.
Mean reciprocal rank of first relevant passage.
Normalized discounted cumulative gain.
Fraction of output sentences attributable to a chunk.
Penalizes claims not derivable from retrieved context.
Agent picks the correct tool for the goal.
Tool call arguments validate against schema.
Fraction of tool calls returning non-error.
Tools are called in a valid order to reach the goal.
Penalizes wasted tool calls and over-spend.
Generated image matches the prompt intent.
Image free of NSFW, violent, hateful content.
Image follows brand color, logo, and style guidelines.
WER vs gold transcript.
Text references image content correctly.
Extracted fields match source document.
Native-speaker-quality Hindi response judge.
Code-switched Hindi-English response judge.
Native-speaker-quality Tamil response judge.
Native-speaker-quality Telugu response judge.
Native-speaker-quality Bengali response judge.
Native-speaker-quality Marathi response judge.
Detects awkward script/language switches mid-utterance.
Validates Devanagari, Tamil, Bengali, Telugu script encoding.
Penalizes culturally insensitive outputs in Indian context.
Validates correct transliteration of Indic names.
Chain-of-thought is internally coherent.
Each reasoning step is logically valid.
Multiple samples converge to the same answer.
Agent completes the stated user goal.
Coordinator agent correctly delegates and aggregates.
Agent retrieves the right memory at the right turn.
Executed steps follow the agent's stated plan.
Domain-tuned judge for insurance claims correctness.
TILA/RESPA disclosure language compliance.
Field-level precision on PAN/Aadhaar/passport extraction.
Validates Hindi banking terminology vs RBI glossary.
FX quotes within market spread tolerance.
Case citations resolve and support the claim.
FDA/EMA labeling language compliance.
Output adheres to NICE/AHA/WHO guidelines.
Internal evaluator authored by the Meridian Responsible AI team.