Hypothesis
Sarvam-1, a Hindi-native model, will produce equal-or-better quality on Hindi customer voice tasks than claude-3-5-sonnet, at substantially lower cost and latency.
Methodology
- Datasets: hindi-customer-voice (n=2,847), hinglish-banking-queries (n=2,847)
- Evaluators: g-eval-hindi-quality, g-eval-hinglish, code-switching-coherence, cultural-context-bias, banking-domain-accuracy
- Sample size: 2,847 cases per arm — powered to detect a 2pp difference at α=0.05, β=0.2
- Judge model: Bilingual Hindi+English judge (κ=0.81 inter-rater)
- Statistical tests: Two-sided t-test on per-case scores; bootstrap CIs at 95%
Pre-registered analyses
- Per-dimension quality comparison (5 dimensions)
- Cost ($/1K cases), latency (p50/p95/p99)
- Severity-weighted red-team finding count
Decision criteria
Adopt Sarvam-1 as primary if quality non-inferior (≥–1pp on each dimension) AND cost < 50% AND latency < 75%. Otherwise retain claude-3-5-sonnet primary.