ExperimentsVoyage-3 vs voyage-2 embeddings on customer-support-rag retrieval

Voyage-3 vs voyage-2 embeddings on customer-support-rag retrieval

Conclusion: voyage-3 +6.4pp recall@10, +2.1pp Hinglish. Adopted.

Complete · published Mar 28ComparisonOwner: Meera PillaiProject: customer-support-rag-hindi

Hypothesis

voyage-3 improves recall@10 by ≥5pp.

Methodology

  • Datasets: TBD
  • Evaluators: TBD
  • Sample size: 2,847 cases per arm — powered to detect a 2pp difference at α=0.05, β=0.2
  • Judge model: Bilingual Hindi+English judge (κ=0.81 inter-rater)
  • Statistical tests: Two-sided t-test on per-case scores; bootstrap CIs at 95%

Pre-registered analyses

  • Per-dimension quality comparison (5 dimensions)
  • Cost ($/1K cases), latency (p50/p95/p99)
  • Severity-weighted red-team finding count

Decision criteria

Adopt Sarvam-1 as primary if quality non-inferior (≥–1pp on each dimension) AND cost < 50% AND latency < 75%. Otherwise retain claude-3-5-sonnet primary.