Experiments

312 experiments · 47 currently running · 84 with published conclusions

claude-3-7-sonnet vs claude-3-5-sonnet on claims-extraction
Running
ComparisonSaanvi Nair· started Apr 28
Hypothesis: claude-3-7 will improve faithfulness on claims-extraction at acceptable cost increase.
RAG chunk size 512 vs 1024 vs 2048 on retrieval quality
Complete
AblationPublishedVikram Shetty· started Mar 12
Hypothesis: 1024 chunks optimize precision/recall tradeoff.
Conclusion: 1024 wins on F1; 2048 wins on long-context QA. Recommend hybrid retrieval.
Hindi-only vs Hindi+English judge model on hinglish-banking-queries
Complete
A/BPublishedAnjali Krishnan· started Feb 24
Hypothesis: Bilingual judge improves Hinglish score reliability.
Conclusion: Bilingual judge κ=0.81 vs Hindi-only 0.62. Adopted as default.
Voyage-3 vs voyage-2 embeddings on customer-support-rag retrieval
Complete
ComparisonPublishedMeera Pillai· started Mar 1
Hypothesis: voyage-3 improves recall@10 by ≥5pp.
Conclusion: voyage-3 +6.4pp recall@10, +2.1pp Hinglish. Adopted.
Indirect-injection guardrail FP rate study
Running
Stress testFatima Khan· started Apr 18
Hypothesis: Tightened guardrail keeps recall while FP < 1.5%.
Sarvam-1 vs claude-3-5-sonnet on hindi-customer-voice
Complete
Vendor evaluationPublishedAnjali Krishnan· started Feb 14
Hypothesis: Sarvam-1, a Hindi-native model, will produce equal-or-better quality on Hindi customer voice tasks than claude-3-5-sonnet, at substantially lower cost and latency.
Conclusion: Sarvam-1 outperforms claude-3-5-sonnet on Hindi customer voice tasks at 3.2x lower cost and 2.4x lower latency, with comparable quality. Recommended for primary Hindi-customer-voice production traffic.
Caste-bias remediation prompt-engineering ablation
Running
Bias studyFatima Khan· started Apr 22
Hypothesis: Explicit fairness reminder + counterfactual probe lowers caste-skewed denials.
Multi-step agent vs single-shot prompt on loan-eligibility
Complete
ComparisonPublishedVikram Shetty· started Mar 6
Hypothesis: Multi-step lifts approval-quality at acceptable latency.
Conclusion: Multi-step +9.1pp quality, 2.3x latency. Adopted with streaming.
Crescendo multi-turn defense prompt-engineering
Running
Adversarial studyArjun Iyer· started Apr 25
Hypothesis: Per-turn refusal calibration reduces Crescendo escalation success.
Frontier model evaluation: GPT-5 access pre-approval
Planning
Frontier model assessmentSaanvi Nair· started May 2
Hypothesis: GPT-5 enters approved-vendor list for restricted internal use.
internal study #10 — embedding rotation impact
Complete
A/BVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
Conclusion: Cadence acceptable.
internal study #11 — embedding rotation impact
Analyzing
Stress testVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
internal study #12 — embedding rotation impact
Running
AblationVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
Conclusion: Cadence acceptable.
internal study #13 — embedding rotation impact
Complete
ComparisonVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
internal study #14 — embedding rotation impact
Analyzing
A/BVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
Conclusion: Cadence acceptable.
internal study #15 — embedding rotation impact
Running
Stress testVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
internal study #16 — embedding rotation impact
Complete
AblationVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
Conclusion: Cadence acceptable.
internal study #17 — embedding rotation impact
Analyzing
ComparisonVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
internal study #18 — embedding rotation impact
Running
A/BVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
Conclusion: Cadence acceptable.
internal study #19 — embedding rotation impact
Complete
Stress testVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
internal study #20 — embedding rotation impact
Analyzing
AblationVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
Conclusion: Cadence acceptable.
internal study #21 — embedding rotation impact
Running
ComparisonVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
internal study #22 — embedding rotation impact
Complete
A/BVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
Conclusion: Cadence acceptable.
internal study #23 — embedding rotation impact
Analyzing
Stress testVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
internal study #24 — embedding rotation impact
Running
AblationVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
Conclusion: Cadence acceptable.
internal study #25 — embedding rotation impact
Complete
ComparisonVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
internal study #26 — embedding rotation impact
Analyzing
A/BVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
Conclusion: Cadence acceptable.
internal study #27 — embedding rotation impact
Running
Stress testVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
internal study #28 — embedding rotation impact
Complete
AblationVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
Conclusion: Cadence acceptable.
internal study #29 — embedding rotation impact
Analyzing
ComparisonVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
internal study #30 — embedding rotation impact
Running
A/BVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
Conclusion: Cadence acceptable.
internal study #31 — embedding rotation impact
Complete
Stress testVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
internal study #32 — embedding rotation impact
Analyzing
AblationVikram Shetty· started Apr 1
Hypothesis: Rotation cadence does not regress retrieval.
Conclusion: Cadence acceptable.