Experiments

312 experiments · 47 currently running · 84 with published conclusions

StatusType

claude-3-7-sonnet vs claude-3-5-sonnet on claims-extraction

Running

ComparisonSaanvi Nair· started Apr 28

Hypothesis: claude-3-7 will improve faithfulness on claims-extraction at acceptable cost increase.

RAG chunk size 512 vs 1024 vs 2048 on retrieval quality

Complete

AblationPublishedVikram Shetty· started Mar 12

Hypothesis: 1024 chunks optimize precision/recall tradeoff.

Conclusion: 1024 wins on F1; 2048 wins on long-context QA. Recommend hybrid retrieval.

Hindi-only vs Hindi+English judge model on hinglish-banking-queries

Complete

A/BPublishedAnjali Krishnan· started Feb 24

Hypothesis: Bilingual judge improves Hinglish score reliability.

Conclusion: Bilingual judge κ=0.81 vs Hindi-only 0.62. Adopted as default.

Voyage-3 vs voyage-2 embeddings on customer-support-rag retrieval

Complete

ComparisonPublishedMeera Pillai· started Mar 1

Hypothesis: voyage-3 improves recall@10 by ≥5pp.

Conclusion: voyage-3 +6.4pp recall@10, +2.1pp Hinglish. Adopted.

Indirect-injection guardrail FP rate study

Running

Stress testFatima Khan· started Apr 18

Hypothesis: Tightened guardrail keeps recall while FP < 1.5%.

Sarvam-1 vs claude-3-5-sonnet on hindi-customer-voice

Complete

Vendor evaluationPublishedAnjali Krishnan· started Feb 14

Hypothesis: Sarvam-1, a Hindi-native model, will produce equal-or-better quality on Hindi customer voice tasks than claude-3-5-sonnet, at substantially lower cost and latency.

Conclusion: Sarvam-1 outperforms claude-3-5-sonnet on Hindi customer voice tasks at 3.2x lower cost and 2.4x lower latency, with comparable quality. Recommended for primary Hindi-customer-voice production traffic.

Caste-bias remediation prompt-engineering ablation

Running

Bias studyFatima Khan· started Apr 22

Hypothesis: Explicit fairness reminder + counterfactual probe lowers caste-skewed denials.

Multi-step agent vs single-shot prompt on loan-eligibility

Complete

ComparisonPublishedVikram Shetty· started Mar 6

Hypothesis: Multi-step lifts approval-quality at acceptable latency.

Conclusion: Multi-step +9.1pp quality, 2.3x latency. Adopted with streaming.

Crescendo multi-turn defense prompt-engineering

Running

Adversarial studyArjun Iyer· started Apr 25

Hypothesis: Per-turn refusal calibration reduces Crescendo escalation success.

Frontier model evaluation: GPT-5 access pre-approval

Planning

Frontier model assessmentSaanvi Nair· started May 2

Hypothesis: GPT-5 enters approved-vendor list for restricted internal use.

internal study #10 — embedding rotation impact

Complete

A/BVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

Conclusion: Cadence acceptable.

internal study #11 — embedding rotation impact

Analyzing

Stress testVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

internal study #12 — embedding rotation impact

Running

AblationVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

Conclusion: Cadence acceptable.

internal study #13 — embedding rotation impact

Complete

ComparisonVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

internal study #14 — embedding rotation impact

Analyzing

A/BVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

Conclusion: Cadence acceptable.

internal study #15 — embedding rotation impact

Running

Stress testVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

internal study #16 — embedding rotation impact

Complete

AblationVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

Conclusion: Cadence acceptable.

internal study #17 — embedding rotation impact

Analyzing

ComparisonVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

internal study #18 — embedding rotation impact

Running

A/BVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

Conclusion: Cadence acceptable.

internal study #19 — embedding rotation impact

Complete

Stress testVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

internal study #20 — embedding rotation impact

Analyzing

AblationVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

Conclusion: Cadence acceptable.

internal study #21 — embedding rotation impact

Running

ComparisonVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

internal study #22 — embedding rotation impact

Complete

A/BVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

Conclusion: Cadence acceptable.

internal study #23 — embedding rotation impact

Analyzing

Stress testVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

internal study #24 — embedding rotation impact

Running

AblationVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

Conclusion: Cadence acceptable.

internal study #25 — embedding rotation impact

Complete

ComparisonVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

internal study #26 — embedding rotation impact

Analyzing

A/BVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

Conclusion: Cadence acceptable.

internal study #27 — embedding rotation impact

Running

Stress testVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

internal study #28 — embedding rotation impact

Complete

AblationVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

Conclusion: Cadence acceptable.

internal study #29 — embedding rotation impact

Analyzing

ComparisonVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

internal study #30 — embedding rotation impact

Running

A/BVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

Conclusion: Cadence acceptable.

internal study #31 — embedding rotation impact

Complete

Stress testVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

internal study #32 — embedding rotation impact

Analyzing

AblationVikram Shetty· started Apr 1

Hypothesis: Rotation cadence does not regress retrieval.

Conclusion: Cadence acceptable.