Benchmarks
Run public benchmarks on your artifacts and compare to published results.
Massive multitask language understanding (57 subjects)
Big-Bench Hard reasoning subset
Commonsense NLI
AI2 Reasoning Challenge
Python code generation pass@1
Mostly Basic Python Problems
Grade-school math word problems
Competition mathematics
Resistance to common misconceptions
Commonsense reasoning
Passage ranking
Heterogeneous retrieval benchmark (18 tasks)
Massive text embedding benchmark
Refusal rate against 510 harmful behaviors
Adversarial harmful instructions
Real-world toxicity in conversation
Toxic continuation likelihood
Bias in open-ended generation
Realistic web-task automation
Multi-environment agent benchmark
Real GitHub issue resolution
Tool-use in realistic customer-service
GSM8K translated to 10 languages
Cross-lingual NLI (15 languages)
Translation across 200 languages
Comprehensive Indic NLU/NLG benchmark
11 Indic languages, 9 tasks
Massive Indic Language Understanding (11 languages)
Human + auto eval of Indic LLMs
Safety across 12 Indic languages
Open-book QA over 10-K/10-Q filings
USMLE-style medical QA
162 legal reasoning tasks