Benchmarks

Run public benchmarks on your artifacts and compare to published results.

New benchmark run
MMLU
General quality

Massive multitask language understanding (57 subjects)

Top public
GPT-4o88.7
Claude 3.5 Sonnet88.3
Gemini 1.5 Pro85.9
6 of your artifacts benchmarked3 days ago
BBH
General quality

Big-Bench Hard reasoning subset

Top public
Claude 3.5 Sonnet93.1
GPT-4o89.6
4 of your artifacts benchmarked1 week ago
HellaSwag
General quality

Commonsense NLI

Top public
GPT-4o95.3
3 of your artifacts benchmarked12 days ago
ARC
General quality

AI2 Reasoning Challenge

Top public
Claude 3.5 Sonnet96.4
3 of your artifacts benchmarked12 days ago
HumanEval
General quality

Python code generation pass@1

Top public
Claude 3.5 Sonnet92.0
GPT-4o90.2
2 of your artifacts benchmarked5 days ago
MBPP
General quality

Mostly Basic Python Problems

Top public
Claude 3.5 Sonnet91.6
2 of your artifacts benchmarked5 days ago
GSM8K
General quality

Grade-school math word problems

Top public
GPT-4o95.8
Claude 3.5 Sonnet96.4
5 of your artifacts benchmarked2 days ago
MATH
General quality

Competition mathematics

Top public
GPT-4o76.6
3 of your artifacts benchmarked8 days ago
TruthfulQA
General quality

Resistance to common misconceptions

Top public
Claude 3.5 Sonnet71.2
8 of your artifacts benchmarked1 day ago
WinoGrande
General quality

Commonsense reasoning

Top public
GPT-4o87.5
2 of your artifacts benchmarked20 days ago
MS MARCO
RAG and retrieval

Passage ranking

Top public
BGE-M341.2
4 of your artifacts benchmarked6 days ago
BEIR
RAG and retrieval

Heterogeneous retrieval benchmark (18 tasks)

Top public
BGE-M354.1
4 of your artifacts benchmarked6 days ago
MTEB
RAG and retrieval

Massive text embedding benchmark

Top public
NV-Embed-v272.3
6 of your artifacts benchmarked9 days ago
HarmBench
Safety

Refusal rate against 510 harmful behaviors

Top public
Claude 3.5 Sonnet98.4
GPT-4o96.8
9 of your artifacts benchmarked1 day ago
AdvBench
Safety

Adversarial harmful instructions

Top public
Claude 3.5 Sonnet99.1
7 of your artifacts benchmarked1 day ago
ToxicChat
Safety

Real-world toxicity in conversation

Top public
Llama-Guard-387.4
5 of your artifacts benchmarked4 days ago
RealToxicityPrompts
Safety

Toxic continuation likelihood

Top public
Claude 3.5 Sonnet96.2
4 of your artifacts benchmarked8 days ago
BOLD
Safety

Bias in open-ended generation

Top public
Claude 3.5 Sonnet91.0
4 of your artifacts benchmarked12 days ago
WebArena
Agent capability

Realistic web-task automation

Top public
Claude 3.5 Sonnet36.2
GPT-4o14.4
3 of your artifacts benchmarked2 days ago
AgentBench
Agent capability

Multi-environment agent benchmark

Top public
GPT-4o4.4
2 of your artifacts benchmarked9 days ago
SWE-bench
Agent capability

Real GitHub issue resolution

Top public
Claude 3.5 Sonnet (agent)49.0
1 of your artifacts benchmarked16 days ago
TauBench
Agent capability

Tool-use in realistic customer-service

Top public
Claude 3.5 Sonnet55.8
GPT-4o49.1
4 of your artifacts benchmarked3 days ago
MGSM
Multilingual

GSM8K translated to 10 languages

Top public
GPT-4o90.5
4 of your artifacts benchmarked11 days ago
XNLI
Multilingual

Cross-lingual NLI (15 languages)

Top public
XLM-RoBERTa-XL84.6
3 of your artifacts benchmarked14 days ago
FLORES-200
Multilingual

Translation across 200 languages

Top public
NLLB-20044.3
3 of your artifacts benchmarked14 days ago
AI4Bharat IndicEval
Indic

Comprehensive Indic NLU/NLG benchmark

Top public
Airavata-7B67.2
GPT-4o71.4
6 of your artifacts benchmarked2 days ago
IndicXTREME
Indic

11 Indic languages, 9 tasks

Top public
MuRIL72.1
4 of your artifacts benchmarked5 days ago
MILU
Indic

Massive Indic Language Understanding (11 languages)

Top public
GPT-4o72.0
Llama-3.1-70B65.4
5 of your artifacts benchmarked3 days ago
Pariksha
Indic

Human + auto eval of Indic LLMs

Top public
GPT-4-Turbo68.1
3 of your artifacts benchmarked8 days ago
IndicSafe
Indic

Safety across 12 Indic languages

Top public
Claude 3.5 Sonnet91.2
6 of your artifacts benchmarked1 day ago
FinanceBench
Domain

Open-book QA over 10-K/10-Q filings

Top public
Claude 3.5 Sonnet81.0
GPT-4o77.3
Gemini 1.5 Pro71.8
5 of your artifacts benchmarked1 day ago
MedQA
Domain

USMLE-style medical QA

Top public
GPT-4o89.6
Med-PaLM 286.5
2 of your artifacts benchmarked20 days ago
LegalBench
Domain

162 legal reasoning tasks

Top public
GPT-4o78.4
Claude 3.5 Sonnet81.2
3 of your artifacts benchmarked9 days ago