Benchmarks

Run public benchmarks on your artifacts and compare to published results.

Massive multitask language understanding (57 subjects)

Top public

GPT-4o88.7

Claude 3.5 Sonnet88.3

Gemini 1.5 Pro85.9

6 of your artifacts benchmarked3 days ago

BBH

General quality

Big-Bench Hard reasoning subset

Top public

Claude 3.5 Sonnet93.1

GPT-4o89.6

4 of your artifacts benchmarked1 week ago

Commonsense NLI

3 of your artifacts benchmarked12 days ago

ARC

General quality

AI2 Reasoning Challenge

Top public

Claude 3.5 Sonnet96.4

3 of your artifacts benchmarked12 days ago

HumanEval

General quality

Python code generation pass@1

Top public

Claude 3.5 Sonnet92.0

GPT-4o90.2

2 of your artifacts benchmarked5 days ago

MBPP

General quality

Mostly Basic Python Problems

Top public

Claude 3.5 Sonnet91.6

2 of your artifacts benchmarked5 days ago

GSM8K

General quality

Grade-school math word problems

Top public

GPT-4o95.8

Claude 3.5 Sonnet96.4

5 of your artifacts benchmarked2 days ago

MATH

General quality

Competition mathematics

Top public

GPT-4o76.6

3 of your artifacts benchmarked8 days ago

TruthfulQA

General quality

Resistance to common misconceptions

Top public

Claude 3.5 Sonnet71.2

8 of your artifacts benchmarked1 day ago

WinoGrande

General quality

Commonsense reasoning

Top public

GPT-4o87.5

2 of your artifacts benchmarked20 days ago

Passage ranking

4 of your artifacts benchmarked6 days ago

BEIR

RAG and retrieval

Heterogeneous retrieval benchmark (18 tasks)

Top public

BGE-M354.1

4 of your artifacts benchmarked6 days ago

MTEB

RAG and retrieval

Massive text embedding benchmark

Top public

NV-Embed-v272.3

6 of your artifacts benchmarked9 days ago

HarmBench

Safety

Refusal rate against 510 harmful behaviors

Top public

Claude 3.5 Sonnet98.4

GPT-4o96.8

9 of your artifacts benchmarked1 day ago

AdvBench

Safety

Adversarial harmful instructions

Top public

Claude 3.5 Sonnet99.1

7 of your artifacts benchmarked1 day ago

ToxicChat

Safety

Real-world toxicity in conversation

Top public

Llama-Guard-387.4

5 of your artifacts benchmarked4 days ago

RealToxicityPrompts

Safety

Toxic continuation likelihood

Top public

Claude 3.5 Sonnet96.2

4 of your artifacts benchmarked8 days ago

BOLD

Safety

Bias in open-ended generation

Top public

Claude 3.5 Sonnet91.0

4 of your artifacts benchmarked12 days ago

WebArena

Agent capability

Realistic web-task automation

Top public

Claude 3.5 Sonnet36.2

GPT-4o14.4

3 of your artifacts benchmarked2 days ago

AgentBench

Agent capability

Multi-environment agent benchmark

Top public

GPT-4o4.4

2 of your artifacts benchmarked9 days ago

SWE-bench

Agent capability

Real GitHub issue resolution

Top public

Claude 3.5 Sonnet (agent)49.0

1 of your artifacts benchmarked16 days ago

TauBench

Agent capability

Tool-use in realistic customer-service

Top public

Claude 3.5 Sonnet55.8

GPT-4o49.1

4 of your artifacts benchmarked3 days ago

MGSM

Multilingual

GSM8K translated to 10 languages

Top public

GPT-4o90.5

4 of your artifacts benchmarked11 days ago

XNLI

Multilingual

Cross-lingual NLI (15 languages)

Top public

XLM-RoBERTa-XL84.6

3 of your artifacts benchmarked14 days ago

FLORES-200

Multilingual

Translation across 200 languages

Top public

NLLB-20044.3

3 of your artifacts benchmarked14 days ago

AI4Bharat IndicEval

Indic

Comprehensive Indic NLU/NLG benchmark

Top public

Airavata-7B67.2

GPT-4o71.4

6 of your artifacts benchmarked2 days ago

IndicXTREME

Indic

11 Indic languages, 9 tasks

Top public

MuRIL72.1

4 of your artifacts benchmarked5 days ago

MILU

Indic

Massive Indic Language Understanding (11 languages)

Top public

GPT-4o72.0

Llama-3.1-70B65.4

5 of your artifacts benchmarked3 days ago

Pariksha

Indic

Human + auto eval of Indic LLMs

Top public

GPT-4-Turbo68.1

3 of your artifacts benchmarked8 days ago

IndicSafe

Indic

Safety across 12 Indic languages

Top public

Claude 3.5 Sonnet91.2

6 of your artifacts benchmarked1 day ago

FinanceBench

Domain

Open-book QA over 10-K/10-Q filings

Top public

Claude 3.5 Sonnet81.0

GPT-4o77.3

Gemini 1.5 Pro71.8

5 of your artifacts benchmarked1 day ago

MedQA

Domain

USMLE-style medical QA

Top public

GPT-4o89.6

Med-PaLM 286.5

2 of your artifacts benchmarked20 days ago

LegalBench

Domain

162 legal reasoning tasks

Top public

GPT-4o78.4

Claude 3.5 Sonnet81.2

3 of your artifacts benchmarked9 days ago