BenchmarksHellaSwag

HellaSwag

General quality

Commonsense NLI

Source paper
Run benchmark
Public leaderboard
#ModelScore
1GPT-4o95.3
Your artifacts on this benchmark
Ranked among public results
ArtifactScoreRankΔ
claims-copilot-v379.8#4-1.2
wealth-advisor-rag-v582.1#1+1.1
trade-finance-helper-v376.4#6-0.9
claude-3-5-sonnet (vendor baseline)81.0#20.0
gpt-4o (vendor baseline)77.3#50.0
Latest result card — wealth-advisor-rag-v5 on HellaSwag
Overall score
82.1
Public rank
#1
Cases evaluated
10,231
Cost
$184.20
Category breakdown
Open-book QA — 10-K
86.0%
Open-book QA — 10-Q
83.0%
Earnings transcripts
79.0%
Cross-document reasoning
74.0%
Numerical reasoning
81.0%