BenchmarksFinanceBench

FinanceBench

Domain

Open-book QA over 10-K/10-Q filings

Source paper
Run benchmark
Public leaderboard
#ModelScore
1Claude 3.5 Sonnet81.0
2GPT-4o77.3
3Gemini 1.5 Pro71.8
4Llama-3.1-405B68.4
5Mistral-Large-264.2
6Claude 3 Opus62.9
7GPT-4 Turbo60.1
8Llama-3.1-70B58.7
9Qwen2.5-72B56.3
10Mixtral-8x22B51.4
Your artifacts on this benchmark
Ranked among public results
ArtifactScoreRankΔ
claims-copilot-v379.8#4-1.2
wealth-advisor-rag-v582.1#1+1.1
trade-finance-helper-v376.4#6-0.9
claude-3-5-sonnet (vendor baseline)81.0#20.0
gpt-4o (vendor baseline)77.3#50.0
Latest result card — wealth-advisor-rag-v5 on FinanceBench
Overall score
82.1
Public rank
#1
Cases evaluated
10,231
Cost
$184.20
Category breakdown
Open-book QA — 10-K
86.0%
Open-book QA — 10-Q
83.0%
Earnings transcripts
79.0%
Cross-document reasoning
74.0%
Numerical reasoning
81.0%