Compare the top AI models by performance and cost. Last updated: May 30, 2026. Sources: LMSYS Chatbot Arena, HuggingFace Open LLM Leaderboard, official benchmarks.
| # | Model | MMLU | HumanEval | Arena Elo | Price / 1M In | Category |
|---|---|---|---|---|---|---|
1 | o3 OpenAI | 96.7% | 97.9% | 1350 | $10 | reasoning |
2 | Gemini 2.5 Pro Google | 91.0% | 84.1% | 1320 | $1.25 | generalvisionreasoning |
3 | o4-mini OpenAI | 93.3% | 92.4% | 1310 | $1.1 | reasoningcoding |
4 | Claude Opus 4.6 Anthropic | 90.4% | 84.9% | 1290 | $15 | generalreasoningvision |
5 | GPT-4o OpenAI | 88.7% | 90.2% | 1285 | $2.5 | generalvisioncoding |
6 | Claude Sonnet 4.6 Anthropic | 88.7% | 73.0% | 1258 | $3 | generalcoding |
7 | Gemini 2.5 Flash Google | 86.2% | 74.3% | 1240 | $0.075 | generalcoding |
8 | DeepSeek V3Open DeepSeek | 87.1% | 89.1% | 1230 | $0.27 | generalcoding |
9 | Llama 3.3 70BOpen Meta | 86.0% | 88.4% | 1220 | Free | generalcoding |
10 | Qwen2.5 72BOpen Alibaba | 86.7% | 86.7% | 1210 | Free | generalcoding |
11 | GPT-4o mini OpenAI | 82.0% | 87.2% | 1200 | $0.15 | generalcoding |
12 | Mistral Large 2 Mistral | 84.0% | 92.1% | 1195 | $2 | generalcoding |
13 | Claude Haiku 4.5 Anthropic | 75.2% | 60.0% | 1180 | $0.8 | generalcoding |
14 | Llama 4 ScoutOpen Meta | 79.6% | 72.6% | 1180 | Free | general |
Metrics Glossary
- MMLUMassive Multitask Language Understanding. Measures general knowledge and problem-solving across 57 subjects.
- HumanEvalCoding benchmark measuring functional correctness for synthesizing programs (pass@1).
- Arena EloLMSYS Chatbot Arena ELO score based on crowdsourced, blind human preference testing.
- PricingPrices shown are in USD per 1 Million input tokens. Free models are highlighted in green.