Back

LLM / Coding

Compare model performance on coding benchmark tasks.

UpdatedMar 26, 2026, 3:00 PM
MethodPublished benchmark snapshot
Rank Model Usage BenchmarkScoreModel slugContext Summary
#1
GPT-5.4 OpenAI
57.3 coding57.3openai/gpt-5.41.1M GPT-5.4 is OpenAI’s latest frontier model, unifying the Codex and GPT lines into a single system. It features a 1M+ token context window…
#2 55.5 coding55.5google/gemini-3.1-pro-preview1.0M Gemini 3.1 Pro Preview is Google’s frontier reasoning model, delivering enhanced software engineering performance, improved agentic relia…
#3
GPT-5.3-Codex OpenAI
53.1 coding53.1openai/gpt-5.3-codex400K GPT-5.3-Codex is OpenAI’s most advanced agentic coding model, combining the frontier software engineering performance of GPT-5.2-Codex wi…
#4
GPT-5.4 Mini OpenAI
51.5 coding51.5openai/gpt-5.4-mini400K GPT-5.4 mini brings the core capabilities of GPT-5.4 to a faster, more efficient model optimized for high-throughput workloads. It suppor…
#5
Claude Sonnet 4.6 Anthropic
50.9 coding50.9anthropic/claude-sonnet-4.61.0M Sonnet 4.6 is Anthropic's most capable Sonnet-class model yet, with frontier performance across coding, agents, and professional work. It…
#6
GPT-5.2 OpenAI
48.7 coding48.7openai/gpt-5.2400K GPT-5.2 is the latest frontier-grade model in the GPT-5 series, offering stronger agentic and long context perfomance compared to GPT-5.1…
#7
Claude Opus 4.6 Anthropic
48.1 coding48.1anthropic/claude-opus-4.61.0M Opus 4.6 is Anthropic’s strongest model for coding and long-running professional tasks. It is built for agents that operate across entire…
#8
Claude Opus 4.5 Anthropic
47.8 coding47.8anthropic/claude-opus-4.5200K Claude Opus 4.5 is Anthropic’s frontier reasoning model optimized for complex software engineering, agentic workflows, and long-horizon c…
#9 46.7 coding46.7google/gemini-2.5-pro-exp-03-251.0M Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. It emplo…
#10 46.5 coding46.5google/gemini-3-pro-preview- Coding benchmark score.
#1
GPT-5.4 OpenAI
Usage 57.3
Benchmark coding · Context 1.1M

GPT-5.4 is OpenAI’s latest frontier model, unifying the Codex and GPT lines into a single system. It features a 1M+ token context window…

Usage 55.5
Benchmark coding · Context 1.0M

Gemini 3.1 Pro Preview is Google’s frontier reasoning model, delivering enhanced software engineering performance, improved agentic relia…

#3
GPT-5.3-Codex OpenAI
Usage 53.1
Benchmark coding · Context 400K

GPT-5.3-Codex is OpenAI’s most advanced agentic coding model, combining the frontier software engineering performance of GPT-5.2-Codex wi…

#4
GPT-5.4 Mini OpenAI
Usage 51.5
Benchmark coding · Context 400K

GPT-5.4 mini brings the core capabilities of GPT-5.4 to a faster, more efficient model optimized for high-throughput workloads. It suppor…

#5
Claude Sonnet 4.6 Anthropic
Usage 50.9
Benchmark coding · Context 1.0M

Sonnet 4.6 is Anthropic's most capable Sonnet-class model yet, with frontier performance across coding, agents, and professional work. It…

#6
GPT-5.2 OpenAI
Usage 48.7
Benchmark coding · Context 400K

GPT-5.2 is the latest frontier-grade model in the GPT-5 series, offering stronger agentic and long context perfomance compared to GPT-5.1…

#7
Claude Opus 4.6 Anthropic
Usage 48.1
Benchmark coding · Context 1.0M

Opus 4.6 is Anthropic’s strongest model for coding and long-running professional tasks. It is built for agents that operate across entire…

#8
Claude Opus 4.5 Anthropic
Usage 47.8
Benchmark coding · Context 200K

Claude Opus 4.5 is Anthropic’s frontier reasoning model optimized for complex software engineering, agentic workflows, and long-horizon c…

#9
Usage 46.7
Benchmark coding · Context 1.0M

Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. It emplo…