Back

LLM / Coding

Compare model performance on coding benchmark tasks.

UpdatedMar 26, 2026, 3:00 PM

MethodPublished benchmark snapshot

Rank	Model	Usage	Benchmark	Score	Model slug	Context	Summary
#1	GPT-5.4 OpenAI	57.3	coding	57.3	openai/gpt-5.4	1.1M	GPT-5.4 is OpenAI’s latest frontier model, unifying the Codex and GPT lines into a single system. It features a 1M+ token context window…
#2	Gemini 3.1 Pro Preview Google	55.5	coding	55.5	google/gemini-3.1-pro-preview	1.0M	Gemini 3.1 Pro Preview is Google’s frontier reasoning model, delivering enhanced software engineering performance, improved agentic relia…
#3	GPT-5.3-Codex OpenAI	53.1	coding	53.1	openai/gpt-5.3-codex	400K	GPT-5.3-Codex is OpenAI’s most advanced agentic coding model, combining the frontier software engineering performance of GPT-5.2-Codex wi…
#4	GPT-5.4 Mini OpenAI	51.5	coding	51.5	openai/gpt-5.4-mini	400K	GPT-5.4 mini brings the core capabilities of GPT-5.4 to a faster, more efficient model optimized for high-throughput workloads. It suppor…
#5	Claude Sonnet 4.6 Anthropic	50.9	coding	50.9	anthropic/claude-sonnet-4.6	1.0M	Sonnet 4.6 is Anthropic's most capable Sonnet-class model yet, with frontier performance across coding, agents, and professional work. It…
#6	GPT-5.2 OpenAI	48.7	coding	48.7	openai/gpt-5.2	400K	GPT-5.2 is the latest frontier-grade model in the GPT-5 series, offering stronger agentic and long context perfomance compared to GPT-5.1…
#7	Claude Opus 4.6 Anthropic	48.1	coding	48.1	anthropic/claude-opus-4.6	1.0M	Opus 4.6 is Anthropic’s strongest model for coding and long-running professional tasks. It is built for agents that operate across entire…
#8	Claude Opus 4.5 Anthropic	47.8	coding	47.8	anthropic/claude-opus-4.5	200K	Claude Opus 4.5 is Anthropic’s frontier reasoning model optimized for complex software engineering, agentic workflows, and long-horizon c…
#9	Gemini 2.5 Pro Google	46.7	coding	46.7	google/gemini-2.5-pro-exp-03-25	1.0M	Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. It emplo…
#10	Gemini 3 Pro Preview (high) Google	46.5	coding	46.5	google/gemini-3-pro-preview	-	Coding benchmark score.

GPT-5.4 OpenAI

Usage 57.3

Benchmark coding · Context 1.1M

GPT-5.4 is OpenAI’s latest frontier model, unifying the Codex and GPT lines into a single system. It features a 1M+ token context window…

Gemini 3.1 Pro Preview Google

Usage 55.5

Benchmark coding · Context 1.0M

Gemini 3.1 Pro Preview is Google’s frontier reasoning model, delivering enhanced software engineering performance, improved agentic relia…

GPT-5.3-Codex OpenAI

Usage 53.1

Benchmark coding · Context 400K

GPT-5.3-Codex is OpenAI’s most advanced agentic coding model, combining the frontier software engineering performance of GPT-5.2-Codex wi…

GPT-5.4 Mini OpenAI

Usage 51.5

Benchmark coding · Context 400K

GPT-5.4 mini brings the core capabilities of GPT-5.4 to a faster, more efficient model optimized for high-throughput workloads. It suppor…

Claude Sonnet 4.6 Anthropic

Usage 50.9

Benchmark coding · Context 1.0M

Sonnet 4.6 is Anthropic's most capable Sonnet-class model yet, with frontier performance across coding, agents, and professional work. It…

GPT-5.2 OpenAI

Usage 48.7

Benchmark coding · Context 400K

GPT-5.2 is the latest frontier-grade model in the GPT-5 series, offering stronger agentic and long context perfomance compared to GPT-5.1…

Claude Opus 4.6 Anthropic

Usage 48.1

Benchmark coding · Context 1.0M

Opus 4.6 is Anthropic’s strongest model for coding and long-running professional tasks. It is built for agents that operate across entire…

Claude Opus 4.5 Anthropic

Usage 47.8

Benchmark coding · Context 200K

Claude Opus 4.5 is Anthropic’s frontier reasoning model optimized for complex software engineering, agentic workflows, and long-horizon c…

Gemini 2.5 Pro Google

Usage 46.7

Benchmark coding · Context 1.0M

Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. It emplo…

#10

Gemini 3 Pro Preview (high) Google

Usage 46.5

Benchmark coding · Context -

Coding benchmark score.