Back

LLM / Agentic

Compare model performance on agentic benchmark tasks.

UpdatedMar 26, 2026, 3:00 PM

MethodPublished benchmark snapshot

Rank	Model	Usage	Benchmark	Score	Model slug	Context	Summary
#1	GPT-5.4 OpenAI	69.4	agentic	69.4	openai/gpt-5.4	1.1M	GPT-5.4 is OpenAI’s latest frontier model, unifying the Codex and GPT lines into a single system. It features a 1M+ token context window…
#2	Claude Opus 4.6 Anthropic	67.6	agentic	67.6	anthropic/claude-opus-4.6	1.0M	Opus 4.6 is Anthropic’s strongest model for coding and long-running professional tasks. It is built for agents that operate across entire…
#3	GLM 5 Turbo Z.ai	66.1	agentic	66.1	z-ai/glm-5-turbo	203K	GLM-5 Turbo is a new model from Z.ai designed for fast inference and strong performance in agent-driven environments such as OpenClaw sce…
#4	GLM 5 Z.ai	63.1	agentic	63.1	z-ai/glm-5	80K	GLM-5 is Z.ai’s flagship open-source foundation model engineered for complex systems design and long-horizon agent workflows. Built for e…
#5	Claude Sonnet 4.6 Anthropic	63	agentic	63	anthropic/claude-sonnet-4.6	1.0M	Sonnet 4.6 is Anthropic's most capable Sonnet-class model yet, with frontier performance across coding, agents, and professional work. It…
#6	MiMo-V2-Pro Xiaomi	62.8	agentic	62.8	xiaomi/mimo-v2-pro	1.0M	MiMo-V2-Pro is Xiaomi's flagship foundation model, featuring over 1T total parameters and a 1M context length, deeply optimized for agent…
#7	GPT-5.3-Codex OpenAI	62.2	agentic	62.2	openai/gpt-5.3-codex	400K	GPT-5.3-Codex is OpenAI’s most advanced agentic coding model, combining the frontier software engineering performance of GPT-5.2-Codex wi…
#8	MiniMax M2.7 MiniMax	61.5	agentic	61.5	minimax/minimax-m2.7	205K	MiniMax-M2.7 is a next-generation large language model designed for autonomous, real-world productivity and continuous improvement. Built…
#9	GPT-5.2 OpenAI	60.2	agentic	60.2	openai/gpt-5.2	400K	GPT-5.2 is the latest frontier-grade model in the GPT-5 series, offering stronger agentic and long context perfomance compared to GPT-5.1…
#10	Claude Opus 4.5 Anthropic	59.6	agentic	59.6	anthropic/claude-opus-4.5	200K	Claude Opus 4.5 is Anthropic’s frontier reasoning model optimized for complex software engineering, agentic workflows, and long-horizon c…

GPT-5.4 OpenAI

Usage 69.4

Benchmark agentic · Context 1.1M

GPT-5.4 is OpenAI’s latest frontier model, unifying the Codex and GPT lines into a single system. It features a 1M+ token context window…

Claude Opus 4.6 Anthropic

Usage 67.6

Benchmark agentic · Context 1.0M

Opus 4.6 is Anthropic’s strongest model for coding and long-running professional tasks. It is built for agents that operate across entire…

GLM 5 Turbo Z.ai

Usage 66.1

Benchmark agentic · Context 203K

GLM-5 Turbo is a new model from Z.ai designed for fast inference and strong performance in agent-driven environments such as OpenClaw sce…

GLM 5 Z.ai

Usage 63.1

Benchmark agentic · Context 80K

GLM-5 is Z.ai’s flagship open-source foundation model engineered for complex systems design and long-horizon agent workflows. Built for e…

Claude Sonnet 4.6 Anthropic

Usage 63

Benchmark agentic · Context 1.0M

Sonnet 4.6 is Anthropic's most capable Sonnet-class model yet, with frontier performance across coding, agents, and professional work. It…

MiMo-V2-Pro Xiaomi

Usage 62.8

Benchmark agentic · Context 1.0M

MiMo-V2-Pro is Xiaomi's flagship foundation model, featuring over 1T total parameters and a 1M context length, deeply optimized for agent…

GPT-5.3-Codex OpenAI

Usage 62.2

Benchmark agentic · Context 400K

GPT-5.3-Codex is OpenAI’s most advanced agentic coding model, combining the frontier software engineering performance of GPT-5.2-Codex wi…

MiniMax M2.7 MiniMax

Usage 61.5

Benchmark agentic · Context 205K

MiniMax-M2.7 is a next-generation large language model designed for autonomous, real-world productivity and continuous improvement. Built…

GPT-5.2 OpenAI

Usage 60.2

Benchmark agentic · Context 400K

GPT-5.2 is the latest frontier-grade model in the GPT-5 series, offering stronger agentic and long context perfomance compared to GPT-5.1…

#10

Claude Opus 4.5 Anthropic

Usage 59.6

Benchmark agentic · Context 200K

Claude Opus 4.5 is Anthropic’s frontier reasoning model optimized for complex software engineering, agentic workflows, and long-horizon c…