Back

LLM / Agentic

Compare model performance on agentic benchmark tasks.

UpdatedMar 26, 2026, 3:00 PM
MethodPublished benchmark snapshot
Rank Model Usage BenchmarkScoreModel slugContext Summary
#1
GPT-5.4 OpenAI
69.4 agentic69.4openai/gpt-5.41.1M GPT-5.4 is OpenAI’s latest frontier model, unifying the Codex and GPT lines into a single system. It features a 1M+ token context window…
#2
Claude Opus 4.6 Anthropic
67.6 agentic67.6anthropic/claude-opus-4.61.0M Opus 4.6 is Anthropic’s strongest model for coding and long-running professional tasks. It is built for agents that operate across entire…
#3 66.1 agentic66.1z-ai/glm-5-turbo203K GLM-5 Turbo is a new model from Z.ai designed for fast inference and strong performance in agent-driven environments such as OpenClaw sce…
#4
GLM 5 Z.ai
63.1 agentic63.1z-ai/glm-580K GLM-5 is Z.ai’s flagship open-source foundation model engineered for complex systems design and long-horizon agent workflows. Built for e…
#5
Claude Sonnet 4.6 Anthropic
63 agentic63anthropic/claude-sonnet-4.61.0M Sonnet 4.6 is Anthropic's most capable Sonnet-class model yet, with frontier performance across coding, agents, and professional work. It…
#6
MiMo-V2-Pro Xiaomi
62.8 agentic62.8xiaomi/mimo-v2-pro1.0M MiMo-V2-Pro is Xiaomi's flagship foundation model, featuring over 1T total parameters and a 1M context length, deeply optimized for agent…
#7
GPT-5.3-Codex OpenAI
62.2 agentic62.2openai/gpt-5.3-codex400K GPT-5.3-Codex is OpenAI’s most advanced agentic coding model, combining the frontier software engineering performance of GPT-5.2-Codex wi…
#8
MiniMax M2.7 MiniMax
61.5 agentic61.5minimax/minimax-m2.7205K MiniMax-M2.7 is a next-generation large language model designed for autonomous, real-world productivity and continuous improvement. Built…
#9
GPT-5.2 OpenAI
60.2 agentic60.2openai/gpt-5.2400K GPT-5.2 is the latest frontier-grade model in the GPT-5 series, offering stronger agentic and long context perfomance compared to GPT-5.1…
#10
Claude Opus 4.5 Anthropic
59.6 agentic59.6anthropic/claude-opus-4.5200K Claude Opus 4.5 is Anthropic’s frontier reasoning model optimized for complex software engineering, agentic workflows, and long-horizon c…
#1
GPT-5.4 OpenAI
Usage 69.4
Benchmark agentic · Context 1.1M

GPT-5.4 is OpenAI’s latest frontier model, unifying the Codex and GPT lines into a single system. It features a 1M+ token context window…

#2
Claude Opus 4.6 Anthropic
Usage 67.6
Benchmark agentic · Context 1.0M

Opus 4.6 is Anthropic’s strongest model for coding and long-running professional tasks. It is built for agents that operate across entire…

#3
Usage 66.1
Benchmark agentic · Context 203K

GLM-5 Turbo is a new model from Z.ai designed for fast inference and strong performance in agent-driven environments such as OpenClaw sce…

#4
GLM 5 Z.ai
Usage 63.1
Benchmark agentic · Context 80K

GLM-5 is Z.ai’s flagship open-source foundation model engineered for complex systems design and long-horizon agent workflows. Built for e…

#5
Claude Sonnet 4.6 Anthropic
Usage 63
Benchmark agentic · Context 1.0M

Sonnet 4.6 is Anthropic's most capable Sonnet-class model yet, with frontier performance across coding, agents, and professional work. It…

#6
MiMo-V2-Pro Xiaomi
Usage 62.8
Benchmark agentic · Context 1.0M

MiMo-V2-Pro is Xiaomi's flagship foundation model, featuring over 1T total parameters and a 1M context length, deeply optimized for agent…

#7
GPT-5.3-Codex OpenAI
Usage 62.2
Benchmark agentic · Context 400K

GPT-5.3-Codex is OpenAI’s most advanced agentic coding model, combining the frontier software engineering performance of GPT-5.2-Codex wi…

#8
MiniMax M2.7 MiniMax
Usage 61.5
Benchmark agentic · Context 205K

MiniMax-M2.7 is a next-generation large language model designed for autonomous, real-world productivity and continuous improvement. Built…

#9
GPT-5.2 OpenAI
Usage 60.2
Benchmark agentic · Context 400K

GPT-5.2 is the latest frontier-grade model in the GPT-5 series, offering stronger agentic and long context perfomance compared to GPT-5.1…

#10
Claude Opus 4.5 Anthropic
Usage 59.6
Benchmark agentic · Context 200K

Claude Opus 4.5 is Anthropic’s frontier reasoning model optimized for complex software engineering, agentic workflows, and long-horizon c…