benchmark evidence
Terminal-Bench 2.0
Agentic terminal task completion benchmark. Pass@1 success rate across shell tasks; published rows combine an agent scaffold with an underlying model.
winner on Terminal-Bench 2.0
direct benchmark result, not a broad vertical composite | source row dated 2026-05-15 | agent: vix
scored on 2026-05-15
latest mapped results | top 20
| # | Model | Score | Evidence | Tested |
|---|---|---|---|---|
| 1 | Anthropic: Claude Opus 4.7 | 90.2 | agent-dependent independent_benchmark | vix | 2026-05-15 |
| 2 | OpenAI: GPT-5.5 | 84.7 | agent-dependent independent_benchmark | NexAU-AHE | 2026-05-14 |
| 3 | Anthropic: Claude Opus 4.6 | 76.4 | agent-dependent independent_benchmark | Meta-Harness | 2026-05-14 |
| 4 | OpenAI: GPT-5.2 | 64.9 | agent-dependent independent_benchmark | Droid | 2025-12-24 |
| 5 | Anthropic: Claude Opus 4.5 | 63.1 | agent-dependent independent_benchmark | Droid | 2025-12-11 |
| 6 | Anthropic: Claude Sonnet 4.6 | 53.4 | agent-dependent independent_benchmark | Simplai Agent | 2026-05-14 |
| 7 | Z.ai: GLM 5 | 52.4 | agent-dependent independent_benchmark | Terminus 2 | 2026-02-23 |
| 8 | OpenAI: GPT-5 | 49.6 | agent-dependent independent_benchmark | Codex CLI | 2025-11-04 |
| 9 | OpenAI: GPT-5.1 | 47.6 | agent-dependent independent_benchmark | Terminus 2 | 2025-11-16 |
| 10 | Anthropic: Claude Sonnet 4.5 | 42.7 | agent-dependent independent_benchmark | MAYA-V2 | 2026-01-04 |
| 11 | DeepSeek: DeepSeek V3.2 | 39.6 | agent-dependent independent_benchmark | Terminus 2 | 2026-02-10 |
| 12 | MoonshotAI: Kimi K2 Thinking | 35.7 | agent-dependent independent_benchmark | Terminus 2 | 2025-11-11 |
| 13 | Anthropic: Claude Haiku 4.5 | 35.5 | agent-dependent independent_benchmark | Goose | 2025-12-11 |
| 14 | Google: Gemini 2.5 Pro | 32.6 | agent-dependent independent_benchmark | Terminus 2 | 2025-10-31 |
| 15 | Google: Gemini 2.5 Flash | 17.1 | agent-dependent independent_benchmark | Mini-SWE-Agent | 2025-11-03 |
what this result means
Agentic terminal task completion benchmark. Pass@1 success rate across shell tasks; published rows combine an agent scaffold with an underlying model.
This benchmark contributes direct public evidence. Read its scope before generalizing the result.
A win here is a win on Terminal-Bench 2.0. Broad task pages require independent corroboration before naming a general winner.
source record
category: coding
metric: accuracy
matched models: 15
latest source date: 2026-05-15
direction: higher is better