live
weekly refresh
basedagi.org
benchmark evidence

Terminal-Bench 2.0

Agentic terminal task completion benchmark. Pass@1 success rate across shell tasks; published rows combine an agent scaffold with an underlying model.

winner on Terminal-Bench 2.0
direct benchmark result, not a broad vertical composite | source row dated 2026-05-15 | agent: vix
scored on 2026-05-15
latest mapped results | top 20
#ModelScoreEvidenceTested
1Anthropic: Claude Opus 4.7
Anthropic
90.2
agent-dependent
independent_benchmark | vix
2026-05-15
2OpenAI: GPT-5.5
Openai
84.7
agent-dependent
independent_benchmark | NexAU-AHE
2026-05-14
3Anthropic: Claude Opus 4.6
Anthropic
76.4
agent-dependent
independent_benchmark | Meta-Harness
2026-05-14
4OpenAI: GPT-5.2
Openai
64.9
agent-dependent
independent_benchmark | Droid
2025-12-24
5Anthropic: Claude Opus 4.5
Anthropic
63.1
agent-dependent
independent_benchmark | Droid
2025-12-11
6Anthropic: Claude Sonnet 4.6
Anthropic
53.4
agent-dependent
independent_benchmark | Simplai Agent
2026-05-14
7Z.ai: GLM 5
Z Ai
52.4
agent-dependent
independent_benchmark | Terminus 2
2026-02-23
8OpenAI: GPT-5
Openai
49.6
agent-dependent
independent_benchmark | Codex CLI
2025-11-04
9OpenAI: GPT-5.1
Openai
47.6
agent-dependent
independent_benchmark | Terminus 2
2025-11-16
10Anthropic: Claude Sonnet 4.5
Anthropic
42.7
agent-dependent
independent_benchmark | MAYA-V2
2026-01-04
11DeepSeek: DeepSeek V3.2
Deepseek
39.6
agent-dependent
independent_benchmark | Terminus 2
2026-02-10
12MoonshotAI: Kimi K2 Thinking
Moonshotai
35.7
agent-dependent
independent_benchmark | Terminus 2
2025-11-11
13Anthropic: Claude Haiku 4.5
Anthropic
35.5
agent-dependent
independent_benchmark | Goose
2025-12-11
14Google: Gemini 2.5 Pro
Google
32.6
agent-dependent
independent_benchmark | Terminus 2
2025-10-31
15Google: Gemini 2.5 Flash
Google
17.1
agent-dependent
independent_benchmark | Mini-SWE-Agent
2025-11-03
what this result means

Agentic terminal task completion benchmark. Pass@1 success rate across shell tasks; published rows combine an agent scaffold with an underlying model.

This benchmark contributes direct public evidence. Read its scope before generalizing the result.

A win here is a win on Terminal-Bench 2.0. Broad task pages require independent corroboration before naming a general winner.

source record
category: coding
metric: accuracy
matched models: 15
latest source date: 2026-05-15
direction: higher is better
inspect upstream source ->