BasedAGIBasedAGI
Menu
Rankings live

developer_tools

Autonomous Coding Agent

End-to-end autonomous software engineering: reading issues, writing code, running tests, submitting PRs.

#1 Recommendation

Kimi K2 Thinking

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct (80%) and Sonar Java Quality Leaderboard functional_skill_pct (88%)

external/kimi/kimi-k2-thinking

16.8%

Score

42.9%

Confidence

Limited benchmark evidence for this use case.

25 ranked models with average evidence of 17.2 points. Rankings may shift as more benchmark data is ingested.

Ranked Models

25

Evidence Quality

82%

Scoring

Benchmark-backed

Top Signal

SWE-bench Verified Leaderboard: swe_verified_resolved_pct

All Ranked Models

Max params:
Min confidence:
25 of 25
RankModelScore
#8Kimi K2 Thinking
16.8%
#9GLM-5
16.8%
#10anthropic/claude-sonnet-4.6
16.6%
#13gemini-3-pro-preview
15.2%
#15gemini-2.5-pro
14.3%
#16openai/gpt-4.1
14.1%
#17kimi/kimi-k2.5-thinking
13.9%
#18gpt-4.1-20250414
13.5%
#19claude-opus-4-5-20251101
13.4%
#21gpt-5.2-2025-12-11
12.8%
#24minimax/minimax-m2.1
11.2%
#25gpt-4o
11.1%
#26deepseek/deepseek-r1
10.6%
#28o3-20250416
10.1%
#30Grok-4-0709
9.2%
#31claude-sonnet-4-20250514
9.1%
#32gpt-4.1-mini-20250414
8.9%
#33gpt-4o-20241120
8.8%
#34z-ai/glm-4.7
8.7%
#35Kimi-K2-Instruct
8.6%
#36gpt-4o-2024-05-13
8.4%
#37gpt-4o-2024-08-06
8.2%
#39o4-mini-20250416
7.6%
#40GLM-4.7
7.1%
#48openai/gpt-4o-mini-2024-07-18
2.6%

Compare Models

Model A leads by +0.0%

Shareable Link →

Model A

Kimi K2 Thinking

external/kimi/kimi-k2-thinking

16.8%

Rank #8

Confidence 42.9%26 evidence pts

SWE-bench Verified Leaderboard: swe_verified_resolved_pct

Value 80.2% · Conf 100.0% · Weight 4.2%

swebench_verified_official.swe_verified_resolved_pct (Mar 17, 2026)

Sonar Java Quality Leaderboard: functional_skill_pct

Value 88.4% · Conf 100.0% · Weight 1.8%

sonar_java_quality.functional_skill_pct (Mar 17, 2026)

Vals SWE-bench: overall_accuracy_pct

Value 63.5% · Conf 100.0% · Weight 0.7%

vals_swebench.overall_accuracy_pct (Mar 17, 2026)

Sonar Java Quality Leaderboard: issue_density_error_per_kloc

Value 66.6% · Conf 100.0% · Weight 0.7%

sonar_java_quality.issue_density_error_per_kloc (Mar 17, 2026)

Model B

GLM-5

zai-org/GLM-5

16.8%

Rank #9

Confidence 29.8%17 evidence pts

OpenHands Issue Resolution: issue_resolution_score_pct

Value 59.0% · Conf 100.0% · Weight 2.4%

openhands_issue_resolution.issue_resolution_score_pct (Mar 17, 2026)

Sonar Java Quality Leaderboard: functional_skill_pct

Value 91.6% · Conf 100.0% · Weight 1.8%

sonar_java_quality.functional_skill_pct (Mar 17, 2026)

OpenHands Index: average_score_pct

Value 36.5% · Conf 100.0% · Weight 1.4%

openhands_index.average_score_pct (Mar 17, 2026)

Sonar Java Quality Leaderboard: issue_density_error_per_kloc

Value 100.0% · Conf 100.0% · Weight 1.1%

sonar_java_quality.issue_density_error_per_kloc (Mar 17, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

25

Sources

8

Quality

Insufficient

Vals LiveCodeBench

vals_lcb

16 rows

0.8% avg lift

SWE-bench Verified Leaderboard

swebench_verified_official

14 rows

3.3% avg lift

Vals SWE-bench

vals_swebench

14 rows

0.8% avg lift

Vals Legal Bench

vals_legal_bench

13 rows

0.2% avg lift

Missing Strong Models

gpt-5-2025-08-07

external/openai/gpt-5-2025-08-07

Rank #6

19.2%

Thin evidence after weighting

gpt-5-mini-2025-08-07

external/openai/gpt-5-mini-2025-08-07

Rank #7

19.1%

Thin evidence after weighting

google/gemini-3.1-pro-preview

external/google/gemini-3-1-pro-preview

Rank #8

18.6%

Thin evidence after weighting

openai/gpt-5.4-2026-03-05

external/openai/gpt-5-4-2026-03-05

Rank #9

18.3%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.agentic_multi_step_completiontask.code_generation

Required Modes

mode.tool_calling

Domains

domain.software_engineering

Related Use Cases