BasedAGIBasedAGI

developer_tools

Best LLM for Autonomous Coding

Benchmark-backed ranking of models for end-to-end autonomous software engineering and issue resolution.

#1 Recommendation

gpt-5-2025-08-07

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct

external/openai/gpt-5-2025-08-07

22.9%

Score

26.7%

Confidence

33

Evidence

Ranked Models

30

Evidence Quality

92%

Evidence Points

33

Top Signal

SWE-bench Verified Leaderboard: swe_verified_resolved_pct

Benchmark Sources

55

Last Updated

5h ago

All Ranked Models

30 of 30 models
RankModelScore
🥉gpt-5-2025-08-07

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct

22.9%
#4claude-opus-4.7

Strong on OpenHands Index average_score_pct and OpenHands Issue Resolution issue_resolution_score_pct

21.2%
#5claude-sonnet-4

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct

20.3%
#8gemini-3-pro-preview

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct

18.8%
#10Kimi K2 Thinking

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct

18.6%
#12GLM-5

Strong on SWE-bench Leaderboard verified_resolved_pct and OpenHands Issue Resolution issue_resolution_score_pct

17.6%
#13o3-20250416

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct

17.6%
#14kimi-k2.5-thinking

Strong on SWE-bench Leaderboard verified_resolved_pct and OpenHands Issue Resolution issue_resolution_score_pct

16.6%
#15gpt-5.2-2025-12-11

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct

16.3%
#16o4-mini

Strong on SWE-bench Leaderboard verified_resolved_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct

15.7%
#19claude-opus-4-5-20251101

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct

15.2%
#21gpt-4.1-20250414

Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct

14.6%
#23gemini-2.5-pro

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct

14.1%
#25claude-sonnet-4.6

Strong on OpenHands Issue Resolution issue_resolution_score_pct and OpenHands Index greenfield_score_pct

13.5%
#26qwen-2.5-coder32b-instruct

Strong on BigCode Models Leaderboard average_score and BigCodeBench Official bigcodebench_complete_pct

13.5%
#28Kimi-K2-Instruct

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct

12.9%
#29GLM-4.6

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Berkeley Function Calling Leaderboard (Overall) Overall Acc

12.7%
#31GLM-5.1

Strong on OpenHands Issue Resolution issue_resolution_score_pct and OpenHands Index average_score_pct

12.6%
#33gpt-4o

Strong on TestEval Leaderboard overall_average_coverage_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct

12.0%
#36qwen-2.5-72b-instruct

Strong on Galileo Agent Leaderboard v2 Avg AC and Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct

11.9%
#37gpt-5-mini-2025-08-07

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct

11.7%
#39minimax-m2.1

Strong on OpenHands Issue Resolution issue_resolution_score_pct and Sonar Java Quality Leaderboard functional_skill_pct

10.9%
#40gemini-2.5-flash

Strong on Berkeley Function Calling Leaderboard (Overall) Overall Acc and SWE-bench Leaderboard verified_resolved_pct

10.2%
#42Grok-4-0709

Strong on Berkeley Function Calling Leaderboard (Overall) Overall Acc and Galileo Agent Leaderboard v2 Avg AC

10.1%
#43gemini-3-flash-preview

Strong on SWE-bench Leaderboard verified_resolved_pct and Vals LiveCodeBench overall_accuracy_pct

9.7%
#50deepseek-r1

Strong on Aider Polyglot Leaderboard percent_correct_pct and Sonar Java Quality Leaderboard functional_skill_pct

8.8%
#52gpt-5.1-2025-11-13

Strong on SWE-bench Leaderboard verified_resolved_pct and Vals LiveCodeBench overall_accuracy_pct

8.5%
#55gpt-4o-2024-08-06

Strong on SWE-bench Leaderboard verified_resolved_pct and BigCodeBench Official bigcodebench_hard_complete_pct

8.0%
#58GLM-4.5

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct

7.8%
#60gpt-4.1-mini-20250414

Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct

7.5%

Head-to-Head: #1 vs #2

#3

Top Pick

gpt-5-2025-08-07

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct

22.9%

Conf 26.7%

#4

anthropic/claude-opus-4.7

Strong on OpenHands Index average_score_pct and OpenHands Issue Resolution issue_resolution_score_pct

21.2%

Conf 22.7%

Related Lookups