developer_tools
Best LLM for Autonomous Coding
Benchmark-backed ranking of models for end-to-end autonomous software engineering and issue resolution.
#1 Recommendation
gpt-5-2025-08-07
Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct
external/openai/gpt-5-2025-08-07
22.9%
Score
26.7%
Confidence
33
Evidence
Ranked Models
30
Evidence Quality
92%
Evidence Points
33
Top Signal
SWE-bench Verified Leaderboard: swe_verified_resolved_pct
Benchmark Sources
55
Last Updated
5h ago
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| 🥉 | gpt-5-2025-08-07 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct | 22.9% |
| #4 | claude-opus-4.7 Strong on OpenHands Index average_score_pct and OpenHands Issue Resolution issue_resolution_score_pct | 21.2% |
| #5 | claude-sonnet-4 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct | 20.3% |
| #8 | gemini-3-pro-preview Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct | 18.8% |
| #10 | Kimi K2 Thinking Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct | 18.6% |
| #12 | GLM-5 Strong on SWE-bench Leaderboard verified_resolved_pct and OpenHands Issue Resolution issue_resolution_score_pct | 17.6% |
| #13 | o3-20250416 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct | 17.6% |
| #14 | kimi-k2.5-thinking Strong on SWE-bench Leaderboard verified_resolved_pct and OpenHands Issue Resolution issue_resolution_score_pct | 16.6% |
| #15 | gpt-5.2-2025-12-11 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct | 16.3% |
| #16 | o4-mini Strong on SWE-bench Leaderboard verified_resolved_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct | 15.7% |
| #19 | claude-opus-4-5-20251101 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct | 15.2% |
| #21 | gpt-4.1-20250414 Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct | 14.6% |
| #23 | gemini-2.5-pro Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct | 14.1% |
| #25 | claude-sonnet-4.6 Strong on OpenHands Issue Resolution issue_resolution_score_pct and OpenHands Index greenfield_score_pct | 13.5% |
| #26 | qwen-2.5-coder32b-instruct Strong on BigCode Models Leaderboard average_score and BigCodeBench Official bigcodebench_complete_pct | 13.5% |
| #28 | Kimi-K2-Instruct Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct | 12.9% |
| #29 | GLM-4.6 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Berkeley Function Calling Leaderboard (Overall) Overall Acc | 12.7% |
| #31 | GLM-5.1 Strong on OpenHands Issue Resolution issue_resolution_score_pct and OpenHands Index average_score_pct | 12.6% |
| #33 | gpt-4o Strong on TestEval Leaderboard overall_average_coverage_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct | 12.0% |
| #36 | qwen-2.5-72b-instruct Strong on Galileo Agent Leaderboard v2 Avg AC and Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct | 11.9% |
| #37 | gpt-5-mini-2025-08-07 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct | 11.7% |
| #39 | minimax-m2.1 Strong on OpenHands Issue Resolution issue_resolution_score_pct and Sonar Java Quality Leaderboard functional_skill_pct | 10.9% |
| #40 | gemini-2.5-flash Strong on Berkeley Function Calling Leaderboard (Overall) Overall Acc and SWE-bench Leaderboard verified_resolved_pct | 10.2% |
| #42 | Grok-4-0709 Strong on Berkeley Function Calling Leaderboard (Overall) Overall Acc and Galileo Agent Leaderboard v2 Avg AC | 10.1% |
| #43 | gemini-3-flash-preview Strong on SWE-bench Leaderboard verified_resolved_pct and Vals LiveCodeBench overall_accuracy_pct | 9.7% |
| #50 | deepseek-r1 Strong on Aider Polyglot Leaderboard percent_correct_pct and Sonar Java Quality Leaderboard functional_skill_pct | 8.8% |
| #52 | gpt-5.1-2025-11-13 Strong on SWE-bench Leaderboard verified_resolved_pct and Vals LiveCodeBench overall_accuracy_pct | 8.5% |
| #55 | gpt-4o-2024-08-06 Strong on SWE-bench Leaderboard verified_resolved_pct and BigCodeBench Official bigcodebench_hard_complete_pct | 8.0% |
| #58 | GLM-4.5 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct | 7.8% |
| #60 | gpt-4.1-mini-20250414 Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct | 7.5% |
Head-to-Head: #1 vs #2
#3
Top Pickgpt-5-2025-08-07
Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct
Conf 26.7%
#4
anthropic/claude-opus-4.7
Strong on OpenHands Index average_score_pct and OpenHands Issue Resolution issue_resolution_score_pct
Conf 22.7%
Related Lookups
Best LLM for Code Generation
Benchmark-backed ranking of models for generating correct, secure code from requirements.
Best LLM for Debugging
Find the top-ranked models for localizing bugs and proposing fixes with explanations.
Best LLM for Unit Test Generation
Ranked models for generating meaningful unit tests and edge cases from code.
Best LLM for Code Review
Compare models for automated PR review covering correctness, security, and maintainability.
Best LLM for Function Calling
Compare models for reliable tool use, function selection, and multi-step API orchestration.
Best LLM for Refactoring
Ranked models for safely refactoring code while preserving behavior and improving clarity.