Task-based recommendation
Best LLM for Coding in 2025
We compared AI models on SWE-Bench Verified, HumanEval, MBPP, and real-world coding tasks. Here are the best models for software development, ranked by coding capability, cost, speed, and reliability.
Last updated: May 2025 · Methodology
Our Pick
Claude Sonnet 4.6 — Best LLM for Coding
Claude Sonnet 4.6 leads coding benchmarks with strong performance on SWE-Bench Verified, excellent frontend/React generation, and good debugging. At $3/$15 per million tokens (input/output), it offers the best combination of quality and cost for most developers.
Who this is for
Developers
Code generation, refactoring, debugging, and pair programming in your IDE.
Technical founders
Rapid prototyping, API design, and full-stack development on a budget.
Agencies & teams
Frontend generation, backend boilerplate, and consistent code across projects.
Top Coding Models Ranked
| Rank | Model | Provider | Coding Score | Notes |
|---|---|---|---|---|
| 1 | Claude Sonnet 4.6 | Anthropic | 91 | Best overall coding model. Strong SWE-Bench and frontend generation. |
| 2 | GPT-4.1 | OpenAI | 88 | Excellent coding with strong instruction following. Good API reliability. |
| 3 | Claude Opus 4.7 | Anthropic | 83 | Strongest for complex multi-file refactors and architecture. |
| 4 | DeepSeek V3 | DeepSeek | 82 | Best open-weight coding model. Competes with GPT-4o at much lower cost. |
| 5 | Gemini 2.5 Pro | 79 | Huge context window useful for multi-file codebase analysis. | |
| 6 | Llama 4 Maverick | Meta | 76 | Strong open model for coding with good value. |
| 7 | Codestral 2 | Mistral | 74 | Specialized code generation model. Fast and focused. |
| 8 | Mistral Large 3 | Mistral | 72 | Solid coding with good multilingual support. |
How we evaluate coding models
Benchmarks we use
- SWE-Bench Verified — Real GitHub issue resolution (primary)
- HumanEval — Function synthesis from docstrings
- MBPP — Python programming benchmarks
- BFCL — Function calling and tool use accuracy
Weighting
- Coding benchmarks: 45%
- Reasoning: 15%
- Structured output (JSON/tool calling): 15%
- Value (quality per dollar): 10%
- Speed: 10%
- Context window: 5%
Recommendations by persona
For professional developers
Claude Sonnet 4.6 or GPT-4.1. Both provide excellent code generation, debugging, and multi-file reasoning. Sonnet scores higher on frontend; GPT-4.1 has a larger ecosystem.
On a budget
DeepSeek V3 (~$0.27/M in, ~$0.40/M out) delivers coding quality competitive with GPT-4o at a fraction of the cost. GPT-4o Mini is another good cheap option.
For local / private coding
DeepSeek V3 (671B, can run on high-end hardware), Llama 4 Maverick, or Codestral 2. All have open weights and can be self-hosted.
Fastest coding
DeepSeek V3, Claude Haiku 4.5, and Gemini 2.5 Flash offer the best combination of coding quality and low latency.
SWE-bench Verified — Real Coding Benchmark
BasedAGI now uses real SWE-bench Verified data — a 500-instance benchmark of real-world GitHub issue resolution tasks. This is the first real quality signal powering our coding rankings.
What it measures
- Real GitHub issues from popular Python repos
- Human-filtered Verified subset (500 instances)
- Resolved rate: % of issues correctly patched
Important caveats
- Many results use agent/scaffold setups (mini-swe-agent, SWE-agent, OpenHands)
- Agent-assisted scores may exceed pure model capability
- Coding coverage is partial — other benchmarks remain mock/sample
- Only 6 models currently have real SWE-bench data
Coding model pricing at a glance
Approximate cost for a developer using ~1M input + 200K output tokens per month. Use our pricing calculator for exact estimates.
| Model | Input $/1M | Output $/1M | Est. Monthly |
|---|---|---|---|
| GPT-4.1 | $2.00 | $8.00 | $3.60–$18.00 |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $6.00–$30.00 |
| Claude Opus 4.7 | $15.00 | $75.00 | $30.00–$150.00 |
| DeepSeek V3 | $0.27 | $0.40 | $0.35–$1.75 |
| Mistral Large 3 | $4.00 | $12.00 | $6.40–$32.00 |
Methodology note
Scores are normalized composites from public benchmarks (SWE-Bench, HumanEval, MBPP, BFCL). We weight by recency, source confidence, and task relevance. Scores marked with low confidence indicate limited benchmark coverage. Models within 2 points are considered effectively tied. We do not rely on vendor-reported scores alone. Full methodology.
Frequently Asked Questions
Which LLM is best for coding in 2025?
Claude Sonnet 4.6 currently leads coding benchmarks including SWE-Bench Verified. However, GPT-4.1 is nearly tied. Your choice depends on whether you prioritize Sonnet's stronger frontend generation or GPT-4.1's API reliability and ecosystem. DeepSeek V3 is the best open-weight option at significantly lower cost.
Is an open-weight LLM good enough for coding?
Yes. DeepSeek V3, Llama 4 Maverick, and Mistral Large 3 are all strong for coding. They are slightly behind the best proprietary models on SWE-Bench but offer much lower cost and the ability to self-host.
What benchmarks matter for coding?
SWE-Bench Verified (real GitHub issue resolution), HumanEval (function synthesis), MBPP (Python programming), and the Berkeley Function Calling Leaderboard (tool use). We weight these based on how well they predict real-world coding performance.
How much does it cost to use an LLM for coding?
For a developer making ~1M input + 200K output tokens/month: GPT-4.1 costs ~$13/mo, Claude Sonnet ~$11/mo, DeepSeek V3 ~$1/mo. Prices vary by provider and change frequently. Check our pricing calculator for current rates.
Which model is fastest for coding?
DeepSeek V3 and Claude Haiku 4.5 are among the fastest. GPT-4o Mini and Gemini 2.5 Flash are also fast options with good coding capabilities. See our speed leaderboard for current latency data.