Task-based recommendation

Best LLM for Coding in 2025

We compared AI models on SWE-Bench Verified, HumanEval, MBPP, and real-world coding tasks. Here are the best models for software development, ranked by coding capability, cost, speed, and reliability.

Last updated: May 2025 · Methodology

Our Pick

Claude Sonnet 4.6 — Best LLM for Coding

Claude Sonnet 4.6 leads coding benchmarks with strong performance on SWE-Bench Verified, excellent frontend/React generation, and good debugging. At $3/$15 per million tokens (input/output), it offers the best combination of quality and cost for most developers.

View full profile →Compare alternatives →

Who this is for

Developers

Code generation, refactoring, debugging, and pair programming in your IDE.

Technical founders

Rapid prototyping, API design, and full-stack development on a budget.

Agencies & teams

Frontend generation, backend boilerplate, and consistent code across projects.

Top Coding Models Ranked

Rank	Model	Provider	Coding Score	Notes
1	Claude Sonnet 4.6	Anthropic	91	Best overall coding model. Strong SWE-Bench and frontend generation.
2	GPT-4.1	OpenAI	88	Excellent coding with strong instruction following. Good API reliability.
3	Claude Opus 4.7	Anthropic	83	Strongest for complex multi-file refactors and architecture.
4	DeepSeek V3	DeepSeek	82	Best open-weight coding model. Competes with GPT-4o at much lower cost.
5	Gemini 2.5 Pro	Google	79	Huge context window useful for multi-file codebase analysis.
6	Llama 4 Maverick	Meta	76	Strong open model for coding with good value.
7	Codestral 2	Mistral	74	Specialized code generation model. Fast and focused.
8	Mistral Large 3	Mistral	72	Solid coding with good multilingual support.

How we evaluate coding models

Benchmarks we use

SWE-Bench Verified — Real GitHub issue resolution (primary)
HumanEval — Function synthesis from docstrings
MBPP — Python programming benchmarks
BFCL — Function calling and tool use accuracy

Weighting

Coding benchmarks: 45%
Reasoning: 15%
Structured output (JSON/tool calling): 15%
Value (quality per dollar): 10%
Speed: 10%
Context window: 5%

Recommendations by persona

For professional developers

Claude Sonnet 4.6 or GPT-4.1. Both provide excellent code generation, debugging, and multi-file reasoning. Sonnet scores higher on frontend; GPT-4.1 has a larger ecosystem.

On a budget

DeepSeek V3 (~$0.27/M in, ~$0.40/M out) delivers coding quality competitive with GPT-4o at a fraction of the cost. GPT-4o Mini is another good cheap option.

For local / private coding

DeepSeek V3 (671B, can run on high-end hardware), Llama 4 Maverick, or Codestral 2. All have open weights and can be self-hosted.

Fastest coding

DeepSeek V3, Claude Haiku 4.5, and Gemini 2.5 Flash offer the best combination of coding quality and low latency.

SWE-bench Verified — Real Coding Benchmark

BasedAGI now uses real SWE-bench Verified data — a 500-instance benchmark of real-world GitHub issue resolution tasks. This is the first real quality signal powering our coding rankings.

What it measures

Real GitHub issues from popular Python repos
Human-filtered Verified subset (500 instances)
Resolved rate: % of issues correctly patched

Important caveats

Many results use agent/scaffold setups (mini-swe-agent, SWE-agent, OpenHands)
Agent-assisted scores may exceed pure model capability
Coding coverage is partial — other benchmarks remain mock/sample
Only 6 models currently have real SWE-bench data

Coding model pricing at a glance

Approximate cost for a developer using ~1M input + 200K output tokens per month. Use our pricing calculator for exact estimates.

Model	Input $/1M	Output $/1M	Est. Monthly
GPT-4.1	$2.00	$8.00	$3.60–$18.00
Claude Sonnet 4.6	$3.00	$15.00	$6.00–$30.00
Claude Opus 4.7	$15.00	$75.00	$30.00–$150.00
DeepSeek V3	$0.27	$0.40	$0.35–$1.75
Mistral Large 3	$4.00	$12.00	$6.40–$32.00

Methodology note

Scores are normalized composites from public benchmarks (SWE-Bench, HumanEval, MBPP, BFCL). We weight by recency, source confidence, and task relevance. Scores marked with low confidence indicate limited benchmark coverage. Models within 2 points are considered effectively tied. We do not rely on vendor-reported scores alone. Full methodology.

Frequently Asked Questions

Which LLM is best for coding in 2025?

Claude Sonnet 4.6 currently leads coding benchmarks including SWE-Bench Verified. However, GPT-4.1 is nearly tied. Your choice depends on whether you prioritize Sonnet's stronger frontend generation or GPT-4.1's API reliability and ecosystem. DeepSeek V3 is the best open-weight option at significantly lower cost.

Is an open-weight LLM good enough for coding?

Yes. DeepSeek V3, Llama 4 Maverick, and Mistral Large 3 are all strong for coding. They are slightly behind the best proprietary models on SWE-Bench but offer much lower cost and the ability to self-host.

What benchmarks matter for coding?

SWE-Bench Verified (real GitHub issue resolution), HumanEval (function synthesis), MBPP (Python programming), and the Berkeley Function Calling Leaderboard (tool use). We weight these based on how well they predict real-world coding performance.

How much does it cost to use an LLM for coding?

For a developer making ~1M input + 200K output tokens/month: GPT-4.1 costs ~$13/mo, Claude Sonnet ~$11/mo, DeepSeek V3 ~$1/mo. Prices vary by provider and change frequently. Check our pricing calculator for current rates.

Which model is fastest for coding?

DeepSeek V3 and Claude Haiku 4.5 are among the fastest. GPT-4o Mini and Gemini 2.5 Flash are also fast options with good coding capabilities. See our speed leaderboard for current latency data.

More comparisons

Best for Reasoning →Best for RAG →Best for JSON →Best for Copywriting →Best for Long Context →Best Local LLMs →Model Comparison Tool →Get a Recommendation →