MVP Preview — Rankings use sample/mock data for development. Not yet real-world benchmarks. Learn more

Task-based recommendation

Best LLM for Coding in 2025

We compared AI models on SWE-Bench Verified, HumanEval, MBPP, and real-world coding tasks. Here are the best models for software development, ranked by coding capability, cost, speed, and reliability.

Last updated: May 2025 · Methodology

Our Pick

Claude Sonnet 4.6 — Best LLM for Coding

Claude Sonnet 4.6 leads coding benchmarks with strong performance on SWE-Bench Verified, excellent frontend/React generation, and good debugging. At $3/$15 per million tokens (input/output), it offers the best combination of quality and cost for most developers.

Who this is for

Developers

Code generation, refactoring, debugging, and pair programming in your IDE.

Technical founders

Rapid prototyping, API design, and full-stack development on a budget.

Agencies & teams

Frontend generation, backend boilerplate, and consistent code across projects.

Top Coding Models Ranked

RankModelProviderCoding ScoreNotes
1Claude Sonnet 4.6Anthropic91Best overall coding model. Strong SWE-Bench and frontend generation.
2GPT-4.1OpenAI88Excellent coding with strong instruction following. Good API reliability.
3Claude Opus 4.7Anthropic83Strongest for complex multi-file refactors and architecture.
4DeepSeek V3DeepSeek82Best open-weight coding model. Competes with GPT-4o at much lower cost.
5Gemini 2.5 ProGoogle79Huge context window useful for multi-file codebase analysis.
6Llama 4 MaverickMeta76Strong open model for coding with good value.
7Codestral 2Mistral74Specialized code generation model. Fast and focused.
8Mistral Large 3Mistral72Solid coding with good multilingual support.

How we evaluate coding models

Benchmarks we use

  • SWE-Bench Verified — Real GitHub issue resolution (primary)
  • HumanEval — Function synthesis from docstrings
  • MBPP — Python programming benchmarks
  • BFCL — Function calling and tool use accuracy

Weighting

  • Coding benchmarks: 45%
  • Reasoning: 15%
  • Structured output (JSON/tool calling): 15%
  • Value (quality per dollar): 10%
  • Speed: 10%
  • Context window: 5%

Recommendations by persona

For professional developers

Claude Sonnet 4.6 or GPT-4.1. Both provide excellent code generation, debugging, and multi-file reasoning. Sonnet scores higher on frontend; GPT-4.1 has a larger ecosystem.

On a budget

DeepSeek V3 (~$0.27/M in, ~$0.40/M out) delivers coding quality competitive with GPT-4o at a fraction of the cost. GPT-4o Mini is another good cheap option.

For local / private coding

DeepSeek V3 (671B, can run on high-end hardware), Llama 4 Maverick, or Codestral 2. All have open weights and can be self-hosted.

Fastest coding

DeepSeek V3, Claude Haiku 4.5, and Gemini 2.5 Flash offer the best combination of coding quality and low latency.

SWE-bench Verified — Real Coding Benchmark

BasedAGI now uses real SWE-bench Verified data — a 500-instance benchmark of real-world GitHub issue resolution tasks. This is the first real quality signal powering our coding rankings.

What it measures

  • Real GitHub issues from popular Python repos
  • Human-filtered Verified subset (500 instances)
  • Resolved rate: % of issues correctly patched

Important caveats

  • Many results use agent/scaffold setups (mini-swe-agent, SWE-agent, OpenHands)
  • Agent-assisted scores may exceed pure model capability
  • Coding coverage is partial — other benchmarks remain mock/sample
  • Only 6 models currently have real SWE-bench data

Coding model pricing at a glance

Approximate cost for a developer using ~1M input + 200K output tokens per month. Use our pricing calculator for exact estimates.

ModelInput $/1MOutput $/1MEst. Monthly
GPT-4.1$2.00$8.00$3.60–$18.00
Claude Sonnet 4.6$3.00$15.00$6.00–$30.00
Claude Opus 4.7$15.00$75.00$30.00–$150.00
DeepSeek V3$0.27$0.40$0.35–$1.75
Mistral Large 3$4.00$12.00$6.40–$32.00

Methodology note

Scores are normalized composites from public benchmarks (SWE-Bench, HumanEval, MBPP, BFCL). We weight by recency, source confidence, and task relevance. Scores marked with low confidence indicate limited benchmark coverage. Models within 2 points are considered effectively tied. We do not rely on vendor-reported scores alone. Full methodology.

Frequently Asked Questions

Which LLM is best for coding in 2025?

Claude Sonnet 4.6 currently leads coding benchmarks including SWE-Bench Verified. However, GPT-4.1 is nearly tied. Your choice depends on whether you prioritize Sonnet's stronger frontend generation or GPT-4.1's API reliability and ecosystem. DeepSeek V3 is the best open-weight option at significantly lower cost.

Is an open-weight LLM good enough for coding?

Yes. DeepSeek V3, Llama 4 Maverick, and Mistral Large 3 are all strong for coding. They are slightly behind the best proprietary models on SWE-Bench but offer much lower cost and the ability to self-host.

What benchmarks matter for coding?

SWE-Bench Verified (real GitHub issue resolution), HumanEval (function synthesis), MBPP (Python programming), and the Berkeley Function Calling Leaderboard (tool use). We weight these based on how well they predict real-world coding performance.

How much does it cost to use an LLM for coding?

For a developer making ~1M input + 200K output tokens/month: GPT-4.1 costs ~$13/mo, Claude Sonnet ~$11/mo, DeepSeek V3 ~$1/mo. Prices vary by provider and change frequently. Check our pricing calculator for current rates.

Which model is fastest for coding?

DeepSeek V3 and Claude Haiku 4.5 are among the fastest. GPT-4o Mini and Gemini 2.5 Flash are also fast options with good coding capabilities. See our speed leaderboard for current latency data.

More comparisons