BasedAGIBasedAGI
Menu
Rankings live

developer_tools

Code generation

Generate correct, secure code from requirements.

#1 Recommendation

anthropic/claude-sonnet-4.6

Strong on OpenHands Issue Resolution issue_resolution_score_pct (72%) and OpenHands Index issue_resolution_score_pct (72%)

external/anthropic/claude-sonnet-4-6

19.7%

Score

33.8%

Confidence

Limited benchmark evidence for this use case.

18 ranked models with average evidence of 17.8 points. Rankings may shift as more benchmark data is ingested.

Ranked Models

18

Evidence Quality

82%

Scoring

Benchmark-backed

Top Signal

OpenHands Issue Resolution: issue_resolution_score_pct

All Ranked Models

Max params:
Min confidence:
18 of 18
RankModelScore
#7anthropic/claude-sonnet-4.6
19.7%
#10Kimi K2 Thinking
16.3%
#11minimax/minimax-m2.1
15.8%
#12kimi/kimi-k2.5-thinking
14.1%
#13deepseek/deepseek-r1
14.1%
#14z-ai/glm-4.7
13.6%
#17gemini-3-pro-preview
11.3%
#18GLM-5
11.1%
#24gpt-4.1-20250414
9.7%
#25gpt-4o-2024-08-06
9.3%
#26gpt-4o
9.1%
#29Grok-4-0709
9.0%
#31gpt-4o-2024-05-13
8.9%
#35claude-sonnet-4-20250514
8.7%
#43gpt-4o-20241120
8.2%
#50GLM-4.7
7.6%
#58gemini-2.5-pro
6.7%
#76openai/gpt-4o-mini-2024-07-18
3.1%

Compare Models

Model A leads by +3.4%

Shareable Link →

Model A

anthropic/claude-sonnet-4.6

external/anthropic/claude-sonnet-4-6

19.7%

Rank #7

Confidence 33.8%26 evidence pts

OpenHands Issue Resolution: issue_resolution_score_pct

Value 71.8% · Conf 100.0% · Weight 2.4%

openhands_issue_resolution.issue_resolution_score_pct (Mar 12, 2026)

OpenHands Index: issue_resolution_score_pct

Value 71.8% · Conf 100.0% · Weight 2.0%

openhands_index.issue_resolution_score_pct (Mar 12, 2026)

OpenHands Index: greenfield_score_pct

Value 75.2% · Conf 100.0% · Weight 1.4%

openhands_index.greenfield_score_pct (Mar 12, 2026)

Vals SWE-bench: overall_accuracy_pct

Value 95.1% · Conf 100.0% · Weight 1.3%

vals_swebench.overall_accuracy_pct (Mar 12, 2026)

Model B

Kimi K2 Thinking

external/kimi/kimi-k2-thinking

16.3%

Rank #10

Confidence 43.5%26 evidence pts

Sonar Java Quality Leaderboard: functional_skill_pct

Value 88.4% · Conf 100.0% · Weight 2.8%

sonar_java_quality.functional_skill_pct (Mar 12, 2026)

Sonar Java Quality Leaderboard: issue_density_error_per_kloc

Value 66.6% · Conf 100.0% · Weight 1.5%

sonar_java_quality.issue_density_error_per_kloc (Mar 12, 2026)

Sonar Java Quality Leaderboard: vulnerability_density_error_per_kloc

Value 61.4% · Conf 100.0% · Weight 1.0%

sonar_java_quality.vulnerability_density_error_per_kloc (Mar 12, 2026)

Vals SWE-bench: overall_accuracy_pct

Value 63.5% · Conf 100.0% · Weight 0.9%

vals_swebench.overall_accuracy_pct (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

18

Sources

8

Quality

Insufficient

Vals LiveCodeBench

vals_lcb

11 rows

0.9% avg lift

Vals SWE-bench

vals_swebench

9 rows

1.0% avg lift

Vals Terminal-Bench 2

vals_terminal_bench_2

9 rows

0.7% avg lift

Vals Legal Bench

vals_legal_bench

8 rows

0.2% avg lift

Missing Strong Models

gpt-5-mini-2025-08-07

external/openai/gpt-5-mini-2025-08-07

Rank #7

19.6%

Thin evidence after weighting

google/gemini-3.1-pro-preview

external/google/gemini-3-1-pro-preview

Rank #8

19.3%

Thin evidence after weighting

gpt-5-2025-08-07

external/openai/gpt-5-2025-08-07

Rank #9

19.2%

Thin evidence after weighting

openai/gpt-5.4-2026-03-05

external/openai/gpt-5-4-2026-03-05

Rank #10

18.9%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.code_generationtask.api_usage_correctness

Required Modes

none

Domains

domain.software_engineering

Related Use Cases