developer_tools
anthropic/claude-sonnet-4 vs gpt-4o-2024-05-13
Benchmark coverage is still limited for this use case, so this comparison is directional rather than definitive.
Model A leads so farby +2.6%
Rank #1
Confidence
41.7%
Evidence
27 pts
SWE-bench Verified Leaderboard: swe_verified_resolved_pct
Value 81.7% · Conf 100.0% · Weight 2.9%
swebench_verified_official.swe_verified_resolved_pct (Apr 1, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 84.8% · Conf 100.0% · Weight 1.8%
galileo_agent_v2.avg_ac (Apr 1, 2026)
Sonar Java Quality Leaderboard: functional_skill_pct
Value 79.5% · Conf 100.0% · Weight 1.8%
sonar_java_quality.functional_skill_pct (Apr 1, 2026)
Aider Polyglot Leaderboard: percent_correct_pct
Value 67.9% · Conf 100.0% · Weight 1.2%
aider_polyglot.percent_correct_pct (Apr 1, 2026)
Sonar Java Quality Leaderboard: issue_density_error_per_kloc
Value 58.5% · Conf 100.0% · Weight 1.0%
sonar_java_quality.issue_density_error_per_kloc (Apr 1, 2026)
Rank #2
Confidence
35.5%
Evidence
13 pts
RepoQA Official Results: overall_average_pass_at_1_pct
Value 99.3% · Conf 100.0% · Weight 4.6%
repoqa_leaderboard.overall_average_pass_at_1_pct (Apr 1, 2026)
SWE-bench Verified Leaderboard: swe_verified_resolved_pct
Value 48.2% · Conf 100.0% · Weight 1.7%
swebench_verified_official.swe_verified_resolved_pct (Apr 1, 2026)
RepoQA Official Results: all_average_pass_at_1_pct
Value 99.3% · Conf 100.0% · Weight 1.6%
repoqa_leaderboard.all_average_pass_at_1_pct (Apr 1, 2026)
Aider Code Editing Leaderboard: percent_correct_pct
Value 82.3% · Conf 100.0% · Weight 1.3%
aider_code_editing.percent_correct_pct (Apr 1, 2026)
BigCodeBench Official: bigcodebench_complete_pct
Value 97.6% · Conf 100.0% · Weight 1.0%
bigcodebench_official.bigcodebench_complete_pct (Apr 1, 2026)