developer_tools
anthropic/claude-sonnet-4 vs gpt-5-2025-08-07
Benchmark coverage is still limited for this use case, so this comparison is directional rather than definitive.
Model A leads so farby +3.9%
Rank #1
Confidence
43.8%
Evidence
27 pts
SWE-bench Verified Leaderboard: swe_verified_resolved_pct
Value 81.7% · Conf 100.0% · Weight 3.0%
swebench_verified_official.swe_verified_resolved_pct (Apr 1, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 84.8% · Conf 100.0% · Weight 1.9%
galileo_agent_v2.avg_ac (Apr 1, 2026)
Sonar Java Quality Leaderboard: functional_skill_pct
Value 79.5% · Conf 100.0% · Weight 1.8%
sonar_java_quality.functional_skill_pct (Apr 1, 2026)
Aider Polyglot Leaderboard: percent_correct_pct
Value 67.9% · Conf 100.0% · Weight 1.3%
aider_polyglot.percent_correct_pct (Apr 1, 2026)
Sonar Java Quality Leaderboard: issue_density_error_per_kloc
Value 58.5% · Conf 100.0% · Weight 1.1%
sonar_java_quality.issue_density_error_per_kloc (Apr 1, 2026)
Rank #2
Confidence
32.9%
Evidence
27 pts
SWE-bench Verified Leaderboard: swe_verified_resolved_pct
Value 93.8% · Conf 100.0% · Weight 3.5%
swebench_verified_official.swe_verified_resolved_pct (Apr 1, 2026)
Aider Polyglot Leaderboard: percent_correct_pct
Value 100.0% · Conf 100.0% · Weight 1.9%
aider_polyglot.percent_correct_pct (Apr 1, 2026)
Vals LiveCodeBench: overall_accuracy_pct
Value 96.5% · Conf 100.0% · Weight 1.3%
vals_lcb.overall_accuracy_pct (Mar 31, 2026)
Vals SWE-bench: overall_accuracy_pct
Value 78.3% · Conf 100.0% · Weight 1.2%
vals_swebench.overall_accuracy_pct (Mar 31, 2026)
Vals Terminal-Bench 2: overall_accuracy_pct
Value 53.4% · Conf 100.0% · Weight 0.7%
vals_terminal_bench_2.overall_accuracy_pct (Mar 31, 2026)