developer_tools
Kimi K2 Thinking vs deepseek/deepseek-r1
Model A winsby +0.9%
Rank #2
Confidence
31.2%
Evidence
16 pts
Sonar Java Quality Leaderboard: functional_skill_pct
Value 88.4% · Conf 100.0% · Weight 3.2%
sonar_java_quality.functional_skill_pct (Mar 17, 2026)
SWE-bench Verified Leaderboard: swe_verified_resolved_pct
Value 80.2% · Conf 100.0% · Weight 2.4%
swebench_verified_official.swe_verified_resolved_pct (Mar 17, 2026)
Sonar Java Quality Leaderboard: issue_density_error_per_kloc
Value 66.6% · Conf 100.0% · Weight 1.7%
sonar_java_quality.issue_density_error_per_kloc (Mar 17, 2026)
Vals SWE-bench: overall_accuracy_pct
Value 63.5% · Conf 100.0% · Weight 0.8%
vals_swebench.overall_accuracy_pct (Mar 17, 2026)
Vals LiveCodeBench: overall_accuracy_pct
Value 65.1% · Conf 100.0% · Weight 0.7%
vals_lcb.overall_accuracy_pct (Mar 17, 2026)
Rank #3
Confidence
33.1%
Evidence
20 pts
Sonar Java Quality Leaderboard: functional_skill_pct
Value 82.8% · Conf 100.0% · Weight 3.0%
sonar_java_quality.functional_skill_pct (Mar 17, 2026)
Sonar Java Quality Leaderboard: issue_density_error_per_kloc
Value 59.0% · Conf 100.0% · Weight 1.5%
sonar_java_quality.issue_density_error_per_kloc (Mar 17, 2026)
Aider Polyglot Leaderboard: percent_correct_pct
Value 80.0% · Conf 100.0% · Weight 1.2%
aider_polyglot.percent_correct_pct (Mar 17, 2026)
LEXam Leaderboard: average_score_pct
Value 73.2% · Conf 100.0% · Weight 1.0%
lexam_leaderboard.average_score_pct (Mar 17, 2026)
ContractEval Leaderboard: contract_adherence_csr_pct
Value 50.0% · Conf 100.0% · Weight 0.8%
contracteval_leaderboard.contract_adherence_csr_pct (Mar 17, 2026)