BasedAGIBasedAGI
Menu
Rankings live

developer_tools

Code Review Assistant

Reviewing pull requests, identifying bugs, security issues, and style violations.

#1 Recommendation

Kimi K2 Thinking

Strong on Sonar Java Quality Leaderboard functional_skill_pct (88%) and SWE-bench Verified Leaderboard swe_verified_resolved_pct (80%)

external/kimi/kimi-k2-thinking

22.9%

Score

31.2%

Confidence

Limited benchmark evidence for this use case.

48 ranked models with average evidence of 13.5 points. Rankings may shift as more benchmark data is ingested.

Ranked Models

30

Evidence Quality

80%

Scoring

Benchmark-backed

Top Signal

Sonar Java Quality Leaderboard: functional_skill_pct

All Ranked Models

Max params:
Min confidence:
30 of 30
RankModelScore
#1Kimi K2 Thinking

Strong on Sonar Java Quality Leaderboard functional_skill_pct (88%) and SWE-bench Verified Leaderboard swe_verified_resolved_pct (80%)

22.9%
#2deepseek/deepseek-r1

Strong on Sonar Java Quality Leaderboard functional_skill_pct (83%) and Sonar Java Quality Leaderboard issue_density_error_per_kloc (59%)

22.0%
#3gemini-3-pro-preview

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct (88%) and Vals SWE-bench overall_accuracy_pct (88%)

19.7%
#4gemini-2.5-pro
19.4%
#5gpt-4o
18.2%
#7z-ai/glm-4.7
18.0%
#9gpt-4o-2024-05-13
16.6%
#10minimax/minimax-m2.1
16.4%
#11openai/gpt-4.1
16.3%
#13claude-opus-4-5-20251101
15.7%
#14gpt-4.1-20250414
15.1%
#15GLM-5
15.1%
#16gpt-5.2-2025-12-11
15.1%
#31o3-20250416
11.7%
#32Grok-4-0709
11.7%
#37google/gemini-3.1-pro-preview
11.4%
#46openai/gpt-5.4-2026-03-05
10.7%
#48claude-sonnet-4-20250514
10.6%
#49anthropic/claude-sonnet-4.6
10.5%
#50gpt-5-2025-08-07
10.5%
#51gpt-4o-20241120
10.4%
#52anthropic/claude-opus-4-6-thinking
10.3%
#55gemini-3-flash-preview
10.1%
#57gpt-5.1-2025-11-13
10.0%
#60anthropic/claude-opus-4-5-20251101-thinking
9.9%
#67gpt-4.1-mini-20250414
9.4%
#72kimi/kimi-k2.5-thinking
8.9%
#74gpt-4o-2024-08-06
8.8%
#77o4-mini-20250416
8.7%
#78anthropic/claude-sonnet-4-5-20250929-thinking
8.7%

Compare Models

Model A leads by +0.9%

Shareable Link →

Model A

Kimi K2 Thinking

external/kimi/kimi-k2-thinking

22.9%

Rank #1

Confidence 31.2%16 evidence pts

Sonar Java Quality Leaderboard: functional_skill_pct

Value 88.4% · Conf 100.0% · Weight 3.2%

sonar_java_quality.functional_skill_pct (Mar 17, 2026)

SWE-bench Verified Leaderboard: swe_verified_resolved_pct

Value 80.2% · Conf 100.0% · Weight 2.4%

swebench_verified_official.swe_verified_resolved_pct (Mar 17, 2026)

Sonar Java Quality Leaderboard: issue_density_error_per_kloc

Value 66.6% · Conf 100.0% · Weight 1.7%

sonar_java_quality.issue_density_error_per_kloc (Mar 17, 2026)

Vals SWE-bench: overall_accuracy_pct

Value 63.5% · Conf 100.0% · Weight 0.8%

vals_swebench.overall_accuracy_pct (Mar 17, 2026)

Model B

deepseek/deepseek-r1

external/deepseek/deepseek-r1

22.0%

Rank #2

Confidence 33.1%20 evidence pts

Sonar Java Quality Leaderboard: functional_skill_pct

Value 82.8% · Conf 100.0% · Weight 3.0%

sonar_java_quality.functional_skill_pct (Mar 17, 2026)

Sonar Java Quality Leaderboard: issue_density_error_per_kloc

Value 59.0% · Conf 100.0% · Weight 1.5%

sonar_java_quality.issue_density_error_per_kloc (Mar 17, 2026)

Aider Polyglot Leaderboard: percent_correct_pct

Value 80.0% · Conf 100.0% · Weight 1.2%

aider_polyglot.percent_correct_pct (Mar 17, 2026)

LEXam Leaderboard: average_score_pct

Value 73.2% · Conf 100.0% · Weight 1.0%

lexam_leaderboard.average_score_pct (Mar 17, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

48

Sources

8

Quality

Insufficient

Vals LiveCodeBench

vals_lcb

34 rows

0.9% avg lift

Vals Legal Bench

vals_legal_bench

34 rows

0.2% avg lift

Vals SWE-bench

vals_swebench

33 rows

0.9% avg lift

Vals MedQA

vals_medqa

32 rows

0.2% avg lift

Missing Strong Models

gemini-2.5-flash

external/google/gemini-2-5-flash

Rank #10

17.6%

Thin evidence after weighting

anthropic/claude-opus-4-1-20250805

external/anthropic/claude-opus-4-1-20250805

Rank #46

10.6%

Thin evidence after weighting

google/gemini-2.0-flash-001

external/google/gemini-2-0-flash-001

Rank #48

10.2%

Thin evidence after weighting

qwen/qwen3-max

external/qwen/qwen3-max

Rank #51

10.0%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.code_reviewtask.risk_assessment

Required Modes

none

Domains

domain.software_engineering

Related Use Cases