BasedAGIBasedAGI
Developer

Code Review Assistant

Reviewing pull requests, identifying bugs, security issues, and style violations.

task.code_reviewtask.risk_assessment

Best for this use case

claude-sonnet-4

Strong on Sonar Java Quality Leaderboard functional_skill_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct

23.2%

Best benchmark score

32.6%

Confidence

Ranked Models

30

Evidence Quality

90%

Evidence Points

29

Top Signal

Sonar Java Quality Leaderboard: functional_skill_pct

Benchmark Sources

45

Last Updated

6h ago

Benchmark Sources (7)

Sonar Java Quality Leaderboardfunctional_skill_pct · May 1, 2026
3% weight
SWE-bench Verified Leaderboardswe_verified_resolved_pct · May 1, 2026
3% weight
SWE-bench Leaderboardverified_resolved_pct · May 1, 2026
3% weight
Galileo Agent Leaderboard v2Avg AC · May 1, 2026
2% weight
BaxBench Leaderboardaverage_secure_pass_1_pct · May 1, 2026
2% weight

All Ranked Models

30 of 30 models
RankModelScore
🥇claude-sonnet-4

Strong on Sonar Java Quality Leaderboard functional_skill_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct

23.2%
🥈gpt-5-2025-08-07

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct

21.9%
🥉gemini-3-pro-preview

Strong on Berkeley Function Calling Leaderboard (Overall) Overall Acc and SWE-bench Leaderboard verified_resolved_pct

21.8%
#4Kimi K2 Thinking

Strong on Sonar Java Quality Leaderboard functional_skill_pct and SWE-bench Leaderboard verified_resolved_pct

19.1%
#5o3-20250416

Strong on SWE-bench Leaderboard verified_resolved_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct

18.2%
#6gpt-5.2-2025-12-11

Strong on SWE-bench Leaderboard verified_resolved_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct

18.1%
#9gemini-2.5-pro

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct

17.1%
#11claude-opus-4-5-20251101

Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and SWE-bench Leaderboard verified_resolved_pct

16.3%
#12deepseek-r1

Strong on Sonar Java Quality Leaderboard functional_skill_pct and Sonar Java Quality Leaderboard issue_density_error_per_kloc

16.2%
#13gpt-4o

Strong on TestEval Leaderboard overall_average_coverage_pct and Sonar Java Quality Leaderboard functional_skill_pct

16.0%
#14gpt-4.1-20250414

Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Leaderboard verified_resolved_pct

15.7%
#15qwen-2.5-coder32b-instruct

Strong on BigCode Models Leaderboard average_score and BigCodeBench Official bigcodebench_complete_pct

15.2%
#16o4-mini

Strong on SWE-bench Leaderboard verified_resolved_pct and Berkeley Function Calling Leaderboard (Overall) Overall Acc

15.2%
#17gpt-5-mini-2025-08-07

Strong on SWE-bench Leaderboard verified_resolved_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct

15.0%
#18GLM-5

Strong on Sonar Java Quality Leaderboard functional_skill_pct and SWE-bench Leaderboard verified_resolved_pct

14.8%
#19claude-opus-4-6-thinking

Strong on Sonar Java Quality Leaderboard functional_skill_pct and Sonar Java Quality Leaderboard issue_density_error_per_kloc

14.6%
#20qwen-2.5-72b-instruct

Strong on Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct and BigCodeBench Official bigcodebench_complete_pct

14.0%
#25GLM-4.6

Strong on Berkeley Function Calling Leaderboard (Overall) Overall Acc and SWE-bench Verified Leaderboard swe_verified_resolved_pct

12.9%
#28Kimi-K2-Instruct

Strong on Berkeley Function Calling Leaderboard (Overall) Overall Acc and SWE-bench Leaderboard verified_resolved_pct

12.5%
#29gemini-3-flash-preview

Strong on SWE-bench Leaderboard verified_resolved_pct and Vals LiveCodeBench overall_accuracy_pct

12.3%
#30gemini-2.5-flash

Strong on Berkeley Function Calling Leaderboard (Overall) Overall Acc and OSS-Bench Leaderboard average_score_pct

12.2%
#32Phi-3-medium-128k-instruct

Strong on RepoQA Official Results overall_average_pass_at_1_pct and Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct

11.7%
#33Grok-4-0709

Strong on Berkeley Function Calling Leaderboard (Overall) Overall Acc and Vals LiveCodeBench overall_accuracy_pct

11.7%
#34glm-4.7

Strong on Sonar Java Quality Leaderboard functional_skill_pct and Sonar Java Quality Leaderboard issue_density_error_per_kloc

11.5%
#40minimax-m2.1

Strong on Sonar Java Quality Leaderboard functional_skill_pct and Sonar Java Quality Leaderboard issue_density_error_per_kloc

10.7%
#41gpt-5.1-2025-11-13

Strong on SWE-bench Leaderboard verified_resolved_pct and Vals LiveCodeBench overall_accuracy_pct

10.7%
#42Mixtral-8x22B-Instruct-v0.1

Strong on RepoQA Official Results overall_average_pass_at_1_pct and Open LLM Leaderboard GPQA gpqa

10.5%
#44kimi-k2.5-thinking

Strong on SWE-bench Leaderboard verified_resolved_pct and Vals LiveCodeBench overall_accuracy_pct

10.4%
#51Qwen/Qwen1.5-32B-Chat

Strong on RepoQA Official Results overall_average_pass_at_1_pct and Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct

9.5%
#52grok-4-1-fast-reasoning

Strong on Berkeley Function Calling Leaderboard (Overall) Overall Acc and Vals LiveCodeBench overall_accuracy_pct

9.4%

Compare Models

Select two different models above to compare their evidence side by side.
Ranking diagnostics & missing models

Source lift

Ranked

84

Sources

8

Quality

Good

Open LLM Leaderboard MMLU-Pro

53 rows · 0.9% avg lift

Open LLM Leaderboard GPQA

53 rows · 0.6% avg lift

Open LLM Leaderboard BBH

52 rows · 0.1% avg lift

Open LLM Leaderboard Results

51 rows · 0.1% avg lift

Missing frontier models

claude-sonnet-4.6

Thin evidence after weighting

Rank #11

17.9%

gemini-3.1-pro-preview

Thin evidence after weighting

Rank #17

20.8%

grok-4-fast-reasoning

Thin evidence after weighting

Rank #19

14.0%

Taxonomy & task details

Core tasks

task.code_reviewtask.risk_assessment

Required modes

none

Domains

domain.software_engineering

Related in Developer