Developer

Codebase onboarding brief

Summarize a repository's architecture, modules, and conventions.

task.multi_doc_synthesistask.code_explanation

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

gpt-5-2025-08-07

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

27.3%

Best benchmark score

35.2%

Confidence

All ranked models — top 3

🥇

gpt-5-2025-08-07

27.3%

🥈

gemini-3.1-pro-preview

27.0%

🥉

gemini-3-pro-preview

23.6%

Ranked Models

Evidence Quality

83%

Evidence Points

Top Signal

Aider Polyglot Leaderboard: percent_correct_pct

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	gpt-5-2025-08-07 Strong on Aider Polyglot Leaderboard percent_correct_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct	27.3%	35%	—	Aider Polyglot LeaderboardSWE-bench Verified Leaderboard
🥈	gemini-3.1-pro-preview Strong on Vals SWE-bench overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	27.0%	30%	$4.50	Vals SWE-benchVals Finance Agent
🥉	gemini-3-pro-preview Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals SWE-bench overall_accuracy_pct	23.6%	31%	$4.50	SWE-bench Verified LeaderboardVals SWE-bench
#4	gemini-2.5-pro Strong on FACTS Benchmark Suite facts_grounding_score_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct	23.6%	35%	$3.44	FACTS Benchmark SuiteSWE-bench Verified Leaderboard
#5	gpt-5.2-2025-12-11 Strong on FACTS Benchmark Suite facts_grounding_score_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct	22.9%	27%	—	FACTS Benchmark SuiteSWE-bench Verified Leaderboard
#6	claude-sonnet-4 Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct	22.7%	35%	$6.00	Galileo Agent Leaderboard v2SWE-bench Verified Leaderboard
#7	gpt-5-mini-2025-08-07 Strong on Vals LiveCodeBench overall_accuracy_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct	22.5%	35%	—	Vals LiveCodeBenchSWE-bench Verified Leaderboard
#8	gemini-3-flash-preview Strong on Vals CorpFin v2 overall_accuracy_pct and Vals SWE-bench overall_accuracy_pct	21.2%	28%	$1.13	Vals CorpFin v2Vals SWE-bench
#9	Grok-4-0709 Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	20.6%	31%	—	Vals CorpFin v2Vals Finance Agent
#10	gpt-4.1-20250414 Strong on Galileo Agent Leaderboard v2 Avg AC and Vectara HHEM Leaderboard overall_hallucination_error_pct	20.0%	32%	—	Galileo Agent Leaderboard v2Vectara HHEM Leaderboard
#11	claude-sonnet-4.6 Strong on Vals Finance Agent overall_accuracy_pct and Vals SWE-bench overall_accuracy_pct	19.9%	24%	$6.00	Vals Finance AgentVals SWE-bench
#12	claude-opus-4-5-20251101 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and FACTS Benchmark Suite facts_grounding_score_pct	19.3%	27%	—	SWE-bench Verified LeaderboardFACTS Benchmark Suite
#13	gpt-5.4-2026-03-05 Strong on Vals SWE-bench overall_accuracy_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	19.1%	23%	—	Vals SWE-benchVectara HHEM Leaderboard
#14	o3-20250416 Strong on Aider Polyglot Leaderboard percent_correct_pct and SWE-bench Verified Leaderboard swe_verified_resolved_pct	18.9%	28%	$3.50	Aider Polyglot LeaderboardSWE-bench Verified Leaderboard
#15	gemini-3.1-flash-lite-preview Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	18.6%	27%	$0.56	FACTS Benchmark SuiteVectara HHEM Leaderboard
#17	gpt-5.1-2025-11-13 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	17.0%	25%	—	Vals Finance AgentVals CorpFin v2
#18	gpt-4o-2024-05-13 Strong on RepoQA Official Results overall_average_pass_at_1_pct and RepoQA Official Results all_average_pass_at_1_pct	17.0%	23%	—	RepoQA Official ResultsRepoQA Official Results
#20	grok-4-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals LiveCodeBench overall_accuracy_pct	15.4%	30%	$0.28	Vals CorpFin v2Vals LiveCodeBench
#21	claude-opus-4-6-thinking Strong on Vals SWE-bench overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	15.3%	17%	—	Vals SWE-benchVals CorpFin v2
#22	Kimi K2 Thinking Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals CorpFin v2 overall_accuracy_pct	15.2%	24%	$1.07	SWE-bench Verified LeaderboardVals CorpFin v2
#23	claude-opus-4-5-20251101-thinking Strong on Vals SWE-bench overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	14.3%	17%	—	Vals SWE-benchVals Finance Agent
#24	kimi-k2.5-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals LiveCodeBench overall_accuracy_pct	14.0%	19%	—	Vals CorpFin v2Vals LiveCodeBench
#25	grok-4-1-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	13.9%	22%	$0.28	Vals CorpFin v2Vals Finance Agent
#26	o4-mini Strong on Aider Polyglot Leaderboard percent_correct_pct and Vals LiveCodeBench overall_accuracy_pct	13.6%	27%	$1.93	Aider Polyglot LeaderboardVals LiveCodeBench
#27	glm-5-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	13.1%	18%	—	Vals CorpFin v2Vals Finance Agent
#28	glm-4.7 Strong on Vals LiveCodeBench overall_accuracy_pct and Vals SWE-bench overall_accuracy_pct	13.1%	20%	—	Vals LiveCodeBenchVals SWE-bench
#29	minimax-m2.1 Strong on Vals SWE-bench overall_accuracy_pct and Vals LiveCodeBench overall_accuracy_pct	13.1%	20%	$0.53	Vals SWE-benchVals LiveCodeBench
#30	gpt-4o-20241120 Strong on Aider Code Editing Leaderboard percent_correct_pct and BigCodeBench Official bigcodebench_complete_pct	13.0%	23%	—	Aider Code Editing LeaderboardBigCodeBench Official
#32	grok-4.20-0309-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals SWE-bench overall_accuracy_pct	13.0%	17%	—	Vals CorpFin v2Vals SWE-bench
#33	claude-sonnet-4-5-20250929-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	12.9%	17%	—	Vals Finance AgentVals CorpFin v2

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals CorpFin v2

43 rows · 0.9% avg lift

Vals LiveCodeBench

43 rows · 0.9% avg lift

Vals Finance Agent

30 rows · 0.8% avg lift

Vals Terminal-Bench 2

30 rows · 0.7% avg lift

Missing frontier models

No obvious gaps right now.

▶Taxonomy & task details

Core tasks

task.multi_doc_synthesistask.code_explanation

Required modes

mode.long_context

Domains

domain.software_engineering

Related in Developer

Autonomous Coding Agent

End-to-end autonomous software engineering: reading issues, writing code, running tests, submitting PRs.

Code generation

Generate correct, secure code from requirements.

Refactoring assistant

Refactor code safely while preserving behavior and improving clarity.

IDE code completion

Fast local-context code completion and small snippet generation.