developer_tools

Best LLM for Unit Test Generation

Ranked models for generating meaningful unit tests and edge cases from code.

#1 Recommendation

gpt-4o-2024-05-13

Strong on RepoQA Official Results overall_average_pass_at_1_pct (99%) and RepoQA Official Results all_average_pass_at_1_pct (99%)

external/openai/gpt-4o-2024-05-13

19.2%

Score

23.9%

Confidence

Evidence

Runners-up:#2 gpt-4.1-20250414 (17.4%)#3 gemini-3-pro-preview (16.7%)#4 google/gemini-3.1-pro-preview (16.3%)

Ranked Models

Evidence Quality

80%

Scoring

Benchmark-backed

Top Signal

RepoQA Official Results: overall_average_pass_at_1_pct

All Ranked Models

Open weights only

Max params:

Min confidence:

30 of 30

Rank	Model	Score	Confidence	Evidence	Top Benchmarks
#2	gpt-4o-2024-05-13 Strong on RepoQA Official Results overall_average_pass_at_1_pct (99%) and RepoQA Official Results all_average_pass_at_1_pct (99%)	19.2%	23.9%	9	RepoQA Official Results overall_average_pass_at_1_pct (Mar 12, 2026) RepoQA Official Results all_average_pass_at_1_pct (Mar 12, 2026)
#5	gpt-4.1-20250414	17.4%	26.4%	18	Galileo Agent Leaderboard v2 Avg AC (Mar 12, 2026) MMLongBench-Doc Leaderboard acc_score_pct (Mar 12, 2026)
#6	gemini-3-pro-preview	16.7%	20.4%	21	Vals SWE-bench overall_accuracy_pct (Mar 12, 2026) Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026)
#7	google/gemini-3.1-pro-preview	16.3%	17.9%	16	Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026) Vals Terminal-Bench 2 overall_accuracy_pct (Mar 12, 2026)
#8	Grok-4-0709	16.1%	24.1%	18	Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026) Galileo Agent Leaderboard v2 Avg AC (Mar 12, 2026)
#10	openai/gpt-5.4-2026-03-05	15.3%	17.3%	15	Vals SWE-bench overall_accuracy_pct (Mar 12, 2026) Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026)
#12	claude-sonnet-4-20250514	15.1%	21.0%	17	Galileo Agent Leaderboard v2 Avg AC (Mar 12, 2026) Vals SWE-bench overall_accuracy_pct (Mar 12, 2026)
#13	anthropic/claude-sonnet-4.6	14.9%	17.3%	15	Vals SWE-bench overall_accuracy_pct (Mar 12, 2026) Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026)
#14	z-ai/glm-4.7	14.9%	21.4%	15	Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026) Vals SWE-bench overall_accuracy_pct (Mar 12, 2026)
#15	anthropic/claude-opus-4-6-thinking	14.7%	16.0%	13	Vals SWE-bench overall_accuracy_pct (Mar 12, 2026) Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026)
#17	claude-opus-4-5-20251101	14.5%	17.7%	16	Vals SWE-bench overall_accuracy_pct (Mar 12, 2026) Vals Terminal-Bench 2 overall_accuracy_pct (Mar 12, 2026)
#18	gemini-3-flash-preview	14.3%	17.0%	15	Vals SWE-bench overall_accuracy_pct (Mar 12, 2026) Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026)
#19	gpt-5-2025-08-07	14.3%	17.7%	16	Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026) Vals SWE-bench overall_accuracy_pct (Mar 12, 2026)
#20	minimax/minimax-m2.1	14.2%	21.4%	15	Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026) Vals SWE-bench overall_accuracy_pct (Mar 12, 2026)
#21	gpt-5.2-2025-12-11	14.2%	16.0%	13	Vals SWE-bench overall_accuracy_pct (Mar 12, 2026) Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026)
#22	gpt-5.1-2025-11-13	14.2%	17.7%	16	Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026) Vals SWE-bench overall_accuracy_pct (Mar 12, 2026)
#23	anthropic/claude-opus-4-5-20251101-thinking	14.1%	16.0%	13	Vals SWE-bench overall_accuracy_pct (Mar 12, 2026) Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026)
#24	Kimi K2 Thinking	14.0%	20.8%	14	Sonar Java Quality Leaderboard functional_skill_pct (Mar 12, 2026) Vals SWE-bench overall_accuracy_pct (Mar 12, 2026)
#27	Meta-Llama-3-70B-Instruct	12.8%	14.8%	4	RepoQA Official Results overall_average_pass_at_1_pct (Mar 12, 2026) RepoQA Official Results all_average_pass_at_1_pct (Mar 12, 2026)
#28	kimi/kimi-k2.5-thinking	12.8%	17.0%	15	Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026) Vals SWE-bench overall_accuracy_pct (Mar 12, 2026)
#29	gpt-4o-20241120	12.7%	21.6%	13	Aider Code Editing Leaderboard percent_correct_pct (Mar 12, 2026) MMLongBench-Doc Leaderboard acc_score_pct (Mar 12, 2026)
#31	anthropic/claude-sonnet-4-5-20250929-thinking	12.4%	16.0%	13	Vals SWE-bench overall_accuracy_pct (Mar 12, 2026) Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026)
#34	gpt-4o-2024-08-06	12.0%	23.9%	15	Aider Code Editing Leaderboard percent_correct_pct (Mar 12, 2026) Aider Code Editing Leaderboard correct_edit_format_pct (Mar 12, 2026)
#35	gemini-2.5-pro	12.0%	19.3%	21	Galileo Agent Leaderboard v2 Avg AC (Mar 12, 2026) Vals Terminal-Bench 2 overall_accuracy_pct (Mar 12, 2026)
#36	deepseek/deepseek-r1	11.6%	17.0%	16	Aider Polyglot Leaderboard percent_correct_pct (Mar 12, 2026) Sonar Java Quality Leaderboard functional_skill_pct (Mar 12, 2026)
#38	zai/glm-5-thinking	11.5%	14.8%	11	Vals SWE-bench overall_accuracy_pct (Mar 12, 2026) Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026)
#39	qwen-2.5-72b-instruct	11.5%	16.2%	10	Galileo Agent Leaderboard v2 Avg AC (Mar 12, 2026) Aider Code Editing Leaderboard percent_correct_pct (Mar 12, 2026)
#40	xai-org/grok-4-fast-reasoning	11.2%	17.3%	15	Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026) Vals SWE-bench overall_accuracy_pct (Mar 12, 2026)
#41	google/gemini-3.1-flash-lite-preview	11.1%	16.6%	14	Vals LiveCodeBench overall_accuracy_pct (Mar 12, 2026) Vals SWE-bench overall_accuracy_pct (Mar 12, 2026)
#43	Meta-Llama-3-8B-Instruct	11.1%	18.1%	6	RepoQA Official Results overall_average_pass_at_1_pct (Mar 12, 2026) RepoQA Official Results all_average_pass_at_1_pct (Mar 12, 2026)

Head-to-Head: #1 vs #2

Top Pick

gpt-4o-2024-05-13

Strong on RepoQA Official Results overall_average_pass_at_1_pct (99%) and RepoQA Official Results all_average_pass_at_1_pct (99%)

19.2%

Conf 23.9%

gpt-4.1-20250414

Strong on Galileo Agent Leaderboard v2 Avg AC (100%) and MMLongBench-Doc Leaderboard acc_score_pct (75%)

17.4%

Conf 26.4%

Full Comparison with Benchmark Evidence →

Full Use-Case Page Browse All Use Cases How We Score

Related Lookups

Best LLM for Code Generation

Benchmark-backed ranking of models for generating correct, secure code from requirements.

Best LLM for Debugging

Find the top-ranked models for localizing bugs and proposing fixes with explanations.

Best LLM for Code Review

Compare models for automated PR review covering correctness, security, and maintainability.

Best LLM for Refactoring

Ranked models for safely refactoring code while preserving behavior and improving clarity.

Best LLM for IDE Code Completion

Compare models for fast, accurate local-context code completion and snippet generation.

Best LLM for Documentation from Code

Ranked models for generating docstrings and technical docs that match code behavior.