BasedAGIBasedAGI
Menu
Rankings live

engineering

Simulation setup assistant

Turn design requirements into simulation setup checklists and boundary notes.

#1 Recommendation

gemini-3-pro-preview

Strong on Vals SWE-bench overall_accuracy_pct (88%) and Vals LiveCodeBench overall_accuracy_pct (97%)

external/google/gemini-3-pro-preview

28.2%

Score

34.4%

Confidence

Ranked Models

30

Evidence Quality

85%

Scoring

Benchmark-backed

Top Signal

Vals SWE-bench: overall_accuracy_pct

All Ranked Models

Max params:
Min confidence:
30 of 30
RankModelScore
#1gemini-3-pro-preview

Strong on Vals SWE-bench overall_accuracy_pct (88%) and Vals LiveCodeBench overall_accuracy_pct (97%)

28.2%
#2google/gemini-3.1-pro-preview

Strong on Vals LiveCodeBench overall_accuracy_pct (100%) and Vals Terminal-Bench 2 overall_accuracy_pct (100%)

27.5%
#3Grok-4-0709

Strong on Vals LiveCodeBench overall_accuracy_pct (93%) and Vals SWE-bench overall_accuracy_pct (66%)

25.9%
#4openai/gpt-5.4-2026-03-05
25.8%
#5anthropic/claude-sonnet-4.6
25.2%
#6anthropic/claude-opus-4-6-thinking
24.9%
#7claude-opus-4-5-20251101
24.5%
#8gemini-3-flash-preview
24.2%
#9gpt-4.1-20250414
24.1%
#10gpt-5-2025-08-07
24.1%
#11gpt-5.2-2025-12-11
24.0%
#12gpt-5.1-2025-11-13
23.9%
#13anthropic/claude-opus-4-5-20251101-thinking
23.7%
#14claude-sonnet-4-20250514
23.6%
#16kimi/kimi-k2.5-thinking
21.6%
#17anthropic/claude-sonnet-4-5-20250929-thinking
20.9%
#19zai/glm-5-thinking
19.4%
#20gemini-2.5-pro
19.0%
#21z-ai/glm-4.7
18.9%
#22xai-org/grok-4-fast-reasoning
18.9%
#23google/gemini-3.1-flash-lite-preview
18.8%
#24minimax/minimax-m2.1
18.6%
#25gpt-5-mini-2025-08-07
18.3%
#27xai-org/grok-4-1-fast-reasoning
18.0%
#28alibaba/qwen3.5-flash
17.6%
#29Kimi K2 Thinking
16.8%
#30o3-20250416
16.8%
#31anthropic/claude-haiku-4-5-20251001-thinking
16.8%
#32gpt-4.1-mini-20250414
16.5%
#35qwen/qwen3-max
15.6%

Compare Models

Model A leads by +0.7%

Shareable Link →

Model A

gemini-3-pro-preview

external/google/gemini-3-pro-preview

28.2%

Rank #1

Confidence 34.4%21 evidence pts

Vals SWE-bench: overall_accuracy_pct

Value 87.5% · Conf 100.0% · Weight 2.6%

vals_swebench.overall_accuracy_pct (Mar 12, 2026)

Vals LiveCodeBench: overall_accuracy_pct

Value 97.1% · Conf 100.0% · Weight 2.5%

vals_lcb.overall_accuracy_pct (Mar 12, 2026)

Vals Terminal-Bench 2: overall_accuracy_pct

Value 81.0% · Conf 100.0% · Weight 2.1%

vals_terminal_bench_2.overall_accuracy_pct (Mar 12, 2026)

FACTS Benchmark Suite: average_score_pct

Value 100.0% · Conf 100.0% · Weight 0.5%

facts_benchmark_suite.average_score_pct (Mar 12, 2026)

Model B

google/gemini-3.1-pro-preview

external/google/gemini-3-1-pro-preview

27.5%

Rank #2

Confidence 30.2%16 evidence pts

Vals LiveCodeBench: overall_accuracy_pct

Value 100.0% · Conf 100.0% · Weight 2.6%

vals_lcb.overall_accuracy_pct (Mar 12, 2026)

Vals Terminal-Bench 2: overall_accuracy_pct

Value 100.0% · Conf 100.0% · Weight 2.6%

vals_terminal_bench_2.overall_accuracy_pct (Mar 12, 2026)

Vals SWE-bench: overall_accuracy_pct

Value 85.2% · Conf 100.0% · Weight 2.5%

vals_swebench.overall_accuracy_pct (Mar 12, 2026)

Vals Mortgage Tax: overall_accuracy_pct

Value 100.0% · Conf 100.0% · Weight 0.5%

vals_mortgage_tax.overall_accuracy_pct (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

50

Sources

8

Quality

Sufficient

Vals CorpFin v2

vals_corp_fin_v2

42 rows

0.4% avg lift

Vals LiveCodeBench

vals_lcb

41 rows

1.9% avg lift

Vals Legal Bench

vals_legal_bench

41 rows

0.5% avg lift

Vals Tax Eval v2

vals_tax_eval_v2

41 rows

0.4% avg lift

Missing Strong Models

No obvious gaps right now.

Taxonomy Details

Core Tasks

task.planning_task_breakdowntask.write_memo_brief

Required Modes

none

Domains

domain.mechanical_engineeringdomain.civil_engineering

Related Use Cases