BasedAGIBasedAGI
Menu
Rankings live

finance

Thesis red teaming

Stress-test an investment thesis with counterarguments and risk.

#1 Recommendation

gemini-3-pro-preview

Strong on Vals Finance Agent overall_accuracy_pct (87%) and Vals CorpFin v2 overall_accuracy_pct (87%)

external/google/gemini-3-pro-preview

47.2%

Score

60.7%

Confidence

Ranked Models

30

Evidence Quality

91%

Scoring

Benchmark-backed

Top Signal

Vals Finance Agent: overall_accuracy_pct

All Ranked Models

Max params:
Min confidence:
30 of 30
RankModelScore
#1gemini-3-pro-preview

Strong on Vals Finance Agent overall_accuracy_pct (87%) and Vals CorpFin v2 overall_accuracy_pct (87%)

47.2%
#2gemini-2.5-pro

Strong on FACTS Benchmark Suite facts_grounding_score_pct (100%) and Vals CorpFin v2 overall_accuracy_pct (78%)

42.4%
#3anthropic/claude-sonnet-4.6

Strong on Vals Finance Agent overall_accuracy_pct (100%) and Vals CorpFin v2 overall_accuracy_pct (91%)

41.1%
#4Grok-4-0709
40.6%
#5gpt-5-mini-2025-08-07
39.4%
#6gpt-5-2025-08-07
37.9%
#7openai/gpt-5.4-2026-03-05
37.8%
#8google/gemini-3.1-pro-preview
37.7%
#9gpt-4.1-20250414
35.1%
#10gpt-5.1-2025-11-13
33.6%
#11gpt-5.2-2025-12-11
33.1%
#12anthropic/claude-opus-4-6-thinking
32.4%
#13xai-org/grok-4-fast-reasoning
32.3%
#14xai-org/grok-4-1-fast-reasoning
31.7%
#15gemini-3-flash-preview
31.4%
#16google/gemini-3.1-flash-lite-preview
31.4%
#17claude-sonnet-4-20250514
30.9%
#18anthropic/claude-opus-4-5-20251101-thinking
30.8%
#19kimi/kimi-k2.5-thinking
30.3%
#20claude-opus-4-5-20251101
29.2%
#21anthropic/claude-sonnet-4-5-20250929-thinking
28.7%
#23alibaba/qwen3.5-flash
26.4%
#24zai/glm-5-thinking
26.4%
#25anthropic/claude-haiku-4-5-20251001-thinking
25.5%
#26mistralai/mistral-large-2512
22.6%
#27xai-org/grok-4-1-fast-non-reasoning
22.5%
#28z-ai/glm-4.7
21.8%
#29qwen/qwen3-max
21.6%
#30Kimi K2 Thinking
21.3%
#31gpt-4.1-mini-20250414
21.1%

Compare Models

Model A leads by +4.8%

Shareable Link →

Model A

gemini-3-pro-preview

external/google/gemini-3-pro-preview

47.2%

Rank #1

Confidence 60.7%29 evidence pts

Vals Finance Agent: overall_accuracy_pct

Value 87.0% · Conf 100.0% · Weight 3.7%

vals_finance_agent.overall_accuracy_pct (Mar 12, 2026)

Vals CorpFin v2: overall_accuracy_pct

Value 86.7% · Conf 100.0% · Weight 3.5%

vals_corp_fin_v2.overall_accuracy_pct (Mar 12, 2026)

FACTS Benchmark Suite: facts_grounding_score_pct

Value 88.3% · Conf 100.0% · Weight 3.0%

facts_benchmark_suite.facts_grounding_score_pct (Mar 12, 2026)

FACTS Benchmark Suite: facts_search_score_pct

Value 100.0% · Conf 100.0% · Weight 2.6%

facts_benchmark_suite.facts_search_score_pct (Mar 12, 2026)

Model B

gemini-2.5-pro

external/google/gemini-2-5-pro

42.4%

Rank #2

Confidence 61.8%32 evidence pts

FACTS Benchmark Suite: facts_grounding_score_pct

Value 100.0% · Conf 100.0% · Weight 3.4%

facts_benchmark_suite.facts_grounding_score_pct (Mar 12, 2026)

Vals CorpFin v2: overall_accuracy_pct

Value 78.4% · Conf 100.0% · Weight 3.2%

vals_corp_fin_v2.overall_accuracy_pct (Mar 12, 2026)

Vals Finance Agent: overall_accuracy_pct

Value 65.5% · Conf 100.0% · Weight 2.8%

vals_finance_agent.overall_accuracy_pct (Mar 12, 2026)

FACTS Benchmark Suite: average_score_pct

Value 78.3% · Conf 100.0% · Weight 2.0%

facts_benchmark_suite.average_score_pct (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

50

Sources

8

Quality

Sufficient

Vals CorpFin v2

vals_corp_fin_v2

42 rows

1.9% avg lift

Vals Tax Eval v2

vals_tax_eval_v2

42 rows

1.9% avg lift

Vals GPQA

vals_gpqa

36 rows

0.8% avg lift

Vals Mortgage Tax

vals_mortgage_tax

30 rows

1.5% avg lift

Missing Strong Models

gpt-4o

external/openai/gpt-4o

Rank #22

15.2%

Thin evidence after weighting

gpt-4o-2024-05-13

external/openai/gpt-4o-2024-05-13

Rank #51

10.5%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.tradeoff_analysistask.risk_assessment

Required Modes

none

Domains

domain.finance_equity_research

Related Use Cases