BasedAGIBasedAGI
Menu
Rankings live

finance

AML alert triage

Triage AML alerts into severity, rationale, and next actions.

#1 Recommendation

gemini-3-pro-preview

Strong on Vals Finance Agent overall_accuracy_pct (87%) and Vals CorpFin v2 overall_accuracy_pct (87%)

external/google/gemini-3-pro-preview

39.8%

Score

51.2%

Confidence

Ranked Models

30

Evidence Quality

89%

Scoring

Benchmark-backed

Top Signal

Vals Finance Agent: overall_accuracy_pct

All Ranked Models

Max params:
Min confidence:
30 of 30
RankModelScore
#1gemini-3-pro-preview

Strong on Vals Finance Agent overall_accuracy_pct (87%) and Vals CorpFin v2 overall_accuracy_pct (87%)

39.8%
#2gemini-2.5-pro

Strong on FACTS Benchmark Suite facts_grounding_score_pct (100%) and Vals CorpFin v2 overall_accuracy_pct (78%)

35.7%
#3anthropic/claude-sonnet-4.6

Strong on Vals Finance Agent overall_accuracy_pct (100%) and Vals CorpFin v2 overall_accuracy_pct (91%)

34.6%
#4Grok-4-0709
34.2%
#5gpt-5-mini-2025-08-07
33.2%
#6gpt-5-2025-08-07
31.9%
#7openai/gpt-5.4-2026-03-05
31.8%
#8google/gemini-3.1-pro-preview
31.7%
#9gpt-4.1-20250414
30.7%
#10gpt-5.1-2025-11-13
28.3%
#11gpt-5.2-2025-12-11
27.9%
#12anthropic/claude-opus-4-6-thinking
27.3%
#13xai-org/grok-4-fast-reasoning
27.2%
#14xai-org/grok-4-1-fast-reasoning
26.7%
#15gemini-3-flash-preview
26.5%
#16google/gemini-3.1-flash-lite-preview
26.5%
#17claude-sonnet-4-20250514
26.1%
#18anthropic/claude-opus-4-5-20251101-thinking
25.9%
#19kimi/kimi-k2.5-thinking
25.5%
#20claude-opus-4-5-20251101
24.6%
#21anthropic/claude-sonnet-4-5-20250929-thinking
24.2%
#23alibaba/qwen3.5-flash
22.2%
#24zai/glm-5-thinking
22.2%
#25anthropic/claude-haiku-4-5-20251001-thinking
21.5%
#26mistralai/mistral-large-2512
19.0%
#27xai-org/grok-4-1-fast-non-reasoning
18.9%
#28z-ai/glm-4.7
18.4%
#29qwen/qwen3-max
18.2%
#30Kimi K2 Thinking
17.9%
#31gpt-4.1-mini-20250414
17.8%

Compare Models

Model A leads by +4.1%

Shareable Link →

Model A

gemini-3-pro-preview

external/google/gemini-3-pro-preview

39.8%

Rank #1

Confidence 51.2%29 evidence pts

Vals Finance Agent: overall_accuracy_pct

Value 87.0% · Conf 100.0% · Weight 3.3%

vals_finance_agent.overall_accuracy_pct (Mar 12, 2026)

Vals CorpFin v2: overall_accuracy_pct

Value 86.7% · Conf 100.0% · Weight 3.1%

vals_corp_fin_v2.overall_accuracy_pct (Mar 12, 2026)

FACTS Benchmark Suite: facts_grounding_score_pct

Value 88.3% · Conf 100.0% · Weight 2.6%

facts_benchmark_suite.facts_grounding_score_pct (Mar 12, 2026)

FACTS Benchmark Suite: facts_search_score_pct

Value 100.0% · Conf 100.0% · Weight 2.3%

facts_benchmark_suite.facts_search_score_pct (Mar 12, 2026)

Model B

gemini-2.5-pro

external/google/gemini-2-5-pro

35.7%

Rank #2

Confidence 52.0%32 evidence pts

FACTS Benchmark Suite: facts_grounding_score_pct

Value 100.0% · Conf 100.0% · Weight 3.0%

facts_benchmark_suite.facts_grounding_score_pct (Mar 12, 2026)

Vals CorpFin v2: overall_accuracy_pct

Value 78.4% · Conf 100.0% · Weight 2.8%

vals_corp_fin_v2.overall_accuracy_pct (Mar 12, 2026)

Vals Finance Agent: overall_accuracy_pct

Value 65.5% · Conf 100.0% · Weight 2.4%

vals_finance_agent.overall_accuracy_pct (Mar 12, 2026)

FACTS Benchmark Suite: average_score_pct

Value 78.3% · Conf 100.0% · Weight 1.7%

facts_benchmark_suite.average_score_pct (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

46

Sources

8

Quality

Sufficient

Vals CorpFin v2

vals_corp_fin_v2

42 rows

1.7% avg lift

Vals Tax Eval v2

vals_tax_eval_v2

42 rows

1.7% avg lift

Vals GPQA

vals_gpqa

36 rows

0.7% avg lift

Vals Mortgage Tax

vals_mortgage_tax

30 rows

1.3% avg lift

Missing Strong Models

gpt-4o

external/openai/gpt-4o

Rank #22

15.2%

Thin evidence after weighting

gpt-4o-2024-05-13

external/openai/gpt-4o-2024-05-13

Rank #51

10.5%

Thin evidence after weighting

deepseek/deepseek-r1

external/deepseek/deepseek-r1

Rank #54

10.5%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.priority_routingtask.risk_assessment

Required Modes

mode.json_schema

Domains

domain.finance_compliance_aml

Related Use Cases