BasedAGIBasedAGI
Menu
Rankings live

data_analytics

Executive brief from metrics

Summarize key metric changes into an executive-ready brief.

#1 Recommendation

qwen-2.5-72b-instruct

Strong on DuckDB NSQL Leaderboard all_execution_accuracy (83%) and JSONSchemaBench Leaderboard medium_schema_compliance_pct (90%)

external/qwen/qwen-2-5-72b-instruct

21.2%

Score

32.1%

Confidence

Limited benchmark evidence for this use case.

48 ranked models with average evidence of 11.1 points. Rankings may shift as more benchmark data is ingested.

Ranked Models

30

Evidence Quality

79%

Scoring

Benchmark-backed

Top Signal

DuckDB NSQL Leaderboard: all_execution_accuracy

All Ranked Models

Max params:
Min confidence:
30 of 30
RankModelScore
#1qwen-2.5-72b-instruct

Strong on DuckDB NSQL Leaderboard all_execution_accuracy (83%) and JSONSchemaBench Leaderboard medium_schema_compliance_pct (90%)

21.2%
#2gpt-4o

Strong on DuckDB NSQL Leaderboard all_execution_accuracy (77%) and JSONSchemaBench Leaderboard medium_schema_compliance_pct (100%)

19.0%
#3gpt-4o-20241120

Strong on DuckDB NSQL Leaderboard all_execution_accuracy (96%) and DuckDB NSQL Leaderboard hard_execution_accuracy (75%)

18.3%
#5deepseek/deepseek-r1
17.6%
#9openai/gpt-4o-mini-2024-07-18
14.3%
#11gpt-4o-2024-08-06
13.5%
#12gemini-3-pro-preview
13.3%
#14gemini-2.5-pro
12.5%
#17google/gemini-2.0-flash-001
11.5%
#18gpt-4.1-20250414
11.4%
#21Grok-4-0709
11.1%
#24google/gemini-3.1-pro-preview
10.7%
#25claude-sonnet-4-20250514
10.7%
#29Llama-3.3-70B-Instruct
10.4%
#30gemini-2.5-flash
10.1%
#33gpt-5-2025-08-07
9.8%
#35openai/gpt-5.4-2026-03-05
9.6%
#37gemma-2-27b-it
9.5%
#38gpt-5.1-2025-11-13
9.4%
#41anthropic/claude-sonnet-4.6
9.3%
#42phi-4
9.2%
#43claude-opus-4-5-20251101
9.2%
#44gpt-5-mini-2025-08-07
9.0%
#46Qwen3-30B-A3B
8.9%
#47gemini-3-flash-preview
8.8%
#48Qwen2.5-Coder-7B
8.7%
#52Qwen3-32B
8.4%
#56kimi/kimi-k2.5-thinking
7.9%
#57xai-org/grok-4-fast-reasoning
7.7%
#58QwQ-32B-Preview
7.7%

Compare Models

Model A leads by +2.2%

Shareable Link →

Model A

qwen-2.5-72b-instruct

external/qwen/qwen-2-5-72b-instruct

21.2%

Rank #1

Confidence 32.1%12 evidence pts

DuckDB NSQL Leaderboard: all_execution_accuracy

Value 82.7% · Conf 100.0% · Weight 6.1%

duckdb_nsql_leaderboard.all_execution_accuracy (Mar 12, 2026)

JSONSchemaBench Leaderboard: medium_schema_compliance_pct

Value 90.1% · Conf 100.0% · Weight 2.5%

jsonschemabench_leaderboard.medium_schema_compliance_pct (Mar 12, 2026)

Galileo Agent Leaderboard v2: Avg AC

Value 76.1% · Conf 100.0% · Weight 1.5%

galileo_agent_v2.avg_ac (Mar 12, 2026)

JSONSchemaBench Leaderboard: hard_schema_compliance_pct

Value 74.4% · Conf 100.0% · Weight 1.4%

jsonschemabench_leaderboard.hard_schema_compliance_pct (Mar 12, 2026)

Model B

gpt-4o

external/openai/gpt-4o

19.0%

Rank #2

Confidence 32.3%14 evidence pts

DuckDB NSQL Leaderboard: all_execution_accuracy

Value 76.9% · Conf 100.0% · Weight 5.7%

duckdb_nsql_leaderboard.all_execution_accuracy (Mar 12, 2026)

JSONSchemaBench Leaderboard: medium_schema_compliance_pct

Value 100.0% · Conf 100.0% · Weight 2.7%

jsonschemabench_leaderboard.medium_schema_compliance_pct (Mar 12, 2026)

JSONSchemaBench Leaderboard: hard_schema_compliance_pct

Value 100.0% · Conf 100.0% · Weight 1.9%

jsonschemabench_leaderboard.hard_schema_compliance_pct (Mar 12, 2026)

DuckDB NSQL Leaderboard: hard_execution_accuracy

Value 50.0% · Conf 100.0% · Weight 1.4%

duckdb_nsql_leaderboard.hard_execution_accuracy (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

48

Sources

8

Quality

Insufficient

Vals Legal Bench

vals_legal_bench

25 rows

0.5% avg lift

Vals Tax Eval v2

vals_tax_eval_v2

25 rows

0.5% avg lift

Vals CorpFin v2

vals_corp_fin_v2

25 rows

0.4% avg lift

Vals LiveCodeBench

vals_lcb

24 rows

0.5% avg lift

Missing Strong Models

gpt-5.2-2025-12-11

external/openai/gpt-5-2-2025-12-11

Rank #16

16.2%

Thin evidence after weighting

anthropic/claude-opus-4-6-thinking

external/anthropic/claude-opus-4-6-thinking

Rank #17

16.1%

Thin evidence after weighting

anthropic/claude-opus-4-5-20251101-thinking

external/anthropic/claude-opus-4-5-20251101-thinking

Rank #21

15.2%

Thin evidence after weighting

anthropic/claude-sonnet-4-5-20250929-thinking

external/anthropic/claude-sonnet-4-5-20250929-thinking

Rank #28

14.1%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.dashboard_narrativetask.write_memo_brief

Required Modes

none

Domains

domain.data_analytics_bi

Related Use Cases