BasedAGIBasedAGI
Menu
Rankings live

data_analytics

Metric definition workshop

Turn ambiguous KPI definitions into precise, measurable specs.

#1 Recommendation

gpt-4o

Strong on DuckDB NSQL Leaderboard all_execution_accuracy (77%) and JSONSchemaBench Leaderboard medium_schema_compliance_pct (100%)

external/openai/gpt-4o

26.6%

Score

46.0%

Confidence

Limited benchmark evidence for this use case.

41 ranked models with average evidence of 7.8 points. Rankings may shift as more benchmark data is ingested.

Ranked Models

30

Evidence Quality

81%

Scoring

Benchmark-backed

Top Signal

DuckDB NSQL Leaderboard: all_execution_accuracy

All Ranked Models

Max params:
Min confidence:
30 of 30
RankModelScore
#1gpt-4o

Strong on DuckDB NSQL Leaderboard all_execution_accuracy (77%) and JSONSchemaBench Leaderboard medium_schema_compliance_pct (100%)

26.6%
#2qwen-2.5-72b-instruct

Strong on DuckDB NSQL Leaderboard all_execution_accuracy (83%) and JSONSchemaBench Leaderboard medium_schema_compliance_pct (90%)

25.1%
#3gpt-4o-20241120

Strong on DuckDB NSQL Leaderboard all_execution_accuracy (96%) and DuckDB NSQL Leaderboard hard_execution_accuracy (75%)

23.8%
#5deepseek/deepseek-r1
21.5%
#10openai/gpt-4o-mini-2024-07-18
17.0%
#13gpt-4o-2024-08-06
16.0%
#20Llama-3.3-70B-Instruct
13.8%
#25google/gemini-2.0-flash-001
13.2%
#28Qwen3-30B-A3B
12.6%
#29Qwen2.5-Coder-7B
12.5%
#30gemma-2-27b-it
12.1%
#32phi-4
11.7%
#35Qwen3-32B
11.1%
#39QwQ-32B-Preview
10.8%
#41Qwen2.5-32B-Instruct
10.5%
#42Phi-3-medium-128k-instruct
10.5%
#45gemini-3-pro-preview
9.2%
#46Meta-Llama-3.1-8B
9.1%
#49gpt-4.1-20250414
8.9%
#50Llama-3.1-70B-Instruct
8.8%
#52Phi-3-mini-128k-instruct
8.7%
#53gemini-2.5-pro
8.7%
#54Grok-4-0709
8.6%
#56claude-sonnet-4-20250514
8.3%
#58Qwen2.5-14B-Instruct
8.1%
#62gemini-2.5-flash
7.3%
#64Meta-Llama-3-8B-Instruct
7.2%
#65Llama-3.1-8B-Instruct
7.1%
#67deepseek-v3
6.8%
#71Qwen2.5-Coder-3B-Instruct
6.6%

Compare Models

Model A leads by +1.5%

Shareable Link →

Model A

gpt-4o

external/openai/gpt-4o

26.6%

Rank #1

Confidence 46.0%14 evidence pts

DuckDB NSQL Leaderboard: all_execution_accuracy

Value 76.9% · Conf 100.0% · Weight 8.1%

duckdb_nsql_leaderboard.all_execution_accuracy (Mar 12, 2026)

JSONSchemaBench Leaderboard: medium_schema_compliance_pct

Value 100.0% · Conf 100.0% · Weight 5.0%

jsonschemabench_leaderboard.medium_schema_compliance_pct (Mar 12, 2026)

JSONSchemaBench Leaderboard: hard_schema_compliance_pct

Value 100.0% · Conf 100.0% · Weight 3.4%

jsonschemabench_leaderboard.hard_schema_compliance_pct (Mar 12, 2026)

DuckDB NSQL Leaderboard: hard_execution_accuracy

Value 50.0% · Conf 100.0% · Weight 2.2%

duckdb_nsql_leaderboard.hard_execution_accuracy (Mar 12, 2026)

Model B

qwen-2.5-72b-instruct

external/qwen/qwen-2-5-72b-instruct

25.1%

Rank #2

Confidence 35.2%11 evidence pts

DuckDB NSQL Leaderboard: all_execution_accuracy

Value 82.7% · Conf 100.0% · Weight 8.8%

duckdb_nsql_leaderboard.all_execution_accuracy (Mar 12, 2026)

JSONSchemaBench Leaderboard: medium_schema_compliance_pct

Value 90.1% · Conf 100.0% · Weight 4.5%

jsonschemabench_leaderboard.medium_schema_compliance_pct (Mar 12, 2026)

JSONSchemaBench Leaderboard: hard_schema_compliance_pct

Value 74.4% · Conf 100.0% · Weight 2.5%

jsonschemabench_leaderboard.hard_schema_compliance_pct (Mar 12, 2026)

Galileo Agent Leaderboard v2: Avg AC

Value 76.1% · Conf 100.0% · Weight 1.2%

galileo_agent_v2.avg_ac (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

41

Sources

8

Quality

Insufficient

DuckDB NSQL Leaderboard

duckdb_nsql_leaderboard

23 rows

4.0% avg lift

JSONSchemaBench Leaderboard

jsonschemabench_leaderboard

12 rows

2.4% avg lift

EQ-Bench Leaderboard

eq_bench

10 rows

0.4% avg lift

Vals Legal Bench

vals_legal_bench

10 rows

0.4% avg lift

Missing Strong Models

anthropic/claude-sonnet-4.6

external/anthropic/claude-sonnet-4-6

Rank #4

21.1%

Thin evidence after weighting

gpt-5-mini-2025-08-07

external/openai/gpt-5-mini-2025-08-07

Rank #7

19.6%

Thin evidence after weighting

google/gemini-3.1-pro-preview

external/google/gemini-3-1-pro-preview

Rank #8

19.3%

Thin evidence after weighting

gpt-5-2025-08-07

external/openai/gpt-5-2025-08-07

Rank #9

19.2%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.metric_definition_clarificationtask.decision_recommendation

Required Modes

none

Domains

domain.data_analytics_bi

Related Use Cases