BasedAGIBasedAGI
Menu
Rankings live

data_analytics

SQL debugging

Diagnose and fix SQL queries for correctness and performance.

#1 Recommendation

gpt-4o-20241120

Strong on DuckDB NSQL Leaderboard all_execution_accuracy (96%) and DuckDB NSQL Leaderboard hard_execution_accuracy (75%)

external/openai/gpt-4o-20241120

24.5%

Score

44.7%

Confidence

Limited benchmark evidence for this use case.

30 ranked models with average evidence of 9.1 points. Rankings may shift as more benchmark data is ingested.

Ranked Models

30

Evidence Quality

82%

Scoring

Benchmark-backed

Top Signal

DuckDB NSQL Leaderboard: all_execution_accuracy

All Ranked Models

Max params:
Min confidence:
30 of 30
RankModelScore
#1gpt-4o-20241120

Strong on DuckDB NSQL Leaderboard all_execution_accuracy (96%) and DuckDB NSQL Leaderboard hard_execution_accuracy (75%)

24.5%
#3gpt-4o

Strong on DuckDB NSQL Leaderboard all_execution_accuracy (77%) and JSONSchemaBench Leaderboard medium_schema_compliance_pct (100%)

20.3%
#4deepseek/deepseek-r1
19.6%
#5qwen-2.5-72b-instruct
18.7%
#11openai/gpt-4o-mini-2024-07-18
14.8%
#15gpt-4o-2024-08-06
13.2%
#20google/gemini-2.0-flash-001
11.9%
#23Llama-3.3-70B-Instruct
11.3%
#24Qwen3-30B-A3B
11.1%
#26Qwen2.5-Coder-7B
11.0%
#33gemma-2-27b-it
10.1%
#35phi-4
9.8%
#37Phi-3-medium-128k-instruct
9.5%
#38Qwen3-32B
9.3%
#41gpt-4.1-20250414
9.1%
#42QwQ-32B-Preview
9.0%
#44Meta-Llama-3.1-8B
8.5%
#47gemini-3-pro-preview
7.8%
#48deepseek-v3
7.7%
#50Llama-3.1-70B-Instruct
7.7%
#54gemini-2.5-pro
7.4%
#55Grok-4-0709
7.4%
#56Phi-3-mini-128k-instruct
7.4%
#58claude-sonnet-4-20250514
7.1%
#68Meta-Llama-3-8B-Instruct
5.7%
#69Qwen2.5-Coder-1.5B-Instruct
5.6%
#70DeepSeek-Coder-V2-Lite-Instruct
5.3%
#75minimax/minimax-m2.1
4.3%
#77gemma-2
4.0%
#82starcoder2-15b
1.7%

Compare Models

Model A leads by +4.2%

Shareable Link →

Model A

gpt-4o-20241120

external/openai/gpt-4o-20241120

24.5%

Rank #1

Confidence 44.7%15 evidence pts

DuckDB NSQL Leaderboard: all_execution_accuracy

Value 96.2% · Conf 100.0% · Weight 7.6%

duckdb_nsql_leaderboard.all_execution_accuracy (Mar 12, 2026)

DuckDB NSQL Leaderboard: hard_execution_accuracy

Value 75.0% · Conf 100.0% · Weight 4.3%

duckdb_nsql_leaderboard.hard_execution_accuracy (Mar 12, 2026)

BIRD-CRITIC: success_rate_open_pct

Value 55.6% · Conf 100.0% · Weight 2.5%

bird_critic.success_rate_open_pct (Mar 12, 2026)

MMLongBench-Doc Leaderboard: acc_score_pct

Value 62.7% · Conf 100.0% · Weight 1.1%

mmlongbench_doc_leaderboard.acc_score_pct (Mar 12, 2026)

Model B

gpt-4o

external/openai/gpt-4o

20.3%

Rank #3

Confidence 41.9%14 evidence pts

DuckDB NSQL Leaderboard: all_execution_accuracy

Value 76.9% · Conf 100.0% · Weight 6.0%

duckdb_nsql_leaderboard.all_execution_accuracy (Mar 12, 2026)

JSONSchemaBench Leaderboard: medium_schema_compliance_pct

Value 100.0% · Conf 100.0% · Weight 2.9%

jsonschemabench_leaderboard.medium_schema_compliance_pct (Mar 12, 2026)

DuckDB NSQL Leaderboard: hard_execution_accuracy

Value 50.0% · Conf 100.0% · Weight 2.9%

duckdb_nsql_leaderboard.hard_execution_accuracy (Mar 12, 2026)

JSONSchemaBench Leaderboard: hard_schema_compliance_pct

Value 100.0% · Conf 100.0% · Weight 2.1%

jsonschemabench_leaderboard.hard_schema_compliance_pct (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

30

Sources

8

Quality

Insufficient

DuckDB NSQL Leaderboard

duckdb_nsql_leaderboard

23 rows

3.4% avg lift

Vals Legal Bench

vals_legal_bench

10 rows

0.3% avg lift

Vals Tax Eval v2

vals_tax_eval_v2

9 rows

0.3% avg lift

Vals CorpFin v2

vals_corp_fin_v2

9 rows

0.2% avg lift

Missing Strong Models

anthropic/claude-sonnet-4.6

external/anthropic/claude-sonnet-4-6

Rank #4

21.1%

Thin evidence after weighting

gpt-5-mini-2025-08-07

external/openai/gpt-5-mini-2025-08-07

Rank #7

19.6%

Thin evidence after weighting

google/gemini-3.1-pro-preview

external/google/gemini-3-1-pro-preview

Rank #8

19.3%

Thin evidence after weighting

gpt-5-2025-08-07

external/openai/gpt-5-2025-08-07

Rank #9

19.2%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.sql_debugging

Required Modes

none

Domains

domain.data_analytics_bi

Related Use Cases