data_analytics

gpt-4o-20241120 vs gpt-4o

For SQL debugging

Model A winsby +4.2%

Model A

Winner

gpt-4o-20241120

external/openai/gpt-4o-20241120

24.5%

Rank #1

Confidence

44.7%

Evidence

15 pts

Confidence 44.7%15 evidence pts

DuckDB NSQL Leaderboard: all_execution_accuracy

Value 96.2% · Conf 100.0% · Weight 7.6%

duckdb_nsql_leaderboard.all_execution_accuracy (Mar 12, 2026)

DuckDB NSQL Leaderboard: hard_execution_accuracy

Value 75.0% · Conf 100.0% · Weight 4.3%

duckdb_nsql_leaderboard.hard_execution_accuracy (Mar 12, 2026)

BIRD-CRITIC: success_rate_open_pct

Value 55.6% · Conf 100.0% · Weight 2.5%

bird_critic.success_rate_open_pct (Mar 12, 2026)

MMLongBench-Doc Leaderboard: acc_score_pct

Value 62.7% · Conf 100.0% · Weight 1.1%

mmlongbench_doc_leaderboard.acc_score_pct (Mar 12, 2026)

Spider2.0 Snow Text-to-SQL: snow_text_to_sql_score_pct

Value 13.5% · Conf 100.0% · Weight 0.9%

spider2_snow_text_to_sql.snow_text_to_sql_score_pct (Mar 12, 2026)

Model B

gpt-4o

external/openai/gpt-4o

20.3%

Rank #3

Confidence

41.9%

Evidence

14 pts

Confidence 41.9%14 evidence pts

DuckDB NSQL Leaderboard: all_execution_accuracy

Value 76.9% · Conf 100.0% · Weight 6.0%

duckdb_nsql_leaderboard.all_execution_accuracy (Mar 12, 2026)

JSONSchemaBench Leaderboard: medium_schema_compliance_pct

Value 100.0% · Conf 100.0% · Weight 2.9%

jsonschemabench_leaderboard.medium_schema_compliance_pct (Mar 12, 2026)

DuckDB NSQL Leaderboard: hard_execution_accuracy

Value 50.0% · Conf 100.0% · Weight 2.9%

duckdb_nsql_leaderboard.hard_execution_accuracy (Mar 12, 2026)

JSONSchemaBench Leaderboard: hard_schema_compliance_pct

Value 100.0% · Conf 100.0% · Weight 2.1%

jsonschemabench_leaderboard.hard_schema_compliance_pct (Mar 12, 2026)

MEGA-Bench: overall_score

Value 92.8% · Conf 100.0% · Weight 0.5%

mega_bench.overall_score (Mar 12, 2026)

Back to SQL debugging gpt-4o-20241120 Profile gpt-4o Profile