| #1 | gpt-4o-20241120 Strong on DuckDB NSQL Leaderboard all_execution_accuracy (96%) and DuckDB NSQL Leaderboard hard_execution_accuracy (75%) | | 44.7% | 15 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #3 | gpt-4o Strong on DuckDB NSQL Leaderboard all_execution_accuracy (77%) and JSONSchemaBench Leaderboard medium_schema_compliance_pct (100%) | | 41.9% | 14 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) JSONSchemaBench Leaderboard medium_schema_compliance_pct (Mar 12, 2026) |
| #4 | deepseek/deepseek-r1 | | 37.4% | 17 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #5 | qwen-2.5-72b-instruct | | 28.7% | 11 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) JSONSchemaBench Leaderboard medium_schema_compliance_pct (Mar 12, 2026) |
| #11 | openai/gpt-4o-mini-2024-07-18 | | 24.9% | 12 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #15 | gpt-4o-2024-08-06 | | 23.5% | 14 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #20 | google/gemini-2.0-flash-001 | | 22.4% | 12 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #23 | Llama-3.3-70B-Instruct | | 18.5% | 4 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #24 | Qwen3-30B-A3B | | 18.9% | 5 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #26 | Qwen2.5-Coder-7B | | 17.4% | 2 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #33 | gemma-2-27b-it | | 19.4% | 6 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #35 | phi-4 | | 19.5% | 6 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #37 | Phi-3-medium-128k-instruct | | 17.7% | 3 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #38 | Qwen3-32B | | 18.4% | 4 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #41 | gpt-4.1-20250414 | | 12.5% | 18 | MMLongBench-Doc Leaderboard acc_score_pct (Mar 16, 2026) Galileo Agent Leaderboard v2 Avg AC (Mar 16, 2026) |
| #42 | QwQ-32B-Preview | | 17.4% | 2 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #44 | Meta-Llama-3.1-8B | | 17.4% | 2 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #47 | gemini-3-pro-preview | | 10.0% | 21 | FACTS Benchmark Suite average_score_pct (Mar 16, 2026) Vals Mortgage Tax overall_accuracy_pct (Mar 16, 2026) |
| #48 | deepseek-v3 | | 24.5% | 8 | BIRD-CRITIC success_rate_open_pct (Mar 16, 2026) Spider2.0 Lite Text-to-SQL lite_text_to_sql_score_pct (Mar 16, 2026) |
| #53 | gemini-2.5-pro | | 11.9% | 22 | Galileo Agent Leaderboard v2 Avg AC (Mar 16, 2026) Galileo Agent Leaderboard v2 Avg TSQ (Mar 16, 2026) |
| #54 | Grok-4-0709 | | 10.7% | 18 | Galileo Agent Leaderboard v2 Avg TSQ (Mar 16, 2026) Galileo Agent Leaderboard v2 Avg AC (Mar 16, 2026) |
| #55 | Phi-3-mini-128k-instruct | | 17.7% | 3 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #57 | claude-sonnet-4-20250514 | | 10.2% | 17 | Galileo Agent Leaderboard v2 Avg AC (Mar 16, 2026) Galileo Agent Leaderboard v2 Avg TSQ (Mar 16, 2026) |
| #59 | Llama-3.1-70B-Instruct | | 18.9% | 5 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #68 | Meta-Llama-3-8B-Instruct | | 21.5% | 6 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) LLM Trustworthy Leaderboard fairness (Mar 16, 2026) |
| #69 | Qwen2.5-Coder-1.5B-Instruct | | 17.9% | 3 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #70 | DeepSeek-Coder-V2-Lite-Instruct | | 17.4% | 2 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #75 | minimax/minimax-m2.1 | | 13.5% | 14 | Vals LiveCodeBench overall_accuracy_pct (Mar 16, 2026) Vals MedQA overall_accuracy_pct (Mar 16, 2026) |
| #77 | gemma-2 | | 17.4% | 2 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |
| #82 | starcoder2-15b | | 17.4% | 2 | DuckDB NSQL Leaderboard all_execution_accuracy (Mar 16, 2026) DuckDB NSQL Leaderboard hard_execution_accuracy (Mar 16, 2026) |