Best LLM for Function Calling
& Tool Use 2026
Ranked by BFCL (Berkeley Function-Calling Leaderboard) composite scores.
Provider and weight class do not decide this ranking. BFCL rows do. Use the current winner and gap above, then compare price for models within your acceptable accuracy range.
For structured output pipelines (RAG, database queries, form filling), explicitly enable JSON mode or tool-use APIs. Models that are great at free-text reasoning may not honor schemas without this constraint.
In multi-step agent chains, per-call latency compounds. A model 10% less accurate but 2× faster may yield better end-to-end agent performance. Benchmark your full chain, not just isolated calls.
Agentic workflows often make 5–20 LLM calls per user request. At scale, a model costing $5/M may be preferable to one at $15/M even with slightly lower accuracy — unless that accuracy gap causes downstream failures.
Which LLM is best for function calling?
The current function-calling ranking is generated from BFCL benchmark results when sufficient matching data is available.
What is BFCL?
BFCL (Berkeley Function-Calling Leaderboard) is a standardized benchmark for evaluating LLM function-calling accuracy across multiple categories including simple, multiple, and parallel function calls, as well as SQL and live function-calling scenarios.
Does model size affect function calling accuracy?
Model size is not the ranking rule. BFCL accuracy is. Compare current measured rows rather than inferring function-calling quality from parameter count.