live
weekly refresh
basedagi.org
▸ guide

Best LLM for Function Calling
& Tool Use 2026

Ranked by BFCL (Berkeley Function-Calling Leaderboard) composite scores.

▸ what matters
BFCL accuracySimple, multiple, parallel call scenarios. Highest signal for production tool use.
Parallel function callsCritical for agents. Can the model call multiple tools in one turn?
JSON schema adherenceStrict mode output. Matters for typed APIs and validation pipelines.
LatencyFor real-time agents, TTFT and output speed matter as much as accuracy.
Context windowLong tool schemas and conversation history require 32K+ context minimum.
▸ benchmarks used
BFCL OverallBerkeley aggregate function-calling accuracy
BFCL Non-LiveStatic tool definitions and invocation accuracy
BFCL LiveLive function-calling scenarios
BFCL Multi-TurnTool use across multiple conversation turns
▸ analysis
Read the measured table

Provider and weight class do not decide this ranking. BFCL rows do. Use the current winner and gap above, then compare price for models within your acceptable accuracy range.

When to use strict JSON mode

For structured output pipelines (RAG, database queries, form filling), explicitly enable JSON mode or tool-use APIs. Models that are great at free-text reasoning may not honor schemas without this constraint.

Agent frameworks and overhead

In multi-step agent chains, per-call latency compounds. A model 10% less accurate but 2× faster may yield better end-to-end agent performance. Benchmark your full chain, not just isolated calls.

Cost at scale

Agentic workflows often make 5–20 LLM calls per user request. At scale, a model costing $5/M may be preferable to one at $15/M even with slightly lower accuracy — unless that accuracy gap causes downstream failures.

▸ frequently asked

Which LLM is best for function calling?

The current function-calling ranking is generated from BFCL benchmark results when sufficient matching data is available.

What is BFCL?

BFCL (Berkeley Function-Calling Leaderboard) is a standardized benchmark for evaluating LLM function-calling accuracy across multiple categories including simple, multiple, and parallel function calls, as well as SQL and live function-calling scenarios.

Does model size affect function calling accuracy?

Model size is not the ranking rule. BFCL accuracy is. Compare current measured rows rather than inferring function-calling quality from parameter count.