▸ guide

Best LLM for Function Calling
& Tool Use 2026

Ranked by BFCL (Berkeley Function-Calling Leaderboard) composite scores.

function-calling leaderboard →how scores work →

▸ SOTA for function calling

Closest: OpenAI: GPT-5.2· 54.2(−10.1 pts)

score date unavailable

▸ current rankings · BFCL composite

#	Model	BFCL	Overall	Price/M
1	MoonshotAI: Kimi K2 Thinking Moonshotai	64.3	76.4	$0.60/M
2	OpenAI: GPT-5.2 Openai	54.2	80.8	$1.75/M
3	Google: Gemini 2.5 Flash Google	53.4	74.6	$0.30/M
4	OpenAI: o4 Mini Openai	50.6	69.5	$1.10/M
5	Meta: Llama 4 Maverick Meta Llama	48.0	76.6	$0.20/M
6	OpenAI: GPT-4.1 Openai	47.3	77.7	$2.00/M
7	DeepSeek: DeepSeek V3.2 Deepseek	47.0	80.5	$0.21/M
8	Meta: Llama 3.3 70B Instruct Meta Llama	46.7	59.8	$0.10/M
9	Anthropic: Claude Opus 4.5 Anthropic	46.1	83.7	$5.00/M
10	Mistral: Mistral Small 4 Mistralai	45.9	60.6	$0.15/M

live data · scoring methodology

▸ what matters

BFCL accuracySimple, multiple, parallel call scenarios. Highest signal for production tool use.

Parallel function callsCritical for agents. Can the model call multiple tools in one turn?

JSON schema adherenceStrict mode output. Matters for typed APIs and validation pipelines.

LatencyFor real-time agents, TTFT and output speed matter as much as accuracy.

Context windowLong tool schemas and conversation history require 32K+ context minimum.

▸ benchmarks used

BFCL OverallBerkeley aggregate function-calling accuracy

BFCL Non-LiveStatic tool definitions and invocation accuracy

BFCL LiveLive function-calling scenarios

BFCL Multi-TurnTool use across multiple conversation turns

▸ analysis

Read the measured table

Provider and weight class do not decide this ranking. BFCL rows do. Use the current winner and gap above, then compare price for models within your acceptable accuracy range.

When to use strict JSON mode

For structured output pipelines (RAG, database queries, form filling), explicitly enable JSON mode or tool-use APIs. Models that are great at free-text reasoning may not honor schemas without this constraint.

Agent frameworks and overhead

In multi-step agent chains, per-call latency compounds. A model 10% less accurate but 2× faster may yield better end-to-end agent performance. Benchmark your full chain, not just isolated calls.

Cost at scale

Agentic workflows often make 5–20 LLM calls per user request. At scale, a model costing $5/M may be preferable to one at $15/M even with slightly lower accuracy — unless that accuracy gap causes downstream failures.

▸ frequently asked

Which LLM is best for function calling?

MoonshotAI: Kimi K2 Thinking leads the current function-calling composite at 64.3. OpenAI: GPT-5.2 trails by 10.1 points.

What is BFCL?

BFCL (Berkeley Function-Calling Leaderboard) is a standardized benchmark for evaluating LLM function-calling accuracy across multiple categories including simple, multiple, and parallel function calls, as well as SQL and live function-calling scenarios.

Does model size affect function calling accuracy?

Model size is not the ranking rule. BFCL accuracy is. Compare current measured rows rather than inferring function-calling quality from parameter count.

Best LLM for Function Calling& Tool Use 2026

Which LLM is best for function calling?

What is BFCL?

Does model size affect function calling accuracy?

Best LLM for Function Calling
& Tool Use 2026