benchmark evidence

BFCL Overall

BFCL overall function calling accuracy.

winner on BFCL Overall

direct benchmark result, not a broad vertical composite | source row dated 2026-05-15

scored on 2026-05-15 · stale source data (60d)

latest mapped results | top 20

#	Model	Score	Evidence	Tested
1	MoonshotAI: Kimi K2 Thinking Moonshotai	59.1	model-only independent_benchmark	2026-05-15
2	DeepSeek: DeepSeek V3.2 Deepseek	54.1	model-only independent_benchmark	2026-05-15
3	Google: Gemini 2.5 Flash Google	50.9	model-only independent_benchmark	2026-05-15
4	OpenAI: o4 Mini Openai	50.3	model-only independent_benchmark	2026-05-15
5	OpenAI: o3 Openai	48.6	model-only independent_benchmark	2026-05-15
6	OpenAI: GPT-5.2 Openai	45.3	model-only independent_benchmark	2026-05-15
7	OpenAI: GPT-4.1 Openai	39.4	model-only independent_benchmark	2026-05-15
8	Meta: Llama 4 Maverick Meta Llama	37.3	model-only independent_benchmark	2026-05-15
9	Anthropic: Claude Opus 4.5 Anthropic	33.5	model-only independent_benchmark	2026-05-15
10	Mistral: Mistral Small 4 Mistralai	32.4	model-only independent_benchmark	2026-05-15
11	Meta: Llama 3.3 70B Instruct Meta Llama	31.9	model-only independent_benchmark	2026-05-15
12	Mistral Large Mistralai	31.8	model-only independent_benchmark	2026-05-15
13	Google: Gemma 3 27B Google	29.5	model-only independent_benchmark	2026-05-15
14	Meta: Llama 4 Scout Meta Llama	28.1	model-only independent_benchmark	2026-05-15
15	Anthropic: Claude Haiku 4.5 Anthropic	25.3	model-only independent_benchmark	2026-05-15
16	Amazon: Nova Pro 1.0 Amazon	25.0	model-only independent_benchmark	2026-05-15
17	Anthropic: Claude Sonnet 4.5 Anthropic	24.9	model-only independent_benchmark	2026-05-15

what this result means

BFCL overall function calling accuracy.

This benchmark contributes direct public evidence. Read its scope before generalizing the result.

A win here is a win on BFCL Overall. Broad task pages require independent corroboration before naming a general winner.

source record

category: structured_output

metric: accuracy

matched models: 17

latest source date: 2026-05-15

direction: higher is better