benchmark evidence

BFCL Non-Live

BFCL non-live function calling accuracy.

winner on BFCL Non-Live

direct benchmark result, not a broad vertical composite | source row dated 2026-05-15

scored on 2026-05-15 · stale source data (60d)

latest mapped results | top 20

#	Model	Score	Evidence	Tested
1	Mistral: Mistral Small 4 Mistralai	89.7	model-only independent_benchmark	2026-05-15
2	Anthropic: Claude Opus 4.5 Anthropic	89.7	model-only independent_benchmark	2026-05-15
3	Meta: Llama 4 Scout Meta Llama	89.4	model-only independent_benchmark	2026-05-15
4	OpenAI: GPT-4.1 Openai	88.7	model-only independent_benchmark	2026-05-15
5	Meta: Llama 4 Maverick Meta Llama	88.7	model-only independent_benchmark	2026-05-15
6	Google: Gemini 2.5 Flash Google	88.1	model-only independent_benchmark	2026-05-15
7	Meta: Llama 3.3 70B Instruct Meta Llama	88.0	model-only independent_benchmark	2026-05-15
8	Google: Gemma 3 27B Google	87.2	model-only independent_benchmark	2026-05-15
9	Amazon: Nova Pro 1.0 Amazon	86.6	model-only independent_benchmark	2026-05-15
10	Mistral Large Mistralai	83.0	model-only independent_benchmark	2026-05-15
11	MoonshotAI: Kimi K2 Thinking Moonshotai	81.6	model-only independent_benchmark	2026-05-15
12	OpenAI: o4 Mini Openai	81.3	model-only independent_benchmark	2026-05-15
13	OpenAI: GPT-5.2 Openai	78.3	model-only independent_benchmark	2026-05-15
14	Anthropic: Claude Sonnet 4.5 Anthropic	59.8	model-only independent_benchmark	2026-05-15
15	Anthropic: Claude Haiku 4.5 Anthropic	55.4	model-only independent_benchmark	2026-05-15
16	OpenAI: o3 Openai	40.4	model-only independent_benchmark	2026-05-15
17	DeepSeek: DeepSeek V3.2 Deepseek	34.9	model-only independent_benchmark	2026-05-15

what this result means

BFCL non-live function calling accuracy.

This benchmark contributes direct public evidence. Read its scope before generalizing the result.

A win here is a win on BFCL Non-Live. Broad task pages require independent corroboration before naming a general winner.

source record

category: structured_output

metric: accuracy

matched models: 17

latest source date: 2026-05-15

direction: higher is better