benchmark evidence

AI Language Proficiency Monitor Mandarin Chinese

Mean ARC, MGSM, and MMLU accuracy for Mandarin Chinese, from the AI Language Proficiency Monitor.

activemultilingual rankings ->upstream source ->

winner on AI Language Proficiency Monitor Mandarin Chinese

direct benchmark result, not a broad vertical composite | source row dated 2026-06-10

scored on 2026-06-10 · stale source data (34d)

latest mapped results | top 20

#	Model	Score	Evidence	Tested
1	OpenAI: GPT-5.5 Openai	100.0	model-only independent_benchmark	2026-06-10
2	DeepSeek: DeepSeek V4 Flash Deepseek	100.0	model-only independent_benchmark	2026-06-10
3	Anthropic: Claude Sonnet 4.6 Anthropic	100.0	model-only independent_benchmark	2026-06-10
4	Anthropic: Claude Opus 4.6 Anthropic	100.0	model-only independent_benchmark	2026-06-05
5	OpenAI: GPT-5.2 Openai	100.0	model-only independent_benchmark	2026-06-10
6	Anthropic: Claude Sonnet 4.5 Anthropic	100.0	model-only independent_benchmark	2026-06-10
7	OpenAI: GPT-5 Openai	100.0	model-only independent_benchmark	2026-06-10
8	Anthropic: Claude Sonnet 4 Anthropic	100.0	model-only independent_benchmark	2026-06-10
9	OpenAI: GPT-5.1 Openai	96.7	model-only independent_benchmark	2026-06-10
10	Google: Gemini 2.5 Pro Google	96.7	model-only independent_benchmark	2026-06-10
11	Anthropic: Claude Opus 4.7 Anthropic	96.7	model-only independent_benchmark	2026-06-10
12	OpenAI: GPT-5.4 Openai	96.7	model-only independent_benchmark	2026-06-10
13	OpenAI: GPT-4.1 Openai	96.7	model-only independent_benchmark	2026-06-10
14	Anthropic: Claude Opus 4.5 Anthropic	96.7	model-only independent_benchmark	2026-06-05
15	Anthropic: Claude Haiku 4.5 Anthropic	93.3	model-only independent_benchmark	2026-06-10
16	Meta: Llama 3.3 70B Instruct Meta Llama	93.3	model-only independent_benchmark	2026-06-10
17	Qwen: Qwen3 30B A3B Qwen	92.6	model-only independent_benchmark	2026-06-10
18	Meta: Llama 4 Maverick Meta Llama	90.0	model-only independent_benchmark	2026-06-10
19	Amazon: Nova Pro 1.0 Amazon	90.0	model-only independent_benchmark	2026-06-10
20	Google: Gemma 3 27B Google	80.0	model-only independent_benchmark	2026-06-10

what this result means

Mean ARC, MGSM, and MMLU accuracy for Mandarin Chinese, from the AI Language Proficiency Monitor.

This benchmark contributes direct public evidence. Read its scope before generalizing the result.

A win here is a win on AI Language Proficiency Monitor Mandarin Chinese. Broad task pages require independent corroboration before naming a general winner.

source record

category: multilingual

metric: accuracy

matched models: 22

latest source date: 2026-06-10

direction: higher is better

inspect upstream source ->