benchmark evidence
AI Language Proficiency Monitor Mandarin Chinese
Mean ARC, MGSM, and MMLU accuracy for Mandarin Chinese, from the AI Language Proficiency Monitor.
winner on AI Language Proficiency Monitor Mandarin Chinese
direct benchmark result, not a broad vertical composite | source row dated 2026-05-19
scored on 2026-05-19
latest mapped results | top 20
| # | Model | Score | Evidence | Tested |
|---|---|---|---|---|
| 1 | Anthropic: Claude Sonnet 4.6 | 100.0 | model-only independent_benchmark | 2026-05-19 |
| 2 | Anthropic: Claude Sonnet 4.5 | 100.0 | model-only independent_benchmark | 2026-05-19 |
| 3 | OpenAI: GPT-5 | 100.0 | model-only independent_benchmark | 2026-05-19 |
| 4 | Anthropic: Claude Opus 4.6 | 100.0 | model-only independent_benchmark | 2026-05-19 |
| 5 | Anthropic: Claude Sonnet 4 | 100.0 | model-only independent_benchmark | 2026-05-19 |
| 6 | OpenAI: GPT-5.2 | 100.0 | model-only independent_benchmark | 2026-05-19 |
| 7 | Google: Gemini 2.5 Pro | 96.7 | model-only independent_benchmark | 2026-05-19 |
| 8 | OpenAI: GPT-4.1 | 96.7 | model-only independent_benchmark | 2026-05-19 |
| 9 | Anthropic: Claude Opus 4.5 | 96.7 | model-only independent_benchmark | 2026-05-19 |
| 10 | OpenAI: GPT-5.1 | 96.7 | model-only independent_benchmark | 2026-05-19 |
| 11 | OpenAI: GPT-5.4 | 96.7 | model-only independent_benchmark | 2026-05-19 |
| 12 | Meta: Llama 3.3 70B Instruct | 93.3 | model-only independent_benchmark | 2026-05-19 |
| 13 | Anthropic: Claude Haiku 4.5 | 93.3 | model-only independent_benchmark | 2026-05-19 |
| 14 | Qwen: Qwen3 30B A3B | 92.6 | model-only independent_benchmark | 2026-05-19 |
| 15 | Meta: Llama 4 Maverick | 90.0 | model-only independent_benchmark | 2026-05-19 |
| 16 | Amazon: Nova Pro 1.0 | 90.0 | model-only independent_benchmark | 2026-05-19 |
| 17 | Mistral: Mistral Nemo | 53.3 | model-only independent_benchmark | 2026-05-19 |
| 18 | Google: Gemini 2.5 Flash | 43.3 | model-only independent_benchmark | 2026-05-19 |
what this result means
Mean ARC, MGSM, and MMLU accuracy for Mandarin Chinese, from the AI Language Proficiency Monitor.
This benchmark contributes direct public evidence. Read its scope before generalizing the result.
A win here is a win on AI Language Proficiency Monitor Mandarin Chinese. Broad task pages require independent corroboration before naming a general winner.
source record
category: multilingual
metric: accuracy
matched models: 18
latest source date: 2026-05-19
direction: higher is better