benchmark evidence
AI Language Proficiency Monitor Arabic
Mean ARC, MGSM, and MMLU accuracy for Arabic, from the AI Language Proficiency Monitor.
winner on AI Language Proficiency Monitor Arabic
direct benchmark result, not a broad vertical composite | source row dated 2026-05-19
scored on 2026-05-19
latest mapped results | top 20
| # | Model | Score | Evidence | Tested |
|---|---|---|---|---|
| 1 | Qwen: Qwen3 30B A3B | 90.5 | model-only independent_benchmark | 2026-05-19 |
| 2 | Anthropic: Claude Opus 4.6 | 90.0 | model-only independent_benchmark | 2026-05-19 |
| 3 | Anthropic: Claude Opus 4.5 | 90.0 | model-only independent_benchmark | 2026-05-19 |
| 4 | Anthropic: Claude Sonnet 4.6 | 88.9 | model-only independent_benchmark | 2026-05-19 |
| 5 | OpenAI: GPT-5 | 88.9 | model-only independent_benchmark | 2026-05-19 |
| 6 | Anthropic: Claude Haiku 4.5 | 86.7 | model-only independent_benchmark | 2026-05-19 |
| 7 | OpenAI: GPT-5.2 | 86.7 | model-only independent_benchmark | 2026-05-19 |
| 8 | OpenAI: GPT-5.1 | 86.7 | model-only independent_benchmark | 2026-05-19 |
| 9 | Anthropic: Claude Sonnet 4.5 | 86.7 | model-only independent_benchmark | 2026-05-19 |
| 10 | Google: Gemini 2.5 Pro | 86.7 | model-only independent_benchmark | 2026-05-19 |
| 11 | Anthropic: Claude Sonnet 4 | 83.3 | model-only independent_benchmark | 2026-05-19 |
| 12 | OpenAI: GPT-5.4 | 83.0 | model-only independent_benchmark | 2026-05-19 |
| 13 | Amazon: Nova Pro 1.0 | 80.0 | model-only independent_benchmark | 2026-05-19 |
| 14 | OpenAI: GPT-4.1 | 76.7 | model-only independent_benchmark | 2026-05-19 |
| 15 | Meta: Llama 3.3 70B Instruct | 76.7 | model-only independent_benchmark | 2026-05-19 |
| 16 | Meta: Llama 4 Maverick | 73.3 | model-only independent_benchmark | 2026-05-19 |
| 17 | Google: Gemini 2.5 Flash | 43.3 | model-only independent_benchmark | 2026-05-19 |
| 18 | Mistral: Mistral Nemo | 26.7 | model-only independent_benchmark | 2026-05-19 |
what this result means
Mean ARC, MGSM, and MMLU accuracy for Arabic, from the AI Language Proficiency Monitor.
This benchmark contributes direct public evidence. Read its scope before generalizing the result.
A win here is a win on AI Language Proficiency Monitor Arabic. Broad task pages require independent corroboration before naming a general winner.
source record
category: multilingual
metric: accuracy
matched models: 18
latest source date: 2026-05-19
direction: higher is better