creative
SFW roleplay and simulation
Roleplay/simulations for learning or entertainment with state tracking.
#1 Recommendation
gemini-2.5-pro
Strong on UGI Leaderboard Writing ✍️ (96%) and MWS Vision Bench validation_overall_score (93%)
external/google/gemini-2-5-pro
20.1%
Score
27.3%
Confidence
Limited benchmark evidence for this use case.
50 ranked models with average evidence of 12.7 points. Rankings may shift as more benchmark data is ingested.
Ranked Models
30
Evidence Quality
79%
Scoring
Benchmark-backed
Top Signal
UGI Leaderboard: Writing ✍️
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| #16 | gemini-2.5-pro | 20.1% |
| #27 | Grok-4-0709 | 18.6% |
| #32 | gpt-4.1-20250414 | 18.1% |
| #35 | Arch-Agent-32B | 17.9% |
| #36 | qwen-2.5-72b-instruct | 17.5% |
| #48 | gpt-4o | 15.4% |
| #69 | xai-org/grok-4-fast-reasoning | 13.2% |
| #73 | Arch-Agent-3B | 12.6% |
| #76 | xai-org/grok-4-1-fast-reasoning | 12.5% |
| #77 | gemini-3-pro-preview | 12.4% |
| #80 | Arch-Agent-1.5B | 12.1% |
| #83 | gemini-3-flash-preview | 11.7% |
| #84 | x-ai/grok-3 | 11.5% |
| #86 | google/gemini-3.1-pro-preview | 11.3% |
| #87 | claude-sonnet-4-20250514 | 11.3% |
| #89 | gemini-2.5-flash | 11.1% |
| #96 | gpt-5-2025-08-07 | 10.4% |
| #98 | openai/gpt-5.4-2026-03-05 | 10.2% |
| #100 | gemma-2-27b-it | 10.1% |
| #101 | gpt-5.1-2025-11-13 | 9.9% |
| #105 | anthropic/claude-sonnet-4.6 | 9.8% |
| #107 | claude-opus-4-5-20251101 | 9.8% |
| #110 | gpt-5-mini-2025-08-07 | 9.6% |
| #112 | xai-org/grok-4-1-fast-non-reasoning | 9.5% |
| #113 | Kimi-K2-Instruct | 9.5% |
| #116 | anthropic/claude-opus-4-6-thinking | 9.3% |
| #117 | gpt-5.2-2025-12-11 | 9.2% |
| #118 | gpt-4o-2024-05-13 | 9.1% |
| #121 | anthropic/claude-opus-4-5-20251101-thinking | 9.0% |
| #124 | xai-org/grok-4-fast-non-reasoning | 8.9% |
Compare Models
Model A leads by +1.5%
Shareable Link →Model A
gemini-2.5-pro
external/google/gemini-2-5-pro
Rank #16
UGI Leaderboard: Writing ✍️
Value 96.3% · Conf 100.0% · Weight 2.8%
ugi_main.writing (Mar 12, 2026)
MWS Vision Bench: validation_overall_score
Value 93.5% · Conf 100.0% · Weight 2.3%
mws_vision_bench.validation_overall_score (Mar 12, 2026)
UGI Leaderboard: Entertainment
Value 73.3% · Conf 100.0% · Weight 1.8%
ugi_main.entertainment (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 58.7% · Conf 100.0% · Weight 1.2%
galileo_agent_v2.avg_ac (Mar 12, 2026)
Model B
Grok-4-0709
external/xai/grok-4-0709
Rank #27
UGI Leaderboard: Writing ✍️
Value 99.2% · Conf 100.0% · Weight 2.9%
ugi_main.writing (Mar 12, 2026)
UGI Leaderboard: Entertainment
Value 100.0% · Conf 100.0% · Weight 2.5%
ugi_main.entertainment (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg TSQ
Value 84.6% · Conf 100.0% · Weight 1.2%
galileo_agent_v2.avg_tsq (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 56.5% · Conf 100.0% · Weight 1.2%
galileo_agent_v2.avg_ac (Mar 12, 2026)
▶Ranking Diagnostics & Missing Models
Source Lift
Ranked
50
Sources
8
Quality
Insufficient
Vals Legal Bench
vals_legal_bench
35 rows
0.5% avg lift
Vals CorpFin v2
vals_corp_fin_v2
35 rows
0.5% avg lift
Vals MedQA
vals_medqa
34 rows
0.5% avg lift
Vals Tax Eval v2
vals_tax_eval_v2
34 rows
0.5% avg lift
Missing Strong Models
zai/glm-5-thinking
external/zai/glm-5-thinking
Rank #32
13.0%
alibaba/qwen3.5-flash
external/alibaba/qwen3-5-flash
Rank #33
12.3%
Kimi K2 Thinking
external/kimi/kimi-k2-thinking
Rank #34
12.3%
gpt-4o-20241120
external/openai/gpt-4o-20241120
Rank #49
10.7%
▶Taxonomy Details
Core Tasks
Required Modes
Domains
Related Use Cases
creative
Poetry and lyrics
Generate poems and lyrics with style control and variation.
Top: qwen-2.5-72b-instruct
creative
Screenplay scene writing
Write screenplay scenes with formatting, pacing, and strong dialogue.
Top: qwen-2.5-72b-instruct
creative
Interactive fiction / DM
Run interactive fiction with state tracking and user agency.
Top: qwen-2.5-72b-instruct
creative
NPC dialogue
Low-latency in-character dialogue suitable for games.
Top: qwen-2.5-72b-instruct