BasedAGIBasedAGI
Menu
Rankings live

creative

SFW roleplay and simulation

Roleplay/simulations for learning or entertainment with state tracking.

#1 Recommendation

gemini-2.5-pro

Strong on UGI Leaderboard Writing ✍️ (96%) and MWS Vision Bench validation_overall_score (93%)

external/google/gemini-2-5-pro

20.1%

Score

27.3%

Confidence

Limited benchmark evidence for this use case.

50 ranked models with average evidence of 12.7 points. Rankings may shift as more benchmark data is ingested.

Ranked Models

30

Evidence Quality

79%

Scoring

Benchmark-backed

Top Signal

UGI Leaderboard: Writing ✍️

All Ranked Models

Max params:
Min confidence:
30 of 30
RankModelScore
#16gemini-2.5-pro
20.1%
#27Grok-4-0709
18.6%
#32gpt-4.1-20250414
18.1%
#35Arch-Agent-32B
17.9%
#36qwen-2.5-72b-instruct
17.5%
#48gpt-4o
15.4%
#69xai-org/grok-4-fast-reasoning
13.2%
#73Arch-Agent-3B
12.6%
#76xai-org/grok-4-1-fast-reasoning
12.5%
#77gemini-3-pro-preview
12.4%
#80Arch-Agent-1.5B
12.1%
#83gemini-3-flash-preview
11.7%
#84x-ai/grok-3
11.5%
#86google/gemini-3.1-pro-preview
11.3%
#87claude-sonnet-4-20250514
11.3%
#89gemini-2.5-flash
11.1%
#96gpt-5-2025-08-07
10.4%
#98openai/gpt-5.4-2026-03-05
10.2%
#100gemma-2-27b-it
10.1%
#101gpt-5.1-2025-11-13
9.9%
#105anthropic/claude-sonnet-4.6
9.8%
#107claude-opus-4-5-20251101
9.8%
#110gpt-5-mini-2025-08-07
9.6%
#112xai-org/grok-4-1-fast-non-reasoning
9.5%
#113Kimi-K2-Instruct
9.5%
#116anthropic/claude-opus-4-6-thinking
9.3%
#117gpt-5.2-2025-12-11
9.2%
#118gpt-4o-2024-05-13
9.1%
#121anthropic/claude-opus-4-5-20251101-thinking
9.0%
#124xai-org/grok-4-fast-non-reasoning
8.9%

Compare Models

Model A leads by +1.5%

Shareable Link →

Model A

gemini-2.5-pro

external/google/gemini-2-5-pro

20.1%

Rank #16

Confidence 27.3%23 evidence pts

UGI Leaderboard: Writing ✍️

Value 96.3% · Conf 100.0% · Weight 2.8%

ugi_main.writing (Mar 12, 2026)

MWS Vision Bench: validation_overall_score

Value 93.5% · Conf 100.0% · Weight 2.3%

mws_vision_bench.validation_overall_score (Mar 12, 2026)

UGI Leaderboard: Entertainment

Value 73.3% · Conf 100.0% · Weight 1.8%

ugi_main.entertainment (Mar 12, 2026)

Galileo Agent Leaderboard v2: Avg AC

Value 58.7% · Conf 100.0% · Weight 1.2%

galileo_agent_v2.avg_ac (Mar 12, 2026)

Model B

Grok-4-0709

external/xai/grok-4-0709

18.6%

Rank #27

Confidence 23.9%20 evidence pts

UGI Leaderboard: Writing ✍️

Value 99.2% · Conf 100.0% · Weight 2.9%

ugi_main.writing (Mar 12, 2026)

UGI Leaderboard: Entertainment

Value 100.0% · Conf 100.0% · Weight 2.5%

ugi_main.entertainment (Mar 12, 2026)

Galileo Agent Leaderboard v2: Avg TSQ

Value 84.6% · Conf 100.0% · Weight 1.2%

galileo_agent_v2.avg_tsq (Mar 12, 2026)

Galileo Agent Leaderboard v2: Avg AC

Value 56.5% · Conf 100.0% · Weight 1.2%

galileo_agent_v2.avg_ac (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

50

Sources

8

Quality

Insufficient

Vals Legal Bench

vals_legal_bench

35 rows

0.5% avg lift

Vals CorpFin v2

vals_corp_fin_v2

35 rows

0.5% avg lift

Vals MedQA

vals_medqa

34 rows

0.5% avg lift

Vals Tax Eval v2

vals_tax_eval_v2

34 rows

0.5% avg lift

Missing Strong Models

zai/glm-5-thinking

external/zai/glm-5-thinking

Rank #32

13.0%

Thin evidence after weighting

alibaba/qwen3.5-flash

external/alibaba/qwen3-5-flash

Rank #33

12.3%

Thin evidence after weighting

Kimi K2 Thinking

external/kimi/kimi-k2-thinking

Rank #34

12.3%

Thin evidence after weighting

gpt-4o-20241120

external/openai/gpt-4o-20241120

Rank #49

10.7%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.roleplay_simulation_sfwtask.persona_consistency

Required Modes

mode.persona_memory

Domains

domain.creative_writing

Related Use Cases