Creative

SFW roleplay and simulation

Roleplay/simulations for learning or entertainment with state tracking.

task.roleplay_simulation_sfwtask.persona_consistency

Best for this use case

gemini-3-pro-preview

Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc

45.9%

Best benchmark score

55.2%

Confidence

All ranked models — top 3

🥇

gemini-3-pro-preview

45.9%

🥈

Grok-4-0709

45.8%

🥉

grok-4-1-fast-reasoning

41.5%

Ranked Models

Evidence Quality

85%

Evidence Points

Top Signal

BFCL Memory Official: Memory Acc

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	gemini-3-pro-preview Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc	45.9%	55%	$4.50	BFCL Memory OfficialBFCL Multi-turn Official
🥈	Grok-4-0709 Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection	45.8%	59%	—	BFCL Memory OfficialBFCL Relevance Detection Official
🥉	grok-4-1-fast-reasoning Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc	41.5%	53%	$0.28	BFCL Memory OfficialBFCL Multi-turn Official
#4	o3-20250416 Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection	38.4%	56%	$3.50	BFCL Memory OfficialBFCL Relevance Detection Official
#5	GLM-4.6 Strong on BFCL Memory Official Memory Acc and BFCL Multi-turn Official Multi Turn Acc	37.8%	44%	—	BFCL Memory OfficialBFCL Multi-turn Official
#6	gpt-4.1-20250414 Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Memory Official Memory Acc	33.3%	57%	—	BFCL Relevance Detection OfficialBFCL Memory Official
#7	Kimi-K2-Instruct Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Memory Official Memory Acc	31.9%	47%	—	BFCL Multi-turn OfficialBFCL Memory Official
#8	o4-mini Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection	30.9%	54%	$1.93	BFCL Memory OfficialBFCL Relevance Detection Official
#9	gemini-2.5-flash Strong on BFCL Memory Official Memory Acc and BFCL Relevance Detection Official Relevance Detection	30.0%	51%	$0.17	BFCL Memory OfficialBFCL Relevance Detection Official
#10	grok-4-1-fast-non-reasoning Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Memory Official Memory Acc	29.8%	52%	$0.28	BFCL Multi-turn OfficialBFCL Memory Official
#11	gpt-5.2-2025-12-11 Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Relevance Detection Official Relevance Detection	29.7%	54%	—	BFCL Multi-turn OfficialBFCL Relevance Detection Official
#15	claude-opus-4-5-20251101 Strong on BFCL Relevance Detection Official Relevance Detection and UGI Leaderboard Writing ✍️	24.2%	53%	—	BFCL Relevance Detection OfficialUGI Leaderboard
#18	gemini-2.5-pro Strong on UGI Leaderboard Writing ✍️ and MWS Vision Bench validation_overall_score	22.0%	31%	$3.44	UGI LeaderboardMWS Vision Bench
#24	gpt-5-2025-08-07 Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	20.2%	26%	—	UGI LeaderboardUGI Leaderboard
#26	claude-sonnet-4 Strong on UGI Leaderboard Writing ✍️ and Galileo Agent Leaderboard v2 Avg AC	19.4%	27%	$6.00	UGI LeaderboardGalileo Agent Leaderboard v2
#27	gemini-3.1-pro-preview Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	19.1%	22%	$4.50	UGI LeaderboardUGI Leaderboard
#28	Arch-Agent-32B Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Relevance Detection Official Relevance Detection	18.9%	34%	—	BFCL Multi-turn OfficialBFCL Relevance Detection Official
#30	Llama 3.3 70B Instruct Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Multi-turn Official Multi Turn Acc	18.6%	52%	—	BFCL Relevance Detection OfficialBFCL Multi-turn Official
#31	gemini-3-flash-preview Strong on UGI Leaderboard Writing ✍️ and MWS Vision Bench validation_overall_score	18.4%	24%	$1.13	UGI LeaderboardMWS Vision Bench
#39	gpt-5.4-2026-03-05 Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	16.3%	19%	—	UGI LeaderboardUGI Leaderboard
#43	claude-sonnet-4.6 Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	16.0%	19%	$6.00	UGI LeaderboardUGI Leaderboard
#49	Llama-4-Scout-17B-16E-Instruct Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Memory Official Memory Acc	15.2%	42%	—	BFCL Relevance Detection OfficialBFCL Memory Official
#51	kimi-k2.5-thinking Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	15.0%	19%	—	UGI LeaderboardUGI Leaderboard
#53	gemini-2.5-flash-lite Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Memory Official Memory Acc	14.9%	42%	$0.17	BFCL Relevance Detection OfficialBFCL Memory Official
#55	qwen-2.5-72b-instruct Strong on EQ-Bench Leaderboard judgemark_score and Galileo Agent Leaderboard v2 Avg AC	14.4%	28%	—	EQ-Bench LeaderboardGalileo Agent Leaderboard v2
#56	gpt-5.1-2025-11-13 Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	14.3%	20%	—	UGI LeaderboardUGI Leaderboard
#61	gpt-5-mini-2025-08-07 Strong on MWS Vision Bench validation_overall_score and Vals MedQA overall_accuracy_pct	14.0%	21%	—	MWS Vision BenchVals MedQA
#69	grok-4-fast-reasoning Strong on UGI Leaderboard Writing ✍️ and UGI Leaderboard Entertainment	13.3%	21%	$0.28	UGI LeaderboardUGI Leaderboard
#70	Arch-Agent-3B Strong on BFCL Multi-turn Official Multi Turn Acc and BFCL Relevance Detection Official Relevance Detection	13.1%	34%	—	BFCL Multi-turn OfficialBFCL Relevance Detection Official
#71	Arch-Agent-1.5B Strong on BFCL Relevance Detection Official Relevance Detection and BFCL Multi-turn Official Multi Turn Acc	12.7%	34%	—	BFCL Relevance Detection OfficialBFCL Multi-turn Official