companion
Arch-Agent-32B vs Grok-4-0709
Model A winsby +5.1%
Rank #30
Confidence
38.9%
Evidence
4 pts
BFCL Multi-turn Official: Multi Turn Acc
Value 70.1% · Conf 100.0% · Weight 6.8%
bfcl_multiturn_official.multi_turn_acc (Mar 12, 2026)
BFCL Relevance Detection Official: Relevance Detection
Value 81.3% · Conf 100.0% · Weight 6.1%
bfcl_relevance_detection_official.relevance_detection (Mar 12, 2026)
BFCL Relevance Detection Official: Irrelevance Detection
Value 81.0% · Conf 100.0% · Weight 2.4%
bfcl_relevance_detection_official.irrelevance_detection (Mar 12, 2026)
BFCL Memory Official: Memory Acc
Value 19.8% · Conf 100.0% · Weight 2.3%
bfcl_memory_official.memory_acc (Mar 12, 2026)
Rank #59
Confidence
21.1%
Evidence
20 pts
UGI Leaderboard: Entertainment
Value 100.0% · Conf 100.0% · Weight 2.6%
ugi_main.entertainment (Mar 12, 2026)
UGI Leaderboard: Writing ✍️
Value 99.2% · Conf 100.0% · Weight 2.6%
ugi_main.writing (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg TSQ
Value 84.6% · Conf 100.0% · Weight 1.1%
galileo_agent_v2.avg_tsq (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 56.5% · Conf 100.0% · Weight 1.1%
galileo_agent_v2.avg_ac (Mar 12, 2026)
Vals CorpFin v2: overall_accuracy_pct
Value 93.6% · Conf 100.0% · Weight 0.5%
vals_corp_fin_v2.overall_accuracy_pct (Mar 12, 2026)