BasedAGIBasedAGI
Menu
Rankings live

devops_sre

Runbook step assistant

Suggest safe runbook steps and escalation points grounded in docs.

#1 Recommendation

gemini-3-pro-preview

Strong on FACTS Benchmark Suite facts_grounding_score_pct (88%) and FACTS Benchmark Suite facts_search_score_pct (100%)

external/google/gemini-3-pro-preview

35.1%

Score

45.7%

Confidence

Ranked Models

30

Evidence Quality

85%

Scoring

Benchmark-backed

Top Signal

FACTS Benchmark Suite: facts_grounding_score_pct

All Ranked Models

Max params:
Min confidence:
30 of 30
RankModelScore
#1gemini-3-pro-preview

Strong on FACTS Benchmark Suite facts_grounding_score_pct (88%) and FACTS Benchmark Suite facts_search_score_pct (100%)

35.1%
#2gemini-2.5-pro

Strong on FACTS Benchmark Suite facts_grounding_score_pct (100%) and Vectara HHEM Leaderboard overall_hallucination_error_pct (76%)

32.8%
#3gpt-4.1-20250414

Strong on Galileo Agent Leaderboard v2 Avg AC (100%) and Vectara HHEM Leaderboard overall_hallucination_error_pct (82%)

27.2%
#4Grok-4-0709
26.9%
#5anthropic/claude-sonnet-4.6
26.3%
#6gpt-5-mini-2025-08-07
25.8%
#7gpt-5-2025-08-07
25.4%
#8google/gemini-3.1-pro-preview
24.2%
#9claude-sonnet-4-20250514
24.0%
#10openai/gpt-5.4-2026-03-05
23.7%
#11claude-opus-4-5-20251101
23.0%
#12gpt-5.1-2025-11-13
21.1%
#13kimi/kimi-k2.5-thinking
20.5%
#14gemini-3-flash-preview
20.5%
#15gemini-2.5-flash
19.7%
#16google/gemini-3.1-flash-lite-preview
19.7%
#17xai-org/grok-4-fast-reasoning
19.4%
#18xai-org/grok-4-1-fast-reasoning
18.5%
#19anthropic/claude-opus-4-6-thinking
18.4%
#20gpt-5.2-2025-12-11
18.4%
#21anthropic/claude-opus-4-5-20251101-thinking
17.0%
#22anthropic/claude-sonnet-4-5-20250929-thinking
16.0%
#24zai/glm-5-thinking
15.5%
#25x-ai/grok-3
15.4%
#26o3-20250416
14.5%
#27mistralai/mistral-large-2512
14.4%
#29anthropic/claude-haiku-4-5-20251001-thinking
14.2%
#30alibaba/qwen3.5-flash
14.0%
#31xai-org/grok-4-1-fast-non-reasoning
13.9%
#32gpt-4.1-mini-20250414
13.7%

Compare Models

Model A leads by +2.3%

Shareable Link →

Model A

gemini-3-pro-preview

external/google/gemini-3-pro-preview

35.1%

Rank #1

Confidence 45.7%23 evidence pts

FACTS Benchmark Suite: facts_grounding_score_pct

Value 88.3% · Conf 100.0% · Weight 3.1%

facts_benchmark_suite.facts_grounding_score_pct (Mar 12, 2026)

FACTS Benchmark Suite: facts_search_score_pct

Value 100.0% · Conf 100.0% · Weight 2.7%

facts_benchmark_suite.facts_search_score_pct (Mar 12, 2026)

FACTS Benchmark Suite: average_score_pct

Value 100.0% · Conf 100.0% · Weight 2.5%

facts_benchmark_suite.average_score_pct (Mar 12, 2026)

Vals Finance Agent: overall_accuracy_pct

Value 87.0% · Conf 100.0% · Weight 2.5%

vals_finance_agent.overall_accuracy_pct (Mar 12, 2026)

Model B

gemini-2.5-pro

external/google/gemini-2-5-pro

32.8%

Rank #2

Confidence 48.1%23 evidence pts

FACTS Benchmark Suite: facts_grounding_score_pct

Value 100.0% · Conf 100.0% · Weight 3.5%

facts_benchmark_suite.facts_grounding_score_pct (Mar 12, 2026)

Vectara HHEM Leaderboard: overall_hallucination_error_pct

Value 76.0% · Conf 100.0% · Weight 2.8%

vectara_hhem_leaderboard.overall_hallucination_error_pct (Mar 12, 2026)

Galileo Agent Leaderboard v2: Avg AC

Value 58.7% · Conf 100.0% · Weight 2.6%

galileo_agent_v2.avg_ac (Mar 12, 2026)

Vals CorpFin v2: overall_accuracy_pct

Value 78.4% · Conf 100.0% · Weight 2.2%

vals_corp_fin_v2.overall_accuracy_pct (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

48

Sources

8

Quality

Sufficient

Vals CorpFin v2

vals_corp_fin_v2

42 rows

1.7% avg lift

Vals Legal Bench

vals_legal_bench

39 rows

0.4% avg lift

Vals Tax Eval v2

vals_tax_eval_v2

37 rows

0.4% avg lift

Vals MedQA

vals_medqa

36 rows

0.4% avg lift

Missing Strong Models

gpt-4o

external/openai/gpt-4o

Rank #22

15.2%

Thin evidence after weighting

deepseek/deepseek-r1

external/deepseek/deepseek-r1

Rank #54

10.5%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.runbook_step_suggestiontask.rag_answer_no_citations

Required Modes

none

Domains

domain.devops_sre

Related Use Cases