devops_sre
Runbook step assistant
Suggest safe runbook steps and escalation points grounded in docs.
#1 Recommendation
gemini-3-pro-preview
Strong on FACTS Benchmark Suite facts_grounding_score_pct (88%) and FACTS Benchmark Suite facts_search_score_pct (100%)
external/google/gemini-3-pro-preview
35.1%
Score
45.7%
Confidence
Ranked Models
30
Evidence Quality
85%
Scoring
Benchmark-backed
Top Signal
FACTS Benchmark Suite: facts_grounding_score_pct
All Ranked Models
Compare Models
Model A leads by +2.3%
Shareable Link →Model A
gemini-3-pro-preview
external/google/gemini-3-pro-preview
Rank #1
FACTS Benchmark Suite: facts_grounding_score_pct
Value 88.3% · Conf 100.0% · Weight 3.1%
facts_benchmark_suite.facts_grounding_score_pct (Mar 12, 2026)
FACTS Benchmark Suite: facts_search_score_pct
Value 100.0% · Conf 100.0% · Weight 2.7%
facts_benchmark_suite.facts_search_score_pct (Mar 12, 2026)
FACTS Benchmark Suite: average_score_pct
Value 100.0% · Conf 100.0% · Weight 2.5%
facts_benchmark_suite.average_score_pct (Mar 12, 2026)
Vals Finance Agent: overall_accuracy_pct
Value 87.0% · Conf 100.0% · Weight 2.5%
vals_finance_agent.overall_accuracy_pct (Mar 12, 2026)
Model B
gemini-2.5-pro
external/google/gemini-2-5-pro
Rank #2
FACTS Benchmark Suite: facts_grounding_score_pct
Value 100.0% · Conf 100.0% · Weight 3.5%
facts_benchmark_suite.facts_grounding_score_pct (Mar 12, 2026)
Vectara HHEM Leaderboard: overall_hallucination_error_pct
Value 76.0% · Conf 100.0% · Weight 2.8%
vectara_hhem_leaderboard.overall_hallucination_error_pct (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 58.7% · Conf 100.0% · Weight 2.6%
galileo_agent_v2.avg_ac (Mar 12, 2026)
Vals CorpFin v2: overall_accuracy_pct
Value 78.4% · Conf 100.0% · Weight 2.2%
vals_corp_fin_v2.overall_accuracy_pct (Mar 12, 2026)
▶Ranking Diagnostics & Missing Models
Source Lift
Ranked
48
Sources
8
Quality
Sufficient
Vals CorpFin v2
vals_corp_fin_v2
42 rows
1.7% avg lift
Vals Legal Bench
vals_legal_bench
39 rows
0.4% avg lift
Vals Tax Eval v2
vals_tax_eval_v2
37 rows
0.4% avg lift
Vals MedQA
vals_medqa
36 rows
0.4% avg lift
Missing Strong Models
gpt-4o
external/openai/gpt-4o
Rank #22
15.2%
deepseek/deepseek-r1
external/deepseek/deepseek-r1
Rank #54
10.5%
▶Taxonomy Details
Core Tasks
Required Modes
Domains
Related Use Cases
devops_sre
Log triage
Interpret logs and propose safe diagnostic steps.
Top: gemini-3-pro-preview
devops_sre
Config debugging
Diagnose and patch YAML/JSON/TOML configs with minimal diffs.
Top: gpt-4.1-20250414
devops_sre
Kubernetes manifest generation
Generate K8s manifests with safe defaults and probes.
Top: gpt-4.1-20250414
devops_sre
Terraform generation
Generate Terraform IaC with correct resources and safe defaults.
Top: gpt-4.1-20250414