devops_sre
Best LLM for Terraform
Ranked models for generating Terraform IaC with correct resources and safe defaults.
Full Analysis Available
Benchmark methodology, patterns in the data, and deployment notes
#1 Recommendation
anthropic/claude-sonnet-4
Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct
external/anthropic/claude-sonnet-4
25.7%
Score
35.0%
Confidence
23
Evidence
$6.00
per 1M tokens
Ranked Models
30
Evidence Quality
96%
Evidence Points
23
Top Signal
Galileo Agent Leaderboard v2: Avg AC
Benchmark Sources
33
Last Updated
19h ago
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| 🥇 | claude-sonnet-4 Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct | 25.7% |
| 🥈 | qwen-2.5-72b-instruct Strong on Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct and Galileo Agent Leaderboard v2 Avg AC | 24.0% |
| 🥉 | gemini-2.5-pro Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Galileo Agent Leaderboard v2 Avg AC | 22.5% |
| #4 | gpt-4.1-20250414 Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct | 22.2% |
| #5 | gemini-3-pro-preview Strong on Berkeley Function Calling Leaderboard (Overall) Overall Acc and SWE-bench Verified Leaderboard swe_verified_resolved_pct | 21.1% |
| #6 | o3-20250416 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Berkeley Function Calling Leaderboard (Overall) Overall Acc | 20.4% |
| #7 | gpt-5-2025-08-07 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Aider Polyglot Leaderboard percent_correct_pct | 20.4% |
| #8 | Grok-4-0709 Strong on Berkeley Function Calling Leaderboard (Overall) Overall Acc and Galileo Agent Leaderboard v2 Avg AC | 20.4% |
| #9 | gpt-5.2-2025-12-11 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Berkeley Function Calling Leaderboard (Overall) Overall Acc | 19.2% |
| #10 | Steelskull/L3.3-MS-Nevoria-70b Strong on Open LLM Leaderboard GPQA gpqa and Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct | 17.8% |
| #11 | MaziyarPanahi/calme-3.2-instruct-78b Strong on Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct and Open LLM Leaderboard GPQA gpqa | 17.7% |
| #12 | Steelskull/L3.3-Nevoria-R1-70b Strong on Open LLM Leaderboard GPQA gpqa and Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct | 17.5% |
| #13 | Mistral-Large-Instruct-2411 Strong on Open LLM Leaderboard GPQA gpqa and Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct | 17.5% |
| #14 | claude-opus-4-5-20251101 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Berkeley Function Calling Leaderboard (Overall) Overall Acc | 17.3% |
| #15 | MaziyarPanahi/calme-2.4-rys-78b Strong on Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct and Open LLM Leaderboard GPQA gpqa | 17.2% |
| #16 | MaziyarPanahi/calme-3.1-instruct-78b Strong on Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct and Open LLM Leaderboard GPQA gpqa | 17.2% |
| #17 | Tarek07/Progenitor-V1.1-LLaMa-70B Strong on Open LLM Leaderboard GPQA gpqa and Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct | 17.1% |
| #18 | CalmeRys-78B-Orpo-v0.1 Strong on Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct and Open LLM Leaderboard GPQA gpqa | 17.1% |
| #19 | phi-4 Strong on Open LLM Leaderboard GPQA gpqa and Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct | 16.7% |
| #20 | Apollo-70B Strong on Open LLM Leaderboard GPQA gpqa and Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct | 16.6% |
| #21 | Triangle104/Set-70b Strong on Open LLM Leaderboard GPQA gpqa and Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct | 16.6% |
| #22 | Sao10K/70B-L3.3-Cirrus-x1 Strong on Open LLM Leaderboard GPQA gpqa and Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct | 16.6% |
| #23 | gpt-5-mini-2025-08-07 Strong on SWE-bench Verified Leaderboard swe_verified_resolved_pct and Vals MedQA overall_accuracy_pct | 16.6% |
| #24 | Homer-v1.0-Qwen2.5-72B Strong on Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct and Open LLM Leaderboard GPQA gpqa | 16.5% |
| #25 | Tarek07/Thalassic-Alpha-LLaMa-70B Strong on Open LLM Leaderboard GPQA gpqa and Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct | 16.5% |
| #26 | Sakalti/ultiima-72B-v1.5 Strong on Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct and Open LLM Leaderboard GPQA gpqa | 16.2% |
| #27 | T3Q-qwen2.5-14b-v1.0-e3 Strong on Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct and Open LLM Leaderboard GPQA gpqa | 16.1% |
| #28 | JungZoona/T3Q-Qwen2.5-14B-Instruct-1M-e3 Strong on Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct and Open LLM Leaderboard GPQA gpqa | 16.1% |
| #29 | gemini-2.5-flash Strong on Berkeley Function Calling Leaderboard (Overall) Overall Acc and Galileo Agent Leaderboard v2 Avg AC | 16.1% |
| #30 | Llama3.3-70B-CogniLink Strong on Open LLM Leaderboard GPQA gpqa and Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct | 16.1% |
Head-to-Head: #1 vs #2
#1
Top Pickanthropic/claude-sonnet-4
Strong on Galileo Agent Leaderboard v2 Avg AC and SWE-bench Verified Leaderboard swe_verified_resolved_pct
Conf 35.0%
#2
qwen-2.5-72b-instruct
Strong on Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct and Galileo Agent Leaderboard v2 Avg AC
Conf 35.6%
Related Lookups
Best LLM for Code Generation
Benchmark-backed ranking of models for generating correct, secure code from requirements.
Best LLM for Debugging
Find the top-ranked models for localizing bugs and proposing fixes with explanations.
Best LLM for Unit Test Generation
Ranked models for generating meaningful unit tests and edge cases from code.
Best LLM for Code Review
Compare models for automated PR review covering correctness, security, and maintainability.
Best LLM for Autonomous Coding
Benchmark-backed ranking of models for end-to-end autonomous software engineering and issue resolution.
Best LLM for Function Calling
Compare models for reliable tool use, function selection, and multi-step API orchestration.