hr_recruiting
Job description drafting
Draft job descriptions that match role requirements and tone.
#1 Recommendation
gpt-4.1-20250414
Strong on Galileo Agent Leaderboard v2 Avg TSQ (64%) and MMLongBench-Doc Leaderboard acc_score_pct (75%)
external/openai/gpt-4-1-20250414
23.7%
Score
36.3%
Confidence
Limited benchmark evidence for this use case.
31 ranked models with average evidence of 13.0 points. Rankings may shift as more benchmark data is ingested.
Ranked Models
30
Evidence Quality
79%
Scoring
Benchmark-backed
Top Signal
Galileo Agent Leaderboard v2: Avg TSQ
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| #1 | gpt-4.1-20250414 Strong on Galileo Agent Leaderboard v2 Avg TSQ (64%) and MMLongBench-Doc Leaderboard acc_score_pct (75%) | 23.7% |
| #2 | gemini-2.5-flash Strong on Galileo Agent Leaderboard v2 Avg TSQ (100%) and LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct (100%) | 17.7% |
| #3 | gpt-4.1-mini-20250414 Strong on Galileo Agent Leaderboard v2 Avg TSQ (62%) and OpenVLM OCRBench Official ocrbench_score_pct (88%) | 17.5% |
| #5 | gemini-2.5-pro | 15.8% |
| #6 | gpt-4o | 15.0% |
| #12 | Grok-4-0709 | 12.6% |
| #13 | claude-sonnet-4-20250514 | 12.6% |
| #14 | qwen-2.5-72b-instruct | 12.6% |
| #20 | gpt-5-2025-08-07 | 11.5% |
| #23 | google/gemini-2.0-flash-001 | 11.0% |
| #25 | gpt-5-mini-2025-08-07 | 10.9% |
| #29 | gemini-3-pro-preview | 10.6% |
| #55 | google/gemini-3.1-pro-preview | 9.6% |
| #68 | Llama-2-7b-chat-hf | 9.0% |
| #87 | openai/gpt-5.4-2026-03-05 | 8.7% |
| #100 | gpt-5.1-2025-11-13 | 8.4% |
| #111 | anthropic/claude-sonnet-4.6 | 8.4% |
| #113 | claude-opus-4-5-20251101 | 8.3% |
| #117 | Qwen3-Embedding-4B | 8.2% |
| #120 | GPT-4.1-nano-2025-04-14 | 8.1% |
| #127 | gemma-7b-it | 7.9% |
| #144 | Qwen-VL-Chat | 7.6% |
| #150 | Llama-3.1-70B-Instruct | 7.4% |
| #161 | gemma-2b-it | 7.2% |
| #178 | xai-org/grok-4-fast-reasoning | 6.9% |
| #179 | gpt-4o-20241120 | 6.9% |
| #211 | xai-org/grok-4-1-fast-reasoning | 6.5% |
| #219 | deepseek/deepseek-r1 | 6.5% |
| #260 | openai/gpt-4o-mini-2024-07-18 | 5.9% |
| #288 | phi-4 | 5.5% |
Compare Models
Model A leads by +6.0%
Shareable Link →Model A
gpt-4.1-20250414
external/openai/gpt-4-1-20250414
Rank #1
Galileo Agent Leaderboard v2: Avg TSQ
Value 64.1% · Conf 100.0% · Weight 2.5%
galileo_agent_v2.avg_tsq (Mar 12, 2026)
MMLongBench-Doc Leaderboard: acc_score_pct
Value 74.6% · Conf 100.0% · Weight 2.5%
mmlongbench_doc_leaderboard.acc_score_pct (Mar 12, 2026)
OpenVLM OCRBench Official: ocrbench_score_pct
Value 87.7% · Conf 100.0% · Weight 2.3%
openvlm_ocrbench_official.ocrbench_score_pct (Mar 12, 2026)
OpenVLM TextVQA Official: textvqa_score_pct
Value 76.8% · Conf 100.0% · Weight 2.0%
openvlm_textvqa_official.textvqa_score_pct (Mar 12, 2026)
Model B
gemini-2.5-flash
external/google/gemini-2-5-flash
Rank #2
Galileo Agent Leaderboard v2: Avg TSQ
Value 100.0% · Conf 100.0% · Weight 4.0%
galileo_agent_v2.avg_tsq (Mar 12, 2026)
LanguageBench Grammar/Clarity Official (Split): grammar_clarity_score_pct
Value 100.0% · Conf 100.0% · Weight 2.3%
languagebench_grammar_clarity_official.grammar_clarity_score_pct (Mar 12, 2026)
LanguageBench Translation Official (Split): translation_to:bleu
Value 92.0% · Conf 100.0% · Weight 2.1%
languagebench_translation_official.translation_to_bleu (Mar 12, 2026)
LanguageBench: overall:mean
Value 100.0% · Conf 100.0% · Weight 2.1%
languagebench.overall_mean (Mar 12, 2026)
▶Ranking Diagnostics & Missing Models
Source Lift
Ranked
31
Sources
8
Quality
Insufficient
Vals Legal Bench
vals_legal_bench
18 rows
0.5% avg lift
Vals Tax Eval v2
vals_tax_eval_v2
18 rows
0.5% avg lift
Vals CorpFin v2
vals_corp_fin_v2
18 rows
0.4% avg lift
Vals MedQA
vals_medqa
17 rows
0.5% avg lift
Missing Strong Models
gemini-3-flash-preview
external/google/gemini-3-flash-preview
Rank #15
16.2%
gpt-5.2-2025-12-11
external/openai/gpt-5-2-2025-12-11
Rank #16
16.2%
anthropic/claude-opus-4-6-thinking
external/anthropic/claude-opus-4-6-thinking
Rank #17
16.1%
google/gemini-3.1-flash-lite-preview
external/google/gemini-3-1-flash-lite-preview
Rank #19
15.6%
▶Taxonomy Details
Core Tasks
Required Modes
Domains
Related Use Cases
hr_recruiting
HR policy Q&A
Answer HR policy questions grounded in authoritative text.
Top: gemini-3-pro-preview
hr_recruiting
Candidate summary memo
Summarize a candidate profile into strengths, gaps, and questions.
Top: gpt-4.1-20250414
hr_recruiting
Interview question bank
Generate structured interview questions and rubrics for a role.
Top: gpt-4.1-20250414
hr_recruiting
Resume structuring
Extract structured candidate profiles from resumes.
Top: gpt-4.1-20250414