BasedAGIBasedAGI
Menu
Rankings live

hr_recruiting

Job description drafting

Draft job descriptions that match role requirements and tone.

#1 Recommendation

gpt-4.1-20250414

Strong on Galileo Agent Leaderboard v2 Avg TSQ (64%) and MMLongBench-Doc Leaderboard acc_score_pct (75%)

external/openai/gpt-4-1-20250414

23.7%

Score

36.3%

Confidence

Limited benchmark evidence for this use case.

31 ranked models with average evidence of 13.0 points. Rankings may shift as more benchmark data is ingested.

Ranked Models

30

Evidence Quality

79%

Scoring

Benchmark-backed

Top Signal

Galileo Agent Leaderboard v2: Avg TSQ

All Ranked Models

Max params:
Min confidence:
30 of 30
RankModelScore
#1gpt-4.1-20250414

Strong on Galileo Agent Leaderboard v2 Avg TSQ (64%) and MMLongBench-Doc Leaderboard acc_score_pct (75%)

23.7%
#2gemini-2.5-flash

Strong on Galileo Agent Leaderboard v2 Avg TSQ (100%) and LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct (100%)

17.7%
#3gpt-4.1-mini-20250414

Strong on Galileo Agent Leaderboard v2 Avg TSQ (62%) and OpenVLM OCRBench Official ocrbench_score_pct (88%)

17.5%
#5gemini-2.5-pro
15.8%
#6gpt-4o
15.0%
#12Grok-4-0709
12.6%
#13claude-sonnet-4-20250514
12.6%
#14qwen-2.5-72b-instruct
12.6%
#20gpt-5-2025-08-07
11.5%
#23google/gemini-2.0-flash-001
11.0%
#25gpt-5-mini-2025-08-07
10.9%
#29gemini-3-pro-preview
10.6%
#55google/gemini-3.1-pro-preview
9.6%
#68Llama-2-7b-chat-hf
9.0%
#87openai/gpt-5.4-2026-03-05
8.7%
#100gpt-5.1-2025-11-13
8.4%
#111anthropic/claude-sonnet-4.6
8.4%
#113claude-opus-4-5-20251101
8.3%
#117Qwen3-Embedding-4B
8.2%
#120GPT-4.1-nano-2025-04-14
8.1%
#127gemma-7b-it
7.9%
#144Qwen-VL-Chat
7.6%
#150Llama-3.1-70B-Instruct
7.4%
#161gemma-2b-it
7.2%
#178xai-org/grok-4-fast-reasoning
6.9%
#179gpt-4o-20241120
6.9%
#211xai-org/grok-4-1-fast-reasoning
6.5%
#219deepseek/deepseek-r1
6.5%
#260openai/gpt-4o-mini-2024-07-18
5.9%
#288phi-4
5.5%

Compare Models

Model A leads by +6.0%

Shareable Link →

Model A

gpt-4.1-20250414

external/openai/gpt-4-1-20250414

23.7%

Rank #1

Confidence 36.3%24 evidence pts

Galileo Agent Leaderboard v2: Avg TSQ

Value 64.1% · Conf 100.0% · Weight 2.5%

galileo_agent_v2.avg_tsq (Mar 12, 2026)

MMLongBench-Doc Leaderboard: acc_score_pct

Value 74.6% · Conf 100.0% · Weight 2.5%

mmlongbench_doc_leaderboard.acc_score_pct (Mar 12, 2026)

OpenVLM OCRBench Official: ocrbench_score_pct

Value 87.7% · Conf 100.0% · Weight 2.3%

openvlm_ocrbench_official.ocrbench_score_pct (Mar 12, 2026)

OpenVLM TextVQA Official: textvqa_score_pct

Value 76.8% · Conf 100.0% · Weight 2.0%

openvlm_textvqa_official.textvqa_score_pct (Mar 12, 2026)

Model B

gemini-2.5-flash

external/google/gemini-2-5-flash

17.7%

Rank #2

Confidence 21.2%16 evidence pts

Galileo Agent Leaderboard v2: Avg TSQ

Value 100.0% · Conf 100.0% · Weight 4.0%

galileo_agent_v2.avg_tsq (Mar 12, 2026)

LanguageBench Grammar/Clarity Official (Split): grammar_clarity_score_pct

Value 100.0% · Conf 100.0% · Weight 2.3%

languagebench_grammar_clarity_official.grammar_clarity_score_pct (Mar 12, 2026)

LanguageBench Translation Official (Split): translation_to:bleu

Value 92.0% · Conf 100.0% · Weight 2.1%

languagebench_translation_official.translation_to_bleu (Mar 12, 2026)

LanguageBench: overall:mean

Value 100.0% · Conf 100.0% · Weight 2.1%

languagebench.overall_mean (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

31

Sources

8

Quality

Insufficient

Vals Legal Bench

vals_legal_bench

18 rows

0.5% avg lift

Vals Tax Eval v2

vals_tax_eval_v2

18 rows

0.5% avg lift

Vals CorpFin v2

vals_corp_fin_v2

18 rows

0.4% avg lift

Vals MedQA

vals_medqa

17 rows

0.5% avg lift

Missing Strong Models

gemini-3-flash-preview

external/google/gemini-3-flash-preview

Rank #15

16.2%

Thin evidence after weighting

gpt-5.2-2025-12-11

external/openai/gpt-5-2-2025-12-11

Rank #16

16.2%

Thin evidence after weighting

anthropic/claude-opus-4-6-thinking

external/anthropic/claude-opus-4-6-thinking

Rank #17

16.1%

Thin evidence after weighting

google/gemini-3.1-flash-lite-preview

external/google/gemini-3-1-flash-lite-preview

Rank #19

15.6%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.write_reporttask.rewrite_tone_style

Required Modes

none

Domains

domain.hr_recruiting

Related Use Cases