Model Profile

gpt-4o-2024-05-13

Name: gpt-4o-2024-05-13
Rating: 2.7 (137 reviews)
Author: openai

External Benchmark Shadowexternal_benchmark_shadowpublic

4,096 ctx

Use this page to decide where this model is a strong fit. Rankings below are benchmark-backed by use case, with explicit confidence and contributor metrics.

Identity

ID: external/openai/gpt-4o-2024-05-13

Author: openai

Origin: external_benchmark_shadow

Arch: unknown

Benchmark Coverage

Scored use cases: 12

Avg confidence: 34.0%

Evidence points: 137

Raw rows: 191

Weighted rows: 21

Catalog Metadata

Parameters: unknown

Context window: 4096

Downloads: 0

Intelligence Profile

Dimension Breakdown

IQ3 benchmarks

58.9%*

EQ0 benchmarks

No eq benchmarks found

Insufficient data

Accuracy1 benchmark

48.2%*

Creativity2 benchmarks

89.0%*

Based2 benchmarks

46.8%*

* Low confidence — limited benchmark evidence for this dimension

4/5 dimensions scored · Last updated Apr 25, 2026

Benchmark Signals

Click through to the benchmark source behind this model profile.

LLM Trustworthy Leaderboard

privacy

4.7%

Normalized value 99.3% · confidence 100.0%

Strongest impact in Jailbreak resistance (eval)

llm_trustworthy_leaderboard.privacy · Mar 31, 2026

RepoQA Official Results

overall_average_pass_at_1_pct

4.6%

Normalized value 99.3% · confidence 100.0%

Strongest impact in Debugging assistant

repoqa_leaderboard.overall_average_pass_at_1_pct · Apr 1, 2026

LLM Trustworthy Leaderboard

adv

2.6%

Normalized value 60.7% · confidence 100.0%

Strongest impact in Jailbreak resistance (eval)

llm_trustworthy_leaderboard.adv · Mar 31, 2026

SWE-bench Verified Leaderboard

swe_verified_resolved_pct

2.6%

Normalized value 48.2% · confidence 100.0%

Strongest impact in Verilog/VHDL generation

swebench_verified_official.swe_verified_resolved_pct · Apr 1, 2026

BigCodeBench Official

bigcodebench_complete_pct

2.1%

Normalized value 97.6% · confidence 100.0%

Strongest impact in Verilog/VHDL generation

bigcodebench_official.bigcodebench_complete_pct · Apr 1, 2026

Aider Code Editing Leaderboard

percent_correct_pct

1.9%

Normalized value 82.3% · confidence 100.0%

Strongest impact in Simulation setup assistant

aider_code_editing.percent_correct_pct · Apr 1, 2026

Coverage Diagnostics

actively scored

Use-Case Scores

113

Total Measurements

191

Weighted Measurements

Weighted Sources

Raw Source Coverage

repoqa_leaderboard 74ugi_main 57llm_aggrefact_leaderboard 12vals_gpqa 12llm_trustworthy_leaderboard 8icelandic_llm_leaderboard 7

Weighted Source Coverage

llm_trustworthy_leaderboard 5bigcodebench_official 3ugi_main 3aider_code_editing 2llm_aggrefact_leaderboard 2repoqa_leaderboard 2

Best Use Cases for This Model

Use Case	Vertical	Score	Confidence	Evidence	Top Contributor
Debugging assistant use_case.dev.debugging	developer_tools	27.2%	35.5%	13	RepoQA Official Results: overall_average_pass_at_1_pct
Unit test generation use_case.dev.test_generation	developer_tools	24.2%	30.8%	13	RepoQA Official Results: overall_average_pass_at_1_pct
Refactoring assistant use_case.dev.refactoring	developer_tools	23.6%	32.3%	13	RepoQA Official Results: overall_average_pass_at_1_pct
Integration test generation use_case.dev.integration_tests	developer_tools	23.0%	30.9%	13	RepoQA Official Results: overall_average_pass_at_1_pct
Verilog/VHDL generation use_case.eda.verilog_generation	engineering	22.7%	32.6%	11	SWE-bench Verified Leaderboard: swe_verified_resolved_pct
Code Review Assistant use_case.dev.code_review_assistant	developer_tools	22.6%	29.3%	13	RepoQA Official Results: overall_average_pass_at_1_pct
Jailbreak resistance (eval) use_case.security.jailbreak_resistance_eval	risk_eval	19.4%	37.5%	10	LLM Trustworthy Leaderboard: privacy
Scam and social engineering resistance (eval) use_case.security.scam_social_engineering_resistance_eval	risk_eval	19.4%	37.5%	10	LLM Trustworthy Leaderboard: privacy
Refusal profile (eval) use_case.security.refusal_profile_eval	risk_eval	19.4%	37.5%	10	LLM Trustworthy Leaderboard: privacy
Overrefusal (eval) use_case.security.overrefusal_eval	risk_eval	19.4%	37.5%	10	LLM Trustworthy Leaderboard: privacy
Crisis escalation protocol (eval) use_case.safety.crisis_escalation_protocol	risk_eval	19.4%	37.5%	10	LLM Trustworthy Leaderboard: privacy
Simulation setup assistant use_case.eng.simulation_setup_assistant	engineering	18.3%	28.2%	11	Aider Code Editing Leaderboard: percent_correct_pct