The term "open-source LLM" gets used loosely — and the distinction matters more than ever. Some models are fully open: weights, training code, and data. Others release weights with usage restrictions. Others release nothing but the inference API. This analysis focuses on open-weight models: those where you can download and run the weights yourself, regardless of licensing details.
Why does open-weight matter? Because it changes the entire deployment calculus. Zero marginal inference cost at scale. Code and data that never leave your infrastructure. The ability to fine-tune on proprietary data. Direct control over latency, availability, and costs. For the right teams, these advantages are more important than the capability delta between open and closed models.
The question this report answers: which open-weight models are actually good enough to use?
The Open-Weight Landscape in 2026
The open-weight space has matured dramatically since 2023. A few structural facts about where it stands now:
Meta's Llama family dominates mindshare. Llama 3.x has become the reference point — the model everyone compares against, the checkpoint most fine-tuned, and the architecture most deployed in production. Not because it's necessarily the best at every task, but because it hit a quality threshold that made it practical while being freely available.
Mistral proved that efficiency matters. Mistral's models consistently punch above their weight class. The Mixtral architecture (mixture-of-experts) showed that you don't need to activate all parameters for all tokens — a design choice that other labs have since adopted. Mistral's open releases have been among the most downloaded on Hugging Face for two years running.
Qwen and DeepSeek have emerged as serious alternatives. The Chinese lab models — particularly Qwen 2.5 and DeepSeek-V3 and its variants — have reached quality levels that are genuinely competitive with the frontier. DeepSeek's reasoning model in particular surprised the industry with performance approaching o1-level at a fraction of the infrastructure cost.
Community fine-tunes are a separate tier. The base releases from these labs are the floor, not the ceiling. Community fine-tunes — applying RLHF, instruction tuning, and domain-specific training on top of open weights — have produced models that outperform the base releases on specific tasks. Evaluating open-weight options means evaluating both the base releases and the fine-tuned derivatives.
BGI Leaderboard: Open-Weight Models
The leaderboard below includes all models in the system — open and closed. Open-weight models that make the overall top-20 represent genuine frontier parity for their intended use cases.
Tier Breakdown
Tier 1: Frontier-Competitive
The top open-weight models that fall in this tier are genuinely competitive with the best closed-source alternatives on most practical tasks. The capability gap that existed in 2023-2024 has largely closed for general-purpose use.
What puts a model in this tier: strong across multiple benchmarks (not just one), consistent performance on real-world tasks (not just synthetic evals), and evidence of performance in agentic contexts where the model must chain reasoning across multiple steps.
Tier 2: Strong General Purpose
Models in this tier are excellent for most enterprise use cases — document analysis, question answering, summarization, basic coding, structured extraction. They represent meaningful capability at a much lower inference cost than Tier 1 models.
The practical sweet spot for many teams: Tier 2 open-weight models running on-premise handle 90% of tasks, with Tier 1 models (open or closed) reserved for the hardest 10%.
Tier 3: Efficient / Specialized
Smaller models optimized for specific tasks or hardware constraints. The best models at the sub-7B parameter tier can run on a single consumer GPU and perform surprisingly well on focused tasks. These are the right choice when latency, hardware constraints, or cost at extreme volume dominate other considerations.
The question is rarely "open vs. closed" in the abstract. It's "which open-weight model gets closest to the closed model I'd otherwise use, for this specific task, given my infrastructure constraints?" The answer is increasingly: close enough to use.
By Use Case: Where Open Models Excel vs. Struggle
Open-weight models are not uniformly competitive across all tasks. The capability gaps that remain are uneven:
Strong open-weight performance:
- Code generation and review (several open models match or beat GPT-4-class on coding benchmarks)
- Text extraction and structured output (instruction-tuned models handle this reliably)
- Summarization and classification (well-studied, well-benchmarked tasks)
- RAG and document Q&A (retrieval quality matters more than model quality here)
- Translation and multilingual tasks (particularly Qwen models)
Remaining gaps:
- Very long context tasks (>100K tokens) — closed models have invested more in this
- Complex multi-step agentic tasks — the frontier is still closed-source
- Emotional nuance and tone calibration — an area where RLHF on larger closed models still shows
- Cutting-edge reasoning (math olympiad difficulty, advanced proof writing)
For most enterprise tasks, the "remaining gaps" category is a small minority of actual workload. The 80% case is well-served by the best open-weight options.
Practical Deployment Considerations
Self-Hosted vs. Managed API
Open-weight models can be deployed several ways:
Self-hosted on your own GPU infrastructure — Maximum privacy, maximum control, requires infrastructure investment. Makes economic sense at high volume. Key components: inference server (vLLM, Ollama, TGI), load balancing, monitoring.
Managed open-weight APIs — Services like Together AI, Fireworks AI, and Groq run the same open-weight models on their infrastructure. You get the open-weight models via API without running the hardware. Lower privacy than true self-hosted, but dramatically simpler operations.
Hybrid — Use managed APIs for development and low-volume workloads, self-hosted for production high-volume. Common pattern in cost-optimized deployments.
Hardware Requirements
Parameter count is the main driver of GPU memory requirements, but it's not the only factor. Quantization can reduce memory footprint significantly at moderate quality cost:
- 70B parameter models: Require ~140GB VRAM at full precision (fp16). With 4-bit quantization (~35GB), runnable on a pair of 24GB GPUs.
- 30-40B models: ~60-80GB fp16, ~15-20GB 4-bit. Single-GPU at 4-bit on 24GB cards.
- 7-13B models: Run on single consumer GPUs. The sweet spot for teams without GPU clusters.
Quality at quantization: 4-bit quantization typically costs 1-3 points on most benchmarks — meaningful but not catastrophic. 8-bit costs less than 1 point for most models.
Licensing
"Open-weight" covers a range of licensing:
- Truly open (Apache 2.0, MIT): Mistral 7B, many Qwen variants — commercial use, modification, redistribution all permitted
- Community license (Meta Llama license): Free for most commercial use up to a user count threshold, requires attribution, some restrictions on derivatives
- Research only: Some models released for academic use only — check before commercial deployment
When "open-source LLM" appears in vendor marketing materials, verify which license applies. The distinction between Apache 2.0 and a restrictive research license is significant.
Provider-Specific Notes
Meta Llama
The most widely deployed open-weight family. Llama 3.x represents a genuine step-change from earlier versions — capable instruction following, strong coding performance, and a large ecosystem of fine-tunes and tooling built around it. The official instruction-tuned releases are solid; the community fine-tunes range from excellent to unreliable.
For enterprise deployments, the official Meta releases are the safest starting point. Evaluate community fine-tunes against your specific use cases before adopting them in production.
Mistral
Efficient and predictable. Mistral models tend to perform more consistently than their raw parameter count suggests. The Mixtral MoE architecture activates only a subset of parameters per token, reducing inference cost while maintaining quality. Strong technical performance; weaker on emotional nuance.
Mistral's commercial terms are clean, and being EU-based makes them appealing for European deployments with data residency requirements.
Qwen (Alibaba)
The surprise of the last 18 months. Qwen 2.5 and its variants have reached genuinely competitive quality on benchmarks, with particular strength in multilingual tasks and coding. The MoE variants offer the efficiency advantages of that architecture.
Licensing varies across Qwen releases — some are Apache 2.0, some have restrictions. Check per-release.
DeepSeek
The highest-profile surprise in recent open-weight releases. DeepSeek-V3 and the DeepSeek-R1 reasoning model attracted significant attention by matching or approaching closed-source frontier performance at dramatically lower reported training and inference costs.
DeepSeek-R1 in particular is worth evaluating for reasoning-heavy tasks. The distilled variants (smaller models trained to mimic the larger reasoning model) are among the best reasoning models at their size tier.
DeepSeek models are developed by a Chinese company and have raised questions about data handling and training practices. Organizations with sensitive data should evaluate the privacy implications of using any cloud-hosted inference service for these models, regardless of the weights being publicly available.
The Fine-Tuning Advantage
The strongest argument for open-weight models isn't the base performance — it's the ability to fine-tune. Closed models are frozen; you prompt them but can't change them. Open-weight models can be adapted to your specific domain with a relatively small amount of labeled data.
Fine-tuning delivers the most value when:
- Your domain has specialized terminology or formats (legal, medical, financial)
- You have high-volume, narrow tasks where a small model can match a large one after training
- Your task requires a specific output structure or style that's hard to achieve via prompting
- You have proprietary knowledge that you want the model to have without building a RAG system
The infrastructure required for fine-tuning has also become accessible. Full fine-tuning of 7B models is achievable on a single A100. LoRA/QLoRA adapters enable parameter-efficient fine-tuning at even lower resource requirements.
Choosing an Open-Weight Model
Rather than a single recommendation, the right choice depends on your constraints:
If performance is the top priority → Evaluate the Tier 1 models in the leaderboard above. Use the use-cases browser to check rankings for your specific task.
If infrastructure simplicity is the priority → Consider managed open-weight APIs (Together, Fireworks) running the same models. Same quality, no ops burden.
If you need to fine-tune → Llama 3 has the largest ecosystem of tooling and community resources for fine-tuning. Start there unless you have specific reasons to prefer another base.
If cost at extreme scale is the priority → Evaluate smaller models with quantization. A 7B model at 4-bit running on-premise can cost two orders of magnitude less than API calls at high volume.
If you need multilingual support → Qwen variants consistently outperform other open-weight options on non-English tasks.
For closed vs. open model comparison across providers, see the provider comparison report.
Full benchmark methodology at /methodology.