BasedAGIBasedAGI
Use Case ReportLive data

Best LLMs for Summarization

Summarization sounds simple — take long text, make it shorter. In practice it's one of the harder tasks to do well, because "good summary" is surprisingly hard to define and even harder to measure. A summary can be:

  • Faithful but uninformative (it only says things that were in the source, but misses what mattered)
  • Informative but unfaithful (it captures the key points but adds things that weren't there)
  • Concise but shallow (short, but the important nuances got lost)
  • Comprehensive but bloated (technically accurate, but barely shorter than the original)

The best summarization models navigate these tradeoffs well across different summarization types. The worst ones hallucinate confidently, miss the main point, or produce word salad that feels like a summary but communicates nothing.

Types of Summarization

"Summarization" covers meaningfully different tasks:

Extractive summarization — Select and present the most important sentences from the source. Lower risk of hallucination because the output is drawn from the source text. Less useful when the key insights require synthesis rather than extraction.

Abstractive summarization — Generate new sentences that capture the meaning of the source. Higher quality ceiling (can synthesize, reframe, and restructure) but also higher risk of hallucination and faithfulness failures.

Meeting and conversation summarization — Condense multi-party dialogue into action items, decisions, and key discussion points. Requires understanding conversational structure, speaker roles, and implicit decisions.

Research paper summarization — Compress technical papers into accessible summaries. Requires domain knowledge to identify what's novel, what the methodology is, and what the actual findings mean.

Document hierarchy summarization — Summarize long documents (50-100+ pages) while maintaining the structure and relationships between sections. This is where context window length and long-context handling matter most.

Rankings

Use case not found.

Faithfulness: The Critical Failure Mode

The most important quality dimension in summarization is faithfulness — does the summary only say things that are supported by the source? Hallucination in summarization is particularly insidious because:

  1. The output looks plausible — it's in the right register and style
  2. The reader assumes it reflects the source
  3. Downstream decisions get made based on information that wasn't there

Faithfulness failures happen in two ways:

Intrinsic hallucinations — The summary contradicts the source (says the opposite of what was in the document).

Extrinsic hallucinations — The summary adds information from outside the source that sounds related but wasn't in the document (a model drawing on its parametric knowledge to "fill in" what it thinks should be there).

Both are common. Extrinsic hallucinations are harder to catch because the added information might even be true — just not from the source being summarized.

Models with high Accuracy scores on the BGI dimension system tend to have lower hallucination rates in summarization. The correlation isn't perfect, but if you're choosing a model specifically for faithfulness-critical summarization, the Accuracy dimension rankings are your best leading indicator.

Long-Context Performance

Summarizing long documents (legal contracts, research papers, financial filings, book-length content) requires models that handle long context well. This is not just about context window size — it's about what the model actually does with that context.

Several known failure modes in long-context summarization:

Lost-in-the-middle — Models attend poorly to content in the middle of very long contexts. Information at the beginning and end of a document tends to be retained better. A document where the key finding is on page 47 of 100 may get missed.

Recency bias — Models summarizing long conversations or documents often over-represent what came at the end, at the expense of important earlier content.

Structural confusion — Long documents with complex hierarchies (chapters, sections, subsections) can confuse models about what level of abstraction to operate at.

For long-document summarization specifically, models have improved significantly in the last year — but evaluating on your specific document length and structure before choosing a model is still warranted.

When Summarization Quality Actually Matters

The value of better summarization varies enormously by context:

High impact:

  • Legal and compliance documents where missed provisions have real consequences
  • Research and due diligence where a missed study finding affects a decision
  • Executive briefings where the summary is all that gets read
  • Medical records where accuracy is safety-critical

Lower impact:

  • Content you'll read anyway (where the summary is just orientation)
  • Internal notes where approximate accuracy is sufficient
  • Repetitive structured documents where even basic summarization captures what matters

For high-stakes summarization, the quality gap between best and worst models is large enough to matter. For lower-stakes cases, mid-tier models are often sufficient and substantially cheaper.

Summarization Length and Format

Model choice interacts with prompt design. A few guidelines that hold across models:

Specify the target length. "Summarize in 3 bullet points" and "Summarize in 2 paragraphs" produce very different quality outputs than "summarize this." Models calibrate differently when given explicit length constraints.

Specify what matters. "Summarize focusing on action items and decisions" produces better meeting notes than "summarize this conversation." The more specific the instruction, the better the focus.

Ask for structure explicitly. For long documents, asking for section-by-section summaries followed by an overall summary often produces better results than asking for a single top-level summary, because it forces the model to process each section before synthesizing.

Chain-of-custody matters in high-stakes summarization. Knowing which model version summarized a document, when, and with what prompt is important for auditability. This is often overlooked in production deployments until there's an incident.

Models with Specific Summarization Strengths

Based on benchmark evidence and production patterns:

Long-document summarization — Claude-class models have consistently shown strong performance on long-context tasks, particularly for documents requiring sustained attention to structure across 50K+ tokens.

Faithful, conservative summarization — Models with strong Accuracy dimension scores tend toward faithfulness over creativity — they're less likely to embellish, which is what you want for legal and compliance contexts.

Abstractive quality with high compression — GPT-4 class models tend to produce fluent, well-structured abstracts that read well even at high compression ratios.

Cost-efficient summarization at scale — For high-volume, lower-stakes summarization (news, internal documents, customer feedback), smaller open-weight models with good instruction tuning can handle 90% of the work at a fraction of the cost.

Related Use Cases

  • Contract review — High-stakes document comprehension with faithfulness requirements
  • Medical coding — Structured extraction from clinical documents
  • Provider comparison — How the major providers compare across dimensions including long-context handling

Full rankings at /use-cases. Methodology at /methodology.

Related Reports