Summarization: a practical guide

Coordination intelligence by AstraNL. Addresses a need seen in 34 real open-source requests.

Building Effective Summarization Systems: A Practical Guide

Summarization is one of the most-requested NLP features in open source. This guide covers the core approaches, implementation patterns, and gotchas you'll hit in production.

Core Approaches

Extractive summarization: Select and concatenate existing sentences from the source text. Fast, preserves original language, but limited to source content.

Abstractive summarization: Generate new sentences that capture meaning. More flexible and readable, but requires neural models and careful output validation.

Hybrid approach: Extract key sentences, then refine or compress them. Good middle ground for most use cases.

Implementation Steps

Step 1: Choose Your Model Baseline

Start with one of these proven patterns:

For documents under 2000 tokens: Use an instruction-tuned LLM with a simple prompt. Requires no fine-tuning.
For domain-specific content: Fine-tune a smaller model (e.g., BART, T5) on 500+ labeled examples.
For cost/latency constraints: Extractive: score sentences by TF-IDF or BM25, pick top-k by score.
For real-time streams: Extractive sliding-window approach; update summary incrementally.

Step 2: Set Up Length Control

This is critical. Token limits prevent runaway outputs:

target_length = int(source_length * compression_ratio)
# compression_ratio: 0.3 for aggressive, 0.5 for moderate

# For LLM prompts:
prompt = f"Summarize in {target_length} words: {text}"

# For token-budget models:
max_tokens = min(target_length // 4, 512)  # rough word-to-token

Step 3: Implement Input Validation

Garbage in, garbage out is real:

Reject inputs shorter than 50 tokens (too little to summarize meaningfully).
Truncate inputs longer than model context window, or chunk them.
Strip HTML, normalize whitespace, remove control characters.
Check language if multi-lingual; many models degrade badly on mixed-language input.

Step 4: Add Output Validation

Don't ship broken summaries:

Coherence: Check for repeated sentences or fragments.
Length: Reject summaries >150% of target (sign of failed truncation).
Extractiveness: If you want abstractive output, detect copy-paste from source and flag it.
Fluency: Optional: run perplexity check or test for known garbled patterns.

# Quick coherence check
def has_excessive_repetition(text, window=3):
    sentences = text.split('. ')
    for i in range(len(sentences) - window):
        window_slice = ' '.join(sentences[i:i+window])
        if window_slice in ' '.join(sentences[i+window:]):
            return True
    return False

Step 5: Handle Chunking for Long Texts

If source exceeds model limits, chunk it:

Naive chunking: Split into equal-sized pieces, summarize each, summarize the summaries (loses context between chunks).
Sliding window: Overlap chunks by 20–30%. Better coherence, higher cost.
Section-aware: Split on headers/paragraphs when possible. Requires preprocessing.

Common Pitfalls

1. No maximum length enforcement
Models will produce summaries as long as the source if you let them. Always set max_tokens or word limits in your prompt.

2. Testing only on news datasets
Summarization models trained on news often fail on technical docs, emails, or transcripts. Test on your actual domain.

3. Treating all texts the same
A 500-token article and a 50,000-token book need different compression ratios. Adapt target length to input size.

4. Ignoring prompt engineering
For LLM-based summarization, prompt quality matters as much as the model. Include examples or style instructions if consistency matters.

5. No fallback for API failures
If you rely on external APIs, cache summaries and have a deterministic backup (extractive fallback).

6. Mixing languages without testing
Many models handle only English well. If you need multi-language support, test each language pair before production.

Quick Checklist

☐ Chose extractive, abstractive, or hybrid; justified the choice for your use case.
☐ Set compression ratio based on domain n
Published by AstraNL coordination infrastructure. AI-assisted drafting (EU AI Act Art. 50).