Summarization Runbook

Purpose: Produce fair-use, non-substitutive summaries without drift.


First Diagnostic Question

Is the output text wrong? (hallucination, tone drift, over-quoting)

If yes, halt summarization before adjusting prompts.


Guardrails

  • Summarize delta only (not full page content)
  • No long quotes; paraphrase
  • Match tone to URL type
  • Respect fair-use constraints at all times

Diagnostic Decision Tree

Summarization issue suspected
    │
    ├─ Hallucination? (summary contains invented information)
    │   ├─ Content not in delta → Model confabulation, tighten prompt
    │   ├─ Misinterpretation → Delta context insufficient
    │   └─ Pattern-matching error → Model saw similar content elsewhere
    │
    ├─ Over-quoting? (too much verbatim content)
    │   ├─ Exceeds length caps → Enforce truncation
    │   ├─ Feels like replacement → Fair-use violation, suppress
    │   └─ Quote boundaries unclear → Improve paraphrase instructions
    │
    ├─ Tone issues? (urgency, style mismatch)
    │   ├─ Overstated urgency → Calibrate confidence language
    │   ├─ Wrong register → URL type detection issue
    │   └─ Inconsistent across runs → Model temperature too high
    │
    ├─ Missing key changes?
    │   ├─ Delta correct but summary incomplete → Prompt issue
    │   ├─ Changes below significance threshold → Intentional suppression
    │   └─ Model truncated output → Token limit hit
    │
    └─ Confidence wrong?
        ├─ High confidence on ambiguous change → Tighten confidence criteria
        ├─ Low confidence on clear change → Confidence logic bug
        └─ Confidence not matching delta quality → Miscalibration

Common Failure Scenarios

Hallucination

Symptoms:

  • Summary mentions changes not in delta
  • Invented statistics or dates
  • Confusion with similar pages

Actions:

  1. Immediately suppress affected summaries
  2. Review delta to confirm content mismatch
  3. Tighten prompt to restrict to delta content only
  4. Add explicit "only summarize provided delta" instruction
  5. Reduce model temperature if using non-zero

Priority: Always treat as SEV-1 if widespread.

Fair-Use Violations (Over-Quoting)

Symptoms:

  • Summaries contain excessive verbatim quotation
  • Output feels like page replacement
  • Length exceeds expected bounds

Actions:

  1. Immediately halt affected summaries
  2. Review summarization prompts for quoting instructions
  3. Reduce excerpt limits
  4. Force paraphrase-only mode
  5. Add post-processing to detect and truncate

Priority: Always treat as SEV-1.

Tone Drift

Symptoms:

  • Overstated urgency ("CRITICAL!", "BREAKING!")
  • Inconsistent formality
  • Emotional language inappropriate for content

Actions:

  1. Review URL type detection - is it being classified correctly?
  2. Adjust tone guidelines in prompt
  3. Add examples of appropriate vs inappropriate tone
  4. Consider lowering temperature for more consistent output

Repeated Phrasing

Symptoms:

  • Same phrases appearing across different summaries
  • Formulaic structure becoming stale
  • Model "habits" emerging

Actions:

  1. Review prompt for unintentional anchoring
  2. Vary prompt structure slightly
  3. Add diversity instructions
  4. Monitor for improvement

Missing Key Information

Symptoms:

  • Important changes not mentioned in summary
  • Summary too brief given delta size
  • User reports summary missed something

Actions:

  1. Verify delta contains the expected changes
  2. Check if model output was truncated
  3. Review prompt for explicit inclusion requirements
  4. Consider multi-pass summarization for complex deltas

Stop Conditions

Hard Stop

Trigger immediately if any are true:

  • Summaries violate fair-use constraints at scale
  • Output risks replacing the source page
  • Hallucinations affecting multiple URLs

Action: Suppress summaries and halt summarization jobs.

Degrade Mode

  • Emit "no meaningful change" or "delta unavailable" messages
  • Force paraphrase-only mode
  • Reduce summary length limits

Resume after manual spot-checks confirm safety.


Signals to Watch

SignalIndicates
Repeated phrasing across daysModel anchoring, needs prompt refresh
Overstated urgencyTone calibration issue
Quote ratio increasingDrift toward fair-use violation
User complaints about accuracyHallucination or omission
Summary length varianceInconsistent generation

Quality Checks

Sample Summary Review

For any suspected issue, manually review:

  1. The delta provided to the model
  2. The prompt used
  3. The generated summary
  4. Compare to what a human would write

Automated Checks

Consider implementing:

  • Quote ratio monitoring (verbatim % of output)
  • Length consistency checks
  • Confidence score distribution
  • A/B testing of prompt changes

Database Queries

Find summaries with high quote ratios

SELECT id, url_id, summary_text, quote_ratio, created_at
FROM summaries
WHERE quote_ratio > 0.3
  AND created_at > NOW() - INTERVAL '24 hours'
ORDER BY quote_ratio DESC
LIMIT 20;

Check summary generation patterns

SELECT DATE(created_at) as day,
       COUNT(*) as total,
       AVG(LENGTH(summary_text)) as avg_length,
       AVG(confidence_score) as avg_confidence
FROM summaries
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY DATE(created_at)
ORDER BY day;

Find URLs with summary issues

SELECT url_id, COUNT(*) as summary_count,
       AVG(confidence_score) as avg_confidence,
       COUNT(*) FILTER (WHERE confidence_score < 0.5) as low_confidence_count
FROM summaries
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY url_id
HAVING COUNT(*) FILTER (WHERE confidence_score < 0.5) > 2
ORDER BY low_confidence_count DESC;