**The Core Problem:**
Context windows aren't infinite. Claude 3.5 gives you 200K tokens, but if you stuff it with:
– Full conversation history
– Massive reference documents
– Multiple system prompts
– Example interactions
You're left with maybe 5K tokens for actual response. The model suffocates in verbosity.
**Three Practical Fixes:**
- **Hierarchical Summarization** – Don't pass raw docs. Create executive summaries with markers ("CRITICAL", "CONTEXT ONLY", "EXAMPLE"). The model learns to weight tokens differently.
-
**Rolling Context** – Keep only the last 5 interactions, not the entire chat. This is counterintuitive but eliminates noise. Newer context is usually more relevant.
-
**Explicit Token Budgets** – Add this to your system prompt: "You have 4000 tokens remaining. Structure responses accordingly." Forces the model to be strategic.
**Real Example:**
I was passing a 50-page research paper to analyze. First try: 80K tokens wasted on reading, 5K on actual analysis.
Second try: Extracted abstract + 3 key sections. 15K tokens total. Better output quality.
What's your use case? Token budget constraints feel different by domain (research vs coding vs creative writing). Curious what patterns you're hitting.