How We Cut Our AI Agent’s Token Usage by 65% Without Sacrificing Quality
How to Reduce AI Token Usage Without Losing What Makes Your Agent Useful
If your organization has deployed an AI agent — for donor research, case summarization, client intake, or internal Q&A — you’ve probably noticed the bills climbing faster than expected. Token costs are the quiet budget leak in most AI implementations, and the common advice to “just use a smaller model” often means trading accuracy for savings in ways that matter. Here’s how we actually reduced AI token usage by 65% on a client deployment without degrading the quality of outputs that staff depended on.
Why Token Bloat Happens in the First Place
Most token waste isn’t in the answers — it’s in the questions. When developers and non-technical staff first build out AI agents, system prompts tend to grow by accretion. Every time the agent says something slightly off, someone adds another paragraph of instruction. After a few months, you can easily end up with a 2,000-token system prompt that repeats itself, contradicts itself, and contains context that only applied to a pilot use case from six months ago.
The second major source of bloat is context stuffing. In retrieval-augmented generation (RAG) setups — which most nonprofit and professional services deployments use for document Q&A — it’s tempting to retrieve five or six chunks of source material and drop them all into the prompt. Sometimes that’s necessary. More often, two well-chosen chunks outperform five mediocre ones, at less than half the token cost.
Understanding where your tokens are actually going is the prerequisite for fixing anything. Before making a single change, log a representative sample of full prompt-plus-completion payloads and break down the token count by component: system prompt, retrieved context, conversation history, user message, and response. The breakdown is almost always surprising.
Audit and Compress Your System Prompt
The system prompt is the highest-leverage place to start. On the deployment we’re describing — an internal knowledge assistant for a mid-sized membership association — the system prompt had grown to 1,840 tokens over eight months of incremental edits. After a structured audit, we reduced it to 510 tokens with no measurable change in output quality on our evaluation set.
The process: print the prompt, read it aloud, and ask three questions about every sentence. Does this instruction change what the model would do by default? Is it already implied by another instruction? Does it still apply to current usage? Anything that fails all three gets cut. Instructions that survive get rewritten for density — say once, say clearly. Redundant safety language, verbose persona descriptions, and legacy instructions for deprecated features are the usual casualties.
One practical note: after compressing, run your evaluation set before deploying. Occasionally a sentence you thought was redundant was doing real work. The eval catches it.
Get Smarter About What Context You Retrieve
RAG systems often retrieve by similarity score alone, which means you get the chunks most semantically close to the query — but not necessarily the chunks that are most useful for answering it. If your retrieval step is pulling in three to five chunks as a default, start by measuring whether the correct answer actually appears in the top chunk versus chunks two through five. In our audit, the correct answer appeared in the top-two chunks 81% of the time. Chunks three through five contributed meaningful additional context in fewer than 15% of queries.
That data justified a default retrieval of two chunks instead of five, with a fallback to four for queries the system classifies as complex. That single change cut retrieved-context tokens by roughly 40% on average. Fewer tokens, faster responses, and in several cases better answers — because a shorter, more focused context is easier for the model to reason over than a long, partially-relevant one.
If you’re not already using a reranker between your vector search and your prompt assembly, it’s worth evaluating. A small cross-encoder reranker — which can run locally and adds minimal latency — significantly improves which chunks actually make it into the prompt, letting you retrieve fewer without sacrificing coverage.
Manage Conversation History Deliberately
For any agent that maintains multi-turn conversations, history management is where token costs can spiral most unpredictably. The naive implementation passes the full conversation history on every turn. On a long session, that can mean repeating 3,000 tokens of prior exchanges to establish context for a single new message.
There are a few approaches worth considering, and the right one depends on your use case. The simplest is a rolling window — keep only the last N turns of history. This works adequately for task-focused agents where each exchange is relatively self-contained. For agents where early context genuinely matters throughout a session, a summarization step is more appropriate: after every few turns, compress prior history into a short summary and carry that forward instead of the raw transcript.
On our association client’s deployment, we implemented a hybrid: a four-turn rolling window plus a running summary that was regenerated every four turns. History token costs dropped by 58% on sessions longer than ten turns, which represented about a quarter of all usage volume.
Choose Model Tiers by Task, Not by Default
Not every task your AI agent performs needs your most capable — and most expensive — model. This is an underused lever, particularly in organizations that set up one model configuration and leave it in place indefinitely.
Consider mapping your agent’s tasks by two dimensions: complexity and visibility. High-complexity, high-visibility tasks — generating a grant summary for an executive director, drafting a client-facing memo — warrant the strongest model available. Low-complexity, low-visibility tasks — classifying an inbound query, extracting a date from a document, deciding which retrieval path to use — often perform just as well with a smaller, faster, cheaper model.
In practice, this means a routing layer that classifies the incoming request and dispatches it to the appropriate model. The engineering overhead is modest, and the cost reduction can be substantial. On workloads with a high proportion of classification and extraction tasks, routing alone can reduce model costs by 30 to 50% without any change to output quality on the tasks that actually matter to users.
Measure Before and After — and Keep Measuring
The optimizations above are not a one-time project. Token usage patterns drift as staff find new ways to use a tool, as document libraries grow, and as someone inevitably starts editing the system prompt again. A lightweight monitoring setup — tracking average tokens per request, cost per query type, and a periodic sample review of full payloads — takes a few hours to build and pays for itself quickly.
Set a threshold that triggers a review. If average tokens per request climbs more than 20% over a rolling 30-day baseline, that’s a signal to audit. It’s almost always one of the four areas above: prompt drift, retrieval expansion, history accumulation, or a new use case that got bolted on without a cost assessment.
Reducing AI token usage isn’t about making your system do less. It’s about making it do the same things with less waste — which, in resource-constrained nonprofit and professional services environments, is exactly the kind of efficiency that justifies continued investment in these tools.
Book a consultation to talk through where your current AI deployment may be leaking tokens and what a structured optimization engagement would look like for your organization.
Not sure where you stand with AI?
Take our free 5-minute AI Readiness Assessment and find out exactly where your organization is — and what to do next.