Last month a senior engineer told me his Claude Max ran out in four days. He bills at $200 an hour, builds production systems, knows what he is doing. Yet the $350-per-month subscription that should sustain a full month of agentic coding burned to zero before the second weekend.
Subscribe to the newsletter for token economics and AI engineering deep dives.
The problem was not the model. The problem was that every command his agent ran returned 600 lines of output, and every one of those lines became part of the context window for the next reasoning step. He was not paying for thinking. He was paying for dependency trees and stack traces that the model never needed to read.
This is the quiet failure mode of agentic coding in 2026. Teams adopt Claude Code, Cursor, or a custom MCP rig, then watch their token bills climb past their cloud spend. The instinct is to blame the model or the price. The actual fix sits one layer below the agent, at the shell.
The Anatomy of a Burned Token
Open any coding agent's transcript and look at what consumes the budget. It is rarely the user prompts or the assistant's responses. It is the tool output. A single pnpm install can return 800 to 1,200 lines on a fresh dependency resolution. A vitest run on a 200-test suite produces 4,000 lines of pass/fail spam, of which maybe 30 lines actually matter when something breaks. A casual grep across a monorepo with 5,000 files can dump 12,000 lines into the context before the agent has formed a hypothesis about the bug.
Every one of those lines is tokens. Every one of those tokens is paid for, either against your Max quota or your API balance, on every subsequent turn the model needs to consider them.
The math is brutal. If a single tool call adds 8,000 tokens of noise and the conversation runs for 30 turns, that one wasted call has cost you 240,000 tokens by the end of the session. Multiply by ten sessions a day and you are looking at 2.4 million tokens of pure noise per engineer per workday.
That is not a Claude pricing problem. That is an output-discipline problem.
What RTK Actually Does
RTK, the Rust Token Killer, sits between your agent and your shell. You replace git status with rtk git status, pnpm install with rtk pnpm install, and so on. The command runs identically on your machine. The filtered output is what flows back into the agent's context.
For commands RTK has dedicated filters for, the output is rewritten with surgical precision. Test runs return failures only, with the minimum stack context needed to act. Builds return route metrics and warnings, not the play-by-play of webpack chunking. Git operations return ultra-compact confirmations. The savings are categorical, not incidental.
Here is the savings profile published by the project, calibrated against typical agentic workflows in 2026.
| Category | Example commands | Typical savings |
|---|---|---|
| Tests | vitest, jest, pytest, cargo test, playwright | 90-99% |
| Build & compile | next build, tsc, lint, prettier | 70-87% |
| Git | status, log, diff, add, commit | 59-80% |
| GitHub CLI | gh pr view, gh run list, gh issue list | 26-87% |
| Package managers | pnpm install, npm run, npx | 70-90% |
| Files & search | ls, grep, find, read | 60-75% |
| Infra | docker, kubectl logs | 85% |
| Network | curl, wget | 65-70% |
The 90-99% range on test output is not a typo. A passing vitest run that would have flooded 4,000 lines of green checkmarks into the agent collapses to a single line confirming the suite passed. A failing run preserves only the failures, with the assertion text and the file location.
A Real Before-and-After
I instrumented a single coding session on a Next.js 15 monorepo, agent running Claude Sonnet 4.6, working through a feature branch with about 40 tool calls over two hours. Half the session ran through raw shell. Half ran through RTK on the same kinds of commands.
The raw-shell half consumed 1.42 million tokens for the equivalent feature work. The RTK half consumed 287,000 tokens for the same kind of feature work. Same model, same engineer, same kind of task. The output filtering alone changed the cost basis by roughly 5x.
At Sonnet 4.6 API pricing in May 2026, that is the difference between $4.26 and $0.86 for two hours of agent work. Multiply across a five-engineer team running this pattern eight hours a day and the annual difference is the cost of a hire.
You can read more about how engineers structure these high-volume sessions in my earlier piece on Claude Max and the high-volume engineer, which covers the workflow context for why this matters.
Where Most Teams Get This Wrong
The reflex when token bills climb is to switch to a cheaper model. That is a worse fix than it looks. Claude Haiku 4.5 is excellent at small, well-scoped tasks but loses ground fast on multi-file reasoning. The teams that downgrade to save tokens often end up running more turns to compensate, which eats most of the savings and adds latency.
The right move is to take the noise out of the input before downgrading the model. A team running Sonnet 4.6 with disciplined tool output is faster and more accurate than a team running Haiku on raw shell dumps. The senior engineer in the second team is also less frustrated, which sounds soft but shows up in retention numbers within six months.
Another common miss is treating output filtering as a one-time setup rather than a continuous practice. New tools enter the workflow constantly. A team that adopts a new ORM or a new infra CLI in Q3 will see token usage spike if those new commands are not wrapped. RTK includes a proxy mode for unfiltered passthrough during onboarding plus a meta command that flags missed opportunities from session history. The discipline is to check those reports weekly.
The Pattern That Actually Works
Three things compound to make agentic coding sustainable at production scale. None of them is a model upgrade.
First, output filtering at the shell layer. RTK is the cleanest implementation of this in 2026, but the principle predates the tool. Any wrapper that strips noise before it enters the context window pays back within the first session.
Second, prompt caching on the parts of the conversation that repeat. Anthropic's caching cuts the per-turn cost of stable context such as system prompts, project conventions, and reference files by 90%. Pair that with RTK and you have small inputs hitting cached prefixes, which is the cheapest possible token economics.
Third, scoped sub-agents for narrow tasks. Spawning a sub-agent with a tight system prompt to handle a focused job, then discarding its context, is the difference between a 200,000-token main session and a 2-million-token main session. I cover the workflow ergonomics of this in Maximize Claude Code: advanced configuration for senior engineers.
A team that runs all three reports token consumption that scales with decisions made rather than with raw output volume. That is the only profile that makes production AI features economically defensible to a CFO.
Why This Matters for B2B Companies
The companies I work with are not running solo developer experiments. They are shipping AI features into customer-facing products with real SLAs and real budget oversight. When a finance team sees a $40,000 monthly Anthropic bill that grew from $4,000 in two months, the product gets shut down whether it works or not.
Token discipline is what keeps that from happening. It is also a skill that is rare inside engineering teams that have not done agent work before. Most senior engineers have never had to think about output as a cost. They have spent their careers in a world where the cost of pnpm install was a few seconds of their time, not a few dollars per invocation.
This is where I spend most of my consulting hours in 2026. Not building the agents themselves, which is now well-supported by frameworks. Building the surrounding token economy that makes the agents safe to deploy.
Where This Leaves You
If your team is hitting Claude Max ceilings or watching API spend climb past predictability, the first audit is not on the model or the prompts. It is on what flows back from your tool calls. RTK is the fastest way to find out how much of your spend is noise.
Install it, wrap your dev commands, and look at rtk gain --history after a week. If the savings number is large, you have your answer about what was wrong. If the savings number is small, you have ruled out a major hypothesis and can move to prompt and architecture audits with confidence.
Either way, you are no longer guessing about where the money goes.
Burning Claude tokens without shipping the feature?
Most AI projects stall because nobody on the team knows how to design agents, manage token budgets, or wire production evals. I build that layer for B2B companies so the feature actually ships and keeps shipping.
Senior engineer turned AI specialist. React, Next.js, AWS, agent orchestration.
Direct collaboration across UAE, Europe, and US time zones.
Discovery, role design, MCP integration, evals, and production deployment.
Subscribe to the newsletter for more on token economics, agent architecture, and B2B AI engineering.
