A founder told me last week that her team had spent three months on an AI feature and it still produced wrong answers two times out of ten. She had read six articles arguing that AI is overhyped. She was halfway convinced the technology itself was the problem.
Subscribe to the newsletter for AI engineering practices and B2B build patterns.
I asked her three questions about the project. Was there an eval harness running on every commit? No. Was the agent's context windowed by retrieval, or stuffed with the full system prompt every turn? Stuffed. Did anyone on the team know the difference between Sonnet 4.6 and Opus 4.6 for the kind of reasoning her feature needed? They had picked Sonnet because it was cheaper.
The feature did not have an AI problem. The team had a skills problem. Within two weeks of installing eval discipline, retrieval-bounded context, and a model-routing layer, the same code produced correct answers 97% of the time on the same test set. The model never changed.
This pattern repeats across the dozens of B2B engineering teams I have seen in 2026. The complaint that AI does not code well is a real complaint. The diagnosis that AI is the problem is wrong almost every time.
The Frustration Is Real. The Cause Is Not What People Think.
Listen carefully to the engineers complaining about AI in 2026 and you hear specific failure stories that all map to known fixes.
"It hallucinates." Most often this means the model was asked for a fact it had no grounded source for, with no MCP retrieval connected, no fallback to "I do not know," and no constraint on the output format. The fix is not a better model. The fix is retrieval, format constraints, and an explicit acknowledgement path for unknowns.
"It burns tokens." Most often this means tool output is flowing unfiltered into the context, system prompts are uncached, and there is no sub-agent boundary on tangential research. The fix is output filtering, prompt caching, and scoped sub-agents. None of which require changing the model.
"It doesn't actually ship anything." Most often this means the team is using AI for code generation but skipped the eval harness, so they have no signal about whether the generated code is correct beyond a manual spot check. The fix is treating evals as the primary deliverable of any AI feature, not an afterthought.
"It works in the demo and breaks in production." Most often this means the demo had clean inputs and the production has noisy inputs, and the team never built the input-validation and graceful-degradation layer. The fix is treating the AI as one component in a system that includes input sanitization, rate limiting, fallbacks, and observability.
In every case the actual problem is one layer underneath the AI. In every case the fix is a senior engineering practice that has existed for years and now applies to AI.
Ten Mistakes I See Constantly
These are the patterns that produce the most frustration per session in B2B AI work in 2026. Each has a one-line fix. Each fix is something a working AI engineer does without thinking about it.
-
Pasting entire files when 40 lines would suffice. Fix: extract the relevant slice with
grep -A 20or your editor's symbol search before sending. -
Letting tool output flood the context. Fix: wrap dev commands with a token-aware proxy such as RTK, covered in Stop Burning Claude Tokens.
-
Asking the model to recall facts instead of fetching them. Fix: connect an MCP retrieval server for the fact source, or use a tool call. Stop expecting the model to know your private data.
-
Running every reasoning step on the most expensive model. Fix: route by complexity. Haiku 4.5 for classification and extraction, Sonnet 4.6 for general reasoning, Opus 4.6 for architectural decisions.
-
No eval harness. Fix: write 30 input/output pairs that represent the actual task before you write any prompts. Run them on every change.
-
One giant system prompt that contains everything the agent might need. Fix: split into a stable cached prefix and a per-task instruction layer. The cache hit rate is the metric.
-
Long conversation histories that drift off-task. Fix: end the session and start fresh when the goal shifts. Conversational continuity has a token cost and a cognitive cost.
-
Assuming the model knows the codebase. Fix: provide a CLAUDE.md, a project map, or a tool that lets the agent enumerate structure. The model is not psychic about your repo conventions.
-
No structured output for downstream code. Fix: use Zod schemas or JSON Schema constraints when the output feeds another system. Free-form prose is fine for humans, never for pipelines.
-
Treating prompt engineering as the whole job. Fix: agent design, retrieval, evals, observability, and token economics matter more than prompt phrasing in 2026. Anyone selling you the opposite is selling a 2023 product.
The list is unglamorous. None of these are clever tricks. They are the same kind of practices that separate a senior engineer from a junior on any non-AI project. The work has shifted from prompt phrasing to system design.
The Engineering Tax of Not Knowing This
A B2B company that does not have these practices internalized will pay for the gap in three ways.
The first cost is direct, in the form of token spend that grows faster than feature value. A team without context discipline routinely spends 4-6x more per feature than a disciplined team. At enterprise scale that gap is six figures per month. The CFO sees the bill before the head of engineering sees the productivity gain, and the project gets cut.
The second cost is reliability. Features that work in QA and break in production erode customer trust faster than missing features do. A B2B customer paying $50,000 a year is patient with a missing feature. They are not patient with a feature that gives wrong answers 8% of the time. The reputational damage from shipping unreliable AI is significantly worse than the reputational damage from shipping nothing.
The third cost is hiring drift. Companies that struggle with AI tend to hire prompt engineers, then realize prompt engineering is a small subset of the actual work, then re-hire AI engineers, then realize the original team cannot collaborate effectively with the new specialists. A clear architectural plan from the start avoids the back-and-forth and the morale damage that comes with it.
What an AI Engineer Actually Does
The job in 2026 is closer to senior platform engineering than to anything that existed in 2023. The deliverables look like this.
A working eval harness with at least 50 task-representative examples, run on every commit, gating production deployment. A token-economics audit baseline plus a monthly review of where the budget actually goes. A model-routing layer with explicit fallback rules and cost attribution. A prompt-caching strategy with measured hit rates. A retrieval and tool-use architecture that grounds the model in current facts. An observability stack that surfaces hallucination rate, latency, and cost per request. A runbook for what happens when the model upgrades or a provider has an outage.
Those are not optional add-ons. They are the architecture. The prompts are the easy part.
When a senior AI engineer is missing from a B2B AI project, those layers do not get built, and the team is left arguing about prompt wording in Slack while their costs climb.
Where This Leaves You
If your team has been at an AI feature for months without a clean shipping path, the model is almost never the problem. The skills, the architecture, and the operating practices are usually where the gap sits.
The fix is straightforward but it is not fast. A team that takes the practices seriously becomes productive in 60-90 days. A team that keeps blaming the model stays frustrated until they replace the engineering leadership or hire someone who has been through this loop before.
I would rather see B2B engineering leaders treat AI as a discipline that takes a quarter to learn than as a feature that should just work. The teams that frame it the first way ship. The teams that frame it the second way write articles complaining about AI.
Stuck on an AI project that hallucinates and won't ship?
Most AI projects stall because nobody on the team knows how to design agents, manage token budgets, or wire production evals. I build that layer for B2B companies so the feature actually ships and keeps shipping.
Senior engineer turned AI specialist. React, Next.js, AWS, agent orchestration.
Direct collaboration across UAE, Europe, and US time zones.
Discovery, role design, MCP integration, evals, and production deployment.
Subscribe to the newsletter for working patterns from real B2B AI engineering work.
