Why do so many engineers say AI is bad at coding in 2026?

The complaint is real but the diagnosis is wrong. Frontier models in 2026 perform at the level of a strong mid-level engineer on well-scoped tasks. The reported failures cluster around bad context discipline, vague prompts, missing eval loops, and unmonitored tool output. These are user-side practices that have well-documented fixes. Engineers who have learned the practices report 3-5x productivity gains. Engineers who have not report frustration.

What does 'context discipline' mean in practical terms?

Context discipline is the practice of putting only the information the model needs to reason about the current decision into the context window. That sounds obvious until you watch a typical session. Engineers paste entire 2,000-line files when 40 lines would suffice. They keep stale conversation history that contradicts the current goal. They pull in unrelated reference docs because they might be useful. Disciplined context means scoping each turn to its actual decision, archiving stale history, and using sub-agents for tangential research.

Is 'AI hallucinates' a fair complaint or a misuse pattern?

Both, but mostly misuse in 2026. Hallucinations happen most often when the model is asked to recall specific facts it does not have grounded access to, when the prompt is ambiguous enough to leave room for invention, or when the model is allowed to commit to an answer without retrieval. Each has a fix. Use tools and MCP servers for facts that need to be current. Specify the expected output format and constraints. Allow the model to say it does not know. The hallucination rate on tasks structured this way is below 2% in current evals.

What are the top three mistakes B2B engineering teams make with AI?

First, treating AI as a feature instead of a discipline, with no token budget, no eval harness, and no rollback strategy. Second, putting AI behavior in the same code review process as deterministic code, which misses the failure modes that only show up at scale. Third, hiring for prompt engineering as a junior skill instead of treating agent design as a senior architectural responsibility. The teams that get value from AI treat it as an architectural concern, not a productivity tool.

How long does it take to actually become competent at AI-assisted coding?

For a senior engineer who already has strong testing and architecture habits, 60-90 days of deliberate practice produces fluency. The bottleneck is not learning prompts. It is unlearning workflow assumptions that worked in pre-AI engineering and replacing them with token-aware, eval-driven, agent-aware versions. Engineers who treat this as a skill they are willing to study for three months become some of the most productive developers I have met. Engineers who treat it as something the tool should figure out for them stay frustrated indefinitely.

Should companies hire prompt engineers or train existing engineers?

Train existing senior engineers and bring in a specialist for the architecture phase. The most expensive failure mode is a junior prompt engineer with no production engineering background designing systems that look impressive in demo and fall apart under load. The reverse failure, a senior backend engineer who ignores AI, is also expensive but easier to recover from. The right structure is a senior AI engineer setting the architecture, evals, and token budgets, with the existing team learning the day-to-day practices alongside.

When should a B2B company bring in an external AI engineer?

The clearest signal is when an internal AI project has been in progress for more than 90 days without a production deployment. The second signal is when token costs are climbing without matching revenue. The third is when the team is shipping features but customers are reporting reliability issues that look random. All three cases are diagnosable in a one-week audit and fixable in 4-8 weeks of focused engineering.

Back to Blog

'AI Doesn't Code' Is a Skills Problem, Not an AI Problem

May 8, 2026

AI EngineeringPrompt EngineeringClaudeDeveloper SkillsAI AdoptionB2BProductivity

Abstract monochrome illustration of a skilled hand tuning a complex instrument made of nested geometric forms, half blurred and chaotic, half crisp and ordered

A founder told me last week that her team had spent three months on an AI feature and it still produced wrong answers two times out of ten. She had read six articles arguing that AI is overhyped. She was halfway convinced the technology itself was the problem.

Subscribe to the newsletter for AI engineering practices and B2B build patterns.

I asked her three questions about the project. Was there an eval harness running on every commit? No. Was the agent's context windowed by retrieval, or stuffed with the full system prompt every turn? Stuffed. Did anyone on the team know the difference between Sonnet 4.6 and Opus 4.6 for the kind of reasoning her feature needed? They had picked Sonnet because it was cheaper.

The feature did not have an AI problem. The team had a skills problem. Within two weeks of installing eval discipline, retrieval-bounded context, and a model-routing layer, the same code produced correct answers 97% of the time on the same test set. The model never changed.

This pattern repeats across the dozens of B2B engineering teams I have seen in 2026. The complaint that AI does not code well is a real complaint. The diagnosis that AI is the problem is wrong almost every time.

The Frustration Is Real. The Cause Is Not What People Think.

Listen carefully to the engineers complaining about AI in 2026 and you hear specific failure stories that all map to known fixes.

"It hallucinates." Most often this means the model was asked for a fact it had no grounded source for, with no MCP retrieval connected, no fallback to "I do not know," and no constraint on the output format. The fix is not a better model. The fix is retrieval, format constraints, and an explicit acknowledgement path for unknowns.

"It burns tokens." Most often this means tool output is flowing unfiltered into the context, system prompts are uncached, and there is no sub-agent boundary on tangential research. The fix is output filtering, prompt caching, and scoped sub-agents. None of which require changing the model.

"It doesn't actually ship anything." Most often this means the team is using AI for code generation but skipped the eval harness, so they have no signal about whether the generated code is correct beyond a manual spot check. The fix is treating evals as the primary deliverable of any AI feature, not an afterthought.

"It works in the demo and breaks in production." Most often this means the demo had clean inputs and the production has noisy inputs, and the team never built the input-validation and graceful-degradation layer. The fix is treating the AI as one component in a system that includes input sanitization, rate limiting, fallbacks, and observability.

In every case the actual problem is one layer underneath the AI. In every case the fix is a senior engineering practice that has existed for years and now applies to AI.

Ten Mistakes I See Constantly

These are the patterns that produce the most frustration per session in B2B AI work in 2026. Each has a one-line fix. Each fix is something a working AI engineer does without thinking about it.

Pasting entire files when 40 lines would suffice. Fix: extract the relevant slice with grep -A 20 or your editor's symbol search before sending.
Letting tool output flood the context. Fix: wrap dev commands with a token-aware proxy such as RTK, covered in Stop Burning Claude Tokens.
Asking the model to recall facts instead of fetching them. Fix: connect an MCP retrieval server for the fact source, or use a tool call. Stop expecting the model to know your private data.
Running every reasoning step on the most expensive model. Fix: route by complexity. Haiku 4.5 for classification and extraction, Sonnet 4.6 for general reasoning, Opus 4.6 for architectural decisions.
No eval harness. Fix: write 30 input/output pairs that represent the actual task before you write any prompts. Run them on every change.
One giant system prompt that contains everything the agent might need. Fix: split into a stable cached prefix and a per-task instruction layer. The cache hit rate is the metric.
Long conversation histories that drift off-task. Fix: end the session and start fresh when the goal shifts. Conversational continuity has a token cost and a cognitive cost.
Assuming the model knows the codebase. Fix: provide a CLAUDE.md, a project map, or a tool that lets the agent enumerate structure. The model is not psychic about your repo conventions.
No structured output for downstream code. Fix: use Zod schemas or JSON Schema constraints when the output feeds another system. Free-form prose is fine for humans, never for pipelines.
Treating prompt engineering as the whole job. Fix: agent design, retrieval, evals, observability, and token economics matter more than prompt phrasing in 2026. Anyone selling you the opposite is selling a 2023 product.

The list is unglamorous. None of these are clever tricks. They are the same kind of practices that separate a senior engineer from a junior on any non-AI project. The work has shifted from prompt phrasing to system design.

The Engineering Tax of Not Knowing This

A B2B company that does not have these practices internalized will pay for the gap in three ways.

The first cost is direct, in the form of token spend that grows faster than feature value. A team without context discipline routinely spends 4-6x more per feature than a disciplined team. At enterprise scale that gap is six figures per month. The CFO sees the bill before the head of engineering sees the productivity gain, and the project gets cut.

The second cost is reliability. Features that work in QA and break in production erode customer trust faster than missing features do. A B2B customer paying $50,000 a year is patient with a missing feature. They are not patient with a feature that gives wrong answers 8% of the time. The reputational damage from shipping unreliable AI is significantly worse than the reputational damage from shipping nothing.

The third cost is hiring drift. Companies that struggle with AI tend to hire prompt engineers, then realize prompt engineering is a small subset of the actual work, then re-hire AI engineers, then realize the original team cannot collaborate effectively with the new specialists. A clear architectural plan from the start avoids the back-and-forth and the morale damage that comes with it.

What an AI Engineer Actually Does

The job in 2026 is closer to senior platform engineering than to anything that existed in 2023. The deliverables look like this.

A working eval harness with at least 50 task-representative examples, run on every commit, gating production deployment. A token-economics audit baseline plus a monthly review of where the budget actually goes. A model-routing layer with explicit fallback rules and cost attribution. A prompt-caching strategy with measured hit rates. A retrieval and tool-use architecture that grounds the model in current facts. An observability stack that surfaces hallucination rate, latency, and cost per request. A runbook for what happens when the model upgrades or a provider has an outage.

Those are not optional add-ons. They are the architecture. The prompts are the easy part.

When a senior AI engineer is missing from a B2B AI project, those layers do not get built, and the team is left arguing about prompt wording in Slack while their costs climb.

Where This Leaves You

If your team has been at an AI feature for months without a clean shipping path, the model is almost never the problem. The skills, the architecture, and the operating practices are usually where the gap sits.

The fix is straightforward but it is not fast. A team that takes the practices seriously becomes productive in 60-90 days. A team that keeps blaming the model stays frustrated until they replace the engineering leadership or hire someone who has been through this loop before.

I would rather see B2B engineering leaders treat AI as a discipline that takes a quarter to learn than as a feature that should just work. The teams that frame it the first way ship. The teams that frame it the second way write articles complaining about AI.

AI Engineering for B2B

Stuck on an AI project that hallucinates and won't ship?

Most AI projects stall because nobody on the team knows how to design agents, manage token budgets, or wire production evals. I build that layer for B2B companies so the feature actually ships and keeps shipping.

12+ years shipping production systems

Senior engineer turned AI specialist. React, Next.js, AWS, agent orchestration.

Dubai-based, working with B2B teams worldwide

Direct collaboration across UAE, Europe, and US time zones.

AI agent teams that ship, not demos that stall

Discovery, role design, MCP integration, evals, and production deployment.

Hire me to build your AI agent teamOr email pooya@pooyagolchian.com to scope a project.

Subscribe to the newsletter for working patterns from real B2B AI engineering work.

Share this post

X / Twitter

Facebook

Financial Analysis

Quantitative Market Reports

Interactive charts powered by Monte Carlo simulations, GARCH volatility models, and Fama-French factor analysis. Two reports are free. Unlock all 12 with a Pro subscription.

Bivariate Risk CorrelationFree

Dynamic correlation analysis between asset pairs with rolling Pearson coefficients.

Black-Scholes Options PricingFree

European call/put pricing with implied volatility surface and Greeks analysis.

Gold & Crypto Q2 ForecastPro

Monte Carlo simulation with confidence bands and VaR analysis.

Portfolio Risk & Kelly CriterionPro

Multi-asset stress testing with Value-at-Risk and optimal position sizing.

View All ReportsFree early access. No credit card required.

Gemma 4 on Ollama: Multi-Token Prediction, Benchmarks, and Local Setup (May 2026)

How I Build AI Agent Teams That Actually Ship for B2B Companies

FAQ

About Pooya Golchian

Common questions about Pooya's work, AI services, and how to start a project together.

The complaint is real but the diagnosis is wrong. Frontier models in 2026 perform at the level of a strong mid-level engineer on well-scoped tasks. The reported failures cluster around bad context discipline, vague prompts, missing eval loops, and unmonitored tool output. These are user-side practices that have well-documented fixes. Engineers who have learned the practices report 3-5x productivity gains. Engineers who have not report frustration.
Context discipline is the practice of putting only the information the model needs to reason about the current decision into the context window. That sounds obvious until you watch a typical session. Engineers paste entire 2,000-line files when 40 lines would suffice. They keep stale conversation history that contradicts the current goal. They pull in unrelated reference docs because they might be useful. Disciplined context means scoping each turn to its actual decision, archiving stale history, and using sub-agents for tangential research.
Both, but mostly misuse in 2026. Hallucinations happen most often when the model is asked to recall specific facts it does not have grounded access to, when the prompt is ambiguous enough to leave room for invention, or when the model is allowed to commit to an answer without retrieval. Each has a fix. Use tools and MCP servers for facts that need to be current. Specify the expected output format and constraints. Allow the model to say it does not know. The hallucination rate on tasks structured this way is below 2% in current evals.
First, treating AI as a feature instead of a discipline, with no token budget, no eval harness, and no rollback strategy. Second, putting AI behavior in the same code review process as deterministic code, which misses the failure modes that only show up at scale. Third, hiring for prompt engineering as a junior skill instead of treating agent design as a senior architectural responsibility. The teams that get value from AI treat it as an architectural concern, not a productivity tool.
For a senior engineer who already has strong testing and architecture habits, 60-90 days of deliberate practice produces fluency. The bottleneck is not learning prompts. It is unlearning workflow assumptions that worked in pre-AI engineering and replacing them with token-aware, eval-driven, agent-aware versions. Engineers who treat this as a skill they are willing to study for three months become some of the most productive developers I have met. Engineers who treat it as something the tool should figure out for them stay frustrated indefinitely.
Train existing senior engineers and bring in a specialist for the architecture phase. The most expensive failure mode is a junior prompt engineer with no production engineering background designing systems that look impressive in demo and fall apart under load. The reverse failure, a senior backend engineer who ignores AI, is also expensive but easier to recover from. The right structure is a senior AI engineer setting the architecture, evals, and token budgets, with the existing team learning the day-to-day practices alongside.
The clearest signal is when an internal AI project has been in progress for more than 90 days without a production deployment. The second signal is when token costs are climbing without matching revenue. The third is when the team is shipping features but customers are reporting reliability issues that look random. All three cases are diagnosable in a one-week audit and fixable in 4-8 weeks of focused engineering.