We are living through one of the fastest paradigm shifts in software engineering history, and most developers are so deep inside it they haven't noticed the shape of the thing. Step back far enough and a clear progression emerges — five distinct eras in how programmers have related to AI, each one rendering the previous era's hard-won expertise slightly embarrassing in retrospect.
This is the arc: from prompt whispering, through vibe coding, into context engineering, and now arriving at the frontier of harness engineering. Every step on that path tells you something about where the real leverage in AI-assisted development actually lives.
// Era 01 — The Prompt Whisperers (2022–2023)
In the beginning, getting useful output from a large language model felt like a dark art. There were tricks — specific incantations that, when applied correctly, dramatically improved what you got back. The community catalogued them obsessively: chain-of-thought prompting ("think step by step"), role assignment ("you are a senior software engineer with 20 years of experience"), few-shot examples, self-consistency sampling, and the infamous cargo-cult additions like "take a deep breath before answering" or — the meme that launched a thousand LinkedIn posts — appending "make no mistakes" to the end of your request.
That last one wasn't entirely wrong. It worked, a little, for the same reason all of these tricks worked: models are next-token predictors, and the tokens immediately preceding the output shape its character. Priming the model with "expert, careful, no mistakes" language genuinely nudged outputs in that direction because that's the kind of text that follows such language in training data.
The "take a deep breath" trick, hilariously, had real research behind it. Google DeepMind's 2023 OPRO paper used LLMs to automatically optimize prompts for other LLMs, and discovered the single most effective phrase was literally "Take a deep breath and work on this problem step by step." Applied to PaLM 2, that prompt achieved 80.2% accuracy on test problems — versus 34% with no prompt guidance at all. Every dev who mocked the phrase was technically wrong.
This era had a real skill ceiling. The best prompt engineers were genuinely valuable — they understood context windows, knew how to structure information for model attention, and could coax reliable behavior out of systems that were, at base, fundamentally unreliable. Dedicated "Prompt Engineer" job listings appeared on every major job board. On Indeed, searches for the role spiked from 2 per million in January 2023 to 144 per million by April 2023.
Then the models got smarter, the tricks stopped mattering, and those job listings quietly vanished. By 2025 a Microsoft survey ranked "Prompt Engineer" second-to-last among roles companies were considering adding. IEEE Spectrum published a piece bluntly titled "AI Prompt Engineering Is Dead."
// Era 01.5 — The One Trick That Survived
Not every prompt trick became obsolete. One in particular has a real mechanistic basis that persists regardless of model capability: restating your goal at the end of a long prompt.
The reason it works is rooted in how transformer attention actually functions. A 2023 paper titled "Lost in the Middle: How Language Models Use Long Contexts" (arXiv: 2307.03172) measured this directly: LLM performance follows a U-shaped curve based on where information sits in the context. Relevant information at the very beginning or very end gets the highest attention weight. Bury it in the middle and performance drops by more than 30%. The underlying cause is Rotary Position Embedding (RoPE), which introduces a long-term decay effect — the model structurally downweights tokens far from its current position. This is not a quirk that better models fully escape; it is baked into the architecture at scale, and it persists even in models with 200K+ token windows.
The practical consequence: if you spend five paragraphs setting up a complex coding task and then state your actual objective once at the top and never again, by the time the model is generating its response the original goal may be getting outcompeted by all the middle content. Restating the core objective at the end of the prompt ensures it sits in the recency-favored position during generation — essentially anchoring the model's attention to what you actually want right before it starts writing.
This isn't superstition. It's exploiting a real property of the architecture. Among all the prompt engineering folklore, this one earned its place.
// Era 02 — Vibe Coding (Early 2025)
On February 2, 2025, Andrej Karpathy — co-founder of OpenAI, former head of AI at Tesla — posted something that ricocheted around the internet: "There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists."
The post was viewed over 4.5 million times. The New York Times, Ars Technica, and The Guardian wrote about it. Merriam-Webster listed it as a slang entry by March 2025. Collins English Dictionary named it Word of the Year.
Karpathy's description was precise: accept all AI-generated code changes without reviewing diffs, paste error messages back to the model verbatim, let the codebase grow organically beyond your comprehension. His own caveat — "not too bad for throwaway weekend projects, but still quite amusing" — got lost in translation.
Vibe coding captured a real moment: models had become good enough that you could describe software in plain English and get functional results, fast. For prototyping, for personal projects, for throwaway experiments, it genuinely worked. The meme landed because it described something true about the current state of the tools.
But serious engineers ran into the wall quickly. Accepting every diff without reading it means inheriting every bug, security hole, and bad pattern the model decides to introduce. Vibe coding at production scale is how you build a system nobody on the team understands — including the AI, which has no persistent memory of the choices it made two sessions ago. The vibes, it turned out, did not scale.
A randomized controlled trial studying experienced developers working on real open-source repositories found that using AI tools made them 19% slower on non-trivial tasks — not faster. The cognitive overhead of reviewing and verifying AI output erased the generation speed gains. By September 2025, Fast Company was reporting a "vibe coding hangover," with senior engineers describing development hell when inheriting AI-generated codebases.
Karpathy, for his part, later called the original post "a shower of thoughts throwaway tweet that I just fired off" that "minted a fitting name at the right moment." He clarified that the throwaway-weekend-project framing had always been the point — a nuance that got lost somewhere between 4.5 million views.
// Era 03 — Engineers Reclaim the Wheel
The backlash was quiet but firm. Senior engineers started posting the same observation in different words: this is a tool, and I'm still the engineer. They weren't rejecting AI-assisted development — they were rejecting the abdication of engineering judgment it seemed to require.
What emerged was a more disciplined posture: use the AI to generate, but review everything it produces. Check that the code actually does what it claims. Enforce style conventions. Run the tests. Treat the model as a capable but junior collaborator who occasionally hallucinates with complete confidence.
Andrew Ng's research from this period illustrated the underlying dynamic well. GPT-3.5 with a single prompt achieved 48.1% accuracy on a coding benchmark. GPT-4 with a single prompt reached 67%. But GPT-3.5 wrapped in an agentic workflow — where the model iterates, checks its own output, and self-corrects — hit 95.1%. The model mattered less than the process wrapped around it.
This was the first sign that the real work wasn't in better prompts. It was in better systems.
// Era 04 — Context Engineering (2025)
The term crystallized in June 2025, when two people posted about it within days of each other. Shopify CEO Tobi Lütke on June 19th: "I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM." Then Karpathy on June 25th, adding his weight: context engineering is "the delicate art and science of filling the context window with just the right information for the next step." By July, Gartner was stating explicitly: "Context engineering is in, and prompt engineering is out."
The framing shift matters. Prompt engineering asks: what do I say? Context engineering asks: what does the model need to see?
In practice, this meant building structured, persistent context that AI coding tools could consume
at the start of each session. The canonical artifact is the CLAUDE.md file — a
project-level manifest checked into your repo that tells the AI your architecture decisions, coding
conventions, known gotchas, and anything else it would need to onboard like a new team member.
(Similar files exist for other tools: .cursorrules, AGENTS.md; there are
active efforts to standardize the format.)
The research backed the approach up hard. Projects with well-maintained context manifests showed
a 29% reduction in agent task runtime. Simple optimizations to a CLAUDE.md alone
produced roughly a 10% accuracy improvement on coding tasks. Context accumulation without management
caused a 30% accuracy drop — models buried under irrelevant conversation history
performed measurably worse.
As Anthropic put it plainly: "Claude is already smart enough — intelligence is not the bottleneck. Context is."
Context engineering turned out to have real depth. Early adopters moved from a single
CLAUDE.md to tiered architectures — some projects reaching 26,000 lines of structured
context. Path-scoped rules loaded only when relevant files were touched. Skills systems enabled
on-demand loading of domain-specific context. The discipline matured fast.
// Era 05 — Harness Engineering (2026)
Context engineering solves the "what does the model see" problem. Harness engineering asks a bigger question: what is the system around the model that makes it reliably produce correct outcomes, at scale, without constant human supervision?
An agent harness is the infrastructure wrapped around an AI model — the steering and brakes, not the engine. It manages the agent's lifecycle, constrains what it can do, feeds it context, verifies its outputs, and corrects it when it drifts. The model provides the intelligence; the harness provides the reliability.
The proof of concept came from inside OpenAI itself. Over five months, a small team of engineers used Codex agents to build and ship a product containing roughly one million lines of code — with no manually written source code. Three engineers drove approximately 1,500 pull requests through CI over that period: 3.5 PRs per engineer per day, a throughput that increased as the team grew to seven. The code included application logic, documentation, CI configuration, observability setup, and tooling.
What made it possible wasn't the model alone. It was the harness: architectural constraints enforced by deterministic linters, "garbage collection" agents that ran periodically to detect documentation drift and structural violations, a continuously maintained knowledge base that served as the project's long-term memory. When agents faithfully reproduced suboptimal patterns from existing code — what the team called "AI slop" — the answer wasn't better prompts. It was golden principles: opinionated, mechanical rules baked directly into the repository and enforced by the harness.
The epistemological shift is profound. In a harness-engineering workflow, your job is no longer to write correct code. It is to build an environment in which an agent reliably produces correct code. These are fundamentally different problems, and they require fundamentally different skills.
"The most significant change in an agent-first workflow is not technical — it's epistemological. The engineer's job shifts from producing correct code to producing an environment in which an agent reliably produces correct code."
LangChain demonstrated the stakes clearly: their coding agent went from 52.8% to 66.5% on Terminal Bench 2.0 by changing nothing about the underlying model. They only changed the harness. Same model, better environment, dramatically better results.
Anthropic's own Nicholas Carlini ran an experiment on the other end of the complexity spectrum: orchestrating 16 parallel Claude agents across 2,000 sessions to build a 100,000-line C compiler from scratch. The compiler successfully built PostgreSQL, Redis, FFmpeg, CPython, and the Linux 6.9 kernel. Carlini's reflection: "Most of my effort went into designing the environment around Claude — the tests, the feedback loops — so that it could orient itself without me." That's the harness. The model was capable from day one; the harness was what made it useful.
// The Divergence: Claude Code vs. Codex
The two most prominent AI coding tools of 2026 — Anthropic's Claude Code and OpenAI's Codex — illustrate how harness engineering is already fracturing into distinct philosophies.
Claude Code is terminal-native and local-first. It lives in your shell, reads
your entire codebase, and works interactively — showing its reasoning and pausing at decision
points. Its context system is deep: CLAUDE.md files, path-scoped rules, a Skills
system for on-demand domain loading, MCP integrations, and Agent Teams that spawn coordinated
sub-agents sharing a task list with dependency tracking. The philosophy is: the engineer stays
in the loop, the agent is a powerful collaborator, and the harness is built into the project
itself.
OpenAI Codex is cloud-first and asynchronous. You delegate a task, it spins up an isolated sandboxed environment, writes the code, runs the tests, and hands you a pull request. Its agent isolation model — each task in its own container — optimizes for parallel throughput rather than coordinated depth. The philosophy is closer to delegation: hand off work, review results.
Neither approach is universally correct. For complex refactors where subtasks have deep dependencies, Claude Code's coordinated agent model wins. For independent, parallelizable tasks — "migrate 200 React components from Class to Hooks" — Codex's isolation model is more efficient. The interesting question isn't which tool is better; it's what the divergence reveals about where the industry is heading.
The answer: harness engineering is becoming a moat. You can fine-tune a competitive model in weeks. Building production-ready harnesses takes months or years. The teams building sophisticated harnesses now are accumulating advantages that compound — not in the model, but in the system of constraints, context, verification, and correction wrapped around it.
// Where This Ends Up
The trajectory from prompt whispering to harness engineering is, in retrospect, inevitable. Every iteration moved the leverage point further from the model and further into the system. Better prompts → better context → better infrastructure. The model got smarter; the discipline around it got more sophisticated.
Karpathy — who, to his credit, has named at least two of these eras himself — now calls the current moment "agentic engineering": "'agentic' because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do — 'engineering' to emphasize that there is an art and science and expertise to it."
The art and science of it is the harness. The engineers who figure that out first aren't just going to ship faster — they're going to build systems that sustain quality at scale in ways that no prompt trick, no matter how clever, ever could.
The cursor still blinks. But now it's waiting for an agent to respond to the harness that told it what to do.