The shapeshifting engineer
This post is for the mids, seniors, and principals trying to work out what our jobs look like now that the machine writes the code. If you’re early in your career and wondering whether the whole thing is still worth pursuing, that’s a different conversation and I’ll write it up separately soon.
I’ve been building software for over a decade, and these days I use AI coding tools every working day. The velocity is genuinely something. I’ve shipped features in an afternoon that would have taken me most of a week, and cleared backlog items that sat untouched for a year because the effort never justified the payoff.
It’s fun and it’s fast, and it also keeps poking at a question I find harder to wave away every month: if the AI writes the code, what is it that we actually do?
After a year living in these tools, my answer is that the job is changing shape rather than going away. The parts that are growing are the ones that were always underneath the typing: working out what’s actually worth building, and judging whether what the machine hands back is any good. The rest of this is where I think that leaves you, whatever level you’re at, and what I’d do about it now.
The frontier split into two tiers #
The standout AI moment of 2026 for me wasn’t one of the launches we all got to play with. In April, Anthropic disclosed a model, codenamed Mythos, so good at finding and exploiting software bugs that they held it back from release. It found fresh zero-days in every major operating system and browser, and dug up a 27-year-old hole in OpenBSD, an OS people pick precisely because it’s so hardened. It did eventually ship, gated, as Claude Fable 5, and then on 12 June the US government ordered it pulled worldwide along with Mythos, over a jailbreak Anthropic says only exposed minor, already-known bugs.
That’s the actual story: the frontier isn’t one thing anymore. There’s the tier you can pick up and use today (Opus 4.8, GPT-5.5, Gemini 3.1), and a tier so capable it ships gated, if it ships at all. Fable made it out behind access controls and got pulled within weeks, and Anthropic reckons other labs will field Fable-class models with no safeguards inside 6 to 12 months. It’s worth keeping some scepticism, though. The firm Aisle reportedly reproduced some of those old bugs with small open-source models, and Cloudflare found the model great at finding bugs but weak at fixing them. The capability leap is real, but the disclosure also landed right before Anthropic’s confidential S-1, so some of the story around it is commercial.
Here’s why it matters for the rest of us, beyond the security teams. A model that surfaces ten thousand bugs but safely patches almost none of them doesn’t take the engineer out of the loop. It hands them a much bigger backlog. Anthropic’s own Glasswing consortium has already reported over 10,000 high-severity bugs, only around 14% of them patched so far. Finding the problems is the part that got automated, while the judgement to work out which ones matter and fix them without breaking three other things stayed human. The better these models get at the flashy part, the more downstream work they pile onto whoever has to make the output safe.
The gap between writing code and engineering software #
AI got dramatically better at writing code, but it didn’t get dramatically better at engineering software, and the gap between those two things is where our jobs now live.
METR ran a randomised controlled trial with experienced open-source developers working on their own mature codebases. The developers came out 19% slower with AI tools, while believing they’d been 20% faster. The tools generate beautifully, but they introduce review overhead that experienced engineers actually feel, because experienced engineers actually read the output.
Roychoudhury and Zeller make the underlying point in their paper: at least half the effort in software engineering goes into understanding software that already exists, and that understanding needs domain and program-specific knowledge you can’t get from processing syntax. Their closing question is the one engineering leaders should sit with. If something catastrophic happens with a codebase produced this way, who’s left who understands it well enough to fix it?
Amodei baked the same distinction into his own forecast when he predicted AI writing 90% of code. The programmer, he said, still specifies the overall app and the overall design decisions. Six months on, Redwood Research went and checked the numbers and reckoned the company-wide average for merged lines is more like 50%, a long way from 90%. Even inside Anthropic the headline claim doesn’t quite survive contact with the data.
Kent Beck said it best: 90% of his skills went to zero dollars, and 10% went up a thousandfold. He’s honest that he doesn’t yet know exactly which skills landed in which bucket, and the only way to find out is to try a lot of ideas and see what still pays.
Sam Altman framed the 2026 reality as developers spending 30% of their time writing code and 70% on architecture, design, and review, roughly inverted from where it sat two years ago. Simon Willison laid out what that 70% is: researching approaches, deciding on architecture, writing specs, defining success criteria, designing agentic loops, planning QA. If you read that list and thought “those are all things a senior engineer already does”, that’s the point.
So an experienced engineer’s job doesn’t shrink, it shifts: less of it is producing code, more is the judgement around what gets built and why.
The data on what happens when you crank up production without a matching increase in judgement is genuinely alarming. Faros AI tracked more than 22,000 developers and found PR merge rate up 16% while review time climbed 441%. Incidents per pull request rose 242%, and almost a third more PRs merged with no review at all. They call it Acceleration Whiplash.
It’s Jevons paradox showing up in our own work: when something gets cheaper to make, we make far more of it, not less. Cheaper code means more code, and that means more of it to review and keep working, which was always the expensive part of the job. And the backlog never empties, it just grows, because the work that was never worth the effort before is suddenly cheap enough to be worth doing.
The bottleneck has moved off writing the code and onto checking it. Anthropic themselves point out the obvious version of this: as they push more and more code through the org, human review is what everything else now waits on. What people do about that is where it gets interesting, because plenty of teams are trying to take the human out of that step rather than add more reviewers, whether that’s one AI reviewing another, agents left running in a loop to fix and re-run on their own, or leaning hard on automated tests so the test suite is the gate instead of a person. So the skill that lasts is a step up from reviewing every change yourself: it’s owning how the checking works at all, what you trust to run on its own and where a human still has to put their name on the result. That call is the part that stays yours.
When an agent can grind through the build over a weekend, what separates a good engineer from a great one is more and more the spec they hand it. Not a vague requirements doc, but a tight set of rules: what must not change, what has to hold when things break, and what “done” looks like once it’s live. The spec becomes the thing you’re really building. Holding a messy, contradictory system in your head and bringing some order to it is what good engineers have always done, it just matters far more now.
The world model problem #
Yann LeCun has been raising a deeper question loudly for years. He left Meta in late 2025, founded AMI Labs in Paris, and raised $1.03 billion, which was the largest seed round in European startup history. His thesis is that predicting the next word one at a time, which is roughly all an LLM does, is a dead end for human-level intelligence. In a November 2025 lecture he put it more bluntly: “the path to superintelligence via LLMs is complete bullshit. It’s just never going to work.”
The argument behind the swearing is precise. LLMs generate one token at a time, each one conditioned on everything before it. Over a long reasoning chain the errors compound, and there’s no mechanism to backtrack or to simulate several possible futures and pick one. They work in words and symbols, which suits language and falls apart for the messy, physical real world. They have no world model, so an LLM can recite gravity from a textbook without grasping in any useful sense that a glass pushed off a table will smash.
His alternative is JEPA, the Joint Embedding Predictive Architecture. Rather than predict the next token, it predicts in an abstract representation space, learning the patterns that matter and throwing away noise it can’t predict. A toddler learns gravity by watching the world, long before any physics lesson. JEPA is an attempt to formalise that.
It’s still early research rather than product, but it’s moving: Meta’s V-JEPA 2 trained on a million hours of video and leads on motion understanding and action anticipation, with follow-up work firming up the theory and shrinking the models to a single GPU.
This matters to us because it names a real limitation in the tools we lean on every day. LLMs are pattern matchers over text, and software engineering happens out in the world, in rooms with other people, in conversations with stakeholders who can’t quite say what they want, in systems wired into physical infrastructure and human behaviour. The models are brilliant at the text part, and everything that isn’t text is where they wobble.
LeCun might be wrong. He’s spent three years calling LLMs a dead end while they got steadily, dramatically more capable, and even his own people admit world models could be years away from anything commercial. But a Turing Award winner has just put a billion dollars behind a falsifiable architectural bet against the direction the whole field is running in. Even if you think he’s mistaken, his claim about what LLMs can’t model, the physics and causality and continuous reality of it, describes almost exactly the territory where experienced engineers earn their keep.
AI works best when the environment is machine-controlled #
A pattern I keep tripping over ties a lot of this together: AI performs best when the whole environment is itself machine-controlled, and it degrades the moment it hits a human-built edge.
Self-driving cars are happiest in instrumented, geofenced zones where every vehicle can talk to every other vehicle. The hard part was never the driving, it’s the humans, and autonomous vehicles get rear-ended because they follow the rules strictly while human drivers misjudge them. The academic work is blunt about it: if every vehicle on the road were autonomous, the coordination problem would mostly vanish, and what’s left unsolved is legacy cars and road users who aren’t connected to anything.
Warehouses tell the same story. Gartner predicts half of new warehouses in developed markets will be built for robots first by 2030, with people optional, and the word doing the work is “new”, because designing fresh for robots beats retrofitting a space made for people. The same robots that hit millimetre precision in a controlled cell still fail against clutter, slippage, and plain uncertainty.
You see exactly the same thing in code. Agents shine on a clean, fresh project where the whole thing fits in view, and they struggle on an existing system with years of history behind it: live dependencies, undocumented rules, decisions nobody remembers making. One team pointed Claude at a big, mature Django app and got tidy, confident-looking changes that quietly broke its links to outside services, because the agent only sees fragments and misses the shape of the whole thing. Old systems are full of invisible contracts: assumptions about how data moves, integration quirks, rules baked into code long after the people who wrote them have gone.
The pattern is that AI is brilliant inside neat, closed worlds it can fully see, and it breaks at the human-shaped edges, and most of the real world is human-shaped: the software we maintain, the roads we drive on, the warehouses that already exist, the organisations we work inside.
The industry’s response tells you something. We’re busy rebuilding our tools so machines can read them: context files that brief an agent, standard ways for agents to call other services, little manifests that explain how to use a system. Karpathy’s Software 3.0 talk argues our human-facing setup is unreadable to AI and needs rebuilding for machines, and UC Berkeley’s Dawn Song describes an agentic web made for AI agents rather than human browsers.
We are, quite literally, reshaping the environment to suit the AI. The engineers who understand both sides of that line, the tidy world the AI runs in and the messy human one it’s meant to serve, are the ones with the leverage.
When AI builds itself #
In June 2026, Anthropic put out an essay that read differently from the usual AI hype. “When AI builds itself” laid out their internal numbers. As of May 2026, Claude authored over 80% of the code merged into Anthropic’s own codebase. Engineers merge something like 8x as much code per day as they did in 2024. Open-ended task success hit 76%, up 50 points in six months. A kernel-optimisation research loop went from a 3x speedup with Opus 4 to 52x with the Mythos Preview model, against roughly 4x for a skilled human given 4 to 8 hours.
The essay’s central claim is that recursive self-improvement, AI autonomously improving AI, isn’t here yet but is plausible within a few years. Jack Clark puts it at a 60% chance by the end of 2028.
The timeline debate isn’t the interesting bit for us, the three scenarios they sketch are.
The first is that the trend stalls, where today’s capabilities spread widely but don’t compound. Anthropic lists it for completeness and clearly doesn’t believe it.
The second is compounding efficiency, and it’s the one they think most likely. Humans still set direction and judge results while AI does the doing, and a 100-person company starts doing the work of a 10,000-person one. In that world the experienced engineer steering a fleet of agents gets more valuable, not less, because the rare skill becomes spotting where the work is stuck and clearing it.
The third is full recursive self-improvement, where even taste and judgement get automated and humans drift toward oversight and verification of an AI-run virtual lab. Amodei captured the bear case with his islands metaphor: human expertise survives in islands for a while, and then the tide comes for the islands one by one.
My read is that the second scenario is the more likely one on a 3-to-7-year horizon, but the third isn’t a rounding error and its probability is creeping up.
Even the most aggressive timeline quietly concedes the point. The AI 2027 scenario walks month by month to superintelligence by late 2027, and even its fictional lab keeps human engineers on the payroll, because research taste in particular proved hard to train, which I think is a lovely way to describe what experienced engineers do all day.
The reassuring part is that both scenarios reward the same skills: verification, safety, architecting AI systems, writing specs, and being able to look at an output and say “no, this is wrong”, and then explain exactly why.
The comforting read is that Anthropic’s own internal assessment of the model that became Fable reportedly rated its weaknesses as handling week-long open-ended tasks on its own, reading what an organisation actually cares about, taste, checking its own work, following instructions, and knowing what it really knows. Read that list back and it’s more or less the job description for a senior engineer.
The obvious objection is that those are just today’s weaknesses, and I’ve spent this whole post watching that curve blow past limits people swore were permanent. So why would taste and judgement be the parts that hold? The best answer I’ve got is that they’re the parts of the job least like text. Writing a spec that survives contact with production, or knowing which failure actually matters to the business, leans on the messy human and physical context the models keep struggling with, which is the same limit those earlier sections on world models and machine-controlled environments kept pointing at. If that’s wrong, if judgement turns out to be just another capability on the same curve, then the moat closes and a fair bit of this post goes with it. That’s the bet I’m making, and I’d rather name it out loud than pretend it’s a sure thing.
One person still can’t run everything through agents #
Say we get AGI tomorrow. Does the CEO fire everyone and run the whole operation through a chat box? No.
Someone has to understand what the business actually needs, and the business is usually terrible at saying what that is. Someone has to turn vague requirements into technical decisions. Someone has to look at an AI-generated architecture and say it’ll fall over the first time real traffic hits it. Someone has to make the call on build versus buy, on which debt is worth taking on, on how to structure systems for problems that nobody has fully defined yet.
Even with a fleet of agents helping, every one of them needs pointing somewhere and every output needs judging, and that judging takes context that lives in people’s heads and in the genuinely messy reality of how an organisation works.
Annie Vella put words to the emotional side of this in her essay on the identity crisis. We’re shifting from building things to overseeing the agents that build them, which lands us somewhere that looks suspiciously like management, and that’s uncomfortable and I don’t always love it, but it’s where the value is going.
What to do about it #
If you’re mid-level, the pressure lands hardest on you, because your main output is solid, competent code, and that’s exactly what the AI now turns out by the yard. The way through is to stop being only a coder. Go deep in one area until you understand the business behind it, not just the code. Get properly fluent with the tools. And build the kind of judgement that lets you look at an AI’s design and see it’ll fall over the first time real traffic hits it. Roles that ask for two or more AI skills already pay around 43% more, so none of this is abstract.
If you’re senior, your work is shifting from writing the code to setting things up so the AI writes it well. That means the scaffolding around the work: clear docs, clean interfaces, the context files and tool setups that let an agent move through your systems without flailing. In my experience that’s the highest-leverage thing you can do to make AI coding pay off on a real, lived-in codebase, instead of getting confident changes that quietly break in production.
If you’re principal, the same move scales up, and there’s a shift that’s easy to miss. Your main tool across team boundaries used to be influence: write a proposal, hope it lands on someone’s roadmap. Now you can turn up with a working prototype attached to the argument. Point an agent at a scoped cross-team problem, harden what it gives you, and arrive with a pull request instead of a ticket. Getting a cross-team change off the ground used to be most of the fight, and that’s fallen away faster than anything else I’ve seen this year.
Whatever your level, move your main craft from the code to the spec. Get good at writing down exactly what you want: what must not change, what has to hold when things break, what “done” looks like once it’s live, with monitoring and rollback wired in. Keep those specs in version control next to the code. Once an agent can grind out the build overnight, the spec is the real thing you’re working on.
Treat reviewing and checking the AI’s output as a core skill you practise on purpose, not a chore to rush. Nearly half of AI-written code ships with a security hole in it, and the person who reliably catches that is worth more as more code gets written this way, not less.
Mentor your juniors anyway. It’s tempting to stop hiring them once an AI does the work they used to cut their teeth on, but Stanford already sees a 20% drop in employment for developers aged 22 to 25. If that holds, the seniors you’ll want to hire in 2031 are the juniors nobody is training today, and this is one we can actually help head off.
Kent Beck has the clearest take I’ve seen on what happened to our skills:
| Worth little now | Worth far more | Brand new |
|---|---|---|
| Syntax by heart | Vision | Writing specs |
| Memorising APIs | Architecture | Checking AI output |
| Raw typing speed | A feel for quality | Debugging code an agent wrote |
If most of your day is still in that first column, take it as your signal to move.
Where this ends #
Nobody knows. LeCun thinks LLMs are a dead end and has put a billion dollars behind world models. The AI 2027 authors think superhuman coding shows up around 2030. Hinton thinks a lot of jobs disappear. Ng says there’s no jobpocalypse coming. The discourse is this polarised because the situation is genuinely uncertain, and I’d be sceptical of anyone who tells you they know how it lands.
What I’m fairly sure of is narrower. The tools keep getting better, but the open question is whether that curve bends toward the judgement work or just keeps making the typing cheaper, and so far it’s overwhelmingly the typing. The role keeps sliding from authorship toward orchestration, and the engineers who do well will be the ones standing on the boundary between the tidy machine world and the messy human one, translating between the two all day.
The job isn’t dying so much as shapeshifting, and if you’ve been at this long enough to have real judgement and depth in a domain, you are pretty much exactly what the market is short of right now.
So build with the tools, sharpen the judgement they can’t reproduce, and maybe stop reading the LinkedIn doomer posts.
$ comments --load