The shapeshifting engineer

17 June 2026 17 min read

This post is for the mids, seniors, and principals trying to work out what our jobs look like now that the machine writes the code. If you’re early in your career and wondering whether the whole thing is still worth pursuing, that’s a different conversation and I’ll write it up separately soon.

I’ve been building software for over a decade, and these days I use AI coding tools every working day. The velocity is real. I’ve shipped features in an afternoon that would have taken me most of a week, and cleared backlog items that sat untouched for a year because the effort never justified the payoff.

It’s fun and it’s fast, and it also keeps poking at a question I find harder to wave away every month: if the AI writes the code, what is it that we actually do?

After a year living in these tools, I don’t think the job is going away, but it is changing shape. More of my time goes into deciding what’s worth building and judging whether the machine’s work is any good. Those jobs were always there underneath the typing. Now they’re much harder to ignore.

The frontier split into two tiers #

The standout AI moment of 2026 for me wasn’t one of the launches we all got to play with. In April, Anthropic disclosed a model, codenamed Mythos, so good at finding and exploiting software bugs that they held it back from release. It found fresh zero-days in every major operating system and browser, and dug up a 27-year-old hole in OpenBSD, an OS people pick precisely because it’s so hardened. It did eventually ship, gated, as Claude Fable 5, and then on 12 June the US government ordered it pulled worldwide along with Mythos, over a jailbreak Anthropic says only exposed minor, already-known bugs.

That’s the actual story: the frontier isn’t one thing anymore. There’s the tier you can pick up and use today (Opus 4.8, GPT-5.5, Gemini 3.1), and a tier so capable it ships gated, if it ships at all. Fable made it out behind access controls and got pulled within weeks, and Anthropic reckons other labs will field Fable-class models with no safeguards inside 6 to 12 months. It’s worth keeping some scepticism, though. The firm Aisle reportedly reproduced some of those old bugs with small open-source models, and Cloudflare found the model great at finding bugs but weak at fixing them. The capability leap is real, but the disclosure also landed right before Anthropic’s confidential S-1, so some of the story around it is commercial.

A model that surfaces ten thousand bugs but safely patches almost none of them doesn’t take the engineer out of the loop. It hands them a much bigger backlog. Anthropic’s own Glasswing consortium has already reported over 10,000 high-severity bugs, only around 14% of them patched so far. Finding the problems is the part that got automated, while the judgement to work out which ones matter and fix them without breaking three other things stayed human. The better these models get at the flashy part, the more downstream work they pile onto whoever has to make the output safe.

The gap between writing code and engineering software #

AI got dramatically better at writing code, but it didn’t get dramatically better at engineering software, and the gap between those two things is where our jobs now live.

METR ran a randomised controlled trial with experienced open-source developers working on their own mature codebases. The developers came out 19% slower with AI tools, while believing they’d been 20% faster. The tools generate beautifully, but they introduce review overhead that experienced engineers actually feel, because experienced engineers actually read the output.

Roychoudhury and Zeller make the underlying point in their paper: at least half the effort in software engineering goes into understanding software that already exists, and that understanding needs domain and program-specific knowledge you can’t get from processing syntax. Their closing question is the one engineering leaders should sit with. If something catastrophic happens with a codebase produced this way, who’s left who understands it well enough to fix it?

Amodei baked the same distinction into his own forecast when he predicted AI writing 90% of code. The programmer, he said, still specifies the overall app and the overall design decisions. Six months on, Redwood Research went and checked the numbers and reckoned the company-wide average for merged lines is more like 50%, a long way from 90%. Even inside Anthropic the headline claim doesn’t quite survive contact with the data.

Kent Beck said it best: 90% of his skills went to zero dollars, and 10% went up a thousandfold. He’s honest that he doesn’t yet know exactly which skills landed in which bucket, and the only way to find out is to try a lot of ideas and see what still pays.

Sam Altman framed the 2026 reality as developers spending 30% of their time writing code and 70% on architecture, design, and review, roughly inverted from where it sat two years ago. Simon Willison laid out what that 70% is: researching approaches, deciding on architecture, writing specs, defining success criteria, designing agentic loops, planning QA. In other words, the work already filling a senior engineer’s day.

The data on what happens when you crank up production without a matching increase in judgement is alarming. Faros AI tracked more than 22,000 developers and found PR merge rate up 16% while review time climbed 441%. Incidents per pull request rose 242%, and almost a third more PRs merged with no review at all. They call it Acceleration Whiplash.

It’s Jevons paradox showing up in our own work: make code cheaper and we produce more of it, including backlog jobs that never used to justify the effort. The bottleneck moves to checking. Anthropic says human review is already what its growing volume of code waits on, so teams are trying AI reviewers, self-correcting agent loops and stronger automated tests instead of adding people at the same rate. The lasting skill isn’t manually reviewing every line. It’s deciding what can run alone and where a human still has to sign off.

When an agent can grind through the build over a weekend, the quality of the spec starts to matter more than typing speed. It needs to say what must not change, what has to hold when things break and what “done” looks like once the feature is live. Good engineers have always held messy, contradictory systems in their heads and brought some order to them. Now the agent makes the cost of a vague spec obvious very quickly.

The world model problem #

Yann LeCun has been raising a deeper question loudly for years. He left Meta in late 2025, founded AMI Labs in Paris, and raised $1.03 billion, which was the largest seed round in European startup history. His thesis is that predicting the next word one at a time, which is roughly all an LLM does, is a dead end for human-level intelligence. In a November 2025 lecture he put it more bluntly: “the path to superintelligence via LLMs is complete bullshit. It’s just never going to work.”

The argument behind the swearing is precise. LLMs generate one token at a time, each one conditioned on everything before it. Over a long reasoning chain the errors compound, and there’s no mechanism to backtrack or to simulate several possible futures and pick one. They work in words and symbols, which suits language and falls apart for the messy, physical real world. They have no world model, so an LLM can recite gravity from a textbook without grasping in any useful sense that a glass pushed off a table will smash.

His alternative is JEPA, the Joint Embedding Predictive Architecture. Rather than predict the next token, it predicts in an abstract representation space, learning the patterns that matter and throwing away noise it can’t predict. A toddler learns gravity by watching the world, long before any physics lesson. JEPA is an attempt to formalise that.

It’s still early research rather than product, but it’s moving: Meta’s V-JEPA 2 trained on a million hours of video and leads on motion understanding and action anticipation, with follow-up work firming up the theory and shrinking the models to a single GPU.

This matters to us because it names a real limitation in the tools we lean on every day. LLMs are pattern matchers over text, and software engineering happens out in the world, in rooms with other people, in conversations with stakeholders who can’t quite say what they want, in systems wired into physical infrastructure and human behaviour. The models are brilliant at the text part, and everything that isn’t text is where they wobble.

LeCun might be wrong. He’s spent three years calling LLMs a dead end while they got steadily, dramatically more capable, and even his own people admit world models could be years away from anything commercial. But a Turing Award winner has just put a billion dollars behind a falsifiable architectural bet against the direction the whole field is running in. Even if you think he’s mistaken, his claim about what LLMs can’t model, the physics and causality and continuous reality of it, describes almost exactly the territory where experienced engineers earn their keep.

AI works best when the environment is machine-controlled #

A pattern I keep tripping over ties a lot of this together: AI performs best when the whole environment is itself machine-controlled, and it degrades the moment it hits a human-built edge.

Self-driving cars are happiest in instrumented, geofenced zones where every vehicle can talk to every other vehicle. Human drivers are the awkward variable. Autonomous vehicles get rear-ended because they follow the rules strictly while people misjudge them. The academic work is blunt: if every vehicle were autonomous, most of the coordination problem would disappear. Legacy cars and unconnected road users are what keep it difficult.

Warehouses tell the same story. Gartner predicts half of new warehouses in developed markets will be built for robots first by 2030, with people optional. New matters there: designing for robots from day one is far easier than retrofitting a space made for people. The same robots that hit millimetre precision in a controlled cell still fail against clutter, slippage and plain uncertainty.

You see the same thing in code. Agents shine on a clean, fresh project where the whole thing fits in view, and they struggle on an existing system with years of history behind it: live dependencies, undocumented rules, decisions nobody remembers making. One team pointed Claude at a big, mature Django app and got tidy, confident-looking changes that broke its links to outside services, because the agent only sees fragments and misses the shape of the whole thing. Old systems are full of invisible contracts: assumptions about how data moves, integration quirks, rules baked into code long after the people who wrote them have gone.

The industry’s response tells you something. We’re busy rebuilding our tools so machines can read them: context files that brief an agent, standard ways for agents to call other services, little manifests that explain how to use a system. Karpathy’s Software 3.0 talk argues our human-facing setup is unreadable to AI and needs rebuilding for machines, and UC Berkeley’s Dawn Song describes an agentic web made for AI agents rather than human browsers.

We’re reshaping the environment to suit the AI because neat, closed systems are where it works best. Engineers still have to connect that machine-readable world to the old software, physical infrastructure and organisations that don’t fit neatly inside a context window.

When AI builds itself #

In June 2026, Anthropic put out an essay that read differently from the usual AI hype. “When AI builds itself” laid out their internal numbers. As of May 2026, Claude authored over 80% of the code merged into Anthropic’s own codebase. Engineers merge something like 8x as much code per day as they did in 2024. Open-ended task success hit 76%, up 50 points in six months. A kernel-optimisation research loop went from a 3x speedup with Opus 4 to 52x with the Mythos Preview model, against roughly 4x for a skilled human given 4 to 8 hours.

The essay’s central claim is that recursive self-improvement, AI autonomously improving AI, isn’t here yet but is plausible within a few years. Jack Clark puts it at a 60% chance by the end of 2028.

Anthropic sketches three outcomes. Progress could stall, which it clearly doesn’t expect. More likely, efficiency keeps compounding: humans set direction and judge results while a small company does work that once needed thousands of people. The aggressive case automates taste and judgement too, leaving people to oversee an AI-run lab. Amodei’s islands metaphor captures that last one: human expertise survives on islands until the tide reaches each of them.

I think compounding efficiency is more likely over the next three to seven years, but full recursive improvement isn’t a rounding error. Even the fictional lab in AI 2027 keeps human engineers because research taste proved hard to train.

For now, both scenarios reward the same skills: verification, safety, architecting AI systems, writing specs, and being able to look at an output and explain exactly why it’s wrong.

Anthropic’s own internal assessment of the model that became Fable reportedly rated its weaknesses as handling week-long open-ended tasks on its own, reading what an organisation actually cares about, taste, checking its own work, following instructions, and knowing what it really knows. Read that list back and it’s more or less the job description for a senior engineer.

The obvious objection is that those are just today’s weaknesses, and I’ve spent this whole post watching that curve blow past limits people swore were permanent. So why would taste and judgement be the parts that hold? The best answer I’ve got is that they’re the parts of the job least like text. Writing a spec that survives contact with production, or knowing which failure actually matters to the business, leans on the messy human and physical context the models keep struggling with, which is the same limit those earlier sections on world models and machine-controlled environments kept pointing at. If that’s wrong, if judgement turns out to be just another capability on the same curve, then the moat closes and a fair bit of this post goes with it. That’s the bet I’m making, and I’d rather name it out loud than pretend it’s a sure thing.

One person still can’t run everything through agents #

Even if AGI arrived tomorrow, a CEO couldn’t run the company through a chat box. Businesses are bad at stating what they need. Someone still has to turn vague wants into technical decisions, choose which debt is worth taking and reject the architecture that falls over under real traffic. A fleet of agents can do the build, but it needs direction and its output needs judging against context that lives in people’s heads.

Annie Vella put words to the emotional side of this in her essay on the identity crisis. We’re shifting from building things to overseeing the agents that build them, which lands us somewhere that looks suspiciously like management, and that’s uncomfortable and I don’t always love it, but it’s where the value is going.

What to do about it #

If you’re mid-level, the pressure lands hardest on you, because your main output is solid, competent code, and that’s exactly what the AI now turns out by the yard. The way through is to stop being only a coder. Go deep in one area until you understand the business behind it, not just the code. Get properly fluent with the tools. And build the kind of judgement that lets you look at an AI’s design and see it’ll fall over the first time real traffic hits it. Roles that ask for two or more AI skills already pay around 43% more, so none of this is abstract.

If you’re senior, your work is shifting from writing the code to setting things up so the AI writes it well. That means the scaffolding around the work: clear docs, clean interfaces, the context files and tool setups that let an agent move through your systems without flailing. In my experience that’s the highest-leverage thing you can do to make AI coding pay off on a real, lived-in codebase, instead of getting confident changes that break in production.

If you’re principal, the same move scales up, and there’s a shift that’s easy to miss. Your main tool across team boundaries used to be influence: write a proposal, hope it lands on someone’s roadmap. Now you can turn up with a working prototype attached to the argument. Point an agent at a scoped cross-team problem, harden what it gives you, and arrive with a pull request instead of a ticket. Getting a cross-team change off the ground used to be most of the fight, and that’s fallen away faster than anything else I’ve seen this year.

Whatever your level, move your main craft from the code to the spec. Get good at writing down exactly what you want: what must not change, what has to hold when things break, what “done” looks like once it’s live, with monitoring and rollback wired in. Keep those specs in version control next to the code. Once an agent can grind out the build overnight, the spec is the real thing you’re working on.

Treat reviewing and checking the AI’s output as a core skill you practise on purpose, not a chore to rush. Nearly half of AI-written code ships with a security hole in it, and the person who reliably catches that is worth more as more code gets written this way, not less.

Mentor your juniors anyway. It’s tempting to stop hiring them once an AI does the work they used to cut their teeth on, but Stanford already sees a 20% drop in employment for developers aged 22 to 25. If that holds, the seniors you’ll want to hire in 2031 are the juniors nobody is training today, and this is one we can actually help head off.

Kent Beck has the clearest take I’ve seen on what happened to our skills:

Worth little now	Worth far more	Brand new
Syntax by heart	Vision	Writing specs
Memorising APIs	Architecture	Checking AI output
Raw typing speed	A feel for quality	Debugging code an agent wrote

If most of your day is still in that first column, take it as your signal to move.

Where this ends #

Nobody knows. LeCun thinks LLMs are a dead end and has put a billion dollars behind world models. The AI 2027 authors think superhuman coding shows up around 2030. Hinton thinks a lot of jobs disappear. Ng says there’s no jobpocalypse coming. The discourse is this polarised because the situation is uncertain, and I’d be sceptical of anyone who tells you they know how it lands.

What I’m fairly sure of is narrower. The tools keep getting better, but so far they’ve made typing cheaper much faster than they’ve replaced engineering judgement. The role keeps sliding from authorship towards orchestration. Engineers who can translate between a tidy machine-readable plan and the messy organisation it has to serve will do well out of that shift.

The job is shapeshifting. If you’ve been at this long enough to build real judgement and depth in a domain, the market still has plenty of use for you.

So build with the tools and sharpen the judgement they can’t reproduce, and maybe stop reading the LinkedIn doomer posts.

Update: I wrote the companion piece for people earlier in their careers, on whether it’s still worth starting in software in 2026 and how to break in now that the entry-level market has narrowed: Breaking In.