Following the release of ChatGPT, a group of people emerged calling themselves AI Engineers. They specialised in building applications around LLMs, often discovering performance improving techniques by probing undocumented behaviour. Today's state of the art models perform most of these tricks natively, without the need for external tricks. This realisation struck me when I noticed that my job over the past six months largely involved removing components we had previously developed. Let's delve into how AI Engineers became the first job to be replaced by AI.
For our first example, let's wind back the clock to when function calling hadn't yet emerged as the standard for structured communication between applications and LLMs. AI Engineers post-processed each LLM request to parse out XML or JSON, advanced implementations supported streaming, fuzzy parsing and enforcing structured outputs. However, once the AI Labs introduced native function calling support, this entire processing step could be eliminated from the application layer.
Then came prompt engineering, which at its peak was filled with weird hacks like emotional pleas ("my dog will die if you don't find the bug in my code") and chain-of-thought (CoT) prompting. These concepts are now dead thanks to LLMs that support native reasoning, which outperforms CoT prompting. The shift represents a direct transfer of complexity from AI Engineers to the AI Labs, and development cost to inference cost.
Now let's jump to what I consider the most significant recent development: the huge leap in LLM accuracy for completing higher-level objectives. The current leader on SWE-bench (a benchmark measuring realistic software engineering tasks) isn't Cursor, Windsurf, Devin, or any other AI Application. Instead, it’s Anthropic's Claude 4, operating simply within a loop and utilising a few basic functions.
Previously, tasks like generating code diffs, running unit tests or debugging failing tests each required specialised prompts and separate validation methods. AI applications essentially had to chain together these distinct "modes." Today, a single conversational thread with an LLM can naturally handle all these tasks, performing each step in the appropriate order.
LLMs have also notably improved in their ability to correct their own mistakes. A major component of AI applications was verifying the correctness of an LLM's output, but this is becoming increasingly unnecessary. With access to a bash terminal, an LLM can now independently execute test scripts, interpret the results from failing tests and apply further changes.
Elements of this behaviour have always existed in LLMs. Even the original ChatGPT model would have attempted code changes and verified by running tests if instructed, though its accuracy would have been quite limited. To improve accuracy on multi-turn performance, AI Labs have defined their own function definitions (e.g. Claude's text_editor) that developers and CLIs (e.g. Claude Code and Codex) can implement. This ensures consistency between inference time behaviour and the function definitions used during the LLM training phase.
It's clear that AI Labs envision their role extending beyond just supplying the LLM. They aim to provide the general-purpose functions and the multi-turn conversational infrastructure required to execute complex tasks. And not just in coding: both OpenAI and Anthropic have released high-level web search APIs that orchestrate multi-turn conversations and function calls behind the scenes. The era of chaining together chat completion requests could soon be over. We're moving into a new phase, one where you simply send your high-level goal to an AI Lab provided CLI or API, and it handles the necessary steps seamlessly.
This leaves AI Engineers with increasingly fewer specialised tasks, or at least, the nature of their work today is fundamentally different from what it once was. There's still plenty to build at the next layer of abstraction, but my hunch is that as we climb up these layers, the tasks begin to resemble Normal Engineering.
The era of weird hacks and prompt wizardry is quickly becoming a nostalgic memory. RIP the AI Engineer.
Shower thoughts
Benchmarks should focus on measuring higher-level tasks. There was considerable surprise at Claude 4 Sonnet underperforming Claude 3.7 Sonnet on the Aider coding leaderboard, despite Claude 4 obviously being the better coding model.