Picture this: your customer service bot has been working flawlessly for months. Customers are satisfied, it saves FTEs. Then a new version of the model arrives. On paper more powerful, faster, cheaper. Two weeks later the customer service team lead is on your doorstep: the bot deviates from the script, answers questions it shouldn’t answer, and ignores precisely the instructions that matter.
Not a hypothetical scenario. This happens the moment the prompts written for the old model version start getting in the way of the new version.
TL;DR
- Model upgrades break prompts via two opposing mechanisms: the model takes too much control itself, or it becomes too literal
- Seven prompt patterns that were once best practice now backfire on modern reasoning models
- The durable strategy is not “better prompts” but stable specifications, automatic evals, and pinned model versions, so you know when something breaks
Why prompts break during a model upgrade
With new generations, modern LLMs drift in two opposing directions, depending on the model provider:
Inference creep: The new model increasingly takes control itself and overrides your boundaries with its own judgment. Anthropic acknowledged this in the transition to Claude Sonnet 4 and 4.5: instructions like “ALWAYS do X” were reconsidered by the model under the heading “does the user really want this in this context?”
Literalism shock: The opposite pattern. OpenAI writes it themselves in their Cookbook on the GPT-4o to GPT-4.1 transition: “GPT-4.1 is trained to follow instructions more closely and more literally than its predecessors… we expect that getting the most out of this model will require some prompt migration.” An instruction like “you must always call a tool” led GPT-4o to ask a sensible follow-up question when information was missing. GPT-4.1 simply hallucinated a tool call with empty parameters.
A future-proof prompt has to withstand both directions. That is a higher bar than most prompts meet.
7 prompt patterns past their expiry date
| # | Pattern | Why brittle on modern models | What to do instead |
|---|---|---|---|
| 1 | "Think step by step" as an instruction | Reasoning models (o1, o3, Claude with thinking) do this by themselves, explicit CoT can actually undermine performance | Give the goal, let the model decide whether to reason step by step |
| 2 | "You are an expert X" role prompts | Research on 162 personas and 2,410 questions shows no average improvement, sometimes even less accurate answers | Describe the task context, not a role |
| 3 | 5+ few-shot examples | Strong models become anchored to your examples and explore the solution space less | Maximum of 1-2 examples, diversely chosen |
| 4 | Enforcing strict JSON while reasoning | Structure constraints compete with thinking: measurable performance loss of 10-15% on reasoning tasks | Two steps: first reason freely, then format separately |
| 5 | temperature=0 for “determinism” | Anthropic Opus 4.7 rejects this with HTTP 400. Google Gemini 3 gets stuck in a loop under the default | Use the default value; enforce consistency via your prompt |
| 6 | "NEVER do X" negative instructions | Strong models sometimes interpret negations the opposite way; positive instructions generalize better | Write what the model should do, with an explanation of why |
| 7 | The 2000-token monster prompt | ”God Object” prompts: one change causes regression on other tasks. Voiceflow saw a 10% performance loss during model migration | Split into modular prompts with their own evals |
Three patterns highlighted: why precisely these
1. Chain-of-thought has become almost free, and therefore almost worthless
The Wharton Generative AI Lab recently measured the declining value of chain-of-thought prompting on modern reasoning models. On o3-mini and o4-mini, “think step by step” yielded only 2.9 to 3.1 percent accuracy gain, against 20 to 80 percent longer response time and higher costs.
Put differently: you pay four times as much compute time for a marginally better answer. On specific task types, think of implicit statistical reasoning, explicit CoT can even worsen performance by 36 percentage points.
The pattern that was the gold standard in 2023 is, in 2026, a drag on speed with no quality to show for it.
2. “You are an expert X” hasn’t worked since 2024
Zheng et al. (EMNLP 2024) tested 162 different personas in a large-scale study, from “you are a lawyer” to “you are a top scientist”, spread across 2,410 factual questions and four model families. The conclusion: no measurable improvement over a neutral system prompt. On reasoning tasks, follow-up research on LLaMA-3 even showed performance loss in 7 of 12 datasets.
Yet 90 percent of production prompts still start with a role opener. It feels smart. And it can be removed with one line.
3. Few-shot prompting works inversely to model capacity
Perhaps the most surprising finding from the literature: the more capable the model, the more sensitive it is to bad examples in your prompt. A paper by Sclar et al. (ICLR 2024) showed that the formatting between examples alone, commas, colons, line breaks, can cause accuracy differences of up to 76 percentage points. And formatting that works well on one model turns out to be bad for another. What is optimal is therefore model-specific.
The more examples you give, the larger the surface on which this fragility can strike.
What to do instead: three principles for future-proof AI
We design AI systems for organizations with three rules that have held up across all model generations:
1. Describe the intent, not the procedure. Tell the model what “done” means and which rules are sacred. Let the model decide for itself which intermediate steps are needed. That’s how you design an AI system that still works two years from now.
2. Maintain specifications and evals, not “the prompt”. The prompt is the temporary artifact. What is durable is the specification of what your system must do, together with automated tests that verify it. With every model upgrade you run your prompt through your evals again.
3. Pin your model version in production. It is tempting to run on the latest version automatically. Don’t. Pin a version, run your evals against the new version as soon as it comes out, and only upgrade once you know your system survives it.
What this means for your AI investment
Most organizations that have invested in AI over the past two years wrote their prompts for the model of that time. Those prompts still work now. They will break, not if but when, the model provider upgrades. That is no fault of your team. It is a property of the medium.
The way to protect your investment is not “write better prompts.” It is building the layer beneath the prompt: a clear task specification, automated tests that catch regression before your customer does, and discipline around model versions. That’s how you become AI-native in seven steps, not through today’s prompts, but through tomorrow’s evals.
Want to know how your AI systems are doing? Get in touch, we’re happy to take a look with you.
