Learnings

Beware: AI prompts age faster than you think. 7 patterns that break with every model upgrade

14 May 2026 · Bas van Dijk

Back to Learnings

Picture this: your customer service bot has been working flawlessly for months. Customers are satisfied, it saves FTEs. Then a new version of the model arrives. On paper more powerful, faster, cheaper. Two weeks later the customer service team lead is on your doorstep: the bot deviates from the script, answers questions it shouldn’t answer, and ignores precisely the instructions that matter.

Not a hypothetical scenario. This happens the moment the prompts written for the old model version start getting in the way of the new version.

TL;DR

  • Model upgrades break prompts via two opposing mechanisms: the model takes too much control itself, or it becomes too literal
  • Seven prompt patterns that were once best practice now backfire on modern reasoning models
  • The durable strategy is not “better prompts” but stable specifications, automatic evals, and pinned model versions, so you know when something breaks

Why prompts break during a model upgrade

With new generations, modern LLMs drift in two opposing directions, depending on the model provider:

Inference creep: The new model increasingly takes control itself and overrides your boundaries with its own judgment. Anthropic acknowledged this in the transition to Claude Sonnet 4 and 4.5: instructions like “ALWAYS do X” were reconsidered by the model under the heading “does the user really want this in this context?”

Literalism shock: The opposite pattern. OpenAI writes it themselves in their Cookbook on the GPT-4o to GPT-4.1 transition: “GPT-4.1 is trained to follow instructions more closely and more literally than its predecessors… we expect that getting the most out of this model will require some prompt migration.” An instruction like “you must always call a tool” led GPT-4o to ask a sensible follow-up question when information was missing. GPT-4.1 simply hallucinated a tool call with empty parameters.

A future-proof prompt has to withstand both directions. That is a higher bar than most prompts meet.

7 prompt patterns past their expiry date

#PatternWhy brittle on modern modelsWhat to do instead
1"Think step by step" as an instructionReasoning models (o1, o3, Claude with thinking) do this by themselves, explicit CoT can actually undermine performanceGive the goal, let the model decide whether to reason step by step
2"You are an expert X" role promptsResearch on 162 personas and 2,410 questions shows no average improvement, sometimes even less accurate answersDescribe the task context, not a role
35+ few-shot examplesStrong models become anchored to your examples and explore the solution space lessMaximum of 1-2 examples, diversely chosen
4Enforcing strict JSON while reasoningStructure constraints compete with thinking: measurable performance loss of 10-15% on reasoning tasksTwo steps: first reason freely, then format separately
5temperature=0 for “determinism”Anthropic Opus 4.7 rejects this with HTTP 400. Google Gemini 3 gets stuck in a loop under the defaultUse the default value; enforce consistency via your prompt
6"NEVER do X" negative instructionsStrong models sometimes interpret negations the opposite way; positive instructions generalize betterWrite what the model should do, with an explanation of why
7The 2000-token monster prompt”God Object” prompts: one change causes regression on other tasks. Voiceflow saw a 10% performance loss during model migrationSplit into modular prompts with their own evals

Three patterns highlighted: why precisely these

1. Chain-of-thought has become almost free, and therefore almost worthless

The Wharton Generative AI Lab recently measured the declining value of chain-of-thought prompting on modern reasoning models. On o3-mini and o4-mini, “think step by step” yielded only 2.9 to 3.1 percent accuracy gain, against 20 to 80 percent longer response time and higher costs.

Put differently: you pay four times as much compute time for a marginally better answer. On specific task types, think of implicit statistical reasoning, explicit CoT can even worsen performance by 36 percentage points.

The pattern that was the gold standard in 2023 is, in 2026, a drag on speed with no quality to show for it.

2. “You are an expert X” hasn’t worked since 2024

Zheng et al. (EMNLP 2024) tested 162 different personas in a large-scale study, from “you are a lawyer” to “you are a top scientist”, spread across 2,410 factual questions and four model families. The conclusion: no measurable improvement over a neutral system prompt. On reasoning tasks, follow-up research on LLaMA-3 even showed performance loss in 7 of 12 datasets.

Yet 90 percent of production prompts still start with a role opener. It feels smart. And it can be removed with one line.

3. Few-shot prompting works inversely to model capacity

Perhaps the most surprising finding from the literature: the more capable the model, the more sensitive it is to bad examples in your prompt. A paper by Sclar et al. (ICLR 2024) showed that the formatting between examples alone, commas, colons, line breaks, can cause accuracy differences of up to 76 percentage points. And formatting that works well on one model turns out to be bad for another. What is optimal is therefore model-specific.

The more examples you give, the larger the surface on which this fragility can strike.

What to do instead: three principles for future-proof AI

We design AI systems for organizations with three rules that have held up across all model generations:

1. Describe the intent, not the procedure. Tell the model what “done” means and which rules are sacred. Let the model decide for itself which intermediate steps are needed. That’s how you design an AI system that still works two years from now.

2. Maintain specifications and evals, not “the prompt”. The prompt is the temporary artifact. What is durable is the specification of what your system must do, together with automated tests that verify it. With every model upgrade you run your prompt through your evals again.

3. Pin your model version in production. It is tempting to run on the latest version automatically. Don’t. Pin a version, run your evals against the new version as soon as it comes out, and only upgrade once you know your system survives it.

What this means for your AI investment

Most organizations that have invested in AI over the past two years wrote their prompts for the model of that time. Those prompts still work now. They will break, not if but when, the model provider upgrades. That is no fault of your team. It is a property of the medium.

The way to protect your investment is not “write better prompts.” It is building the layer beneath the prompt: a clear task specification, automated tests that catch regression before your customer does, and discipline around model versions. That’s how you become AI-native in seven steps, not through today’s prompts, but through tomorrow’s evals.

Want to know how your AI systems are doing? Get in touch, we’re happy to take a look with you.

About JumpScale

It's our mission to make organizations AI-native. JumpScale helps ambitious SMEs make the move to AI, built together and fully owned by you.

About us