PaulRoebuck.co.uk
ClaudeAiBlog.png

The Claude AI

A series of papers on Additional Intelligence for senior operators — directors, executives, partners, heads of function. People carrying real load. Direct, exact, grounded. Not for IT, developers, or marketing.

SHaDS Blog — In 1985, AI Hallucination Would Have Triggered a Stock Recount

In the 1980s I worked with MRP2 systems — the software that ran the factory floor: bills of materials, work in progress, the stock on the shelf. The whole edifice rested on one thing: the numbers had to be right. Oliver Wight, who more or less wrote the rulebook for that world, set the bar at ninety-five per cent inventory-record accuracy — and warned that below it the whole planning system falls off dramatically. Good operations chased tighter still. If stock-record accuracy slipped to one or two per cent wrong, that wasn't a shrug. It was a recount, a root-cause hunt, and — let's be honest about the era — usually someone's head on a plate. That was the threshold for a crisis.

I've been thinking about that bar, because the AI industry has just started doing something genuinely remarkable: it has begun to measure, in the open, how often its models make things up.

An independent evaluator, Artificial Analysis, runs a public benchmark called AA-Omniscience — six thousand hard factual questions spanning business, law, medicine, software and science, a verified right answer for each, and a model free to either answer or say "I don't know." When they published it in November 2025, the headline finding was the kind of sentence that should stop a boardroom cold: of every frontier model they tested, all but three were more likely to invent an answer than to get a hard question right.

Look at the spread. The best-behaved models were wrong on roughly one attempted answer in four. The typical flagship — around one in two. And the two models with the highest raw knowledge scores, the cleverest by accuracy, were wrong on 64% and 81% of the answers they chose to give. Not "failed to answer." Answered — fluently, confidently — and wrong.

Now set those numbers against Oliver Wight. His minimum — the line below which a 1985 stock system was deemed unsafe to run on — was five per cent error. The machines we're now inviting into boardrooms, contracts, diagnoses and due diligence run at error rates of twenty-five, fifty, eighty per cent on hard questions, and the standard response is a nod and a copy-paste. We held a paper inventory ledger to a higher standard of accuracy than we currently hold the tool we hand our decisions to.

Hear me clearly, because this is the part everyone gets wrong. This is not me telling you AI is rubbish. It isn't. These systems are extraordinary, and the same benchmarks that expose the hallucination also show models that genuinely know more than any tool we've ever had. And the fact that the industry is now measuring and publishing this honestly is not a scandal — it's a sign of maturity. Eli Goldratt gave us the principle decades ago: "Tell me how you measure me, and I will tell you how I will behave." The benchmark's designers understood it exactly — they reward a model for admitting "I don't know," precisely to coax out the behaviour we want. Measurement shaping conduct, just as he said. That deserves applause, not a sneer.

The point isn't that the tool is bad. The point is what the number demands of us.

Because the rate isn't drifting quietly to zero. By the middle of 2026 the leaderboard had turned over almost entirely — newer models, higher scores, the top system now answering more than sixty per cent of these hard questions correctly — and yet the hallucination floor barely yielded: the field's flagships still invent on anywhere from a third to over four-fifths of the answers they attempt. Better at knowing; no better at admitting when they don't. Deming would have recognised it on sight: an error rate this stubborn, model after model, isn't the failing of any one system — it's a property of the system itself, and a system's faults don't yield to exhortation, only to a change in the system. These models produce fluent, confident, plausible language, and when they reach the edge of what they actually know, they don't stop. They fill the gap. And the wrongness doesn't arrive wearing a warning label — it arrives in the same calm, articulate voice as everything that's true. You aren't handed a flagged "bad batch." You're handed a seamless whole, a chunk of it wrong somewhere inside, and no way on the face of it to see which.

That is precisely why the old discipline matters more, not less. We never trusted the stockroom on faith — we counted, we reconciled, someone's job was to check. A tool that's wrong a quarter to a half of the time on hard questions, walking into serious decisions, doesn't call for less of that discipline. It calls for more. Know the error rate. Assume the wrong part is in there. Be the recount.

Confidently making things up is only one of the five ways an AI defends the gap between what it knows and what it's asked. I call the family of them SHaDS™. My book, Mind the Gap, is about where that shape comes from, why no amount of raw capability erases it, and how to work with a brilliant, fluent, structurally over-confident machine without handing it the keys.

We worked all this out forty years ago, on the factory floor: a powerful system is only as trustworthy as the discipline you wrap around it. Wight knew it about a stock ledger. Goldratt knew it about the numbers we choose to reward. Deming knew it about the system as a whole. We'd be wise to remember it about an AI in 2026 — not because the machine is rubbish, but because it's good enough to be believed.

 

SHaDS™ is a trademark of Paul Roebuck.