The Blind Spot in AI Safety
Jennifer Kinne / May 26, 2026
Detail from Digital Society Bell by Lone Thomasky & Bits&Bäume. Better Images of AI / CC by 4.0
Last month, thousands of machine learning researchers gathered in Rio de Janeiro for ICLR 2026, one of the most influential AI conferences in the world. Among the papers being presented was one from Anthropic researchers formalizing a claim that has been circulating in AI safety circles since 2023 and is now arriving with empirical backing: that frontier AI systems will fail unpredictably rather than through coherent misalignment. That claim is poised to move into governance discussions. It deserves closer scrutiny.
The paper, also published as a blog post on Anthropic’s alignment research site, asks a question that matters for anyone responsible for deploying AI in high-stakes environments: when advanced AI systems fail, will they do so by coherently pursuing the wrong goals, or by failing unpredictably, in ways that don’t reflect any consistent objective? The authors call the second scenario the “hot mess,” and their conclusion is that as AI systems tackle harder tasks, hot mess failures will dominate. In the author’s own framing, they predict an AI future “where AIs sometimes cause industrial accidents (due to unpredictable misbehavior), but are less likely to exhibit consistent pursuit of a misaligned goal.” The implication for governance is direct: organizations should focus on preventing accidents rather than constraining misaligned goals.
The empirical observations deserve to be taken seriously. The governance conclusion does not follow from them. And the distance between those two things is a space that machine learning peer review is not built to catch.
What the paper measures
To distinguish between coherent and incoherent failure, the authors use a statistical tool called bias-variance decomposition. The core question is whether a model gives consistent answers across many attempts at the same problem. High consistency means failures are systematic: the model is reliably wrong in the same direction. Low consistency means failures are scattered: the model is unpredictably wrong each time. The authors find that as tasks get harder, AI failures become increasingly scattered. They call this “error-incoherence” and draw the industrial accident conclusion from it.
The problem is what consistency actually measures here. The paper assesses whether a model’s outputs are consistent, not whether they are consistent with reality, and the authors acknowledge this themselves. Under “Limitations of our framework,” the blog post states: “To rigorously measure bias and variance, we need well-defined targets.” Those well-defined targets are benchmark labels: the correct answer on a test, the passing result on a coding evaluation. What the framework measures, then, is how consistently a model’s outputs cluster around its own mean relative to those benchmarks. The authors did not follow this flag into its governance implication: a framework requiring well-defined targets cannot evaluate whether a model’s outputs remain grounded in reality across the open-ended, evolving contexts where governance failures actually occur.
This is not an idiosyncratic flaw in this paper. It reflects standard practice across machine learning evaluation, where benchmark performance gets called “understanding” and output consistency gets called “coherence.” ML peer review evaluates whether the math is right. It is not designed to evaluate whether the conceptual claims that the math is being asked to support are valid. By the standards of the venue, the paper is sound. By the standards of governance, the conclusions being drawn from it are insufficient.
What no one is measuring
The paper distinguishes between two types of AI failure: unpredictable incoherence and consistent pursuit of a wrong goal. But there is a third failure mode that falls between them, and that is the most consequential for organizations deploying AI in regulated environments.
That failure mode is what I am calling epistemic drift: a model that appears internally consistent and passes standard validation, but whose outputs have gradually decoupled from ground truth in a directional, accumulating way. The model is not randomly wrong; nor is it coherently pursuing a misaligned goal. It is operating coherently within a reference frame that is not following reality.
The practical consequence is that as standard validations produce passing results and output metrics continue to look normal, the model’s epistemic drift embeds in systems that depend on it before anyone is able to notice the deviation. The bias-variance framework cannot see this; the paper itself notes that “scaling alone won’t eliminate incoherence in errors.” The deeper problem is that scaling the same measurement system won’t reveal the drift that the system is structurally blind to.
Why is this a governance problem
Organizations deploying AI in regulated environments such as life sciences, financial services, healthcare, and research institutions, validate systems at a point in time and assume a baseline of reliability going forward. Every subsequent update, round of fine-tuning, or preference optimization pass has potentially shifted that system’s internal reference frame. If governance approaches rely on output-level consistency metrics to detect failure, they are auditing the wrong thing. The system can pass every check while the model drifts from reality.
The pattern is already visible in healthcare. A 2025 study in JAMA Health Forum examining nearly 1,000 FDA-cleared AI medical devices found that 43% of recall events occurred within one year of authorization, and that recalled devices were disproportionately those that had not undergone clinical trials before reaching the market. The FDA’s clearance pathway asked the wrong question: not “does this correspond to clinical reality” but “does this resemble an approved device.” Devices cleared that standard. Then they failed in deployment. Governance frameworks that rely on output-level consistency metrics in AI are asking the same wrong question.
This matters beyond individual organizations. The NIST AI Risk Management Framework, sector regulators including the FDA, SEC, and FTC, and organizational governance frameworks increasingly reference ML safety research as a technical foundation for policy. When that research contains a measurement error – not a mathematical error, but a category error about what is being measured – the governance frameworks built on it inherit the same blindness. A paper that tells organizations how AI will fail, using a tool that cannot see one of the most structurally dangerous failure modes, is not a safe foundation for policy.
To be clear, the paper is not wrong to say that frontier models show increasing output variance on complex tasks. That finding is valuable. The problem is the move from “high variance” to “incoherence” (a conceptual claim the measurement cannot support) and the governance conclusions drawn from it. When that claim clears peer review at a prestigious venue because the math is correct, it enters the literature with authority it hasn’t fully earned. It will be cited, referenced in guidance documents, and used to shape deployment decisions by people who will never read the paper itself.
What needs to change
The question AI safety research actually needs to answer is not whether AI will fail incoherently or coherently. It is whether we can detect when a system’s internal reference frame has departed from reality before that divergence becomes foundational; that is, before it has already shaped the decisions, recommendations, and outputs that organizations and regulators are relying on.
The tools to do this rigorously do not yet exist in standard evaluation practice. We cannot govern well what we have not learned to detect.
Authors
