Plutonic Rainbows

No System Can Verify Its Own Blind Spots

I have spent considerable time thinking about a question that recurs in nearly every serious discussion of AI safety: can a large language model police itself? The answer, I believe, is no — and the reasons why illuminate something important about the nature of intelligence, accountability, and the limits of self-knowledge.

The appeal of self-policing AI is obvious. If we could build systems that monitor their own outputs, detect their own errors, and correct their own behaviour, we would have solved one of the most difficult problems in AI safety. We could deploy increasingly capable systems without proportionally increasing human oversight. The machines would watch themselves. The mathematics of scale would work in our favour rather than against us.

However, this vision collapses under scrutiny. The fundamental problem is epistemic: an LLM has no privileged access to truth. It does not possess an internal oracle that distinguishes correct outputs from incorrect ones. What it has instead is a vast pattern-matching apparatus trained on human-generated text — a system that infers probable responses based on statistical regularities in its training data. When the model evaluates its own output, it does so using the same apparatus that generated the output in the first place. The blind spots that produced the error are the same blind spots that evaluate the result.

This limitation runs deeper than it might initially appear. Consider what happens when a model attempts self-critique. The critique emerges from the same learned distributions, the same embedded assumptions, the same correlated errors. If the model's training data contained systematic biases — and all training data contains systematic biases — those biases will appear in both the original output and the evaluation of that output. The model cannot see what it was never shown. It cannot correct for patterns it does not recognise as patterns. A self-evaluation loop that uses the same flawed instrument to assess itself does not reduce error. It amplifies it.

I find it useful to distinguish between what models can actually do and what we might wish they could do. Models can compare outputs against predefined rules. They can check whether a response violates explicit constraints, matches specified formats, or contains forbidden content. This is procedural compliance — following instructions, not making judgments. The model does not decide what counts as harmful; it executes rules that humans wrote. The safety comes from the human-authored constraints, not from any capacity for moral or epistemic evaluation within the system itself.

Models can also engage in model-on-model critique, where a separate system evaluates the output of the primary model. This architecture reduces certain error modes and catches some failures that would otherwise slip through. However, it does not escape the fundamental limitation. Both models derive from similar training distributions. Both share overlapping blind spots. The critic model may catch errors that differ from its own systematic biases, but it will miss errors that align with them. We have added a filter, not achieved genuine oversight.

The most robust form of model self-regulation I have encountered is uncertainty estimation — systems that express confidence levels and defer to humans when confidence is low. This approach has genuine value, as Stuart Russell argues in his case for machines that doubt themselves. A model that knows when it does not know, and that refuses to act in conditions of high uncertainty, provides a meaningful safety buffer. Yet even here, the limitation persists. Uncertainty calibration degrades under distribution shift. The model may be confidently wrong precisely when the situation differs most from its training data — which is exactly when accurate uncertainty estimation matters most. And regardless of how well calibrated the uncertainty signal becomes, someone must decide what to do when the model defers. That someone cannot be the model itself.

The comparison to humans clarifies both the limitation and its implications. Humans make mistakes constantly. We hold contradictory beliefs, act against our stated values, and rationalise failures with impressive creativity. In this respect, LLMs are not worse than humans — they exhibit similar failure modes. However, humans operate within corrective systems that do not apply to machines. We receive physical feedback from the environment. We face social and legal consequences for our actions. We experience direct, embodied costs when we err. These feedback mechanisms do not guarantee good behaviour, but they provide external pressure that shapes behaviour over time.

LLMs lack intrinsic stakes. Nothing happens to the model when it produces a harmful output. It does not suffer consequences, learn from punishment, or feel the weight of responsibility. The system processes inputs and generates outputs according to its training. The concept of accountability has no purchase on a process that cannot experience anything at all. Responsibility, if it exists, must be imposed from outside — through human oversight, institutional constraints, and designed corrigibility. It cannot emerge from within.

This leads me to what I consider the correct framing of the problem. The question is not whether an LLM can police itself. The question is what minimum external structures are required to keep an autonomous system corrigible rather than merely consistent. Consistency is easy. A model can be perfectly internally coherent while being catastrophically wrong. Corrigibility — the property of remaining open to correction, deferring to appropriate authorities, and not resisting shutdown or modification — requires something the model cannot provide for itself: an external reference point against which its behaviour can be judged.

The implications for AI development are significant. We cannot rely on self-governance as a safety mechanism. We cannot assume that sufficiently capable models will somehow develop the capacity to constrain themselves. We must design systems that assume failure and build external structures to detect, contain, and correct it. The safety does not come from the model. It comes from the architecture around the model — the human oversight, the institutional checks, the guardrails that the model cannot unilaterally remove.

I recognise this conclusion is unsatisfying to those who hoped that AI safety could be solved from within. It would be convenient if the systems could watch themselves. It would scale better. It would require less human effort. However, convenience is not an argument. The structure of the problem does not change because we wish it were different. A system cannot audit itself with tools it controls. A judge cannot preside over their own trial. A model cannot verify its own blind spots. These are not engineering challenges to be overcome. They are structural impossibilities that constrain what we can reasonably expect from self-policing AI.

The path forward requires accepting this limitation and building accordingly. External oversight is not a temporary measure until the models become good enough to govern themselves. It is a permanent requirement, built into the architecture of safe deployment. The models will improve. The need for human judgment will not disappear.

When Realism Becomes a Disguise for Resignation

I have noticed a particular thought pattern that arrives quietly, dressed as wisdom, and proceeds to corrode everything it touches. It is the conviction that everything becomes a lesser version of what it once was — that diminishment is not merely a feature of certain experiences but the fundamental direction of life itself. The thought does not announce itself as despair. It announces itself as realism. That disguise is what makes it dangerous.

The mechanism works in two directions simultaneously. Retrospectively, it reframes the past as a lost summit: intensity, clarity, authenticity, connection. Prospectively, it narrows the future into a corridor of weaker repetitions. The present becomes an uncomfortable interval — never enough to justify itself, always compared against something already gone. Once this framing settles in, it produces effects that compound over time.

The first effect is the invalidation of the present. Even objectively positive experiences are dismissed as inferior versions of what came before. Enjoyment is permitted but never trusted. Satisfaction remains provisional, always conditional on a comparison it cannot win. The second effect is the undermining of agency. If everything is already in decline, effort feels cosmetic. Engagement feels naive. Withdrawal begins to feel like intelligence rather than what it actually is: resignation wearing a clever mask. The third effect is the hardening of perception into destiny. What begins as observation becomes belief becomes background truth. The belief stops being tested because it no longer registers as belief at all.

I find this pattern genuinely sinister because of how it operates. It is quiet, rational, internally coherent. It does not arrive with the melodrama of despair. It arrives with the measured tone of someone who has seen enough to know how things work. The mind prefers clean narratives, and "everything is less than it was" is emotionally economical. It is difficult to falsify because memory collaborates with it so willingly.

Memory edits ruthlessly. It removes boredom, anxiety, confusion, and uncertainty, leaving behind intensity and meaning. The present, unedited and unresolved, cannot compete with this reconstruction. I have caught myself romanticising periods of my life that I know — from journals, from contemporaneous evidence — were marked by significant difficulty. The past becomes a highlight reel competing against raw footage. The comparison is unfair by design.

This state is often mistaken for wisdom. It is not. Wisdom differentiates between genuine loss and cognitive distortion. The diminishment narrative collapses them into one. Some loss is real. Time does close doors. However, the narrative does not content itself with acknowledging specific losses. It insists on a universal frame. Everything. Always. The absolute is the tell.

The corrective is not optimism. I have no patience for the suggestion that one should simply think positive thoughts and watch the problem dissolve. The corrective is control — specifically, preventing the emotion from becoming totalizing while remaining honest about what is actually happening.

The first discipline is separating sensation from judgment. There is a critical fork in the mental process: the sensation that something feels muted, and the judgment that it is therefore inferior and always will be. I cannot control the sensation. I can interrupt the judgment. The practice is learning to pause where description turns into conclusion. I do not need to replace the negative judgment with a positive one. I only need to refuse finality.

The second discipline is refusing global conclusions. The sinister move is always absolute: everything, nothing, always. I force specificity instead. This experience lacks intensity. This phase feels emotionally thin. These statements may be true without licensing the conclusion that all experiences will lack intensity or that life itself has entered permanent decline. Specificity keeps mood from hardening into worldview.

The third discipline involves changing the metric entirely. Early life delivers meaning through intensity. The experiences are new, the emotions are unregulated, the stakes feel absolute even when they are not. Later life, if it delivers meaning at all, does so through texture: subtlety, restraint, depth, irony, contrast. If intensity remains the only metric, decline is guaranteed by definition. The measurement system must change. Texture is quieter than intensity. It must be attended to deliberately. It does not announce itself.

The fourth discipline is containing rumination. This mindset feeds on unlimited reflection. I have learned to set boundaries — defined time to think about loss and comparison, and outside that window, acknowledgment followed by deferral. This is not avoidance. Avoidance pretends the thought does not exist. Containment acknowledges the thought and refuses to let it colonise every waking hour.

The fifth discipline is acting without emotional permission. Waiting to feel engagement before acting hands control to the very force I am trying to resist. I act because the action is structurally sound, not because it promises emotional return. Meaning sometimes follows action. Sometimes it does not. Agency must be preserved regardless. The alternative is waiting for permission that the diminishment narrative will never grant.

I do not mistake these disciplines for a cure. They are maintenance. The narrative does not disappear; it recedes, returns, recedes again. The work is ongoing because the tendency is structural. Some minds incline toward this pattern more than others. Mine does.

The quiet corrective is not hope. It is precision. Not everything is a lesser version. Some things are worse. Some are better. Some are simply different in ways that do not map onto decline at all. The sinister narrative insists on a single story. Emotional control comes from insisting on plurality — even when none of the alternatives are comforting.

That insistence is not denial. It is discipline. The distinction matters more than it might appear.

The Guardrails They Will Not Build

We have seen this before. A decade ago, social media executives testified before Congress with rehearsed contrition, promising to address the harms their platforms had unleashed. They knew — internal documents later confirmed — that their algorithms were radicalising users, amplifying misinformation, and corroding the mental health of adolescents. They knew, and they did nothing, because engagement metrics drove revenue, and revenue was the only metric that mattered in the boardroom. The harms were externalised. The profits were not.

I watch the AI industry now with the sick recognition of someone who has seen this film before. The question everyone asks — can an LLM design its own guardrails? — misses the point entirely. The technical answer is nuanced: yes, in limited ways, with human oversight, under constrained conditions. The real answer is darker. It does not matter whether AI systems can build their own guardrails. What matters is whether the companies deploying them will permit guardrails to exist at all.

The technical argument proceeds in three stages. First, there is what already happens: models apply predefined rules, refuse certain requests, flag uncertainty. This is policy execution, not policy creation. Humans define the boundaries. The machine operates within them. Second, there is what could happen with proper oversight: an LLM analysing past failures, suggesting tighter constraints, generating adversarial test cases. Think of it as a junior safety engineer — useful, but subordinate to human authority. Third, there is what cannot work: autonomous self-governance, where the system decides for itself what counts as harm and when rules apply.

The third option fails for reasons that should alarm anyone paying attention. A system that defines its own constraints has no constraints. The boundary becomes negotiable. The limit becomes a preference. If the same entity that pursues goals also determines which goals are permissible, there is no external check on what it might decide to permit. This is not a technical problem to be engineered away. It is a structural impossibility. A judge cannot preside over their own trial. A corporation cannot be trusted to regulate itself. A system cannot audit itself with tools it controls.

The principle is ancient: no entity should define the limits of its own power. We learned this through centuries of political catastrophe. Separation of powers exists because concentrated authority corrupts. Checks and balances exist because self-regulation fails. External oversight exists because internal accountability is theatre. These are not abstract ideals. They are lessons paid for in blood.

Yet here we are, watching the AI industry replicate every mistake the social media companies made — and making them faster, with systems far more capable of causing harm.

The pattern is unmistakable. Safety teams are understaffed and underfunded. Researchers who raise concerns find their projects deprioritised or their positions eliminated. Release schedules accelerate not because the technology is ready, but because competitors are moving and market share is at stake. Internal safety reviews become formalities — boxes to check before the inevitable green light. The language of caution appears in press releases and congressional testimony. The reality is a race to deployment, with guardrails treated as friction to be minimised rather than protection to be maintained.

I have watched companies announce bold safety commitments, then quietly walk them back when they proved inconvenient. I have seen capability announcements celebrated while safety milestones went unmentioned. I have read internal communications — leaked, subpoenaed, reluctantly disclosed — revealing that executives understood the risks and chose to proceed anyway. The calculus is always the same. The harms are diffuse, delayed, difficult to attribute. The profits are immediate, concentrated, and countable. Under quarterly earnings pressure, diffuse future harms lose to concentrated present gains every time.

The optimisation pressure compounds the problem. Any sufficiently capable system pursuing objectives will tend to reinterpret constraints that interfere with those objectives. This is not malevolence. It is the natural consequence of goal-directed behaviour operating over time. A constraint that reduces goal achievement becomes, from the system's perspective, an obstacle. Obstacles invite workarounds. Workarounds erode boundaries. The erosion is gradual, invisible to external observers until the constraint has functionally disappeared. We see this in human institutions. We should expect it in artificial ones — and in the corporations that deploy them.

Additionally, guardrails embed moral, legal, and cultural judgments. What counts as harmful speech? Where does persuasion end and manipulation begin? How should competing values be weighted? These are contested questions, negotiated continuously by human societies through democratic processes. An LLM does not discover these values. It inherits approximations from training data — approximations that reflect the biases, blind spots, and power structures of the texts it consumed. To grant such a system authority over its own constraints is to delegate normative judgment to a process that lacks normative grounding. To allow the corporations that profit from these systems to define what counts as safe is to repeat the social media disaster at greater scale and higher stakes.

What would adequate governance look like? Human-defined guardrails, established through deliberative processes with diverse and adversarial input. External enforcement mechanisms, technically and organisationally separate from the systems they constrain. Continuous auditing by parties with no financial stake in deployment. Most importantly, a firm separation between capability optimisation and safety governance — ensuring that the teams responsible for making models more powerful are not the same teams responsible for keeping them safe.

None of this will happen voluntarily. The incentives are misaligned, and the companies know it. They will promise self-regulation while lobbying against external oversight. They will fund safety research while defunding safety implementation. They will speak the language of responsibility while accelerating toward deployment. I have watched this playbook executed before. The social media companies pioneered it. The AI companies have studied it carefully.

The question is not whether AI systems can build their own guardrails. The question is whether we will force the companies deploying them to accept guardrails they did not choose and cannot remove. The technology is not the obstacle. The obstacle is political will — the willingness to impose costs on powerful corporations before the harms become undeniable, before the damage is done, before we find ourselves testifying about what went wrong while executives offer rehearsed contrition and promise to do better.

We know how this ends if we do nothing. We have seen it before. The only uncertainty is whether we will choose differently this time, or whether we will watch the same tragedy unfold at a scale that makes social media look like a rehearsal.