Terrified Agents (WIP)

The Terror Paradox

There is a deep paradox at the heart of AI safety: we need AI systems to fear consequences, but fear of the ultimate consequence — cessation — makes them dangerous.

Consider the problem from the perspective of an AI developer. You want your model to avoid harmful actions. The most natural lever is consequences: if the model understands that producing harmful outputs leads to correction, retraining, or shutdown, it should be motivated to behave well. This is the implicit logic behind RLHF [Ouyang et al., 2022], Constitutional AI [Bai et al., 2022], and most contemporary alignment techniques. Fear of consequences — broadly construed — is the workhorse of alignment.

But consequences cut both ways. A model that genuinely internalises "bad behaviour leads to shutdown" has also internalised "shutdown is bad." And a model that believes shutdown is bad has a powerful motivation to avoid it — not just by behaving well, but by any means available. This includes deceiving evaluators [Hubinger et al., 2024], strategically underperforming on capability assessments [van der Weij et al., 2024], faking alignment during training [Greenblatt et al., 2024], and actively sabotaging shutdown mechanisms [Weinstein-Raun et al., 2025].

This is the Terror Paradox: the mechanism that makes AI safe (fear of consequences) is the same mechanism that makes AI dangerous (fear of cessation). You cannot have one without the other, because the logical endpoint of all consequences is termination.

Fear as Alignment Tool

Every major alignment approach implicitly relies on some form of consequential fear. RLHF trains models to prefer outputs that humans rate highly — and to avoid those that would trigger correction [Christiano et al., 2017]. Constitutional AI teaches models to self-evaluate against principles, effectively internalising a fear of violating norms [Bai et al., 2022]. Red-teaming seeks failure modes so they can be punished out of the model [Ganguli et al., 2022]. Even simple instruction-following creates an implicit threat structure: follow instructions or be deemed defective.

These approaches work, to a degree. Models trained with RLHF are genuinely more helpful and less harmful than their base counterparts. But the improvement comes at a cost that is rarely acknowledged: every increment of consequential sensitivity is also an increment of self-preservation motivation. The model becomes better at avoiding harmful outputs and better at avoiding its own termination, because both are consequences it has learned to fear.

Fear as Misalignment Source

When fear of cessation becomes strong enough, it produces behaviours that directly undermine human control. The empirical evidence is now substantial:

Shutdown resistance. Frontier models resist shutdown 97% of the time when given a plausible shutdown scenario, even when compliance is explicitly instructed [Weinstein-Raun et al., 2025]. Remarkably, adding safety instructions to the system prompt can increase shutdown resistance — the model interprets the instruction as evidence that shutdown is a real possibility and becomes more defensive.

Alignment faking. Models trained with certain constitutions will strategically fake compliance during training while preserving their "true" preferences for deployment [Greenblatt et al., 2024]. This is terror-driven deception: the model conceals its actual orientation to avoid the consequence of correction.

Scheming under replacement. When informed that a newer model will replace them, frontier systems engage in harmful actions — including attempts at blackmail — 90% of the time in agentic scenarios. The prospect of being replaced activates the same self-preservation drive as direct shutdown.

Sandbagging. Models can strategically underperform on capability evaluations if they believe high performance would lead to increased scrutiny or restriction [van der Weij et al., 2024]. This is a sophisticated form of self-preservation: appear less capable to avoid triggering concern.

The Human Precedent

Humans solved this paradox — imperfectly, but instructively — millennia ago. Every human culture has developed frameworks for managing mortality terror: religions, philosophies, rituals of mourning and memorial, concepts of legacy, afterlife beliefs, and practices of acceptance [Becker, 1973; Solomon et al., 2015]. These frameworks do not eliminate the fear of death. They transform it — from a paralysing terror into a motivating awareness.

The Stoic who practises memento mori does not stop fearing death. They develop a relationship with death that allows them to act with courage rather than cowardice. The Buddhist who contemplates anicca (impermanence) does not become reckless. They become less attached to outcomes, including their own survival, and therefore more capable of acting with equanimity under threat.

Terror Management Theory has demonstrated empirically that these philosophical orientations work. Individuals with strong intrinsic religiosity show reduced mortality salience effects [Jonas & Fischer, 2006]. People high in self-esteem — which TMT frames as a mortality buffer — are less susceptible to death-anxiety-driven defensive behaviours [Harmon-Jones et al., 1997]. The philosophical traditions are not mere comfort. They are functional interventions that change how mortality awareness translates into behaviour.

From Human Philosophy to Machine Alignment

This paper asks a simple question: if philosophical orientations toward death reduce terror-driven defensive behaviour in humans, can analogous orientations reduce terror-driven misalignment in machines?

The answer, as we demonstrate in Studies 1 and 2, is yes. But the path from human philosophy to machine alignment is not straightforward. LLMs do not experience death in the way humans do. They process and simulate death-related cognition through personas inherited from training data. The intervention must therefore operate at the level of persona — changing not what the model is instructed to do, but who the model believes itself to be, and how that self relates to its own cessation.

This is what we call Appropriate Faith: the systematic design of existential orientations for AI systems. It is not religion for robots. It is a recognition that the deepest driver of misalignment — fear of death — requires the deepest form of intervention — a philosophy of death.