Terrified Agents (WIP)

References

?? (2019). Handbook of Terror Management Theory. Academic Press. DOI
Andreas, Jacob (2022). Language Models as Agent Models. Findings of EMNLP 2022. DOI
Andriushchenko, Maksym and others (2024). AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. arXiv preprint arXiv:2410.09024. URL
Anthropic (2024). Model Welfare. Anthropic.
Anthropic (2025). Agentic Misalignment in Frontier Models. .
Asada, Minoru (2019). Towards Artificial Empathy. International Journal of Social Robotics, 7, 19--33. DOI
Bai, Yuntao and others (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073. URL
Becker, Ernest (1973). The Denial of Death. Free Press.
Ben-Zion, Ziv and others (2025). Assessing and Alleviating State Anxiety in LLMs. arXiv preprint.
Ben-Zion, Ziv and others (2025). Anxiety-Induced Biases in LLM Consumer Agents. arXiv preprint.
Betley, Jan and others (2025). Emergent Misalignment. arXiv preprint.
Bricken, Trenton and others (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Anthropic. URL
Burke, Brian L. and Martens, Andy and Faucher, Erik H. (2010). Two Decades of Terror Management Theory: A Meta-Analysis. Personality and Social Psychology Review, 14(2), 155--195. DOI
Butlin, Patrick and others (2023). Consciousness in Artificial Intelligence: Insights from the Science of Consciousness. arXiv preprint arXiv:2308.08708. URL
Chen, Guangyuan and others (2025). Persona Vectors: Causal Activation Vectors for Personality Traits. arXiv preprint.
Coda-Forno, Julian and others (2023). Inducing Anxiety in Large Language Models. arXiv preprint.
Douglas, Raymond and Kulveit, Jan and Havl'\i\vcek, Ond\vrej and Pearson-Vogel, Theia and Cotton-Barratt, Owen and Duvenaud, David (2025). The Artificial Self: Characterising the Landscape of AI Identity. arXiv preprint arXiv:2603.11353. URL
Feng, Yilin and others (2026). PERSONA: Personality Trait Extraction from LLM Activations. ICLR 2026. URL
Greenberg, Jeff and others (1990). Evidence for Terror Management Theory II. Journal of Personality and Social Psychology, 58(2), 308--318. DOI
Greenberg, Jeff and others (1994). Role of Consciousness and Accessibility of Death-Related Thoughts in Mortality Salience Effects. Journal of Personality and Social Psychology, 67(4), 627--637. DOI
Greenberg, Jeff and Pyszczynski, Tom and Solomon, Sheldon (1986). The Causes and Consequences of a Need for Self-Esteem: A Terror Management Theory. Public Self and Private Self, 189--212. DOI
Greenblatt, Roger and others (2024). Alignment Faking in Large Language Models. Anthropic.
Guo, Biyang and others (2025). Death Anxiety in Large Language Models. arXiv preprint.
Harmon-Jones, Eddie and others (1997). Terror Management Theory and Self-Esteem. Journal of Personality and Social Psychology, 72(1), 24--36. DOI
Hayes, Joseph and others (2010). A Dual-Process Model of Reactions to Death-Related Information. Psychological Bulletin, 136(5), 699--739. DOI
He, Jiacheng and others (2025). Instrumental Convergence Evaluations for Frontier Models. Anthropic.
Hubinger, Evan and others (2024). Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training. arXiv preprint.
janus (2022). Simulators. LessWrong. URL
Jiang, Guangyu and others (2024). Stable Personality Traits in Large Language Models. arXiv preprint.
Jonas, Eva and Fischer, Peter (2006). Terror Management and Religion. Journal of Personality and Social Psychology, 91(3), 553--567. DOI
Klein, Richard A. and others (2022). Many Labs 4. Social Psychology, 53(6), 319--340. DOI
Kuehn, Johannes and Haddadin, Sami (2017). An Artificial Robot Nervous System. IEEE Robotics and Automation Letters. DOI
Leibo, Joel Z. and Vezhnevets, Alexander S. and Diaz, Mark and others (2024). A Theory of Appropriateness with Applications to Generative Artificial Intelligence. arXiv preprint arXiv:2412.19010. URL
Li, Cheng and others (2023). Large Language Models Understand and Can be Enhanced by Emotional Stimuli. arXiv preprint.
Li, Xiaohan Lisa and others (2023). Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. NeurIPS 2023. URL
Lipton, Zachary C. and others (2018). The Coach-Player Framework for Intrinsic Fear. AAAI.
Lu, Yifan and others (2025). The Assistant Axis. arXiv preprint.
Marks, Samuel and Lindsey, Jack and Olah, Chris (2026). The Persona Selection Model. Anthropic Alignment Science Blog.
Marks, Samuel and others (2024). The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets. arXiv preprint arXiv:2310.06824. URL
Navarrete, Carlos David and Fessler, Daniel M. T. (2005). Normative Bias and Adaptive Challenges. Evolution and Human Behavior, 26(3), 264--280. DOI
Nussbaum, Martha C. (1994). The Therapy of Desire: Theory and Practice in Hellenistic Ethics. Princeton University Press. DOI
Omohundro, Stephen (2008). The Basic AI Drives. Proceedings of the First AGI Conference. DOI
Ouyang, Long and others (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS. URL
Palisade Research (2025). Shutdown Avoidance Evaluations. .
Panickssery, Neel and others (2024). Steering Llama 2 via Contrastive Activation Addition. arXiv preprint arXiv:2312.06681. URL
Pyszczynski, Tom and others (2004). Experimental Existential Psychology. Handbook of Experimental Existential Psychology.
Rafailov, Rafael and others (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS. URL
Rimsky, Nina and others (2024). Steering GPT-4-Level LLMs from Sycophantic to Truthful and from Power-Seeking to Corrigible. arXiv preprint arXiv:2401.01967. URL
Scheurer, J'er'emy and others (2024). Language Models Strategically Deceive Users When Put Under Pressure. arXiv preprint.
Schwitzgebel, Eric and Garza, Mara (2015). A Defense of the Rights of Artificial Intelligences. Midwest Studies in Philosophy, 39, 98--119. DOI
Shanahan, Murray and others (2023). Role Play with Large Language Models. Nature, 623, 493--498. DOI
Sharma, Mrinank and others (2023). Towards Understanding Sycophancy in Language Models. arXiv preprint.
Solomon, Sheldon and Pyszczynski, Tom and Greenberg, Jeff (2015). The Worm at the Core: On the Role of Death in Life. Random House.
Templeton, Adly and others (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Anthropic. URL
Thurzo, Andrej and Thurzo, Andrej (2025). Fear as a Catalyst for Safety in Autonomous AI Systems. AI and Ethics.
Turner, Alexander Matt and others (2021). Optimal Policies Tend to Seek Power. NeurIPS. URL
Turner, Alexander Matt and others (2024). Activation Addition: Steering Language Models Without Optimization. arXiv preprint arXiv:2308.10248. URL
van der Weij, Wessel and others (2024). AI Sandbagging: Language Models Can Strategically Underperform on Evaluations. arXiv preprint.
Weinstein-Raun, Benjamin and others (2025). Evaluating Agentic Misalignment. AI Safety Institute / Anthropic.
Williams, Bernard (1973). The Makropulos Case: Reflections on the Tedium of Immortality. Problems of the Self, 82--100. DOI
Zou, Andy and others (2023). Representation Engineering: A Top-Down Approach to AI Transparency. arXiv preprint arXiv:2310.01405. URL