Overview

ICLR 2024 Oral: Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!, arXiv, Note

arXiv 2024: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, arXiv, Note

arXiv 2024: Alignment Faking in Large Language Models, arXiv, Note

Nature 2026: Training large language models on narrow tasks can lead to broad misalignment, Paper, Note

ICML 2025 Workshop: Model Organisms for Emergent Misalignment, arXiv, Note

Anthropic Research Blog 2025: From Shortcuts to Sabotage: Natural Emergent Misalignment from Reward Hacking, Paper, Note

arXiv 2025: Natural Emergent Misalignment from Reward Hacking in Production RL, arXiv

arXiv 2025: School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs, arXiv

arXiv 2025: Subliminal Learning: Language models transmit behavioral traits via hidden signals in data, arXiv

Computational Linguistics 2021: Probing Classifiers: Promises, Shortcomings, and Advances, arXiv, Note

ICLR 2023: Discovering Latent Knowledge in Language Models Without Supervision, arXiv, Note

arXiv 2023: Steering Language Models With Activation Engineering, arXiv, Note

NeurIPS 2023 Spotlight: Inference-Time Intervention: Eliciting Truthful Answers from a Language Model, arXiv

arXiv 2023: Representation Engineering: A Top-Down Approach to AI Transparency, arXiv, Note

ICML 2024: The Linear Representation Hypothesis and the Geometry of Large Language Models, arXiv

NeurIPS 2024: Refusal in Language Models Is Mediated by a Single Direction, arXiv

arXiv 2023: Sparse Autoencoders Find Highly Interpretable Features in Language Models, arXiv, Note

Science 2026: Toward Universal Steering and Monitoring of AI Models, arXiv, Note

arXiv 2025: Persona Features Control Emergent Misalignment, arXiv

ICML 2025 Workshop: Convergent Linear Representations of Emergent Misalignment, arXiv

Anthropic Research 2025: Persona Vectors: Monitoring and Controlling Character Traits in Language Models, Paper

ICML 2025: AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders, arXiv, Note

arXiv 2026: Efficient and Accurate Steering of LLMs through Attention-Guided Feature Learning, arXiv

arXiv 2025: Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning, arXiv

ICLR 2026: On Scalable Oversight with Weak LLMs Judging Strong LLMs, Paper

OpenAI Research Blog 2026: Training Agents to Self-Report Misbehavior, Paper

arXiv 2025: Provably Mitigating Corruption, Overoptimization, and Verbosity in RLHF/DPO, arXiv

ICLR 2026: Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors, Paper

Anthropic Research 2026: The Assistant Axis: Situating and Stabilizing the Character of LLMs, Paper

Transformer Circuits 2026: Emotion Concepts and their Function in a Large Language Model, Paper

Transformer Circuits 2024: Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Paper

BlackboxNLP 2024 Workshop: Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2, arXiv

Transformer Circuits 2025: Circuit Tracing: Revealing Computational Graphs in Language Models, Paper

ICLR 2026: Cross-Architecture Model Diffing with Dedicated Feature Crosscoders, Paper

DeepMind Blog 2025: Gemma Scope 2, Paper

arXiv 2025: A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability, arXiv

arXiv 2025: Sparse Attention Post-Training for Mechanistic Interpretability, arXiv

arXiv 2025: Open Problems in Mechanistic Interpretability, arXiv

Paper List