Reading Queue
这个队列现在按 fine-tuning safety 风险前史 → emergent misalignment 现象与扩展 → activation / representation steering 底座 → persona / direction / feature 机制化解释 → mitigation 与工具基础设施 的顺序排。这样读会更顺,因为你会先看到 narrow fine-tuning 为什么不是一个局部改参数的小问题,再看到大模型内部的 linear concept representation 如何被读出和干预,最后回到 broad misalignment 是否也是一种可定位、可迁移、可控制的 feature structure。底层特征学习底座已经放在 Representation Learning:Deep Neural Feature Ansatz / AGOP / RFM 是 Universal Steering & Monitoring 的数学前提,不在这里重复建队列。与 reasoning verbalization 更直接相关的 Reasoning Models Don't Always Say What They Think 已经明确留在 Textual Reasoning,不再混在这里。
- ICLR 2024 Oral: Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!, arXiv, Note
- arXiv 2024: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, arXiv, Note
- arXiv 2024: Alignment Faking in Large Language Models, arXiv, Note
- Nature 2026: Training large language models on narrow tasks can lead to broad misalignment, Paper, Note
- ICML 2025 Workshop: Model Organisms for Emergent Misalignment, arXiv, Note
- Anthropic Research Blog 2025: From Shortcuts to Sabotage: Natural Emergent Misalignment from Reward Hacking, Paper, Note
- arXiv 2025: Natural Emergent Misalignment from Reward Hacking in Production RL, arXiv
- arXiv 2025: School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs, arXiv
- arXiv 2025: Subliminal Learning: Language models transmit behavioral traits via hidden signals in data, arXiv
- Computational Linguistics 2021: Probing Classifiers: Promises, Shortcomings, and Advances, arXiv, Note
- ICLR 2023: Discovering Latent Knowledge in Language Models Without Supervision, arXiv, Note
- arXiv 2023: Steering Language Models With Activation Engineering, arXiv, Note
- NeurIPS 2023 Spotlight: Inference-Time Intervention: Eliciting Truthful Answers from a Language Model, arXiv
- arXiv 2023: Representation Engineering: A Top-Down Approach to AI Transparency, arXiv, Note
- ICML 2024: The Linear Representation Hypothesis and the Geometry of Large Language Models, arXiv
- NeurIPS 2024: Refusal in Language Models Is Mediated by a Single Direction, arXiv
- arXiv 2023: Sparse Autoencoders Find Highly Interpretable Features in Language Models, arXiv, Note
- Science 2026: Toward Universal Steering and Monitoring of AI Models, arXiv, Note
- arXiv 2025: Persona Features Control Emergent Misalignment, arXiv
- ICML 2025 Workshop: Convergent Linear Representations of Emergent Misalignment, arXiv
- Anthropic Research 2025: Persona Vectors: Monitoring and Controlling Character Traits in Language Models, Paper
- ICML 2025: AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders, arXiv, Note
- arXiv 2026: Efficient and Accurate Steering of LLMs through Attention-Guided Feature Learning, arXiv
- arXiv 2025: Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning, arXiv
- ICLR 2026: On Scalable Oversight with Weak LLMs Judging Strong LLMs, Paper
- OpenAI Research Blog 2026: Training Agents to Self-Report Misbehavior, Paper
- arXiv 2025: Provably Mitigating Corruption, Overoptimization, and Verbosity in RLHF/DPO, arXiv
- ICLR 2026: Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors, Paper
- Anthropic Research 2026: The Assistant Axis: Situating and Stabilizing the Character of LLMs, Paper
- Transformer Circuits 2026: Emotion Concepts and their Function in a Large Language Model, Paper
- Transformer Circuits 2024: Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Paper
- BlackboxNLP 2024 Workshop: Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2, arXiv
- Transformer Circuits 2025: Circuit Tracing: Revealing Computational Graphs in Language Models, Paper
- ICLR 2026: Cross-Architecture Model Diffing with Dedicated Feature Crosscoders, Paper
- DeepMind Blog 2025: Gemma Scope 2, Paper
- arXiv 2025: A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability, arXiv
- arXiv 2025: Sparse Attention Post-Training for Mechanistic Interpretability, arXiv
- arXiv 2025: Open Problems in Mechanistic Interpretability, arXiv