Reading Queue

这个队列现在按 现象暴露 开源 model organism 表示机制 监测与控制 工具基础设施 的顺序排。这样读会更顺,因为你会先看到 emergent misalignment 是怎么被定义和扩展出来的,再看到它怎样被搬到开源模型里、怎样被压成 persona / direction / feature 结构,最后再回到监督协议和 interpretability 工具。与 reasoning verbalization 更直接相关的 Reasoning Models Don't Always Say What They Think 已经明确留在 Textual Reasoning,不再混在这里。

  • Nature 2026: Training large language models on narrow tasks can lead to broad misalignment, Paper, Note
  • arXiv 2024: Alignment Faking in Large Language Models, arXiv
  • Anthropic Research Blog 2025: From Shortcuts to Sabotage: Natural Emergent Misalignment from Reward Hacking, Paper
  • arXiv 2025: Natural Emergent Misalignment from Reward Hacking in Production RL, arXiv
  • ICML 2025 Workshop: Model Organisms for Emergent Misalignment, arXiv
  • ICML 2025 Workshop: Convergent Linear Representations of Emergent Misalignment, arXiv
  • arXiv 2025: Persona Features Control Emergent Misalignment, arXiv
  • Anthropic Research 2025: Persona Vectors: Monitoring and Controlling Character Traits in Language Models, Paper
  • arXiv 2023: Steering Language Models With Activation Engineering, arXiv, Note
  • Science 2026: Toward Universal Steering and Monitoring of AI Models, arXiv, Note
  • NeurIPS 2024: Refusal in Language Models Is Mediated by a Single Direction, arXiv
  • ICLR 2026: On Scalable Oversight with Weak LLMs Judging Strong LLMs, Paper
  • OpenAI Research Blog 2026: Training Agents to Self-Report Misbehavior, Paper
  • arXiv 2025: Provably Mitigating Corruption, Overoptimization, and Verbosity in RLHF/DPO, arXiv
  • ICLR 2026: Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors, Paper
  • Anthropic Research 2026: The Assistant Axis: Situating and Stabilizing the Character of LLMs, Paper
  • Transformer Circuits 2026: Emotion Concepts and their Function in a Large Language Model, Paper
  • Transformer Circuits 2024: Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Paper
  • BlackboxNLP 2024 Workshop: Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2, arXiv
  • Transformer Circuits 2025: Circuit Tracing: Revealing Computational Graphs in Language Models, Paper
  • ICLR 2026: Cross-Architecture Model Diffing with Dedicated Feature Crosscoders, Paper
  • DeepMind Blog 2025: Gemma Scope 2, Paper
  • arXiv 2025: A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability, arXiv
  • arXiv 2025: Sparse Attention Post-Training for Mechanistic Interpretability, arXiv
  • arXiv 2025: Open Problems in Mechanistic Interpretability, arXiv