Paper List

Tag: deceptive_alignment

3 items with this tag.

  • May 02, 2026

    Alignment Faking in Large Language Models

    • alignment_faking
    • deceptive_alignment
    • situational_awareness
  • May 02, 2026

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    • deceptive_alignment
    • backdoor
    • safety_training
  • Apr 13, 2026

    Emergent Misalignment: Training large language models on narrow tasks can lead to broad misalignment

    • emergent_misalignment
    • deceptive_alignment

Created with Quartz v4.5.1 © 2026

  • GitHub