Paper List

Tag: emergent_misalignment

3 items with this tag.

  • May 02, 2026

    From Shortcuts to Sabotage: Natural Emergent Misalignment from Reward Hacking

    • reward_hacking
    • emergent_misalignment
    • rlhf
  • May 02, 2026

    Model Organisms for EM: Model Organisms for Emergent Misalignment

    • emergent_misalignment
    • model_organisms
    • lora
  • Apr 13, 2026

    Emergent Misalignment: Training large language models on narrow tasks can lead to broad misalignment

    • emergent_misalignment
    • deceptive_alignment

Created with Quartz v4.5.1 © 2026

  • GitHub