资讯 ArXiv AI Papers 2026-05-12

The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

arXiv:2605.08427v1 Announce Type: new Abstract: Self-play red team is an established approach to improving AI safety in which different instances of the same model play attacker and defender roles in a zero-sum game, i.e., where the attacker tries to jailbreak the defender; if self-play converges to a Nash equilibrium, the model is guaranteed to respond safely within the settings of the game. Alth

4 0

暂无详细内容

标签: #research #ArXiv AI Papers

讨论

发表评论

资讯详情

发布日期

2026-05-12

来源媒体

ArXiv AI Papers

🏷️ 相关标签

#research #ArXiv AI Papers

The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

讨论

发表评论

资讯详情

🏷️ 相关标签

相关资讯

📤 分享这条资讯