资讯 ArXiv AI Papers 2026-05-12

The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

arXiv:2605.08427v1 Announce Type: new Abstract: Self-play red team is an established approach to improving AI safety in which different instances of the same model play attacker and defender roles in a zero-sum game, i.e., where the attacker tries to jailbreak the defender; if self-play converges to a Nash equilibrium, the model is guaranteed to respond safely within the settings of the game. Alth

1 0
分享:

暂无详细内容

讨论

发表评论

0/2000
...
= ?