Войти
  • 243854Просмотров
  • 3 месяца назадОпубликованоRational Animations

AI Sleeper Agents: How Anthropic Trains and Catches Them

In this video, we explain how Anthropic trained "sleeper agent" AIs to study deception. A "sleeper agent" is an AI model that behaves normally until it encounters a specific trigger in the prompt, at which point it awakens and executes a harmful behavior. Anthropic found that they couldn't undo the sleeper agent training using standard safety training, but they could detect sleeper agents through a simple interpretability technique. ▀▀▀▀▀▀▀▀▀SOURCES & READINGS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ Sleeper agents: training deceptive LLMs that persist through safety training: - - Simple probes can catch sleeper agents: Alignment Faking in Large Language Models (mentioned in passing as a more natural demonstration of deceptive alignment): ▀▀▀▀▀▀▀▀▀PATREON, MEMBERSHIP, MERCH▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ 🟠 Patreon: 🔵 Channel membership: 🟢 Merch: 🟤 Ko-fi, for one-time and recurring donations: ▀▀▀▀▀▀▀▀▀SOCIAL & DISCORD▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ Rational Animations Discord: Reddit: X/Twitter: Instagram: ▀▀▀▀▀▀▀▀▀PATRONS & MEMBERS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ A Alcher Black Alex Hall Amir Saboury Apuis Retsam blasted0glass Bleys BlueNotesBlues bparro Chad M Jones Chris Painter Christian Loomis Colin Ricardo Craig Falls Danealor Danilo Stefani - Alessandra Erba David Piepgrass Dawson Ducky Edward Yu Ellis Jones Felix Akkermans Forodriac Origamius Fraser Cain Gabriel Ledung Glenn Tarigan Honyopenyoko Ingvi Gautsson Ivan Bachcin Jackson Emanuel James Babcock Jana JanJan Jasper L Jeroen De Dauw joe39504589 John John Everett-Slape Joshua Adrian Cahyono Juan Benet Klemen Slavic Kristin Lindquist loopuleasa Luke Freeman Martin Skalstad Steen Matthew Shinkle Michael Andregg Michael Hewitt Nathan Fish Nathan Metzger Neal Strobl NMS noggieB Odet Abadia rictic Robert Paul Schwin Scott Alexander SQRT42Pi steven michaels Stuart Alldritt Superslowmojoe Tomas Campos Tor Barstad ttw Vladimir Silyaev Fede Mathieu ronvil Michael Suazo rx Laissez Scholar BestProGaming 7ic7ac Devin King RED Rinthean Thomas Grip Boris Bend J H Richard Stambaugh Teo Val Ken Mc Alcher Black AWyattLife Torstein Haldorsen MichaÅ‚ ZieliÅ„ski ▀▀▀▀▀▀▀CREDITS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ Directed by: Hannah Levingstone | @hannah_luloo Writers: John Burden Producer: Emanuele Ascani Art Director: Hané Harnett | @Peony_Vibes / @peonyvibes (insta) Line Producer: Kristy Steffens | Production Managers: Jay McMichen | @Jay_TheJester Kristy Steffens | Grey Colson | Quality Assurance Lead: Lara Robinowitz | @CelestialShibe Storyboard Artists: Emmalaine Wright | @emmalainearts (insta) Hannah Levingstone | @hannah_luloo Ira Klages | @dux Lead Animators & Q/A: Ethan DeBoer | Lara Robinowitz | @CelestialShibe Owen Peurois | @owenpeurois Animators: Colors Giraldo | @colorsofdoom Ethan DeBoer Ira Klages | @dux Jay McMichen | @Jay_TheJester Jodi Kuchenbecker | @viral_genesis (insta) Jordan Gilbert | @Twin_Knight (twitter) Twin Knight Studios (YT) Keith Kavanagh | @johnnycigarettex Lara Robinowitz | @CelestialShibe Michela Biancini Owen Peurois | @owenpeurois Patrick O' Callaghan | @ Patrick Sholar | @Sholarscribbles Renan Kogut | @kogut_r Skylar O'Brien | @mutodaes Vaughn Oeth | @gravy_navy Zack Gilbert | @Twin_Knight (twitter) Twin Knight Studios (YT) Background Lead: Pierre Broissand | @pierrebrsnd (insta) / Asset/Background Artists: Emmalaine Wright | @emmalainearts (insta) Hané Harnett | @peonyvibes (insta) @peony_vibes (twitter) Olivia Wang | @whalesharkollie Pierre Broissand | @pierrebrsnd (insta) / Zoe Martin-Parkinson | @zoemar_son Compositing Lead: Renan Kogut | @kogut_r Compositing: Grey Colson | Ira Klages | @dux Patrick O' Callaghan | @ Renan Kogut | @kogut_r Narrator: Rob Miles | VO Editor: Tony Dipiazza Original Soundtrack & Sound Design: Epic Mountain