LW - DeepMind's "Frontier Safety Framework" is weak and unambitious by Zach Stein-Perlman

Released Saturday, 18th May 2024

Good episode? Give it some love!

LW - DeepMind's "Frontier Safety Framework" is weak and unambitious by Zach Stein-Perlman

Saturday, 18th May 2024

Good episode? Give it some love!

Rate Episode

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: DeepMind's "Frontier Safety Framework" is weak and unambitious, published by Zach Stein-Perlman on May 18, 2024 on LessWrong.FSF blogpost.Full document (just 6 pages; you should read it). Compare toAnthropic's RSP,OpenAI's RSP ("PF"), and METR'sKey Components of an RSP.DeepMind's FSF has three steps:1. Create model evals for warning signs of "Critical Capability Levels"1. Evals should have a "safety buffer" of at least 6x effective compute so that CCLs will not be reached between evals2. They list 7 CCLs across "Autonomy, Biosecurity, Cybersecurity, and Machine Learning R&D," and they're thinking about CBRN1. E.g. "Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents"2. Do model evals every 6x effective compute and every 3 months of fine-tuning1. This is an "aim," not a commitment2. Nothing about evals during deployment3. "When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results. We will also take into account considerations such as additional risks flagged by the review and the deployment context." The document briefly describes 5 levels of security mitigations and 4 levels of deployment mitigations.1. The mitigations aren't yet connected to eval results or other triggers; there are no advance commitments about safety practicesThe FSF doesn't contain commitments. The blogpost says "The Framework is exploratory and we expect it to evolve significantly" and "We aim to have this initial framework fully implemented by early 2025." The document says similar things. It uses the word "aim" a lot and the word "commit" never. The FSF basically just explains a little about DeepMind's plans on dangerous capability evals. Those details do seem reasonable. (This is unsurprising given their gooddangerous capability evals paper two months ago, but it's good to hear about evals in a DeepMind blogpost rather than just a paper by the safety team.)(Ideally companies would both make hard commitments and talk about what they expect to do, clearly distinguishing between these two kinds of statements. Talking about plans like this is helpful. But with no commitments, DeepMind shouldn't get much credit.)(Moreover the FSF is not precise enough to be possible to commit to - DeepMind could commit to doing the model evals regularly, but it doesn't discuss specific mitigations as a function of risk assessment results.[1])Misc notes (but you should reallyread the doc yourself):The document doesn't specify whether "deployment" includes internal deployment. (This is important because maybelots of risk comes from the lab using AIs internally to do AI development.) Standard usage suggests internal deployment is excluded, and the focus on misuse and related cues also suggest it's excluded, but the mention of ML R&D as a dangerous capability suggests it's included.The document doesn't mention doing evals during deployment (to account for improvements inscaffolding, prompting, etc.)The document says "We expect it to evolve substantially as our understanding of the risks and benefits of frontier models improves, and we will publish substantive revisions as appropriate" and a few similar things. The document doesn't say how it will be revised/amended, which isn't surprising, since it doesn't make formal commitments.No external evals or accountability, but they're "exploring" it.Public accountability: unfortunately, there's no mention of releasing eval results or even announcing when thresholds are reached. They say "We are exploring internal policies around alerting relevant stakeholder bodies when, for example, ev...

Rate