LW - Anthropic: Reflections on our Responsible Scaling Policy by Zac Hatfield-Dodds

Released Monday, 20th May 2024

Good episode? Give it some love!

LW - Anthropic: Reflections on our Responsible Scaling Policy by Zac Hatfield-Dodds

Monday, 20th May 2024

Good episode? Give it some love!

Rate Episode

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Anthropic: Reflections on our Responsible Scaling Policy, published by Zac Hatfield-Dodds on May 20, 2024 on LessWrong.Last September we published our first Responsible Scaling Policy (RSP) [LW discussion], which focuses on addressing catastrophic safety failures and misuse of frontier models. In adopting this policy, our primary goal is to help turn high-level safety concepts into practical guidelines for fast-moving technical organizations and demonstrate their viability as possible standards. As we operationalize the policy, we expect to learn a great deal and plan to share our findings.This post shares reflections from implementing the policy so far. We are also working on an updated RSP and will share this soon.We have found having a clearly-articulated policy on catastrophic risks extremely valuable. It has provided a structured framework to clarify our organizational priorities and frame discussions around project timelines, headcount, threat models, and tradeoffs. The process of implementing the policy has also surfaced a range of important questions, projects, and dependencies that might otherwise have taken longer to identify or gone undiscussed.Balancing the desire for strong commitments with the reality that we are still seeking the right answers is challenging. In some cases, the original policy is ambiguous and needs clarification. In cases where there are open research questions or uncertainties, setting overly-specific requirements is unlikely to stand the test of time.That said, as industry actors face increasing commercial pressures we hope to move from voluntary commitments to established best practices and then well-crafted regulations.As we continue to iterate on and improve the original policy, we are actively exploring ways to incorporate practices from existing risk management and operational safety domains. While none of these domains alone will be perfectly analogous, we expect to find valuable insights from nuclear security, biosecurity, systems safety, autonomous vehicles, aerospace, and cybersecurity. We are building an interdisciplinary team to help us integrate the most relevant and valuable practices from each.Our current framework for doing so is summarized below, as a set of five high-level commitments.1.Establishing Red Line Capabilities. We commit to identifying and publishing "Red Line Capabilities" which might emerge in future generations of models and would present too much risk if stored or deployed under our current safety and security practices (referred to as the ASL-2 Standard).2.Testing for Red Line Capabilities (Frontier Risk Evaluations). We commit to demonstrating that the Red Line Capabilities are not present in models, or - if we cannot do so - taking action as if they are (more below). This involves collaborating with domain experts to design a range of "Frontier Risk Evaluations" - empirical tests which, if failed, would give strong evidence against a model being at or near a red line capability.We also commit to maintaining a clear evaluation process and a summary of our current evaluations publicly.3.Responding to Red Line Capabilities. We commit to develop and implement a new standard for safety and security sufficient to handle models that have the Red Line Capabilities. This set of measures is referred to as the ASL-3 Standard. We commit not only to define the risk mitigations comprising this standard, but also detail and follow an assurance process to validate the standard's effectiveness.Finally, we commit to pause training or deployment if necessary to ensure that models with Red Line Capabilities are only trained, stored and deployed when we are able to apply the ASL-3 standard.4.Iteratively extending this policy. Before we proceed with activities which require the ASL-3 standard, we commit...

Rate