LW - Highlights: Wentworth, Shah, and Murphy on "Retargeting the Search" by RobertM

Released Thursday, 14th September 2023

Good episode? Give it some love!

LW - Highlights: Wentworth, Shah, and Murphy on "Retargeting the Search" by RobertM

Thursday, 14th September 2023

Good episode? Give it some love!

Rate Episode

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Highlights: Wentworth, Shah, and Murphy on "Retargeting the Search", published by RobertM on September 14, 2023 on LessWrong.In How To Go From Interpretability To Alignment: Just Retarget The Search, John Wentworth suggests:When people talk about prosaic alignment proposals, there's a common pattern: they'll be outlining some overcomplicated scheme, and then they'll say "oh, and assume we have great interpretability tools, this whole thing just works way better the better the interpretability tools are", and then they'll go back to the overcomplicated scheme. (Credit to Evan for pointing out this pattern to me.) And then usually there's a whole discussion about the specific problems with the overcomplicated scheme.In this post I want to argue from a different direction: if we had great interpretability tools, we could just use those to align an AI directly, and skip the overcomplicated schemes. I'll call the strategy "Just Retarget the Search".We'll need to make two assumptions:Some version of the natural abstraction hypothesis holds, and the AI ends up with an internal concept for human values, or corrigibility, or what the user intends, or human mimicry, or some other outer alignment target.The standard mesa-optimization argument from Risks From Learned Optimization holds, and the system ends up developing a general-purpose (i.e. retargetable) internal search process.Given these two assumptions, here's how to use interpretability tools to align the AI:Identify the AI's internal concept corresponding to whatever alignment target we want to use (e.g. values/corrigibility/user intention/human mimicry/etc).Identify the retargetable internal search process.Retarget (i.e. directly rewire/set the input state of) the internal search process on the internal representation of our alignment target.Just retarget the search. Bada-bing, bada-boom.There was a pretty interesting thread in the comments afterwards that I wanted to highlight.Rohin Shah (permalink)Definitely agree that "Retarget the Search" is an interesting baseline alignment method you should be considering.I like what you call "complicated schemes" over "retarget the search" for two main reasons:They don't rely on the "mesa-optimizer assumption" that the model is performing retargetable search (which I think will probably be false in the systems we care about).They degrade gracefully with worse interpretability tools, e.g. in debate, even if the debaters can only credibly make claims about whether particular neurons are activated, they can still stay stuff like "look my opponent is thinking about synthesizing pathogens, probably it is hoping to execute a treacherous turn", whereas "Retarget the Search" can't use this weaker interpretability at all. (Depending on background assumptions you might think this doesn't reduce x-risk at all; that could also be a crux.)johnswentworth (permalink)I indeed think those are the relevant cruxes.Evan R. Murphy (permalink)They don't rely on the "mesa-optimizer assumption" that the model is performing retargetable search (which I think will probably be false in the systems we care about).Why do you think we probably won't end up with mesa-optimizers in the systems we care about?Curious about both which systems you think we'll care about (e.g. generative models, RL-based agents, etc.) and why you don't think mesa-optimization is a likely emergent property for very scaled-up ML models.Rohin Shah (permalink)It's a very specific claim about how intelligence works, so gets a low prior, from which I don't update much (because it seems to me we know very little about how intelligence works structurally and the arguments given in favor seem like relatively weak considerations).Search is computationally inefficient relative to heuristics, and we'll be selecting rea...

Rate