AF - EIS XIII: Reflections on Anthropic's SAE Research Circa May 2024 by Stephen Casper

Released Tuesday, 21st May 2024

Good episode? Give it some love!

AF - EIS XIII: Reflections on Anthropic's SAE Research Circa May 2024 by Stephen Casper

Tuesday, 21st May 2024

Good episode? Give it some love!

Rate Episode

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS XIII: Reflections on Anthropic's SAE Research Circa May 2024, published by Stephen Casper on May 21, 2024 on The AI Alignment Forum.Part 13 of 12 in theEngineer's Interpretability Sequence.TL;DROn May 5, 2024,I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn't do.Today's new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but it ultimately underperformed my expectations. I am beginning to be concerned that Anthropic's recent approach to interpretability research might be better explained by safety washing than practical safety work.Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates.Reflecting on predictionsPleasesee my original post for 10 specific predictions about whattoday's paper would and wouldn't accomplish. I think that Anthropic obviously did 1 and 2 and obviously did not do 4, 5, 7, 8, 9, and 10. Meanwhile, I think that their experiments to identifyspecific andsafety-relevant features should count for 3 (proofs of concept for a useful type of task) but definitely do not count for 6 (*competitively* finding and removing a harmful behavior that was represented in the training data).Thus, my assessment is that Anthropic did 1-3 but not 4-10.I have been wrong with mech interp predictions in the past, but this time, I think I was 10 for 10: everything I predicted with >50% probability happened, and everything I predicted with <50% probability did not happen.Overall, the paper underperformed my expectations. If you scored the paper relative to my predictions by giving it (1-p) points when it did something that I predicted it would do with probability p and -p points when it did not, the paper would score -0.74.A review + thoughtsI think that Anthropic's new SAE work has continued to be like lots of prior high-profile work on mechanistic interpretability - it has focused on presenting illustrative examples, streetlight demos, and cherry-picked proofs of concept. This is useful for science, but it does not yet show that SAEs are helpful and competitive for diagnostic and debugging tasks that could improve AI safety.I feel increasingly concerned about how Anthropic motivates and sells its interpretability research in the name of safety. Today's paper makes some major Motte and Bailey claims that oversell what was accomplished like "Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer," "Sparse autoencoders produce interpretable features for large models," and "The resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references." The paper also made some omissions of past literature on interpretability illusions (e.g., Bolukbasi et al., 2021), which their methodology seems prone to. Normally, problems like this are mitigated by peer review, which Anthropic does not participate in.Meanwhile, whenever Anthropic puts out new interpretability research, I see a laundry list of posts from the company and employees to promote it. They always seem to claim the same thing - that some 'groundbreaking new progress has been made' and that 'the model was even more interpretable than they thought' but that 'there remains progress to be made before interpretability is solved'. I won't link to any specific person's posts, but here is Anthropic's post fromtoday andOctober 2023.The way that Anthropic presents its interpretability work has real-world consequences. For example, it seems to have led toviral claims that interpretability will be solved and that we are bound for safe models. It has also led to at least one claim in a pol...

Rate