12 Researches on Sparse Autoencoders

Explore what is special about sparse autoencoders (SAEs) and their use for different purposes

Over the last few weeks, there were a lot of interesting studies on sparse autoencoders (SAEs). Today, we want to summarize them and briefly clarify what SAEs are and when they are used.

SAEs are known for making large language model (LLM) representations interpretable. While supervised learning with manually designed features is time-consuming and doesn't scale well to new problems, SAEs, being unsupervised neural networks, automatically learn meaningful features from data. Unlike regular autoencoders, SAEs include a sparsity constraint, which ensures that only a few neurons are active for any input, helping highlight the most significant patterns. Thanks to these capabilities, SAEs are used for tasks like feature extraction, dimensionality reduction, and pretraining deep networks.

Here are 12 studies to better understand sparse autoencoders and their implementation:

  1. Sparse autoencoder, CS294A Lecture notes, by Andrew Ng explains how SAEs can reduce dimensionality, extract features, and handle large datasets. It also explores how sparsity constraints improve feature learning in various domains like computer vision and audio processing. → Read more

  2. Four researches on using SAEs in steering LLMs from last week:

    • Can sparse autoencoders be used to decompose and interpret steering vectors? by University of Oxford examines why SAEs struggle to interpret steering vectors, which control LLM behaviors. The key issues are: 1) steering vectors don’t match SAE training input distribution; 2) SAEs don't handle negative projections in steering vectors. → Read more

    • Steering Language Model Refusal with Sparse Autoencoders explores improving LLMs safety by feature steering activations at inference time using SAEs. However, this can negatively impact performance on benchmarks. → Read more

    • Improving Steering Vectors by Targeting Sparse Autoencoder Features by Microsoft introduces SAE-Targeted Steering (SAE-TS) to improve control over LLMs. SAE-TS targets specific SAE features to achieve desired behavior while reducing unintended side effects. → Read more

    • SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs introduces a method that helps to prevent harmful or misaligned output in models, adding a SCAR module to LLMs that steers their output toward or away from specific concepts, like toxicity, without altering the original model. → Read more

  3. Sparse Autoencoders Find Highly Interpretable Features in Language Models shows how sparse autoencoders can be used to identify meaningful patterns in a language model’s internal activations. It highlights that these features become easier to interpret and edit, allowing precise modifications, like disabling pronoun prediction. → Read more

  4. Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders explores improving SAEs by decoupling the encoding and decoding processes. It shows that advanced encoders improve sparse feature detection and inference with minimal extra cost. → Read more

  5. Direct Preference Optimization Using Sparse Feature-Level Constraints proposes Feature-level constrained Preference Optimization (FPO) method to improve the alignment of LLMs with human preferences. Instead of using RLHF it implements SAEs to simplify the alignment process, making it more efficient and stable. → Read more

  6. Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders by Google DeepMind proposes a JumpReLU SAEs that balance faithfully reconstructing activations and maintaining sparse representations, which can be in conflict. → Read more

    Google DeepMind’s Gemma Scope includes these JumpReLU SAEs trained on multiple layers of Gemma 2 models to uncover sparse, interpretable features in neural networks and make it more affordable.

  7. Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders explores using SAEs to understand text-to-image diffusion models like SDXL Turbo. Training SAEs on the SDXL Turbo's U-net reveals interpretable features in image generation, with finding blocks specializing in image composition, local details, and style. → Read more

  8. Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting

    Rare Concepts in Foundation Models introduces Specialized Sparse Autoencoders (SSAEs) that focus on specific subdomains to capture "hidden" features and identify rare patterns, using dense retrieval for data selection and advanced training techniques. This helps to for address specific risks in FMs. → Read more

  9. Interpret the Internal States of Recommendation Model with Sparse Autoencoder introduces RecSAE tool to improve the interpretability of recommendation systems. This plug-in module translates model activations into interpretable features, creates automated concept dictionaries for recommendations, and validates interpretations with precision and recall. → Read more

Reply

or to participate.