Turing Post
Posts
Topic 33: Slim Attention, KArAt, and XAttention Explained – What’s Really Changing in Transformers?

Topic 33: Slim Attention, KArAt, and XAttention Explained – What’s Really Changing in Transformers?

We explore three advanced attention mechanisms which improve how models handle long sequences, cut memory use and make attention learnable

Alyona Vert.
March 26, 2025

Was this email forwarded to you? Forward it also to a friend or a collegue! Sign up

Attention in AI is a fundamental technique which will always remain a hot topic as we continue to work with architectures like transformers. Attention mechanisms give us a peek into what the model is focusing on when making decisions. They allow models to dynamically focus on specific parts of their input, and researchers are trying to use attention weights for interpretability and for figuring out why a model made a choice.

Two types of attention being core mechanisms of transformers, once revolutionized the effectiveness of AI models: 1) Self-Attention which lets each token “look” at all others in a sequence to understand context, and 2) Multi-Head Attention (MHA) that runs multiple attention mechanisms in parallel to capture different types of relationships. Now they are foundational to all major current LLMs, such as GPT, BERT, T5, and LLaMA.

Another notable attention mechanism is DeepSeek’s Multi-Head Latent Attention (MLA), which we covered in one of our previous episodes. It goes further than MHA and allows us to reduce memory use by modifying the MHA mechanism to compress the KV cache into a much smaller form.

Even these example show that new attention techniques = new possibilities and capabilities. They also open up possibilities for steering or guiding generation. Recently, we have observed that researchers are increasingly focusing on attention in AI. This gives us a hint that the community is seeking new mechanisms to take the models we use daily to the next level. Today we are going to dive into three latest diverse attentions: 1) Slim Attention which allows to process long context faster and cut memory use up to 32 times; 2) XAttention improving the effectiveness of sparse attention in long sequences including text and videos; and 3) Kolmogorov-Arnold Attention (KArAt and Fourier-KArAt), a completely different approach, that focuses on making attention learnable and adaptable. How they work? What models can benefit from them? Did we get your attention? Good. Let’s begin. You’ll learn a lot!

What is Slim Attention?

Working with long context remains a serious challenge for all LLMs, which also takes up a lot of memory and slows everything down – especially when generating new tokens during inference. Researchers from OpenMachine decided to overcome this issue, focusing on attention mechanism, and proposed Slim Attention. This technique allows to get the same results as using MHA, but faster and with less memory, which is pretty cool for scaling up large models. For example, in models like Whisper, Slim Attention can reduce memory use by 8 times and make text generation up to 5 times faster when running big batches. It some cases, like with T5-11B model, it can even cut memory use by 32 times!

Let’s look more precisely, how Slim Attention achieves these impressive results →

Upgrade if you want to be the first to receive the full articles with detailed explanations and curated resources directly in your inbox. Simplify your learning journey →

If you want to thank us for our work without a subscription – here’s a way to do it. Thank you for your support – it keeps us going.

Reply

or to participate.