Attention in AI: KV Cache, QKV & Self-Attention in AI Explained

Quick answer: What is attention in AI?

Attention in AI is the mechanism that lets Transformer models decide which parts of an input sequence should influence each token’s representation. It compares queries, keys, and values to build contextual meaning, powers self-attention, and enables faster generation through KV cache reuse.

TL;DR: Attention in AI lets Transformer models dynamically decide which tokens matter most for understanding context. Using queries, keys, and values, attention computes relationships between tokens, enabling contextual representations, efficient long-context processing, and modern capabilities like reasoning, translation, retrieval, and autoregressive text generation.

What is attention in AI? This question may sound simple, but there are actually many aspects worth a deep discussion.

In the previous episodes, we explored what a token in AI is and followed tokens as they became embeddings: dense vectors, that are also points in a learned space and coordinates with semantic structure. Embeddings are where meaning lives, but coordinates alone are not enough. The next natural step in our series is attention. So what is its role in the workflow?

Tokenization breaks language into units.
Embedding turns those units into learnable geometry.
Attention makes that geometry interact.

Attention solves a harder task than previous stages.: for this token, in this sentence, at this exact moment, which other tokens are relevant enough to change its meaning? The overall mechanism lets one token attend to another token, decide how relevant it is, and pull in exactly the information it needs. At this moment a sequence stops being a row of vectors and becomes context.

This mechanism became an indispensable foundation and changed AI once and forever, especially becoming the central computation part of Transformers.

So let’s unpack where attention comes from, how it works, the core components and types of this mechanism, and why attention isn’t the same as understanding. It’s foundational knowledge.

In today’s episode:

The history of attention: from translation to Transformers
How attention works
- From embeddings to contextual representations
- Queries, keys, and values: the QKV mechanism
- How to compute attention step-by-step?
What is KV cache and why does it matter?
How attention mechanisms are evolving
Why attention is not the same as understanding
Sources and further reading

The history of attention: from translation to Transformers

Before attention became the center of the Transformer described in one of the most influential papers in AI – Attention is All You Need from Google, it had appeared three years earlier as a solution for a practical problem in neural machine translation.

Early encoder–decoder models encoded an entire source sentence into a single fixed-length vector – a dense numerical representation that captures the sentence’s meaning – and decoded the translation from that compressed summary. Dzmitry Bahdanau, KyungHyun Cho and Yoshua Bengio in their Neural Machine Translation by Jointly Learning to Align and Translate paper (2014) argued that this fixed-length context vector created a bottleneck, because the model had to compress all relevant information into one representation. They proposed a model that could align source and target words during decoding, letting the decoder softly search over different parts of the input sentence while generating each new word. The decoder dynamically computes a context vector as a weighted combination of source annotations, focusing on the most relevant parts of the input for the current prediction. This was the moment when context became adaptive and target-dependent. For the full story of who actually coined the term 'attention' — and the debate around Schmidhuber's earlier work — see Who came up with the term 'attention'?

Image credit: Karpathy’s post

Then in 2015, Stanford researchers published Effective Approaches to Attention-based Neural Machine Translation by Minh-Thang Luong, Hieu Pham and Christopher D. Manning that extended this idea with practical attention mechanisms. They introduced global attention, where the decoder attends to all source positions, and local attention, where it focuses only on a smaller window of source words at each step. Plus, they proposed input-feeding approach, where the model feeds earlier attention information back into later steps so it can remember what parts of the source it has already focused on.

Image Credit: Global attention mechanism in Effective Approaches to Attention-based Neural Machine Translation

Image Credit: Local attention window Effective Approaches to Attention-based Neural Machine Translation

Then finally came the main breakthrough – the architecture built around attention itself. Yes, it is about Transformers and Attention Is All You Need paper (2017) by A. Vaswani et al. The Google researchers removed recurrence and convolutions and made self-attention layers the core foundation of Transformers. They also introduced the formulation of attention and the language of queries, keys, and values that became the canonical way to describe how models retrieve and combine contextual information. One of the most influential architectures built on this foundation is BERT — see our deep dive on BERT and its modern variants.

This is just the story of how attention became the centerpiece of modern models. It’s not enough to understand the basics. Let’s go through every part of the workflow step-by-step.

How does attention work in transformer models?

From embeddings to contextual representations

The usual Transformer model starts with token embeddings (dense numerical vector representations of tokens that encode semantic and syntactic information in a continuous vector space), combined with positional encodings, because the word order matters a lot in this architecture. At this stage, these vectors are still not deeply contextual. They have information about token identity, and with positional encoding they “know” something about location, but they do not yet define what matters to what.

So these “token embeddings + positional encodings” vectors serve as the inputs to the attention mechanism, entering the Transformer’s attention layers.

Image Credit: Transformer architecture showing self-attention layers and positional encodings, Attention is All You Need

There, each token representation is transformed into queries, keys, and values. These are linear projections of embeddings or hidden states – context-aware vector representations of tokens as they are processed layer by layer inside the model. Embeddings provide the raw material from which attention builds its comparisons. Without embeddings, the mechanism would have nothing meaningful to compare or route through the network.

Queries, keys, and values: the QKV mechanism

A KV cache, or Key-Value cache, is a stored collection of key and value vectors from earlier tokens in a generated sequence. The model predicts a new token, the Transformer reuses them. This happens instead of recalculating those vectors every time, so KV cache makes autoregressive generation much faster, especially for long prompts, long conversations, and reasoning-heavy outputs.

Here is the essential vocabulary that everyone needs to know to understand the full mechanism and attention formulation.

Concept	Simple meaning	Why it matters
Query	What the current token is looking for	Starts the relevance search
Key	What each token exposes about itself	Lets other tokens decide whether to attend to it
Value	The information passed forward if selected	Becomes the attention output
Self-attention	Tokens attending to other tokens in the same sequence	Builds contextual representations
Multi-head attention	Several attention operations running in parallel	Captures different relationships at once
KV cache	Stored keys and values from previous tokens	Speeds up generation and reduces recomputation

The attention mechanism uses queries, keys, and values to decide which tokens are most relevant and what information should be passed forward.

After embeddings and positional information enter the model, each token vector is multiplied by learned weight matrices to produce three different versions of itself:

A query (Q) is what the current token is looking for. It is the signal used to compare against other tokens and determine which ones may be relevant.
A key (K) is information that every token exposes about itself so others can decide whether to attend to it. It acts like a tag or description attached to a token.
A value (V) represents the information a token contributes if attention selects it. It is the actual content being passed along, like a payload. A weighted combination of value vectors becomes the attention output.

But why can’t we just use ordinary embeddings, and why do we need to split them into these three vectors?

The answer is simple: a token does not have just one job. A model may need one representation for searching, another for matching, and another for carrying information forward. One word can simultaneously be a requester, a candidate match, and a carrier of content. Without this separation for Q, K, and V, attention would collapse multiple roles into one vector and lose its flexibility.

How to compute attention step-by-step?

Now it’s time to move on to the core aspect – computation. A classic formulation for attention looks like this:

Image Credit: Attention Is All You Need

Let’s figure out what is so important here.

For each token, the model compares its query (Q) to the keys (K) of other tokens. This operation is computed for all tokens simultaneously using matrices Q, K, and V. The dot product between a Q and K measures compatibility. This produces attention scores.
Those scores are scaled by √dₖ that keeps gradients in a realistic range. This scaling is needed because when dimensions grow large, raw dot products can become too large in magnitude as well, which makes softmax overly sharp and training less stable.
Then softmax – a mathematical function that converts a set of numbers into probabilities that sum to 1 – turns similarity scores into attention weights, determining how strongly each token should influence the result.
Finally, the model uses those weights to compute a weighted sum of the value (V) vectors. The result is a new representation of the token that has pulled in the most relevant information from the sequence.

In decoder self-attention, the model also applies masking, preventing tokens from attending to future positions. This preserves the autoregressive behavior of text generation, where each token can depend only on earlier tokens.

So this is the workflow in easy words: a token enters as a vector, asks a question, scans the sequence for relevant keys, gathers weighted values, and exits as a richer vector. The process repeats for every token and across every Transformer layer. We can also call attention a lookup mechanism that learns relevance in a high-dimensional space. And this is the moment where embeddings become contextual representations refined layer by layer.

Interestingly, in a Multi-Head Attention (MHA) variant, this workflow is performed several times in parallel using different learned projections of Q, K, and V. Different attention heads (independent attention operation) can focus on different kinds of relationships and representation subspaces, like local dependencies, long-range dependencies, and structural patterns. The outputs of all heads are then fused and projected again into a single representation.

Image Credit: Self-attention and Multi-head attention processing different token relationships in parallel, Attention Is All You Need

Send subscription as a gift

What is KV cache and why does it matter?

In Transformers, text generation basically works one token at a time, and when generating the next token, the model still needs access to all previous tokens. If it recomputes their keys and values at every step (for example, for token 501 it recomputes everything for tokens 1-500 from scratch), this would make the decoding very slow.

So there is a solution: systems store the previously computed key and value vectors, forming KV cache. Models compute only new query, key, and value vectors for the current token, then attends over the cached representations from previous positions. In practice, past queries aren’t needed. This reuse mechanism is also what powers prompt caching at the API level — providers store KV states between requests, which is why cached tokens can cost up to 90% less than regular input tokens.

But in this case the main tradeoff becomes memory: KV cache grows with sequence length, batch size, number of layers, and in standard multi-head attention, with the number of attention heads, since each head maintains its own cached key and value states.

This is why so much attention research focuses on making KV storage more efficient. Multi-Query Attention (MQA) shares one key-value head across all query heads. Grouped-Query Attention (GQA) uses several KV groups, keeping much of MQA’s speed while staying closer to full multi-head quality. Cross-Layer Attention (CLA) shares KV activations across neighboring Transformer layers, reducing KV cache memory use by up to 2x compared to MQA and GQA. And Multi-Head Latent Attention (MLA) compresses KV states into lower-rank latent vectors. It was first introduced with the release of DeepSeek-V2, reducing KV cache by 93.3% in comparison to the previous DeepSeek model.

How attention mechanisms are evolving

If you want to explore the core and most popular attention mechanisms, including FlashAttention algorithm optimized for the hardware (GPUs) that uses fast on-chip memory and the ones that we have already mentioned here, we recommend you to check out 13+ Attetnion Mechanisms You Should Know

And here, we’ll focus on some of the most interesting recent attention variants instead.

Elastic core-periphery attention: It is an alternative building block for Vision Transformers working at high resolution. The model routes communication through a small set of learned “core” tokens. There are also patch tokens which interact only with the cores, while the cores communicate with each other. This reduces attention complexity from quadratic to more linear with image size.
Mixture of Indexer Sparse Attention (MISA) is a sparse attention mechanism for very long contexts to make it faster and cheaper. It dynamically selects only a small subset of the most useful heads for each query. A lightweight router decides which heads to activate based on coarse block-level statistics, and only those heads perform the expensive token-level search.
Block-Filtered Long-Context Attention (BFLA) has a very similar idea to MISA but sparsifies along the token/block axis. It first groups tokens into coarse blocks and estimates their importance, and then computes full token-level attention only inside the selected regions.
Latent-Condensed Attention (LCA) compresses context inside the latent space used by Multi-head Latent Attention (MLA). It merges groups of similar semantic vectors into compact representations while separately preserving positional information through anchor tokens. This lets the model reduce attention computation and KV cache size at the same time.

Low-Rank Key-Value Attention (LRKV) is another interesting mechanism. It reduces KV cache memory by making attention heads share one main key-value representation while adding small low-rank head-specific corrections on top.

So the core directions in attention research are moving toward more aggressive reduction of KV cache and specifically adapted for the very long context. If you look deeper, this trend makes a lot of sense: model reasoning chains are now much longer, agents keep much more context in memory, context windows keep growing, but at the same time we still want the speed and efficiency of much lighter models. And probably, we are still waiting for a new breakthrough in this field.

Why attention is not the same as understanding

Before we finish, let’s clarify one more important point. Attention is often described using human metaphors, but AI attention is about the mechanics and compute. It does not “understand” information the way humans do, relying on memory, awareness, and intent. Attention in AI just computes which parts of the input should influence the output more strongly.

If a model sees the sentence “The big red ball was under the table” and is asked “Where was the red ball?”, attention helps it focus on the relevant parts – “red ball” and “under the table.” But if asked “Where was the blue balloon?”, the model may or may not detect the inconsistency depending on its training and internal representations, while a human immediately notices the mismatch.

That is why attention is not the same as understanding. Attention helps the model stay grounded in context, but it doesn't guarantee correctness. It is just a part of the stack that makes models more relevant. Of course, without attention, models would never have reached this level of performance. However, attention is still ultimately a mechanism – a tool for prioritizing input that helps models stay on track.

Share the newsletter

How did you like it?

Sources and further reading

Attention Is All You Need | Paper
Neural Machine Translation by Jointly Learning to Align and Translate | Paper
Effective Approaches to Attention-based Neural Machine Translation | Paper
Fast Transformer Decoding: One Write-Head is All You Need | Paper
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | Paper
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention | Paper
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model | Paper
Efficient Memory Management for Large Language Model Serving with PagedAttention | Paper
Elastic Attention Cores for Scalable Vision Transformers | Paper
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference | Paper
BFLA: Block-Filtered Long-Context Attention Mechanism | Paper
Latent-Condensed Transformer for Efficient Long Context Modeling | Paper
Low-Rank Key Value Attention | Paper
DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference | Paper

Resources From Turing Post:

FAQ

What is attention in AI?

Attention is a mechanism used in Transformers that allows models to determine which tokens in a sequence are most relevant when processing information. It dynamically computes relationships between tokens to build contextual understanding.

How do queries, keys, and values work in attention?

Queries represent what a token is searching for, keys describe what each token offers, and values contain the information passed forward. Attention compares queries with keys to decide how much of each value to use.

Why is attention important for Transformers?

Attention allows Transformers to process relationships between tokens in parallel instead of sequentially. This enables better context modeling, scalability, and long-range dependency handling compared to older recurrent architectures.

What is KV cache in LLMs?

KV cache stores previously computed key and value vectors during autoregressive generation so the model does not recompute them for every new token. This dramatically speeds up inference in large language models.

Attention vs understanding: what is the difference?

Attention helps models prioritize relevant information in context, but it does not guarantee reasoning, factual correctness, or human-like understanding. It is a computational mechanism, not consciousness or intent.

AI 101: Your Ultimate Guide to Attention: Mechanism, QKV, and KV Cache