Memory systems · LLM inference

From FlashAttention to PagedAttention:
how memory shapes LLM inference

Transformer performance isn't limited by compute — it's limited by how efficiently we move and store data.

~14 min read Attention mechanisms KV cache Serving systems

Two innovations. Two different problems. One shared insight: in modern AI systems, memory is the constraint — not math.

When engineers talk about making LLMs faster, they reach for bigger GPUs or clever quantization. But the real lever has almost always been memory — specifically, how the system reads, writes, and manages data during inference. FlashAttention and PagedAttention are the two most important answers to that problem, and understanding both gives you a clear mental model for why modern LLM infrastructure looks the way it does.

01 — The attention bottleneck

Standard self-attention follows a deceptively simple formula: compute query, key, and value projections from the input, then compute attention scores, apply softmax, and weight the values. The problem is what lives in the middle.

Standard attention — the naive path
Q (N × d) × Kᵀ (d × N) Attention matrix (N × N) materializes in HBM
Softmax × V Output

That N × N attention matrix is the villain. For a sequence of 4,096 tokens, it's a 16-million-element matrix — written to GPU memory in full, read back for the softmax, written again, read again for the value multiplication. With 32 attention heads, that's 32 round trips to HBM per layer.

The key insight

Memory reads and writes dominate attention cost — not the matrix multiplications. The FLOPs are cheap. Moving data between compute units and memory is what kills performance.

This distinction between compute-bound and memory-bound operations is foundational. A GPU can perform arithmetic far faster than it can fetch data from memory. When an operation spends most of its time waiting for data, it's memory-bound — and that's exactly what standard attention is.

02 — FlashAttention: tiling the computation

FlashAttention's core idea is simple enough to state in one sentence: never materialize the full attention matrix. Instead, compute attention in tiles that fit entirely in on-chip SRAM.

FlashAttention — tiled execution
Q, K, V in HBM split into blocks
for each tile:
Q_block + K_block → SRAM stays on-chip
partial attention + online softmax
accumulate into output write once to HBM

Three mechanisms that make it work

Tiling: Q, K, and V are divided into blocks sized to fit in SRAM. Each block is loaded once, processed completely, and the intermediate result is accumulated — no round-trips to HBM for partial results.

SRAM vs HBM: SRAM (the GPU's on-chip cache) is roughly 10× faster than HBM but tiny — typically 20–40MB on an A100 vs 80GB of HBM. FlashAttention's entire optimization is about keeping hot data in SRAM as long as possible.

Online softmax: The standard softmax requires seeing all scores before computing any probabilities. FlashAttention uses a numerically stable online algorithm that updates running statistics tile-by-tile, never needing the full row materialized at once.

The elegant tradeoff

FlashAttention trades redundant memory movement for recomputation — and that's a good trade on modern GPUs. Recomputing a tile is cheap compared to writing a 16M-element matrix to HBM and reading it back twice.

The results are substantial: FlashAttention reduces the memory footprint of attention from O(N²) to O(N), enables sequences 4–8× longer than standard attention on the same hardware, and improves wall-clock speed by 2–4× on real workloads.

03 — What FlashAttention doesn't solve

This is the section most explainers skip — and it's the one that connects FlashAttention to PagedAttention in a coherent way.

FlashAttention optimizes within a single forward pass. It makes the attention computation itself faster and less memory-hungry during training and single-request inference. But it does nothing for the problems that emerge when you're running a serving system with hundreds of concurrent users generating responses token by token.

What Flash fixes

Memory IO during attention computation. Intermediate matrix materialization. Sequence length scaling in a single pass.

What Flash leaves untouched

KV cache size across decode steps. Memory fragmentation across concurrent requests. Batching efficiency under variable load.

The transition point

FlashAttention is a training and prefill optimization. When you move to autoregressive decode at serving scale, a different memory problem takes over — and that's what PagedAttention addresses.

04 — The KV cache explosion

During autoregressive generation, each new token needs to attend to every token before it. To avoid recomputing keys and values for previous tokens on every step, the serving system caches them — that's the KV cache.

The KV cache size scales as: 2 × layers × heads × head_dim × sequence_length × batch_size × dtype_bytes. For a 70B model serving 100 concurrent users with 4K context, you're looking at hundreds of gigabytes. The problem isn't just size — it's how memory gets allocated.

Naive KV cache allocation
Request A pre-allocated 4K block (mostly empty)
Request B pre-allocated 4K block (mostly empty)
Request C pre-allocated 4K block (mostly empty)
Result: 60–80% of allocated memory is wasted. GPU OOMs before compute saturation.

The naïve approach pre-allocates a contiguous memory block per request at its maximum possible length. Since you don't know how long responses will be, you reserve for the worst case. Most of that memory sits unused. When requests finish at different lengths, the freed blocks can't be easily reused — classic memory fragmentation.

05 — PagedAttention: virtual memory for KV cache

PagedAttention borrows the OS concept of virtual memory: decouple the logical view of a sequence from its physical memory layout. Instead of one contiguous block per request, the KV cache is divided into fixed-size pages that can live anywhere in GPU memory.

PagedAttention — non-contiguous allocation
Logical sequence:
[T1] [T2] [T3] [T4] [T5] [T6]
Physical memory (pages, anywhere in GPU RAM):
[Page A]T1–T2 [Page F]T3–T4 [Page C]T5–T6
Block table (logical → physical mapping):
T1–T2 → Page A T3–T4 → Page F T5–T6 → Page C

Pages are allocated on demand as tokens are generated. When a request ends, its pages are freed immediately and made available to any new request — no size constraints, no fragmentation. A block table maps each request's logical token positions to physical page locations at runtime.

The core insight

PagedAttention decouples logical sequence layout from physical memory layout — the same insight that lets modern OSes run many processes on limited physical RAM. Near-zero fragmentation and near-100% memory utilization become achievable.

The attention computation itself is modified to look up physical page locations via the block table on each decode step. The mathematical result is identical to standard attention — only the memory access pattern changes.

06 — Why this transforms serving

PagedAttention isn't just an optimization — it unlocks a fundamentally different operating mode for inference servers.

Continuous batching becomes viable. Because memory is allocated in small pages and freed immediately, the server can add new requests to the batch mid-flight as others complete — no waiting to assemble a fixed batch. GPU utilization stays high continuously rather than spiking and dipping.

Memory efficiency jumps dramatically. vLLM's benchmarks show 2–4× throughput improvement over static allocation on identical hardware — the difference is almost entirely from eliminated fragmentation.

Copy-on-write sharing becomes possible. Multiple requests sharing a common prefix (e.g. a system prompt) can share the same physical KV cache pages, with pages only duplicated when they diverge. This alone can halve memory usage in multi-tenant chat APIs.

The dependency

Without PagedAttention, continuous batching would collapse under memory fragmentation. The two techniques are inseparable in practice — PagedAttention is what makes high-throughput serving economically viable.

07 — Direct comparison

Dimension FlashAttention PagedAttention
What it optimizesAttention compute & memory IOKV cache memory management
Core mechanismTiled SRAM computation, online softmaxNon-contiguous paged allocation, block table
Problem solvedQuadratic memory in attention matrixFragmentation across concurrent requests
ScopeWithin a single forward passAcross multiple decode steps + requests
Used inTraining + inference (prefill)Serving systems (decode)
Primary benefitLonger context, faster attentionHigher concurrency, better GPU utilization
OriginDao et al. (2022)vLLM / Kwon et al. (2023)

The key word is orthogonal. These frameworks solve problems at different levels of the system and during different phases of generation. One doesn't subsume the other — they compose.

08 — How they work together

In a production serving system, both are active simultaneously — but operating on different problems at different timescales.

During the prefill phase, FlashAttention handles the full-context attention computation efficiently: tiled execution, SRAM-resident intermediates, no N×N matrix materialization. The KV cache for the prompt is written to paged memory blocks managed by PagedAttention.

During the decode phase, FlashAttention handles each new token's attention over the cache. PagedAttention manages where those cache entries live as the sequence grows, allocating new pages on demand and maintaining the block table for non-contiguous lookup.

FlashAttention optimizes how tokens attend to each other — the computation itself. PagedAttention optimizes where tokens live in memory — the storage and retrieval system. Together they address both sides of the memory problem: reducing unnecessary IO within a pass, and eliminating waste across passes.

"Two sides of the same shift — one optimizes memory access during computation, the other optimizes memory allocation across time."

09 — Tradeoffs and limitations

Neither technique is free. Understanding the costs is what separates a surface-level understanding from a production-ready one.

FlashAttention limitations

  • Complex custom CUDA kernels — hardware-specific, requires careful tuning
  • Not all attention variants supported (some sparse patterns, custom masking)
  • Recomputation adds FLOPs during backprop (acceptable tradeoff, but a real one)
  • Implementation complexity — debugging is harder than standard attention

PagedAttention limitations

  • Indirection overhead — every attention step requires block table lookup
  • Non-contiguous memory access can reduce cache locality
  • Scheduler complexity increases with dynamic page management
  • Memory overhead from the block table itself (minor but real)

In practice, these costs are almost always worth it — the performance gains are not marginal. But knowing the tradeoffs matters when you're debugging latency regressions or designing hardware-specific serving stacks.

System-level takeaway

"Modern LLM systems are not limited by algorithms — they are limited by memory hierarchy."

As models scale, the bottleneck shifts from raw compute to how efficiently the system can move and store the data those computations require. FlashAttention and PagedAttention are two sides of that same shift: one eliminates unnecessary memory traffic during attention computation, the other eliminates wasted memory across the full lifecycle of a serving request.

Understanding them together — as complementary solutions to a single underlying constraint — is what lets you reason clearly about LLM infrastructure. The question is never just "which framework?" It's "where is the memory bottleneck, and what layer does it live at?"

← back to all posts
Topics: FlashAttention · PagedAttention · KV cache · attention mechanisms · LLM serving · memory-bound compute · vLLM