Scaling LLMs:
Inference & Engineering

Deep dives into LLM inference systems, covering memory, latency, batching, and production tradeoffs.

LLM Serving

LLM Serving · Frameworks Serving LLMs in Production: vLLM vs TensorRT-LLM vs SGLang
How the three dominant serving frameworks handle KV cache, batching, and throughput — and when to use each.

Apr 2026 new

Scaling

Parallelism · MoE Scaling LLMs in Practice: Parallelism Strategies and MoE
Data, tensor, and pipeline parallelism — plus what Mixture-of-Experts actually costs.

Apr 2026 new

Inference

LLM Inference · Systems From Prefill to Decode: How Modern LLM Inference Actually Works
KV caching, continuous batching, chunked prefill, speculative decoding, and disaggregated serving.

Mar 2026
KV Cache · Memory From FlashAttention to PagedAttention: How Memory Shapes LLM Inference
Two orthogonal innovations — one optimises attention compute, the other fixes KV cache fragmentation.

Apr 2026 new