-
LLM Serving · Frameworks Serving LLMs in Production: vLLM vs TensorRT-LLM vs SGLang
How the three dominant serving frameworks handle KV cache, batching, and throughput — and when to use each.
Apr 2026 new
-
Parallelism · MoE Scaling LLMs in Practice: Parallelism Strategies and MoE
Data, tensor, and pipeline parallelism — plus what Mixture-of-Experts actually costs.
Apr 2026 new
-
LLM Inference · Systems From Prefill to Decode: How Modern LLM Inference Actually Works
KV caching, continuous batching, chunked prefill, speculative decoding, and disaggregated serving.
Mar 2026 -
KV Cache · Memory From FlashAttention to PagedAttention: How Memory Shapes LLM Inference
Two orthogonal innovations — one optimises attention compute, the other fixes KV cache fragmentation.
Apr 2026 new