What actually matters when you move past "run a model" and start managing memory, latency, and concurrency under real workloads.
Serving LLMs isn't about running a model — it's about managing memory, latency, and concurrency under real workloads. Everything else is a consequence of that.
This post explains how the three dominant serving frameworks — vLLM, TensorRT-LLM, and SGLang — actually work, where each one wins, and what happens when you pick the wrong one. By the end, you'll have a clear mental model for any system design conversation involving LLM infrastructure.
Three constraints govern every serving decision you'll ever make:
Time-to-first-token and per-token generation speed. The user-facing metric. Bad latency kills products.
Tokens/sec across all users. The cost metric. Low throughput means expensive GPUs sitting idle.
KV cache dominates GPU memory during inference. It grows with context length and concurrent users — and it's almost always the real bottleneck, not compute.
Every LLM generation has two distinct phases with different computational profiles. Understanding this split is non-negotiable — all three frameworks are designed around it.
The full prompt is processed in one forward pass. Compute-heavy: all tokens processed in parallel. Fast in wall-clock time, but expensive in FLOPs. Produces the first output token.
Tokens generated one at a time. Memory-bound: each step requires reading the entire KV cache from GPU memory. This is the slow part — and it's why memory bandwidth matters more than raw FLOP count.
Decode is serial and memory-bound. Doubling your GPU's FLOP count does almost nothing for decode speed. What matters is memory bandwidth and cache efficiency — which is exactly what vLLM, TensorRT-LLM, and SGLang are optimizing.
During the decode phase, the model needs the key and value tensors for every previous token, across every attention head, in every layer. This is the KV cache — and it's enormous.
For a 70B parameter model with a 4096-token context window at 16-bit precision, the KV cache for a single request can consume several gigabytes of GPU memory. Now multiply that by 50 concurrent users.
KV cache size scales linearly with: number of tokens in context, number of layers, number of attention heads, and the head dimension. With modern long-context models (32K, 128K tokens), this becomes the dominant memory consumer — not the model weights themselves.
In modern LLM serving, memory — not FLOPs — is the bottleneck. You will hit GPU OOM long before you saturate your compute units.
If you pre-allocate a fixed memory block per request (the naive approach), you face three compounding problems:
Fragmentation: requests end at different lengths, leaving unusable gaps between allocated blocks. Wasted space: pre-allocated blocks must assume worst-case length, so most memory is reserved but unused. Premature OOM: the GPU runs out of memory before its compute is anywhere near saturated.
The difference between vLLM, TensorRT-LLM, and SGLang is largely about how each one handles this memory problem — and what tradeoffs they make to solve it.
Before specialized frameworks existed, the standard approach was simple: one request per GPU, static batching, no KV reuse. Load the model, run a request, return the result, repeat.
This works at demo scale. At production scale it falls apart completely:
One request per GPU: your $30K A100 sits at 5% utilization while a single user generates a 200-token response. Static batching: you must wait to collect a full batch before starting inference — adding latency to every request. No KV reuse: if the same system prompt appears in 1,000 requests, you recompute it 1,000 times.
A naively served 7B model on an A100 might handle 3–5 concurrent users before latency degrades. The same hardware with vLLM can handle 50–100+. The difference is entirely in memory management and batching strategy.
This is the gap that vLLM, TensorRT-LLM, and SGLang exist to close — each from a different angle.
vLLM's core contribution is a single elegant idea: apply virtual memory techniques from operating systems to the KV cache problem. The result is Paged Attention — arguably the most important systems innovation in LLM serving.
In a traditional OS, virtual memory lets processes use non-contiguous physical memory by mapping logical pages to physical frames. vLLM does the same for KV cache: instead of one contiguous block per request, it allocates fixed-size blocks (pages) that can live anywhere in GPU memory.
Pages are allocated on demand as generation proceeds. When a request ends, its pages are freed immediately. New requests grab those freed pages without waiting for defragmentation. The KV cache manager handles the mapping between logical token positions and physical memory locations.
Fragmentation drops to near-zero. You can pack far more concurrent requests onto a GPU because memory is used efficiently. Across benchmarks, vLLM achieves 2–4× higher throughput than static allocation on identical hardware.
Paged Attention enables continuous batching: as soon as one request finishes, its GPU slot is immediately filled by the next waiting request — no idle time, no batch assembly delay. The GPU runs at near-constant utilization.
This is the mechanism that drives vLLM's throughput advantage. High GPU utilization means lower cost-per-token, which is what matters in production.
vLLM is purpose-built for high-concurrency, variable-workload scenarios: chat APIs, multi-tenant systems, anything where many users send requests of unpredictable lengths. Its memory management is excellent and its OpenAI-compatible API makes integration straightforward.
vLLM is a general-purpose runtime, not a hardware-optimized one. For absolute minimum latency on a fixed workload, TensorRT-LLM's kernel-level optimizations will beat it. vLLM trades peak single-request performance for throughput and flexibility.
If vLLM is about memory management, TensorRT-LLM is about extracting every last FLOP from your hardware. NVIDIA's framework compiles models into highly optimized CUDA kernels, applying transformations that a generic runtime simply can't.
Operator fusion: multiple operations that would execute sequentially (layernorm → attention → projection) are fused into a single CUDA kernel. Fewer kernel launches, fewer memory round-trips, lower latency. Quantization: FP8 and INT8 quantization reduce memory bandwidth requirements and unlock specialized hardware units (Tensor Cores on H100s operate natively at FP8). Graph optimization: the entire compute graph is analyzed at compile time and restructured to minimize memory transfers between compute steps.
TensorRT-LLM "builds" an engine for a specific model, precision, batch size, and sequence length configuration. This build step takes minutes to hours — but the result is a binary optimized specifically for your GPU and workload. It's the difference between a JIT-compiled script and a hand-tuned C binary.
For latency-critical applications — real-time coding assistants, voice interfaces, anything where time-to-first-token determines user experience — TensorRT-LLM's kernel optimizations can deliver 30–60% lower latency than vLLM on identical hardware.
That compile-time optimization is also TensorRT-LLM's biggest constraint. Changing model architecture, quantization scheme, or sequence length means rebuilding the engine. For teams iterating rapidly on models or serving diverse workloads, this operational cost adds up fast.
TensorRT-LLM is the right choice when you know exactly what you're running, can invest setup time upfront, and the workload is stable. It's the wrong choice if your team ships model updates weekly or serves highly variable request sizes.
SGLang addresses a different problem entirely. Where vLLM and TensorRT-LLM are runtime engines, SGLang is a structured generation framework — it's about controlling what the model produces and how multi-step generation flows are coordinated.
Modern LLM applications rarely involve a single prompt and response. They involve chains: generate a plan, execute tools based on the plan, use tool results to generate the next step, validate output format, retry on failure. Orchestrating this with raw API calls is brittle and slow — you're spending more time on round-trips and JSON parsing than on actual inference.
SGLang provides primitives for expressing these multi-step flows, with the runtime handling batching and scheduling across steps automatically. Critically, it supports RadixAttention — prefix caching that reuses KV cache across requests sharing a common prefix (e.g., a system prompt), eliminating redundant computation.
For agentic workloads, the bottleneck isn't single-request latency — it's the overhead of coordinating many short, interdependent generations. SGLang's structured approach reduces this overhead significantly while keeping each generation fast.
SGLang isn't a replacement for vLLM or TensorRT-LLM at the kernel level — it typically runs on top of one of them. Think of it as the layer between your application logic and the raw inference engine. You get structured generation control, prefix caching, and pipeline orchestration; the underlying runtime provides the actual token generation.
SGLang adds abstraction overhead. For simple, single-turn request/response patterns, that abstraction buys you nothing — you'd be better served by vLLM directly. Its value accrues with complexity: the more multi-step your generation flow, the more it pays off.
| Dimension | vLLM | TensorRT-LLM | SGLang |
|---|---|---|---|
| Primary focus | Memory efficiency + batching | Kernel-level optimization | Structured generation + orchestration |
| Core innovation | Paged Attention | Operator fusion, FP8/INT8 | RadixAttention, prefix caching |
| Throughput | Excellent (continuous batching) | High (fixed workloads) | Good (pipeline-aware) |
| Latency (single req) | Good | Best | Depends on backend |
| Flexibility | High | Low (compile-time) | High |
| Setup complexity | Low | High | Medium |
| Best for | Multi-user APIs, variable load | Fixed workloads, latency-critical | Agents, tools, complex pipelines |
| Weakness | Not hardware-tuned | Rigid, slow to iterate | Not a standalone runtime |
These frameworks aren't competing for the same use case. They solve different layers of the serving problem. In sophisticated production systems, you'll often see them combined: TensorRT-LLM as the execution backend, vLLM's batching strategy on top, SGLang orchestrating multi-step flows at the application layer.
This is the section that matters in system design interviews. Scenario-first thinking beats memorized facts every time.
In practice, these are often combined — not mutually exclusive. A mature production stack might use SGLang for orchestration, vLLM for memory management, and TensorRT-LLM for latency-critical paths.
Continuous batching increases throughput by keeping GPUs busy — but it does so by interleaving requests. A long-running request shares decode cycles with shorter ones. The long request sees higher latency than if it ran alone. This is the core tension: optimizing for utilization hurts tail latency, and optimizing for tail latency wastes utilization.
The right balance depends on your SLA. A batch inference job has no tail latency requirement — maximize utilization. A real-time chat product has strict P99 requirements — cap batch size and accept lower utilization.
When KV cache fills up, you have two choices: reject the incoming request or pause and preempt an existing one. Preemption (swapping KV cache to CPU memory) preserves fairness but adds latency. Rejection is simpler but degrades user experience during spikes.
Production systems need explicit admission control logic — a simple queue in front of the inference engine that enforces memory budgets before requests enter the system. Without this, you get cascading OOMs under load.
Allowing longer max_tokens per request directly reduces the number of concurrent requests you can serve — it's not a free parameter. A system that allows 8K token responses will support 4× fewer concurrent users than one capped at 2K tokens, all else equal. This needs to be a conscious product decision, not an implementation default.
The best answers to LLM serving questions don't pick a framework — they reason about the tradeoff space (memory, latency, throughput, flexibility) and then select accordingly. That reasoning is what interviewers are actually evaluating.
Traffic spike hits, KV cache fills, GPU OOMs. Whole server dies. Fix: admission control + explicit memory budgets before requests enter the system.
A few long-running requests get scheduled together. P99 explodes. Fix: max sequence length limits per batch, or priority-based scheduling.
Static allocation leaves gaps between freed blocks. GPU appears to have memory, but no contiguous block large enough for new requests. Fix: use paged attention (i.e., use vLLM).
Batch size too small, or static batching introduces wait time. GPU runs at 15% utilization. Fix: continuous batching + autoscaling on queue depth, not just GPU %.
Model updated, but engine not rebuilt. Silent degradation or crash at runtime. Fix: engine versioning tied to model checksum in your CI/CD pipeline.
System prompt changed, but SGLang's RadixAttention cache wasn't cleared. Stale cache causes incorrect outputs. Fix: explicit cache invalidation on any shared prefix update.
Speculative decoding: a small draft model generates candidate tokens; the large model verifies multiple tokens in parallel. Can deliver 2–3× decode speedup with no quality loss. Already shipping in TensorRT-LLM and increasingly in vLLM.
KV cache compression: quantizing the KV cache itself (not just weights) to INT8 or INT4 can halve memory usage with minimal quality impact. Active research area with production implications.
Disaggregated prefill/decode: running prefill and decode on separate GPU pools, since they have different computational profiles. Early results show significant throughput gains for long-context workloads.
Hardware-aware schedulers: scheduling decisions that account for GPU memory topology, NVLink bandwidth, and PCIe bottlenecks — not just token counts. Nascent area with large potential upside.
Serving LLMs is no longer about running models — it's about managing memory, scheduling, and system tradeoffs at scale. vLLM solves the memory problem. TensorRT-LLM solves the latency problem. SGLang solves the orchestration problem. Understanding which problem you actually have is the key decision factor.