Engineering deep dive · LLM Infrastructure

Serving LLMs in Production: vLLM vs TensorRT-LLM vs SGLang

What actually matters when you move past "run a model" and start managing memory, latency, and concurrency under real workloads.

~18 min read System design Production inference

Serving LLMs isn't about running a model — it's about managing memory, latency, and concurrency under real workloads. Everything else is a consequence of that.

This post explains how the three dominant serving frameworks — vLLM, TensorRT-LLM, and SGLang — actually work, where each one wins, and what happens when you pick the wrong one. By the end, you'll have a clear mental model for any system design conversation involving LLM infrastructure.

Three constraints govern every serving decision you'll ever make:

Latency

Time-to-first-token and per-token generation speed. The user-facing metric. Bad latency kills products.

Throughput

Tokens/sec across all users. The cost metric. Low throughput means expensive GPUs sitting idle.

Memory

KV cache dominates GPU memory during inference. It grows with context length and concurrent users — and it's almost always the real bottleneck, not compute.

01 — What happens during inference

Every LLM generation has two distinct phases with different computational profiles. Understanding this split is non-negotiable — all three frameworks are designed around it.

Prefill phase

The full prompt is processed in one forward pass. Compute-heavy: all tokens processed in parallel. Fast in wall-clock time, but expensive in FLOPs. Produces the first output token.

Decode phase

Tokens generated one at a time. Memory-bound: each step requires reading the entire KV cache from GPU memory. This is the slow part — and it's why memory bandwidth matters more than raw FLOP count.

Key insight

Decode is serial and memory-bound. Doubling your GPU's FLOP count does almost nothing for decode speed. What matters is memory bandwidth and cache efficiency — which is exactly what vLLM, TensorRT-LLM, and SGLang are optimizing.

02 — The real bottleneck: KV cache

During the decode phase, the model needs the key and value tensors for every previous token, across every attention head, in every layer. This is the KV cache — and it's enormous.

For a 70B parameter model with a 4096-token context window at 16-bit precision, the KV cache for a single request can consume several gigabytes of GPU memory. Now multiply that by 50 concurrent users.

Why it explodes

KV cache size scales linearly with: number of tokens in context, number of layers, number of attention heads, and the head dimension. With modern long-context models (32K, 128K tokens), this becomes the dominant memory consumer — not the model weights themselves.

The real constraint

In modern LLM serving, memory — not FLOPs — is the bottleneck. You will hit GPU OOM long before you saturate your compute units.

Problems with naive allocation

If you pre-allocate a fixed memory block per request (the naive approach), you face three compounding problems:

Fragmentation: requests end at different lengths, leaving unusable gaps between allocated blocks. Wasted space: pre-allocated blocks must assume worst-case length, so most memory is reserved but unused. Premature OOM: the GPU runs out of memory before its compute is anywhere near saturated.

Why this matters for framework choice

The difference between vLLM, TensorRT-LLM, and SGLang is largely about how each one handles this memory problem — and what tradeoffs they make to solve it.

03 — Naive serving: the baseline to beat

Before specialized frameworks existed, the standard approach was simple: one request per GPU, static batching, no KV reuse. Load the model, run a request, return the result, repeat.

This works at demo scale. At production scale it falls apart completely:

One request per GPU: your $30K A100 sits at 5% utilization while a single user generates a 200-token response. Static batching: you must wait to collect a full batch before starting inference — adding latency to every request. No KV reuse: if the same system prompt appears in 1,000 requests, you recompute it 1,000 times.

Real-world consequence

A naively served 7B model on an A100 might handle 3–5 concurrent users before latency degrades. The same hardware with vLLM can handle 50–100+. The difference is entirely in memory management and batching strategy.

This is the gap that vLLM, TensorRT-LLM, and SGLang exist to close — each from a different angle.

04 — vLLM: why it changed everything

vLLM's core contribution is a single elegant idea: apply virtual memory techniques from operating systems to the KV cache problem. The result is Paged Attention — arguably the most important systems innovation in LLM serving.

Paged attention, explained plainly

In a traditional OS, virtual memory lets processes use non-contiguous physical memory by mapping logical pages to physical frames. vLLM does the same for KV cache: instead of one contiguous block per request, it allocates fixed-size blocks (pages) that can live anywhere in GPU memory.

Pages are allocated on demand as generation proceeds. When a request ends, its pages are freed immediately. New requests grab those freed pages without waiting for defragmentation. The KV cache manager handles the mapping between logical token positions and physical memory locations.

Why this is a big deal

Fragmentation drops to near-zero. You can pack far more concurrent requests onto a GPU because memory is used efficiently. Across benchmarks, vLLM achieves 2–4× higher throughput than static allocation on identical hardware.

Continuous batching

Paged Attention enables continuous batching: as soon as one request finishes, its GPU slot is immediately filled by the next waiting request — no idle time, no batch assembly delay. The GPU runs at near-constant utilization.

This is the mechanism that drives vLLM's throughput advantage. High GPU utilization means lower cost-per-token, which is what matters in production.

Where vLLM wins and where it doesn't

vLLM is purpose-built for high-concurrency, variable-workload scenarios: chat APIs, multi-tenant systems, anything where many users send requests of unpredictable lengths. Its memory management is excellent and its OpenAI-compatible API makes integration straightforward.

Honest limitations

vLLM is a general-purpose runtime, not a hardware-optimized one. For absolute minimum latency on a fixed workload, TensorRT-LLM's kernel-level optimizations will beat it. vLLM trades peak single-request performance for throughput and flexibility.

05 — TensorRT-LLM: maximum performance path

If vLLM is about memory management, TensorRT-LLM is about extracting every last FLOP from your hardware. NVIDIA's framework compiles models into highly optimized CUDA kernels, applying transformations that a generic runtime simply can't.

What it actually does under the hood

Operator fusion: multiple operations that would execute sequentially (layernorm → attention → projection) are fused into a single CUDA kernel. Fewer kernel launches, fewer memory round-trips, lower latency. Quantization: FP8 and INT8 quantization reduce memory bandwidth requirements and unlock specialized hardware units (Tensor Cores on H100s operate natively at FP8). Graph optimization: the entire compute graph is analyzed at compile time and restructured to minimize memory transfers between compute steps.

The compile-time tradeoff

TensorRT-LLM "builds" an engine for a specific model, precision, batch size, and sequence length configuration. This build step takes minutes to hours — but the result is a binary optimized specifically for your GPU and workload. It's the difference between a JIT-compiled script and a hand-tuned C binary.

When this matters

For latency-critical applications — real-time coding assistants, voice interfaces, anything where time-to-first-token determines user experience — TensorRT-LLM's kernel optimizations can deliver 30–60% lower latency than vLLM on identical hardware.

The flexibility cost

That compile-time optimization is also TensorRT-LLM's biggest constraint. Changing model architecture, quantization scheme, or sequence length means rebuilding the engine. For teams iterating rapidly on models or serving diverse workloads, this operational cost adds up fast.

Honest limitations

TensorRT-LLM is the right choice when you know exactly what you're running, can invest setup time upfront, and the workload is stable. It's the wrong choice if your team ships model updates weekly or serves highly variable request sizes.

06 — SGLang: the orchestration layer

SGLang addresses a different problem entirely. Where vLLM and TensorRT-LLM are runtime engines, SGLang is a structured generation framework — it's about controlling what the model produces and how multi-step generation flows are coordinated.

The problem it solves

Modern LLM applications rarely involve a single prompt and response. They involve chains: generate a plan, execute tools based on the plan, use tool results to generate the next step, validate output format, retry on failure. Orchestrating this with raw API calls is brittle and slow — you're spending more time on round-trips and JSON parsing than on actual inference.

SGLang provides primitives for expressing these multi-step flows, with the runtime handling batching and scheduling across steps automatically. Critically, it supports RadixAttention — prefix caching that reuses KV cache across requests sharing a common prefix (e.g., a system prompt), eliminating redundant computation.

The core insight

For agentic workloads, the bottleneck isn't single-request latency — it's the overhead of coordinating many short, interdependent generations. SGLang's structured approach reduces this overhead significantly while keeping each generation fast.

What it's not

SGLang isn't a replacement for vLLM or TensorRT-LLM at the kernel level — it typically runs on top of one of them. Think of it as the layer between your application logic and the raw inference engine. You get structured generation control, prefix caching, and pipeline orchestration; the underlying runtime provides the actual token generation.

Honest limitations

SGLang adds abstraction overhead. For simple, single-turn request/response patterns, that abstraction buys you nothing — you'd be better served by vLLM directly. Its value accrues with complexity: the more multi-step your generation flow, the more it pays off.

07 — Direct comparison

Dimension vLLM TensorRT-LLM SGLang
Primary focusMemory efficiency + batchingKernel-level optimizationStructured generation + orchestration
Core innovationPaged AttentionOperator fusion, FP8/INT8RadixAttention, prefix caching
ThroughputExcellent (continuous batching)High (fixed workloads)Good (pipeline-aware)
Latency (single req)GoodBestDepends on backend
FlexibilityHighLow (compile-time)High
Setup complexityLowHighMedium
Best forMulti-user APIs, variable loadFixed workloads, latency-criticalAgents, tools, complex pipelines
WeaknessNot hardware-tunedRigid, slow to iterateNot a standalone runtime
The key takeaway

These frameworks aren't competing for the same use case. They solve different layers of the serving problem. In sophisticated production systems, you'll often see them combined: TensorRT-LLM as the execution backend, vLLM's batching strategy on top, SGLang orchestrating multi-step flows at the application layer.

08 — When to use what

This is the section that matters in system design interviews. Scenario-first thinking beats memorized facts every time.

vLLM — reach for this first
  • You're building a multi-tenant chat API or inference endpoint
  • Request sizes are variable and unpredictable
  • You need high throughput and good GPU utilization
  • You're iterating on models frequently (no compile-time penalty)
  • You need an OpenAI-compatible API surface quickly
TensorRT-LLM — when latency is non-negotiable
  • Time-to-first-token is a product requirement (voice, real-time coding)
  • Your model is stable — you're not shipping updates weekly
  • Traffic patterns are predictable (fixed batch sizes, known sequence lengths)
  • You have engineering bandwidth for the build/deploy pipeline
  • You're on NVIDIA hardware and want to extract maximum value from it
SGLang — when generation structure matters
  • You're building agents that chain multiple model calls
  • Tool use, structured JSON output, or constrained generation is required
  • Many requests share a long common prefix (system prompt reuse)
  • You want to express multi-step reasoning flows declaratively
  • Pipeline coordination overhead is hurting performance

In practice, these are often combined — not mutually exclusive. A mature production stack might use SGLang for orchestration, vLLM for memory management, and TensorRT-LLM for latency-critical paths.

09 — System design thinking

GPU utilization vs tail latency

Continuous batching increases throughput by keeping GPUs busy — but it does so by interleaving requests. A long-running request shares decode cycles with shorter ones. The long request sees higher latency than if it ran alone. This is the core tension: optimizing for utilization hurts tail latency, and optimizing for tail latency wastes utilization.

The right balance depends on your SLA. A batch inference job has no tail latency requirement — maximize utilization. A real-time chat product has strict P99 requirements — cap batch size and accept lower utilization.

Memory pressure and admission control

When KV cache fills up, you have two choices: reject the incoming request or pause and preempt an existing one. Preemption (swapping KV cache to CPU memory) preserves fairness but adds latency. Rejection is simpler but degrades user experience during spikes.

Production systems need explicit admission control logic — a simple queue in front of the inference engine that enforces memory budgets before requests enter the system. Without this, you get cascading OOMs under load.

The max-tokens vs concurrency tradeoff

Allowing longer max_tokens per request directly reduces the number of concurrent requests you can serve — it's not a free parameter. A system that allows 8K token responses will support 4× fewer concurrent users than one capped at 2K tokens, all else equal. This needs to be a conscious product decision, not an implementation default.

System design signal

The best answers to LLM serving questions don't pick a framework — they reason about the tradeoff space (memory, latency, throughput, flexibility) and then select accordingly. That reasoning is what interviewers are actually evaluating.

10 — Failure modes

KV cache OOM

Traffic spike hits, KV cache fills, GPU OOMs. Whole server dies. Fix: admission control + explicit memory budgets before requests enter the system.

Latency spikes from batching

A few long-running requests get scheduled together. P99 explodes. Fix: max sequence length limits per batch, or priority-based scheduling.

Fragmentation without paged attention

Static allocation leaves gaps between freed blocks. GPU appears to have memory, but no contiguous block large enough for new requests. Fix: use paged attention (i.e., use vLLM).

Underutilized GPUs

Batch size too small, or static batching introduces wait time. GPU runs at 15% utilization. Fix: continuous batching + autoscaling on queue depth, not just GPU %.

TensorRT engine mismatch

Model updated, but engine not rebuilt. Silent degradation or crash at runtime. Fix: engine versioning tied to model checksum in your CI/CD pipeline.

Prefix cache invalidation

System prompt changed, but SGLang's RadixAttention cache wasn't cleared. Stale cache causes incorrect outputs. Fix: explicit cache invalidation on any shared prefix update.

Serving LLMs is no longer about running models — it's about managing memory, scheduling, and system tradeoffs at scale. vLLM solves the memory problem. TensorRT-LLM solves the latency problem. SGLang solves the orchestration problem. Understanding which problem you actually have is the key decision factor.

← back to all posts