LLM Systems · Parallelism · MoE

Scaling LLMs in Practice: Parallelism Strategies and the Reality of Mixture-of-Experts

Apr 2026 18 min read

Training doesn't fail because of FLOPs. It fails because of memory limits, communication overhead, and load imbalance. Modern LLM scaling is no longer a compute problem — it's a communication and systems problem.

01. Why parallelism exists

The naive mental model of LLM training is: more GPUs = faster training. Reality is messier. Before you even think about scaling, you're hitting a wall on a single GPU.

Consider a 70B parameter model in BF16. Just storing the weights takes ~140 GB. Add optimizer states (Adam stores first and second moments — that's 2× parameters in FP32), and you're north of 700 GB. An A100 has 80 GB of HBM. This isn't a compute problem. It's a memory problem.

There are two regimes to understand:

Large batch sizes push you toward compute-bound. Small batches and large models push you toward memory-bound. Most large-scale training sits somewhere in between — which is why the parallelism strategy you choose matters so much.

The core question parallelism answers: how do you split a model (and its data) across GPUs such that each GPU stays busy, memory stays within bounds, and communication overhead doesn't eat your throughput?

02. Data parallelism, tensor parallelism, and pipeline parallelism

There are three fundamental axes of parallelism. Each attacks a different dimension of the problem.

Data parallelism (DP)

The simplest idea: replicate the full model on each GPU, split the batch across them, and synchronize gradients after each backward pass via an all-reduce. Ring all-reduce is the standard algorithm — GPUs form a ring, each sending and receiving in a pipeline. Total data transferred is 2(N-1)/N × model_size. It's bandwidth-efficient, but as models grow, gradient sync cost becomes significant. At 70B parameters, you're synchronizing hundreds of GBs of gradients every step.

When DP breaks: when the model itself doesn't fit on a single GPU. Gradient sync is manageable — but if you can't even load the model, DP alone does nothing.

Tensor parallelism (TP)

TP splits individual weight matrices across GPUs. A transformer's FFN is essentially two large matmuls. You split the first weight matrix column-wise across GPUs, compute in parallel, then row-wise split the second. Each GPU holds a slice of the weights.

The catch: you need all-reduce or all-gather operations after every matmul to reconstruct full activations. This is frequent, fine-grained communication. TP works well within a node over NVLink (~600 GB/s), but falls apart across nodes over InfiniBand (25–400 Gb/s). The bandwidth delta is an order of magnitude.

Practical rule: TP degree is bounded by the number of GPUs in a node. 8-way TP within a DGX node is common. 16-way+ TP across nodes is almost always a mistake.

Pipeline parallelism (PP)

PP splits layers across GPUs. GPU 0 runs layers 1–8, GPU 1 runs layers 9–16, and so on. Data flows through like an assembly line. The fundamental problem: pipeline bubbles. In a naive pipeline, GPU 1 sits idle waiting for GPU 0 to finish. With k pipeline stages, bubble fraction is (k-1)/k — at 8 stages, that's ~87.5% idle time.

Micro-batching fixes this. Instead of one large batch, you split it into m micro-batches. While GPU 1 processes micro-batch 1, GPU 0 starts on micro-batch 2. Bubbles shrink to (k-1)/(m+k-1) — large m effectively eliminates bubble overhead, at the cost of higher memory for buffered activations.

DATA PARALLELISM GPU 0 batch A GPU 1 batch B GPU 2 batch C all-reduce gradients TENSOR PARALLELISM GPU 0 cols 0–N/2 GPU 1 cols N/2–N merged all-gather PIPELINE PARALLELISM GPU 0 layers 1–8 GPU 1 layers 9–16 GPU 2 layers 17–24 micro-batch schedule → bubble shrinks comm cost — DP: gradient all-reduce (high) · TP: per-layer all-gather (very high) · PP: activation pass only (low)
Three parallelism axes and their communication patterns

03. Hybrid parallelism: what real systems actually do

No single parallelism strategy works at the scale of a 100B+ model. Real training jobs combine all three, and the combination follows from hardware topology:

Frameworks like Megatron-LM formalize this as 3D parallelism. The degree along each axis is a hyperparameter you tune — and it directly impacts MFU (model flops utilization), the real measure of training efficiency.

04. Why dense models stop scaling — and what MoE promises

Here's the core problem with dense transformers at scale: every parameter participates in every forward pass. Compute grows linearly with parameters. You're paying full price per token, every token.

But for any given token, does the model need all parameters? Intuitively, no. A piece of code, a German sentence, a math equation — each likely activates different circuits in the network. What if we only activated a subset of parameters per token?

That's Mixture-of-Experts.

05. MoE: the core idea and the real problems

How it works

In an MoE model, the FFN layer in each transformer block is replaced by N expert FFNs and a router (gating network). For each token, the router selects the top-k experts (usually k=1 or k=2), and only those experts compute. Outputs are combined via weighted sum.

The result: a model with 8× the parameters of a dense model might only activate the same compute per token. Mixtral 8×7B has 47B total parameters but activates ~13B per token. Parameter count and FLOPs-per-token are decoupled.

The promise: more capacity (parameters) at similar compute cost. Better perplexity for the same training budget — assuming routing works well.
token "Paris" router top-2 softmax 0.6 0.3 0.1 skip expert 1 active expert 2 active expert 3 idle combine weighted sum output FFN result only top-k experts compute per token — others receive no gradient signal
MoE routing — router selects top-2 experts, expert 3 stays idle this step

The real problems

load imbalance The router learns which experts are "good" and starts routing everything to them. A few experts become overloaded; most sit idle. You've paid for N experts but effectively trained far fewer.
communication overhead Experts are distributed across GPUs. Routing a token to an expert on a different GPU requires an all-to-all — every GPU sends tokens to every other GPU. At scale, this is expensive and latency-sensitive.
stragglers The slowest expert dictates step time. One overloaded GPU while others are near-idle means you wait. All parallelism benefits evaporate at the bottleneck.
routing collapse Without intervention, training can collapse — the router converges to sending everything to one expert. Gradients for all others vanish. You end up with a very expensive dense model.

06. How MoE systems fix these issues

Each problem has a corresponding mitigation. Understanding why each fix exists matters more than just knowing what it is.

Load balancing loss adds an auxiliary loss term that penalizes uneven expert utilization. It's a soft constraint — too much and the model routes randomly, degrading quality; too little and collapse happens anyway. Tuning this coefficient is non-trivial.

Capacity factor and token dropping give each expert a fixed buffer: capacity = capacity_factor × tokens_per_batch / num_experts. If an expert's buffer is full, overflow tokens are dropped — they skip the expert with only the residual contribution. This bounds compute and memory, but dropped tokens mean information loss.

Expert parallelism distributes experts across GPUs — each GPU owns a subset. The all-to-all for token routing is now the key communication primitive. You're combining TP + PP + DP + EP, each tuned to cluster topology.

07. Dense vs MoE tradeoffs

AspectDenseMoE
Compute per token High Lower (sparse)
Memory footprint Lower Higher (all experts loaded)
Communication overhead Lower Higher (all-to-all)
Training stability Stable Needs careful tuning
Serving complexity Simple High (expert placement)
Quality per FLOP Baseline Better (same compute)
When NOT to use MoE: bandwidth-constrained clusters (all-to-all kills you), latency-sensitive serving (routing overhead adds up), small-scale training (complexity not worth it below ~50B equivalent), or when you lack the engineering bandwidth to tune load balancing carefully.

08. Real-world system thinking: cluster topology matters

Parallelism choices aren't made in a vacuum — they're made against a specific cluster topology. Two systems with the same GPU count can have radically different optimal strategies depending on interconnect.

Scheduling and placement matter too. In an MoE system, how you assign experts to GPUs determines expected routing distance. Co-locating frequently co-activated experts reduces all-to-all traffic — inference serving systems like DeepSpeed-MoE implement expert-aware placement for exactly this reason.

The bottleneck shifts: in dense training, the network bottleneck is gradient all-reduce (DP). In MoE training, it becomes the expert routing all-to-all. Optimizing for one doesn't optimize the other.

09. Putting it all together

A realistic setup for training a 100B+ MoE model:

This is 4D parallelism. Getting the balance right — bubble overhead vs all-to-all latency vs gradient sync frequency vs memory per GPU — determines whether you hit 40% MFU or 60%+. That gap is the difference between a model training in 3 months vs 5 months.

10. Conclusion

Modern LLM scaling is not "more parameters → better models." It's "better systems → more efficient use of the compute we have."

Parallelism strategies exist to fight memory limits and communication overhead. MoE exists to decouple parameter count from compute. But MoE creates its own communication and stability problems that require careful engineering to resolve.

Engineers need to understand why each technique exists — not just what it does — in order to debug a training run running at 35% MFU and raise it to 55%. That’s where the real work is.

The real skill: knowing which axis of parallelism is your bottleneck, given your hardware topology and your model architecture. Everything else follows from that.
← back to all posts