Training doesn't fail because of FLOPs. It fails because of memory limits, communication overhead, and load imbalance. Modern LLM scaling is no longer a compute problem — it's a communication and systems problem.
01. Why parallelism exists
The naive mental model of LLM training is: more GPUs = faster training. Reality is messier. Before you even think about scaling, you're hitting a wall on a single GPU.
Consider a 70B parameter model in BF16. Just storing the weights takes ~140 GB. Add optimizer states (Adam stores first and second moments — that's 2× parameters in FP32), and you're north of 700 GB. An A100 has 80 GB of HBM. This isn't a compute problem. It's a memory problem.
There are two regimes to understand:
- Memory-bound: the bottleneck is loading weights/activations from HBM. Compute sits idle waiting for data.
- Compute-bound: the bottleneck is arithmetic throughput. Memory bandwidth is fine, ALUs are saturated.
Large batch sizes push you toward compute-bound. Small batches and large models push you toward memory-bound. Most large-scale training sits somewhere in between — which is why the parallelism strategy you choose matters so much.
02. Data parallelism, tensor parallelism, and pipeline parallelism
There are three fundamental axes of parallelism. Each attacks a different dimension of the problem.
Data parallelism (DP)
The simplest idea: replicate the full model on each GPU, split the batch across them, and synchronize gradients after each backward pass via an all-reduce. Ring all-reduce is the standard algorithm — GPUs form a ring, each sending and receiving in a pipeline. Total data transferred is 2(N-1)/N × model_size. It's bandwidth-efficient, but as models grow, gradient sync cost becomes significant. At 70B parameters, you're synchronizing hundreds of GBs of gradients every step.
Tensor parallelism (TP)
TP splits individual weight matrices across GPUs. A transformer's FFN is essentially two large matmuls. You split the first weight matrix column-wise across GPUs, compute in parallel, then row-wise split the second. Each GPU holds a slice of the weights.
The catch: you need all-reduce or all-gather operations after every matmul to reconstruct full activations. This is frequent, fine-grained communication. TP works well within a node over NVLink (~600 GB/s), but falls apart across nodes over InfiniBand (25–400 Gb/s). The bandwidth delta is an order of magnitude.
Pipeline parallelism (PP)
PP splits layers across GPUs. GPU 0 runs layers 1–8, GPU 1 runs layers 9–16, and so on. Data flows through like an assembly line. The fundamental problem: pipeline bubbles. In a naive pipeline, GPU 1 sits idle waiting for GPU 0 to finish. With k pipeline stages, bubble fraction is (k-1)/k — at 8 stages, that's ~87.5% idle time.
Micro-batching fixes this. Instead of one large batch, you split it into m micro-batches. While GPU 1 processes micro-batch 1, GPU 0 starts on micro-batch 2. Bubbles shrink to (k-1)/(m+k-1) — large m effectively eliminates bubble overhead, at the cost of higher memory for buffered activations.
03. Hybrid parallelism: what real systems actually do
No single parallelism strategy works at the scale of a 100B+ model. Real training jobs combine all three, and the combination follows from hardware topology:
- TP within a node — NVLink makes fine-grained all-gathers cheap. 8-way TP across 8 A100s on a DGX node is standard.
- PP across nodes — inter-node bandwidth is expensive. PP only passes activations at layer boundaries, so it's efficient even over InfiniBand.
- DP globally — replicate the full TP+PP setup across many node groups. Gradient all-reduce can be overlapped with compute.
04. Why dense models stop scaling — and what MoE promises
Here's the core problem with dense transformers at scale: every parameter participates in every forward pass. Compute grows linearly with parameters. You're paying full price per token, every token.
But for any given token, does the model need all parameters? Intuitively, no. A piece of code, a German sentence, a math equation — each likely activates different circuits in the network. What if we only activated a subset of parameters per token?
That's Mixture-of-Experts.
05. MoE: the core idea and the real problems
How it works
In an MoE model, the FFN layer in each transformer block is replaced by N expert FFNs and a router (gating network). For each token, the router selects the top-k experts (usually k=1 or k=2), and only those experts compute. Outputs are combined via weighted sum.
The result: a model with 8× the parameters of a dense model might only activate the same compute per token. Mixtral 8×7B has 47B total parameters but activates ~13B per token. Parameter count and FLOPs-per-token are decoupled.
The real problems
all-to-all — every GPU sends tokens to every other GPU. At scale, this is expensive and latency-sensitive.
06. How MoE systems fix these issues
Each problem has a corresponding mitigation. Understanding why each fix exists matters more than just knowing what it is.
Load balancing loss adds an auxiliary loss term that penalizes uneven expert utilization. It's a soft constraint — too much and the model routes randomly, degrading quality; too little and collapse happens anyway. Tuning this coefficient is non-trivial.
Capacity factor and token dropping give each expert a fixed buffer: capacity = capacity_factor × tokens_per_batch / num_experts. If an expert's buffer is full, overflow tokens are dropped — they skip the expert with only the residual contribution. This bounds compute and memory, but dropped tokens mean information loss.
Expert parallelism distributes experts across GPUs — each GPU owns a subset. The all-to-all for token routing is now the key communication primitive. You're combining TP + PP + DP + EP, each tuned to cluster topology.
07. Dense vs MoE tradeoffs
| Aspect | Dense | MoE |
|---|---|---|
| Compute per token | High | Lower (sparse) |
| Memory footprint | Lower | Higher (all experts loaded) |
| Communication overhead | Lower | Higher (all-to-all) |
| Training stability | Stable | Needs careful tuning |
| Serving complexity | Simple | High (expert placement) |
| Quality per FLOP | Baseline | Better (same compute) |
08. Real-world system thinking: cluster topology matters
Parallelism choices aren't made in a vacuum — they're made against a specific cluster topology. Two systems with the same GPU count can have radically different optimal strategies depending on interconnect.
- Intra-node (NVLink, NVSwitch): ~600 GB/s bidirectional. Fine-grained communication (TP's frequent all-gathers) is tolerable. MoE's all-to-all is manageable within a node.
- Inter-node (InfiniBand): 25–400 Gb/s — orders of magnitude slower. TP across nodes is almost always wrong. PP minimizes inter-node traffic to activation transfers only. MoE's all-to-all across nodes is the single biggest bottleneck in MoE at scale.
Scheduling and placement matter too. In an MoE system, how you assign experts to GPUs determines expected routing distance. Co-locating frequently co-activated experts reduces all-to-all traffic — inference serving systems like DeepSpeed-MoE implement expert-aware placement for exactly this reason.
09. Putting it all together
A realistic setup for training a 100B+ MoE model:
- 8-way TP within each node — NVLink handles frequent all-gather for tensor-split matmuls.
- PP across nodes — layers distributed across node groups. Micro-batching keeps the pipeline full.
- Expert parallelism across nodes — each node group owns a subset of experts. The all-to-all for routing is the main inter-node cost.
- DP globally — multiple copies of the full TP+PP+EP setup. Gradient all-reduce overlapped with compute during the backward pass.
This is 4D parallelism. Getting the balance right — bubble overhead vs all-to-all latency vs gradient sync frequency vs memory per GPU — determines whether you hit 40% MFU or 60%+. That gap is the difference between a model training in 3 months vs 5 months.
10. Conclusion
Modern LLM scaling is not "more parameters → better models." It's "better systems → more efficient use of the compute we have."
Parallelism strategies exist to fight memory limits and communication overhead. MoE exists to decouple parameter count from compute. But MoE creates its own communication and stability problems that require careful engineering to resolve.
Engineers need to understand why each technique exists — not just what it does — in order to debug a training run running at 35% MFU and raise it to 55%. That’s where the real work is.