sharpbyte.dev
← LLMs
LLMs · topic 11 of 11

Transformer vs. Mixture of Experts

MoE models can be huge on paper but only activate a slice of parameters per token—scaling capacity without paying full dense cost every time.

Dense Transformers recap

In a standard (dense) Transformer block, every token passes through the same feed-forward network. All parameters participate for every token—simple to reason about, but compute grows linearly with model width.

At trillion-parameter ambitions, running every weight on every token becomes prohibitively expensive.

What MoE changes

Mixture-of-Experts (MoE) replaces one large feed-forward with several smaller expert networks.

A router (gating network) scores which experts should handle each token and activates only the top-K (often 1–2).

Total parameter count can be huge while active parameters per token stay closer to a dense model your hardware can serve.

Dense uses all weights; MoE activates a subset of experts per token.
Dense uses all weights; MoE activates a subset of experts per token.
The router sends each token to a small set of experts instead of one giant MLP.
The router sends each token to a small set of experts instead of one giant MLP.

Expert collapse and load balancing

Early in training the router may keep picking the same “lucky” expert. That expert improves, gets picked again, and others starve—called expert collapse. Quality suffers because most of the model never trains.

Teams add auxiliary load-balancing losses, capacity factors, and careful initialization so tokens spread across experts.

Without balancing, one expert can dominate while others barely train.
Without balancing, one expert can dominate while others barely train.

Serving and memory realities

Even though only a few experts run per token, you usually still need memory to store all experts on disk or GPU. Routing adds latency variance—bad batches can spike when many tokens hit the same expert.

MoE is a systems problem as much as an architecture choice: plan for sharding, expert parallelism, and monitoring utilization.

Load-balancing losses, capacity limits, and distributed serving mitigate common MoE pitfalls.
Load-balancing losses, capacity limits, and distributed serving mitigate common MoE pitfalls.

Key takeaways

  • MoE increases capacity without activating every parameter on every token.
  • Routers decide which experts run—routing quality affects speed and quality.
  • Training and serving MoE need load balancing and memory planning.