MoE models can be huge on paper but only activate a slice of parameters per token—scaling capacity without paying full dense cost every time.
In a standard (dense) Transformer block, every token passes through the same feed-forward network. All parameters participate for every token—simple to reason about, but compute grows linearly with model width.
At trillion-parameter ambitions, running every weight on every token becomes prohibitively expensive.
Mixture-of-Experts (MoE) replaces one large feed-forward with several smaller expert networks.
A router (gating network) scores which experts should handle each token and activates only the top-K (often 1–2).
Total parameter count can be huge while active parameters per token stay closer to a dense model your hardware can serve.
Early in training the router may keep picking the same “lucky” expert. That expert improves, gets picked again, and others starve—called expert collapse. Quality suffers because most of the model never trains.
Teams add auxiliary load-balancing losses, capacity factors, and careful initialization so tokens spread across experts.
Even though only a few experts run per token, you usually still need memory to store all experts on disk or GPU. Routing adds latency variance—bad batches can spike when many tokens hit the same expert.
MoE is a systems problem as much as an architecture choice: plan for sharding, expert parallelism, and monitoring utilization.