minibatch 0
tokens
…
do forward layer n attn and router (no experts)
do forward for layer n experts 0–47 on tokens routed to them
continue and repeat
for each layer
…
minibatch 0
output
minibatch 1
tokens
…
do forward for layer n attn and router (no experts)
do forward for layer n experts 48–95 on tokens routed to them
continue and repeat
for each layer
…
minibatch 1
output
···
router output decides
which experts are
used for each token
send token activations
to the GPU(s) with the
appropriate experts
send expert results back
to the GPU that originally
had the token
minibatch 7
tokens
…
do forward for layer n attn and router (no experts)
do forward for layer n experts 336–383 on routed tokens
continue and repeat
for each layer
…
minibatch 7
output