GPU 0: holds all non-expert params and experts 0–47 for each layer
minibatch 0
tokens
do forward layer n attn and router (no experts)
do forward for layer n experts 0–47 on tokens routed to them
continue and repeat
for each layer
minibatch 0
output
···
GPU 1: holds all non-expert params and experts 48–95 for each layer
minibatch 1
tokens
do forward for layer n attn and router (no experts)
do forward for layer n experts 48–95 on tokens routed to them
continue and repeat
for each layer
minibatch 1
output
···
router output decides
which experts are
used for each token
send token activations
to the GPU(s) with the
appropriate experts
send expert results back
to the GPU that originally
had the token
GPU 7: holds all non-expert params and experts 336–383 for each layer
minibatch 7
tokens
do forward for layer n attn and router (no experts)
do forward for layer n experts 336–383 on routed tokens
continue and repeat
for each layer
minibatch 7
output