Expert Parallel Diagram

GPU 0: holds all non-expert params and experts 0–47 for each layer

minibatch 0
tokens

…

do forward layer n attn and router (no experts)

do forward for layer n experts 0–47 on tokens routed to them

continue and repeat
for each layer
…

minibatch 0
output

···

GPU 1: holds all non-expert params and experts 48–95 for each layer

minibatch 1
tokens

…

do forward for layer n attn and router (no experts)

do forward for layer n experts 48–95 on tokens routed to them

continue and repeat
for each layer
…

minibatch 1
output

···

router output decides
which experts are
used for each token

send token activations
to the GPU(s) with the
appropriate experts

send expert results back
to the GPU that originally
had the token

GPU 7: holds all non-expert params and experts 336–383 for each layer

minibatch 7
tokens

…

do forward for layer n attn and router (no experts)

do forward for layer n experts 336–383 on routed tokens

continue and repeat
for each layer
…

minibatch 7
output