in tokens
GPU 0: holds
layers 0–7
do forward
for layers
0–7
GPU 1: holds
layers 8–15
do forward
for layers
8–15
···
send activations
to next GPU
GPU 7: holds
layers 56–61
do forward
for layers
56–61