You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Llama4 model family adopts MoE layer implementation for better efficiency.
However, in the current implementation MoE layer in fact performs an ordinary dense FFN forward pass with all experts being involved in the computation. One can see, that gate_up_proj matrix has the same shape as if all num_experts are active.
I guess the intent was to perform computation only for the experts selected by router.
Hey! there are different formulations to MoE, we went with the one that requires a bit more memory but is more time efficient: we use tensor parallel to alleviate some of the problems, and we expand the inputs so that all expert view all inputs, its not highly optimized but we'll add a better implementation soon!
System Info
Llama4 model family adopts
MoE
layer implementation for better efficiency.However, in the current implementation MoE layer in fact performs an ordinary dense FFN forward pass with all experts being involved in the computation. One can see, that
gate_up_proj
matrix has the same shape as if allnum_experts
are active.I guess the intent was to perform computation only for the experts selected by router.
Who can help?
@ArthurZucker
Reproduction
Any usage of the model
Expected behavior
Only experts chosen by the router are involved in computation
The text was updated successfully, but these errors were encountered: