Llama4TextExperts module implementation #37325

Godofnothing · 2025-04-06T16:40:52Z

System Info

Llama4 model family adopts MoE layer implementation for better efficiency.

However, in the current implementation MoE layer in fact performs an ordinary dense FFN forward pass with all experts being involved in the computation. One can see, that gate_up_proj matrix has the same shape as if all num_experts are active.

I guess the intent was to perform computation only for the experts selected by router.

Who can help?

@ArthurZucker

Reproduction

Any usage of the model

Expected behavior

Only experts chosen by the router are involved in computation

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2025-04-07T07:10:48Z

Hey! there are different formulations to MoE, we went with the one that requires a bit more memory but is more time efficient: we use tensor parallel to alleviate some of the problems, and we expand the inputs so that all expert view all inputs, its not highly optimized but we'll add a better implementation soon!

Godofnothing added the bug label Apr 6, 2025

ArthurZucker added the Usage General questions about the library label Apr 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama4TextExperts module implementation #37325

Llama4TextExperts module implementation #37325

Godofnothing commented Apr 6, 2025 •

edited

Loading

ArthurZucker commented Apr 7, 2025

Llama4TextExperts module implementation #37325

Llama4TextExperts module implementation #37325

Comments

Godofnothing commented Apr 6, 2025 • edited Loading

System Info

Who can help?

Reproduction

Expected behavior

ArthurZucker commented Apr 7, 2025

Godofnothing commented Apr 6, 2025 •

edited

Loading