-
Notifications
You must be signed in to change notification settings - Fork 28.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Llamma4
] Chunked Attention
#37351
Comments
Also as a note:
|
Chunked attention enable 10M context length! It discards unused cache on the go! |
The attention mask reminded me of packed sequences 😄 But that sounds nice! I forgot to mention that I also modified the mask padding (pls see vasqu@5f9b658). (Otherwise smaller sequences will crash on flex attention) |
Same issue here when running the sample code from the LLaMA 4 release blog, and even without using Flex Attention (e.g., using eager or sdpa), it still throws an error if the input length exceeds 8K.
|
this should be fixed with latest patch release! |
System Info
transformers
version: 4.52.0.dev0 (around commit 8bbcdf5)Who can help?
@ArthurZucker @winglian (fyi)
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Because I'm GPU poor, I modified llama4 to only have one layer and a lower hidden size.
Rough script:
This can cause various issues, e.g.
RuntimeError: The size of tensor a (8) must match the size of tensor b (2) at non-singleton dimension 0
ValueError: block_mask was created for block_mask.shape=(2, 1, 8, tensor(8192, device='cuda:0')) but got q_len=8 and kv_len=8. (...)
- looks like more fixes for post-training llama4 #37329 (comment)Expected behavior
Chunked attention doesn't seem to be correctly handled atm. A lots of code does not enter this territory because of the fairly long context to even go over the required chunk size.
The text was updated successfully, but these errors were encountered: