Llama4 chunked attention support #395

quic-rishinr · 2025-05-08T10:06:22Z

No description provided.

quic-hemagnih · 2025-05-14T06:49:19Z

What is the plan to merge this code changes, As we have cut the branch for 1.20 we can now plan to merge Llama4 changes in main branch.

quic-akuruvil · 2025-05-27T05:15:12Z

QEfficient/transformers/models/llama4/modeling_llama4.py

@@ -929,6 +948,8 @@ def get_specializations(
                "batch_size_times_num_tiles": batch_size_times_num_tiles,
                "img_size": img_size,
                "vision_size": vision_size,
+                "chunk_length": prefill_seq_len,
+                "chunk_ctx_len": chunk_ctx_len,


In specializations, we need total CL also, right? For nope layers KV.

quic-akuruvil · 2025-06-09T04:33:06Z

QEfficient/transformers/models/llama4/modeling_llama4.py

-        _, chunk_causal_mask = self._update_causal_mask(
-            attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
+        causal_mask = _create_causal_mask(
+            position_ids=position_ids, target_length=past_key_values.key_cache[3].shape[-2]


Here key_cache[3] instead of the hard coded value can we generalize using some config value.

quic-akuruvil · 2025-06-10T10:18:56Z

QEfficient/transformers/cache_utils.py

@@ -259,6 +259,151 @@ def update3D(

        return k_out, v_out

+    def _sliding_update(


@asmigosw As we have discussed, please restructure this to reuse the hybrid cache function.

Also please report the o/p match with reference, with a smaller config for chunked_window. It will be hard verifying for 8K, so for testing purpose, change the config.

Signed-off-by: vbaddi <[email protected]>

Signed-off-by: Rishin <[email protected]> Signed-off-by: Rishin Raj <[email protected]> Signed-off-by: Rishin <[email protected]>

Signed-off-by: vbaddi <[email protected]> Signed-off-by: Rishin <[email protected]>

Signed-off-by: Rishin <[email protected]>

Signed-off-by: Asmita Goswami <[email protected]>

Signed-off-by: Ann <[email protected]>

quic-rishinr requested a review from ochougul as a code owner May 8, 2025 10:06

quic-rishinr requested a review from vbaddi May 8, 2025 10:06

quic-hemagnih approved these changes May 8, 2025

View reviewed changes

quic-rishinr force-pushed the llama4 branch 4 times, most recently from f59f636 to 8a80c89 Compare May 20, 2025 15:00

quic-akuruvil reviewed May 27, 2025

View reviewed changes

quic-akuruvil reviewed Jun 9, 2025

View reviewed changes

mohiso22 force-pushed the add_llama4 branch from 6050b44 to 5d3870d Compare June 9, 2025 10:18

mohiso22 requested a review from quic-amitraj as a code owner June 9, 2025 10:18

mohiso22 force-pushed the add_llama4 branch from 5d3870d to 77afbbc Compare June 9, 2025 10:22

quic-akuruvil reviewed Jun 10, 2025

View reviewed changes

quic-amitraj force-pushed the add_llama4 branch 2 times, most recently from 1066b4f to d8a947a Compare June 10, 2025 15:11

vbaddi and others added 12 commits June 12, 2025 10:40

Add Llama4 Multi-Modal Support

ca7ba6d

Signed-off-by: vbaddi <[email protected]>

Added support for chunked attention

d0070d0

Signed-off-by: Rishin <[email protected]> Signed-off-by: Rishin Raj <[email protected]> Signed-off-by: Rishin <[email protected]>

Added pkv input for switching between ctx len and chunked ctx len

4d49859

Signed-off-by: Rishin <[email protected]> Signed-off-by: Rishin Raj <[email protected]> Signed-off-by: Rishin <[email protected]>

Fix for chunk causal mask and target length

ab78163

Signed-off-by: Rishin <[email protected]> Signed-off-by: Rishin Raj <[email protected]> Signed-off-by: Rishin <[email protected]>

chunk ctx len fix

2ecb253

Signed-off-by: Rishin <[email protected]> Signed-off-by: Rishin Raj <[email protected]> Signed-off-by: Rishin <[email protected]>

Moved to torch where while resetting the chunked position ids

fe57cbf

Signed-off-by: Rishin <[email protected]> Signed-off-by: Rishin Raj <[email protected]> Signed-off-by: Rishin <[email protected]>

nit: add modified chunked changes to the repo

9bcb384

Signed-off-by: vbaddi <[email protected]> Signed-off-by: Rishin <[email protected]>

nit: QAic changes

311fdfa

Signed-off-by: vbaddi <[email protected]> Signed-off-by: Rishin <[email protected]>

target len fix

ac7fb12

Signed-off-by: Rishin <[email protected]>

Updaeted max_seq_len_cached to 64k

3909508

Signed-off-by: Rishin <[email protected]>

Added Hybrid Chunked Cache for Llama4

fdb2de2

Signed-off-by: Asmita Goswami <[email protected]>

Added HybridCache class and function

bc20390

Signed-off-by: Ann <[email protected]>

quic-rishinr force-pushed the llama4 branch from 0a53733 to bc20390 Compare June 12, 2025 10:43

quic-rishinr changed the base branch from add_llama4 to Llama4_chunk June 12, 2025 10:43

quic-rishinr deleted the branch quic:Llama4_chunk June 12, 2025 10:44

quic-rishinr closed this Jun 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama4 chunked attention support #395

Llama4 chunked attention support #395

Uh oh!

quic-rishinr commented May 8, 2025

Uh oh!

quic-hemagnih commented May 14, 2025

Uh oh!

quic-akuruvil May 27, 2025

Uh oh!

quic-akuruvil Jun 9, 2025

Uh oh!

quic-akuruvil Jun 10, 2025

Uh oh!

quic-akuruvil Jun 12, 2025

Uh oh!

Uh oh!

		@@ -259,6 +259,151 @@ def update3D(

		return k_out, v_out

		def _sliding_update(

Llama4 chunked attention support #395

Llama4 chunked attention support #395

Uh oh!

Conversation

quic-rishinr commented May 8, 2025

Uh oh!

quic-hemagnih commented May 14, 2025

Uh oh!

quic-akuruvil May 27, 2025

Choose a reason for hiding this comment

Uh oh!

quic-akuruvil Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

quic-akuruvil Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

quic-akuruvil Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!