perf(condensation) Condenser that uses cache from agent #7588

happyherp · 2025-03-30T11:19:37Z

This is a work in progress.

This change is worth documenting at https://docs.all-hands.dev/
Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

End-user friendly description of the problem this fixes or functionality that this introduces.

Greatly reduces the cost of doing a condensation by using the cache of the llm

Give a summary of what the PR does, explaining any non-trivial design decisions.

Link of any specific issues this addresses.
happyherp#14

happyherp · 2025-03-30T11:49:30Z

LLM Context Condensation and Cache

Context

I have started to be able to get stuff done with OpenHands + Claude 3.7. But one thing that keeps happening, is
that as soon as the conversation gets longer,

I run into the rate-limit. This slows me down, obviously.
I spend a lot of money on tokens.

But what really hits me, is when a LLM condensation happens.

Cache enables long conversations

Cost of Using Cache with Anthropic API

According to Anthropic's pricing:

Prompt caching write: $3.75 / MTok (just input is $3 / MTok but I will treat them as the same here. )
Prompt caching read: $0.30 / MTok
Output: $15 / MTok

This means that cached input tokens cost one-tenth as much as fresh input tokens.

Example

Imagine we start with a 10k token initial prompt.
Then we have a 5 follow up prompts, that add 1000 input tokens each.

Legend:

+   1k input cache write ~0.3Cent
-   1k output ~1.5 Cent
#   1k input cache read ~0.03Cent

A regular conversation might go like this

+++-  initial prompt and response
####+- follow up
######+- regular response
########+- regular response
##########++++- big response
###############+- follow up
#################+- follow up
###################+- follow up

10 * + * 0.3 = 3 Cent

79 * # * 0.03 = 2.3 Cent

8 * - * 1.5 = 12 Cent

Total: 17.3 Cent

Imagine we did not have caching

(89 # or +) * 0.3 = 26.7 Cent

8 * - * 1.5 = 12 Cent

Total: 38.7 Cent

That's more than twice the price. So in this example caching really brings a benefit. This also matches my real life experience.

Current condensation does not use cache

Our current condensation method creates a completely new prompt, which does not take advantage of caching.

Continuing the example from above

+++-  initial prompt and response
####+- follow up
######+- regular response
########+- regular response
##########++++- big response
###############+- follow up
#################+- follow up
###################+- follow up
+++++++++++++++++++++-- condensation
###++-  conversation continues

Here a condensation reduces the context window from 21k down to 5k(of which 3k are the initial prompt).

But we pay a lot for it.

Cost of Condensation

For the condensation operation:

21 * + * 0.3 = 6.3 Cent

2 * - * 1.5 = 3 Cent

Total: 9.3 Cent

That is because we essentially now pay for every input token twice, when we could have just paid for the cache version instead(at 10% the cost. )

Condenser uses cache

We could greatly reduce the amount of new input tokens of the condensation if we could use the cache of the conversation.
This can be easily achieved by

using the exact same prompt as before, except
with the condensation prompt added as the last message.

That way we will only pay full for the condensation prompt, which is just about 1k right now. Everything else should be cached.

+++-  
####+- 
######+-
########+- 
##########++++- 
###############+- 
#################+-
###################+- 
#####################+-- condensation cached with prompt last
###++-  conversation continues

1 * + * 0.3 = 0.3 Cent

21 * # * 0.03 = 0.63 Cent

2 * - * 1.5 = 3 Cent

Total: 3.93 Cent

That is less than half of the price without cache(9.3 Cent).

Implementation

I have started to implement this in this PR. Right now the focus is to see if this works as indented, rather than for the code to be perfect.

I created a new agent LLMCacheCodeAgent that inherits from CodeActAgent.
Except it configures its own LLMAgentCacheCondenser. This was necessary, because the condenser needs to create the exact same prompt using the same llm as the agent.
I had to add build_llm_completion_params to the CodeActAgent to be able to reuse that code from the condenser to access the prompt generation code.

The prompt for the condensation asks the AI to use this format:

KEEP: 1
KEEP: 2
KEEP: 3
REWRITE 4 TO 15 WITH:
User asked about database schema and agent explained the tables and relationships.
END-REWRITE
KEEP: 18

I choose this, because by referencing messages we just want to keep, we avoid having the llm quite the message, which would cause a lot of output.

Run instructions

Build the backend
set DEBUG=1 if you want to see details
and run it.
select LLM claude3.7(or another one that has caching)
and the agent LLMAgentCacheCondenser
Start a conversation
Condensation will happen at 100 events(from llm_cache_code_agent.py) or when you put the word "CONDENSE!" in your query
search for "I need you to condense our conversation history to make it more efficient." in your logs directory to find the prompt_xx.log where the condensation happens

Evaluation

It would be great if we could run this against some type of benchmark where context condensation makes sense while recording the cost.
It would love to know how much cost this saves in practice compared to the current condensers.

happyherp · 2025-03-30T12:14:44Z

@enyst @csmith49 I would love your feedback on this, as I have seen you are familiar with the codebase.

csmith49 · 2025-03-30T14:19:14Z

This is great, I've been wanting to test this idea for a while and your description of the problem/solution is spot-on. I'll spend some more time digging into this in the upcoming week, but a few thoughts I can leave you with now:

We know the condensation wipes the cache, and that we have to pay for it when we do the condensation and after, when the cache is being rebuilt with the new summary. Since the number of tokens would grow unbounded if there was no condensation, there's a break-even point where the condensation strategy pulls ahead in terms of cost. The earlier in the conversation we can push that point, the better -- this looks like it'll do exactly that.
In practice I expect we'll want to avoid a new agent and just modify the condenser interface so that we get the desired behavior, but we can worry about that after evaluating this approach.
In terms of evaluating the performance of this condenser, there are a few metrics we like to look for: cost (in dollars, tokens, and time) and performance impact (both qualitative and quantitative). I've got some notebooks from when I was testing the original implementations that looks at everything but the qualitative performance impact, so I'll run an evaluation and get back to you with the results!

enyst · 2025-03-30T14:22:05Z

Oh, very interesting! It's definitely worth looking into, maybe we can improve this. 🤔

Just some quick thoughts: We still cache the system prompt separately, but not the first message, I believe, the system prompt is here:

OpenHands/openhands/memory/conversation_memory.py

Line 139 in 6d90e80

cache_prompt=with_caching,

Just curious, does this still happen? That is, running in debug mode, after the first condensation, do we still see in the logs that some caching was applied or is it all a cache write?

From what I understand, you propose to also cache the first keep_first messages. That seems absolutely correct... They will be sent all the time. We used to have the caching marker set explicitly on the initial user message too, but we have shuffled it around at some point. Now we cache here every step, the latest user/tool message. I'm a bit confused though: doesn't that mean that they were sent to Anthropic with the cache marker, the first time the agent went through them?

happyherp · 2025-03-30T15:24:54Z

Just curious, does this still happen? That is, running in debug mode, after the first condensation, do we still see in the logs that some caching was applied or is it all a cache write?

@enyst
My understanding is that anthropic requires a cache flag to be set on each message for it to be added to the cache. The need to specify comes because if the flag is off, the input tokens are slightly cheaper 3 / MTok vs $3.75 .
So you could send a prompt where the few messages have caching on, followed by others that have it off. When you do not expect them to be useful for caching.
OpenAI does not care. It just does caching for you. Albeit at a worse rate of only 50%.
So because that is kind of an edge case and only applies to anthropic and the different between input/cache-write-input is only 25%, I decided to just not think about that and have it always on. However, we could get some savings out of them in some situations. But I was not focusing on that.

I'm a bit confused though: doesn't that mean that they were sent to Anthropic with the cache marker, the first time the agent went through them?

From what I understand yes, when caching is on, all messages get the caching flag. Even the ones used during condensation. But the condensation was not able to neither

use the previous cached entries
create entries that could be useful for caching itself
because the beginning of the condensation prompt and the regular prompt(which includes tool use instructions) are different, they have a different PREFIX(thats what they call it) so caching does not happen. That is what I changed by moving the condensation prompt to the end, while keeping everything else the same.

@csmith49

We know the condensation wipes the cache, and that we have to pay for it when we do the condensation and after, when the cache is being rebuilt with the new summary. Since the number of tokens would grow unbounded if there was no condensation, there's a break-even point where the condensation strategy pulls ahead in terms of cost. The earlier in the conversation we can push that point, the better -- this looks like it'll do exactly that.

Yes. I think there is whole art to how and when you create a summary. Which must be balanced with cache. I agree that we probably want to do it a lot earlier than after 100 messages. I believe, that it might even be a good idea to request a condensation of just the last observation, if it is above a certain size. That way, we could still reuse the cache of the conversation so far. Doing this consistently, would keep all the huge observations of out the context-window. I am looking at you, translation.json and poetry run pytest.

In practice I expect we'll want to avoid a new agent and just modify the condenser interface so that we get the desired behavior, but we can worry about that after evaluating this approach.

Yes. Otherwise it is impossible to take advantage of caching.

In terms of evaluating the performance of this condenser, there are a few metrics we like to look for: cost (in dollars, tokens, and time) and performance impact (both qualitative and quantitative). I've got some notebooks from when I was testing the original implementations that looks at everything but the qualitative performance impact, so I'll run an evaluation and get back to you with the results!

👍

enyst · 2025-03-30T16:28:36Z

Just curious, does this still happen? That is, running in debug mode, after the first condensation, do we still see in the logs that some caching was applied or is it all a cache write?

@enyst My understanding is that anthropic requires a cache flag to be set on each message for it to be added to the cache.

Just to clarify what I meant here, it needs it on the last message that we want cached. It will then cache all prompt, which implies all the previous ones, from the beginning.

This is my understanding from Anthropic's documentation. For example,
https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching#prompt-caching-examples

The cache_control parameter is placed on the system message to designate it as part of the static prefix.

During each turn, we mark the final message with cache_control so the conversation can be incrementally cached. The system will automatically lookup and use the longest previously cached prefix for follow-up messages.

This is what we are doing. Every step, we cache the system prompt, and the last message.

According to Anthropic though, if the last message suddenly didn't have the marker or would not be found or since it's the first time we sent it, they would lookup and use "the longest previously cached prefix".

Edited to add: That's why I was asking, doesn't it find the system message at least? If the answer is no... I'm curious why. I see, the PR changed the order... that seems smart! 🤔 How about tools?

enyst · 2025-03-30T17:07:22Z

openhands/memory/condenser/impl/llm_agent_cache_condenser.py

+
+    def _build_messages_for_condensation(self, events: List[Event]) -> list[Message]:
+        # Process the events into messages using the same format as the agent
+        # This ensures we can take advantage of the LLM's cache


Interesting! I wasn't sure if the system prompt being different makes a difference.

happyherp · 2025-03-30T19:59:10Z

Wait with your evaluation @csmith49 The current condensation is buggy.
The wrong events are being removed, because I assumed the list of events to match the list of messages, which is naive.

openhands/memory/condenser/impl/llm_agent_cache_condenser.py

csmith49 · 2025-03-30T20:30:05Z

Wait with your evaluation @csmith49 The current condensation is buggy. The wrong events are being removed, because I assumed the list of events to match the list of messages, which is naive.

No worries, just tag me here when you're ready for me to give it a spin.

openhands/core/message.py

…he event from the message fixes test_llm_agent_cache_condenser_with_state_with_dependencies

…es it.

allow CondensationAction.summary_offset to be None. Which means insert at and. keep all events that have no message fix failed tests: Messages created for microknowlege where created, but returned fix after rebase: View class was moved. removed LLMAgentCacheCondenser.keep_first for simplicity condenseWithState: apply previous condensations to events simplified summary creation as we ignore the indices anyways tested 2 condensations in one session added test with no KEEP 0 make sure we have at least one user message, to prevent invalid prompt. removed unnecessary field

added State and Agent to Condenser.condense. Interface LLMCompletionProvider for Agents that expose their LLM and prompt generation. CachingCondenser - Base class for Condensers that extend the prompt from the Agent to use Cache

removed agent field param.messages IS NOT a list of Message objects.

removed import that lead to circular import.

…ons. and this fixes that.

…stom agent to use the condenser. it can be done via config.toml

happyherp · 2025-04-07T18:17:42Z

@csmith49 I think now its worth trying to get it to run.

Put this in the config.toml

[agent.CodeActAgent.condenser]
type = "agentcache"
trigger_word = "!CONDENSE!"
max_size = 50

I tested it with claude3.7. It sometimes makes bad choices with what it rememers/forgets. But the goal here was to avoid cache-writes mainly. So lets see if it does that properly.

happyherp force-pushed the condenser_use_cache_from_agent branch from 1346481 to 6db3f04 Compare March 30, 2025 11:52

enyst reviewed Mar 30, 2025

View reviewed changes

happyherp force-pushed the condenser_use_cache_from_agent branch from 6db3f04 to 17d691f Compare March 30, 2025 20:08

happyherp commented Mar 30, 2025

View reviewed changes

openhands/memory/condenser/impl/llm_agent_cache_condenser.py Outdated Show resolved Hide resolved

enyst reviewed Mar 31, 2025

View reviewed changes

openhands/core/message.py Outdated Show resolved Hide resolved

happyherp force-pushed the condenser_use_cache_from_agent branch 3 times, most recently from 40a2f78 to 9938be3 Compare April 6, 2025 13:50

openhands-agent and others added 12 commits April 6, 2025 22:02

new special agent and condenser that has the agents llm.

ec5a677

work in progress

fdd309b

call agent._get_messages() to get messages. fixed test,

fd92eca

set message.source to the event that created it, so we can identify t…

3a6b575

…he event from the message fixes test_llm_agent_cache_condenser_with_state_with_dependencies

use less mocks in test

c4497bf

made message.source into a private field so pydantic completely ignor…

04edfab

…es it.

renamed message._source to ._event to avoid confusion with event.source

e8129d3

deleted no longer working tests with little value

13e3587

Refactoring

c1b2555

added State and Agent to Condenser.condense. Interface LLMCompletionProvider for Agents that expose their LLM and prompt generation. CachingCondenser - Base class for Condensers that extend the prompt from the Agent to use Cache

add config for the new condenser.

ae8f0b7

removed agent field param.messages IS NOT a list of Message objects.

fix structured summary condenser to use new interface.

7da4c00

removed import that lead to circular import.

happyherp force-pushed the condenser_use_cache_from_agent branch from 9938be3 to 7da4c00 Compare April 7, 2025 08:28

somewhere i broke the import mechanism for the condenser implementati…

8a17701

…ons. and this fixes that.

Carlos Freund added 4 commits April 7, 2025 15:59

prevent empty messages. Otherwhise fails later during an messages[0]

00170df

throw early exception when missing user message

e8dd40c

allow summary without summary_offset. In that case put it at the end.

15443ea

now that the condensor works with any agent, we no longer need the cu…

04b319d

…stom agent to use the condenser. it can be done via config.toml

happyherp force-pushed the condenser_use_cache_from_agent branch from 89b066a to 04b319d Compare April 7, 2025 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(condensation) Condenser that uses cache from agent #7588

perf(condensation) Condenser that uses cache from agent #7588

happyherp commented Mar 30, 2025

happyherp commented Mar 30, 2025

happyherp commented Mar 30, 2025

csmith49 commented Mar 30, 2025

enyst commented Mar 30, 2025

happyherp commented Mar 30, 2025

enyst commented Mar 30, 2025 •

edited

Loading

enyst Mar 30, 2025

happyherp commented Mar 30, 2025

csmith49 commented Mar 30, 2025

happyherp commented Apr 7, 2025

perf(condensation) Condenser that uses cache from agent #7588

Are you sure you want to change the base?

perf(condensation) Condenser that uses cache from agent #7588

Conversation

happyherp commented Mar 30, 2025

happyherp commented Mar 30, 2025

LLM Context Condensation and Cache

Context

Cache enables long conversations

Cost of Using Cache with Anthropic API

Example

Legend:

Current condensation does not use cache

Cost of Condensation

Condenser uses cache

Implementation

Run instructions

Evaluation

happyherp commented Mar 30, 2025

csmith49 commented Mar 30, 2025

enyst commented Mar 30, 2025

happyherp commented Mar 30, 2025

enyst commented Mar 30, 2025 • edited Loading

enyst Mar 30, 2025

Choose a reason for hiding this comment

happyherp commented Mar 30, 2025

csmith49 commented Mar 30, 2025

happyherp commented Apr 7, 2025

enyst commented Mar 30, 2025 •

edited

Loading