Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(condensation) Condenser that uses cache from agent #7588

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

happyherp
Copy link
Contributor

This is a work in progress.

  • This change is worth documenting at https://docs.all-hands.dev/
  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

End-user friendly description of the problem this fixes or functionality that this introduces.

Greatly reduces the cost of doing a condensation by using the cache of the llm


Give a summary of what the PR does, explaining any non-trivial design decisions.


Link of any specific issues this addresses.
happyherp#14

@happyherp
Copy link
Contributor Author

LLM Context Condensation and Cache

Context

I have started to be able to get stuff done with OpenHands + Claude 3.7. But one thing that keeps happening, is
that as soon as the conversation gets longer,

  • I run into the rate-limit. This slows me down, obviously.
  • I spend a lot of money on tokens.

But what really hits me, is when a LLM condensation happens.

Cache enables long conversations

Cost of Using Cache with Anthropic API

According to Anthropic's pricing:

  • Prompt caching write: $3.75 / MTok (just input is $3 / MTok but I will treat them as the same here. )
  • Prompt caching read: $0.30 / MTok
  • Output: $15 / MTok

This means that cached input tokens cost one-tenth as much as fresh input tokens.

Example

Imagine we start with a 10k token initial prompt.
Then we have a 5 follow up prompts, that add 1000 input tokens each.

Legend:

+   1k input cache write ~0.3Cent
-   1k output ~1.5 Cent
#   1k input cache read ~0.03Cent

A regular conversation might go like this

+++-  initial prompt and response
####+- follow up
######+- regular response
########+- regular response
##########++++- big response
###############+- follow up
#################+- follow up
###################+- follow up

10 * + * 0.3 = 3 Cent

79 * # * 0.03 = 2.3 Cent

8 * - * 1.5 = 12 Cent

Total: 17.3 Cent

Imagine we did not have caching

(89 # or +) * 0.3 = 26.7 Cent

8 * - * 1.5 = 12 Cent

Total: 38.7 Cent

That's more than twice the price. So in this example caching really brings a benefit. This also matches my real life experience.

Current condensation does not use cache

Our current condensation method creates a completely new prompt, which does not take advantage of caching.

Continuing the example from above

+++-  initial prompt and response
####+- follow up
######+- regular response
########+- regular response
##########++++- big response
###############+- follow up
#################+- follow up
###################+- follow up
+++++++++++++++++++++-- condensation
###++-  conversation continues

Here a condensation reduces the context window from 21k down to 5k(of which 3k are the initial prompt).

But we pay a lot for it.

Cost of Condensation

For the condensation operation:

21 * + * 0.3 = 6.3 Cent

2 * - * 1.5 = 3 Cent

Total: 9.3 Cent

That is because we essentially now pay for every input token twice, when we could have just paid for the cache version instead(at 10% the cost. )

Condenser uses cache

We could greatly reduce the amount of new input tokens of the condensation if we could use the cache of the conversation.
This can be easily achieved by

  • using the exact same prompt as before, except
  • with the condensation prompt added as the last message.

That way we will only pay full for the condensation prompt, which is just about 1k right now. Everything else should be cached.

+++-  
####+- 
######+-
########+- 
##########++++- 
###############+- 
#################+-
###################+- 
#####################+-- condensation cached with prompt last
###++-  conversation continues

1 * + * 0.3 = 0.3 Cent

21 * # * 0.03 = 0.63 Cent

2 * - * 1.5 = 3 Cent

Total: 3.93 Cent

That is less than half of the price without cache(9.3 Cent).

Implementation

I have started to implement this in this PR. Right now the focus is to see if this works as indented, rather than for the code to be perfect.

I created a new agent LLMCacheCodeAgent that inherits from CodeActAgent.
Except it configures its own LLMAgentCacheCondenser. This was necessary, because the condenser needs to create the exact same prompt using the same llm as the agent.
I had to add build_llm_completion_params to the CodeActAgent to be able to reuse that code from the condenser to access the prompt generation code.

The prompt for the condensation asks the AI to use this format:

KEEP: 1
KEEP: 2
KEEP: 3
REWRITE 4 TO 15 WITH:
User asked about database schema and agent explained the tables and relationships.
END-REWRITE
KEEP: 18

I choose this, because by referencing messages we just want to keep, we avoid having the llm quite the message, which would cause a lot of output.

Run instructions

  • Build the backend
  • set DEBUG=1 if you want to see details
  • and run it.
  • select LLM claude3.7(or another one that has caching)
    and the agent LLMAgentCacheCondenser
  • Start a conversation
  • Condensation will happen at 100 events(from llm_cache_code_agent.py) or when you put the word "CONDENSE!" in your query
  • search for "I need you to condense our conversation history to make it more efficient." in your logs directory to find the prompt_xx.log where the condensation happens

Evaluation

It would be great if we could run this against some type of benchmark where context condensation makes sense while recording the cost.
It would love to know how much cost this saves in practice compared to the current condensers.

@happyherp happyherp force-pushed the condenser_use_cache_from_agent branch from 1346481 to 6db3f04 Compare March 30, 2025 11:52
@happyherp
Copy link
Contributor Author

@enyst @csmith49 I would love your feedback on this, as I have seen you are familiar with the codebase.

@csmith49
Copy link
Collaborator

This is great, I've been wanting to test this idea for a while and your description of the problem/solution is spot-on. I'll spend some more time digging into this in the upcoming week, but a few thoughts I can leave you with now:

  1. We know the condensation wipes the cache, and that we have to pay for it when we do the condensation and after, when the cache is being rebuilt with the new summary. Since the number of tokens would grow unbounded if there was no condensation, there's a break-even point where the condensation strategy pulls ahead in terms of cost. The earlier in the conversation we can push that point, the better -- this looks like it'll do exactly that.

  2. In practice I expect we'll want to avoid a new agent and just modify the condenser interface so that we get the desired behavior, but we can worry about that after evaluating this approach.

  3. In terms of evaluating the performance of this condenser, there are a few metrics we like to look for: cost (in dollars, tokens, and time) and performance impact (both qualitative and quantitative). I've got some notebooks from when I was testing the original implementations that looks at everything but the qualitative performance impact, so I'll run an evaluation and get back to you with the results!

@enyst
Copy link
Collaborator

enyst commented Mar 30, 2025

Oh, very interesting! It's definitely worth looking into, maybe we can improve this. 🤔

Just some quick thoughts: We still cache the system prompt separately, but not the first message, I believe, the system prompt is here:

cache_prompt=with_caching,

Just curious, does this still happen? That is, running in debug mode, after the first condensation, do we still see in the logs that some caching was applied or is it all a cache write?

From what I understand, you propose to also cache the first keep_first messages. That seems absolutely correct... They will be sent all the time. We used to have the caching marker set explicitly on the initial user message too, but we have shuffled it around at some point. Now we cache here every step, the latest user/tool message. I'm a bit confused though: doesn't that mean that they were sent to Anthropic with the cache marker, the first time the agent went through them?

@happyherp
Copy link
Contributor Author

Just curious, does this still happen? That is, running in debug mode, after the first condensation, do we still see in the logs that some caching was applied or is it all a cache write?

@enyst
My understanding is that anthropic requires a cache flag to be set on each message for it to be added to the cache. The need to specify comes because if the flag is off, the input tokens are slightly cheaper 3 / MTok vs $3.75 .
So you could send a prompt where the few messages have caching on, followed by others that have it off. When you do not expect them to be useful for caching.
OpenAI does not care. It just does caching for you. Albeit at a worse rate of only 50%.
So because that is kind of an edge case and only applies to anthropic and the different between input/cache-write-input is only 25%, I decided to just not think about that and have it always on. However, we could get some savings out of them in some situations. But I was not focusing on that.

I'm a bit confused though: doesn't that mean that they were sent to Anthropic with the cache marker, the first time the agent went through them?

From what I understand yes, when caching is on, all messages get the caching flag. Even the ones used during condensation. But the condensation was not able to neither

  • use the previous cached entries
  • create entries that could be useful for caching itself
    because the beginning of the condensation prompt and the regular prompt(which includes tool use instructions) are different, they have a different PREFIX(thats what they call it) so caching does not happen. That is what I changed by moving the condensation prompt to the end, while keeping everything else the same.

@csmith49

We know the condensation wipes the cache, and that we have to pay for it when we do the condensation and after, when the cache is being rebuilt with the new summary. Since the number of tokens would grow unbounded if there was no condensation, there's a break-even point where the condensation strategy pulls ahead in terms of cost. The earlier in the conversation we can push that point, the better -- this looks like it'll do exactly that.

Yes. I think there is whole art to how and when you create a summary. Which must be balanced with cache. I agree that we probably want to do it a lot earlier than after 100 messages. I believe, that it might even be a good idea to request a condensation of just the last observation, if it is above a certain size. That way, we could still reuse the cache of the conversation so far. Doing this consistently, would keep all the huge observations of out the context-window. I am looking at you, translation.json and poetry run pytest.

In practice I expect we'll want to avoid a new agent and just modify the condenser interface so that we get the desired behavior, but we can worry about that after evaluating this approach.

Yes. Otherwise it is impossible to take advantage of caching.

In terms of evaluating the performance of this condenser, there are a few metrics we like to look for: cost (in dollars, tokens, and time) and performance impact (both qualitative and quantitative). I've got some notebooks from when I was testing the original implementations that looks at everything but the qualitative performance impact, so I'll run an evaluation and get back to you with the results!

👍

@enyst
Copy link
Collaborator

enyst commented Mar 30, 2025

Just curious, does this still happen? That is, running in debug mode, after the first condensation, do we still see in the logs that some caching was applied or is it all a cache write?

@enyst My understanding is that anthropic requires a cache flag to be set on each message for it to be added to the cache.

Just to clarify what I meant here, it needs it on the last message that we want cached. It will then cache all prompt, which implies all the previous ones, from the beginning.

This is my understanding from Anthropic's documentation. For example,
https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching#prompt-caching-examples

The cache_control parameter is placed on the system message to designate it as part of the static prefix.

During each turn, we mark the final message with cache_control so the conversation can be incrementally cached. The system will automatically lookup and use the longest previously cached prefix for follow-up messages.

This is what we are doing. Every step, we cache the system prompt, and the last message.

According to Anthropic though, if the last message suddenly didn't have the marker or would not be found or since it's the first time we sent it, they would lookup and use "the longest previously cached prefix".

Edited to add: That's why I was asking, doesn't it find the system message at least? If the answer is no... I'm curious why. I see, the PR changed the order... that seems smart! 🤔 How about tools?


def _build_messages_for_condensation(self, events: List[Event]) -> list[Message]:
# Process the events into messages using the same format as the agent
# This ensures we can take advantage of the LLM's cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I wasn't sure if the system prompt being different makes a difference.

@happyherp
Copy link
Contributor Author

Wait with your evaluation @csmith49 The current condensation is buggy.
The wrong events are being removed, because I assumed the list of events to match the list of messages, which is naive.

@happyherp happyherp force-pushed the condenser_use_cache_from_agent branch from 6db3f04 to 17d691f Compare March 30, 2025 20:08
@csmith49
Copy link
Collaborator

Wait with your evaluation @csmith49 The current condensation is buggy. The wrong events are being removed, because I assumed the list of events to match the list of messages, which is naive.

No worries, just tag me here when you're ready for me to give it a spin.

@happyherp happyherp force-pushed the condenser_use_cache_from_agent branch 3 times, most recently from 40a2f78 to 9938be3 Compare April 6, 2025 13:50
openhands-agent and others added 12 commits April 6, 2025 22:02
…he event from the message

fixes test_llm_agent_cache_condenser_with_state_with_dependencies
allow CondensationAction.summary_offset to be None. Which means insert at and.

keep all events that have no message

fix failed tests: Messages created for microknowlege where created, but returned

fix after rebase: View class was moved.

removed LLMAgentCacheCondenser.keep_first for simplicity
condenseWithState: apply previous condensations to events
simplified summary creation as we ignore the indices anyways
tested 2 condensations in one session

added test with no KEEP 0

make sure we have at least one user message, to prevent invalid prompt.

removed unnecessary field
added State and Agent to Condenser.condense.
Interface LLMCompletionProvider for Agents that expose their LLM and prompt generation.
CachingCondenser - Base class for Condensers that extend the prompt from the Agent to use Cache
removed agent field
param.messages IS NOT a list of Message objects.
removed import that lead to circular import.
@happyherp happyherp force-pushed the condenser_use_cache_from_agent branch from 9938be3 to 7da4c00 Compare April 7, 2025 08:28
@happyherp happyherp force-pushed the condenser_use_cache_from_agent branch from 89b066a to 04b319d Compare April 7, 2025 18:08
@happyherp
Copy link
Contributor Author

@csmith49 I think now its worth trying to get it to run.

Put this in the config.toml

[agent.CodeActAgent.condenser]
type = "agentcache"
trigger_word = "!CONDENSE!"
max_size = 50

I tested it with claude3.7. It sometimes makes bad choices with what it rememers/forgets. But the goal here was to avoid cache-writes mainly. So lets see if it does that properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants