-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(condensation) Condenser that uses cache from agent #7588
base: main
Are you sure you want to change the base?
perf(condensation) Condenser that uses cache from agent #7588
Conversation
LLM Context Condensation and CacheContextI have started to be able to get stuff done with OpenHands + Claude 3.7. But one thing that keeps happening, is
But what really hits me, is when a LLM condensation happens. Cache enables long conversationsCost of Using Cache with Anthropic APIAccording to Anthropic's pricing:
This means that cached input tokens cost one-tenth as much as fresh input tokens. ExampleImagine we start with a 10k token initial prompt. Legend:
A regular conversation might go like this
10 * 79 * 8 * Total: 17.3 Cent Imagine we did not have caching (89 8 * Total: 38.7 Cent That's more than twice the price. So in this example caching really brings a benefit. This also matches my real life experience. Current condensation does not use cacheOur current condensation method creates a completely new prompt, which does not take advantage of caching. Continuing the example from above
Here a condensation reduces the context window from 21k down to 5k(of which 3k are the initial prompt). But we pay a lot for it. Cost of CondensationFor the condensation operation: 21 * 2 * Total: 9.3 Cent That is because we essentially now pay for every input token twice, when we could have just paid for the cache version instead(at 10% the cost. ) Condenser uses cacheWe could greatly reduce the amount of new input tokens of the condensation if we could use the cache of the conversation.
That way we will only pay full for the condensation prompt, which is just about 1k right now. Everything else should be cached.
1 * 21 * 2 * Total: 3.93 Cent That is less than half of the price without cache(9.3 Cent). ImplementationI have started to implement this in this PR. Right now the focus is to see if this works as indented, rather than for the code to be perfect. I created a new agent LLMCacheCodeAgent that inherits from CodeActAgent. The prompt for the condensation asks the AI to use this format:
I choose this, because by referencing messages we just want to keep, we avoid having the llm quite the message, which would cause a lot of output. Run instructions
EvaluationIt would be great if we could run this against some type of benchmark where context condensation makes sense while recording the cost. |
1346481
to
6db3f04
Compare
This is great, I've been wanting to test this idea for a while and your description of the problem/solution is spot-on. I'll spend some more time digging into this in the upcoming week, but a few thoughts I can leave you with now:
|
Oh, very interesting! It's definitely worth looking into, maybe we can improve this. 🤔 Just some quick thoughts: We still cache the system prompt separately, but not the first message, I believe, the system prompt is here:
Just curious, does this still happen? That is, running in debug mode, after the first condensation, do we still see in the logs that some caching was applied or is it all a cache write? From what I understand, you propose to also cache the first |
@enyst
From what I understand yes, when caching is on, all messages get the caching flag. Even the ones used during condensation. But the condensation was not able to neither
Yes. I think there is whole art to how and when you create a summary. Which must be balanced with cache. I agree that we probably want to do it a lot earlier than after 100 messages. I believe, that it might even be a good idea to request a condensation of just the last observation, if it is above a certain size. That way, we could still reuse the cache of the conversation so far. Doing this consistently, would keep all the huge observations of out the context-window. I am looking at you, translation.json and
Yes. Otherwise it is impossible to take advantage of caching.
👍 |
Just to clarify what I meant here, it needs it on the last message that we want cached. It will then cache all prompt, which implies all the previous ones, from the beginning. This is my understanding from Anthropic's documentation. For example,
This is what we are doing. Every step, we cache the system prompt, and the last message. According to Anthropic though, if the last message suddenly didn't have the marker or would not be found or since it's the first time we sent it, they would lookup and use "the longest previously cached prefix". Edited to add: That's why I was asking, doesn't it find the system message at least? If the answer is no... I'm curious why. I see, the PR changed the order... that seems smart! 🤔 How about tools? |
|
||
def _build_messages_for_condensation(self, events: List[Event]) -> list[Message]: | ||
# Process the events into messages using the same format as the agent | ||
# This ensures we can take advantage of the LLM's cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! I wasn't sure if the system prompt being different makes a difference.
Wait with your evaluation @csmith49 The current condensation is buggy. |
6db3f04
to
17d691f
Compare
No worries, just tag me here when you're ready for me to give it a spin. |
40a2f78
to
9938be3
Compare
…he event from the message fixes test_llm_agent_cache_condenser_with_state_with_dependencies
allow CondensationAction.summary_offset to be None. Which means insert at and. keep all events that have no message fix failed tests: Messages created for microknowlege where created, but returned fix after rebase: View class was moved. removed LLMAgentCacheCondenser.keep_first for simplicity condenseWithState: apply previous condensations to events simplified summary creation as we ignore the indices anyways tested 2 condensations in one session added test with no KEEP 0 make sure we have at least one user message, to prevent invalid prompt. removed unnecessary field
added State and Agent to Condenser.condense. Interface LLMCompletionProvider for Agents that expose their LLM and prompt generation. CachingCondenser - Base class for Condensers that extend the prompt from the Agent to use Cache
removed agent field param.messages IS NOT a list of Message objects.
removed import that lead to circular import.
9938be3
to
7da4c00
Compare
…ons. and this fixes that.
…stom agent to use the condenser. it can be done via config.toml
89b066a
to
04b319d
Compare
@csmith49 I think now its worth trying to get it to run. Put this in the config.toml
I tested it with claude3.7. It sometimes makes bad choices with what it rememers/forgets. But the goal here was to avoid cache-writes mainly. So lets see if it does that properly. |
This is a work in progress.
End-user friendly description of the problem this fixes or functionality that this introduces.
Greatly reduces the cost of doing a condensation by using the cache of the llm
Give a summary of what the PR does, explaining any non-trivial design decisions.
Link of any specific issues this addresses.
happyherp#14