Recommended/Best practice for chat implementation. Extend input_ids or _gen_begin_reuse()/_gen_feed_tokens()? #770
karlsolomon
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm looking at the example "minimal_chat.py". In order to establish a running chat history/context, every interaction the example appends to an increasing context_ids list. Meanwhile, if I understand correctly, previous input_ids and generations are already stored in KV Cache. Are repeated input/context_ids skipped over somehow, or is this ever-growing list of inputs being re-tokenized/encoded every time? If re-encoded every time, wouldn't this slow down TTFT as the chat context gets longer? Would it be better practice to use implementation akin to say _gen_feed_tokens() or _gen_begin_reuse()? Is there a way to leverage Dynamic Generator in the same way as Streaming Generators's _gen_feed_tokens() and/or _gen_begin_reuse()? Apologies if any of these questions are obvious. I'm neither experienced with (local) LLMs nor am I a strong python developer. Thanks! :)
Beta Was this translation helpful? Give feedback.
All reactions