-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement an OpenAI Chat Completion compatibility API #1817
Comments
FYI @bbrowning @booxter let me know if either of you would like to scope this out (i.e., outline sub-issues and add more details). |
Is the goal to adjust the existing Llama Stack chat completion inference parameters to more closely match those used by OpenAI Chat Completions so that the overall shape of the APIs feels similar? Or is the goal to be able to use existing OpenAI clients against Llama Stack with an OpenAI-compatible chat completion endpoint? If we want actual OpenAI clients to work with Llama Stack, then we'll need to adjust things like the path of our chat completions endpoint to match what OpenAI clients expect, ensure OpenAI client api_keys get passed through to our auth middleware properly, and other such implicit semantics required to have OpenAI client compatibility outside of just the shape of the parameters passed into the API. We do have some existing code and the start of a test suite that adapts OpenAI client calls into Llama Stack inference calls for chat completions at https://github.com/instructlab/lls-openai-client/blob/fdb343d5743ffb6ce7b54b25e0c6f0e5e314267b/tests/functional/test_chat_completions.py . The code in that repository assumes we want to adapt OpenAI client calls into Llama Stack Inference calls that then go through the remote::vllm backend to get converted back into OpenAI client calls against the remote vLLM server, and the tests verify both the request and response conversions of that roundtrip. We can take inspiration from that code and/or test suite if we want to do something similar directly in Llama Stack. The path we take in Llama Stack depends on the original question about whether we want existing OpenAI clients to just work with Llama Stack or if we just want a similar parameter shape. Either way, I'm happy to contribute here given our recent learnings from prototyping the OpenAI python Client to Llama Stack Inference API adapter linked above. |
I believe during the discussion yesterday @raghotham had suggested making a separate API for this but I wanted to confirm it as, unfortunately, my notes weren't as precise I wanted them to be. |
Is the intention here to use this purely for openai api compatible chat completion, or will this api also expose llama-stack functionality such as tool_groups to client applications? |
+1 provide an OpenAI compatible it could live under |
Confirmed that this is a new API |
From an implementation point-of-view, do we want to implement our own OpenAI endpoint? Or should we do something like implement a Llama Stack provider for LiteLLM, which would enable anyone using LiteLLM's Python SDK to work with Llama Stack as well as let us run LiteLLM as a proxy to give an OpenAI endpoint in front of Llama Stack. Basically, we already have a dependency on litellm for some of our inference providers - do we expand that scope to also let litellm be our OpenAI proxy endpoint? |
we should definitely make it easy for litellm sdk users to use llama stack. using litellm proxy server is a good idea. if we run into problems we can either improve it or migrate. +1 |
So, thinking about this a bit more, I think a first-class OpenAI-compatible server API makes sense here to implement directly in the project. Even if litellm proxy or support happens later, the benefits of doing this directly is that we could then avoid some extra conversion steps if we expose new I stubbed in a prototype of this in #1894 with the remote-vllm provider as a working example of this. We'd also want an inference mixin that does the OpenAI chat completion request --> Llama Stack chat completion request conversion and Llama Stack chat completion response --> OpenAI chat completion response. We'd use this mixin for any inference providers that don't have a provider-specific OpenAI-compatible endpoint to just proxy directly to. |
I'm getting far enough along in #1894 that I've had various users already try it and provide feedback / requests both on that PR and to me privately via other channels. Is there any comment on the overall approach there, of leaving it up to the individual providers to either directly proxy to their own OpenAI backend (for providers that speak OpenAI natively), raise an error, or offer a mixin that does some automatic conversion that we could improve over time to cover more and more of our providers that don't have a native OpenAI-compatible server API. I can keep poking at this for a while - adding more tests, supporting more edge-cases and provider-specific |
🚀 Describe the new functionality needed
Many AI Frameworks support OpenAI's chat completion schema.
We would like to enhance Llama Stack to support Chat Completion as well.
The chat completion object sample can be seen here:
💡 Why is this needed? What if we don't build it?
Software developers using LLM providers (e.g., OpenAI) want to be able to use Chat Completion compatible software so that they can get up and running quickly with Llama Stack.
Other thoughts
To be clarified in further detail.
The text was updated successfully, but these errors were encountered: