Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement an OpenAI Chat Completion compatibility API #1817

Open
franciscojavierarceo opened this issue Mar 27, 2025 · 10 comments
Open

Implement an OpenAI Chat Completion compatibility API #1817

franciscojavierarceo opened this issue Mar 27, 2025 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@franciscojavierarceo
Copy link
Collaborator

🚀 Describe the new functionality needed

Many AI Frameworks support OpenAI's chat completion schema.

We would like to enhance Llama Stack to support Chat Completion as well.

The chat completion object sample can be seen here:

{
  "id": "chatcmpl-B9MHDbslfkBeAs8l4bebGdFOJ6PeG",
  "object": "chat.completion",
  "created": 1741570283,
  "model": "gpt-4o-2024-08-06",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The image shows a wooden boardwalk path running through a lush green field or meadow. The sky is bright blue with some scattered clouds, giving the scene a serene and peaceful atmosphere. Trees and shrubs are visible in the background.",
        "refusal": null,
        "annotations": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 1117,
    "completion_tokens": 46,
    "total_tokens": 1163,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  "service_tier": "default",
  "system_fingerprint": "fp_fc9f1d7035"
}

💡 Why is this needed? What if we don't build it?

Software developers using LLM providers (e.g., OpenAI) want to be able to use Chat Completion compatible software so that they can get up and running quickly with Llama Stack.

Other thoughts

To be clarified in further detail.

@franciscojavierarceo franciscojavierarceo added the enhancement New feature or request label Mar 27, 2025
@franciscojavierarceo
Copy link
Collaborator Author

FYI @bbrowning @booxter let me know if either of you would like to scope this out (i.e., outline sub-issues and add more details).

@bbrowning
Copy link
Contributor

Is the goal to adjust the existing Llama Stack chat completion inference parameters to more closely match those used by OpenAI Chat Completions so that the overall shape of the APIs feels similar? Or is the goal to be able to use existing OpenAI clients against Llama Stack with an OpenAI-compatible chat completion endpoint?

If we want actual OpenAI clients to work with Llama Stack, then we'll need to adjust things like the path of our chat completions endpoint to match what OpenAI clients expect, ensure OpenAI client api_keys get passed through to our auth middleware properly, and other such implicit semantics required to have OpenAI client compatibility outside of just the shape of the parameters passed into the API.

We do have some existing code and the start of a test suite that adapts OpenAI client calls into Llama Stack inference calls for chat completions at https://github.com/instructlab/lls-openai-client/blob/fdb343d5743ffb6ce7b54b25e0c6f0e5e314267b/tests/functional/test_chat_completions.py . The code in that repository assumes we want to adapt OpenAI client calls into Llama Stack Inference calls that then go through the remote::vllm backend to get converted back into OpenAI client calls against the remote vLLM server, and the tests verify both the request and response conversions of that roundtrip. We can take inspiration from that code and/or test suite if we want to do something similar directly in Llama Stack.

The path we take in Llama Stack depends on the original question about whether we want existing OpenAI clients to just work with Llama Stack or if we just want a similar parameter shape. Either way, I'm happy to contribute here given our recent learnings from prototyping the OpenAI python Client to Llama Stack Inference API adapter linked above.

@franciscojavierarceo
Copy link
Collaborator Author

If we want actual OpenAI clients to work with Llama Stack, then we'll need to adjust things like the path of our chat completions endpoint to match what OpenAI clients expect, ensure OpenAI client api_keys get passed through to our auth middleware properly, and other such implicit semantics required to have OpenAI client compatibility outside of just the shape of the parameters passed into the API.

I believe during the discussion yesterday @raghotham had suggested making a separate API for this but I wanted to confirm it as, unfortunately, my notes weren't as precise I wanted them to be.

@deewhyweb
Copy link

Is the intention here to use this purely for openai api compatible chat completion, or will this api also expose llama-stack functionality such as tool_groups to client applications?

@mattf
Copy link
Contributor

mattf commented Mar 29, 2025

+1 provide an OpenAI compatible /chat/completions endpoint for clients to use in addition to /inference/chat_completion

it could live under /experimental for now.

@franciscojavierarceo franciscojavierarceo changed the title Implement an OpenAI Chat Completion compatibility Implement an OpenAI Chat Completion compatibility API Apr 3, 2025
@franciscojavierarceo
Copy link
Collaborator Author

Confirmed that this is a new API

@bbrowning
Copy link
Contributor

From an implementation point-of-view, do we want to implement our own OpenAI endpoint? Or should we do something like implement a Llama Stack provider for LiteLLM, which would enable anyone using LiteLLM's Python SDK to work with Llama Stack as well as let us run LiteLLM as a proxy to give an OpenAI endpoint in front of Llama Stack. Basically, we already have a dependency on litellm for some of our inference providers - do we expand that scope to also let litellm be our OpenAI proxy endpoint?

@mattf
Copy link
Contributor

mattf commented Apr 4, 2025

From an implementation point-of-view, do we want to implement our own OpenAI endpoint? Or should we do something like implement a Llama Stack provider for LiteLLM, which would enable anyone using LiteLLM's Python SDK to work with Llama Stack as well as let us run LiteLLM as a proxy to give an OpenAI endpoint in front of Llama Stack. Basically, we already have a dependency on litellm for some of our inference providers - do we expand that scope to also let litellm be our OpenAI proxy endpoint?

we should definitely make it easy for litellm sdk users to use llama stack.

using litellm proxy server is a good idea. if we run into problems we can either improve it or migrate.

+1

@bbrowning
Copy link
Contributor

So, thinking about this a bit more, I think a first-class OpenAI-compatible server API makes sense here to implement directly in the project. Even if litellm proxy or support happens later, the benefits of doing this directly is that we could then avoid some extra conversion steps if we expose new openai_completion and openai_chat_completion methods directly to inference providers, allowing the provider to choose whether to proxy that directly to its OpenAI-compatible backend (if it has one) or convert it to Llama Stack Inference calls.

I stubbed in a prototype of this in #1894 with the remote-vllm provider as a working example of this. We'd also want an inference mixin that does the OpenAI chat completion request --> Llama Stack chat completion request conversion and Llama Stack chat completion response --> OpenAI chat completion response. We'd use this mixin for any inference providers that don't have a provider-specific OpenAI-compatible endpoint to just proxy directly to.

@bbrowning
Copy link
Contributor

I'm getting far enough along in #1894 that I've had various users already try it and provide feedback / requests both on that PR and to me privately via other channels. Is there any comment on the overall approach there, of leaving it up to the individual providers to either directly proxy to their own OpenAI backend (for providers that speak OpenAI natively), raise an error, or offer a mixin that does some automatic conversion that we could improve over time to cover more and more of our providers that don't have a native OpenAI-compatible server API.

I can keep poking at this for a while - adding more tests, supporting more edge-cases and provider-specific extra_body params, improving automatic conversion for any provider that doesn't have a native OpenAI backend, etc. But, what's the bar where we'd decide if we want to merge this or not, just so I know where to focus my efforts? Should I keep polishing everything as much as possible, or do we want to consider merging something sooner and then iteratively improving it over time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants