Implement an OpenAI Chat Completion compatibility API #1817

franciscojavierarceo · 2025-03-27T18:27:14Z

🚀 Describe the new functionality needed

Many AI Frameworks support OpenAI's chat completion schema.

We would like to enhance Llama Stack to support Chat Completion as well.

The chat completion object sample can be seen here:

{
  "id": "chatcmpl-B9MHDbslfkBeAs8l4bebGdFOJ6PeG",
  "object": "chat.completion",
  "created": 1741570283,
  "model": "gpt-4o-2024-08-06",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The image shows a wooden boardwalk path running through a lush green field or meadow. The sky is bright blue with some scattered clouds, giving the scene a serene and peaceful atmosphere. Trees and shrubs are visible in the background.",
        "refusal": null,
        "annotations": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 1117,
    "completion_tokens": 46,
    "total_tokens": 1163,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  "service_tier": "default",
  "system_fingerprint": "fp_fc9f1d7035"
}

💡 Why is this needed? What if we don't build it?

Software developers using LLM providers (e.g., OpenAI) want to be able to use Chat Completion compatible software so that they can get up and running quickly with Llama Stack.

Other thoughts

To be clarified in further detail.

franciscojavierarceo · 2025-03-27T19:32:30Z

FYI @bbrowning @booxter let me know if either of you would like to scope this out (i.e., outline sub-issues and add more details).

bbrowning · 2025-03-28T13:37:45Z

Is the goal to adjust the existing Llama Stack chat completion inference parameters to more closely match those used by OpenAI Chat Completions so that the overall shape of the APIs feels similar? Or is the goal to be able to use existing OpenAI clients against Llama Stack with an OpenAI-compatible chat completion endpoint?

If we want actual OpenAI clients to work with Llama Stack, then we'll need to adjust things like the path of our chat completions endpoint to match what OpenAI clients expect, ensure OpenAI client api_keys get passed through to our auth middleware properly, and other such implicit semantics required to have OpenAI client compatibility outside of just the shape of the parameters passed into the API.

We do have some existing code and the start of a test suite that adapts OpenAI client calls into Llama Stack inference calls for chat completions at https://github.com/instructlab/lls-openai-client/blob/fdb343d5743ffb6ce7b54b25e0c6f0e5e314267b/tests/functional/test_chat_completions.py . The code in that repository assumes we want to adapt OpenAI client calls into Llama Stack Inference calls that then go through the remote::vllm backend to get converted back into OpenAI client calls against the remote vLLM server, and the tests verify both the request and response conversions of that roundtrip. We can take inspiration from that code and/or test suite if we want to do something similar directly in Llama Stack.

The path we take in Llama Stack depends on the original question about whether we want existing OpenAI clients to just work with Llama Stack or if we just want a similar parameter shape. Either way, I'm happy to contribute here given our recent learnings from prototyping the OpenAI python Client to Llama Stack Inference API adapter linked above.

franciscojavierarceo · 2025-03-28T13:46:18Z

If we want actual OpenAI clients to work with Llama Stack, then we'll need to adjust things like the path of our chat completions endpoint to match what OpenAI clients expect, ensure OpenAI client api_keys get passed through to our auth middleware properly, and other such implicit semantics required to have OpenAI client compatibility outside of just the shape of the parameters passed into the API.

I believe during the discussion yesterday @raghotham had suggested making a separate API for this but I wanted to confirm it as, unfortunately, my notes weren't as precise I wanted them to be.

deewhyweb · 2025-03-28T19:10:57Z

Is the intention here to use this purely for openai api compatible chat completion, or will this api also expose llama-stack functionality such as tool_groups to client applications?

mattf · 2025-03-29T12:02:36Z

+1 provide an OpenAI compatible /chat/completions endpoint for clients to use in addition to /inference/chat_completion

it could live under /experimental for now.

franciscojavierarceo · 2025-04-03T16:35:57Z

Confirmed that this is a new API

bbrowning · 2025-04-04T13:18:43Z

From an implementation point-of-view, do we want to implement our own OpenAI endpoint? Or should we do something like implement a Llama Stack provider for LiteLLM, which would enable anyone using LiteLLM's Python SDK to work with Llama Stack as well as let us run LiteLLM as a proxy to give an OpenAI endpoint in front of Llama Stack. Basically, we already have a dependency on litellm for some of our inference providers - do we expand that scope to also let litellm be our OpenAI proxy endpoint?

mattf · 2025-04-04T13:23:47Z

From an implementation point-of-view, do we want to implement our own OpenAI endpoint? Or should we do something like implement a Llama Stack provider for LiteLLM, which would enable anyone using LiteLLM's Python SDK to work with Llama Stack as well as let us run LiteLLM as a proxy to give an OpenAI endpoint in front of Llama Stack. Basically, we already have a dependency on litellm for some of our inference providers - do we expand that scope to also let litellm be our OpenAI proxy endpoint?

we should definitely make it easy for litellm sdk users to use llama stack.

using litellm proxy server is a good idea. if we run into problems we can either improve it or migrate.

+1

bbrowning · 2025-04-08T01:44:39Z

So, thinking about this a bit more, I think a first-class OpenAI-compatible server API makes sense here to implement directly in the project. Even if litellm proxy or support happens later, the benefits of doing this directly is that we could then avoid some extra conversion steps if we expose new openai_completion and openai_chat_completion methods directly to inference providers, allowing the provider to choose whether to proxy that directly to its OpenAI-compatible backend (if it has one) or convert it to Llama Stack Inference calls.

I stubbed in a prototype of this in #1894 with the remote-vllm provider as a working example of this. We'd also want an inference mixin that does the OpenAI chat completion request --> Llama Stack chat completion request conversion and Llama Stack chat completion response --> OpenAI chat completion response. We'd use this mixin for any inference providers that don't have a provider-specific OpenAI-compatible endpoint to just proxy directly to.

bbrowning · 2025-04-09T19:56:53Z

I'm getting far enough along in #1894 that I've had various users already try it and provide feedback / requests both on that PR and to me privately via other channels. Is there any comment on the overall approach there, of leaving it up to the individual providers to either directly proxy to their own OpenAI backend (for providers that speak OpenAI natively), raise an error, or offer a mixin that does some automatic conversion that we could improve over time to cover more and more of our providers that don't have a native OpenAI-compatible server API.

I can keep poking at this for a while - adding more tests, supporting more edge-cases and provider-specific extra_body params, improving automatic conversion for any provider that doesn't have a native OpenAI backend, etc. But, what's the bar where we'd decide if we want to merge this or not, just so I know where to focus my efforts? Should I keep polishing everything as much as possible, or do we want to consider merging something sooner and then iteratively improving it over time?

franciscojavierarceo added the enhancement New feature or request label Mar 27, 2025

mattf mentioned this issue Mar 29, 2025

Support "stop" parameter in inference providers #1771

Open

franciscojavierarceo changed the title ~~Implement an OpenAI Chat Completion compatibility~~ Implement an OpenAI Chat Completion compatibility API Apr 3, 2025

bbrowning mentioned this issue Apr 8, 2025

feat: OpenAI-Compatible models, completions, chat/completions #1894

Draft

franciscojavierarceo assigned bbrowning Apr 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement an OpenAI Chat Completion compatibility API #1817

Implement an OpenAI Chat Completion compatibility API #1817

franciscojavierarceo commented Mar 27, 2025

franciscojavierarceo commented Mar 27, 2025

bbrowning commented Mar 28, 2025

franciscojavierarceo commented Mar 28, 2025

deewhyweb commented Mar 28, 2025

mattf commented Mar 29, 2025

franciscojavierarceo commented Apr 3, 2025

bbrowning commented Apr 4, 2025

mattf commented Apr 4, 2025

bbrowning commented Apr 8, 2025

bbrowning commented Apr 9, 2025

Implement an OpenAI Chat Completion compatibility API #1817

Implement an OpenAI Chat Completion compatibility API #1817

Comments

franciscojavierarceo commented Mar 27, 2025

🚀 Describe the new functionality needed

💡 Why is this needed? What if we don't build it?

Other thoughts

franciscojavierarceo commented Mar 27, 2025

bbrowning commented Mar 28, 2025

franciscojavierarceo commented Mar 28, 2025

deewhyweb commented Mar 28, 2025

mattf commented Mar 29, 2025

franciscojavierarceo commented Apr 3, 2025

bbrowning commented Apr 4, 2025

mattf commented Apr 4, 2025

bbrowning commented Apr 8, 2025

bbrowning commented Apr 9, 2025