Skip to content

Eval bug: OpenAI incompatible image handling in server multimodal #12947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kerlion opened this issue Apr 15, 2025 · 2 comments
Open

Eval bug: OpenAI incompatible image handling in server multimodal #12947

kerlion opened this issue Apr 15, 2025 · 2 comments

Comments

@kerlion
Copy link

kerlion commented Apr 15, 2025

Name and Version

$ llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
version: 5129 (526739b)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

NVIDIA RTX A6000, compute capability 8.6, VMM: yes

Models

Llama-4-Scout-17B-16E-Instruct

Problem description & steps to reproduce

When I invoke OpenAI api with image, it got 500 error.

First Bad Commit

500: Failed to parse messages: Unsupported content part type: "image_url"; messages = [
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "PLS  desc this pic?"
      },
      {
        "type": "image_url",
        "image_url": {
          "url": ""
        }
      }
    ]
  }
]

Relevant log output

got exception: {"code":500,"message":"Failed to parse messages: Unsupported content part type: \"image_url\"; messages = [\n  {\n    \"role\": \"user\",\n    \"content\": [\n      {\n        \"type\": \"text\",\n        \"text\": \"PLS  desc this pic?\"\n      },\n      {\n        \"type\": \"image_url\",\n        \"image_url\": {\n          \"url\": \"data:image/xxxxx\"\n        }\n      }\n    ]\n  }\n]","type":"server_error"}
srv  log_server_r: request: POST /v1/chat/completions 10.13.23.105 500
@betweenus
Copy link

Hi. llama-server supports only text input.

@Fr0d0Beutl1n
Copy link

Fr0d0Beutl1n commented Apr 19, 2025

Then what is image_data? It sounds very similar, if not the same:

image_data: An array of objects to hold base64-encoded image data and its ids to be reference in prompt. You can determine the place of the image in the prompt as in the following: USER:[img-12]Describe the image in detail.\nASSISTANT:. In this case, [img-12] will be replaced by the embeddings of the image with id 12 in the following image_data array: {..., "image_data": [{"data": "<BASE64_STRING>", "id": 12}]}. Use image_data only with multimodal models, e.g., LLaVA.

https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md#post-completion-given-a-prompt-it-returns-the-predicted-completion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants