Skip to content

[Usage]: How can I quickly obtain the number of prompt tokens containing multimodal data? #16191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
yansh97 opened this issue Apr 7, 2025 · 6 comments
Open
1 task done
Assignees
Labels
help wanted Extra attention is needed multi-modality Related to multi-modality (#4194) usage How to use vllm

Comments

@yansh97
Copy link
Contributor

yansh97 commented Apr 7, 2025

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

The /tokenize API can only return the number of prompt tokens that contain text and multimodal placeholders, but cannot return the actual number of prompt tokens. @DarkLight1337

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@yansh97 yansh97 added the usage How to use vllm label Apr 7, 2025
@DarkLight1337
Copy link
Member

It's not possible yet. Help is welcome!

@DarkLight1337 DarkLight1337 added the help wanted Extra attention is needed label Apr 7, 2025
@DarkLight1337 DarkLight1337 moved this to Planning in Multi-modality Core Apr 7, 2025
@DarkLight1337
Copy link
Member

We need to call the processor from the API server in order to get the multimodal tokens.

@DarkLight1337 DarkLight1337 moved this from Planning to Todo in Multi-modality Core Apr 7, 2025
@chaunceyjiang
Copy link
Contributor

I can try fix this issue.

@DarkLight1337 DarkLight1337 added the multi-modality Related to multi-modality (#4194) label Apr 7, 2025
@w013nad
Copy link

w013nad commented Apr 7, 2025

+1 Would love to have this feature

@chaunceyjiang
Copy link
Contributor

import requests

text = "hello world!"
model_name = "Qwen/Qwen2.5-VL-7B-Instruct"
# model_name = ""
url = "http://localhost:8000/tokenize"
data = {"model": model_name, "prompt": text}
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

messages = [{
    "role":
    "user",
    "content": [
        {
            "type": "text",
            "text": "What's in this image?"
        },
        {
            "type": "image_url",
            "image_url": {
                "url": image_url
            },
        },
    ],
}]

data = {"model": model_name, "messages": messages}

response = requests.post(url, json=data)
result = response.json()

print("Token IDs:", result["tokens"])
print("Token count:", result["count"])

output:

Token IDs: [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 3838, 594, 304, 419, 2168, 30, 151652, 151655, 151653, 151645, 198, 151644, 77091, 198]
Token count: 28

I have a test case, as shown above. I understand that the generated token IDs contain placeholders like <image>, but do not include the actual image content. This issue requires the actual token IDs, i.e., including the image content itself, not just the placeholder.

Hi, @DarkLight1337 I need some help. I'm not sure if I understand this correctly, but it seems that the patch embedding of the image itself doesn't have token IDs — the embedding is processed inside the model.

How should I handle this?

@DarkLight1337
Copy link
Member

Yes, that's why I said you need to apply the multi-modal processor instead of just the tokenizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed multi-modality Related to multi-modality (#4194) usage How to use vllm
Projects
Status: Todo
Development

No branches or pull requests

4 participants