Skip to content

feat: updated model card for qwen_2.5_vl #37099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Apr 3, 2025
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
193 changes: 73 additions & 120 deletions docs/source/en/model_doc/qwen2_5_vl.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,166 +14,109 @@ rendered properly in your Markdown viewer.

-->

# Qwen2.5-VL

<div class="flex flex-wrap space-x-1">
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white"> </div>
</div>

## Overview

The [Qwen2.5-VL](https://qwenlm.github.io/blog/qwen2_5-vl/) model is an update to [Qwen2-VL](https://arxiv.org/abs/2409.12191) from Qwen team, Alibaba Group.
# Qwen2.5-VL

The abstract from this update is the following:
The [Qwen2.5-VL](https://qwenlm.github.io/blog/qwen2_5-vl/) model is a multimodal vision-language model developed by the Qwen team, Alibaba Group, combining an enhanced ViT encoder with the Qwen2.5 LLM. As an update to [Qwen2-VL](https://arxiv.org/abs/2409.12191), it offers improved visual reasoning and video understanding capabilities. The model uses a refined ViT architecture with SwiGLU and RMSNorm, making it more aligned with the LLM’s structure.

*Qwen2.5-VL marks a major step forward from Qwen2-VL, built upon the latest Qwen2.5 LLM. We've accelerated training and testing through the strategic implementation of window attention within the ViT. The ViT architecture itself has been refined with SwiGLU and RMSNorm, aligning it more closely with the LLM's structure. A key innovation is the expansion of native dynamic resolution to encompass the temporal dimension, in addition to spatial aspects. Furthermore, we've upgraded MRoPE, incorporating absolute time alignment on the time axis to allow the model to effectively capture temporal dynamics, regardless of frame rate, leading to superior video understanding.*
Qwen2.5-VL introduces window attention in the ViT, accelerating both training and inference. It also supports dynamic resolution across both spatial and temporal dimensions, making it highly effective for video analysis. The upgraded MRoPE (Multi-Resolutional Rotary Positional Encoding) now includes absolute time alignment on the time axis, enabling it to capture temporal dynamics across varying frame rates, enhancing its video comprehension abilities.

## Usage example
You can find all the original Qwen2.5-VL checkpoints under the [Qwen2.5-VL](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) collection.

### Single Media inference
> [!TIP]
> Click on the Qwen2.5-VL models in the right sidebar for more examples of how to apply Qwen2.5-VL to different vision and language tasks.

The model can accept both images and videos as input. Here's an example code for inference.
The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.

```python
<hfoptions id="usage">
<hfoption id="Pipeline">

```py
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from transformers import pipeline
pipe = pipeline(
task="image-text-to-text",
model="Qwen/Qwen2.5-VL-7B-Instruct",
device="cuda",
torch_dtype=torch.bfloat16
)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
{ "type": "text", "text": "Describe this image."},
]
}
]
pipe(text=messages,max_new_tokens=20, return_full_text=False)

# Load the model in half-precision on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", device_map="auto")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
```
</hfoption>

<hfoption id="AutoModel">

conversation = [
```py
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
torch_dtype=torch.float16,
device_map="auto",
attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
messages = [
{
"role":"user",
"content":[
{
"type":"image",
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
},
{
"type":"text",
"text":"Describe this image."
}
]
}
]

inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)


# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(output_text)

# Video
conversation = [
{
"role": "user",
"content": [
{"type": "video", "path": "/path/to/video.mp4"},
{"type": "text", "text": "What happened in the video?"},
],
}
]

image = Image.open(requests.get("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", stream=True).raw)
inputs = processor.apply_chat_template(
conversation,
video_fps=1,
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(output_text)
```

### Batch Mixed Media Inference
).to("cuda:0")

The model can batch inputs composed of mixed samples of various types such as images, videos, and text. Here is an example.

```python
# Conversation for the first image
conversation1 = [
{
"role": "user",
"content": [
{"type": "image", "path": "/path/to/image1.jpg"},
{"type": "text", "text": "Describe this image."}
]
}
]

# Conversation with two images
conversation2 = [
{
"role": "user",
"content": [
{"type": "image", "path": "/path/to/image2.jpg"},
{"type": "image", "path": "/path/to/image3.jpg"},
{"type": "text", "text": "What is written in the pictures?"}
]
}
]

# Conversation with pure text
conversation3 = [
{
"role": "user",
"content": "who are you?"
}
]


# Conversation with mixed midia
conversation4 = [
{
"role": "user",
"content": [
{"type": "image", "path": "/path/to/image3.jpg"},
{"type": "image", "path": "/path/to/image4.jpg"},
{"type": "video", "path": "/path/to/video.jpg"},
{"type": "text", "text": "What are the common elements in these medias?"},
],
}
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

conversations = [conversation1, conversation2, conversation3, conversation4]
# Preparation for batch inference
ipnuts = processor.apply_chat_template(
conversations,
video_fps=1,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)


# Batch Inference
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
</hfoption>
</hfoptions>

### Notes

### Usage Tips
Qwen2.5-VL is a multimodal conversational model designed for image, video, and text understanding, making it highly versatile for both general-purpose and fine-tuned downstream tasks such as image captioning, visual question answering (VQA), scene description, and video understanding.

#### Image Resolution trade-off

Expand All @@ -199,6 +142,16 @@ This ensures each image gets encoded using a number between 256-1024 tokens. The
By default, images and video content are directly included in the conversation. When handling multiple images, it's helpful to add labels to the images and videos for better reference. Users can control this behavior with the following settings:

```python
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
torch_dtype=torch.float16,
device_map="auto",
attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
conversation = [
{
"role": "user",
Expand Down