diff --git a/docs/ci/update_pytorch_version.md b/docs/ci/update_pytorch_version.md index 2ad3430a4de..69fdc82ef97 100644 --- a/docs/ci/update_pytorch_version.md +++ b/docs/ci/update_pytorch_version.md @@ -91,7 +91,7 @@ source to unblock the update process. ### FlashInfer Here is how to build and install it from source with torch2.7.0+cu128 in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271): -``` +```bash export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0 10.0+PTX' export FLASHINFER_ENABLE_SM90=1 uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@v0.2.6.post1" @@ -105,14 +105,14 @@ team if you want to get the package published there. ### xFormers Similar to FlashInfer, here is how to build and install xFormers from source: -``` +```bash export TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0 8.9 9.0 10.0+PTX' MAX_JOBS=16 uv pip install --system --no-build-isolation "git+https://github.com/facebookresearch/xformers@v0.0.30" ``` ### Mamba -``` +```bash uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4" ``` diff --git a/docs/cli/README.md b/docs/cli/README.md index df700fb743c..b2587a5e7cd 100644 --- a/docs/cli/README.md +++ b/docs/cli/README.md @@ -16,35 +16,33 @@ vllm {chat,complete,serve,bench,collect-env,run-batch} Start the vLLM OpenAI Compatible API server. -Examples: +??? Examples -```bash -# Start with a model -vllm serve meta-llama/Llama-2-7b-hf + ```bash + # Start with a model + vllm serve meta-llama/Llama-2-7b-hf -# Specify the port -vllm serve meta-llama/Llama-2-7b-hf --port 8100 + # Specify the port + vllm serve meta-llama/Llama-2-7b-hf --port 8100 -# Check with --help for more options -# To list all groups -vllm serve --help=listgroup + # Check with --help for more options + # To list all groups + vllm serve --help=listgroup -# To view a argument group -vllm serve --help=ModelConfig + # To view a argument group + vllm serve --help=ModelConfig -# To view a single argument -vllm serve --help=max-num-seqs + # To view a single argument + vllm serve --help=max-num-seqs -# To search by keyword -vllm serve --help=max -``` + # To search by keyword + vllm serve --help=max + ``` ## chat Generate chat completions via the running API server. -Examples: - ```bash # Directly connect to localhost API without arguments vllm chat @@ -60,8 +58,6 @@ vllm chat --quick "hi" Generate text completions based on the given prompt via the running API server. -Examples: - ```bash # Directly connect to localhost API without arguments vllm complete @@ -73,6 +69,8 @@ vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1 vllm complete --quick "The future of AI is" ``` + + ## bench Run benchmark tests for latency online serving throughput and offline inference throughput. @@ -89,8 +87,6 @@ vllm bench {latency, serve, throughput} Benchmark the latency of a single batch of requests. -Example: - ```bash vllm bench latency \ --model meta-llama/Llama-3.2-1B-Instruct \ @@ -104,8 +100,6 @@ vllm bench latency \ Benchmark the online serving throughput. -Example: - ```bash vllm bench serve \ --model meta-llama/Llama-3.2-1B-Instruct \ @@ -120,8 +114,6 @@ vllm bench serve \ Benchmark offline inference throughput. -Example: - ```bash vllm bench throughput \ --model meta-llama/Llama-3.2-1B-Instruct \ @@ -143,7 +135,8 @@ vllm collect-env Run batch prompts and write results to file. -Examples: +
+Examples ```bash # Running with a local file @@ -159,6 +152,8 @@ vllm run-batch \ --model meta-llama/Meta-Llama-3-8B-Instruct ``` +
+ ## More Help For detailed options of any subcommand, use: diff --git a/docs/configuration/conserving_memory.md b/docs/configuration/conserving_memory.md index a1283a503a6..e2303067e3e 100644 --- a/docs/configuration/conserving_memory.md +++ b/docs/configuration/conserving_memory.md @@ -57,19 +57,21 @@ By default, we optimize model inference using CUDA graphs which take up extra me You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage: -```python -from vllm import LLM -from vllm.config import CompilationConfig, CompilationLevel - -llm = LLM( - model="meta-llama/Llama-3.1-8B-Instruct", - compilation_config=CompilationConfig( - level=CompilationLevel.PIECEWISE, - # By default, it goes up to max_num_seqs - cudagraph_capture_sizes=[1, 2, 4, 8, 16], - ), -) -``` +??? Code + + ```python + from vllm import LLM + from vllm.config import CompilationConfig, CompilationLevel + + llm = LLM( + model="meta-llama/Llama-3.1-8B-Instruct", + compilation_config=CompilationConfig( + level=CompilationLevel.PIECEWISE, + # By default, it goes up to max_num_seqs + cudagraph_capture_sizes=[1, 2, 4, 8, 16], + ), + ) + ``` You can disable graph capturing completely via the `enforce_eager` flag: @@ -127,18 +129,20 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory. Here are some examples: -```python -from vllm import LLM +??? Code -# Available for Qwen2-VL series models -llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct", - mm_processor_kwargs={ - "max_pixels": 768 * 768, # Default is 1280 * 28 * 28 - }) - -# Available for InternVL series models -llm = LLM(model="OpenGVLab/InternVL2-2B", - mm_processor_kwargs={ - "max_dynamic_patch": 4, # Default is 12 - }) -``` + ```python + from vllm import LLM + + # Available for Qwen2-VL series models + llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct", + mm_processor_kwargs={ + "max_pixels": 768 * 768, # Default is 1280 * 28 * 28 + }) + + # Available for InternVL series models + llm = LLM(model="OpenGVLab/InternVL2-2B", + mm_processor_kwargs={ + "max_dynamic_patch": 4, # Default is 12 + }) + ``` diff --git a/docs/configuration/env_vars.md b/docs/configuration/env_vars.md index f6d548a19d9..c875931c305 100644 --- a/docs/configuration/env_vars.md +++ b/docs/configuration/env_vars.md @@ -7,6 +7,8 @@ vLLM uses the following environment variables to configure the system: All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables). -```python ---8<-- "vllm/envs.py:env-vars-definition" -``` +??? Code + + ```python + --8<-- "vllm/envs.py:env-vars-definition" + ``` diff --git a/docs/contributing/README.md b/docs/contributing/README.md index 10c50e00724..e977ec3d2f7 100644 --- a/docs/contributing/README.md +++ b/docs/contributing/README.md @@ -93,25 +93,27 @@ For additional features and advanced configurations, refer to the official [MkDo ## Testing -```bash -pip install -r requirements/dev.txt +??? note "Commands" -# Linting, formatting and static type checking -pre-commit install --hook-type pre-commit --hook-type commit-msg + ```bash + pip install -r requirements/dev.txt -# You can manually run pre-commit with -pre-commit run --all-files + # Linting, formatting and static type checking + pre-commit install --hook-type pre-commit --hook-type commit-msg -# To manually run something from CI that does not run -# locally by default, you can run: -pre-commit run mypy-3.9 --hook-stage manual --all-files + # You can manually run pre-commit with + pre-commit run --all-files -# Unit tests -pytest tests/ + # To manually run something from CI that does not run + # locally by default, you can run: + pre-commit run mypy-3.9 --hook-stage manual --all-files -# Run tests for a single test file with detailed output -pytest -s -v tests/test_logger.py -``` + # Unit tests + pytest tests/ + + # Run tests for a single test file with detailed output + pytest -s -v tests/test_logger.py + ``` !!! tip Since the ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12. diff --git a/docs/contributing/model/basic.md b/docs/contributing/model/basic.md index 0c0ba337925..644d21482ef 100644 --- a/docs/contributing/model/basic.md +++ b/docs/contributing/model/basic.md @@ -27,33 +27,35 @@ All vLLM modules within the model must include a `prefix` argument in their cons The initialization code should look like this: -```python -from torch import nn -from vllm.config import VllmConfig -from vllm.attention import Attention - -class MyAttention(nn.Module): - def __init__(self, vllm_config: VllmConfig, prefix: str): - super().__init__() - self.attn = Attention(prefix=f"{prefix}.attn") - -class MyDecoderLayer(nn.Module): - def __init__(self, vllm_config: VllmConfig, prefix: str): - super().__init__() - self.self_attn = MyAttention(prefix=f"{prefix}.self_attn") - -class MyModel(nn.Module): - def __init__(self, vllm_config: VllmConfig, prefix: str): - super().__init__() - self.layers = nn.ModuleList( - [MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)] - ) - -class MyModelForCausalLM(nn.Module): - def __init__(self, vllm_config: VllmConfig, prefix: str = ""): - super().__init__() - self.model = MyModel(vllm_config, prefix=f"{prefix}.model") -``` +??? Code + + ```python + from torch import nn + from vllm.config import VllmConfig + from vllm.attention import Attention + + class MyAttention(nn.Module): + def __init__(self, vllm_config: VllmConfig, prefix: str): + super().__init__() + self.attn = Attention(prefix=f"{prefix}.attn") + + class MyDecoderLayer(nn.Module): + def __init__(self, vllm_config: VllmConfig, prefix: str): + super().__init__() + self.self_attn = MyAttention(prefix=f"{prefix}.self_attn") + + class MyModel(nn.Module): + def __init__(self, vllm_config: VllmConfig, prefix: str): + super().__init__() + self.layers = nn.ModuleList( + [MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)] + ) + + class MyModelForCausalLM(nn.Module): + def __init__(self, vllm_config: VllmConfig, prefix: str = ""): + super().__init__() + self.model = MyModel(vllm_config, prefix=f"{prefix}.model") + ``` ### Computation Code diff --git a/docs/contributing/model/multimodal.md b/docs/contributing/model/multimodal.md index bed6d4e653d..6ff2abbae63 100644 --- a/docs/contributing/model/multimodal.md +++ b/docs/contributing/model/multimodal.md @@ -25,59 +25,63 @@ Further update the model as follows: - Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs. - ```python - class YourModelForImage2Seq(nn.Module): - ... + ??? Code - def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor: + ```python + class YourModelForImage2Seq(nn.Module): + ... - assert self.vision_encoder is not None - image_features = self.vision_encoder(image_input) - return self.multi_modal_projector(image_features) + def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor: - def get_multimodal_embeddings( - self, **kwargs: object) -> Optional[MultiModalEmbeddings]: + assert self.vision_encoder is not None + image_features = self.vision_encoder(image_input) + return self.multi_modal_projector(image_features) - # Validate the multimodal input keyword arguments - image_input = self._parse_and_validate_image_input(**kwargs) - if image_input is None: - return None + def get_multimodal_embeddings( + self, **kwargs: object) -> Optional[MultiModalEmbeddings]: - # Run multimodal inputs through encoder and projector - vision_embeddings = self._process_image_input(image_input) - return vision_embeddings - ``` + # Validate the multimodal input keyword arguments + image_input = self._parse_and_validate_image_input(**kwargs) + if image_input is None: + return None + + # Run multimodal inputs through encoder and projector + vision_embeddings = self._process_image_input(image_input) + return vision_embeddings + ``` !!! important The returned `multimodal_embeddings` must be either a **3D [torch.Tensor][]** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D [torch.Tensor][]'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request. - Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings. - ```python - from .utils import merge_multimodal_embeddings + ??? Code - class YourModelForImage2Seq(nn.Module): - ... + ```python + from .utils import merge_multimodal_embeddings - def get_input_embeddings( - self, - input_ids: torch.Tensor, - multimodal_embeddings: Optional[MultiModalEmbeddings] = None, - ) -> torch.Tensor: - - # `get_input_embeddings` should already be implemented for the language - # model as one of the requirements of basic vLLM model implementation. - inputs_embeds = self.language_model.get_input_embeddings(input_ids) - - if multimodal_embeddings is not None: - inputs_embeds = merge_multimodal_embeddings( - input_ids=input_ids, - inputs_embeds=inputs_embeds, - multimodal_embeddings=multimodal_embeddings, - placeholder_token_id=self.config.image_token_index) - - return inputs_embeds - ``` + class YourModelForImage2Seq(nn.Module): + ... + + def get_input_embeddings( + self, + input_ids: torch.Tensor, + multimodal_embeddings: Optional[MultiModalEmbeddings] = None, + ) -> torch.Tensor: + + # `get_input_embeddings` should already be implemented for the language + # model as one of the requirements of basic vLLM model implementation. + inputs_embeds = self.language_model.get_input_embeddings(input_ids) + + if multimodal_embeddings is not None: + inputs_embeds = merge_multimodal_embeddings( + input_ids=input_ids, + inputs_embeds=inputs_embeds, + multimodal_embeddings=multimodal_embeddings, + placeholder_token_id=self.config.image_token_index) + + return inputs_embeds + ``` - Implement [get_language_model][vllm.model_executor.models.interfaces.SupportsMultiModal.get_language_model] getter to provide stable access to the underlying language model. @@ -135,42 +139,46 @@ Assuming that the memory usage increases with the number of tokens, the dummy in Looking at the code of HF's `LlavaForConditionalGeneration`: - ```python - # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544 - n_image_tokens = (input_ids == self.config.image_token_index).sum().item() - n_image_features = image_features.shape[0] * image_features.shape[1] + ??? Code - if n_image_tokens != n_image_features: - raise ValueError( - f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}" + ```python + # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544 + n_image_tokens = (input_ids == self.config.image_token_index).sum().item() + n_image_features = image_features.shape[0] * image_features.shape[1] + + if n_image_tokens != n_image_features: + raise ValueError( + f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}" + ) + special_image_mask = ( + (input_ids == self.config.image_token_index) + .unsqueeze(-1) + .expand_as(inputs_embeds) + .to(inputs_embeds.device) ) - special_image_mask = ( - (input_ids == self.config.image_token_index) - .unsqueeze(-1) - .expand_as(inputs_embeds) - .to(inputs_embeds.device) - ) - image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype) - inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features) - ``` + image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype) + inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features) + ``` The number of placeholder feature tokens per image is `image_features.shape[1]`. `image_features` is calculated inside the `get_image_features` method: - ```python - # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300 - image_outputs = self.vision_tower(pixel_values, output_hidden_states=True) - - selected_image_feature = image_outputs.hidden_states[vision_feature_layer] - if vision_feature_select_strategy == "default": - selected_image_feature = selected_image_feature[:, 1:] - elif vision_feature_select_strategy == "full": - selected_image_feature = selected_image_feature - else: - raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}") - image_features = self.multi_modal_projector(selected_image_feature) - return image_features - ``` + ??? Code + + ```python + # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300 + image_outputs = self.vision_tower(pixel_values, output_hidden_states=True) + + selected_image_feature = image_outputs.hidden_states[vision_feature_layer] + if vision_feature_select_strategy == "default": + selected_image_feature = selected_image_feature[:, 1:] + elif vision_feature_select_strategy == "full": + selected_image_feature = selected_image_feature + else: + raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}") + image_features = self.multi_modal_projector(selected_image_feature) + return image_features + ``` We can infer that `image_features.shape[1]` is based on `image_outputs.hidden_states.shape[1]` from the vision tower (`CLIPVisionModel` for the [`llava-hf/llava-1.5-7b-hf`](https://huggingface.co/llava-hf/llava-1.5-7b-hf) model). @@ -193,20 +201,22 @@ Assuming that the memory usage increases with the number of tokens, the dummy in To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`: - ```python - # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257 - target_dtype = self.patch_embedding.weight.dtype - patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid] - patch_embeds = patch_embeds.flatten(2).transpose(1, 2) - - class_embeds = self.class_embedding.expand(batch_size, 1, -1) - embeddings = torch.cat([class_embeds, patch_embeds], dim=1) - if interpolate_pos_encoding: - embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width) - else: - embeddings = embeddings + self.position_embedding(self.position_ids) - return embeddings - ``` + ??? Code + + ```python + # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257 + target_dtype = self.patch_embedding.weight.dtype + patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid] + patch_embeds = patch_embeds.flatten(2).transpose(1, 2) + + class_embeds = self.class_embedding.expand(batch_size, 1, -1) + embeddings = torch.cat([class_embeds, patch_embeds], dim=1) + if interpolate_pos_encoding: + embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width) + else: + embeddings = embeddings + self.position_embedding(self.position_ids) + return embeddings + ``` We can infer that `embeddings.shape[1] == self.num_positions`, where @@ -218,55 +228,59 @@ Assuming that the memory usage increases with the number of tokens, the dummy in Overall, the number of placeholder feature tokens for an image can be calculated as: - ```python - def get_num_image_tokens( - self, - *, - image_width: int, - image_height: int, - ) -> int: - hf_config = self.get_hf_config() - hf_processor = self.get_hf_processor() + ??? Code - image_size = hf_config.vision_config.image_size - patch_size = hf_config.vision_config.patch_size + ```python + def get_num_image_tokens( + self, + *, + image_width: int, + image_height: int, + ) -> int: + hf_config = self.get_hf_config() + hf_processor = self.get_hf_processor() - num_image_tokens = (image_size // patch_size) ** 2 + 1 - if hf_processor.vision_feature_select_strategy == "default": - num_image_tokens -= 1 + image_size = hf_config.vision_config.image_size + patch_size = hf_config.vision_config.patch_size - return num_image_tokens - ``` + num_image_tokens = (image_size // patch_size) ** 2 + 1 + if hf_processor.vision_feature_select_strategy == "default": + num_image_tokens -= 1 + + return num_image_tokens + ``` Notice that the number of image tokens doesn't depend on the image width and height. We can simply use a dummy `image_size` to calculate the multimodal profiling data: - ```python - # NOTE: In actuality, this is usually implemented as part of the - # model's subclass of `BaseProcessingInfo`, but we show it as is - # here for simplicity. - def get_image_size_with_most_features(self) -> ImageSize: - hf_config = self.get_hf_config() - width = height = hf_config.image_size - return ImageSize(width=width, height=height) + ??? Code - def get_dummy_mm_data( - self, - seq_len: int, - mm_counts: Mapping[str, int], - ) -> MultiModalDataDict: - num_images = mm_counts.get("image", 0) - - target_width, target_height = \ - self.info.get_image_size_with_most_features() + ```python + # NOTE: In actuality, this is usually implemented as part of the + # model's subclass of `BaseProcessingInfo`, but we show it as is + # here for simplicity. + def get_image_size_with_most_features(self) -> ImageSize: + hf_config = self.get_hf_config() + width = height = hf_config.image_size + return ImageSize(width=width, height=height) - return { - "image": - self._get_dummy_images(width=target_width, - height=target_height, - num_images=num_images) - } - ``` + def get_dummy_mm_data( + self, + seq_len: int, + mm_counts: Mapping[str, int], + ) -> MultiModalDataDict: + num_images = mm_counts.get("image", 0) + + target_width, target_height = \ + self.info.get_image_size_with_most_features() + + return { + "image": + self._get_dummy_images(width=target_width, + height=target_height, + num_images=num_images) + } + ``` For the text, we simply expand the multimodal image token from the model config to match the desired number of images. @@ -284,21 +298,23 @@ Assuming that the memory usage increases with the number of tokens, the dummy in Looking at the code of HF's `FuyuForCausalLM`: - ```python - # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322 - if image_patches is not None and past_key_values is None: - patch_embeddings = [ - self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype)) - .squeeze(0) - .to(inputs_embeds.device) - for patch in image_patches - ] - inputs_embeds = self.gather_continuous_embeddings( - word_embeddings=inputs_embeds, - continuous_embeddings=patch_embeddings, - image_patch_input_indices=image_patches_indices, - ) - ``` + ??? Code + + ```python + # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322 + if image_patches is not None and past_key_values is None: + patch_embeddings = [ + self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype)) + .squeeze(0) + .to(inputs_embeds.device) + for patch in image_patches + ] + inputs_embeds = self.gather_continuous_embeddings( + word_embeddings=inputs_embeds, + continuous_embeddings=patch_embeddings, + image_patch_input_indices=image_patches_indices, + ) + ``` The number of placeholder feature tokens for the `i`th item in the batch is `patch_embeddings[i].shape[0]`, which is the same as `image_patches[i].shape[0]`, i.e. `num_total_patches`. @@ -312,92 +328,98 @@ Assuming that the memory usage increases with the number of tokens, the dummy in In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`, returning the dimensions after resizing (but before padding) as metadata. - ```python - # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544 - image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"]) - batch_images = image_encoding["images"] - image_unpadded_heights = image_encoding["image_unpadded_heights"] - image_unpadded_widths = image_encoding["image_unpadded_widths"] - - # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L - if do_resize: - batch_images = [ - [self.resize(image, size=size, input_data_format=input_data_format) for image in images] - for images in batch_images - ] - - image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images] - image_unpadded_heights = [[image_size[0]] for image_size in image_sizes] - image_unpadded_widths = [[image_size[1]] for image_size in image_sizes] - - if do_pad: - batch_images = [ - [ - self.pad_image( - image, - size=size, - mode=padding_mode, - constant_values=padding_value, - input_data_format=input_data_format, - ) - for image in images + ??? Code + + ```python + # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544 + image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"]) + batch_images = image_encoding["images"] + image_unpadded_heights = image_encoding["image_unpadded_heights"] + image_unpadded_widths = image_encoding["image_unpadded_widths"] + + # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L + if do_resize: + batch_images = [ + [self.resize(image, size=size, input_data_format=input_data_format) for image in images] + for images in batch_images ] - for images in batch_images - ] - ``` - In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata: + image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images] + image_unpadded_heights = [[image_size[0]] for image_size in image_sizes] + image_unpadded_widths = [[image_size[1]] for image_size in image_sizes] + + if do_pad: + batch_images = [ + [ + self.pad_image( + image, + size=size, + mode=padding_mode, + constant_values=padding_value, + input_data_format=input_data_format, + ) + for image in images + ] + for images in batch_images + ] + ``` - ```python - # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425 - model_image_input = self.image_processor.preprocess_with_tokenizer_info( - image_input=tensor_batch_images, - image_present=image_present, - image_unpadded_h=image_unpadded_heights, - image_unpadded_w=image_unpadded_widths, - image_placeholder_id=image_placeholder_id, - image_newline_id=image_newline_id, - variable_sized=True, - ) + In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata: - # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658 - image_height, image_width = image.shape[1], image.shape[2] - if variable_sized: # variable_sized=True - new_h = min( - image_height, - math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height, - ) - new_w = min( - image_width, - math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width, + ??? Code + + ```python + # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425 + model_image_input = self.image_processor.preprocess_with_tokenizer_info( + image_input=tensor_batch_images, + image_present=image_present, + image_unpadded_h=image_unpadded_heights, + image_unpadded_w=image_unpadded_widths, + image_placeholder_id=image_placeholder_id, + image_newline_id=image_newline_id, + variable_sized=True, ) - image = image[:, :new_h, :new_w] - image_height, image_width = new_h, new_w - num_patches = self.get_num_patches(image_height=image_height, image_width=image_width) - tensor_of_image_ids = torch.full( - [num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device - ) - patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0) - assert num_patches == patches.shape[0] - ``` + # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658 + image_height, image_width = image.shape[1], image.shape[2] + if variable_sized: # variable_sized=True + new_h = min( + image_height, + math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height, + ) + new_w = min( + image_width, + math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width, + ) + image = image[:, :new_h, :new_w] + image_height, image_width = new_h, new_w + + num_patches = self.get_num_patches(image_height=image_height, image_width=image_width) + tensor_of_image_ids = torch.full( + [num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device + ) + patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0) + assert num_patches == patches.shape[0] + ``` The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`: - ```python - # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562 - patch_size = patch_size if patch_size is not None else self.patch_size - patch_height, patch_width = self.patch_size["height"], self.patch_size["width"] - - if image_height % patch_height != 0: - raise ValueError(f"{image_height=} must be divisible by {patch_height}") - if image_width % patch_width != 0: - raise ValueError(f"{image_width=} must be divisible by {patch_width}") - - num_patches_per_dim_h = image_height // patch_height - num_patches_per_dim_w = image_width // patch_width - num_patches = num_patches_per_dim_h * num_patches_per_dim_w - ``` + ??? Code + + ```python + # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562 + patch_size = patch_size if patch_size is not None else self.patch_size + patch_height, patch_width = self.patch_size["height"], self.patch_size["width"] + + if image_height % patch_height != 0: + raise ValueError(f"{image_height=} must be divisible by {patch_height}") + if image_width % patch_width != 0: + raise ValueError(f"{image_width=} must be divisible by {patch_width}") + + num_patches_per_dim_h = image_height // patch_height + num_patches_per_dim_w = image_width // patch_width + num_patches = num_patches_per_dim_h * num_patches_per_dim_w + ``` These image patches correspond to placeholder tokens (`|SPEAKER|`). So, we just need to maximize the number of image patches. Since input images are first resized to fit within `image_processor.size`, we can maximize the number of image patches by inputting an image with size equal to `image_processor.size`. @@ -419,23 +441,25 @@ Assuming that the memory usage increases with the number of tokens, the dummy in For the multimodal image profiling data, the logic is very similar to LLaVA: - ```python - def get_dummy_mm_data( - self, - seq_len: int, - mm_counts: Mapping[str, int], - ) -> MultiModalDataDict: - target_width, target_height = \ - self.info.get_image_size_with_most_features() - num_images = mm_counts.get("image", 0) + ??? Code - return { - "image": - self._get_dummy_images(width=target_width, - height=target_height, - num_images=num_images) - } - ``` + ```python + def get_dummy_mm_data( + self, + seq_len: int, + mm_counts: Mapping[str, int], + ) -> MultiModalDataDict: + target_width, target_height = \ + self.info.get_image_size_with_most_features() + num_images = mm_counts.get("image", 0) + + return { + "image": + self._get_dummy_images(width=target_width, + height=target_height, + num_images=num_images) + } + ``` ## 4. Specify processing details @@ -455,6 +479,7 @@ return a schema of the tensors outputted by the HF processor that are related to The output of `CLIPImageProcessor` is a simple tensor with shape `(num_images, num_channels, image_height, image_width)`: + ```python # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/image_processing_clip.py#L339-L345 images = [ @@ -505,35 +530,37 @@ return a schema of the tensors outputted by the HF processor that are related to In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA, we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]: - ```python - def _call_hf_processor( - self, - prompt: str, - mm_data: Mapping[str, object], - mm_kwargs: Mapping[str, object], - ) -> BatchFeature: - processed_outputs = super()._call_hf_processor( - prompt=prompt, - mm_data=mm_data, - mm_kwargs=mm_kwargs, - ) + ??? Code - image_patches = processed_outputs.get("image_patches") - if image_patches is not None: - images = mm_data["images"] - assert isinstance(images, list) + ```python + def _call_hf_processor( + self, + prompt: str, + mm_data: Mapping[str, object], + mm_kwargs: Mapping[str, object], + ) -> BatchFeature: + processed_outputs = super()._call_hf_processor( + prompt=prompt, + mm_data=mm_data, + mm_kwargs=mm_kwargs, + ) - # Original output: (1, num_images, Pn, Px * Py * C) - # New output: (num_images, Pn, Px * Py * C) - assert (isinstance(image_patches, list) - and len(image_patches) == 1) - assert (isinstance(image_patches[0], torch.Tensor) - and len(image_patches[0]) == len(images)) + image_patches = processed_outputs.get("image_patches") + if image_patches is not None: + images = mm_data["images"] + assert isinstance(images, list) - processed_outputs["image_patches"] = image_patches[0] + # Original output: (1, num_images, Pn, Px * Py * C) + # New output: (num_images, Pn, Px * Py * C) + assert (isinstance(image_patches, list) + and len(image_patches) == 1) + assert (isinstance(image_patches[0], torch.Tensor) + and len(image_patches[0]) == len(images)) - return processed_outputs - ``` + processed_outputs["image_patches"] = image_patches[0] + + return processed_outputs + ``` !!! note Our [actual code](gh-file:vllm/model_executor/models/fuyu.py) has special handling @@ -573,35 +600,37 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`). Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows: - ```python - def _get_prompt_updates( - self, - mm_items: MultiModalDataItems, - hf_processor_mm_kwargs: Mapping[str, object], - out_mm_kwargs: MultiModalKwargs, - ) -> Sequence[PromptUpdate]: - hf_config = self.info.get_hf_config() - image_token_id = hf_config.image_token_index + ??? Code - def get_replacement(item_idx: int): - images = mm_items.get_items("image", ImageProcessorItems) - - image_size = images.get_image_size(item_idx) - num_image_tokens = self.info.get_num_image_tokens( - image_width=image_size.width, - image_height=image_size.height, - ) + ```python + def _get_prompt_updates( + self, + mm_items: MultiModalDataItems, + hf_processor_mm_kwargs: Mapping[str, object], + out_mm_kwargs: MultiModalKwargs, + ) -> Sequence[PromptUpdate]: + hf_config = self.info.get_hf_config() + image_token_id = hf_config.image_token_index + + def get_replacement(item_idx: int): + images = mm_items.get_items("image", ImageProcessorItems) + + image_size = images.get_image_size(item_idx) + num_image_tokens = self.info.get_num_image_tokens( + image_width=image_size.width, + image_height=image_size.height, + ) - return [image_token_id] * num_image_tokens + return [image_token_id] * num_image_tokens - return [ - PromptReplacement( - modality="image", - target=[image_token_id], - replacement=get_replacement, - ), - ] - ``` + return [ + PromptReplacement( + modality="image", + target=[image_token_id], + replacement=get_replacement, + ), + ] + ``` === "Handling additional tokens: Fuyu" @@ -616,117 +645,90 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies We define a helper function to return `ncols` and `nrows` directly: - ```python - def get_image_feature_grid_size( - self, - *, - image_width: int, - image_height: int, - ) -> tuple[int, int]: - image_processor = self.get_image_processor() - target_width = image_processor.size["width"] - target_height = image_processor.size["height"] - patch_width = image_processor.patch_size["width"] - patch_height = image_processor.patch_size["height"] - - if not (image_width <= target_width and image_height <= target_height): - height_scale_factor = target_height / image_height - width_scale_factor = target_width / image_width - optimal_scale_factor = min(height_scale_factor, width_scale_factor) - - image_height = int(image_height * optimal_scale_factor) - image_width = int(image_width * optimal_scale_factor) - - ncols = math.ceil(image_width / patch_width) - nrows = math.ceil(image_height / patch_height) - return ncols, nrows - ``` + ??? Code + + ```python + def get_image_feature_grid_size( + self, + *, + image_width: int, + image_height: int, + ) -> tuple[int, int]: + image_processor = self.get_image_processor() + target_width = image_processor.size["width"] + target_height = image_processor.size["height"] + patch_width = image_processor.patch_size["width"] + patch_height = image_processor.patch_size["height"] + + if not (image_width <= target_width and image_height <= target_height): + height_scale_factor = target_height / image_height + width_scale_factor = target_width / image_width + optimal_scale_factor = min(height_scale_factor, width_scale_factor) + + image_height = int(image_height * optimal_scale_factor) + image_width = int(image_width * optimal_scale_factor) + + ncols = math.ceil(image_width / patch_width) + nrows = math.ceil(image_height / patch_height) + return ncols, nrows + ``` Based on this, we can initially define our replacement tokens as: - ```python - def get_replacement(item_idx: int): - images = mm_items.get_items("image", ImageProcessorItems) - image_size = images.get_image_size(item_idx) + ??? Code - ncols, nrows = self.info.get_image_feature_grid_size( - image_width=image_size.width, - image_height=image_size.height, - ) + ```python + def get_replacement(item_idx: int): + images = mm_items.get_items("image", ImageProcessorItems) + image_size = images.get_image_size(item_idx) - # `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|` - # `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|` - return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows - ``` + ncols, nrows = self.info.get_image_feature_grid_size( + image_width=image_size.width, + image_height=image_size.height, + ) + + # `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|` + # `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|` + return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows + ``` However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called, a BOS token (``) is also added to the promopt: - ```python - # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435 - model_image_input = self.image_processor.preprocess_with_tokenizer_info( - image_input=tensor_batch_images, - image_present=image_present, - image_unpadded_h=image_unpadded_heights, - image_unpadded_w=image_unpadded_widths, - image_placeholder_id=image_placeholder_id, - image_newline_id=image_newline_id, - variable_sized=True, - ) - prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch( - tokenizer=self.tokenizer, - prompts=prompts, - scale_factors=scale_factors, - max_tokens_to_generate=self.max_tokens_to_generate, - max_position_embeddings=self.max_position_embeddings, - add_BOS=True, - add_beginning_of_answer_token=True, - ) - ``` + ??? Code + + ```python + # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435 + model_image_input = self.image_processor.preprocess_with_tokenizer_info( + image_input=tensor_batch_images, + image_present=image_present, + image_unpadded_h=image_unpadded_heights, + image_unpadded_w=image_unpadded_widths, + image_placeholder_id=image_placeholder_id, + image_newline_id=image_newline_id, + variable_sized=True, + ) + prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch( + tokenizer=self.tokenizer, + prompts=prompts, + scale_factors=scale_factors, + max_tokens_to_generate=self.max_tokens_to_generate, + max_position_embeddings=self.max_position_embeddings, + add_BOS=True, + add_beginning_of_answer_token=True, + ) + ``` To assign the vision embeddings to only the image tokens, instead of a string you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]: - ```python - hf_config = self.info.get_hf_config() - bos_token_id = hf_config.bos_token_id # `` - assert isinstance(bos_token_id, int) - - def get_replacement_fuyu(item_idx: int): - images = mm_items.get_items("image", ImageProcessorItems) - image_size = images.get_image_size(item_idx) + ??? Code - ncols, nrows = self.info.get_image_feature_grid_size( - image_width=image_size.width, - image_height=image_size.height, - ) - image_tokens = ([_IMAGE_TOKEN_ID] * ncols + - [_NEWLINE_TOKEN_ID]) * nrows - - return PromptUpdateDetails.select_token_id( - image_tokens + [bos_token_id], - embed_token_id=_IMAGE_TOKEN_ID, - ) - ``` - - Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt, - we can search for it to conduct the replacement at the start of the string: - - ```python - def _get_prompt_updates( - self, - mm_items: MultiModalDataItems, - hf_processor_mm_kwargs: Mapping[str, object], - out_mm_kwargs: MultiModalKwargs, - ) -> Sequence[PromptUpdate]: + ```python hf_config = self.info.get_hf_config() - bos_token_id = hf_config.bos_token_id + bos_token_id = hf_config.bos_token_id # `` assert isinstance(bos_token_id, int) - tokenizer = self.info.get_tokenizer() - eot_token_id = tokenizer.bos_token_id - assert isinstance(eot_token_id, int) - def get_replacement_fuyu(item_idx: int): images = mm_items.get_items("image", ImageProcessorItems) image_size = images.get_image_size(item_idx) @@ -742,15 +744,52 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies image_tokens + [bos_token_id], embed_token_id=_IMAGE_TOKEN_ID, ) + ``` - return [ - PromptReplacement( - modality="image", - target=[eot_token_id], - replacement=get_replacement_fuyu, - ) - ] - ``` + Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt, + we can search for it to conduct the replacement at the start of the string: + + ??? Code + + ```python + def _get_prompt_updates( + self, + mm_items: MultiModalDataItems, + hf_processor_mm_kwargs: Mapping[str, object], + out_mm_kwargs: MultiModalKwargs, + ) -> Sequence[PromptUpdate]: + hf_config = self.info.get_hf_config() + bos_token_id = hf_config.bos_token_id + assert isinstance(bos_token_id, int) + + tokenizer = self.info.get_tokenizer() + eot_token_id = tokenizer.bos_token_id + assert isinstance(eot_token_id, int) + + def get_replacement_fuyu(item_idx: int): + images = mm_items.get_items("image", ImageProcessorItems) + image_size = images.get_image_size(item_idx) + + ncols, nrows = self.info.get_image_feature_grid_size( + image_width=image_size.width, + image_height=image_size.height, + ) + image_tokens = ([_IMAGE_TOKEN_ID] * ncols + + [_NEWLINE_TOKEN_ID]) * nrows + + return PromptUpdateDetails.select_token_id( + image_tokens + [bos_token_id], + embed_token_id=_IMAGE_TOKEN_ID, + ) + + return [ + PromptReplacement( + modality="image", + target=[eot_token_id], + replacement=get_replacement_fuyu, + ) + ] + ``` ## 5. Register processor-related classes diff --git a/docs/contributing/profiling.md b/docs/contributing/profiling.md index be01b9b65f6..6d6366741aa 100644 --- a/docs/contributing/profiling.md +++ b/docs/contributing/profiling.md @@ -97,26 +97,26 @@ to manually kill the profiler and generate your `nsys-rep` report. You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started). -CLI example: - -```bash -nsys stats report1.nsys-rep -... - ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum): - - Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name - -------- --------------- --------- ----------- ----------- -------- --------- ----------- ---------------------------------------------------------------------------------------------------- - 46.3 10,327,352,338 17,505 589,965.9 144,383.0 27,040 3,126,460 944,263.8 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of… - 14.8 3,305,114,764 5,152 641,520.7 293,408.0 287,296 2,822,716 867,124.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of… - 12.1 2,692,284,876 14,280 188,535.4 83,904.0 19,328 2,862,237 497,999.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off… - 9.5 2,116,600,578 33,920 62,399.8 21,504.0 15,326 2,532,285 290,954.1 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_… - 5.0 1,119,749,165 18,912 59,208.4 9,056.0 6,784 2,578,366 271,581.7 void vllm::act_and_mul_kernel, (bool)1>(T1 *, cons… - 4.1 916,662,515 21,312 43,011.6 19,776.0 8,928 2,586,205 199,790.1 void cutlass::device_kernel(int)0&&vllm::_typeConvert::exists, void>::type vllm::fused_add_rms_norm_kern… - 1.9 418,362,605 18,912 22,121.5 3,871.0 3,328 2,523,870 175,248.2 void vllm::rotary_embedding_kernel(const long *, T1 *, T1 *, const T1 *, in… - 0.7 167,083,069 18,880 8,849.7 2,240.0 1,471 2,499,996 101,436.1 void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0… -... -``` +??? CLI example + + ```bash + nsys stats report1.nsys-rep + ... + ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum): + + Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name + -------- --------------- --------- ----------- ----------- -------- --------- ----------- ---------------------------------------------------------------------------------------------------- + 46.3 10,327,352,338 17,505 589,965.9 144,383.0 27,040 3,126,460 944,263.8 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of… + 14.8 3,305,114,764 5,152 641,520.7 293,408.0 287,296 2,822,716 867,124.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of… + 12.1 2,692,284,876 14,280 188,535.4 83,904.0 19,328 2,862,237 497,999.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off… + 9.5 2,116,600,578 33,920 62,399.8 21,504.0 15,326 2,532,285 290,954.1 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_… + 5.0 1,119,749,165 18,912 59,208.4 9,056.0 6,784 2,578,366 271,581.7 void vllm::act_and_mul_kernel, (bool)1>(T1 *, cons… + 4.1 916,662,515 21,312 43,011.6 19,776.0 8,928 2,586,205 199,790.1 void cutlass::device_kernel(int)0&&vllm::_typeConvert::exists, void>::type vllm::fused_add_rms_norm_kern… + 1.9 418,362,605 18,912 22,121.5 3,871.0 3,328 2,523,870 175,248.2 void vllm::rotary_embedding_kernel(const long *, T1 *, T1 *, const T1 *, in… + 0.7 167,083,069 18,880 8,849.7 2,240.0 1,471 2,499,996 101,436.1 void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0… + ... + ``` GUI example: diff --git a/docs/deployment/docker.md b/docs/deployment/docker.md index 93d9e80f5b0..eb84db7871e 100644 --- a/docs/deployment/docker.md +++ b/docs/deployment/docker.md @@ -97,19 +97,21 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `-- flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits. Keep an eye on memory usage with parallel jobs as it can be substantial (see example below). -```console -# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB) -python3 use_existing_torch.py -DOCKER_BUILDKIT=1 docker build . \ - --file docker/Dockerfile \ - --target vllm-openai \ - --platform "linux/arm64" \ - -t vllm/vllm-gh200-openai:latest \ - --build-arg max_jobs=66 \ - --build-arg nvcc_threads=2 \ - --build-arg torch_cuda_arch_list="9.0 10.0+PTX" \ - --build-arg vllm_fa_cmake_gpu_arches="90-real" -``` +??? Command + + ```console + # Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB) + python3 use_existing_torch.py + DOCKER_BUILDKIT=1 docker build . \ + --file docker/Dockerfile \ + --target vllm-openai \ + --platform "linux/arm64" \ + -t vllm/vllm-gh200-openai:latest \ + --build-arg max_jobs=66 \ + --build-arg nvcc_threads=2 \ + --build-arg torch_cuda_arch_list="9.0 10.0+PTX" \ + --build-arg vllm_fa_cmake_gpu_arches="90-real" + ``` !!! note If you are building the `linux/arm64` image on a non-ARM host (e.g., an x86_64 machine), you need to ensure your system is set up for cross-compilation using QEMU. This allows your host machine to emulate ARM64 execution. diff --git a/docs/deployment/frameworks/autogen.md b/docs/deployment/frameworks/autogen.md index ad8c167659e..295664daead 100644 --- a/docs/deployment/frameworks/autogen.md +++ b/docs/deployment/frameworks/autogen.md @@ -30,51 +30,53 @@ python -m vllm.entrypoints.openai.api_server \ - Call it with AutoGen: -```python -import asyncio -from autogen_core.models import UserMessage -from autogen_ext.models.openai import OpenAIChatCompletionClient -from autogen_core.models import ModelFamily - - -async def main() -> None: - # Create a model client - model_client = OpenAIChatCompletionClient( - model="mistralai/Mistral-7B-Instruct-v0.2", - base_url="http://{your-vllm-host-ip}:{your-vllm-host-port}/v1", - api_key="EMPTY", - model_info={ - "vision": False, - "function_calling": False, - "json_output": False, - "family": ModelFamily.MISTRAL, - "structured_output": True, - }, - ) - - messages = [UserMessage(content="Write a very short story about a dragon.", source="user")] - - # Create a stream. - stream = model_client.create_stream(messages=messages) - - # Iterate over the stream and print the responses. - print("Streamed responses:") - async for response in stream: - if isinstance(response, str): - # A partial response is a string. - print(response, flush=True, end="") - else: - # The last response is a CreateResult object with the complete message. - print("\n\n------------\n") - print("The complete response:", flush=True) - print(response.content, flush=True) - - # Close the client when done. - await model_client.close() - - -asyncio.run(main()) -``` +??? Code + + ```python + import asyncio + from autogen_core.models import UserMessage + from autogen_ext.models.openai import OpenAIChatCompletionClient + from autogen_core.models import ModelFamily + + + async def main() -> None: + # Create a model client + model_client = OpenAIChatCompletionClient( + model="mistralai/Mistral-7B-Instruct-v0.2", + base_url="http://{your-vllm-host-ip}:{your-vllm-host-port}/v1", + api_key="EMPTY", + model_info={ + "vision": False, + "function_calling": False, + "json_output": False, + "family": ModelFamily.MISTRAL, + "structured_output": True, + }, + ) + + messages = [UserMessage(content="Write a very short story about a dragon.", source="user")] + + # Create a stream. + stream = model_client.create_stream(messages=messages) + + # Iterate over the stream and print the responses. + print("Streamed responses:") + async for response in stream: + if isinstance(response, str): + # A partial response is a string. + print(response, flush=True, end="") + else: + # The last response is a CreateResult object with the complete message. + print("\n\n------------\n") + print("The complete response:", flush=True) + print(response.content, flush=True) + + # Close the client when done. + await model_client.close() + + + asyncio.run(main()) + ``` For details, see the tutorial: diff --git a/docs/deployment/frameworks/cerebrium.md b/docs/deployment/frameworks/cerebrium.md index 84cb2304fac..8e096f26db7 100644 --- a/docs/deployment/frameworks/cerebrium.md +++ b/docs/deployment/frameworks/cerebrium.md @@ -34,25 +34,27 @@ vllm = "latest" Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`: -```python -from vllm import LLM, SamplingParams +??? Code -llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1") + ```python + from vllm import LLM, SamplingParams -def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95): + llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1") - sampling_params = SamplingParams(temperature=temperature, top_p=top_p) - outputs = llm.generate(prompts, sampling_params) + def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95): - # Print the outputs. - results = [] - for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - results.append({"prompt": prompt, "generated_text": generated_text}) + sampling_params = SamplingParams(temperature=temperature, top_p=top_p) + outputs = llm.generate(prompts, sampling_params) - return {"results": results} -``` + # Print the outputs. + results = [] + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + results.append({"prompt": prompt, "generated_text": generated_text}) + + return {"results": results} + ``` Then, run the following code to deploy it to the cloud: @@ -62,47 +64,51 @@ cerebrium deploy If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`) -```python -curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \ - -H 'Content-Type: application/json' \ - -H 'Authorization: ' \ - --data '{ - "prompts": [ - "Hello, my name is", - "The president of the United States is", - "The capital of France is", - "The future of AI is" - ] - }' -``` +??? Command + + ```python + curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \ + -H 'Content-Type: application/json' \ + -H 'Authorization: ' \ + --data '{ + "prompts": [ + "Hello, my name is", + "The president of the United States is", + "The capital of France is", + "The future of AI is" + ] + }' + ``` You should get a response like: -```python -{ - "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262", - "result": { - "result": [ - { - "prompt": "Hello, my name is", - "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of" - }, - { - "prompt": "The president of the United States is", - "generated_text": " elected every four years. This is a democratic system.\n\n5. What" - }, - { - "prompt": "The capital of France is", - "generated_text": " Paris.\n" - }, - { - "prompt": "The future of AI is", - "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective." - } - ] - }, - "run_time_ms": 152.53663063049316 -} -``` +??? Response + + ```python + { + "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262", + "result": { + "result": [ + { + "prompt": "Hello, my name is", + "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of" + }, + { + "prompt": "The president of the United States is", + "generated_text": " elected every four years. This is a democratic system.\n\n5. What" + }, + { + "prompt": "The capital of France is", + "generated_text": " Paris.\n" + }, + { + "prompt": "The future of AI is", + "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective." + } + ] + }, + "run_time_ms": 152.53663063049316 + } + ``` You now have an autoscaling endpoint where you only pay for the compute you use! diff --git a/docs/deployment/frameworks/dstack.md b/docs/deployment/frameworks/dstack.md index 7de92855745..0b91fc88ce3 100644 --- a/docs/deployment/frameworks/dstack.md +++ b/docs/deployment/frameworks/dstack.md @@ -26,75 +26,81 @@ dstack init Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`: -```yaml -type: service - -python: "3.11" -env: - - MODEL=NousResearch/Llama-2-7b-chat-hf -port: 8000 -resources: - gpu: 24GB -commands: - - pip install vllm - - vllm serve $MODEL --port 8000 -model: - format: openai - type: chat - name: NousResearch/Llama-2-7b-chat-hf -``` +??? Config + + ```yaml + type: service + + python: "3.11" + env: + - MODEL=NousResearch/Llama-2-7b-chat-hf + port: 8000 + resources: + gpu: 24GB + commands: + - pip install vllm + - vllm serve $MODEL --port 8000 + model: + format: openai + type: chat + name: NousResearch/Llama-2-7b-chat-hf + ``` Then, run the following CLI for provisioning: -```console -$ dstack run . -f serve.dstack.yml - -⠸ Getting run plan... - Configuration serve.dstack.yml - Project deep-diver-main - User deep-diver - Min resources 2..xCPU, 8GB.., 1xGPU (24GB) - Max price - - Max duration - - Spot policy auto - Retry policy no - - # BACKEND REGION INSTANCE RESOURCES SPOT PRICE - 1 gcp us-central1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 - 2 gcp us-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 - 3 gcp us-west1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 - ... - Shown 3 of 193 offers, $5.876 max - -Continue? [y/n]: y -⠙ Submitting run... -⠏ Launching spicy-treefrog-1 (pulling) -spicy-treefrog-1 provisioning completed (running) -Service is published at ... -``` +??? Command + + ```console + $ dstack run . -f serve.dstack.yml + + ⠸ Getting run plan... + Configuration serve.dstack.yml + Project deep-diver-main + User deep-diver + Min resources 2..xCPU, 8GB.., 1xGPU (24GB) + Max price - + Max duration - + Spot policy auto + Retry policy no + + # BACKEND REGION INSTANCE RESOURCES SPOT PRICE + 1 gcp us-central1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 + 2 gcp us-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 + 3 gcp us-west1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 + ... + Shown 3 of 193 offers, $5.876 max + + Continue? [y/n]: y + ⠙ Submitting run... + ⠏ Launching spicy-treefrog-1 (pulling) + spicy-treefrog-1 provisioning completed (running) + Service is published at ... + ``` After the provisioning, you can interact with the model by using the OpenAI SDK: -```python -from openai import OpenAI - -client = OpenAI( - base_url="https://gateway.", - api_key="" -) - -completion = client.chat.completions.create( - model="NousResearch/Llama-2-7b-chat-hf", - messages=[ - { - "role": "user", - "content": "Compose a poem that explains the concept of recursion in programming.", - } - ] -) - -print(completion.choices[0].message.content) -``` +??? Code + + ```python + from openai import OpenAI + + client = OpenAI( + base_url="https://gateway.", + api_key="" + ) + + completion = client.chat.completions.create( + model="NousResearch/Llama-2-7b-chat-hf", + messages=[ + { + "role": "user", + "content": "Compose a poem that explains the concept of recursion in programming.", + } + ] + ) + + print(completion.choices[0].message.content) + ``` !!! note dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm) diff --git a/docs/deployment/frameworks/haystack.md b/docs/deployment/frameworks/haystack.md index 2eac4a5279f..04d9eba3065 100644 --- a/docs/deployment/frameworks/haystack.md +++ b/docs/deployment/frameworks/haystack.md @@ -27,29 +27,29 @@ vllm serve mistralai/Mistral-7B-Instruct-v0.1 - Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server. -```python -from haystack.components.generators.chat import OpenAIChatGenerator -from haystack.dataclasses import ChatMessage -from haystack.utils import Secret - -generator = OpenAIChatGenerator( - # for compatibility with the OpenAI API, a placeholder api_key is needed - api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"), - model="mistralai/Mistral-7B-Instruct-v0.1", - api_base_url="http://{your-vLLM-host-ip}:{your-vLLM-host-port}/v1", - generation_kwargs = {"max_tokens": 512} -) - -response = generator.run( - messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")] -) - -print("-"*30) -print(response) -print("-"*30) -``` - -Output e.g.: +??? Code + + ```python + from haystack.components.generators.chat import OpenAIChatGenerator + from haystack.dataclasses import ChatMessage + from haystack.utils import Secret + + generator = OpenAIChatGenerator( + # for compatibility with the OpenAI API, a placeholder api_key is needed + api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"), + model="mistralai/Mistral-7B-Instruct-v0.1", + api_base_url="http://{your-vLLM-host-ip}:{your-vLLM-host-port}/v1", + generation_kwargs = {"max_tokens": 512} + ) + + response = generator.run( + messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")] + ) + + print("-"*30) + print(response) + print("-"*30) + ``` ```console ------------------------------ diff --git a/docs/deployment/frameworks/litellm.md b/docs/deployment/frameworks/litellm.md index 3011cde8301..8498feaa297 100644 --- a/docs/deployment/frameworks/litellm.md +++ b/docs/deployment/frameworks/litellm.md @@ -34,21 +34,23 @@ vllm serve qwen/Qwen1.5-0.5B-Chat - Call it with litellm: -```python -import litellm +??? Code -messages = [{ "content": "Hello, how are you?","role": "user"}] + ```python + import litellm -# hosted_vllm is prefix key word and necessary -response = litellm.completion( - model="hosted_vllm/qwen/Qwen1.5-0.5B-Chat", # pass the vllm model name - messages=messages, - api_base="http://{your-vllm-server-host}:{your-vllm-server-port}/v1", - temperature=0.2, - max_tokens=80) - -print(response) -``` + messages = [{ "content": "Hello, how are you?","role": "user"}] + + # hosted_vllm is prefix key word and necessary + response = litellm.completion( + model="hosted_vllm/qwen/Qwen1.5-0.5B-Chat", # pass the vllm model name + messages=messages, + api_base="http://{your-vllm-server-host}:{your-vllm-server-port}/v1", + temperature=0.2, + max_tokens=80) + + print(response) + ``` ### Embeddings diff --git a/docs/deployment/frameworks/lws.md b/docs/deployment/frameworks/lws.md index 18282a89ddf..9df95287690 100644 --- a/docs/deployment/frameworks/lws.md +++ b/docs/deployment/frameworks/lws.md @@ -17,99 +17,101 @@ vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kuber Deploy the following yaml file `lws.yaml` -```yaml -apiVersion: leaderworkerset.x-k8s.io/v1 -kind: LeaderWorkerSet -metadata: - name: vllm -spec: - replicas: 2 - leaderWorkerTemplate: - size: 2 - restartPolicy: RecreateGroupOnPodRestart - leaderTemplate: - metadata: - labels: - role: leader - spec: - containers: - - name: vllm-leader - image: docker.io/vllm/vllm-openai:latest - env: - - name: HUGGING_FACE_HUB_TOKEN - value: - command: - - sh - - -c - - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); - python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2" - resources: - limits: - nvidia.com/gpu: "8" - memory: 1124Gi - ephemeral-storage: 800Gi - requests: - ephemeral-storage: 800Gi - cpu: 125 - ports: - - containerPort: 8080 - readinessProbe: - tcpSocket: - port: 8080 - initialDelaySeconds: 15 - periodSeconds: 10 - volumeMounts: - - mountPath: /dev/shm - name: dshm - volumes: - - name: dshm - emptyDir: - medium: Memory - sizeLimit: 15Gi - workerTemplate: - spec: - containers: - - name: vllm-worker - image: docker.io/vllm/vllm-openai:latest - command: - - sh - - -c - - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)" - resources: - limits: - nvidia.com/gpu: "8" - memory: 1124Gi - ephemeral-storage: 800Gi - requests: - ephemeral-storage: 800Gi - cpu: 125 - env: - - name: HUGGING_FACE_HUB_TOKEN - value: - volumeMounts: - - mountPath: /dev/shm - name: dshm - volumes: - - name: dshm - emptyDir: - medium: Memory - sizeLimit: 15Gi ---- -apiVersion: v1 -kind: Service -metadata: - name: vllm-leader -spec: - ports: - - name: http - port: 8080 - protocol: TCP - targetPort: 8080 - selector: - leaderworkerset.sigs.k8s.io/name: vllm - role: leader - type: ClusterIP -``` +??? Yaml + + ```yaml + apiVersion: leaderworkerset.x-k8s.io/v1 + kind: LeaderWorkerSet + metadata: + name: vllm + spec: + replicas: 2 + leaderWorkerTemplate: + size: 2 + restartPolicy: RecreateGroupOnPodRestart + leaderTemplate: + metadata: + labels: + role: leader + spec: + containers: + - name: vllm-leader + image: docker.io/vllm/vllm-openai:latest + env: + - name: HUGGING_FACE_HUB_TOKEN + value: + command: + - sh + - -c + - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); + python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2" + resources: + limits: + nvidia.com/gpu: "8" + memory: 1124Gi + ephemeral-storage: 800Gi + requests: + ephemeral-storage: 800Gi + cpu: 125 + ports: + - containerPort: 8080 + readinessProbe: + tcpSocket: + port: 8080 + initialDelaySeconds: 15 + periodSeconds: 10 + volumeMounts: + - mountPath: /dev/shm + name: dshm + volumes: + - name: dshm + emptyDir: + medium: Memory + sizeLimit: 15Gi + workerTemplate: + spec: + containers: + - name: vllm-worker + image: docker.io/vllm/vllm-openai:latest + command: + - sh + - -c + - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)" + resources: + limits: + nvidia.com/gpu: "8" + memory: 1124Gi + ephemeral-storage: 800Gi + requests: + ephemeral-storage: 800Gi + cpu: 125 + env: + - name: HUGGING_FACE_HUB_TOKEN + value: + volumeMounts: + - mountPath: /dev/shm + name: dshm + volumes: + - name: dshm + emptyDir: + medium: Memory + sizeLimit: 15Gi + --- + apiVersion: v1 + kind: Service + metadata: + name: vllm-leader + spec: + ports: + - name: http + port: 8080 + protocol: TCP + targetPort: 8080 + selector: + leaderworkerset.sigs.k8s.io/name: vllm + role: leader + type: ClusterIP + ``` ```bash kubectl apply -f lws.yaml @@ -175,25 +177,27 @@ curl http://localhost:8080/v1/completions \ The output should be similar to the following -```text -{ - "id": "cmpl-1bb34faba88b43f9862cfbfb2200949d", - "object": "text_completion", - "created": 1715138766, - "model": "meta-llama/Meta-Llama-3.1-405B-Instruct", - "choices": [ +??? Output + + ```text { - "index": 0, - "text": " top destination for foodies, with", - "logprobs": null, - "finish_reason": "length", - "stop_reason": null + "id": "cmpl-1bb34faba88b43f9862cfbfb2200949d", + "object": "text_completion", + "created": 1715138766, + "model": "meta-llama/Meta-Llama-3.1-405B-Instruct", + "choices": [ + { + "index": 0, + "text": " top destination for foodies, with", + "logprobs": null, + "finish_reason": "length", + "stop_reason": null + } + ], + "usage": { + "prompt_tokens": 5, + "total_tokens": 12, + "completion_tokens": 7 + } } - ], - "usage": { - "prompt_tokens": 5, - "total_tokens": 12, - "completion_tokens": 7 - } -} -``` + ``` diff --git a/docs/deployment/frameworks/skypilot.md b/docs/deployment/frameworks/skypilot.md index 9763745f237..b649312971b 100644 --- a/docs/deployment/frameworks/skypilot.md +++ b/docs/deployment/frameworks/skypilot.md @@ -24,48 +24,50 @@ sky check See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml). -```yaml -resources: - accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. - use_spot: True - disk_size: 512 # Ensure model checkpoints can fit. - disk_tier: best - ports: 8081 # Expose to internet traffic. - -envs: - MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct - HF_TOKEN: # Change to your own huggingface token, or use --env to pass. - -setup: | - conda create -n vllm python=3.10 -y - conda activate vllm - - pip install vllm==0.4.0.post1 - # Install Gradio for web UI. - pip install gradio openai - pip install flash-attn==2.5.7 - -run: | - conda activate vllm - echo 'Starting vllm api server...' - python -u -m vllm.entrypoints.openai.api_server \ - --port 8081 \ - --model $MODEL_NAME \ - --trust-remote-code \ - --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ - 2>&1 | tee api_server.log & - - echo 'Waiting for vllm api server to start...' - while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done - - echo 'Starting gradio server...' - git clone https://github.com/vllm-project/vllm.git || true - python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \ - -m $MODEL_NAME \ - --port 8811 \ - --model-url http://localhost:8081/v1 \ - --stop-token-ids 128009,128001 -``` +??? Yaml + + ```yaml + resources: + accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. + use_spot: True + disk_size: 512 # Ensure model checkpoints can fit. + disk_tier: best + ports: 8081 # Expose to internet traffic. + + envs: + MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct + HF_TOKEN: # Change to your own huggingface token, or use --env to pass. + + setup: | + conda create -n vllm python=3.10 -y + conda activate vllm + + pip install vllm==0.4.0.post1 + # Install Gradio for web UI. + pip install gradio openai + pip install flash-attn==2.5.7 + + run: | + conda activate vllm + echo 'Starting vllm api server...' + python -u -m vllm.entrypoints.openai.api_server \ + --port 8081 \ + --model $MODEL_NAME \ + --trust-remote-code \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + 2>&1 | tee api_server.log & + + echo 'Waiting for vllm api server to start...' + while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done + + echo 'Starting gradio server...' + git clone https://github.com/vllm-project/vllm.git || true + python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \ + -m $MODEL_NAME \ + --port 8811 \ + --model-url http://localhost:8081/v1 \ + --stop-token-ids 128009,128001 + ``` Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...): @@ -93,68 +95,67 @@ HF_TOKEN="your-huggingface-token" \ SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file. -```yaml -service: - replicas: 2 - # An actual request for readiness probe. - readiness_probe: - path: /v1/chat/completions - post_data: - model: $MODEL_NAME - messages: - - role: user - content: Hello! What is your name? - max_completion_tokens: 1 -``` - -
-Click to see the full recipe YAML - -```yaml -service: - replicas: 2 - # An actual request for readiness probe. - readiness_probe: - path: /v1/chat/completions - post_data: - model: $MODEL_NAME - messages: - - role: user - content: Hello! What is your name? +??? Yaml + + ```yaml + service: + replicas: 2 + # An actual request for readiness probe. + readiness_probe: + path: /v1/chat/completions + post_data: + model: $MODEL_NAME + messages: + - role: user + content: Hello! What is your name? max_completion_tokens: 1 + ``` -resources: - accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. - use_spot: True - disk_size: 512 # Ensure model checkpoints can fit. - disk_tier: best - ports: 8081 # Expose to internet traffic. - -envs: - MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct - HF_TOKEN: # Change to your own huggingface token, or use --env to pass. - -setup: | - conda create -n vllm python=3.10 -y - conda activate vllm - - pip install vllm==0.4.0.post1 - # Install Gradio for web UI. - pip install gradio openai - pip install flash-attn==2.5.7 - -run: | - conda activate vllm - echo 'Starting vllm api server...' - python -u -m vllm.entrypoints.openai.api_server \ - --port 8081 \ - --model $MODEL_NAME \ - --trust-remote-code \ - --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ - 2>&1 | tee api_server.log -``` - -
+??? Yaml + + ```yaml + service: + replicas: 2 + # An actual request for readiness probe. + readiness_probe: + path: /v1/chat/completions + post_data: + model: $MODEL_NAME + messages: + - role: user + content: Hello! What is your name? + max_completion_tokens: 1 + + resources: + accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. + use_spot: True + disk_size: 512 # Ensure model checkpoints can fit. + disk_tier: best + ports: 8081 # Expose to internet traffic. + + envs: + MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct + HF_TOKEN: # Change to your own huggingface token, or use --env to pass. + + setup: | + conda create -n vllm python=3.10 -y + conda activate vllm + + pip install vllm==0.4.0.post1 + # Install Gradio for web UI. + pip install gradio openai + pip install flash-attn==2.5.7 + + run: | + conda activate vllm + echo 'Starting vllm api server...' + python -u -m vllm.entrypoints.openai.api_server \ + --port 8081 \ + --model $MODEL_NAME \ + --trust-remote-code \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + 2>&1 | tee api_server.log + ``` Start the serving the Llama-3 8B model on multiple replicas: @@ -170,8 +171,7 @@ Wait until the service is ready: watch -n10 sky serve status vllm ``` -
-Example outputs: +Example outputs: ```console Services @@ -184,29 +184,29 @@ vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'L4': 1}) R vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4 ``` -
- After the service is READY, you can find a single endpoint for the service and access the service with the endpoint: -```console -ENDPOINT=$(sky serve status --endpoint 8081 vllm) -curl -L http://$ENDPOINT/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "meta-llama/Meta-Llama-3-8B-Instruct", - "messages": [ - { - "role": "system", - "content": "You are a helpful assistant." - }, - { - "role": "user", - "content": "Who are you?" - } - ], - "stop_token_ids": [128009, 128001] - }' -``` +??? Commands + + ```bash + ENDPOINT=$(sky serve status --endpoint 8081 vllm) + curl -L http://$ENDPOINT/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Meta-Llama-3-8B-Instruct", + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "Who are you?" + } + ], + "stop_token_ids": [128009, 128001] + }' + ``` To enable autoscaling, you could replace the `replicas` with the following configs in `service`: @@ -220,57 +220,54 @@ service: This will scale the service up to when the QPS exceeds 2 for each replica. -
-Click to see the full recipe YAML - -```yaml -service: - replica_policy: - min_replicas: 2 - max_replicas: 4 - target_qps_per_replica: 2 - # An actual request for readiness probe. - readiness_probe: - path: /v1/chat/completions - post_data: - model: $MODEL_NAME - messages: - - role: user - content: Hello! What is your name? - max_completion_tokens: 1 - -resources: - accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. - use_spot: True - disk_size: 512 # Ensure model checkpoints can fit. - disk_tier: best - ports: 8081 # Expose to internet traffic. - -envs: - MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct - HF_TOKEN: # Change to your own huggingface token, or use --env to pass. - -setup: | - conda create -n vllm python=3.10 -y - conda activate vllm - - pip install vllm==0.4.0.post1 - # Install Gradio for web UI. - pip install gradio openai - pip install flash-attn==2.5.7 - -run: | - conda activate vllm - echo 'Starting vllm api server...' - python -u -m vllm.entrypoints.openai.api_server \ - --port 8081 \ - --model $MODEL_NAME \ - --trust-remote-code \ - --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ - 2>&1 | tee api_server.log -``` - -
+??? Yaml + + ```yaml + service: + replica_policy: + min_replicas: 2 + max_replicas: 4 + target_qps_per_replica: 2 + # An actual request for readiness probe. + readiness_probe: + path: /v1/chat/completions + post_data: + model: $MODEL_NAME + messages: + - role: user + content: Hello! What is your name? + max_completion_tokens: 1 + + resources: + accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. + use_spot: True + disk_size: 512 # Ensure model checkpoints can fit. + disk_tier: best + ports: 8081 # Expose to internet traffic. + + envs: + MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct + HF_TOKEN: # Change to your own huggingface token, or use --env to pass. + + setup: | + conda create -n vllm python=3.10 -y + conda activate vllm + + pip install vllm==0.4.0.post1 + # Install Gradio for web UI. + pip install gradio openai + pip install flash-attn==2.5.7 + + run: | + conda activate vllm + echo 'Starting vllm api server...' + python -u -m vllm.entrypoints.openai.api_server \ + --port 8081 \ + --model $MODEL_NAME \ + --trust-remote-code \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + 2>&1 | tee api_server.log + ``` To update the service with the new config: @@ -288,38 +285,35 @@ sky serve down vllm It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas. -
-Click to see the full GUI YAML +??? Yaml -```yaml -envs: - MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct - ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm. - -resources: - cpus: 2 - -setup: | - conda create -n vllm python=3.10 -y - conda activate vllm - - # Install Gradio for web UI. - pip install gradio openai - -run: | - conda activate vllm - export PATH=$PATH:/sbin - - echo 'Starting gradio server...' - git clone https://github.com/vllm-project/vllm.git || true - python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \ - -m $MODEL_NAME \ - --port 8811 \ - --model-url http://$ENDPOINT/v1 \ - --stop-token-ids 128009,128001 | tee ~/gradio.log -``` + ```yaml + envs: + MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct + ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm. + + resources: + cpus: 2 -
+ setup: | + conda create -n vllm python=3.10 -y + conda activate vllm + + # Install Gradio for web UI. + pip install gradio openai + + run: | + conda activate vllm + export PATH=$PATH:/sbin + + echo 'Starting gradio server...' + git clone https://github.com/vllm-project/vllm.git || true + python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \ + -m $MODEL_NAME \ + --port 8811 \ + --model-url http://$ENDPOINT/v1 \ + --stop-token-ids 128009,128001 | tee ~/gradio.log + ``` 1. Start the chat web UI: diff --git a/docs/deployment/integrations/production-stack.md b/docs/deployment/integrations/production-stack.md index 8288a4b6e6b..2b1cc6f6fee 100644 --- a/docs/deployment/integrations/production-stack.md +++ b/docs/deployment/integrations/production-stack.md @@ -60,22 +60,22 @@ And then you can send out a query to the OpenAI-compatible API to check the avai curl -o- http://localhost:30080/models ``` -Expected output: +??? Output -```json -{ - "object": "list", - "data": [ + ```json { - "id": "facebook/opt-125m", - "object": "model", - "created": 1737428424, - "owned_by": "vllm", - "root": null + "object": "list", + "data": [ + { + "id": "facebook/opt-125m", + "object": "model", + "created": 1737428424, + "owned_by": "vllm", + "root": null + } + ] } - ] -} -``` + ``` To send an actual chatting request, you can issue a curl request to the OpenAI `/completion` endpoint: @@ -89,23 +89,23 @@ curl -X POST http://localhost:30080/completions \ }' ``` -Expected output: +??? Output -```json -{ - "id": "completion-id", - "object": "text_completion", - "created": 1737428424, - "model": "facebook/opt-125m", - "choices": [ + ```json { - "text": " there was a brave knight who...", - "index": 0, - "finish_reason": "length" + "id": "completion-id", + "object": "text_completion", + "created": 1737428424, + "model": "facebook/opt-125m", + "choices": [ + { + "text": " there was a brave knight who...", + "index": 0, + "finish_reason": "length" + } + ] } - ] -} -``` + ``` ### Uninstall @@ -121,23 +121,25 @@ sudo helm uninstall vllm The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above: -```yaml -servingEngineSpec: - runtimeClassName: "" - modelSpec: - - name: "opt125m" - repository: "vllm/vllm-openai" - tag: "latest" - modelURL: "facebook/opt-125m" +??? Yaml - replicaCount: 1 + ```yaml + servingEngineSpec: + runtimeClassName: "" + modelSpec: + - name: "opt125m" + repository: "vllm/vllm-openai" + tag: "latest" + modelURL: "facebook/opt-125m" - requestCPU: 6 - requestMemory: "16Gi" - requestGPU: 1 + replicaCount: 1 - pvcStorage: "10Gi" -``` + requestCPU: 6 + requestMemory: "16Gi" + requestGPU: 1 + + pvcStorage: "10Gi" + ``` In this YAML configuration: * **`modelSpec`** includes: diff --git a/docs/deployment/k8s.md b/docs/deployment/k8s.md index 7430f99a539..13225ba208f 100644 --- a/docs/deployment/k8s.md +++ b/docs/deployment/k8s.md @@ -29,85 +29,89 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following: First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model: -```bash -cat < + Yaml + ```yaml apiVersion: v1 kind: PersistentVolumeClaim @@ -144,6 +151,8 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) volumeMode: Filesystem ``` + + Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models ```yaml @@ -156,13 +165,16 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) stringData: token: "REPLACE_WITH_TOKEN" ``` - + Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model. Here are two examples for using NVIDIA GPU and AMD GPU. NVIDIA GPU: +
+ Yaml + ```yaml apiVersion: apps/v1 kind: Deployment @@ -233,10 +245,15 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) periodSeconds: 5 ``` +
+ AMD GPU: You can refer to the `deployment.yaml` below if using AMD ROCm GPU like MI300X. +
+ Yaml + ```yaml apiVersion: apps/v1 kind: Deployment @@ -305,12 +322,17 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) mountPath: /dev/shm ``` +
+ You can get the full example with steps and sample yaml files from . 2. Create a Kubernetes Service for vLLM Next, create a Kubernetes Service file to expose the `mistral-7b` deployment: +
+ Yaml + ```yaml apiVersion: v1 kind: Service @@ -330,6 +352,8 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) type: ClusterIP ``` +
+ 3. Deploy and Test Apply the deployment and service configurations using `kubectl apply -f `: diff --git a/docs/deployment/nginx.md b/docs/deployment/nginx.md index f0ff5c1d0e7..752be76b386 100644 --- a/docs/deployment/nginx.md +++ b/docs/deployment/nginx.md @@ -36,23 +36,25 @@ docker build . -f Dockerfile.nginx --tag nginx-lb Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`. -```console -upstream backend { - least_conn; - server vllm0:8000 max_fails=3 fail_timeout=10000s; - server vllm1:8000 max_fails=3 fail_timeout=10000s; -} -server { - listen 80; - location / { - proxy_pass http://backend; - proxy_set_header Host $host; - proxy_set_header X-Real-IP $remote_addr; - proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; - proxy_set_header X-Forwarded-Proto $scheme; +??? Config + + ```console + upstream backend { + least_conn; + server vllm0:8000 max_fails=3 fail_timeout=10000s; + server vllm1:8000 max_fails=3 fail_timeout=10000s; } -} -``` + server { + listen 80; + location / { + proxy_pass http://backend; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + } + } + ``` [](){ #nginxloadbalancer-nginx-vllm-container } @@ -93,30 +95,32 @@ Notes: - The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus device=ID`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command. - Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`. -```console -mkdir -p ~/.cache/huggingface/hub/ -hf_cache_dir=~/.cache/huggingface/ -docker run \ - -itd \ - --ipc host \ - --network vllm_nginx \ - --gpus device=0 \ - --shm-size=10.24gb \ - -v $hf_cache_dir:/root/.cache/huggingface/ \ - -p 8081:8000 \ - --name vllm0 vllm \ - --model meta-llama/Llama-2-7b-chat-hf -docker run \ - -itd \ - --ipc host \ - --network vllm_nginx \ - --gpus device=1 \ - --shm-size=10.24gb \ - -v $hf_cache_dir:/root/.cache/huggingface/ \ - -p 8082:8000 \ - --name vllm1 vllm \ - --model meta-llama/Llama-2-7b-chat-hf -``` +??? Commands + + ```console + mkdir -p ~/.cache/huggingface/hub/ + hf_cache_dir=~/.cache/huggingface/ + docker run \ + -itd \ + --ipc host \ + --network vllm_nginx \ + --gpus device=0 \ + --shm-size=10.24gb \ + -v $hf_cache_dir:/root/.cache/huggingface/ \ + -p 8081:8000 \ + --name vllm0 vllm \ + --model meta-llama/Llama-2-7b-chat-hf + docker run \ + -itd \ + --ipc host \ + --network vllm_nginx \ + --gpus device=1 \ + --shm-size=10.24gb \ + -v $hf_cache_dir:/root/.cache/huggingface/ \ + -p 8082:8000 \ + --name vllm1 vllm \ + --model meta-llama/Llama-2-7b-chat-hf + ``` !!! note If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`. diff --git a/docs/design/arch_overview.md b/docs/design/arch_overview.md index 14720a392aa..9bfdab17007 100644 --- a/docs/design/arch_overview.md +++ b/docs/design/arch_overview.md @@ -22,31 +22,33 @@ server. Here is a sample of `LLM` class usage: -```python -from vllm import LLM, SamplingParams - -# Define a list of input prompts -prompts = [ - "Hello, my name is", - "The capital of France is", - "The largest ocean is", -] - -# Define sampling parameters -sampling_params = SamplingParams(temperature=0.8, top_p=0.95) - -# Initialize the LLM engine with the OPT-125M model -llm = LLM(model="facebook/opt-125m") - -# Generate outputs for the input prompts -outputs = llm.generate(prompts, sampling_params) - -# Print the generated outputs -for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") -``` +??? Code + + ```python + from vllm import LLM, SamplingParams + + # Define a list of input prompts + prompts = [ + "Hello, my name is", + "The capital of France is", + "The largest ocean is", + ] + + # Define sampling parameters + sampling_params = SamplingParams(temperature=0.8, top_p=0.95) + + # Initialize the LLM engine with the OPT-125M model + llm = LLM(model="facebook/opt-125m") + + # Generate outputs for the input prompts + outputs = llm.generate(prompts, sampling_params) + + # Print the generated outputs + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + ``` More API details can be found in the [Offline Inference](#offline-inference-api) section of the API docs. @@ -178,32 +180,34 @@ vision-language model. To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one: - ```python - class MyOldModel(nn.Module): - def __init__( - self, - config, - cache_config: Optional[CacheConfig] = None, - quant_config: Optional[QuantizationConfig] = None, - lora_config: Optional[LoRAConfig] = None, - prefix: str = "", - ) -> None: - ... - - from vllm.config import VllmConfig - class MyNewModel(MyOldModel): - def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): - config = vllm_config.model_config.hf_config - cache_config = vllm_config.cache_config - quant_config = vllm_config.quant_config - lora_config = vllm_config.lora_config - super().__init__(config, cache_config, quant_config, lora_config, prefix) - - if __version__ >= "0.6.4": - MyModel = MyNewModel - else: - MyModel = MyOldModel - ``` + ??? Code + + ```python + class MyOldModel(nn.Module): + def __init__( + self, + config, + cache_config: Optional[CacheConfig] = None, + quant_config: Optional[QuantizationConfig] = None, + lora_config: Optional[LoRAConfig] = None, + prefix: str = "", + ) -> None: + ... + + from vllm.config import VllmConfig + class MyNewModel(MyOldModel): + def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): + config = vllm_config.model_config.hf_config + cache_config = vllm_config.cache_config + quant_config = vllm_config.quant_config + lora_config = vllm_config.lora_config + super().__init__(config, cache_config, quant_config, lora_config, prefix) + + if __version__ >= "0.6.4": + MyModel = MyNewModel + else: + MyModel = MyOldModel + ``` This way, the model can work with both old and new versions of vLLM. diff --git a/docs/design/kernel/paged_attention.md b/docs/design/kernel/paged_attention.md index 6ebe1ee48ac..ff135a73196 100644 --- a/docs/design/kernel/paged_attention.md +++ b/docs/design/kernel/paged_attention.md @@ -448,27 +448,29 @@ elements of the entire head for all context tokens. However, overall, all results for output have been calculated but are just stored in different thread register memory. -```cpp -float* out_smem = reinterpret_cast(shared_mem); -for (int i = NUM_WARPS; i > 1; i /= 2) { - // Upper warps write to shared memory. - ... - float* dst = &out_smem[(warp_idx - mid) * HEAD_SIZE]; - for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) { - ... - dst[row_idx] = accs[i]; - } +??? Code - // Lower warps update the output. - const float* src = &out_smem[warp_idx * HEAD_SIZE]; - for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) { + ```cpp + float* out_smem = reinterpret_cast(shared_mem); + for (int i = NUM_WARPS; i > 1; i /= 2) { + // Upper warps write to shared memory. ... - accs[i] += src[row_idx]; + float* dst = &out_smem[(warp_idx - mid) * HEAD_SIZE]; + for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) { + ... + dst[row_idx] = accs[i]; + } + + // Lower warps update the output. + const float* src = &out_smem[warp_idx * HEAD_SIZE]; + for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) { + ... + accs[i] += src[row_idx]; + } + + // Write out the accs. } - - // Write out the accs. -} -``` + ``` ## Output diff --git a/docs/design/plugin_system.md b/docs/design/plugin_system.md index 0764dfb6501..944f0e680de 100644 --- a/docs/design/plugin_system.md +++ b/docs/design/plugin_system.md @@ -13,28 +13,30 @@ Plugins are user-registered code that vLLM executes. Given vLLM's architecture ( vLLM's plugin system uses the standard Python `entry_points` mechanism. This mechanism allows developers to register functions in their Python packages for use by other packages. An example of a plugin: -```python -# inside `setup.py` file -from setuptools import setup - -setup(name='vllm_add_dummy_model', - version='0.1', - packages=['vllm_add_dummy_model'], - entry_points={ - 'vllm.general_plugins': - ["register_dummy_model = vllm_add_dummy_model:register"] - }) - -# inside `vllm_add_dummy_model.py` file -def register(): - from vllm import ModelRegistry - - if "MyLlava" not in ModelRegistry.get_supported_archs(): - ModelRegistry.register_model( - "MyLlava", - "vllm_add_dummy_model.my_llava:MyLlava", - ) -``` +??? Code + + ```python + # inside `setup.py` file + from setuptools import setup + + setup(name='vllm_add_dummy_model', + version='0.1', + packages=['vllm_add_dummy_model'], + entry_points={ + 'vllm.general_plugins': + ["register_dummy_model = vllm_add_dummy_model:register"] + }) + + # inside `vllm_add_dummy_model.py` file + def register(): + from vllm import ModelRegistry + + if "MyLlava" not in ModelRegistry.get_supported_archs(): + ModelRegistry.register_model( + "MyLlava", + "vllm_add_dummy_model.my_llava:MyLlava", + ) + ``` For more information on adding entry points to your package, please check the [official documentation](https://setuptools.pypa.io/en/latest/userguide/entry_point.html). diff --git a/docs/features/lora.md b/docs/features/lora.md index 04e92dbc459..4ccc3290e56 100644 --- a/docs/features/lora.md +++ b/docs/features/lora.md @@ -29,24 +29,26 @@ We can now submit the prompts and call `llm.generate` with the `lora_request` pa of `LoRARequest` is a human identifiable name, the second parameter is a globally unique ID for the adapter and the third parameter is the path to the LoRA adapter. -```python -sampling_params = SamplingParams( - temperature=0, - max_tokens=256, - stop=["[/assistant]"] -) - -prompts = [ - "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]", - "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]", -] - -outputs = llm.generate( - prompts, - sampling_params, - lora_request=LoRARequest("sql_adapter", 1, sql_lora_path) -) -``` +??? Code + + ```python + sampling_params = SamplingParams( + temperature=0, + max_tokens=256, + stop=["[/assistant]"] + ) + + prompts = [ + "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]", + "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]", + ] + + outputs = llm.generate( + prompts, + sampling_params, + lora_request=LoRARequest("sql_adapter", 1, sql_lora_path) + ) + ``` Check out for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options. @@ -68,24 +70,26 @@ The server entrypoint accepts all other LoRA configuration parameters (`max_lora etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along with its base model (if `jq` is not installed, you can follow [this guide](https://jqlang.org/download/) to install it.): -```bash -curl localhost:8000/v1/models | jq . -{ - "object": "list", - "data": [ - { - "id": "meta-llama/Llama-2-7b-hf", - "object": "model", - ... - }, - { - "id": "sql-lora", - "object": "model", - ... - } - ] -} -``` +??? Command + + ```bash + curl localhost:8000/v1/models | jq . + { + "object": "list", + "data": [ + { + "id": "meta-llama/Llama-2-7b-hf", + "object": "model", + ... + }, + { + "id": "sql-lora", + "object": "model", + ... + } + ] + } + ``` Requests can specify the LoRA adapter as if it were any other model via the `model` request parameter. The requests will be processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and potentially other @@ -168,36 +172,36 @@ Alternatively, follow these example steps to implement your own plugin: 1. Implement the LoRAResolver interface. - Example of a simple S3 LoRAResolver implementation: - - ```python - import os - import s3fs - from vllm.lora.request import LoRARequest - from vllm.lora.resolver import LoRAResolver - - class S3LoRAResolver(LoRAResolver): - def __init__(self): - self.s3 = s3fs.S3FileSystem() - self.s3_path_format = os.getenv("S3_PATH_TEMPLATE") - self.local_path_format = os.getenv("LOCAL_PATH_TEMPLATE") - - async def resolve_lora(self, base_model_name, lora_name): - s3_path = self.s3_path_format.format(base_model_name=base_model_name, lora_name=lora_name) - local_path = self.local_path_format.format(base_model_name=base_model_name, lora_name=lora_name) - - # Download the LoRA from S3 to the local path - await self.s3._get( - s3_path, local_path, recursive=True, maxdepth=1 - ) - - lora_request = LoRARequest( - lora_name=lora_name, - lora_path=local_path, - lora_int_id=abs(hash(lora_name)) - ) - return lora_request - ``` + ??? Example of a simple S3 LoRAResolver implementation + + ```python + import os + import s3fs + from vllm.lora.request import LoRARequest + from vllm.lora.resolver import LoRAResolver + + class S3LoRAResolver(LoRAResolver): + def __init__(self): + self.s3 = s3fs.S3FileSystem() + self.s3_path_format = os.getenv("S3_PATH_TEMPLATE") + self.local_path_format = os.getenv("LOCAL_PATH_TEMPLATE") + + async def resolve_lora(self, base_model_name, lora_name): + s3_path = self.s3_path_format.format(base_model_name=base_model_name, lora_name=lora_name) + local_path = self.local_path_format.format(base_model_name=base_model_name, lora_name=lora_name) + + # Download the LoRA from S3 to the local path + await self.s3._get( + s3_path, local_path, recursive=True, maxdepth=1 + ) + + lora_request = LoRARequest( + lora_name=lora_name, + lora_path=local_path, + lora_int_id=abs(hash(lora_name)) + ) + return lora_request + ``` 2. Register `LoRAResolver` plugin. @@ -234,38 +238,40 @@ The new format of `--lora-modules` is mainly to support the display of parent mo - The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-2-7b-hf`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter. - The `root` field points to the artifact location of the lora adapter. -```bash -$ curl http://localhost:8000/v1/models - -{ - "object": "list", - "data": [ - { - "id": "meta-llama/Llama-2-7b-hf", - "object": "model", - "created": 1715644056, - "owned_by": "vllm", - "root": "~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/", - "parent": null, - "permission": [ +??? Command output + + ```bash + $ curl http://localhost:8000/v1/models + + { + "object": "list", + "data": [ { - ..... - } - ] - }, - { - "id": "sql-lora", - "object": "model", - "created": 1715644056, - "owned_by": "vllm", - "root": "~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/", - "parent": meta-llama/Llama-2-7b-hf, - "permission": [ + "id": "meta-llama/Llama-2-7b-hf", + "object": "model", + "created": 1715644056, + "owned_by": "vllm", + "root": "~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/", + "parent": null, + "permission": [ + { + ..... + } + ] + }, { - .... + "id": "sql-lora", + "object": "model", + "created": 1715644056, + "owned_by": "vllm", + "root": "~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/", + "parent": meta-llama/Llama-2-7b-hf, + "permission": [ + { + .... + } + ] } ] - } - ] -} -``` + } + ``` diff --git a/docs/features/multimodal_inputs.md b/docs/features/multimodal_inputs.md index afb9a6d4df9..d4465beb859 100644 --- a/docs/features/multimodal_inputs.md +++ b/docs/features/multimodal_inputs.md @@ -20,111 +20,117 @@ To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]: You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples: -```python -from vllm import LLM - -llm = LLM(model="llava-hf/llava-1.5-7b-hf") - -# Refer to the HuggingFace repo for the correct format to use -prompt = "USER: \nWhat is the content of this image?\nASSISTANT:" - -# Load the image using PIL.Image -image = PIL.Image.open(...) - -# Single prompt inference -outputs = llm.generate({ - "prompt": prompt, - "multi_modal_data": {"image": image}, -}) - -for o in outputs: - generated_text = o.outputs[0].text - print(generated_text) - -# Batch inference -image_1 = PIL.Image.open(...) -image_2 = PIL.Image.open(...) -outputs = llm.generate( - [ - { - "prompt": "USER: \nWhat is the content of this image?\nASSISTANT:", - "multi_modal_data": {"image": image_1}, - }, - { - "prompt": "USER: \nWhat's the color of this image?\nASSISTANT:", - "multi_modal_data": {"image": image_2}, - } - ] -) +??? Code -for o in outputs: - generated_text = o.outputs[0].text - print(generated_text) -``` + ```python + from vllm import LLM + + llm = LLM(model="llava-hf/llava-1.5-7b-hf") + + # Refer to the HuggingFace repo for the correct format to use + prompt = "USER: \nWhat is the content of this image?\nASSISTANT:" + + # Load the image using PIL.Image + image = PIL.Image.open(...) + + # Single prompt inference + outputs = llm.generate({ + "prompt": prompt, + "multi_modal_data": {"image": image}, + }) + + for o in outputs: + generated_text = o.outputs[0].text + print(generated_text) + + # Batch inference + image_1 = PIL.Image.open(...) + image_2 = PIL.Image.open(...) + outputs = llm.generate( + [ + { + "prompt": "USER: \nWhat is the content of this image?\nASSISTANT:", + "multi_modal_data": {"image": image_1}, + }, + { + "prompt": "USER: \nWhat's the color of this image?\nASSISTANT:", + "multi_modal_data": {"image": image_2}, + } + ] + ) + + for o in outputs: + generated_text = o.outputs[0].text + print(generated_text) + ``` Full example: To substitute multiple images inside the same text prompt, you can pass in a list of images instead: -```python -from vllm import LLM - -llm = LLM( - model="microsoft/Phi-3.5-vision-instruct", - trust_remote_code=True, # Required to load Phi-3.5-vision - max_model_len=4096, # Otherwise, it may not fit in smaller GPUs - limit_mm_per_prompt={"image": 2}, # The maximum number to accept -) - -# Refer to the HuggingFace repo for the correct format to use -prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n" - -# Load the images using PIL.Image -image1 = PIL.Image.open(...) -image2 = PIL.Image.open(...) - -outputs = llm.generate({ - "prompt": prompt, - "multi_modal_data": { - "image": [image1, image2] - }, -}) - -for o in outputs: - generated_text = o.outputs[0].text - print(generated_text) -``` +??? Code + + ```python + from vllm import LLM + + llm = LLM( + model="microsoft/Phi-3.5-vision-instruct", + trust_remote_code=True, # Required to load Phi-3.5-vision + max_model_len=4096, # Otherwise, it may not fit in smaller GPUs + limit_mm_per_prompt={"image": 2}, # The maximum number to accept + ) + + # Refer to the HuggingFace repo for the correct format to use + prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n" + + # Load the images using PIL.Image + image1 = PIL.Image.open(...) + image2 = PIL.Image.open(...) + + outputs = llm.generate({ + "prompt": prompt, + "multi_modal_data": { + "image": [image1, image2] + }, + }) + + for o in outputs: + generated_text = o.outputs[0].text + print(generated_text) + ``` Full example: Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos: -```python -from vllm import LLM +??? Code -# Specify the maximum number of frames per video to be 4. This can be changed. -llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4}) + ```python + from vllm import LLM -# Create the request payload. -video_frames = ... # load your video making sure it only has the number of frames specified earlier. -message = { - "role": "user", - "content": [ - {"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."}, - ], -} -for i in range(len(video_frames)): - base64_image = encode_image(video_frames[i]) # base64 encoding. - new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}} - message["content"].append(new_image) - -# Perform inference and log output. -outputs = llm.chat([message]) - -for o in outputs: - generated_text = o.outputs[0].text - print(generated_text) -``` + # Specify the maximum number of frames per video to be 4. This can be changed. + llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4}) + + # Create the request payload. + video_frames = ... # load your video making sure it only has the number of frames specified earlier. + message = { + "role": "user", + "content": [ + {"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."}, + ], + } + for i in range(len(video_frames)): + base64_image = encode_image(video_frames[i]) # base64 encoding. + new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}} + message["content"].append(new_image) + + # Perform inference and log output. + outputs = llm.chat([message]) + + for o in outputs: + generated_text = o.outputs[0].text + print(generated_text) + ``` ### Video Inputs @@ -144,68 +150,72 @@ Full example: To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model, pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary. -```python -from vllm import LLM +??? Code -# Inference with image embeddings as input -llm = LLM(model="llava-hf/llava-1.5-7b-hf") + ```python + from vllm import LLM -# Refer to the HuggingFace repo for the correct format to use -prompt = "USER: \nWhat is the content of this image?\nASSISTANT:" + # Inference with image embeddings as input + llm = LLM(model="llava-hf/llava-1.5-7b-hf") -# Embeddings for single image -# torch.Tensor of shape (1, image_feature_size, hidden_size of LM) -image_embeds = torch.load(...) + # Refer to the HuggingFace repo for the correct format to use + prompt = "USER: \nWhat is the content of this image?\nASSISTANT:" -outputs = llm.generate({ - "prompt": prompt, - "multi_modal_data": {"image": image_embeds}, -}) + # Embeddings for single image + # torch.Tensor of shape (1, image_feature_size, hidden_size of LM) + image_embeds = torch.load(...) -for o in outputs: - generated_text = o.outputs[0].text - print(generated_text) -``` + outputs = llm.generate({ + "prompt": prompt, + "multi_modal_data": {"image": image_embeds}, + }) + + for o in outputs: + generated_text = o.outputs[0].text + print(generated_text) + ``` For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings: -```python -# Construct the prompt based on your model -prompt = ... - -# Embeddings for multiple images -# torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM) -image_embeds = torch.load(...) - -# Qwen2-VL -llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4}) -mm_data = { - "image": { - "image_embeds": image_embeds, - # image_grid_thw is needed to calculate positional encoding. - "image_grid_thw": torch.load(...), # torch.Tensor of shape (1, 3), +??? Code + + ```python + # Construct the prompt based on your model + prompt = ... + + # Embeddings for multiple images + # torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM) + image_embeds = torch.load(...) + + # Qwen2-VL + llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4}) + mm_data = { + "image": { + "image_embeds": image_embeds, + # image_grid_thw is needed to calculate positional encoding. + "image_grid_thw": torch.load(...), # torch.Tensor of shape (1, 3), + } } -} - -# MiniCPM-V -llm = LLM("openbmb/MiniCPM-V-2_6", trust_remote_code=True, limit_mm_per_prompt={"image": 4}) -mm_data = { - "image": { - "image_embeds": image_embeds, - # image_sizes is needed to calculate details of the sliced image. - "image_sizes": [image.size for image in images], # list of image sizes + + # MiniCPM-V + llm = LLM("openbmb/MiniCPM-V-2_6", trust_remote_code=True, limit_mm_per_prompt={"image": 4}) + mm_data = { + "image": { + "image_embeds": image_embeds, + # image_sizes is needed to calculate details of the sliced image. + "image_sizes": [image.size for image in images], # list of image sizes + } } -} -outputs = llm.generate({ - "prompt": prompt, - "multi_modal_data": mm_data, -}) + outputs = llm.generate({ + "prompt": prompt, + "multi_modal_data": mm_data, + }) -for o in outputs: - generated_text = o.outputs[0].text - print(generated_text) -``` + for o in outputs: + generated_text = o.outputs[0].text + print(generated_text) + ``` ## Online Serving @@ -235,51 +245,53 @@ vllm serve microsoft/Phi-3.5-vision-instruct --task generate \ Then, you can use the OpenAI client as follows: -```python -from openai import OpenAI - -openai_api_key = "EMPTY" -openai_api_base = "http://localhost:8000/v1" - -client = OpenAI( - api_key=openai_api_key, - base_url=openai_api_base, -) - -# Single-image input inference -image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" - -chat_response = client.chat.completions.create( - model="microsoft/Phi-3.5-vision-instruct", - messages=[{ - "role": "user", - "content": [ - # NOTE: The prompt formatting with the image token `` is not needed - # since the prompt will be processed automatically by the API server. - {"type": "text", "text": "What’s in this image?"}, - {"type": "image_url", "image_url": {"url": image_url}}, - ], - }], -) -print("Chat completion output:", chat_response.choices[0].message.content) - -# Multi-image input inference -image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg" -image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg" - -chat_response = client.chat.completions.create( - model="microsoft/Phi-3.5-vision-instruct", - messages=[{ - "role": "user", - "content": [ - {"type": "text", "text": "What are the animals in these images?"}, - {"type": "image_url", "image_url": {"url": image_url_duck}}, - {"type": "image_url", "image_url": {"url": image_url_lion}}, - ], - }], -) -print("Chat completion output:", chat_response.choices[0].message.content) -``` +??? Code + + ```python + from openai import OpenAI + + openai_api_key = "EMPTY" + openai_api_base = "http://localhost:8000/v1" + + client = OpenAI( + api_key=openai_api_key, + base_url=openai_api_base, + ) + + # Single-image input inference + image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" + + chat_response = client.chat.completions.create( + model="microsoft/Phi-3.5-vision-instruct", + messages=[{ + "role": "user", + "content": [ + # NOTE: The prompt formatting with the image token `` is not needed + # since the prompt will be processed automatically by the API server. + {"type": "text", "text": "What’s in this image?"}, + {"type": "image_url", "image_url": {"url": image_url}}, + ], + }], + ) + print("Chat completion output:", chat_response.choices[0].message.content) + + # Multi-image input inference + image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg" + image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg" + + chat_response = client.chat.completions.create( + model="microsoft/Phi-3.5-vision-instruct", + messages=[{ + "role": "user", + "content": [ + {"type": "text", "text": "What are the animals in these images?"}, + {"type": "image_url", "image_url": {"url": image_url_duck}}, + {"type": "image_url", "image_url": {"url": image_url_lion}}, + ], + }], + ) + print("Chat completion output:", chat_response.choices[0].message.content) + ``` Full example: @@ -311,44 +323,46 @@ vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model Then, you can use the OpenAI client as follows: -```python -from openai import OpenAI +??? Code -openai_api_key = "EMPTY" -openai_api_base = "http://localhost:8000/v1" + ```python + from openai import OpenAI -client = OpenAI( - api_key=openai_api_key, - base_url=openai_api_base, -) + openai_api_key = "EMPTY" + openai_api_base = "http://localhost:8000/v1" -video_url = "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4" + client = OpenAI( + api_key=openai_api_key, + base_url=openai_api_base, + ) -## Use video url in the payload -chat_completion_from_url = client.chat.completions.create( - messages=[{ - "role": - "user", - "content": [ - { - "type": "text", - "text": "What's in this video?" - }, - { - "type": "video_url", - "video_url": { - "url": video_url - }, - }, - ], - }], - model=model, - max_completion_tokens=64, -) + video_url = "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4" -result = chat_completion_from_url.choices[0].message.content -print("Chat completion output from image url:", result) -``` + ## Use video url in the payload + chat_completion_from_url = client.chat.completions.create( + messages=[{ + "role": + "user", + "content": [ + { + "type": "text", + "text": "What's in this video?" + }, + { + "type": "video_url", + "video_url": { + "url": video_url + }, + }, + ], + }], + model=model, + max_completion_tokens=64, + ) + + result = chat_completion_from_url.choices[0].message.content + print("Chat completion output from image url:", result) + ``` Full example: @@ -373,84 +387,88 @@ vllm serve fixie-ai/ultravox-v0_5-llama-3_2-1b Then, you can use the OpenAI client as follows: -```python -import base64 -import requests -from openai import OpenAI -from vllm.assets.audio import AudioAsset +??? Code -def encode_base64_content_from_url(content_url: str) -> str: - """Encode a content retrieved from a remote url to base64 format.""" + ```python + import base64 + import requests + from openai import OpenAI + from vllm.assets.audio import AudioAsset - with requests.get(content_url) as response: - response.raise_for_status() - result = base64.b64encode(response.content).decode('utf-8') + def encode_base64_content_from_url(content_url: str) -> str: + """Encode a content retrieved from a remote url to base64 format.""" - return result + with requests.get(content_url) as response: + response.raise_for_status() + result = base64.b64encode(response.content).decode('utf-8') -openai_api_key = "EMPTY" -openai_api_base = "http://localhost:8000/v1" + return result -client = OpenAI( - api_key=openai_api_key, - base_url=openai_api_base, -) + openai_api_key = "EMPTY" + openai_api_base = "http://localhost:8000/v1" -# Any format supported by librosa is supported -audio_url = AudioAsset("winning_call").url -audio_base64 = encode_base64_content_from_url(audio_url) + client = OpenAI( + api_key=openai_api_key, + base_url=openai_api_base, + ) -chat_completion_from_base64 = client.chat.completions.create( - messages=[{ - "role": "user", - "content": [ - { - "type": "text", - "text": "What's in this audio?" - }, - { - "type": "input_audio", - "input_audio": { - "data": audio_base64, - "format": "wav" - }, - }, - ], - }], - model=model, - max_completion_tokens=64, -) + # Any format supported by librosa is supported + audio_url = AudioAsset("winning_call").url + audio_base64 = encode_base64_content_from_url(audio_url) -result = chat_completion_from_base64.choices[0].message.content -print("Chat completion output from input audio:", result) -``` + chat_completion_from_base64 = client.chat.completions.create( + messages=[{ + "role": "user", + "content": [ + { + "type": "text", + "text": "What's in this audio?" + }, + { + "type": "input_audio", + "input_audio": { + "data": audio_base64, + "format": "wav" + }, + }, + ], + }], + model=model, + max_completion_tokens=64, + ) + + result = chat_completion_from_base64.choices[0].message.content + print("Chat completion output from input audio:", result) + ``` Alternatively, you can pass `audio_url`, which is the audio counterpart of `image_url` for image input: -```python -chat_completion_from_url = client.chat.completions.create( - messages=[{ - "role": "user", - "content": [ - { - "type": "text", - "text": "What's in this audio?" - }, - { - "type": "audio_url", - "audio_url": { - "url": audio_url - }, - }, - ], - }], - model=model, - max_completion_tokens=64, -) +??? Code -result = chat_completion_from_url.choices[0].message.content -print("Chat completion output from audio url:", result) -``` + ```python + chat_completion_from_url = client.chat.completions.create( + messages=[{ + "role": "user", + "content": [ + { + "type": "text", + "text": "What's in this audio?" + }, + { + "type": "audio_url", + "audio_url": { + "url": audio_url + }, + }, + ], + }], + model=model, + max_completion_tokens=64, + ) + + result = chat_completion_from_url.choices[0].message.content + print("Chat completion output from audio url:", result) + ``` Full example: @@ -470,61 +488,63 @@ pass a tensor of shape to the corresponding field of the multi-modal dictionary. For image embeddings, you can pass the base64-encoded tensor to the `image_embeds` field. The following example demonstrates how to pass image embeddings to the OpenAI server: -```python -image_embedding = torch.load(...) -grid_thw = torch.load(...) # Required by Qwen/Qwen2-VL-2B-Instruct - -buffer = io.BytesIO() -torch.save(image_embedding, buffer) -buffer.seek(0) -binary_data = buffer.read() -base64_image_embedding = base64.b64encode(binary_data).decode('utf-8') - -client = OpenAI( - # defaults to os.environ.get("OPENAI_API_KEY") - api_key=openai_api_key, - base_url=openai_api_base, -) - -# Basic usage - this is equivalent to the LLaVA example for offline inference -model = "llava-hf/llava-1.5-7b-hf" -embeds = { - "type": "image_embeds", - "image_embeds": f"{base64_image_embedding}" -} - -# Pass additional parameters (available to Qwen2-VL and MiniCPM-V) -model = "Qwen/Qwen2-VL-2B-Instruct" -embeds = { - "type": "image_embeds", - "image_embeds": { - "image_embeds": f"{base64_image_embedding}" , # Required - "image_grid_thw": f"{base64_image_grid_thw}" # Required by Qwen/Qwen2-VL-2B-Instruct - }, -} -model = "openbmb/MiniCPM-V-2_6" -embeds = { - "type": "image_embeds", - "image_embeds": { - "image_embeds": f"{base64_image_embedding}" , # Required - "image_sizes": f"{base64_image_sizes}" # Required by openbmb/MiniCPM-V-2_6 - }, -} -chat_completion = client.chat.completions.create( - messages=[ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": [ - { - "type": "text", - "text": "What's in this image?", +??? Code + + ```python + image_embedding = torch.load(...) + grid_thw = torch.load(...) # Required by Qwen/Qwen2-VL-2B-Instruct + + buffer = io.BytesIO() + torch.save(image_embedding, buffer) + buffer.seek(0) + binary_data = buffer.read() + base64_image_embedding = base64.b64encode(binary_data).decode('utf-8') + + client = OpenAI( + # defaults to os.environ.get("OPENAI_API_KEY") + api_key=openai_api_key, + base_url=openai_api_base, + ) + + # Basic usage - this is equivalent to the LLaVA example for offline inference + model = "llava-hf/llava-1.5-7b-hf" + embeds = { + "type": "image_embeds", + "image_embeds": f"{base64_image_embedding}" + } + + # Pass additional parameters (available to Qwen2-VL and MiniCPM-V) + model = "Qwen/Qwen2-VL-2B-Instruct" + embeds = { + "type": "image_embeds", + "image_embeds": { + "image_embeds": f"{base64_image_embedding}" , # Required + "image_grid_thw": f"{base64_image_grid_thw}" # Required by Qwen/Qwen2-VL-2B-Instruct }, - embeds, - ], - }, -], - model=model, -) -``` + } + model = "openbmb/MiniCPM-V-2_6" + embeds = { + "type": "image_embeds", + "image_embeds": { + "image_embeds": f"{base64_image_embedding}" , # Required + "image_sizes": f"{base64_image_sizes}" # Required by openbmb/MiniCPM-V-2_6 + }, + } + chat_completion = client.chat.completions.create( + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": [ + { + "type": "text", + "text": "What's in this image?", + }, + embeds, + ], + }, + ], + model=model, + ) + ``` !!! note Only one message can contain `{"type": "image_embeds"}`. diff --git a/docs/features/quantization/auto_awq.md b/docs/features/quantization/auto_awq.md index 4366a080f52..8362672f40b 100644 --- a/docs/features/quantization/auto_awq.md +++ b/docs/features/quantization/auto_awq.md @@ -15,29 +15,31 @@ pip install autoawq After installing AutoAWQ, you are ready to quantize a model. Please refer to the [AutoAWQ documentation](https://casper-hansen.github.io/AutoAWQ/examples/#basic-quantization) for further details. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`: -```python -from awq import AutoAWQForCausalLM -from transformers import AutoTokenizer +??? Code -model_path = 'mistralai/Mistral-7B-Instruct-v0.2' -quant_path = 'mistral-instruct-v0.2-awq' -quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } + ```python + from awq import AutoAWQForCausalLM + from transformers import AutoTokenizer -# Load model -model = AutoAWQForCausalLM.from_pretrained( - model_path, **{"low_cpu_mem_usage": True, "use_cache": False} -) -tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) + model_path = 'mistralai/Mistral-7B-Instruct-v0.2' + quant_path = 'mistral-instruct-v0.2-awq' + quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } -# Quantize -model.quantize(tokenizer, quant_config=quant_config) + # Load model + model = AutoAWQForCausalLM.from_pretrained( + model_path, **{"low_cpu_mem_usage": True, "use_cache": False} + ) + tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) -# Save quantized model -model.save_quantized(quant_path) -tokenizer.save_pretrained(quant_path) + # Quantize + model.quantize(tokenizer, quant_config=quant_config) -print(f'Model is quantized and saved at "{quant_path}"') -``` + # Save quantized model + model.save_quantized(quant_path) + tokenizer.save_pretrained(quant_path) + + print(f'Model is quantized and saved at "{quant_path}"') + ``` To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command: @@ -49,27 +51,29 @@ python examples/offline_inference/llm_engine_example.py \ AWQ models are also supported directly through the LLM entrypoint: -```python -from vllm import LLM, SamplingParams - -# Sample prompts. -prompts = [ - "Hello, my name is", - "The president of the United States is", - "The capital of France is", - "The future of AI is", -] -# Create a sampling params object. -sampling_params = SamplingParams(temperature=0.8, top_p=0.95) - -# Create an LLM. -llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ") -# Generate texts from the prompts. The output is a list of RequestOutput objects -# that contain the prompt, generated text, and other information. -outputs = llm.generate(prompts, sampling_params) -# Print the outputs. -for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") -``` +??? Code + + ```python + from vllm import LLM, SamplingParams + + # Sample prompts. + prompts = [ + "Hello, my name is", + "The president of the United States is", + "The capital of France is", + "The future of AI is", + ] + # Create a sampling params object. + sampling_params = SamplingParams(temperature=0.8, top_p=0.95) + + # Create an LLM. + llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ") + # Generate texts from the prompts. The output is a list of RequestOutput objects + # that contain the prompt, generated text, and other information. + outputs = llm.generate(prompts, sampling_params) + # Print the outputs. + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + ``` diff --git a/docs/features/quantization/bitblas.md b/docs/features/quantization/bitblas.md index 9001725d9c0..3f8ae7a959c 100644 --- a/docs/features/quantization/bitblas.md +++ b/docs/features/quantization/bitblas.md @@ -43,17 +43,19 @@ llm = LLM( ## Read gptq format checkpoint -```python -from vllm import LLM -import torch - -# "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint. -model_id = "hxbgsyxh/llama-13b-4bit-g-1" -llm = LLM( - model=model_id, - dtype=torch.float16, - trust_remote_code=True, - quantization="bitblas", - max_model_len=1024 -) -``` +??? Code + + ```python + from vllm import LLM + import torch + + # "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint. + model_id = "hxbgsyxh/llama-13b-4bit-g-1" + llm = LLM( + model=model_id, + dtype=torch.float16, + trust_remote_code=True, + quantization="bitblas", + max_model_len=1024 + ) + ``` diff --git a/docs/features/quantization/fp8.md b/docs/features/quantization/fp8.md index 01d5d9da046..ec7639af805 100644 --- a/docs/features/quantization/fp8.md +++ b/docs/features/quantization/fp8.md @@ -58,22 +58,24 @@ For FP8 quantization, we can recover accuracy with simple RTN quantization. We r Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow. -```python -from llmcompressor.transformers import oneshot -from llmcompressor.modifiers.quantization import QuantizationModifier +??? Code -# Configure the simple PTQ quantization -recipe = QuantizationModifier( - targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]) + ```python + from llmcompressor.transformers import oneshot + from llmcompressor.modifiers.quantization import QuantizationModifier -# Apply the quantization algorithm. -oneshot(model=model, recipe=recipe) + # Configure the simple PTQ quantization + recipe = QuantizationModifier( + targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]) -# Save the model: Meta-Llama-3-8B-Instruct-FP8-Dynamic -SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic" -model.save_pretrained(SAVE_DIR) -tokenizer.save_pretrained(SAVE_DIR) -``` + # Apply the quantization algorithm. + oneshot(model=model, recipe=recipe) + + # Save the model: Meta-Llama-3-8B-Instruct-FP8-Dynamic + SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic" + model.save_pretrained(SAVE_DIR) + tokenizer.save_pretrained(SAVE_DIR) + ``` ### 3. Evaluating Accuracy diff --git a/docs/features/quantization/gguf.md b/docs/features/quantization/gguf.md index 72f758f653a..014b513eeda 100644 --- a/docs/features/quantization/gguf.md +++ b/docs/features/quantization/gguf.md @@ -41,42 +41,44 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \ You can also use the GGUF model directly through the LLM entrypoint: -```python -from vllm import LLM, SamplingParams - -# In this script, we demonstrate how to pass input to the chat method: -conversation = [ - { - "role": "system", - "content": "You are a helpful assistant" - }, - { - "role": "user", - "content": "Hello" - }, - { - "role": "assistant", - "content": "Hello! How can I assist you today?" - }, - { - "role": "user", - "content": "Write an essay about the importance of higher education.", - }, -] - -# Create a sampling params object. -sampling_params = SamplingParams(temperature=0.8, top_p=0.95) - -# Create an LLM. -llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf", - tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0") -# Generate texts from the prompts. The output is a list of RequestOutput objects -# that contain the prompt, generated text, and other information. -outputs = llm.chat(conversation, sampling_params) - -# Print the outputs. -for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") -``` +??? Code + + ```python + from vllm import LLM, SamplingParams + + # In this script, we demonstrate how to pass input to the chat method: + conversation = [ + { + "role": "system", + "content": "You are a helpful assistant" + }, + { + "role": "user", + "content": "Hello" + }, + { + "role": "assistant", + "content": "Hello! How can I assist you today?" + }, + { + "role": "user", + "content": "Write an essay about the importance of higher education.", + }, + ] + + # Create a sampling params object. + sampling_params = SamplingParams(temperature=0.8, top_p=0.95) + + # Create an LLM. + llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf", + tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0") + # Generate texts from the prompts. The output is a list of RequestOutput objects + # that contain the prompt, generated text, and other information. + outputs = llm.chat(conversation, sampling_params) + + # Print the outputs. + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + ``` diff --git a/docs/features/quantization/gptqmodel.md b/docs/features/quantization/gptqmodel.md index 53e938d2cbd..2f088f474f1 100644 --- a/docs/features/quantization/gptqmodel.md +++ b/docs/features/quantization/gptqmodel.md @@ -31,28 +31,30 @@ After installing GPTQModel, you are ready to quantize a model. Please refer to t Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`: -```python -from datasets import load_dataset -from gptqmodel import GPTQModel, QuantizeConfig +??? Code -model_id = "meta-llama/Llama-3.2-1B-Instruct" -quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit" + ```python + from datasets import load_dataset + from gptqmodel import GPTQModel, QuantizeConfig -calibration_dataset = load_dataset( - "allenai/c4", - data_files="en/c4-train.00001-of-01024.json.gz", - split="train" - ).select(range(1024))["text"] + model_id = "meta-llama/Llama-3.2-1B-Instruct" + quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit" -quant_config = QuantizeConfig(bits=4, group_size=128) + calibration_dataset = load_dataset( + "allenai/c4", + data_files="en/c4-train.00001-of-01024.json.gz", + split="train" + ).select(range(1024))["text"] -model = GPTQModel.load(model_id, quant_config) + quant_config = QuantizeConfig(bits=4, group_size=128) -# increase `batch_size` to match gpu/vram specs to speed up quantization -model.quantize(calibration_dataset, batch_size=2) + model = GPTQModel.load(model_id, quant_config) -model.save(quant_path) -``` + # increase `batch_size` to match gpu/vram specs to speed up quantization + model.quantize(calibration_dataset, batch_size=2) + + model.save(quant_path) + ``` ## Running a quantized model with vLLM @@ -67,32 +69,34 @@ python examples/offline_inference/llm_engine_example.py \ GPTQModel quantized models are also supported directly through the LLM entrypoint: -```python -from vllm import LLM, SamplingParams - -# Sample prompts. -prompts = [ - "Hello, my name is", - "The president of the United States is", - "The capital of France is", - "The future of AI is", -] - -# Create a sampling params object. -sampling_params = SamplingParams(temperature=0.6, top_p=0.9) - -# Create an LLM. -llm = LLM(model="ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2") - -# Generate texts from the prompts. The output is a list of RequestOutput objects -# that contain the prompt, generated text, and other information. -outputs = llm.generate(prompts, sampling_params) - -# Print the outputs. -print("-"*50) -for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}") +??? Code + + ```python + from vllm import LLM, SamplingParams + + # Sample prompts. + prompts = [ + "Hello, my name is", + "The president of the United States is", + "The capital of France is", + "The future of AI is", + ] + + # Create a sampling params object. + sampling_params = SamplingParams(temperature=0.6, top_p=0.9) + + # Create an LLM. + llm = LLM(model="ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2") + + # Generate texts from the prompts. The output is a list of RequestOutput objects + # that contain the prompt, generated text, and other information. + outputs = llm.generate(prompts, sampling_params) + + # Print the outputs. print("-"*50) -``` + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}") + print("-"*50) + ``` diff --git a/docs/features/quantization/int4.md b/docs/features/quantization/int4.md index b7d09206365..185e13649f4 100644 --- a/docs/features/quantization/int4.md +++ b/docs/features/quantization/int4.md @@ -53,51 +53,55 @@ When quantizing weights to INT4, you need sample data to estimate the weight upd It's best to use calibration data that closely matches your deployment data. For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`: -```python -from datasets import load_dataset +??? Code -NUM_CALIBRATION_SAMPLES = 512 -MAX_SEQUENCE_LENGTH = 2048 + ```python + from datasets import load_dataset -# Load and preprocess the dataset -ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft") -ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) + NUM_CALIBRATION_SAMPLES = 512 + MAX_SEQUENCE_LENGTH = 2048 -def preprocess(example): - return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)} -ds = ds.map(preprocess) + # Load and preprocess the dataset + ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft") + ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) -def tokenize(sample): - return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False) -ds = ds.map(tokenize, remove_columns=ds.column_names) -``` + def preprocess(example): + return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)} + ds = ds.map(preprocess) + + def tokenize(sample): + return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False) + ds = ds.map(tokenize, remove_columns=ds.column_names) + ``` ### 3. Applying Quantization Now, apply the quantization algorithms: -```python -from llmcompressor.transformers import oneshot -from llmcompressor.modifiers.quantization import GPTQModifier -from llmcompressor.modifiers.smoothquant import SmoothQuantModifier - -# Configure the quantization algorithms -recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]) - -# Apply quantization -oneshot( - model=model, - dataset=ds, - recipe=recipe, - max_seq_length=MAX_SEQUENCE_LENGTH, - num_calibration_samples=NUM_CALIBRATION_SAMPLES, -) +??? Code -# Save the compressed model: Meta-Llama-3-8B-Instruct-W4A16-G128 -SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128" -model.save_pretrained(SAVE_DIR, save_compressed=True) -tokenizer.save_pretrained(SAVE_DIR) -``` + ```python + from llmcompressor.transformers import oneshot + from llmcompressor.modifiers.quantization import GPTQModifier + from llmcompressor.modifiers.smoothquant import SmoothQuantModifier + + # Configure the quantization algorithms + recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]) + + # Apply quantization + oneshot( + model=model, + dataset=ds, + recipe=recipe, + max_seq_length=MAX_SEQUENCE_LENGTH, + num_calibration_samples=NUM_CALIBRATION_SAMPLES, + ) + + # Save the compressed model: Meta-Llama-3-8B-Instruct-W4A16-G128 + SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128" + model.save_pretrained(SAVE_DIR, save_compressed=True) + tokenizer.save_pretrained(SAVE_DIR) + ``` This process creates a W4A16 model with weights quantized to 4-bit integers. @@ -137,34 +141,36 @@ $ lm_eval --model vllm \ The following is an example of an expanded quantization recipe you can tune to your own use case: -```python -from compressed_tensors.quantization import ( - QuantizationArgs, - QuantizationScheme, - QuantizationStrategy, - QuantizationType, -) -recipe = GPTQModifier( - targets="Linear", - config_groups={ - "config_group": QuantizationScheme( - targets=["Linear"], - weights=QuantizationArgs( - num_bits=4, - type=QuantizationType.INT, - strategy=QuantizationStrategy.GROUP, - group_size=128, - symmetric=True, - dynamic=False, - actorder="weight", +??? Code + + ```python + from compressed_tensors.quantization import ( + QuantizationArgs, + QuantizationScheme, + QuantizationStrategy, + QuantizationType, + ) + recipe = GPTQModifier( + targets="Linear", + config_groups={ + "config_group": QuantizationScheme( + targets=["Linear"], + weights=QuantizationArgs( + num_bits=4, + type=QuantizationType.INT, + strategy=QuantizationStrategy.GROUP, + group_size=128, + symmetric=True, + dynamic=False, + actorder="weight", + ), ), - ), - }, - ignore=["lm_head"], - update_size=NUM_CALIBRATION_SAMPLES, - dampening_frac=0.01 -) -``` + }, + ignore=["lm_head"], + update_size=NUM_CALIBRATION_SAMPLES, + dampening_frac=0.01 + ) + ``` ## Troubleshooting and Support diff --git a/docs/features/quantization/int8.md b/docs/features/quantization/int8.md index 1d9fba9dc87..de5ae5c0440 100644 --- a/docs/features/quantization/int8.md +++ b/docs/features/quantization/int8.md @@ -54,54 +54,60 @@ When quantizing activations to INT8, you need sample data to estimate the activa It's best to use calibration data that closely matches your deployment data. For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`: -```python -from datasets import load_dataset +??? Code -NUM_CALIBRATION_SAMPLES = 512 -MAX_SEQUENCE_LENGTH = 2048 + ```python + from datasets import load_dataset -# Load and preprocess the dataset -ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft") -ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) + NUM_CALIBRATION_SAMPLES = 512 + MAX_SEQUENCE_LENGTH = 2048 -def preprocess(example): - return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)} -ds = ds.map(preprocess) + # Load and preprocess the dataset + ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft") + ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) -def tokenize(sample): - return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False) -ds = ds.map(tokenize, remove_columns=ds.column_names) -``` + def preprocess(example): + return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)} + ds = ds.map(preprocess) + + def tokenize(sample): + return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False) + ds = ds.map(tokenize, remove_columns=ds.column_names) + ``` + + ### 3. Applying Quantization Now, apply the quantization algorithms: -```python -from llmcompressor.transformers import oneshot -from llmcompressor.modifiers.quantization import GPTQModifier -from llmcompressor.modifiers.smoothquant import SmoothQuantModifier - -# Configure the quantization algorithms -recipe = [ - SmoothQuantModifier(smoothing_strength=0.8), - GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]), -] - -# Apply quantization -oneshot( - model=model, - dataset=ds, - recipe=recipe, - max_seq_length=MAX_SEQUENCE_LENGTH, - num_calibration_samples=NUM_CALIBRATION_SAMPLES, -) - -# Save the compressed model: Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token -SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token" -model.save_pretrained(SAVE_DIR, save_compressed=True) -tokenizer.save_pretrained(SAVE_DIR) -``` +??? Code + + ```python + from llmcompressor.transformers import oneshot + from llmcompressor.modifiers.quantization import GPTQModifier + from llmcompressor.modifiers.smoothquant import SmoothQuantModifier + + # Configure the quantization algorithms + recipe = [ + SmoothQuantModifier(smoothing_strength=0.8), + GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]), + ] + + # Apply quantization + oneshot( + model=model, + dataset=ds, + recipe=recipe, + max_seq_length=MAX_SEQUENCE_LENGTH, + num_calibration_samples=NUM_CALIBRATION_SAMPLES, + ) + + # Save the compressed model: Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token + SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token" + model.save_pretrained(SAVE_DIR, save_compressed=True) + tokenizer.save_pretrained(SAVE_DIR) + ``` This process creates a W8A8 model with weights and activations quantized to 8-bit integers. diff --git a/docs/features/quantization/modelopt.md b/docs/features/quantization/modelopt.md index 001d18657da..0bb6003832b 100644 --- a/docs/features/quantization/modelopt.md +++ b/docs/features/quantization/modelopt.md @@ -14,24 +14,26 @@ You can quantize HuggingFace models using the example scripts provided in the Te Below is an example showing how to quantize a model using modelopt's PTQ API: -```python -import modelopt.torch.quantization as mtq -from transformers import AutoModelForCausalLM +??? Code -# Load the model from HuggingFace -model = AutoModelForCausalLM.from_pretrained("") + ```python + import modelopt.torch.quantization as mtq + from transformers import AutoModelForCausalLM -# Select the quantization config, for example, FP8 -config = mtq.FP8_DEFAULT_CFG + # Load the model from HuggingFace + model = AutoModelForCausalLM.from_pretrained("") -# Define a forward loop function for calibration -def forward_loop(model): - for data in calib_set: - model(data) + # Select the quantization config, for example, FP8 + config = mtq.FP8_DEFAULT_CFG -# PTQ with in-place replacement of quantized modules -model = mtq.quantize(model, config, forward_loop) -``` + # Define a forward loop function for calibration + def forward_loop(model): + for data in calib_set: + model(data) + + # PTQ with in-place replacement of quantized modules + model = mtq.quantize(model, config, forward_loop) + ``` After the model is quantized, you can export it to a quantized checkpoint using the export API: @@ -48,31 +50,33 @@ with torch.inference_mode(): The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM: -```python -from vllm import LLM, SamplingParams +??? Code -def main(): + ```python + from vllm import LLM, SamplingParams - model_id = "nvidia/Llama-3.1-8B-Instruct-FP8" - # Ensure you specify quantization='modelopt' when loading the modelopt checkpoint - llm = LLM(model=model_id, quantization="modelopt", trust_remote_code=True) + def main(): - sampling_params = SamplingParams(temperature=0.8, top_p=0.9) + model_id = "nvidia/Llama-3.1-8B-Instruct-FP8" + # Ensure you specify quantization='modelopt' when loading the modelopt checkpoint + llm = LLM(model=model_id, quantization="modelopt", trust_remote_code=True) - prompts = [ - "Hello, my name is", - "The president of the United States is", - "The capital of France is", - "The future of AI is", - ] + sampling_params = SamplingParams(temperature=0.8, top_p=0.9) - outputs = llm.generate(prompts, sampling_params) + prompts = [ + "Hello, my name is", + "The president of the United States is", + "The capital of France is", + "The future of AI is", + ] - for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + outputs = llm.generate(prompts, sampling_params) -if __name__ == "__main__": - main() -``` + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + + if __name__ == "__main__": + main() + ``` diff --git a/docs/features/quantization/quantized_kvcache.md b/docs/features/quantization/quantized_kvcache.md index e3ebd024bab..52b8d38ace1 100644 --- a/docs/features/quantization/quantized_kvcache.md +++ b/docs/features/quantization/quantized_kvcache.md @@ -35,20 +35,22 @@ Studies have shown that FP8 E4M3 quantization typically only minimally degrades Here is an example of how to enable FP8 quantization: -```python -# To calculate kv cache scales on the fly enable the calculate_kv_scales -# parameter +??? Code -from vllm import LLM, SamplingParams + ```python + # To calculate kv cache scales on the fly enable the calculate_kv_scales + # parameter -sampling_params = SamplingParams(temperature=0.7, top_p=0.8) -llm = LLM(model="meta-llama/Llama-2-7b-chat-hf", - kv_cache_dtype="fp8", - calculate_kv_scales=True) -prompt = "London is the capital of" -out = llm.generate(prompt, sampling_params)[0].outputs[0].text -print(out) -``` + from vllm import LLM, SamplingParams + + sampling_params = SamplingParams(temperature=0.7, top_p=0.8) + llm = LLM(model="meta-llama/Llama-2-7b-chat-hf", + kv_cache_dtype="fp8", + calculate_kv_scales=True) + prompt = "London is the capital of" + out = llm.generate(prompt, sampling_params)[0].outputs[0].text + print(out) + ``` The `kv_cache_dtype` argument specifies the data type for KV cache storage: - `"auto"`: Uses the model's default "unquantized" data type @@ -71,67 +73,69 @@ pip install llmcompressor Here's a complete example using `meta-llama/Llama-3.1-8B-Instruct` (most models can use this same pattern): -```python -from datasets import load_dataset -from transformers import AutoModelForCausalLM, AutoTokenizer -from llmcompressor.transformers import oneshot - -# Select model and load it -MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct" -model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", torch_dtype="auto") -tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) - -# Select calibration dataset -DATASET_ID = "HuggingFaceH4/ultrachat_200k" -DATASET_SPLIT = "train_sft" - -# Configure calibration parameters -NUM_CALIBRATION_SAMPLES = 512 # 512 samples is a good starting point -MAX_SEQUENCE_LENGTH = 2048 - -# Load and preprocess dataset -ds = load_dataset(DATASET_ID, split=DATASET_SPLIT) -ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) - -def process_and_tokenize(example): - text = tokenizer.apply_chat_template(example["messages"], tokenize=False) - return tokenizer( - text, - padding=False, - max_length=MAX_SEQUENCE_LENGTH, - truncation=True, - add_special_tokens=False, +??? Code + + ```python + from datasets import load_dataset + from transformers import AutoModelForCausalLM, AutoTokenizer + from llmcompressor.transformers import oneshot + + # Select model and load it + MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct" + model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", torch_dtype="auto") + tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) + + # Select calibration dataset + DATASET_ID = "HuggingFaceH4/ultrachat_200k" + DATASET_SPLIT = "train_sft" + + # Configure calibration parameters + NUM_CALIBRATION_SAMPLES = 512 # 512 samples is a good starting point + MAX_SEQUENCE_LENGTH = 2048 + + # Load and preprocess dataset + ds = load_dataset(DATASET_ID, split=DATASET_SPLIT) + ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) + + def process_and_tokenize(example): + text = tokenizer.apply_chat_template(example["messages"], tokenize=False) + return tokenizer( + text, + padding=False, + max_length=MAX_SEQUENCE_LENGTH, + truncation=True, + add_special_tokens=False, + ) + + ds = ds.map(process_and_tokenize, remove_columns=ds.column_names) + + # Configure quantization settings + recipe = """ + quant_stage: + quant_modifiers: + QuantizationModifier: + kv_cache_scheme: + num_bits: 8 + type: float + strategy: tensor + dynamic: false + symmetric: true + """ + + # Apply quantization + oneshot( + model=model, + dataset=ds, + recipe=recipe, + max_seq_length=MAX_SEQUENCE_LENGTH, + num_calibration_samples=NUM_CALIBRATION_SAMPLES, ) -ds = ds.map(process_and_tokenize, remove_columns=ds.column_names) - -# Configure quantization settings -recipe = """ -quant_stage: - quant_modifiers: - QuantizationModifier: - kv_cache_scheme: - num_bits: 8 - type: float - strategy: tensor - dynamic: false - symmetric: true -""" - -# Apply quantization -oneshot( - model=model, - dataset=ds, - recipe=recipe, - max_seq_length=MAX_SEQUENCE_LENGTH, - num_calibration_samples=NUM_CALIBRATION_SAMPLES, -) - -# Save quantized model: Llama-3.1-8B-Instruct-FP8-KV -SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-KV" -model.save_pretrained(SAVE_DIR, save_compressed=True) -tokenizer.save_pretrained(SAVE_DIR) -``` + # Save quantized model: Llama-3.1-8B-Instruct-FP8-KV + SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-KV" + model.save_pretrained(SAVE_DIR, save_compressed=True) + tokenizer.save_pretrained(SAVE_DIR) + ``` The above script will create a folder in your current directory containing your quantized model (e.g., `Llama-3.1-8B-Instruct-FP8-KV`) with calibrated scales. diff --git a/docs/features/quantization/quark.md b/docs/features/quantization/quark.md index 35e9dbe2609..6e77584da23 100644 --- a/docs/features/quantization/quark.md +++ b/docs/features/quantization/quark.md @@ -42,20 +42,22 @@ The Quark quantization process can be listed for 5 steps as below: Quark uses [Transformers](https://huggingface.co/docs/transformers/en/index) to fetch model and tokenizer. -```python -from transformers import AutoTokenizer, AutoModelForCausalLM +??? Code -MODEL_ID = "meta-llama/Llama-2-70b-chat-hf" -MAX_SEQ_LEN = 512 + ```python + from transformers import AutoTokenizer, AutoModelForCausalLM -model = AutoModelForCausalLM.from_pretrained( - MODEL_ID, device_map="auto", torch_dtype="auto", -) -model.eval() + MODEL_ID = "meta-llama/Llama-2-70b-chat-hf" + MAX_SEQ_LEN = 512 -tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, model_max_length=MAX_SEQ_LEN) -tokenizer.pad_token = tokenizer.eos_token -``` + model = AutoModelForCausalLM.from_pretrained( + MODEL_ID, device_map="auto", torch_dtype="auto", + ) + model.eval() + + tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, model_max_length=MAX_SEQ_LEN) + tokenizer.pad_token = tokenizer.eos_token + ``` ### 2. Prepare the Calibration Dataloader @@ -63,22 +65,24 @@ Quark uses the [PyTorch Dataloader](https://pytorch.org/tutorials/beginner/basic to load calibration data. For more details about how to use calibration datasets efficiently, please refer to [Adding Calibration Datasets](https://quark.docs.amd.com/latest/pytorch/calibration_datasets.html). -```python -from datasets import load_dataset -from torch.utils.data import DataLoader +??? Code -BATCH_SIZE = 1 -NUM_CALIBRATION_DATA = 512 + ```python + from datasets import load_dataset + from torch.utils.data import DataLoader -# Load the dataset and get calibration data. -dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation") -text_data = dataset["text"][:NUM_CALIBRATION_DATA] + BATCH_SIZE = 1 + NUM_CALIBRATION_DATA = 512 -tokenized_outputs = tokenizer(text_data, return_tensors="pt", - padding=True, truncation=True, max_length=MAX_SEQ_LEN) -calib_dataloader = DataLoader(tokenized_outputs['input_ids'], - batch_size=BATCH_SIZE, drop_last=True) -``` + # Load the dataset and get calibration data. + dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation") + text_data = dataset["text"][:NUM_CALIBRATION_DATA] + + tokenized_outputs = tokenizer(text_data, return_tensors="pt", + padding=True, truncation=True, max_length=MAX_SEQ_LEN) + calib_dataloader = DataLoader(tokenized_outputs['input_ids'], + batch_size=BATCH_SIZE, drop_last=True) + ``` ### 3. Set the Quantization Configuration @@ -94,42 +98,44 @@ kv-cache and the quantization algorithm is AutoSmoothQuant. AutoSmoothQuant config file for Llama is `examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`. -```python -from quark.torch.quantization import (Config, QuantizationConfig, - FP8E4M3PerTensorSpec, - load_quant_algo_config_from_file) - -# Define fp8/per-tensor/static spec. -FP8_PER_TENSOR_SPEC = FP8E4M3PerTensorSpec(observer_method="min_max", - is_dynamic=False).to_quantization_spec() - -# Define global quantization config, input tensors and weight apply FP8_PER_TENSOR_SPEC. -global_quant_config = QuantizationConfig(input_tensors=FP8_PER_TENSOR_SPEC, - weight=FP8_PER_TENSOR_SPEC) - -# Define quantization config for kv-cache layers, output tensors apply FP8_PER_TENSOR_SPEC. -KV_CACHE_SPEC = FP8_PER_TENSOR_SPEC -kv_cache_layer_names_for_llama = ["*k_proj", "*v_proj"] -kv_cache_quant_config = {name : - QuantizationConfig(input_tensors=global_quant_config.input_tensors, - weight=global_quant_config.weight, - output_tensors=KV_CACHE_SPEC) - for name in kv_cache_layer_names_for_llama} -layer_quant_config = kv_cache_quant_config.copy() - -# Define algorithm config by config file. -LLAMA_AUTOSMOOTHQUANT_CONFIG_FILE = - 'examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json' -algo_config = load_quant_algo_config_from_file(LLAMA_AUTOSMOOTHQUANT_CONFIG_FILE) - -EXCLUDE_LAYERS = ["lm_head"] -quant_config = Config( - global_quant_config=global_quant_config, - layer_quant_config=layer_quant_config, - kv_cache_quant_config=kv_cache_quant_config, - exclude=EXCLUDE_LAYERS, - algo_config=algo_config) -``` +??? Code + + ```python + from quark.torch.quantization import (Config, QuantizationConfig, + FP8E4M3PerTensorSpec, + load_quant_algo_config_from_file) + + # Define fp8/per-tensor/static spec. + FP8_PER_TENSOR_SPEC = FP8E4M3PerTensorSpec(observer_method="min_max", + is_dynamic=False).to_quantization_spec() + + # Define global quantization config, input tensors and weight apply FP8_PER_TENSOR_SPEC. + global_quant_config = QuantizationConfig(input_tensors=FP8_PER_TENSOR_SPEC, + weight=FP8_PER_TENSOR_SPEC) + + # Define quantization config for kv-cache layers, output tensors apply FP8_PER_TENSOR_SPEC. + KV_CACHE_SPEC = FP8_PER_TENSOR_SPEC + kv_cache_layer_names_for_llama = ["*k_proj", "*v_proj"] + kv_cache_quant_config = {name : + QuantizationConfig(input_tensors=global_quant_config.input_tensors, + weight=global_quant_config.weight, + output_tensors=KV_CACHE_SPEC) + for name in kv_cache_layer_names_for_llama} + layer_quant_config = kv_cache_quant_config.copy() + + # Define algorithm config by config file. + LLAMA_AUTOSMOOTHQUANT_CONFIG_FILE = + 'examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json' + algo_config = load_quant_algo_config_from_file(LLAMA_AUTOSMOOTHQUANT_CONFIG_FILE) + + EXCLUDE_LAYERS = ["lm_head"] + quant_config = Config( + global_quant_config=global_quant_config, + layer_quant_config=layer_quant_config, + kv_cache_quant_config=kv_cache_quant_config, + exclude=EXCLUDE_LAYERS, + algo_config=algo_config) + ``` ### 4. Quantize the Model and Export @@ -139,63 +145,67 @@ HuggingFace `safetensors`, you can refer to [HuggingFace format exporting](https://quark.docs.amd.com/latest/pytorch/export/quark_export_hf.html) for more exporting format details. -```python -import torch -from quark.torch import ModelQuantizer, ModelExporter -from quark.torch.export import ExporterConfig, JsonExporterConfig - -# Apply quantization. -quantizer = ModelQuantizer(quant_config) -quant_model = quantizer.quantize_model(model, calib_dataloader) - -# Freeze quantized model to export. -freezed_model = quantizer.freeze(model) - -# Define export config. -LLAMA_KV_CACHE_GROUP = ["*k_proj", "*v_proj"] -export_config = ExporterConfig(json_export_config=JsonExporterConfig()) -export_config.json_export_config.kv_cache_group = LLAMA_KV_CACHE_GROUP - -# Model: Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant -EXPORT_DIR = MODEL_ID.split("/")[1] + "-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant" -exporter = ModelExporter(config=export_config, export_dir=EXPORT_DIR) -with torch.no_grad(): - exporter.export_safetensors_model(freezed_model, - quant_config=quant_config, tokenizer=tokenizer) -``` +??? Code + + ```python + import torch + from quark.torch import ModelQuantizer, ModelExporter + from quark.torch.export import ExporterConfig, JsonExporterConfig + + # Apply quantization. + quantizer = ModelQuantizer(quant_config) + quant_model = quantizer.quantize_model(model, calib_dataloader) + + # Freeze quantized model to export. + freezed_model = quantizer.freeze(model) + + # Define export config. + LLAMA_KV_CACHE_GROUP = ["*k_proj", "*v_proj"] + export_config = ExporterConfig(json_export_config=JsonExporterConfig()) + export_config.json_export_config.kv_cache_group = LLAMA_KV_CACHE_GROUP + + # Model: Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant + EXPORT_DIR = MODEL_ID.split("/")[1] + "-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant" + exporter = ModelExporter(config=export_config, export_dir=EXPORT_DIR) + with torch.no_grad(): + exporter.export_safetensors_model(freezed_model, + quant_config=quant_config, tokenizer=tokenizer) + ``` ### 5. Evaluation in vLLM Now, you can load and run the Quark quantized model directly through the LLM entrypoint: -```python -from vllm import LLM, SamplingParams - -# Sample prompts. -prompts = [ - "Hello, my name is", - "The president of the United States is", - "The capital of France is", - "The future of AI is", -] -# Create a sampling params object. -sampling_params = SamplingParams(temperature=0.8, top_p=0.95) - -# Create an LLM. -llm = LLM(model="Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant", - kv_cache_dtype='fp8',quantization='quark') -# Generate texts from the prompts. The output is a list of RequestOutput objects -# that contain the prompt, generated text, and other information. -outputs = llm.generate(prompts, sampling_params) -# Print the outputs. -print("\nGenerated Outputs:\n" + "-" * 60) -for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}") - print(f"Output: {generated_text!r}") - print("-" * 60) -``` +??? Code + + ```python + from vllm import LLM, SamplingParams + + # Sample prompts. + prompts = [ + "Hello, my name is", + "The president of the United States is", + "The capital of France is", + "The future of AI is", + ] + # Create a sampling params object. + sampling_params = SamplingParams(temperature=0.8, top_p=0.95) + + # Create an LLM. + llm = LLM(model="Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant", + kv_cache_dtype='fp8',quantization='quark') + # Generate texts from the prompts. The output is a list of RequestOutput objects + # that contain the prompt, generated text, and other information. + outputs = llm.generate(prompts, sampling_params) + # Print the outputs. + print("\nGenerated Outputs:\n" + "-" * 60) + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}") + print(f"Output: {generated_text!r}") + print("-" * 60) + ``` Or, you can use `lm_eval` to evaluate accuracy: diff --git a/docs/features/quantization/torchao.md b/docs/features/quantization/torchao.md index a7a517af85a..c45979a3611 100644 --- a/docs/features/quantization/torchao.md +++ b/docs/features/quantization/torchao.md @@ -15,26 +15,28 @@ pip install \ ## Quantizing HuggingFace Models You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code: -```Python -import torch -from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer -from torchao.quantization import Int8WeightOnlyConfig - -model_name = "meta-llama/Meta-Llama-3-8B" -quantization_config = TorchAoConfig(Int8WeightOnlyConfig()) -quantized_model = AutoModelForCausalLM.from_pretrained( - model_name, - torch_dtype="auto", - device_map="auto", - quantization_config=quantization_config -) -tokenizer = AutoTokenizer.from_pretrained(model_name) -input_text = "What are we having for dinner?" -input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") - -hub_repo = # YOUR HUB REPO ID -tokenizer.push_to_hub(hub_repo) -quantized_model.push_to_hub(hub_repo, safe_serialization=False) -``` +??? Code + + ```Python + import torch + from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer + from torchao.quantization import Int8WeightOnlyConfig + + model_name = "meta-llama/Meta-Llama-3-8B" + quantization_config = TorchAoConfig(Int8WeightOnlyConfig()) + quantized_model = AutoModelForCausalLM.from_pretrained( + model_name, + torch_dtype="auto", + device_map="auto", + quantization_config=quantization_config + ) + tokenizer = AutoTokenizer.from_pretrained(model_name) + input_text = "What are we having for dinner?" + input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") + + hub_repo = # YOUR HUB REPO ID + tokenizer.push_to_hub(hub_repo) + quantized_model.push_to_hub(hub_repo, safe_serialization=False) + ``` Alternatively, you can use the [TorchAO Quantization space](https://huggingface.co/spaces/medmekk/TorchAO_Quantization) for quantizing models with a simple UI. diff --git a/docs/features/reasoning_outputs.md b/docs/features/reasoning_outputs.md index 59ef10d9c96..2e6afe61663 100644 --- a/docs/features/reasoning_outputs.md +++ b/docs/features/reasoning_outputs.md @@ -33,34 +33,36 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \ Next, make a request to the model that should return the reasoning content in the response. -```python -from openai import OpenAI +??? Code -# Modify OpenAI's API key and API base to use vLLM's API server. -openai_api_key = "EMPTY" -openai_api_base = "http://localhost:8000/v1" + ```python + from openai import OpenAI -client = OpenAI( - api_key=openai_api_key, - base_url=openai_api_base, -) + # Modify OpenAI's API key and API base to use vLLM's API server. + openai_api_key = "EMPTY" + openai_api_base = "http://localhost:8000/v1" -models = client.models.list() -model = models.data[0].id + client = OpenAI( + api_key=openai_api_key, + base_url=openai_api_base, + ) -# Round 1 -messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}] -# For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}` -# For Qwen3 series, if you want to disable thinking in reasoning mode, add: -# extra_body={"chat_template_kwargs": {"enable_thinking": False}} -response = client.chat.completions.create(model=model, messages=messages) + models = client.models.list() + model = models.data[0].id -reasoning_content = response.choices[0].message.reasoning_content -content = response.choices[0].message.content + # Round 1 + messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}] + # For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}` + # For Qwen3 series, if you want to disable thinking in reasoning mode, add: + # extra_body={"chat_template_kwargs": {"enable_thinking": False}} + response = client.chat.completions.create(model=model, messages=messages) -print("reasoning_content:", reasoning_content) -print("content:", content) -``` + reasoning_content = response.choices[0].message.reasoning_content + content = response.choices[0].message.content + + print("reasoning_content:", reasoning_content) + print("content:", content) + ``` The `reasoning_content` field contains the reasoning steps that led to the final conclusion, while the `content` field contains the final conclusion. @@ -68,77 +70,81 @@ The `reasoning_content` field contains the reasoning steps that led to the final Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in [chat completion response chunks](https://platform.openai.com/docs/api-reference/chat/streaming). -```json -{ - "id": "chatcmpl-123", - "object": "chat.completion.chunk", - "created": 1694268190, - "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", - "system_fingerprint": "fp_44709d6fcb", - "choices": [ - { - "index": 0, - "delta": { - "role": "assistant", - "reasoning_content": "is", - }, - "logprobs": null, - "finish_reason": null - } - ] -} -``` +??? Json + + ```json + { + "id": "chatcmpl-123", + "object": "chat.completion.chunk", + "created": 1694268190, + "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", + "system_fingerprint": "fp_44709d6fcb", + "choices": [ + { + "index": 0, + "delta": { + "role": "assistant", + "reasoning_content": "is", + }, + "logprobs": null, + "finish_reason": null + } + ] + } + ``` OpenAI Python client library does not officially support `reasoning_content` attribute for streaming output. But the client supports extra attributes in the response. You can use `hasattr` to check if the `reasoning_content` attribute is present in the response. For example: -```python -from openai import OpenAI - -# Modify OpenAI's API key and API base to use vLLM's API server. -openai_api_key = "EMPTY" -openai_api_base = "http://localhost:8000/v1" - -client = OpenAI( - api_key=openai_api_key, - base_url=openai_api_base, -) - -models = client.models.list() -model = models.data[0].id - -messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}] -# For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}` -# For Qwen3 series, if you want to disable thinking in reasoning mode, add: -# extra_body={"chat_template_kwargs": {"enable_thinking": False}} -stream = client.chat.completions.create(model=model, - messages=messages, - stream=True) - -print("client: Start streaming chat completions...") -printed_reasoning_content = False -printed_content = False - -for chunk in stream: - reasoning_content = None - content = None - # Check the content is reasoning_content or content - if hasattr(chunk.choices[0].delta, "reasoning_content"): - reasoning_content = chunk.choices[0].delta.reasoning_content - elif hasattr(chunk.choices[0].delta, "content"): - content = chunk.choices[0].delta.content - - if reasoning_content is not None: - if not printed_reasoning_content: - printed_reasoning_content = True - print("reasoning_content:", end="", flush=True) - print(reasoning_content, end="", flush=True) - elif content is not None: - if not printed_content: - printed_content = True - print("\ncontent:", end="", flush=True) - # Extract and print the content - print(content, end="", flush=True) -``` +??? Code + + ```python + from openai import OpenAI + + # Modify OpenAI's API key and API base to use vLLM's API server. + openai_api_key = "EMPTY" + openai_api_base = "http://localhost:8000/v1" + + client = OpenAI( + api_key=openai_api_key, + base_url=openai_api_base, + ) + + models = client.models.list() + model = models.data[0].id + + messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}] + # For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}` + # For Qwen3 series, if you want to disable thinking in reasoning mode, add: + # extra_body={"chat_template_kwargs": {"enable_thinking": False}} + stream = client.chat.completions.create(model=model, + messages=messages, + stream=True) + + print("client: Start streaming chat completions...") + printed_reasoning_content = False + printed_content = False + + for chunk in stream: + reasoning_content = None + content = None + # Check the content is reasoning_content or content + if hasattr(chunk.choices[0].delta, "reasoning_content"): + reasoning_content = chunk.choices[0].delta.reasoning_content + elif hasattr(chunk.choices[0].delta, "content"): + content = chunk.choices[0].delta.content + + if reasoning_content is not None: + if not printed_reasoning_content: + printed_reasoning_content = True + print("reasoning_content:", end="", flush=True) + print(reasoning_content, end="", flush=True) + elif content is not None: + if not printed_content: + printed_content = True + print("\ncontent:", end="", flush=True) + # Extract and print the content + print(content, end="", flush=True) + ``` Remember to check whether the `reasoning_content` exists in the response before accessing it. You could checkout the [example](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py). @@ -146,41 +152,43 @@ Remember to check whether the `reasoning_content` exists in the response before The reasoning content is also available when both tool calling and the reasoning parser are enabled. Additionally, tool calling only parses functions from the `content` field, not from the `reasoning_content`. -```python -from openai import OpenAI - -client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") - -tools = [{ - "type": "function", - "function": { - "name": "get_weather", - "description": "Get the current weather in a given location", - "parameters": { - "type": "object", - "properties": { - "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"}, - "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} - }, - "required": ["location", "unit"] +??? Code + + ```python + from openai import OpenAI + + client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") + + tools = [{ + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather in a given location", + "parameters": { + "type": "object", + "properties": { + "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"}, + "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} + }, + "required": ["location", "unit"] + } } - } -}] + }] -response = client.chat.completions.create( - model=client.models.list().data[0].id, - messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}], - tools=tools, - tool_choice="auto" -) + response = client.chat.completions.create( + model=client.models.list().data[0].id, + messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}], + tools=tools, + tool_choice="auto" + ) -print(response) -tool_call = response.choices[0].message.tool_calls[0].function + print(response) + tool_call = response.choices[0].message.tool_calls[0].function -print(f"reasoning_content: {response.choices[0].message.reasoning_content}") -print(f"Function called: {tool_call.name}") -print(f"Arguments: {tool_call.arguments}") -``` + print(f"reasoning_content: {response.choices[0].message.reasoning_content}") + print(f"Function called: {tool_call.name}") + print(f"Arguments: {tool_call.arguments}") + ``` For more examples, please refer to . @@ -192,85 +200,89 @@ For more examples, please refer to . -```python -# import the required packages - -from vllm.reasoning import ReasoningParser, ReasoningParserManager -from vllm.entrypoints.openai.protocol import (ChatCompletionRequest, - DeltaMessage) - -# define a reasoning parser and register it to vllm -# the name list in register_module can be used -# in --reasoning-parser. -@ReasoningParserManager.register_module(["example"]) -class ExampleParser(ReasoningParser): - def __init__(self, tokenizer: AnyTokenizer): - super().__init__(tokenizer) - - def extract_reasoning_content_streaming( - self, - previous_text: str, - current_text: str, - delta_text: str, - previous_token_ids: Sequence[int], - current_token_ids: Sequence[int], - delta_token_ids: Sequence[int], - ) -> Union[DeltaMessage, None]: - """ - Instance method that should be implemented for extracting reasoning - from an incomplete response; for use when handling reasoning calls and - streaming. Has to be an instance method because it requires state - - the current tokens/diffs, but also the information about what has - previously been parsed and extracted (see constructor) - """ - - def extract_reasoning_content( - self, model_output: str, request: ChatCompletionRequest - ) -> tuple[Optional[str], Optional[str]]: - """ - Extract reasoning content from a complete model-generated string. - - Used for non-streaming responses where we have the entire model response - available before sending to the client. +??? Code + + ```python + # import the required packages + + from vllm.reasoning import ReasoningParser, ReasoningParserManager + from vllm.entrypoints.openai.protocol import (ChatCompletionRequest, + DeltaMessage) + + # define a reasoning parser and register it to vllm + # the name list in register_module can be used + # in --reasoning-parser. + @ReasoningParserManager.register_module(["example"]) + class ExampleParser(ReasoningParser): + def __init__(self, tokenizer: AnyTokenizer): + super().__init__(tokenizer) + + def extract_reasoning_content_streaming( + self, + previous_text: str, + current_text: str, + delta_text: str, + previous_token_ids: Sequence[int], + current_token_ids: Sequence[int], + delta_token_ids: Sequence[int], + ) -> Union[DeltaMessage, None]: + """ + Instance method that should be implemented for extracting reasoning + from an incomplete response; for use when handling reasoning calls and + streaming. Has to be an instance method because it requires state - + the current tokens/diffs, but also the information about what has + previously been parsed and extracted (see constructor) + """ + + def extract_reasoning_content( + self, model_output: str, request: ChatCompletionRequest + ) -> tuple[Optional[str], Optional[str]]: + """ + Extract reasoning content from a complete model-generated string. + + Used for non-streaming responses where we have the entire model response + available before sending to the client. + + Parameters: + model_output: str + The model-generated string to extract reasoning content from. + + request: ChatCompletionRequest + The request object that was used to generate the model_output. + + Returns: + tuple[Optional[str], Optional[str]] + A tuple containing the reasoning content and the content. + """ + ``` - Parameters: - model_output: str - The model-generated string to extract reasoning content from. +Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in . - request: ChatCompletionRequest - The request object that was used to generate the model_output. +??? Code - Returns: - tuple[Optional[str], Optional[str]] - A tuple containing the reasoning content and the content. + ```python + @dataclass + class DeepSeekReasoner(Reasoner): """ -``` - -Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in . - -```python -@dataclass -class DeepSeekReasoner(Reasoner): - """ - Reasoner for DeepSeek R series models. - """ - start_token_id: int - end_token_id: int - - start_token: str = "" - end_token: str = "" - - @classmethod - def from_tokenizer(cls, tokenizer: PreTrainedTokenizer) -> Reasoner: - return cls(start_token_id=tokenizer.encode( - "", add_special_tokens=False)[0], - end_token_id=tokenizer.encode("", - add_special_tokens=False)[0]) - - def is_reasoning_end(self, input_ids: list[int]) -> bool: - return self.end_token_id in input_ids - ... -``` + Reasoner for DeepSeek R series models. + """ + start_token_id: int + end_token_id: int + + start_token: str = "" + end_token: str = "" + + @classmethod + def from_tokenizer(cls, tokenizer: PreTrainedTokenizer) -> Reasoner: + return cls(start_token_id=tokenizer.encode( + "", add_special_tokens=False)[0], + end_token_id=tokenizer.encode("", + add_special_tokens=False)[0]) + + def is_reasoning_end(self, input_ids: list[int]) -> bool: + return self.end_token_id in input_ids + ... + ``` The structured output engine like [xgrammar](https://github.com/mlc-ai/xgrammar) will use `end_token_id` to check if the reasoning content is present in the model output and skip the structured output if it is the case. diff --git a/docs/features/spec_decode.md b/docs/features/spec_decode.md index 5080960f72d..7055cde1e99 100644 --- a/docs/features/spec_decode.md +++ b/docs/features/spec_decode.md @@ -18,29 +18,31 @@ Speculative decoding is a technique which improves inter-token latency in memory The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time. -```python -from vllm import LLM, SamplingParams - -prompts = [ - "The future of AI is", -] -sampling_params = SamplingParams(temperature=0.8, top_p=0.95) - -llm = LLM( - model="facebook/opt-6.7b", - tensor_parallel_size=1, - speculative_config={ - "model": "facebook/opt-125m", - "num_speculative_tokens": 5, - }, -) -outputs = llm.generate(prompts, sampling_params) - -for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") -``` +??? Code + + ```python + from vllm import LLM, SamplingParams + + prompts = [ + "The future of AI is", + ] + sampling_params = SamplingParams(temperature=0.8, top_p=0.95) + + llm = LLM( + model="facebook/opt-6.7b", + tensor_parallel_size=1, + speculative_config={ + "model": "facebook/opt-125m", + "num_speculative_tokens": 5, + }, + ) + outputs = llm.generate(prompts, sampling_params) + + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + ``` To perform the same with an online mode launch the server: @@ -60,69 +62,73 @@ python -m vllm.entrypoints.openai.api_server \ Then use a client: -```python -from openai import OpenAI - -# Modify OpenAI's API key and API base to use vLLM's API server. -openai_api_key = "EMPTY" -openai_api_base = "http://localhost:8000/v1" - -client = OpenAI( - # defaults to os.environ.get("OPENAI_API_KEY") - api_key=openai_api_key, - base_url=openai_api_base, -) - -models = client.models.list() -model = models.data[0].id - -# Completion API -stream = False -completion = client.completions.create( - model=model, - prompt="The future of AI is", - echo=False, - n=1, - stream=stream, -) - -print("Completion results:") -if stream: - for c in completion: - print(c) -else: - print(completion) -``` +??? Code + + ```python + from openai import OpenAI + + # Modify OpenAI's API key and API base to use vLLM's API server. + openai_api_key = "EMPTY" + openai_api_base = "http://localhost:8000/v1" + + client = OpenAI( + # defaults to os.environ.get("OPENAI_API_KEY") + api_key=openai_api_key, + base_url=openai_api_base, + ) + + models = client.models.list() + model = models.data[0].id + + # Completion API + stream = False + completion = client.completions.create( + model=model, + prompt="The future of AI is", + echo=False, + n=1, + stream=stream, + ) + + print("Completion results:") + if stream: + for c in completion: + print(c) + else: + print(completion) + ``` ## Speculating by matching n-grams in the prompt The following code configures vLLM to use speculative decoding where proposals are generated by matching n-grams in the prompt. For more information read [this thread.](https://x.com/joao_gante/status/1747322413006643259) -```python -from vllm import LLM, SamplingParams - -prompts = [ - "The future of AI is", -] -sampling_params = SamplingParams(temperature=0.8, top_p=0.95) - -llm = LLM( - model="facebook/opt-6.7b", - tensor_parallel_size=1, - speculative_config={ - "method": "ngram", - "num_speculative_tokens": 5, - "prompt_lookup_max": 4, - }, -) -outputs = llm.generate(prompts, sampling_params) - -for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") -``` +??? Code + + ```python + from vllm import LLM, SamplingParams + + prompts = [ + "The future of AI is", + ] + sampling_params = SamplingParams(temperature=0.8, top_p=0.95) + + llm = LLM( + model="facebook/opt-6.7b", + tensor_parallel_size=1, + speculative_config={ + "method": "ngram", + "num_speculative_tokens": 5, + "prompt_lookup_max": 4, + }, + ) + outputs = llm.generate(prompts, sampling_params) + + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + ``` ## Speculating using MLP speculators @@ -131,29 +137,31 @@ draft models that conditioning draft predictions on both context vectors and sam For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or [this technical report](https://arxiv.org/abs/2404.19124). -```python -from vllm import LLM, SamplingParams - -prompts = [ - "The future of AI is", -] -sampling_params = SamplingParams(temperature=0.8, top_p=0.95) - -llm = LLM( - model="meta-llama/Meta-Llama-3.1-70B-Instruct", - tensor_parallel_size=4, - speculative_config={ - "model": "ibm-ai-platform/llama3-70b-accelerator", - "draft_tensor_parallel_size": 1, - }, -) -outputs = llm.generate(prompts, sampling_params) - -for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") -``` +??? Code + + ```python + from vllm import LLM, SamplingParams + + prompts = [ + "The future of AI is", + ] + sampling_params = SamplingParams(temperature=0.8, top_p=0.95) + + llm = LLM( + model="meta-llama/Meta-Llama-3.1-70B-Instruct", + tensor_parallel_size=4, + speculative_config={ + "model": "ibm-ai-platform/llama3-70b-accelerator", + "draft_tensor_parallel_size": 1, + }, + ) + outputs = llm.generate(prompts, sampling_params) + + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + ``` Note that these speculative models currently need to be run without tensor parallelism, although it is possible to run the main model using tensor parallelism (see example above). Since the @@ -177,31 +185,33 @@ A variety of speculative models of this type are available on HF hub: The following code configures vLLM to use speculative decoding where proposals are generated by an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](gh-file:examples/offline_inference/eagle.py). -```python -from vllm import LLM, SamplingParams +??? Code -prompts = [ - "The future of AI is", -] -sampling_params = SamplingParams(temperature=0.8, top_p=0.95) + ```python + from vllm import LLM, SamplingParams -llm = LLM( - model="meta-llama/Meta-Llama-3-8B-Instruct", - tensor_parallel_size=4, - speculative_config={ - "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B", - "draft_tensor_parallel_size": 1, - }, -) + prompts = [ + "The future of AI is", + ] + sampling_params = SamplingParams(temperature=0.8, top_p=0.95) -outputs = llm.generate(prompts, sampling_params) + llm = LLM( + model="meta-llama/Meta-Llama-3-8B-Instruct", + tensor_parallel_size=4, + speculative_config={ + "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B", + "draft_tensor_parallel_size": 1, + }, + ) -for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + outputs = llm.generate(prompts, sampling_params) -``` + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + + ``` A few important things to consider when using the EAGLE based draft models: diff --git a/docs/features/structured_outputs.md b/docs/features/structured_outputs.md index 044c7966099..b63f344ebd5 100644 --- a/docs/features/structured_outputs.md +++ b/docs/features/structured_outputs.md @@ -33,39 +33,43 @@ text. Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one: -```python -from openai import OpenAI -client = OpenAI( - base_url="http://localhost:8000/v1", - api_key="-", -) -model = client.models.list().data[0].id - -completion = client.chat.completions.create( - model=model, - messages=[ - {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"} - ], - extra_body={"guided_choice": ["positive", "negative"]}, -) -print(completion.choices[0].message.content) -``` +??? Code + + ```python + from openai import OpenAI + client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="-", + ) + model = client.models.list().data[0].id + + completion = client.chat.completions.create( + model=model, + messages=[ + {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"} + ], + extra_body={"guided_choice": ["positive", "negative"]}, + ) + print(completion.choices[0].message.content) + ``` The next example shows how to use the `guided_regex`. The idea is to generate an email address, given a simple regex template: -```python -completion = client.chat.completions.create( - model=model, - messages=[ - { - "role": "user", - "content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n", - } - ], - extra_body={"guided_regex": r"\w+@\w+\.com\n", "stop": ["\n"]}, -) -print(completion.choices[0].message.content) -``` +??? Code + + ```python + completion = client.chat.completions.create( + model=model, + messages=[ + { + "role": "user", + "content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n", + } + ], + extra_body={"guided_regex": r"\w+@\w+\.com\n", "stop": ["\n"]}, + ) + print(completion.choices[0].message.content) + ``` One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats. For this we can use the `guided_json` parameter in two different ways: @@ -75,41 +79,43 @@ For this we can use the `guided_json` parameter in two different ways: The next example shows how to use the `guided_json` parameter with a Pydantic model: -```python -from pydantic import BaseModel -from enum import Enum - -class CarType(str, Enum): - sedan = "sedan" - suv = "SUV" - truck = "Truck" - coupe = "Coupe" - -class CarDescription(BaseModel): - brand: str - model: str - car_type: CarType - -json_schema = CarDescription.model_json_schema() - -completion = client.chat.completions.create( - model=model, - messages=[ - { - "role": "user", - "content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's", - } - ], - "response_format": { - "type": "json_schema", - "json_schema": { - "name": "car-description", - "schema": CarDescription.model_json_schema() +??? Code + + ```python + from pydantic import BaseModel + from enum import Enum + + class CarType(str, Enum): + sedan = "sedan" + suv = "SUV" + truck = "Truck" + coupe = "Coupe" + + class CarDescription(BaseModel): + brand: str + model: str + car_type: CarType + + json_schema = CarDescription.model_json_schema() + + completion = client.chat.completions.create( + model=model, + messages=[ + { + "role": "user", + "content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's", + } + ], + "response_format": { + "type": "json_schema", + "json_schema": { + "name": "car-description", + "schema": CarDescription.model_json_schema() + }, }, - }, -) -print(completion.choices[0].message.content) -``` + ) + print(completion.choices[0].message.content) + ``` !!! tip While not strictly necessary, normally it´s better to indicate in the prompt the @@ -121,33 +127,35 @@ difficult to use, but it´s really powerful. It allows us to define complete languages like SQL queries. It works by using a context free EBNF grammar. As an example, we can use to define a specific format of simplified SQL queries: -```python -simplified_sql_grammar = """ - root ::= select_statement +??? Code - select_statement ::= "SELECT " column " from " table " where " condition + ```python + simplified_sql_grammar = """ + root ::= select_statement - column ::= "col_1 " | "col_2 " + select_statement ::= "SELECT " column " from " table " where " condition - table ::= "table_1 " | "table_2 " + column ::= "col_1 " | "col_2 " - condition ::= column "= " number + table ::= "table_1 " | "table_2 " - number ::= "1 " | "2 " -""" + condition ::= column "= " number -completion = client.chat.completions.create( - model=model, - messages=[ - { - "role": "user", - "content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.", - } - ], - extra_body={"guided_grammar": simplified_sql_grammar}, -) -print(completion.choices[0].message.content) -``` + number ::= "1 " | "2 " + """ + + completion = client.chat.completions.create( + model=model, + messages=[ + { + "role": "user", + "content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.", + } + ], + extra_body={"guided_grammar": simplified_sql_grammar}, + ) + print(completion.choices[0].message.content) + ``` See also: [full example](https://docs.vllm.ai/en/latest/examples/online_serving/structured_outputs.html) @@ -161,34 +169,36 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --reasoning-parser deepseek_r Note that you can use reasoning with any provided structured outputs feature. The following uses one with JSON schema: -```python -from pydantic import BaseModel - - -class People(BaseModel): - name: str - age: int - - -completion = client.chat.completions.create( - model=model, - messages=[ - { - "role": "user", - "content": "Generate a JSON with the name and age of one random person.", - } - ], - response_format={ - "type": "json_schema", - "json_schema": { - "name": "people", - "schema": People.model_json_schema() - } - }, -) -print("reasoning_content: ", completion.choices[0].message.reasoning_content) -print("content: ", completion.choices[0].message.content) -``` +??? Code + + ```python + from pydantic import BaseModel + + + class People(BaseModel): + name: str + age: int + + + completion = client.chat.completions.create( + model=model, + messages=[ + { + "role": "user", + "content": "Generate a JSON with the name and age of one random person.", + } + ], + response_format={ + "type": "json_schema", + "json_schema": { + "name": "people", + "schema": People.model_json_schema() + } + }, + ) + print("reasoning_content: ", completion.choices[0].message.reasoning_content) + print("content: ", completion.choices[0].message.content) + ``` See also: [full example](https://docs.vllm.ai/en/latest/examples/online_serving/structured_outputs.html) @@ -202,33 +212,33 @@ For the following examples, vLLM was setup using `vllm serve meta-llama/Llama-3. Here is a simple example demonstrating how to get structured output using Pydantic models: -```python -from pydantic import BaseModel -from openai import OpenAI - -class Info(BaseModel): - name: str - age: int - -client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy") -model = client.models.list().data[0].id -completion = client.beta.chat.completions.parse( - model=model, - messages=[ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"}, - ], - response_format=Info, -) - -message = completion.choices[0].message -print(message) -assert message.parsed -print("Name:", message.parsed.name) -print("Age:", message.parsed.age) -``` - -Output: +??? Code + + ```python + from pydantic import BaseModel + from openai import OpenAI + + class Info(BaseModel): + name: str + age: int + + client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy") + model = client.models.list().data[0].id + completion = client.beta.chat.completions.parse( + model=model, + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"}, + ], + response_format=Info, + ) + + message = completion.choices[0].message + print(message) + assert message.parsed + print("Name:", message.parsed.name) + print("Age:", message.parsed.age) + ``` ```console ParsedChatCompletionMessage[Testing](content='{"name": "Cameron", "age": 28}', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=Testing(name='Cameron', age=28)) @@ -238,35 +248,37 @@ Age: 28 Here is a more complex example using nested Pydantic models to handle a step-by-step math solution: -```python -from typing import List -from pydantic import BaseModel -from openai import OpenAI - -class Step(BaseModel): - explanation: str - output: str - -class MathResponse(BaseModel): - steps: list[Step] - final_answer: str - -completion = client.beta.chat.completions.parse( - model=model, - messages=[ - {"role": "system", "content": "You are a helpful expert math tutor."}, - {"role": "user", "content": "Solve 8x + 31 = 2."}, - ], - response_format=MathResponse, -) - -message = completion.choices[0].message -print(message) -assert message.parsed -for i, step in enumerate(message.parsed.steps): - print(f"Step #{i}:", step) -print("Answer:", message.parsed.final_answer) -``` +??? Code + + ```python + from typing import List + from pydantic import BaseModel + from openai import OpenAI + + class Step(BaseModel): + explanation: str + output: str + + class MathResponse(BaseModel): + steps: list[Step] + final_answer: str + + completion = client.beta.chat.completions.parse( + model=model, + messages=[ + {"role": "system", "content": "You are a helpful expert math tutor."}, + {"role": "user", "content": "Solve 8x + 31 = 2."}, + ], + response_format=MathResponse, + ) + + message = completion.choices[0].message + print(message) + assert message.parsed + for i, step in enumerate(message.parsed.steps): + print(f"Step #{i}:", step) + print("Answer:", message.parsed.final_answer) + ``` Output: @@ -296,19 +308,21 @@ These parameters can be used in the same way as the parameters from the Online Serving examples above. One example for the usage of the `choice` parameter is shown below: -```python -from vllm import LLM, SamplingParams -from vllm.sampling_params import GuidedDecodingParams +??? Code -llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct") + ```python + from vllm import LLM, SamplingParams + from vllm.sampling_params import GuidedDecodingParams -guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"]) -sampling_params = SamplingParams(guided_decoding=guided_decoding_params) -outputs = llm.generate( - prompts="Classify this sentiment: vLLM is wonderful!", - sampling_params=sampling_params, -) -print(outputs[0].outputs[0].text) -``` + llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct") + + guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"]) + sampling_params = SamplingParams(guided_decoding=guided_decoding_params) + outputs = llm.generate( + prompts="Classify this sentiment: vLLM is wonderful!", + sampling_params=sampling_params, + ) + print(outputs[0].outputs[0].text) + ``` See also: [full example](https://docs.vllm.ai/en/latest/examples/online_serving/structured_outputs.html) diff --git a/docs/features/tool_calling.md b/docs/features/tool_calling.md index 93ea164881c..9fb878777a4 100644 --- a/docs/features/tool_calling.md +++ b/docs/features/tool_calling.md @@ -15,44 +15,46 @@ vllm serve meta-llama/Llama-3.1-8B-Instruct \ Next, make a request to the model that should result in it using the available tools: -```python -from openai import OpenAI -import json - -client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") - -def get_weather(location: str, unit: str): - return f"Getting the weather for {location} in {unit}..." -tool_functions = {"get_weather": get_weather} - -tools = [{ - "type": "function", - "function": { - "name": "get_weather", - "description": "Get the current weather in a given location", - "parameters": { - "type": "object", - "properties": { - "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"}, - "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} - }, - "required": ["location", "unit"] +??? Code + + ```python + from openai import OpenAI + import json + + client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") + + def get_weather(location: str, unit: str): + return f"Getting the weather for {location} in {unit}..." + tool_functions = {"get_weather": get_weather} + + tools = [{ + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather in a given location", + "parameters": { + "type": "object", + "properties": { + "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"}, + "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} + }, + "required": ["location", "unit"] + } } - } -}] - -response = client.chat.completions.create( - model=client.models.list().data[0].id, - messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}], - tools=tools, - tool_choice="auto" -) - -tool_call = response.choices[0].message.tool_calls[0].function -print(f"Function called: {tool_call.name}") -print(f"Arguments: {tool_call.arguments}") -print(f"Result: {get_weather(**json.loads(tool_call.arguments))}") -``` + }] + + response = client.chat.completions.create( + model=client.models.list().data[0].id, + messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}], + tools=tools, + tool_choice="auto" + ) + + tool_call = response.choices[0].message.tool_calls[0].function + print(f"Function called: {tool_call.name}") + print(f"Arguments: {tool_call.arguments}") + print(f"Result: {get_weather(**json.loads(tool_call.arguments))}") + ``` Example output: @@ -301,49 +303,51 @@ A tool parser plugin is a Python file containing one or more ToolParser implemen Here is a summary of a plugin file: -```python - -# import the required packages - -# define a tool parser and register it to vllm -# the name list in register_module can be used -# in --tool-call-parser. you can define as many -# tool parsers as you want here. -@ToolParserManager.register_module(["example"]) -class ExampleToolParser(ToolParser): - def __init__(self, tokenizer: AnyTokenizer): - super().__init__(tokenizer) - - # adjust request. e.g.: set skip special tokens - # to False for tool call output. - def adjust_request( - self, request: ChatCompletionRequest) -> ChatCompletionRequest: - return request - - # implement the tool call parse for stream call - def extract_tool_calls_streaming( - self, - previous_text: str, - current_text: str, - delta_text: str, - previous_token_ids: Sequence[int], - current_token_ids: Sequence[int], - delta_token_ids: Sequence[int], - request: ChatCompletionRequest, - ) -> Union[DeltaMessage, None]: - return delta - - # implement the tool parse for non-stream call - def extract_tool_calls( - self, - model_output: str, - request: ChatCompletionRequest, - ) -> ExtractedToolCallInformation: - return ExtractedToolCallInformation(tools_called=False, - tool_calls=[], - content=text) - -``` +??? Code + + ```python + + # import the required packages + + # define a tool parser and register it to vllm + # the name list in register_module can be used + # in --tool-call-parser. you can define as many + # tool parsers as you want here. + @ToolParserManager.register_module(["example"]) + class ExampleToolParser(ToolParser): + def __init__(self, tokenizer: AnyTokenizer): + super().__init__(tokenizer) + + # adjust request. e.g.: set skip special tokens + # to False for tool call output. + def adjust_request( + self, request: ChatCompletionRequest) -> ChatCompletionRequest: + return request + + # implement the tool call parse for stream call + def extract_tool_calls_streaming( + self, + previous_text: str, + current_text: str, + delta_text: str, + previous_token_ids: Sequence[int], + current_token_ids: Sequence[int], + delta_token_ids: Sequence[int], + request: ChatCompletionRequest, + ) -> Union[DeltaMessage, None]: + return delta + + # implement the tool parse for non-stream call + def extract_tool_calls( + self, + model_output: str, + request: ChatCompletionRequest, + ) -> ExtractedToolCallInformation: + return ExtractedToolCallInformation(tools_called=False, + tool_calls=[], + content=text) + + ``` Then you can use this plugin in the command line like this. diff --git a/docs/getting_started/installation/cpu.md b/docs/getting_started/installation/cpu.md index 00bb5cae43f..3f75d1aef30 100644 --- a/docs/getting_started/installation/cpu.md +++ b/docs/getting_started/installation/cpu.md @@ -76,21 +76,23 @@ Currently, there are no pre-built CPU wheels. ### Build image from source -```console -$ docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai . - -# Launching OpenAI server -$ docker run --rm \ - --privileged=true \ - --shm-size=4g \ - -p 8000:8000 \ - -e VLLM_CPU_KVCACHE_SPACE= \ - -e VLLM_CPU_OMP_THREADS_BIND= \ - vllm-cpu-env \ - --model=meta-llama/Llama-3.2-1B-Instruct \ - --dtype=bfloat16 \ - other vLLM OpenAI server arguments -``` +??? Commands + + ```console + $ docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai . + + # Launching OpenAI server + $ docker run --rm \ + --privileged=true \ + --shm-size=4g \ + -p 8000:8000 \ + -e VLLM_CPU_KVCACHE_SPACE= \ + -e VLLM_CPU_OMP_THREADS_BIND= \ + vllm-cpu-env \ + --model=meta-llama/Llama-3.2-1B-Instruct \ + --dtype=bfloat16 \ + other vLLM OpenAI server arguments + ``` !!! tip For ARM or Apple silicon, use `docker/Dockerfile.arm` @@ -144,32 +146,34 @@ vllm serve facebook/opt-125m - If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using `VLLM_CPU_OMP_THREADS_BIND` or using auto thread binding feature by default. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores: -```console -$ lscpu -e # check the mapping between logical CPU cores and physical CPU cores - -# The "CPU" column means the logical CPU core IDs, and the "CORE" column means the physical core IDs. On this platform, two logical cores are sharing one physical core. -CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ -0 0 0 0 0:0:0:0 yes 2401.0000 800.0000 800.000 -1 0 0 1 1:1:1:0 yes 2401.0000 800.0000 800.000 -2 0 0 2 2:2:2:0 yes 2401.0000 800.0000 800.000 -3 0 0 3 3:3:3:0 yes 2401.0000 800.0000 800.000 -4 0 0 4 4:4:4:0 yes 2401.0000 800.0000 800.000 -5 0 0 5 5:5:5:0 yes 2401.0000 800.0000 800.000 -6 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000 -7 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000 -8 0 0 0 0:0:0:0 yes 2401.0000 800.0000 800.000 -9 0 0 1 1:1:1:0 yes 2401.0000 800.0000 800.000 -10 0 0 2 2:2:2:0 yes 2401.0000 800.0000 800.000 -11 0 0 3 3:3:3:0 yes 2401.0000 800.0000 800.000 -12 0 0 4 4:4:4:0 yes 2401.0000 800.0000 800.000 -13 0 0 5 5:5:5:0 yes 2401.0000 800.0000 800.000 -14 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000 -15 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000 - -# On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15 -$ export VLLM_CPU_OMP_THREADS_BIND=0-7 -$ python examples/offline_inference/basic/basic.py -``` +??? Commands + + ```console + $ lscpu -e # check the mapping between logical CPU cores and physical CPU cores + + # The "CPU" column means the logical CPU core IDs, and the "CORE" column means the physical core IDs. On this platform, two logical cores are sharing one physical core. + CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ + 0 0 0 0 0:0:0:0 yes 2401.0000 800.0000 800.000 + 1 0 0 1 1:1:1:0 yes 2401.0000 800.0000 800.000 + 2 0 0 2 2:2:2:0 yes 2401.0000 800.0000 800.000 + 3 0 0 3 3:3:3:0 yes 2401.0000 800.0000 800.000 + 4 0 0 4 4:4:4:0 yes 2401.0000 800.0000 800.000 + 5 0 0 5 5:5:5:0 yes 2401.0000 800.0000 800.000 + 6 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000 + 7 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000 + 8 0 0 0 0:0:0:0 yes 2401.0000 800.0000 800.000 + 9 0 0 1 1:1:1:0 yes 2401.0000 800.0000 800.000 + 10 0 0 2 2:2:2:0 yes 2401.0000 800.0000 800.000 + 11 0 0 3 3:3:3:0 yes 2401.0000 800.0000 800.000 + 12 0 0 4 4:4:4:0 yes 2401.0000 800.0000 800.000 + 13 0 0 5 5:5:5:0 yes 2401.0000 800.0000 800.000 + 14 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000 + 15 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000 + + # On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15 + $ export VLLM_CPU_OMP_THREADS_BIND=0-7 + $ python examples/offline_inference/basic/basic.py + ``` - If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using `VLLM_CPU_OMP_THREADS_BIND` to avoid cross NUMA node memory access. diff --git a/docs/getting_started/installation/gpu/rocm.inc.md b/docs/getting_started/installation/gpu/rocm.inc.md index 8019fb50f4d..6bc714fe6e8 100644 --- a/docs/getting_started/installation/gpu/rocm.inc.md +++ b/docs/getting_started/installation/gpu/rocm.inc.md @@ -90,24 +90,26 @@ Currently, there are no pre-built ROCm wheels. 4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps: - ```bash - pip install --upgrade pip - - # Build & install AMD SMI - pip install /opt/rocm/share/amd_smi - - # Install dependencies - pip install --upgrade numba \ - scipy \ - huggingface-hub[cli,hf_transfer] \ - setuptools_scm - pip install "numpy<2" - pip install -r requirements/rocm.txt - - # Build vLLM for MI210/MI250/MI300. - export PYTORCH_ROCM_ARCH="gfx90a;gfx942" - python3 setup.py develop - ``` + ??? Commands + + ```bash + pip install --upgrade pip + + # Build & install AMD SMI + pip install /opt/rocm/share/amd_smi + + # Install dependencies + pip install --upgrade numba \ + scipy \ + huggingface-hub[cli,hf_transfer] \ + setuptools_scm + pip install "numpy<2" + pip install -r requirements/rocm.txt + + # Build vLLM for MI210/MI250/MI300. + export PYTORCH_ROCM_ARCH="gfx90a;gfx942" + python3 setup.py develop + ``` This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation. @@ -201,19 +203,21 @@ DOCKER_BUILDKIT=1 docker build \ To run the above docker image `vllm-rocm`, use the below command: -```console -docker run -it \ - --network=host \ - --group-add=video \ - --ipc=host \ - --cap-add=SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --device /dev/kfd \ - --device /dev/dri \ - -v :/app/model \ - vllm-rocm \ - bash -``` +??? Command + + ```console + docker run -it \ + --network=host \ + --group-add=video \ + --ipc=host \ + --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + --device /dev/kfd \ + --device /dev/dri \ + -v :/app/model \ + vllm-rocm \ + bash + ``` Where the `` is the location where the model is stored, for example, the weights for llama2 or llama3 models. diff --git a/docs/getting_started/installation/intel_gaudi.md b/docs/getting_started/installation/intel_gaudi.md index f5970850aae..056caa70814 100644 --- a/docs/getting_started/installation/intel_gaudi.md +++ b/docs/getting_started/installation/intel_gaudi.md @@ -200,7 +200,7 @@ INFO 08-01 21:37:59 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 1 `min` determines the lowest value of the bucket. `step` determines the interval between buckets, and `max` determines the upper bound of the bucket. Furthermore, interval between `min` and `step` has special handling -- `min` gets multiplied by consecutive powers of two, until `step` gets reached. We call this the ramp-up phase and it is used for handling lower batch sizes with minimum wastage, while allowing larger padding on larger batch sizes. -Example (with ramp-up) +Example (with ramp-up): ```text min = 2, step = 32, max = 64 @@ -209,7 +209,7 @@ min = 2, step = 32, max = 64 => buckets = ramp_up + stable => (2, 4, 8, 16, 32, 64) ``` -Example (without ramp-up) +Example (without ramp-up): ```text min = 128, step = 128, max = 512 @@ -232,19 +232,21 @@ As an example, if a request of 3 sequences, with max sequence length of 412 come Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup: -```text -INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB -INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB -INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB -... -INFO 08-01 22:26:59 hpu_model_runner.py:1066] [Warmup][Prompt][24/24] batch_size:1 seq_len:128 free_mem:55.43 GiB -INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][1/48] batch_size:4 seq_len:2048 free_mem:55.43 GiB -INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][2/48] batch_size:4 seq_len:1920 free_mem:55.43 GiB -INFO 08-01 22:27:01 hpu_model_runner.py:1066] [Warmup][Decode][3/48] batch_size:4 seq_len:1792 free_mem:55.43 GiB -... -INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size:2 seq_len:128 free_mem:55.43 GiB -INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB -``` +??? Logs + + ```text + INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB + INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB + INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB + ... + INFO 08-01 22:26:59 hpu_model_runner.py:1066] [Warmup][Prompt][24/24] batch_size:1 seq_len:128 free_mem:55.43 GiB + INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][1/48] batch_size:4 seq_len:2048 free_mem:55.43 GiB + INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][2/48] batch_size:4 seq_len:1920 free_mem:55.43 GiB + INFO 08-01 22:27:01 hpu_model_runner.py:1066] [Warmup][Decode][3/48] batch_size:4 seq_len:1792 free_mem:55.43 GiB + ... + INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size:2 seq_len:128 free_mem:55.43 GiB + INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB + ``` This example uses the same buckets as in the [Bucketing Mechanism][gaudi-bucketing-mechanism] section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations. @@ -279,37 +281,39 @@ When there's large amount of requests pending, vLLM scheduler will attempt to fi Each described step is logged by vLLM server, as follows (negative values correspond to memory being released): -```text -INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024] -INFO 08-02 17:37:44 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)] -INFO 08-02 17:37:44 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048] -INFO 08-02 17:37:44 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)] -INFO 08-02 17:37:52 hpu_model_runner.py:430] Pre-loading model weights on hpu:0 took 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used) -INFO 08-02 17:37:52 hpu_model_runner.py:438] Wrapping in HPU Graph took 0 B of device memory (14.97 GiB/94.62 GiB used) and -252 KiB of host memory (475.2 GiB/1007 GiB used) -INFO 08-02 17:37:52 hpu_model_runner.py:442] Loading model weights took in total 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used) -INFO 08-02 17:37:54 hpu_worker.py:134] Model profiling run took 504 MiB of device memory (15.46 GiB/94.62 GiB used) and 180.9 MiB of host memory (475.4 GiB/1007 GiB used) -INFO 08-02 17:37:54 hpu_worker.py:158] Free device memory: 79.16 GiB, 39.58 GiB usable (gpu_memory_utilization=0.5), 15.83 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.4), 23.75 GiB reserved for KV cache -INFO 08-02 17:37:54 hpu_executor.py:85] # HPU blocks: 1519, # CPU blocks: 0 -INFO 08-02 17:37:54 hpu_worker.py:190] Initializing cache engine took 23.73 GiB of device memory (39.2 GiB/94.62 GiB used) and -1.238 MiB of host memory (475.4 GiB/1007 GiB used) -INFO 08-02 17:37:54 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:55.43 GiB -... -INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB -INFO 08-02 17:38:22 hpu_model_runner.py:1159] Using 15.85 GiB/55.43 GiB of free device memory for HPUGraphs, 7.923 GiB for prompt and 7.923 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3) -INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][1/24] batch_size:1 seq_len:128 free_mem:55.43 GiB -... -INFO 08-02 17:38:26 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][11/24] batch_size:1 seq_len:896 free_mem:48.77 GiB -INFO 08-02 17:38:27 hpu_model_runner.py:1066] [Warmup][Graph/Decode][1/48] batch_size:4 seq_len:128 free_mem:47.51 GiB -... -INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Decode][48/48] batch_size:1 seq_len:2048 free_mem:47.35 GiB -INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][12/24] batch_size:4 seq_len:256 free_mem:47.35 GiB -INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][13/24] batch_size:2 seq_len:512 free_mem:45.91 GiB -INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][14/24] batch_size:1 seq_len:1024 free_mem:44.48 GiB -INFO 08-02 17:38:43 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][15/24] batch_size:2 seq_len:640 free_mem:43.03 GiB -INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Prompt captured:15 (62.5%) used_mem:14.03 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (4, 128), (4, 256)] -INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Decode captured:48 (100.0%) used_mem:161.9 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)] -INFO 08-02 17:38:43 hpu_model_runner.py:1206] Warmup finished in 49 secs, allocated 14.19 GiB of device memory -INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of device memory (53.39 GiB/94.62 GiB used) and 57.86 MiB of host memory (475.4 GiB/1007 GiB used) -``` +??? Logs + + ```text + INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024] + INFO 08-02 17:37:44 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)] + INFO 08-02 17:37:44 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048] + INFO 08-02 17:37:44 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)] + INFO 08-02 17:37:52 hpu_model_runner.py:430] Pre-loading model weights on hpu:0 took 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used) + INFO 08-02 17:37:52 hpu_model_runner.py:438] Wrapping in HPU Graph took 0 B of device memory (14.97 GiB/94.62 GiB used) and -252 KiB of host memory (475.2 GiB/1007 GiB used) + INFO 08-02 17:37:52 hpu_model_runner.py:442] Loading model weights took in total 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used) + INFO 08-02 17:37:54 hpu_worker.py:134] Model profiling run took 504 MiB of device memory (15.46 GiB/94.62 GiB used) and 180.9 MiB of host memory (475.4 GiB/1007 GiB used) + INFO 08-02 17:37:54 hpu_worker.py:158] Free device memory: 79.16 GiB, 39.58 GiB usable (gpu_memory_utilization=0.5), 15.83 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.4), 23.75 GiB reserved for KV cache + INFO 08-02 17:37:54 hpu_executor.py:85] # HPU blocks: 1519, # CPU blocks: 0 + INFO 08-02 17:37:54 hpu_worker.py:190] Initializing cache engine took 23.73 GiB of device memory (39.2 GiB/94.62 GiB used) and -1.238 MiB of host memory (475.4 GiB/1007 GiB used) + INFO 08-02 17:37:54 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:55.43 GiB + ... + INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB + INFO 08-02 17:38:22 hpu_model_runner.py:1159] Using 15.85 GiB/55.43 GiB of free device memory for HPUGraphs, 7.923 GiB for prompt and 7.923 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3) + INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][1/24] batch_size:1 seq_len:128 free_mem:55.43 GiB + ... + INFO 08-02 17:38:26 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][11/24] batch_size:1 seq_len:896 free_mem:48.77 GiB + INFO 08-02 17:38:27 hpu_model_runner.py:1066] [Warmup][Graph/Decode][1/48] batch_size:4 seq_len:128 free_mem:47.51 GiB + ... + INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Decode][48/48] batch_size:1 seq_len:2048 free_mem:47.35 GiB + INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][12/24] batch_size:4 seq_len:256 free_mem:47.35 GiB + INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][13/24] batch_size:2 seq_len:512 free_mem:45.91 GiB + INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][14/24] batch_size:1 seq_len:1024 free_mem:44.48 GiB + INFO 08-02 17:38:43 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][15/24] batch_size:2 seq_len:640 free_mem:43.03 GiB + INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Prompt captured:15 (62.5%) used_mem:14.03 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (4, 128), (4, 256)] + INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Decode captured:48 (100.0%) used_mem:161.9 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)] + INFO 08-02 17:38:43 hpu_model_runner.py:1206] Warmup finished in 49 secs, allocated 14.19 GiB of device memory + INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of device memory (53.39 GiB/94.62 GiB used) and 57.86 MiB of host memory (475.4 GiB/1007 GiB used) + ``` ### Recommended vLLM Parameters diff --git a/docs/getting_started/quickstart.md b/docs/getting_started/quickstart.md index 38fc9925eb5..d02cb18bcb9 100644 --- a/docs/getting_started/quickstart.md +++ b/docs/getting_started/quickstart.md @@ -147,20 +147,22 @@ curl http://localhost:8000/v1/completions \ Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package: -```python -from openai import OpenAI - -# Modify OpenAI's API key and API base to use vLLM's API server. -openai_api_key = "EMPTY" -openai_api_base = "http://localhost:8000/v1" -client = OpenAI( - api_key=openai_api_key, - base_url=openai_api_base, -) -completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct", - prompt="San Francisco is a") -print("Completion result:", completion) -``` +??? Code + + ```python + from openai import OpenAI + + # Modify OpenAI's API key and API base to use vLLM's API server. + openai_api_key = "EMPTY" + openai_api_base = "http://localhost:8000/v1" + client = OpenAI( + api_key=openai_api_key, + base_url=openai_api_base, + ) + completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct", + prompt="San Francisco is a") + print("Completion result:", completion) + ``` A more detailed client example can be found here: @@ -184,26 +186,28 @@ curl http://localhost:8000/v1/chat/completions \ Alternatively, you can use the `openai` Python package: -```python -from openai import OpenAI -# Set OpenAI's API key and API base to use vLLM's API server. -openai_api_key = "EMPTY" -openai_api_base = "http://localhost:8000/v1" - -client = OpenAI( - api_key=openai_api_key, - base_url=openai_api_base, -) - -chat_response = client.chat.completions.create( - model="Qwen/Qwen2.5-1.5B-Instruct", - messages=[ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": "Tell me a joke."}, - ] -) -print("Chat response:", chat_response) -``` +??? Code + + ```python + from openai import OpenAI + # Set OpenAI's API key and API base to use vLLM's API server. + openai_api_key = "EMPTY" + openai_api_base = "http://localhost:8000/v1" + + client = OpenAI( + api_key=openai_api_key, + base_url=openai_api_base, + ) + + chat_response = client.chat.completions.create( + model="Qwen/Qwen2.5-1.5B-Instruct", + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Tell me a joke."}, + ] + ) + print("Chat response:", chat_response) + ``` ## On Attention Backends diff --git a/docs/models/generative_models.md b/docs/models/generative_models.md index e52c5ae01cb..355ed506e5d 100644 --- a/docs/models/generative_models.md +++ b/docs/models/generative_models.md @@ -85,35 +85,37 @@ and automatically applies the model's [chat template](https://huggingface.co/doc In general, only instruction-tuned models have a chat template. Base models may perform poorly as they are not trained to respond to the chat conversation. -```python -from vllm import LLM - -llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct") -conversation = [ - { - "role": "system", - "content": "You are a helpful assistant" - }, - { - "role": "user", - "content": "Hello" - }, - { - "role": "assistant", - "content": "Hello! How can I assist you today?" - }, - { - "role": "user", - "content": "Write an essay about the importance of higher education.", - }, -] -outputs = llm.chat(conversation) - -for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") -``` +??? Code + + ```python + from vllm import LLM + + llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct") + conversation = [ + { + "role": "system", + "content": "You are a helpful assistant" + }, + { + "role": "user", + "content": "Hello" + }, + { + "role": "assistant", + "content": "Hello! How can I assist you today?" + }, + { + "role": "user", + "content": "Write an essay about the importance of higher education.", + }, + ] + outputs = llm.chat(conversation) + + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + ``` A code example can be found here: diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index 60f7dacebfa..fff6c729a58 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -70,7 +70,10 @@ To make your model compatible with the Transformers backend, it needs: 2. `MyAttention` must use `ALL_ATTENTION_FUNCTIONS` to call attention. 3. `MyModel` must contain `_supports_attention_backend = True`. -```python title="modeling_my_model.py" +
+modeling_my_model.py + +```python from transformers import PreTrainedModel from torch import nn @@ -93,6 +96,8 @@ class MyModel(PreTrainedModel): _supports_attention_backend = True ``` +
+ Here is what happens in the background when this model is loaded: 1. The config is loaded. @@ -103,7 +108,10 @@ That's it! For your model to be compatible with vLLM's tensor parallel and/or pipeline parallel features, you must add `base_model_tp_plan` and/or `base_model_pp_plan` to your model's config class: -```python title="configuration_my_model.py" +
+configuration_my_model.py + +```python from transformers import PretrainedConfig @@ -123,6 +131,8 @@ class MyConfig(PretrainedConfig): } ``` +
+ - `base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported). - `base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s: * You only need to do this for layers which are not present on all pipeline stages @@ -198,6 +208,9 @@ huggingface-cli scan-cache --dir ~/.cache/huggingface/hub Use the Hugging Face CLI to interactively [delete downloaded model](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#clean-your-cache) from the cache: +
+Commands + ```console # The `delete-cache` command requires extra dependencies to work with the TUI. # Please run `pip install huggingface_hub[cli]` to install them. @@ -224,6 +237,8 @@ Start deletion. Done. Deleted 1 repo(s) and 0 revision(s) for a total of 438.9M. ``` +
+ #### Using a proxy Here are some tips for loading/downloading models from Hugging Face using a proxy: @@ -600,27 +615,29 @@ Specified using `--task generate`. For the best results, we recommend using the following dependency versions (tested on A10 and L40): - ```text - # Core vLLM-compatible dependencies with Molmo accuracy setup (tested on L40) - torch==2.5.1 - torchvision==0.20.1 - transformers==4.48.1 - tokenizers==0.21.0 - tiktoken==0.7.0 - vllm==0.7.0 - - # Optional but recommended for improved performance and stability - triton==3.1.0 - xformers==0.0.28.post3 - uvloop==0.21.0 - protobuf==5.29.3 - openai==1.60.2 - opencv-python-headless==4.11.0.86 - pillow==10.4.0 - - # Installed FlashAttention (for float16 only) - flash-attn>=2.5.6 # Not used in float32, but should be documented - ``` + ??? Dependency versions + + ```text + # Core vLLM-compatible dependencies with Molmo accuracy setup (tested on L40) + torch==2.5.1 + torchvision==0.20.1 + transformers==4.48.1 + tokenizers==0.21.0 + tiktoken==0.7.0 + vllm==0.7.0 + + # Optional but recommended for improved performance and stability + triton==3.1.0 + xformers==0.0.28.post3 + uvloop==0.21.0 + protobuf==5.29.3 + openai==1.60.2 + opencv-python-headless==4.11.0.86 + pillow==10.4.0 + + # Installed FlashAttention (for float16 only) + flash-attn>=2.5.6 # Not used in float32, but should be documented + ``` **Note:** Make sure you understand the security implications of using outdated packages. diff --git a/docs/serving/integrations/langchain.md b/docs/serving/integrations/langchain.md index 14ea6a04434..d7e2b41651c 100644 --- a/docs/serving/integrations/langchain.md +++ b/docs/serving/integrations/langchain.md @@ -13,19 +13,21 @@ pip install langchain langchain_community -q To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`. -```python -from langchain_community.llms import VLLM - -llm = VLLM(model="mosaicml/mpt-7b", - trust_remote_code=True, # mandatory for hf models - max_new_tokens=128, - top_k=10, - top_p=0.95, - temperature=0.8, - # tensor_parallel_size=... # for distributed inference -) - -print(llm("What is the capital of France ?")) -``` +??? Code + + ```python + from langchain_community.llms import VLLM + + llm = VLLM(model="mosaicml/mpt-7b", + trust_remote_code=True, # mandatory for hf models + max_new_tokens=128, + top_k=10, + top_p=0.95, + temperature=0.8, + # tensor_parallel_size=... # for distributed inference + ) + + print(llm("What is the capital of France ?")) + ``` Please refer to this [Tutorial](https://python.langchain.com/docs/integrations/llms/vllm) for more details. diff --git a/docs/serving/openai_compatible_server.md b/docs/serving/openai_compatible_server.md index 3002b2f92e4..7862778464d 100644 --- a/docs/serving/openai_compatible_server.md +++ b/docs/serving/openai_compatible_server.md @@ -15,22 +15,24 @@ vllm serve NousResearch/Meta-Llama-3-8B-Instruct \ To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python). -```python -from openai import OpenAI -client = OpenAI( - base_url="http://localhost:8000/v1", - api_key="token-abc123", -) +??? Code -completion = client.chat.completions.create( - model="NousResearch/Meta-Llama-3-8B-Instruct", - messages=[ - {"role": "user", "content": "Hello!"} - ] -) + ```python + from openai import OpenAI + client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="token-abc123", + ) -print(completion.choices[0].message) -``` + completion = client.chat.completions.create( + model="NousResearch/Meta-Llama-3-8B-Instruct", + messages=[ + {"role": "user", "content": "Hello!"} + ] + ) + + print(completion.choices[0].message) + ``` !!! tip vLLM supports some parameters that are not supported by OpenAI, `top_k` for example. @@ -147,27 +149,29 @@ with `--enable-request-id-headers`. > rather than within the vLLM layer for this reason. > See [this PR](https://github.com/vllm-project/vllm/pull/11529) for more details. -```python -completion = client.chat.completions.create( - model="NousResearch/Meta-Llama-3-8B-Instruct", - messages=[ - {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"} - ], - extra_headers={ - "x-request-id": "sentiment-classification-00001", - } -) -print(completion._request_id) +??? Code -completion = client.completions.create( - model="NousResearch/Meta-Llama-3-8B-Instruct", - prompt="A robot may not injure a human being", - extra_headers={ - "x-request-id": "completion-test", - } -) -print(completion._request_id) -``` + ```python + completion = client.chat.completions.create( + model="NousResearch/Meta-Llama-3-8B-Instruct", + messages=[ + {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"} + ], + extra_headers={ + "x-request-id": "sentiment-classification-00001", + } + ) + print(completion._request_id) + + completion = client.completions.create( + model="NousResearch/Meta-Llama-3-8B-Instruct", + prompt="A robot may not injure a human being", + extra_headers={ + "x-request-id": "completion-test", + } + ) + print(completion._request_id) + ``` ## API Reference @@ -184,15 +188,19 @@ Code example: The following [sampling parameters][sampling-params] are supported. -```python ---8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params" -``` +??? Code + + ```python + --8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params" + ``` The following extra parameters are supported: -```python ---8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params" -``` +??? Code + + ```python + --8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params" + ``` [](){ #chat-api } @@ -212,15 +220,19 @@ Code example: The following [sampling parameters][sampling-params] are supported. -```python ---8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params" -``` +??? Code + + ```python + --8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params" + ``` The following extra parameters are supported: -```python ---8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params" -``` +??? Code + + ```python + --8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params" + ``` [](){ #embeddings-api } @@ -259,29 +271,31 @@ and passing a list of `messages` in the request. Refer to the examples below for Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library: - ```python - import requests - - image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" - - response = requests.post( - "http://localhost:8000/v1/embeddings", - json={ - "model": "TIGER-Lab/VLM2Vec-Full", - "messages": [{ - "role": "user", - "content": [ - {"type": "image_url", "image_url": {"url": image_url}}, - {"type": "text", "text": "Represent the given image."}, - ], - }], - "encoding_format": "float", - }, - ) - response.raise_for_status() - response_json = response.json() - print("Embedding output:", response_json["data"][0]["embedding"]) - ``` + ??? Code + + ```python + import requests + + image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" + + response = requests.post( + "http://localhost:8000/v1/embeddings", + json={ + "model": "TIGER-Lab/VLM2Vec-Full", + "messages": [{ + "role": "user", + "content": [ + {"type": "image_url", "image_url": {"url": image_url}}, + {"type": "text", "text": "Represent the given image."}, + ], + }], + "encoding_format": "float", + }, + ) + response.raise_for_status() + response_json = response.json() + print("Embedding output:", response_json["data"][0]["embedding"]) + ``` === "DSE-Qwen2-MRL" @@ -316,15 +330,19 @@ The following [pooling parameters][pooling-params] are supported. The following extra parameters are supported by default: -```python ---8<-- "vllm/entrypoints/openai/protocol.py:embedding-extra-params" -``` +??? Code + + ```python + --8<-- "vllm/entrypoints/openai/protocol.py:embedding-extra-params" + ``` For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead: -```python ---8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params" -``` +??? Code + + ```python + --8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params" + ``` [](){ #transcriptions-api } @@ -343,15 +361,19 @@ Code example: The following [sampling parameters][sampling-params] are supported. -```python ---8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params" -``` +??? Code + + ```python + --8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params" + ``` The following extra parameters are supported: -```python ---8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params" -``` +??? Code + + ```python + --8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params" + ``` [](){ #tokenizer-api } @@ -387,8 +409,6 @@ Code example: You can classify multiple texts by passing an array of strings: -Request: - ```bash curl -v "http://127.0.0.1:8000/classify" \ -H "Content-Type: application/json" \ @@ -401,47 +421,45 @@ curl -v "http://127.0.0.1:8000/classify" \ }' ``` -Response: +??? Response -```bash -{ - "id": "classify-7c87cac407b749a6935d8c7ce2a8fba2", - "object": "list", - "created": 1745383065, - "model": "jason9693/Qwen2.5-1.5B-apeach", - "data": [ - { - "index": 0, - "label": "Default", - "probs": [ - 0.565970778465271, - 0.4340292513370514 - ], - "num_classes": 2 - }, + ```bash { - "index": 1, - "label": "Spoiled", - "probs": [ - 0.26448777318000793, - 0.7355121970176697 + "id": "classify-7c87cac407b749a6935d8c7ce2a8fba2", + "object": "list", + "created": 1745383065, + "model": "jason9693/Qwen2.5-1.5B-apeach", + "data": [ + { + "index": 0, + "label": "Default", + "probs": [ + 0.565970778465271, + 0.4340292513370514 + ], + "num_classes": 2 + }, + { + "index": 1, + "label": "Spoiled", + "probs": [ + 0.26448777318000793, + 0.7355121970176697 + ], + "num_classes": 2 + } ], - "num_classes": 2 + "usage": { + "prompt_tokens": 20, + "total_tokens": 20, + "completion_tokens": 0, + "prompt_tokens_details": null + } } - ], - "usage": { - "prompt_tokens": 20, - "total_tokens": 20, - "completion_tokens": 0, - "prompt_tokens_details": null - } -} -``` + ``` You can also pass a string directly to the `input` field: -Request: - ```bash curl -v "http://127.0.0.1:8000/classify" \ -H "Content-Type: application/json" \ @@ -451,33 +469,33 @@ curl -v "http://127.0.0.1:8000/classify" \ }' ``` -Response: +??? Response -```bash -{ - "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682", - "object": "list", - "created": 1745383213, - "model": "jason9693/Qwen2.5-1.5B-apeach", - "data": [ + ```bash { - "index": 0, - "label": "Default", - "probs": [ - 0.565970778465271, - 0.4340292513370514 + "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682", + "object": "list", + "created": 1745383213, + "model": "jason9693/Qwen2.5-1.5B-apeach", + "data": [ + { + "index": 0, + "label": "Default", + "probs": [ + 0.565970778465271, + 0.4340292513370514 + ], + "num_classes": 2 + } ], - "num_classes": 2 + "usage": { + "prompt_tokens": 10, + "total_tokens": 10, + "completion_tokens": 0, + "prompt_tokens_details": null + } } - ], - "usage": { - "prompt_tokens": 10, - "total_tokens": 10, - "completion_tokens": 0, - "prompt_tokens_details": null - } -} -``` + ``` #### Extra parameters @@ -508,8 +526,6 @@ Code example: You can pass a string to both `text_1` and `text_2`, forming a single sentence pair. -Request: - ```bash curl -X 'POST' \ 'http://127.0.0.1:8000/score' \ @@ -523,24 +539,24 @@ curl -X 'POST' \ }' ``` -Response: +??? Response -```bash -{ - "id": "score-request-id", - "object": "list", - "created": 693447, - "model": "BAAI/bge-reranker-v2-m3", - "data": [ + ```bash { - "index": 0, - "object": "score", - "score": 1 + "id": "score-request-id", + "object": "list", + "created": 693447, + "model": "BAAI/bge-reranker-v2-m3", + "data": [ + { + "index": 0, + "object": "score", + "score": 1 + } + ], + "usage": {} } - ], - "usage": {} -} -``` + ``` #### Batch inference @@ -548,95 +564,95 @@ You can pass a string to `text_1` and a list to `text_2`, forming multiple sente where each pair is built from `text_1` and a string in `text_2`. The total number of pairs is `len(text_2)`. -Request: +??? Request -```bash -curl -X 'POST' \ - 'http://127.0.0.1:8000/score' \ - -H 'accept: application/json' \ - -H 'Content-Type: application/json' \ - -d '{ - "model": "BAAI/bge-reranker-v2-m3", - "text_1": "What is the capital of France?", - "text_2": [ - "The capital of Brazil is Brasilia.", - "The capital of France is Paris." - ] -}' -``` + ```bash + curl -X 'POST' \ + 'http://127.0.0.1:8000/score' \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "BAAI/bge-reranker-v2-m3", + "text_1": "What is the capital of France?", + "text_2": [ + "The capital of Brazil is Brasilia.", + "The capital of France is Paris." + ] + }' + ``` -Response: +??? Response -```bash -{ - "id": "score-request-id", - "object": "list", - "created": 693570, - "model": "BAAI/bge-reranker-v2-m3", - "data": [ - { - "index": 0, - "object": "score", - "score": 0.001094818115234375 - }, + ```bash { - "index": 1, - "object": "score", - "score": 1 + "id": "score-request-id", + "object": "list", + "created": 693570, + "model": "BAAI/bge-reranker-v2-m3", + "data": [ + { + "index": 0, + "object": "score", + "score": 0.001094818115234375 + }, + { + "index": 1, + "object": "score", + "score": 1 + } + ], + "usage": {} } - ], - "usage": {} -} -``` + ``` You can pass a list to both `text_1` and `text_2`, forming multiple sentence pairs where each pair is built from a string in `text_1` and the corresponding string in `text_2` (similar to `zip()`). The total number of pairs is `len(text_2)`. -Request: +??? Request -```bash -curl -X 'POST' \ - 'http://127.0.0.1:8000/score' \ - -H 'accept: application/json' \ - -H 'Content-Type: application/json' \ - -d '{ - "model": "BAAI/bge-reranker-v2-m3", - "encoding_format": "float", - "text_1": [ - "What is the capital of Brazil?", - "What is the capital of France?" - ], - "text_2": [ - "The capital of Brazil is Brasilia.", - "The capital of France is Paris." - ] -}' -``` + ```bash + curl -X 'POST' \ + 'http://127.0.0.1:8000/score' \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "BAAI/bge-reranker-v2-m3", + "encoding_format": "float", + "text_1": [ + "What is the capital of Brazil?", + "What is the capital of France?" + ], + "text_2": [ + "The capital of Brazil is Brasilia.", + "The capital of France is Paris." + ] + }' + ``` -Response: +??? Response -```bash -{ - "id": "score-request-id", - "object": "list", - "created": 693447, - "model": "BAAI/bge-reranker-v2-m3", - "data": [ - { - "index": 0, - "object": "score", - "score": 1 - }, + ```bash { - "index": 1, - "object": "score", - "score": 1 + "id": "score-request-id", + "object": "list", + "created": 693447, + "model": "BAAI/bge-reranker-v2-m3", + "data": [ + { + "index": 0, + "object": "score", + "score": 1 + }, + { + "index": 1, + "object": "score", + "score": 1 + } + ], + "usage": {} } - ], - "usage": {} -} -``` + ``` #### Extra parameters @@ -675,51 +691,51 @@ Code example: Note that the `top_n` request parameter is optional and will default to the length of the `documents` field. Result documents will be sorted by relevance, and the `index` property can be used to determine original order. -Request: +??? Request -```bash -curl -X 'POST' \ - 'http://127.0.0.1:8000/v1/rerank' \ - -H 'accept: application/json' \ - -H 'Content-Type: application/json' \ - -d '{ - "model": "BAAI/bge-reranker-base", - "query": "What is the capital of France?", - "documents": [ - "The capital of Brazil is Brasilia.", - "The capital of France is Paris.", - "Horses and cows are both animals" - ] -}' -``` + ```bash + curl -X 'POST' \ + 'http://127.0.0.1:8000/v1/rerank' \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "BAAI/bge-reranker-base", + "query": "What is the capital of France?", + "documents": [ + "The capital of Brazil is Brasilia.", + "The capital of France is Paris.", + "Horses and cows are both animals" + ] + }' + ``` -Response: +??? Response -```bash -{ - "id": "rerank-fae51b2b664d4ed38f5969b612edff77", - "model": "BAAI/bge-reranker-base", - "usage": { - "total_tokens": 56 - }, - "results": [ - { - "index": 1, - "document": { - "text": "The capital of France is Paris." - }, - "relevance_score": 0.99853515625 - }, + ```bash { - "index": 0, - "document": { - "text": "The capital of Brazil is Brasilia." + "id": "rerank-fae51b2b664d4ed38f5969b612edff77", + "model": "BAAI/bge-reranker-base", + "usage": { + "total_tokens": 56 }, - "relevance_score": 0.0005860328674316406 + "results": [ + { + "index": 1, + "document": { + "text": "The capital of France is Paris." + }, + "relevance_score": 0.99853515625 + }, + { + "index": 0, + "document": { + "text": "The capital of Brazil is Brasilia." + }, + "relevance_score": 0.0005860328674316406 + } + ] } - ] -} -``` + ``` #### Extra parameters diff --git a/docs/usage/metrics.md b/docs/usage/metrics.md index 6603aa83b4a..988b9a55172 100644 --- a/docs/usage/metrics.md +++ b/docs/usage/metrics.md @@ -12,28 +12,32 @@ vllm serve unsloth/Llama-3.2-1B-Instruct Then query the endpoint to get the latest metrics from the server: -```console -$ curl http://0.0.0.0:8000/metrics - -# HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step. -# TYPE vllm:iteration_tokens_total histogram -vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0 -vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 -vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 -vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 -vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 -vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 -vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 -vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 -vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 -... -``` +??? Output + + ```console + $ curl http://0.0.0.0:8000/metrics + + # HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step. + # TYPE vllm:iteration_tokens_total histogram + vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0 + vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 + vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 + vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 + vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 + vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 + vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 + vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 + vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 + ... + ``` The following metrics are exposed: -```python ---8<-- "vllm/engine/metrics.py:metrics-definitions" -``` +??? Code + + ```python + --8<-- "vllm/engine/metrics.py:metrics-definitions" + ``` Note: when metrics are deprecated in version `X.Y`, they are hidden in version `X.Y+1` but can be re-enabled using the `--show-hidden-metrics-for-version=X.Y` escape hatch, diff --git a/docs/usage/troubleshooting.md b/docs/usage/troubleshooting.md index e9ab425a1d0..9403abfad85 100644 --- a/docs/usage/troubleshooting.md +++ b/docs/usage/troubleshooting.md @@ -60,68 +60,70 @@ To identify the particular CUDA operation that causes the error, you can add `-- If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly. -```python -# Test PyTorch NCCL -import torch -import torch.distributed as dist -dist.init_process_group(backend="nccl") -local_rank = dist.get_rank() % torch.cuda.device_count() -torch.cuda.set_device(local_rank) -data = torch.FloatTensor([1,] * 128).to("cuda") -dist.all_reduce(data, op=dist.ReduceOp.SUM) -torch.cuda.synchronize() -value = data.mean().item() -world_size = dist.get_world_size() -assert value == world_size, f"Expected {world_size}, got {value}" - -print("PyTorch NCCL is successful!") - -# Test PyTorch GLOO -gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo") -cpu_data = torch.FloatTensor([1,] * 128) -dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group) -value = cpu_data.mean().item() -assert value == world_size, f"Expected {world_size}, got {value}" - -print("PyTorch GLOO is successful!") - -if world_size <= 1: - exit() - -# Test vLLM NCCL, with cuda graph -from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator - -pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank) -# pynccl is enabled by default for 0.6.5+, -# but for 0.6.4 and below, we need to enable it manually. -# keep the code for backward compatibility when because people -# prefer to read the latest documentation. -pynccl.disabled = False - -s = torch.cuda.Stream() -with torch.cuda.stream(s): - data.fill_(1) - out = pynccl.all_reduce(data, stream=s) - value = out.mean().item() +??? Code + + ```python + # Test PyTorch NCCL + import torch + import torch.distributed as dist + dist.init_process_group(backend="nccl") + local_rank = dist.get_rank() % torch.cuda.device_count() + torch.cuda.set_device(local_rank) + data = torch.FloatTensor([1,] * 128).to("cuda") + dist.all_reduce(data, op=dist.ReduceOp.SUM) + torch.cuda.synchronize() + value = data.mean().item() + world_size = dist.get_world_size() assert value == world_size, f"Expected {world_size}, got {value}" -print("vLLM NCCL is successful!") + print("PyTorch NCCL is successful!") -g = torch.cuda.CUDAGraph() -with torch.cuda.graph(cuda_graph=g, stream=s): - out = pynccl.all_reduce(data, stream=torch.cuda.current_stream()) + # Test PyTorch GLOO + gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo") + cpu_data = torch.FloatTensor([1,] * 128) + dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group) + value = cpu_data.mean().item() + assert value == world_size, f"Expected {world_size}, got {value}" -data.fill_(1) -g.replay() -torch.cuda.current_stream().synchronize() -value = out.mean().item() -assert value == world_size, f"Expected {world_size}, got {value}" + print("PyTorch GLOO is successful!") -print("vLLM NCCL with cuda graph is successful!") + if world_size <= 1: + exit() -dist.destroy_process_group(gloo_group) -dist.destroy_process_group() -``` + # Test vLLM NCCL, with cuda graph + from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator + + pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank) + # pynccl is enabled by default for 0.6.5+, + # but for 0.6.4 and below, we need to enable it manually. + # keep the code for backward compatibility when because people + # prefer to read the latest documentation. + pynccl.disabled = False + + s = torch.cuda.Stream() + with torch.cuda.stream(s): + data.fill_(1) + out = pynccl.all_reduce(data, stream=s) + value = out.mean().item() + assert value == world_size, f"Expected {world_size}, got {value}" + + print("vLLM NCCL is successful!") + + g = torch.cuda.CUDAGraph() + with torch.cuda.graph(cuda_graph=g, stream=s): + out = pynccl.all_reduce(data, stream=torch.cuda.current_stream()) + + data.fill_(1) + g.replay() + torch.cuda.current_stream().synchronize() + value = out.mean().item() + assert value == world_size, f"Expected {world_size}, got {value}" + + print("vLLM NCCL with cuda graph is successful!") + + dist.destroy_process_group(gloo_group) + dist.destroy_process_group() + ``` If you are testing with a single node, adjust `--nproc-per-node` to the number of GPUs you want to use: @@ -165,25 +167,27 @@ WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously or an error from Python that looks like this: -```console -RuntimeError: - An attempt has been made to start a new process before the - current process has finished its bootstrapping phase. +??? Logs - This probably means that you are not using fork to start your - child processes and you have forgotten to use the proper idiom - in the main module: + ```console + RuntimeError: + An attempt has been made to start a new process before the + current process has finished its bootstrapping phase. - if __name__ == '__main__': - freeze_support() - ... + This probably means that you are not using fork to start your + child processes and you have forgotten to use the proper idiom + in the main module: - The "freeze_support()" line can be omitted if the program - is not going to be frozen to produce an executable. + if __name__ == '__main__': + freeze_support() + ... - To fix this issue, refer to the "Safe importing of main module" - section in https://docs.python.org/3/library/multiprocessing.html -``` + The "freeze_support()" line can be omitted if the program + is not going to be frozen to produce an executable. + + To fix this issue, refer to the "Safe importing of main module" + section in https://docs.python.org/3/library/multiprocessing.html + ``` then you must update your Python code to guard usage of `vllm` behind a `if __name__ == '__main__':` block. For example, instead of this: @@ -207,20 +211,22 @@ if __name__ == '__main__': vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script: -```python -import torch - -@torch.compile -def f(x): - # a simple function to test torch.compile - x = x + 1 - x = x * 2 - x = x.sin() - return x - -x = torch.randn(4, 4).cuda() -print(f(x)) -``` +??? Code + + ```python + import torch + + @torch.compile + def f(x): + # a simple function to test torch.compile + x = x + 1 + x = x * 2 + x = x.sin() + return x + + x = torch.randn(4, 4).cuda() + print(f(x)) + ``` If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See [this issue](https://github.com/vllm-project/vllm/issues/12219) for example. diff --git a/docs/usage/usage_stats.md b/docs/usage/usage_stats.md index 750cba7ed9c..78d2a6784bc 100644 --- a/docs/usage/usage_stats.md +++ b/docs/usage/usage_stats.md @@ -10,36 +10,38 @@ The list of data collected by the latest version of vLLM can be found here: