Skip to content

[doc] Fold long code blocks to improve readability #19926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/ci/update_pytorch_version.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ source to unblock the update process.
### FlashInfer
Here is how to build and install it from source with torch2.7.0+cu128 in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271):

```
```bash
export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0 10.0+PTX'
export FLASHINFER_ENABLE_SM90=1
uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/[email protected]"
Expand All @@ -105,14 +105,14 @@ team if you want to get the package published there.
### xFormers
Similar to FlashInfer, here is how to build and install xFormers from source:

```
```bash
export TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0 8.9 9.0 10.0+PTX'
MAX_JOBS=16 uv pip install --system --no-build-isolation "git+https://github.com/facebookresearch/[email protected]"
```

### Mamba

```
```bash
uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/[email protected]"
```

Expand Down
49 changes: 22 additions & 27 deletions docs/cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,35 +16,33 @@ vllm {chat,complete,serve,bench,collect-env,run-batch}

Start the vLLM OpenAI Compatible API server.

Examples:
??? Examples

```bash
# Start with a model
vllm serve meta-llama/Llama-2-7b-hf
```bash
# Start with a model
vllm serve meta-llama/Llama-2-7b-hf

# Specify the port
vllm serve meta-llama/Llama-2-7b-hf --port 8100
# Specify the port
vllm serve meta-llama/Llama-2-7b-hf --port 8100

# Check with --help for more options
# To list all groups
vllm serve --help=listgroup
# Check with --help for more options
# To list all groups
vllm serve --help=listgroup

# To view a argument group
vllm serve --help=ModelConfig
# To view a argument group
vllm serve --help=ModelConfig

# To view a single argument
vllm serve --help=max-num-seqs
# To view a single argument
vllm serve --help=max-num-seqs

# To search by keyword
vllm serve --help=max
```
# To search by keyword
vllm serve --help=max
```

## chat

Generate chat completions via the running API server.

Examples:

```bash
# Directly connect to localhost API without arguments
vllm chat
Expand All @@ -60,8 +58,6 @@ vllm chat --quick "hi"

Generate text completions based on the given prompt via the running API server.

Examples:

```bash
# Directly connect to localhost API without arguments
vllm complete
Expand All @@ -73,6 +69,8 @@ vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
vllm complete --quick "The future of AI is"
```

</details>

## bench

Run benchmark tests for latency online serving throughput and offline inference throughput.
Expand All @@ -89,8 +87,6 @@ vllm bench {latency, serve, throughput}

Benchmark the latency of a single batch of requests.

Example:

```bash
vllm bench latency \
--model meta-llama/Llama-3.2-1B-Instruct \
Expand All @@ -104,8 +100,6 @@ vllm bench latency \

Benchmark the online serving throughput.

Example:

```bash
vllm bench serve \
--model meta-llama/Llama-3.2-1B-Instruct \
Expand All @@ -120,8 +114,6 @@ vllm bench serve \

Benchmark offline inference throughput.

Example:

```bash
vllm bench throughput \
--model meta-llama/Llama-3.2-1B-Instruct \
Expand All @@ -143,7 +135,8 @@ vllm collect-env

Run batch prompts and write results to file.

Examples:
<details>
<summary>Examples</summary>

```bash
# Running with a local file
Expand All @@ -159,6 +152,8 @@ vllm run-batch \
--model meta-llama/Meta-Llama-3-8B-Instruct
```

</details>

## More Help

For detailed options of any subcommand, use:
Expand Down
58 changes: 31 additions & 27 deletions docs/configuration/conserving_memory.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,19 +57,21 @@ By default, we optimize model inference using CUDA graphs which take up extra me

You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:

```python
from vllm import LLM
from vllm.config import CompilationConfig, CompilationLevel

llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
compilation_config=CompilationConfig(
level=CompilationLevel.PIECEWISE,
# By default, it goes up to max_num_seqs
cudagraph_capture_sizes=[1, 2, 4, 8, 16],
),
)
```
??? Code

```python
from vllm import LLM
from vllm.config import CompilationConfig, CompilationLevel

llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
compilation_config=CompilationConfig(
level=CompilationLevel.PIECEWISE,
# By default, it goes up to max_num_seqs
cudagraph_capture_sizes=[1, 2, 4, 8, 16],
),
)
```

You can disable graph capturing completely via the `enforce_eager` flag:

Expand Down Expand Up @@ -127,18 +129,20 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory.

Here are some examples:

```python
from vllm import LLM
??? Code

# Available for Qwen2-VL series models
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
mm_processor_kwargs={
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28
})

# Available for InternVL series models
llm = LLM(model="OpenGVLab/InternVL2-2B",
mm_processor_kwargs={
"max_dynamic_patch": 4, # Default is 12
})
```
```python
from vllm import LLM

# Available for Qwen2-VL series models
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
mm_processor_kwargs={
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28
})

# Available for InternVL series models
llm = LLM(model="OpenGVLab/InternVL2-2B",
mm_processor_kwargs={
"max_dynamic_patch": 4, # Default is 12
})
```
8 changes: 5 additions & 3 deletions docs/configuration/env_vars.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ vLLM uses the following environment variables to configure the system:

All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).

```python
--8<-- "vllm/envs.py:env-vars-definition"
```
??? Code

```python
--8<-- "vllm/envs.py:env-vars-definition"
```
30 changes: 16 additions & 14 deletions docs/contributing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,25 +93,27 @@ For additional features and advanced configurations, refer to the official [MkDo

## Testing

```bash
pip install -r requirements/dev.txt
??? note "Commands"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my two cents, but I was looking at the new page and completely didn't notice this collapsed section and thought it was removed.
image


# Linting, formatting and static type checking
pre-commit install --hook-type pre-commit --hook-type commit-msg
```bash
pip install -r requirements/dev.txt

# You can manually run pre-commit with
pre-commit run --all-files
# Linting, formatting and static type checking
pre-commit install --hook-type pre-commit --hook-type commit-msg

# To manually run something from CI that does not run
# locally by default, you can run:
pre-commit run mypy-3.9 --hook-stage manual --all-files
# You can manually run pre-commit with
pre-commit run --all-files

# Unit tests
pytest tests/
# To manually run something from CI that does not run
# locally by default, you can run:
pre-commit run mypy-3.9 --hook-stage manual --all-files

# Run tests for a single test file with detailed output
pytest -s -v tests/test_logger.py
```
# Unit tests
pytest tests/

# Run tests for a single test file with detailed output
pytest -s -v tests/test_logger.py
```

!!! tip
Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
Expand Down
56 changes: 29 additions & 27 deletions docs/contributing/model/basic.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,33 +27,35 @@ All vLLM modules within the model must include a `prefix` argument in their cons

The initialization code should look like this:

```python
from torch import nn
from vllm.config import VllmConfig
from vllm.attention import Attention

class MyAttention(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.attn = Attention(prefix=f"{prefix}.attn")

class MyDecoderLayer(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.self_attn = MyAttention(prefix=f"{prefix}.self_attn")

class MyModel(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.layers = nn.ModuleList(
[MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)]
)

class MyModelForCausalLM(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
super().__init__()
self.model = MyModel(vllm_config, prefix=f"{prefix}.model")
```
??? Code

```python
from torch import nn
from vllm.config import VllmConfig
from vllm.attention import Attention

class MyAttention(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.attn = Attention(prefix=f"{prefix}.attn")

class MyDecoderLayer(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.self_attn = MyAttention(prefix=f"{prefix}.self_attn")

class MyModel(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.layers = nn.ModuleList(
[MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)]
)

class MyModelForCausalLM(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
super().__init__()
self.model = MyModel(vllm_config, prefix=f"{prefix}.model")
```

### Computation Code

Expand Down
Loading