Skip to content

Commit 5506f60

Browse files
authored
chore [BREAKING CHANGE]: Flatten PyTorchConfig knobs into TorchLlmArgs (#4603)
Signed-off-by: Superjomn <[email protected]>
1 parent fbe4db2 commit 5506f60

File tree

74 files changed

+696
-562
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+696
-562
lines changed

docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md

Lines changed: 20 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -134,9 +134,8 @@ To do the benchmark, run the following command:
134134
YOUR_DATA_PATH=<your dataset file following the format>
135135

136136
cat >./extra-llm-api-config.yml<<EOF
137-
pytorch_backend_config:
138-
use_cuda_graph: true
139-
moe_backend: TRTLLM
137+
use_cuda_graph: true
138+
moe_backend: TRTLLM
140139
speculative_config:
141140
decoding_type: MTP
142141
num_nextn_predict_layers: 3
@@ -202,21 +201,20 @@ python ${YOUR_WORK_PATH}/benchmarks/cpp/prepare_dataset.py \
202201
YOUR_DATA_PATH=./dataset.txt
203202

204203
cat >./extra-llm-api-config.yml <<EOF
205-
pytorch_backend_config:
206-
use_cuda_graph: true
207-
cuda_graph_padding_enabled: true
208-
cuda_graph_batch_sizes:
209-
- 1
210-
- 2
211-
- 4
212-
- 8
213-
- 16
214-
- 32
215-
- 64
216-
- 128
217-
- 256
218-
- 384
219-
print_iter_log: true
204+
use_cuda_graph: true
205+
cuda_graph_padding_enabled: true
206+
cuda_graph_batch_sizes:
207+
- 1
208+
- 2
209+
- 4
210+
- 8
211+
- 16
212+
- 32
213+
- 64
214+
- 128
215+
- 256
216+
- 384
217+
print_iter_log: true
220218
enable_attention_dp: true
221219
EOF
222220

@@ -257,8 +255,7 @@ To do the benchmark, run the following command:
257255
YOUR_DATA_PATH=<your dataset file following the format>
258256

259257
cat >./extra-llm-api-config.yml<<EOF
260-
pytorch_backend_config:
261-
use_cuda_graph: true
258+
use_cuda_graph: true
262259
speculative_config:
263260
decoding_type: MTP
264261
num_nextn_predict_layers: 3
@@ -307,10 +304,9 @@ python ${YOUR_WORK_PATH}/benchmarks/cpp/prepare_dataset.py \
307304
YOUR_DATA_PATH=./dataset.txt
308305

309306
cat >./extra-llm-api-config.yml<<EOF
310-
pytorch_backend_config:
311-
use_cuda_graph: true
312-
cuda_graph_batch_sizes:
313-
- 128
307+
use_cuda_graph: true
308+
cuda_graph_batch_sizes:
309+
- 128
314310
enable_attention_dp: true
315311
EOF
316312

docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -121,9 +121,8 @@ To benchmark min-latency performance with MTP, you need to follow [this document
121121
YOUR_DATA_PATH=<your dataset file following the format>
122122

123123
cat >./extra-llm-api-config.yml<<EOF
124-
pytorch_backend_config:
125-
use_cuda_graph: true
126-
moe_backend: TRTLLM
124+
use_cuda_graph: true
125+
moe_backend: TRTLLM
127126
speculative_config:
128127
decoding_type: MTP
129128
num_nextn_predict_layers: 3
@@ -177,9 +176,8 @@ To benchmark min-latency performance with MTP Relaxed Acceptance, you need to fo
177176
YOUR_DATA_PATH=<your dataset file following the format>
178177

179178
cat >./extra-llm-api-config.yml<<EOF
180-
pytorch_backend_config:
181-
use_cuda_graph: true
182-
moe_backend: TRTLLM
179+
use_cuda_graph: true
180+
moe_backend: TRTLLM
183181
speculative_config:
184182
decoding_type: MTP
185183
num_nextn_predict_layers: 3

docs/source/performance/perf-benchmarking.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -628,8 +628,7 @@ If you would like to force the KV cache quantizaton, you can specify the followi
628628
when the checkpoint precision is `null`:
629629
630630
```yaml
631-
pytorch_backend_config:
632-
kv_cache_dtype: "fp8"
631+
kv_cache_dtype: "fp8"
633632
```
634633
635634
```{tip}

docs/source/performance/perf-overview.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -200,11 +200,9 @@ trtllm-bench --model $model_name throughput --dataset $dataset_file --backend py
200200

201201
`llm_options.yml`
202202
```yaml
203-
204-
pytorch_backend_config:
205-
use_cuda_graph: true
206-
cuda_graph_padding_enabled: true
207-
cuda_graph_batch_sizes:
203+
use_cuda_graph: true
204+
cuda_graph_padding_enabled: true
205+
cuda_graph_batch_sizes:
208206
- 1
209207
- 2
210208
- 4

docs/source/torch/attention.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ The following sections explain how to use these implementations and provide a br
1616

1717

1818
There are currently three available attention backends: the vanilla backend, the TRT-LLM backend, and the Flashinfer backend.
19-
You can specify the desired attention backend using `PyTorchConfig.attn_backend`. For instance, to utilize the Flashinfer backend, you can create a `PyTorchConfig` with `attn_backend = "flashinfer"` and then pass it to the `LLM` constructor as follows: `LLM(pytorch_backend_config=pytorch_config)`. This will enable the use of the Flashinfer backend for your model.
19+
You can specify the desired attention backend using `PyTorchConfig.attn_backend`. For instance, to utilize the Flashinfer backend, you can pass `attn_backend="flashinfer"` to the `LLM` constructor as follows: `LLM(attn_backend="flashinfer")`. This will enable the use of the Flashinfer backend for your model.
2020

2121
The vanilla backend, `VanillaAttention`, is a reference implementation designed primarily for inflight batching and linear KV cache support. While it serves as a useful baseline, it is not recommended for production use due to its limited optimizations.
2222

examples/auto_deploy/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -265,7 +265,7 @@ llm = LLM(
265265
model=<HF_MODEL_CARD_OR_DIR>,
266266
backend="autodeploy",
267267
build_config=build_config,
268-
pytorch_backend_config=ad_config,
268+
auto_deploy_config=ad_config,
269269
tensor_parallel_size=<NUM_WORLD_RANK>,
270270
)
271271

examples/auto_deploy/build_and_run_ad.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ def build_llm_from_config(config: SimpleConfig) -> LLM:
7373
model=factory.model,
7474
backend="autodeploy",
7575
build_config=build_config,
76-
pytorch_backend_config=ad_config,
76+
auto_deploy_config=ad_config,
7777
tensor_parallel_size=config.world_size,
7878
tokenizer=factory.init_tokenizer() if config.customize_tokenizer else None,
7979
)

examples/disaggregated/README.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ You can use multiple `trtllm-serve` commands to launch the context and generatio
99
for disaggregated serving. For example, you could launch two context servers and one generation servers as follows:
1010

1111
```
12-
echo -e "pytorch_backend_config:\n disable_overlap_scheduler: True\ncache_transceiver_config:\n max_num_tokens: 2048" > context_extra-llm-api-config.yml
12+
echo -e "disable_overlap_scheduler: True\ncache_transceiver_config:\nmax_num_tokens: 2048" > context_extra-llm-api-config.yml
1313
echo -e "cache_transceiver_config:\n max_num_tokens: 2048" > gen_extra-llm-api-config.yml
1414
1515
export TRTLLM_USE_UCX_KVCACHE=1
@@ -63,9 +63,8 @@ hostname: localhost
6363
port: 8000
6464
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
6565
backend: "pytorch"
66-
pytorch_backend_config:
67-
use_cuda_graph: False
68-
disable_overlap_scheduler: True
66+
use_cuda_graph: False
67+
disable_overlap_scheduler: True
6968
context_servers:
7069
num_instances: 1
7170
tensor_parallel_size: 1

examples/disaggregated/disagg_config.yaml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,8 @@ port: 8000
33
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
44
free_gpu_memory_fraction: 0.25
55
backend: "pytorch"
6-
pytorch_backend_config:
7-
use_cuda_graph: False
8-
disable_overlap_scheduler: True
6+
use_cuda_graph: False
7+
disable_overlap_scheduler: True
98
context_servers:
109
num_instances: 1
1110
tensor_parallel_size: 1

examples/llm-api/llm_inference_kv_events.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,15 @@
11
### Get KV Cache Events
22

33
from tensorrt_llm import LLM, SamplingParams
4-
from tensorrt_llm._torch.pyexecutor.config import PyTorchConfig
54
from tensorrt_llm.llmapi import KvCacheConfig
65

76

87
def main():
9-
pytorch_config = PyTorchConfig(autotuner_enabled=False,
10-
kv_cache_dtype='auto')
118

129
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
1310
tensor_parallel_size=2,
14-
pytorch_backend_config=pytorch_config,
11+
autotuner_enabled=False,
12+
kv_cache_dtype='auto',
1513
kv_cache_config=KvCacheConfig(enable_block_reuse=True,
1614
event_buffer_max_size=1024),
1715
backend="pytorch")

examples/llm-api/llm_mgmn_trtllm_bench.sh

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -74,10 +74,9 @@ srun -l \
7474
7575
# This is optional
7676
cat > /tmp/pytorch_extra_args.txt << EOF
77-
pytorch_backend_config:
78-
use_cuda_graph: false
79-
cuda_graph_padding_enabled: false
80-
print_iter_log: true
77+
use_cuda_graph: false
78+
cuda_graph_padding_enabled: false
79+
print_iter_log: true
8180
enable_attention_dp: false
8281
EOF
8382

examples/llm-eval/lm-eval-harness/lm_eval_tensorrt_llm.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,6 @@ def __init__(
100100
if hasattr(PyTorchConfig, "moe_backend"):
101101
pytorch_config_params["moe_backend"] = self.moe_backend
102102
print(f"Info: moe_backend is set to {self.moe_backend}")
103-
pytorch_config = PyTorchConfig(**pytorch_config_params)
104103

105104
# stop words not currently supported by torch backend
106105
self.use_stop_words = False
@@ -110,7 +109,7 @@ def __init__(
110109
tensor_parallel_size=tp,
111110
trust_remote_code=trust_remote_code,
112111
enable_chunked_prefill=False,
113-
pytorch_backend_config=pytorch_config,
112+
**pytorch_config_params,
114113
tokenizer=self.tokenizer,
115114
kv_cache_config=trt_kv_cache_config,
116115
moe_expert_parallel_size=self.moe_expert_parallel_size,

examples/models/core/deepseek_v3/README.md

Lines changed: 38 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -140,10 +140,9 @@ python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
140140
--num-requests 24 > /tmp/benchmarking_64k.txt
141141

142142
cat <<EOF > /tmp/extra-llm-api-config.yml
143-
pytorch_backend_config:
144-
use_cuda_graph: true
145-
cuda_graph_padding_enabled: true
146-
cuda_graph_batch_sizes: [1, 4, 8, 12]
143+
use_cuda_graph: true
144+
cuda_graph_padding_enabled: true
145+
cuda_graph_batch_sizes: [1, 4, 8, 12]
147146
EOF
148147

149148
trtllm-bench -m deepseek-ai/DeepSeek-R1 --model_path ${DS_R1_NVFP4_MODEL_PATH} throughput \
@@ -168,11 +167,10 @@ python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
168167
--num-requests 4 > /tmp/benchmarking_128k.txt
169168

170169
cat <<EOF > /tmp/extra-llm-api-config.yml
171-
pytorch_backend_config:
172-
use_cuda_graph: true
173-
cuda_graph_padding_enabled: true
174-
cuda_graph_batch_sizes: [1, 2]
175-
moe_max_num_tokens: 16384
170+
use_cuda_graph: true
171+
cuda_graph_padding_enabled: true
172+
cuda_graph_batch_sizes: [1, 2]
173+
moe_max_num_tokens: 16384
176174
EOF
177175

178176
trtllm-bench -m deepseek-ai/DeepSeek-R1 --model_path ${DS_R1_NVFP4_MODEL_PATH} throughput \
@@ -193,8 +191,7 @@ Evaluate the model accuracy using `trtllm-eval`.
193191
1. (Optional) Prepare an advanced configuration file:
194192
```bash
195193
cat >./extra-llm-api-config.yml <<EOF
196-
pytorch_backend_config:
197-
use_cuda_graph: true
194+
use_cuda_graph: true
198195
enable_attention_dp: true
199196
EOF
200197
```
@@ -236,21 +233,20 @@ To serve the model using `trtllm-serve`:
236233

237234
```bash
238235
cat >./extra-llm-api-config.yml <<EOF
239-
pytorch_backend_config:
240-
use_cuda_graph: true
241-
cuda_graph_padding_enabled: true
242-
cuda_graph_batch_sizes:
243-
- 1
244-
- 2
245-
- 4
246-
- 8
247-
- 16
248-
- 32
249-
- 64
250-
- 128
251-
- 256
252-
- 384
253-
print_iter_log: true
236+
use_cuda_graph: true
237+
cuda_graph_padding_enabled: true
238+
cuda_graph_batch_sizes:
239+
- 1
240+
- 2
241+
- 4
242+
- 8
243+
- 16
244+
- 32
245+
- 64
246+
- 128
247+
- 256
248+
- 384
249+
print_iter_log: true
254250
enable_attention_dp: true
255251
EOF
256252

@@ -427,21 +423,20 @@ python3 /path/to/TensorRT-LLM/benchmarks/cpp/prepare_dataset.py \
427423
--input-mean=1024 --output-mean=2048 --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt
428424

429425
cat >/path/to/TensorRT-LLM/extra-llm-api-config.yml <<EOF
430-
pytorch_backend_config:
431-
use_cuda_graph: true
432-
cuda_graph_padding_enabled: true
433-
cuda_graph_batch_sizes:
434-
- 1
435-
- 2
436-
- 4
437-
- 8
438-
- 16
439-
- 32
440-
- 64
441-
- 128
442-
- 256
443-
- 384
444-
print_iter_log: true
426+
use_cuda_graph: true
427+
cuda_graph_padding_enabled: true
428+
cuda_graph_batch_sizes:
429+
- 1
430+
- 2
431+
- 4
432+
- 8
433+
- 16
434+
- 32
435+
- 64
436+
- 128
437+
- 256
438+
- 384
439+
print_iter_log: true
445440
enable_attention_dp: true
446441
EOF
447442
```
@@ -605,9 +600,8 @@ To enable FP8 MLA, modify the `kv_cache_quant_algo` property. The following show
605600
Alternatively, configure FP8 MLA through the `kv_cache_dtype` of the PyTorch backend config. An example is to use `--kv_cache_dtype` of `quickstart_advanced.py`. Also, you can edit `extra-llm-api-config.yml` consumed by `--extra_llm_api_options` of `trtllm-serve`, `trtllm-bench` and so on:
606601
```yaml
607602
# ...
608-
pytorch_backend_config:
609-
kv_cache_dtype: fp8
610-
# ...
603+
kv_cache_dtype: fp8
604+
# ...
611605
```
612606

613607
### W4AFP8

examples/models/core/qwen/README.md

Lines changed: 14 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -653,21 +653,20 @@ To serve the model using `trtllm-serve`:
653653

654654
```bash
655655
cat >./extra-llm-api-config.yml <<EOF
656-
pytorch_backend_config:
657-
use_cuda_graph: true
658-
cuda_graph_padding_enabled: true
659-
cuda_graph_batch_sizes:
660-
- 1
661-
- 2
662-
- 4
663-
- 8
664-
- 16
665-
- 32
666-
- 64
667-
- 128
668-
- 256
669-
- 384
670-
print_iter_log: true
656+
use_cuda_graph: true
657+
cuda_graph_padding_enabled: true
658+
cuda_graph_batch_sizes:
659+
- 1
660+
- 2
661+
- 4
662+
- 8
663+
- 16
664+
- 32
665+
- 64
666+
- 128
667+
- 256
668+
- 384
669+
print_iter_log: true
671670
enable_attention_dp: true
672671
EOF
673672

0 commit comments

Comments
 (0)