Shared weights whenever multiple instances #18

DavidLangworthy · 2020-11-30T14:33:02Z

Also CPU

Jackiexiao · 2022-06-02T02:34:45Z

any progress?

robertbagge · 2022-07-12T15:40:19Z

+1 for this. Did some benchmarking on this today.

This is with 1 instance of 3 ONNX models

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| 74%   64C    P2   235W / 350W |   4416MiB / 24576MiB |     40%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2054      G   /usr/lib/xorg/Xorg                  8MiB |
|    0   N/A  N/A      2270      G   /usr/bin/gnome-shell                6MiB |
|    0   N/A  N/A   1498503      C   tritonserver                     4397MiB |
+-----------------------------------------------------------------------------+

This is with 2 instances of 3 ONNX models

Tue Jul 12 15:19:46 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| 73%   65C    P2   236W / 350W |   7238MiB / 24576MiB |     50%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2054      G   /usr/lib/xorg/Xorg                  8MiB |
|    0   N/A  N/A      2270      G   /usr/bin/gnome-shell                6MiB |
|    0   N/A  N/A   1653466      C   tritonserver                     7219MiB |
+-----------------------------------------------------------------------------+

GuanLuo · 2022-07-13T00:01:22Z

CC @pranavsharma does ORT provides API for doing so? Or can a ORT session be run for different inferences in parallel?

pranavsharma · 2022-07-13T01:29:01Z

CC @pranavsharma does ORT provides API for doing so? Or can a ORT session be run for different inferences in parallel?

Not fully following. What API are you looking for? I believe Triton already creates a separate session for each instance and these instances (sessions) can be used to run inferences in parallel. The drawback is that each session has its own copy of the weights thereby increasing (replicating) the memory consumption. Someone has submitted code changes to share a session between different instances. We're reviewing the changes. This should fix the memory consumption problem.

GuanLuo · 2022-07-13T17:53:38Z

Someone has submitted code changes to share a session between different instances. We're reviewing the changes. This should fix the memory consumption problem.

Yes, this is what I was looking for. Sorry for not being clear in my previous question, just mumbling different ways to use a copy of weights in multiple instances that I have seen in different framework. i.e. TRT store weights in an "engine" and it can creates multiple "context" maps to the same "engine"

heliqi · 2022-11-01T03:11:24Z

@pranavsharma any progress about "Sharing a session between different instances of ONNXRuntime" ?

pranavsharma · 2022-11-02T06:33:53Z

@pranavsharma any progress about "Sharing a session between different instances of ONNXRuntime" ?

I should be able to get to it this week.

heliqi · 2022-11-02T07:10:09Z

@pranavsharma any progress about "Sharing a session between different instances of ONNXRuntime" ?

I should be able to get to it this week.

GOOD! I look forward to hearing from you soon.

heliqi · 2022-11-09T12:11:38Z

@pranavsharma any progress？

FabianSchuetze · 2023-02-01T07:48:04Z

Is there any news about sharing gpu memory? I the PR you mentioned #141, @pranavsharma ?

We have to switch models regularly and sharing memory would be very beneficial.

DavidLangworthy changed the title ~~Shared weights whenever multiple instances on the same GPU.~~ Shared weights whenever multiple instances Dec 7, 2020

DavidLangworthy added the medium label Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared weights whenever multiple instances #18

Shared weights whenever multiple instances #18

DavidLangworthy commented Nov 30, 2020

Jackiexiao commented Jun 2, 2022

robertbagge commented Jul 12, 2022 •

edited

Loading

GuanLuo commented Jul 13, 2022

pranavsharma commented Jul 13, 2022 •

edited

Loading

GuanLuo commented Jul 13, 2022

heliqi commented Nov 1, 2022

pranavsharma commented Nov 2, 2022

heliqi commented Nov 2, 2022

heliqi commented Nov 9, 2022

FabianSchuetze commented Feb 1, 2023 •

edited

Loading

Shared weights whenever multiple instances #18

Shared weights whenever multiple instances #18

Comments

DavidLangworthy commented Nov 30, 2020

Jackiexiao commented Jun 2, 2022

robertbagge commented Jul 12, 2022 • edited Loading

GuanLuo commented Jul 13, 2022

pranavsharma commented Jul 13, 2022 • edited Loading

GuanLuo commented Jul 13, 2022

heliqi commented Nov 1, 2022

pranavsharma commented Nov 2, 2022

heliqi commented Nov 2, 2022

heliqi commented Nov 9, 2022

FabianSchuetze commented Feb 1, 2023 • edited Loading

robertbagge commented Jul 12, 2022 •

edited

Loading

pranavsharma commented Jul 13, 2022 •

edited

Loading

FabianSchuetze commented Feb 1, 2023 •

edited

Loading