understand (and possibly improve) instance creation times #487

jordanhendricks · 2023-08-11T22:34:45Z

In recent experience with rack2, we have had a few discussions around the time it takes to create instances with a lot of memory. Some examples are: the as of not yet understood oxidecomputer/omicron#3417, data from @askfongjojo that indicates large instances often reliably take around 40 seconds to create and start, and observations that creation of large instances often times out.

Recently we landed support in omicron for using the VMM reservoir, which helped alleviate some of the pain around creating large instances, but it still takes on the order of 30+ seconds to create instances > 64GiB of memory, so I wanted to understand where that time was going.

I looked at a couple of larger instances this week on the dogfood cluster, and saw that there was about 20-25 seconds for a 64gb/96gb memory instance between the first propolis-server log line and the log line indicating a VNIC was being created for instance. (I intended to look at more, smaller instances, but was hamstrung by unrelated issues.) In between those two events, by code inspection I see that we would make OS call to allocate guest memory from the reservoir. @pfmooney did some testing of large VMs and found that the actual reservoir allocation was very small (order of microseconds), but it took around 15 seconds to map ~60GiB of memory into the guest address space. It thus seems plausible but that's where our time was spent, but we have little in the way of logging to show that.

It does not seem that improving instance creation times for large VMs is a big priority at the moment (though of course, no one is going to complain if instance creation is faster!). That said, from looking at this issue so far, it's clear that we could have better data here. At a minimum, I think we should:

look at ways we can improve the propolis-server logging to understand where time is going during instance creation
take a similar tack on the omicron side with instance create saga should produce useful log messages omicron#3877

jordanhendricks · 2023-08-11T22:51:35Z

related: #471

askfongjojo · 2023-08-14T18:41:42Z

My most recent attempts to create instances of 128 GB memory began to fail after the sleds on rack2 have had a lot of instances created and then deleted.

root@[fd00:1122:3344:102::3]:32221/omicron> select id, active_sled_id, state, name, time_created from instance where name like 'provision%128m' order by 5;
                   id                  |            active_sled_id            |   state   |          name           |         time_created
---------------------------------------+--------------------------------------+-----------+-------------------------+--------------------------------
  884d2f39-f5fd-4c31-82c4-4e4628c4a6c3 | a2303a7b-fe1f-4010-99da-ed90bba042b0 | destroyed | provision-time-16c-128m | 2023-08-10 23:12:33.204546+00
  b7523d15-a861-4320-9972-c5a7f7d63754 | ae9eccdf-e662-43d2-9493-445cfa934ee8 | destroyed | provision-time-16c-128m | 2023-08-11 00:55:14.505487+00
  bdb77ad9-6963-4ad6-a1a6-829af63cf575 | 2d7b6828-ba9f-44ff-862a-63852d79a410 | destroyed | provision-time-8c-128m  | 2023-08-14 04:59:26.472164+00
  5918a06f-aab8-4752-913b-310f717a3b2b | 6b4ff253-ba5b-4d0c-94c9-7751bdc0bf80 | destroyed | provision-time-16c-128m | 2023-08-14 05:03:32.510433+00
  1c3f0077-2b7b-4c56-a159-9858b9789ec5 | 6b93c9c3-8056-44f4-b2b5-2f461be09819 | destroyed | provision-time-32c-128m | 2023-08-14 05:07:03.469654+00
  4c5fe451-7b95-4f2d-a448-ce20bbb5fea6 | 94f583be-8d15-4b15-92cd-bf22f33179b7 | destroyed | provision-time-32c-128m | 2023-08-14 05:11:08.420303+00
  8c505c88-2432-419f-85ed-8c194a5310d0 | 6b4ff253-ba5b-4d0c-94c9-7751bdc0bf80 | destroyed | provision-time-64c-128m | 2023-08-14 05:13:54.8145+00
  3b90cccf-b798-40ad-8373-7caa62686ed5 | a2303a7b-fe1f-4010-99da-ed90bba042b0 | destroyed | provision-time-32c-128m | 2023-08-14 06:23:44.721355+00
  3fc8f742-5e24-4133-ac09-5e30a5ff6b3c | ae9eccdf-e662-43d2-9493-445cfa934ee8 | destroyed | provision-time-64c-128m | 2023-08-14 06:26:28.147305+00
(9 rows)

Of the above 9 instances, only the first two were created successfully. The subsequent ones all failed after about 1m 45s (the durations were simply time of the cli calls).

pfmooney · 2023-08-16T05:40:23Z

Filed illumos#15844, which should cover at least some of the provisioning cost we're seeing. I have a patch in the works which should improve things there.

pfmooney · 2023-08-22T21:18:23Z

15844 has landed in illumos-gate. Once that makes its way into stlouis, and onto test hardware, it'd be good to revisit the large instance provisioning tests.

askfongjojo · 2023-08-22T23:32:22Z

Will certainly do so. Currently provisioning large VMs (with 64/96/128 GB of memory) is still subject to the race with the 60 second client timeout.

gjcolombo · 2023-08-22T23:40:38Z

FWIW, I'm not (yet) convinced that the problem in oxidecomputer/omicron#3927 is specific to large instances, since in that issue Nexus timed out while waiting for sled agent to create the Propolis zone, and I don't have an intuition as to what mechanism would make that take longer for larger instances than it does for smaller ones. (That said I have no idea what made zone setup take so long, full stop, so pretty much any theory is on the table at this point.)

askfongjojo · 2023-08-23T15:13:24Z

Well, as a "controlled experiment", I have this one instance on rack2 that was successfully provisioned and started with 64 vcpus and 128 GB memory: https://oxide.sys.rack2.eng.oxide.computer/projects/try/instances/provision-time-64c-128m

After yesterday's software update, it failed to start up after multiple tries, all due to the same client timeout error:

23:15:06.504Z WARN SledAgent (dropshot (SledAgent)): client disconnected before response returned
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/8ef2bd2/dropshot/src/server.rs:927
    local_addr = [fd00:1122:3344:106::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:107::3]:64621
    req_id = e349db23-b0d1-482c-847c-64e7a11421c4
    uri = /instances/fe88bdb4-8bff-41a8-9ae5-9a21acf53ce3/state

I stopped all VMs on the sled in question after multiple failed attempts and this was the only instance being started on the sled BRM44220010 (gc25). Next, I updated the memory setting of this instance to 96 GB in the instance table in CRDB. After that, I was able to stop/start the instance (tried that twice) and verify that the guest was also functional (with a certain test workload). So it would appear that there is still something to do with large memory size?

pfmooney · 2023-08-23T15:30:15Z

Just for reference, 15844 was not merged into stlouis until this morning, so any impact it may have would be missing from what was installed yesterday.

askfongjojo · 2023-08-28T23:16:52Z

Prior to 15844, the end-to-end provisioning times were anywhere between 30-40 for a VM of 32 or 64 GB memory. VMs with larger memory size were close to 50-60 seconds (or failing with sled-agent client timeout error if the requests went beyond 60s). And I hadn't been able to provision instance with more than 128 GB.

After 15844, the provisioning times fall in the 21-29s range consistently for VM of different sizes (sample size = 50, much of the time was spent on vnic and disk setup). I am also able to spin up 256 GB memory instances and run simple applications on them.

jordanhendricks added this to the Unscheduled milestone Aug 11, 2023

jordanhendricks added the control plane Related to the control plane. label Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

understand (and possibly improve) instance creation times #487

understand (and possibly improve) instance creation times #487

jordanhendricks commented Aug 11, 2023 •

edited

Loading

jordanhendricks commented Aug 11, 2023

Uh oh!

askfongjojo commented Aug 14, 2023

Uh oh!

pfmooney commented Aug 16, 2023

Uh oh!

pfmooney commented Aug 22, 2023

Uh oh!

askfongjojo commented Aug 22, 2023

Uh oh!

gjcolombo commented Aug 22, 2023

Uh oh!

askfongjojo commented Aug 23, 2023 •

edited

Loading

Uh oh!

pfmooney commented Aug 23, 2023

Uh oh!

askfongjojo commented Aug 28, 2023 •

edited

Loading

Uh oh!

understand (and possibly improve) instance creation times #487

understand (and possibly improve) instance creation times #487

Comments

jordanhendricks commented Aug 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jordanhendricks commented Aug 11, 2023

Uh oh!

askfongjojo commented Aug 14, 2023

Uh oh!

pfmooney commented Aug 16, 2023

Uh oh!

pfmooney commented Aug 22, 2023

Uh oh!

askfongjojo commented Aug 22, 2023

Uh oh!

gjcolombo commented Aug 22, 2023

Uh oh!

askfongjojo commented Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pfmooney commented Aug 23, 2023

Uh oh!

askfongjojo commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jordanhendricks commented Aug 11, 2023 •

edited

Loading

askfongjojo commented Aug 23, 2023 •

edited

Loading

askfongjojo commented Aug 28, 2023 •

edited

Loading