You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add some unit tests for sled-agent Instance creation
At time of writing, instance creation roughly looks like:
- nexus -> sled-agent: `instance_put_state`
- sled-agent: `InstanceManager::ensure_state`
- sled-agent: `Instance::propolis_ensure`
- sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
- sled-agent: `Instance::setup_propolis_locked` (*blocking!*)
- `RunningZone::install` and `Zones::boot`
- `illumos_utils::svc::wait_for_service`
- `self::wait_for_http_server` for propolis-server itself
- sled-agent: `Instance::ensure_propolis_and_tasks`
- sled-agent: spawn `Instance::monitor_state_task`
- sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
- sled-agent: return ok result
- nexus: `handle_instance_put_result`
Or at least, it does in the happy path. omicron#3927 saw propolis zone
creation take longer than the minute nexus's call to sled-agent's
`instance_put_state`. That might've looked something like:
- nexus -> sled-agent: `instance_put_state`
- sled-agent: `InstanceManager::ensure_state`
- sled-agent: `Instance::propolis_ensure`
- sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
- sled-agent: `Instance::setup_propolis_locked` (*blocking!*)
- `RunningZone::install` and `Zones::boot`
- nexus: i've been waiting a whole minute for this. connection timeout!
- nexus: `handle_instance_put_result`
- sled-agent: [...] return... oh, they hung up. :(
To avoid this timeout being implicit at the *Dropshot configuration*
layer (that is to say, we should still have *some* timeout),
we could consider a small refactor to make `instance_put_state` not a
blocking call -- especially since it's already sending nexus updates on
its progress via out-of-band `cpapi_instances_put` calls! That might look
something like:
- nexus -> sled-agent: `instance_put_state`
- sled-agent: `InstanceManager::ensure_state`
- sled-agent: spawn {
- sled-agent: `Instance::propolis_ensure`
- sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
- sled-agent: `Instance::setup_propolis_locked` (blocking!)
- sled-agent: `Instance::ensure_propolis_and_tasks`
- sled-agent: spawn `Instance::monitor_state_task`
- sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
- sled-agent -> nexus: a cpapi call equivalent to the `handle_instance_put_result` nexus currently invokes after getting the response from the blocking call
(With a way for nexus to cancel an instance creation by ID, and a timeout
in sled-agent itself for terminating the attempt and reporting the failure
back to nexus, and a shorter threshold for logging the event of an instance
creation taking a long time.)
Before such a change, though, we should really have some more tests around
sled-agent's instance creation code at all! So here's a few.
0 commit comments