Skip to content

Add some unit tests for sled-agent Instance creation #4489

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

lifning
Copy link
Contributor

@lifning lifning commented Nov 11, 2023

Depends on #4325 for faking zone creation.

At time of writing, instance creation roughly looks like:

  • nexus -> sled-agent: instance_put_state
    • sled-agent: InstanceManager::ensure_state
      • sled-agent: Instance::propolis_ensure
        • sled-agent -> nexus: cpapi_instances_put (if not migrating)
        • sled-agent: Instance::setup_propolis_locked (blocking!)
          • RunningZone::install and Zones::boot
          • illumos_utils::svc::wait_for_service
          • self::wait_for_http_server for propolis-server itself
        • sled-agent: Instance::ensure_propolis_and_tasks
          • sled-agent: spawn Instance::monitor_state_task
        • sled-agent -> nexus: cpapi_instances_put (if not migrating)
      • sled-agent: return ok result
  • nexus: handle_instance_put_result

Or at least, it does in the happy path. #3927 saw propolis zone
creation take longer than the minute nexus's call to sled-agent's
instance_put_state. That might've looked something like:

  • nexus -> sled-agent: instance_put_state
    • sled-agent: InstanceManager::ensure_state
      • sled-agent: Instance::propolis_ensure
        • sled-agent -> nexus: cpapi_instances_put (if not migrating)
        • sled-agent: Instance::setup_propolis_locked (blocking!)
          • RunningZone::install and Zones::boot
  • nexus: i've been waiting a whole minute for this. connection timeout! handle_instance_put_result
    • [...]
    • sled-agent: return... oh, they hung up. :(

To avoid this timeout being implicit at the Dropshot configuration
layer (that is to say, we should still have some timeout),
we could consider a small refactor to make instance_put_state not a
blocking call -- especially since it's already sending nexus updates on
its progress via out-of-band cpapi_instances_put calls! That might look
something like:

  • nexus -> sled-agent: instance_put_state
    • sled-agent: InstanceManager::ensure_state
      • sled-agent: spawn {
        • sled-agent: Instance::propolis_ensure
          • sled-agent -> nexus: cpapi_instances_put (if not migrating)
          • sled-agent: Instance::setup_propolis_locked (blocking!)
          • sled-agent: Instance::ensure_propolis_and_tasks
            • sled-agent: spawn Instance::monitor_state_task
          • sled-agent -> nexus: cpapi_instances_put (if not migrating)
          • sled-agent -> nexus: a cpapi call equivalent to the handle_instance_put_result nexus currently invokes after getting the response from the blocking call

(With a way for nexus to cancel an instance creation by ID, and a timeout
in sled-agent itself for terminating the attempt and reporting the failure
back to nexus, and a shorter threshold for logging the event of an instance
creation taking a long time.)

Before such a change, though, we should really have some more tests around
sled-agent's instance creation code at all! So here's a few.

@lifning lifning force-pushed the sled-agent-instance-creation-tests branch 3 times, most recently from dfdfe77 to 0b96df7 Compare November 14, 2023 03:52
@jordanhendricks jordanhendricks self-requested a review November 14, 2023 20:44
@lifning lifning force-pushed the sled-agent-instance-creation-tests branch from 0b96df7 to 8c23c45 Compare November 15, 2023 05:11
@lifning lifning force-pushed the sled-agent-instance-creation-tests branch 3 times, most recently from 74d4b87 to bdeb287 Compare December 1, 2023 23:54
@lifning lifning force-pushed the sled-agent-instance-creation-tests branch 2 times, most recently from e6e7db7 to 2c22a2e Compare December 21, 2023 09:57
At time of writing, instance creation roughly looks like:

- nexus -> sled-agent: `instance_put_state`
  - sled-agent: `InstanceManager::ensure_state`
    - sled-agent: `Instance::propolis_ensure`
      - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
      - sled-agent: `Instance::setup_propolis_locked` (*blocking!*)
        - `RunningZone::install` and `Zones::boot`
        - `illumos_utils::svc::wait_for_service`
        - `self::wait_for_http_server` for propolis-server itself
      - sled-agent: `Instance::ensure_propolis_and_tasks`
        - sled-agent: spawn `Instance::monitor_state_task`
      - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
    - sled-agent: return ok result
- nexus: `handle_instance_put_result`

Or at least, it does in the happy path. omicron#3927 saw propolis zone
creation take longer than the minute nexus's call to sled-agent's
`instance_put_state`. That might've looked something like:

- nexus -> sled-agent: `instance_put_state`
  - sled-agent: `InstanceManager::ensure_state`
    - sled-agent: `Instance::propolis_ensure`
      - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
      - sled-agent: `Instance::setup_propolis_locked` (*blocking!*)
        - `RunningZone::install` and `Zones::boot`
- nexus: i've been waiting a whole minute for this. connection timeout!
- nexus: `handle_instance_put_result`
    - sled-agent: [...] return... oh, they hung up. :(

To avoid this timeout being implicit at the *Dropshot configuration*
layer (that is to say, we should still have *some* timeout),
we could consider a small refactor to make `instance_put_state` not a
blocking call -- especially since it's already sending nexus updates on
its progress via out-of-band `cpapi_instances_put` calls! That might look
something like:

- nexus -> sled-agent: `instance_put_state`
  - sled-agent: `InstanceManager::ensure_state`
    - sled-agent: spawn {
      - sled-agent: `Instance::propolis_ensure`
        - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
        - sled-agent: `Instance::setup_propolis_locked` (blocking!)
        - sled-agent: `Instance::ensure_propolis_and_tasks`
          - sled-agent: spawn `Instance::monitor_state_task`
        - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
        - sled-agent -> nexus: a cpapi call equivalent to the `handle_instance_put_result` nexus currently invokes after getting the response from the blocking call

(With a way for nexus to cancel an instance creation by ID, and a timeout
in sled-agent itself for terminating the attempt and reporting the failure
back to nexus, and a shorter threshold for logging the event of an instance
creation taking a long time.)

Before such a change, though, we should really have some more tests around
sled-agent's instance creation code at all! So here's a few.
@lifning lifning force-pushed the sled-agent-instance-creation-tests branch from 2c22a2e to efb04a8 Compare December 21, 2023 20:36
@lifning
Copy link
Contributor Author

lifning commented Mar 5, 2024

closing this because it makes more sense to just pull it in as part of #4691

@lifning lifning closed this Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Propolis zone installation took 81 seconds and caused instance start to time out
2 participants