-
Notifications
You must be signed in to change notification settings - Fork 43
Add some unit tests for sled-agent Instance creation #4489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
lifning
wants to merge
1
commit into
oxidecomputer:main
from
lifning:sled-agent-instance-creation-tests
Closed
Add some unit tests for sled-agent Instance creation #4489
lifning
wants to merge
1
commit into
oxidecomputer:main
from
lifning:sled-agent-instance-creation-tests
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
dfdfe77
to
0b96df7
Compare
0b96df7
to
8c23c45
Compare
74d4b87
to
bdeb287
Compare
e6e7db7
to
2c22a2e
Compare
At time of writing, instance creation roughly looks like: - nexus -> sled-agent: `instance_put_state` - sled-agent: `InstanceManager::ensure_state` - sled-agent: `Instance::propolis_ensure` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: `Instance::setup_propolis_locked` (*blocking!*) - `RunningZone::install` and `Zones::boot` - `illumos_utils::svc::wait_for_service` - `self::wait_for_http_server` for propolis-server itself - sled-agent: `Instance::ensure_propolis_and_tasks` - sled-agent: spawn `Instance::monitor_state_task` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: return ok result - nexus: `handle_instance_put_result` Or at least, it does in the happy path. omicron#3927 saw propolis zone creation take longer than the minute nexus's call to sled-agent's `instance_put_state`. That might've looked something like: - nexus -> sled-agent: `instance_put_state` - sled-agent: `InstanceManager::ensure_state` - sled-agent: `Instance::propolis_ensure` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: `Instance::setup_propolis_locked` (*blocking!*) - `RunningZone::install` and `Zones::boot` - nexus: i've been waiting a whole minute for this. connection timeout! - nexus: `handle_instance_put_result` - sled-agent: [...] return... oh, they hung up. :( To avoid this timeout being implicit at the *Dropshot configuration* layer (that is to say, we should still have *some* timeout), we could consider a small refactor to make `instance_put_state` not a blocking call -- especially since it's already sending nexus updates on its progress via out-of-band `cpapi_instances_put` calls! That might look something like: - nexus -> sled-agent: `instance_put_state` - sled-agent: `InstanceManager::ensure_state` - sled-agent: spawn { - sled-agent: `Instance::propolis_ensure` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: `Instance::setup_propolis_locked` (blocking!) - sled-agent: `Instance::ensure_propolis_and_tasks` - sled-agent: spawn `Instance::monitor_state_task` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent -> nexus: a cpapi call equivalent to the `handle_instance_put_result` nexus currently invokes after getting the response from the blocking call (With a way for nexus to cancel an instance creation by ID, and a timeout in sled-agent itself for terminating the attempt and reporting the failure back to nexus, and a shorter threshold for logging the event of an instance creation taking a long time.) Before such a change, though, we should really have some more tests around sled-agent's instance creation code at all! So here's a few.
2c22a2e
to
efb04a8
Compare
closing this because it makes more sense to just pull it in as part of #4691 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Depends on #4325 for faking zone creation.
At time of writing, instance creation roughly looks like:
instance_put_state
InstanceManager::ensure_state
Instance::propolis_ensure
cpapi_instances_put
(if not migrating)Instance::setup_propolis_locked
(blocking!)RunningZone::install
andZones::boot
illumos_utils::svc::wait_for_service
self::wait_for_http_server
for propolis-server itselfInstance::ensure_propolis_and_tasks
Instance::monitor_state_task
cpapi_instances_put
(if not migrating)handle_instance_put_result
Or at least, it does in the happy path. #3927 saw propolis zone
creation take longer than the minute nexus's call to sled-agent's
instance_put_state
. That might've looked something like:instance_put_state
InstanceManager::ensure_state
Instance::propolis_ensure
cpapi_instances_put
(if not migrating)Instance::setup_propolis_locked
(blocking!)RunningZone::install
andZones::boot
handle_instance_put_result
To avoid this timeout being implicit at the Dropshot configuration
layer (that is to say, we should still have some timeout),
we could consider a small refactor to make
instance_put_state
not ablocking call -- especially since it's already sending nexus updates on
its progress via out-of-band
cpapi_instances_put
calls! That might looksomething like:
instance_put_state
InstanceManager::ensure_state
Instance::propolis_ensure
cpapi_instances_put
(if not migrating)Instance::setup_propolis_locked
(blocking!)Instance::ensure_propolis_and_tasks
Instance::monitor_state_task
cpapi_instances_put
(if not migrating)handle_instance_put_result
nexus currently invokes after getting the response from the blocking call(With a way for nexus to cancel an instance creation by ID, and a timeout
in sled-agent itself for terminating the attempt and reporting the failure
back to nexus, and a shorter threshold for logging the event of an instance
creation taking a long time.)
Before such a change, though, we should really have some more tests around
sled-agent's instance creation code at all! So here's a few.