-
Notifications
You must be signed in to change notification settings - Fork 43
Add VmmState::SagaUnwound
#5855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks right to me (I'm happy to recheck it once the state enum split merges and the PR is targeting main). Just a couple of small comments.
0c55ef3
to
1d91b71
Compare
This branch is part of ongoing work on the `instance-update` saga (see PR #5749); I've factored it out into a separate PR. This is largely because this branch makes mechanical changes to a bunch of different files that aren't directly related to the core change of #5749, and I'd like to avoid the additional merge conflicts that are likely when these changes remain un-merged for a long time. --- Depends on #5854. As part of #5749, it is desirable to distinguish between *why* a VMM record was marked as `Destroyed`, in order to determine which saga is responsible for cleaning up that VMM's resources. The `instance-update` saga must be the only entity that can set an instance's VMM IDs (active and target) to NULL. Presently, however, the `instance-start` and `instance-migrate` sagas may also do this when they unwind. This is a requirement to avoid situations where a failed `instance-update` saga cannot unwind, because the instance's generation number has changed while the saga was executing. We want to ensure that if a start or migrate request fails, the instance will appear to be in the correct post-state as soon as the relevant API call returns. In order to do this, without having to also guarantee that an instance update has occurred, we introduce a new VMM state, `SagaUnwound`, with the following rules: - It is legal to start or migrate an instance if the `active_propolis_id` or `destination_propolis_id` (respectively) is either `NULL` or refers to a VMM that’s in the `SagaUnwound` state (the new VMM ID directly replaces the old ID). - If an instance has an `active_propolis_id` in the `SagaUnwound` state, then the instance appears to be `Stopped`. - If an instance has a `destination_propolis_id` in the `SagaUnwound` state, nothing special happens–the instance’s state is still derived from its active VMM’s state. - The update saga treats `SagaUnwound` VMMs as identical to `Destroyed` VMMs for purposes of deciding whether to remove a VMM ID from an instance. This branch adds a new `VmmState::SagaUnwound` variant. The `SagaUnwound` state is an internal implementation detail that shouldn't be exposed to an operator or in the external API. Sled-agents will never report that a VMM is in this state. Instead, this state is set my the `instance-start` and `instance-migrate` sagas when they unwind. When determining the API instance state from an instance and active VMM query so that the `SagaUnwound` state is mapped to `Destroyed`. Closes #5848, which this replaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the change!
I think I still accept my past self's reason for introducing this state (to recap for other readers: we want to introduce an "instance state update" task; that task requires the exclusive right to clear an instance's Propolis ID fields; unwinding start/migrate sagas therefore need a way to mark that a VMM needs to be GC'ed by the update task). That said, it's sort of a bummer that there are now two states that represent a stopped instance (InstanceState::NoVmm
and InstanceState::Vmm
+ VmmState::SagaUnwound
). As we author and review the update task, it might be worth a little extra elbow grease to make absolutely sure we can't avoid the "updater task is the only thing that can set Propolis IDs to NULL" restriction. But I'm confident enough in that requirement that I'm good with merging this PR.
This branch is part of ongoing work on the
instance-update
saga (see PR #5749); I've factored it out into a separate PR. This is largely because this branch makes mechanical changes to a bunch of different files that aren't directly related to the core change of #5749, and I'd like to avoid the additional merge conflicts that are likely when these changes remain un-merged for a long time.Depends on #5854 and should merge only after that PR.
As part of #5749, it is desirable to distinguish between why a VMM record was marked as
Destroyed
, in order to determine which saga is responsible for cleaning up that VMM's resources. Theinstance-update
saga must be the only entity that can set an instance's VMM IDs (active and target) to NULL. Presently, however, theinstance-start
andinstance-migrate
sagas may also do this when they unwind. This is a requirement to avoid situations where a failedinstance-update
saga cannot unwind, because the instance's generation number has changed while the saga was executing.We want to ensure that if a start or migrate request fails, the instance will appear to be in the correct post-state as soon as the relevant API call returns. In order to do this, without having to also guarantee that an instance update has occurred, we introduce a new VMM state,
SagaUnwound
, with the following rules:active_propolis_id
ordestination_propolis_id
(respectively) is eitherNULL
or refers to a VMM that’s in theSagaUnwound
state (the new VMM ID directly replaces the old ID).active_propolis_id
in theSagaUnwound
state, then the instance appears to beStopped
.destination_propolis_id
in theSagaUnwound
state, nothing special happens–the instance’s state is still derived from its active VMM’s state.SagaUnwound
VMMs as identical toDestroyed
VMMs for purposes of deciding whether to remove a VMM ID from an instance.This branch adds a new
VmmState::SagaUnwound
variant. TheSagaUnwound
state is an internal implementation detail that shouldn't be exposed to an operator or in the external API. Sled-agents will never report that a VMM is in this state. Instead, this state is set my theinstance-start
andinstance-migrate
sagas when they unwind. When determining the API instance state from an instance and active VMM query so that theSagaUnwound
state is mapped toDestroyed
.Closes #5848, which this replaces.