Skip to content

Add VmmState::SagaUnwound #5855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 5, 2024
Merged

Add VmmState::SagaUnwound #5855

merged 6 commits into from
Jun 5, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Jun 4, 2024

This branch is part of ongoing work on the instance-update saga (see PR #5749); I've factored it out into a separate PR. This is largely because this branch makes mechanical changes to a bunch of different files that aren't directly related to the core change of #5749, and I'd like to avoid the additional merge conflicts that are likely when these changes remain un-merged for a long time.


Depends on #5854 and should merge only after that PR.

As part of #5749, it is desirable to distinguish between why a VMM record was marked as Destroyed, in order to determine which saga is responsible for cleaning up that VMM's resources. The instance-update saga must be the only entity that can set an instance's VMM IDs (active and target) to NULL. Presently, however, the instance-start and instance-migrate sagas may also do this when they unwind. This is a requirement to avoid situations where a failed instance-update saga cannot unwind, because the instance's generation number has changed while the saga was executing.

We want to ensure that if a start or migrate request fails, the instance will appear to be in the correct post-state as soon as the relevant API call returns. In order to do this, without having to also guarantee that an instance update has occurred, we introduce a new VMM state, SagaUnwound, with the following rules:

  • It is legal to start or migrate an instance if the active_propolis_id or destination_propolis_id (respectively) is either NULL or refers to a VMM that’s in the SagaUnwound state (the new VMM ID directly replaces the old ID).
  • If an instance has an active_propolis_id in the SagaUnwound state, then the instance appears to be Stopped.
  • If an instance has a destination_propolis_id in the SagaUnwound state, nothing special happens–the instance’s state is still derived from its active VMM’s state.
  • The update saga treats SagaUnwound VMMs as identical to Destroyed VMMs for purposes of deciding whether to remove a VMM ID from an instance.

This branch adds a new VmmState::SagaUnwound variant. The SagaUnwound state is an internal implementation detail that shouldn't be exposed to an operator or in the external API. Sled-agents will never report that a VMM is in this state. Instead, this state is set my the instance-start and instance-migrate sagas when they unwind. When determining the API instance state from an instance and active VMM query so that the SagaUnwound state is mapped to Destroyed.

Closes #5848, which this replaces.

@hawkw hawkw requested a review from gjcolombo June 4, 2024 23:21
Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right to me (I'm happy to recheck it once the state enum split merges and the PR is targeting main). Just a couple of small comments.

@gjcolombo gjcolombo force-pushed the gjcolombo/separate-instance-vmm-state-enums branch from 0c55ef3 to 1d91b71 Compare June 5, 2024 16:26
Base automatically changed from gjcolombo/separate-instance-vmm-state-enums to main June 5, 2024 20:16
hawkw added 3 commits June 5, 2024 13:27
This branch is part of ongoing work on the `instance-update` saga (see
PR #5749); I've factored it out into a separate PR. This is largely
because this branch makes mechanical changes to a bunch of different
files that aren't directly related to the core change of #5749, and I'd
like to avoid the additional merge conflicts that are likely when these
changes remain un-merged for a long time.

---

Depends on #5854.

As part of #5749, it is desirable to distinguish between *why* a VMM
record was marked as `Destroyed`, in order to determine which saga is
responsible for cleaning up that VMM's resources. The `instance-update`
saga must be the only entity that can set an instance's VMM IDs (active
and target) to NULL. Presently, however, the `instance-start` and
`instance-migrate` sagas may also do this when they unwind. This is a
requirement to avoid situations where a failed `instance-update` saga
cannot unwind, because the instance's generation number has changed
while the saga was executing.

We want to ensure that if a start or migrate request fails, the instance
will appear to be in the correct post-state as soon as the relevant API
call returns. In order to do this, without having to also guarantee that
an instance update has occurred, we introduce a new VMM state,
`SagaUnwound`, with the following rules:

- It is legal to start or migrate an instance if the
  `active_propolis_id` or `destination_propolis_id` (respectively) is
  either `NULL` or refers to a VMM that’s in the `SagaUnwound` state
  (the new VMM ID directly replaces the old ID).
- If an instance has an `active_propolis_id` in the `SagaUnwound` state,
  then the instance appears to be `Stopped`.
- If an instance has a `destination_propolis_id` in the `SagaUnwound`
  state, nothing special happens–the instance’s state is still derived
  from its active VMM’s state.
- The update saga treats `SagaUnwound` VMMs as identical to `Destroyed`
  VMMs for purposes of deciding whether to remove a VMM ID from an
  instance.

This branch adds a new `VmmState::SagaUnwound` variant. The
`SagaUnwound` state is an internal implementation detail that shouldn't
be exposed to an operator or in the external API. Sled-agents will never
report that a VMM is in this state. Instead, this state is set my the
`instance-start` and `instance-migrate` sagas when they unwind. When
determining the API instance state from an instance and active VMM query
so that the `SagaUnwound` state is mapped to `Destroyed`.

Closes #5848, which this replaces.
@hawkw hawkw force-pushed the eliza/saga-rewound branch from 7b11f1a to 7c95a23 Compare June 5, 2024 20:29
@hawkw hawkw requested a review from gjcolombo June 5, 2024 21:07
Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the change!

I think I still accept my past self's reason for introducing this state (to recap for other readers: we want to introduce an "instance state update" task; that task requires the exclusive right to clear an instance's Propolis ID fields; unwinding start/migrate sagas therefore need a way to mark that a VMM needs to be GC'ed by the update task). That said, it's sort of a bummer that there are now two states that represent a stopped instance (InstanceState::NoVmm and InstanceState::Vmm + VmmState::SagaUnwound). As we author and review the update task, it might be worth a little extra elbow grease to make absolutely sure we can't avoid the "updater task is the only thing that can set Propolis IDs to NULL" restriction. But I'm confident enough in that requirement that I'm good with merging this PR.

@hawkw hawkw marked this pull request as ready for review June 5, 2024 21:25
@hawkw hawkw enabled auto-merge (squash) June 5, 2024 21:25
@hawkw hawkw merged commit 9d67e42 into main Jun 5, 2024
19 checks passed
@hawkw hawkw deleted the eliza/saga-rewound branch June 5, 2024 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants