Skip to content

[nexus] add background task for cleaning up abandoned VMMs #5812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 24, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented May 23, 2024

Note: This change is part of the ongoing work on instance lifecycle
management that I'm working on in PR #5749. It's not actually necessary
on its own, it's just a component of the upcoming instance updater saga.
However, I thought it would be easier to review if I factored out this
change into a separate PR that can be reviewed and merged on its own.

The instance update saga (see PR #5749) will only clean up after VMMs
whose IDs appear in an instance record. When a live migration finishes
(successfully or not), we want to allow a new migration to begin as soon
as possible, which means we have to unlink the “unused” side of the
migration --- the source if migration succeeded, or the target if it
failed --- from the instance, even though that VMM may not be fully
destroyed yet. Once this happens, the instance update saga will no
longer be able to clean up these VMMs, so we’ll need a separate task
that cleans up these "abandoned" VMMs in the background.

This branch introduces an abandoned_vmm_reaper background task that's
responsible for doing this. It queries the database to list VMMs which
are:

  • in the Destroyed state
  • not deleted yet (i.e. time_deleted IS NOT NULL)
  • not pointed to by their corresponding instances (neither the
    active_propolis_id nor the target_propolis_id equals the VMM's ID)

For any VMMs returned by this query, the abandoned_vmm_reaper task
will:

  • remove the sled_resource reservation for that VMM
  • sets the time_deleted on the VMM record if it was not already set.

This cleanup process will be executed periodically in the background.
Eventually, the background task will also be explicitly triggered by the
instance update saga when it knows it has abandoned a VMM.

As an aside, I noticed that the current implementation of
DataStore::vmm_mark_deleted will always unconditionally set the
time_deleted field on a VMM record, even if it's already set. This is
"probably fine" for overall correctness: the VMM remains deleted, so the
operation is still idempotent-ish. But, it's not great, as it means
that any queries for VMMs deleted before a certain timestamp may not be
strictly correct, and we're updating the database more frequently than
we really need to. So, I've gone ahead and changed it to only set
time_deleted if the record's time_deleted is null, using
check_if_exists so that the method still returns Ok if the record
was already deleted --- the caller can inspect the returned bool to
determine whether or not they were the actual deleter, but the query
still doesn't fail.

@hawkw hawkw requested review from davepacheco, smklein and bnaecker and removed request for davepacheco May 23, 2024 18:31
// - not deleted yet
.filter(dsl::time_deleted.is_null())
// - not pointed to by their corresponding instances
.left_join(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as mentioned in the hypervisor meeting) the fact that this is a left_join, not an inner_join, I think means we would also find any "VMMs that are 'destroyed', but not 'deleted', as long as instances aren't pointing at them".

I think this is actually a nice-to-have feature -- could (should?) this background task be responsible for destroying VMMs that are in a terminal state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we discussed this during the hypervisor sync --- I've updated the comments for this to more accurately reflect that this task will also clean up VMMs that are Destroyed and have never been assigned to an instance. The one case that this task is not responsible for is for VMMs that are Destroyed/Failed while being used actively to run an instance, which is the responsibility of the (future) instance-update saga.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - I think I was a bit narrowly focused on "what is the data", rather than "how did we get in this state" in my review, but your comments here make a lot of sense, especially in the context of the upcoming instance-update saga.

let vmm_id = vmm.id;
slog::trace!(opctx.log, "Deleting abandoned VMM"; "vmm" => %vmm_id);
// Attempt to remove the abandoned VMM's sled resource reservation.
match self.datastore.sled_reservation_delete(opctx, vmm_id).await {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one spot of this PR that feels weird to me is the overlap with functionality in

pub(crate) async fn notify_instance_updated(

When a VMM is destroyed, that function seems to:

  • Update counters in the project / fleet about "virtual provisioning" (see: virtual_provisioning_collection_delete_instance)
  • Unassign producers from oximeter (see: unassign_producer)
  • Update instance and VMM states to indicate that the VMM has been destroyed
  • Delete the sled reservation (physical usage)
  • Mark the VMM deleted

(We have previously identified that it is a problem these steps aren't atomic! But that's an issue which exists independently of this PR)

In contrast, this background task seems to:

  • Delete the sled reservation (physical usage)
  • Mark the VMM deleted

So, I have a smattering of follow-up questions here:

  • Should this background task take over this responsibility for destroying VMMs from notify_instance_updated ? e.g., should that function start kicking this background task?
  • That function also seems to destroy VMMs in the Failed state. Is that scope we would want this task to handle too?
  • Should the "physical usage" tracking (sled reservation) and "virtual usage" tracking (virtual_provisioning_collection...) updates be more tightly coupled? I think they're both already atomic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our discussion in the hypervisor sync and on Matrix today has cleared this up --- for readers other than @smklein, though, the goal is that eventually:

  • The behavior currently run in notify_instance_updated will eventually move to a saga (see Perform instance state transitions in instance-update saga #5749), and the Nexus upcall will just update the instance's state and kick off that saga
  • When a VMM that's actively in use by an instance is destroyed/failed, a bunch of cleanup needs to happen, including virtual provisioning, Oximeter producer, and networking; the update saga will do this
  • When a VMM is abandoned after a migration (either because it was a failed migration target, or because it was the old VMM leftover after a successful migration), the cleanup is much simpler, as the instance is still around and still using the virtual provisioning resources. Just the physical sled resources consumed by the individual Propolis process need to be released, the other resources are still in use by the instance, just not via this VMM.

This background task is only responsible for abandoned VMMs, so it only needs to do the simple cleanup process for them; it doesn't deallocate resources owned by instances when an in-use VMM fails or is stopped.

Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, clean work, looks good to me! Just one comment, but quite minor.

@hawkw hawkw requested a review from smklein May 23, 2024 23:42
@hawkw hawkw enabled auto-merge (squash) May 24, 2024 00:07
@hawkw hawkw disabled auto-merge May 24, 2024 00:11
@hawkw hawkw enabled auto-merge (squash) May 24, 2024 00:23
@hawkw
Copy link
Member Author

hawkw commented May 24, 2024

Okay, I've added some more comments clarifying which VMMs are candidates for cleanup by this task, and what cleanup is handled elsewhere (e.g. when an active VMM is destroyed). I'm going to merge this once it gets through CI --- thanks @smklein and @bnaecker for the reviews! <3

@hawkw hawkw merged commit 27e6b34 into main May 24, 2024
19 checks passed
@hawkw hawkw deleted the eliza/abandoned-vmm-reaper branch May 24, 2024 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants