-
Notifications
You must be signed in to change notification settings - Fork 43
[nexus] add background task for cleaning up abandoned VMMs #5812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The delete query doesn't actually indicate whether the record was present or not, so we don't actually need to handle that case.
// - not deleted yet | ||
.filter(dsl::time_deleted.is_null()) | ||
// - not pointed to by their corresponding instances | ||
.left_join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(as mentioned in the hypervisor meeting) the fact that this is a left_join
, not an inner_join
, I think means we would also find any "VMMs that are 'destroyed', but not 'deleted', as long as instances aren't pointing at them".
I think this is actually a nice-to-have feature -- could (should?) this background task be responsible for destroying VMMs that are in a terminal state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we discussed this during the hypervisor sync --- I've updated the comments for this to more accurately reflect that this task will also clean up VMMs that are Destroyed
and have never been assigned to an instance. The one case that this task is not responsible for is for VMMs that are Destroyed
/Failed
while being used actively to run an instance, which is the responsibility of the (future) instance-update
saga.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good - I think I was a bit narrowly focused on "what is the data", rather than "how did we get in this state" in my review, but your comments here make a lot of sense, especially in the context of the upcoming instance-update
saga.
let vmm_id = vmm.id; | ||
slog::trace!(opctx.log, "Deleting abandoned VMM"; "vmm" => %vmm_id); | ||
// Attempt to remove the abandoned VMM's sled resource reservation. | ||
match self.datastore.sled_reservation_delete(opctx, vmm_id).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one spot of this PR that feels weird to me is the overlap with functionality in
omicron/nexus/src/app/instance.rs
Line 1970 in c2f3515
pub(crate) async fn notify_instance_updated( |
When a VMM is destroyed, that function seems to:
- Update counters in the project / fleet about "virtual provisioning" (see:
virtual_provisioning_collection_delete_instance
) - Unassign producers from oximeter (see:
unassign_producer
) - Update instance and VMM states to indicate that the VMM has been destroyed
- Delete the sled reservation (physical usage)
- Mark the VMM deleted
(We have previously identified that it is a problem these steps aren't atomic! But that's an issue which exists independently of this PR)
In contrast, this background task seems to:
- Delete the sled reservation (physical usage)
- Mark the VMM deleted
So, I have a smattering of follow-up questions here:
- Should this background task take over this responsibility for destroying VMMs from
notify_instance_updated
? e.g., should that function start kicking this background task? - That function also seems to destroy VMMs in the
Failed
state. Is that scope we would want this task to handle too? - Should the "physical usage" tracking (sled reservation) and "virtual usage" tracking (
virtual_provisioning_collection...
) updates be more tightly coupled? I think they're both already atomic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our discussion in the hypervisor sync and on Matrix today has cleared this up --- for readers other than @smklein, though, the goal is that eventually:
- The behavior currently run in
notify_instance_updated
will eventually move to a saga (see Perform instance state transitions ininstance-update
saga #5749), and the Nexus upcall will just update the instance's state and kick off that saga - When a VMM that's actively in use by an instance is destroyed/failed, a bunch of cleanup needs to happen, including virtual provisioning, Oximeter producer, and networking; the update saga will do this
- When a VMM is abandoned after a migration (either because it was a failed migration target, or because it was the old VMM leftover after a successful migration), the cleanup is much simpler, as the instance is still around and still using the virtual provisioning resources. Just the physical sled resources consumed by the individual Propolis process need to be released, the other resources are still in use by the instance, just not via this VMM.
This background task is only responsible for abandoned VMMs, so it only needs to do the simple cleanup process for them; it doesn't deallocate resources owned by instances when an in-use VMM fails or is stopped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, clean work, looks good to me! Just one comment, but quite minor.
Co-authored-by: Sean Klein <[email protected]>
Note: This change is part of the ongoing work on instance lifecycle
management that I'm working on in PR #5749. It's not actually necessary
on its own, it's just a component of the upcoming instance updater saga.
However, I thought it would be easier to review if I factored out this
change into a separate PR that can be reviewed and merged on its own.
The instance update saga (see PR #5749) will only clean up after VMMs
whose IDs appear in an
instance
record. When a live migration finishes(successfully or not), we want to allow a new migration to begin as soon
as possible, which means we have to unlink the “unused” side of the
migration --- the source if migration succeeded, or the target if it
failed --- from the instance, even though that VMM may not be fully
destroyed yet. Once this happens, the instance update saga will no
longer be able to clean up these VMMs, so we’ll need a separate task
that cleans up these "abandoned" VMMs in the background.
This branch introduces an
abandoned_vmm_reaper
background task that'sresponsible for doing this. It queries the database to list VMMs which
are:
Destroyed
statetime_deleted
IS NOT NULL)active_propolis_id
nor thetarget_propolis_id
equals the VMM's ID)For any VMMs returned by this query, the
abandoned_vmm_reaper
taskwill:
sled_resource
reservation for that VMMtime_deleted
on the VMM record if it was not already set.This cleanup process will be executed periodically in the background.
Eventually, the background task will also be explicitly triggered by the
instance update saga when it knows it has abandoned a VMM.
As an aside, I noticed that the current implementation of
DataStore::vmm_mark_deleted
will always unconditionally set thetime_deleted
field on a VMM record, even if it's already set. This is"probably fine" for overall correctness: the VMM remains deleted, so the
operation is still idempotent-ish. But, it's not great, as it means
that any queries for VMMs deleted before a certain timestamp may not be
strictly correct, and we're updating the database more frequently than
we really need to. So, I've gone ahead and changed it to only set
time_deleted
if the record'stime_deleted
is null, usingcheck_if_exists
so that the method still returnsOk
if the recordwas already deleted --- the caller can inspect the returned
bool
todetermine whether or not they were the actual deleter, but the query
still doesn't fail.