-
Notifications
You must be signed in to change notification settings - Fork 43
[nexus] Add virtual provisioning idempotency tests, prevent underflow #5830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// and that operation should have cleaned up the resources already, | ||
// in which case there's nothing to do here. | ||
query.sql(" | ||
do_update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I basically tried to move all the validation of the state_generation into this boolean do_update?
subquery, then re-use that check in multiple spots where we can perform updates below.
AND EXISTS( | ||
SELECT 1 FROM instance WHERE instance.id = $5 AND instance.state_generation < $6 LIMIT 1 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that (before this PR) do_update
could be "true", but the state generation could be below our "max", which would result in that buggy partial deletion state.
Now:
- If
do_update
is true, we really believe we should perform the insertion/deletion - If that's the case, we can use it as the deciding factor to both delete the
virtual_provisioning_resource
and decrement the usage in thevirtual_provisioning_collection
table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! LGTM as far as the use of the do_update
clause goes--this is more or less the fix I had in mind for #5042, which issue I think this will resolve.
I know the conversion to raw SQL is part of a different PR, but I also looked at the entire expected query output in virtual_provisioning_collection_update_delete_instance.sql
and the structure there also looks right to me (I didn't check the parameter ordering too closely though).
nexus/db-queries/src/db/datastore/virtual_provisioning_collection.rs
Outdated
Show resolved
Hide resolved
This is a corollary PR to #5830 , which fixed the root cause. Due to a bug in the virtual provisioning query, it was possible to undercount virtual provisioning information for instances, which would result in an integer underflow for "total CPU/RAM provisioned" for a {project, silo, fleet}. Although #5830 fixed the root cause, it's possible that in-field systems have an invalid value if they experienced this bug. This PR uses a schema change, exploiting the fact that schema changes occur with instances offline, to reset these values to a known value.
Builds on #5081 and #5089 , but more out of convenience than necessity.
Summary
This PR attempts to validate that, for the "virtual provisioning collection {insert, delete}" operations, they are idempotent. Currently, our usage of
max_instance_gen
only partially prevents updates during instance provisioning deletions:max_instance_gen
is smaller than the observed instance generation number...virtual_provisioning_resource
record (which is great)virtual_provisioning_collection
values (which is really not great).This basically means that we can "only cause the project/silo/fleet usage values to decrement arbitrarily, with no other changes". This has been, mechanically, the root cause of our observed underflows (e.g, #5525).
Details of this change
nexus/db-queries/src/db/datastore/virtual_provisioning_collection.rs
are tests validating idempotency of these operations.nexus/db-queries/src/db/queries/virtual_provisioning_collection_update.rs
are actually changes to the query which change functionality. The objective of these changes is to preserve idempotency of the newly added tests, and to prevent undercounting of virtual provisioning resources. If these changes are reverted, the newly added tests start failing, showing a lack of coverage.