Skip to content

[nexus] Add virtual provisioning idempotency tests, prevent underflow #5830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
May 30, 2024

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented May 28, 2024

Builds on #5081 and #5089 , but more out of convenience than necessity.

Summary

This PR attempts to validate that, for the "virtual provisioning collection {insert, delete}" operations, they are idempotent. Currently, our usage of max_instance_gen only partially prevents updates during instance provisioning deletions:

  • If max_instance_gen is smaller than the observed instance generation number...
  • ... we avoid deleting the virtual_provisioning_resource record (which is great)
  • ... but we still decrement the virtual_provisioning_collection values (which is really not great).

This basically means that we can "only cause the project/silo/fleet usage values to decrement arbitrarily, with no other changes". This has been, mechanically, the root cause of our observed underflows (e.g, #5525).

Details of this change

  • All the changes in nexus/db-queries/src/db/datastore/virtual_provisioning_collection.rs are tests validating idempotency of these operations.
  • All the changes in nexus/db-queries/src/db/queries/virtual_provisioning_collection_update.rs are actually changes to the query which change functionality. The objective of these changes is to preserve idempotency of the newly added tests, and to prevent undercounting of virtual provisioning resources. If these changes are reverted, the newly added tests start failing, showing a lack of coverage.

@hawkw hawkw self-assigned this May 28, 2024
@smklein smklein requested a review from gjcolombo May 29, 2024 00:48
// and that operation should have cleaned up the resources already,
// in which case there's nothing to do here.
query.sql("
do_update
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically tried to move all the validation of the state_generation into this boolean do_update? subquery, then re-use that check in multiple spots where we can perform updates below.

Comment on lines +43 to +45
AND EXISTS(
SELECT 1 FROM instance WHERE instance.id = $5 AND instance.state_generation < $6 LIMIT 1
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that (before this PR) do_update could be "true", but the state generation could be below our "max", which would result in that buggy partial deletion state.

Now:

  • If do_update is true, we really believe we should perform the insertion/deletion
  • If that's the case, we can use it as the deciding factor to both delete the virtual_provisioning_resource and decrement the usage in the virtual_provisioning_collection table

@smklein smklein marked this pull request as ready for review May 29, 2024 01:02
@hawkw hawkw removed their assignment May 29, 2024
@hawkw hawkw self-requested a review May 29, 2024 16:42
Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! LGTM as far as the use of the do_update clause goes--this is more or less the fix I had in mind for #5042, which issue I think this will resolve.

I know the conversion to raw SQL is part of a different PR, but I also looked at the entire expected query output in virtual_provisioning_collection_update_delete_instance.sql and the structure there also looks right to me (I didn't check the parameter ordering too closely though).

@smklein smklein changed the title Add some virtual provisioning tests [nexus] Add virtual provisioning idempotency tests, prevent underflow May 29, 2024
Base automatically changed from convert-virtual-provisioning to main May 29, 2024 21:25
@smklein smklein enabled auto-merge (squash) May 29, 2024 21:34
@smklein smklein disabled auto-merge May 29, 2024 22:47
@smklein smklein enabled auto-merge (squash) May 30, 2024 00:42
@smklein smklein merged commit acbeb27 into main May 30, 2024
19 checks passed
@smklein smklein deleted the virtual-provisioning-double-dip branch May 30, 2024 01:38
smklein added a commit that referenced this pull request Jun 6, 2024
This is a corollary PR to
#5830 , which fixed the
root cause.

Due to a bug in the virtual provisioning query, it was possible to
undercount virtual provisioning information for
instances, which would result in an integer underflow for "total CPU/RAM
provisioned" for a {project, silo, fleet}.

Although #5830 fixed the root cause, it's possible that in-field systems
have an invalid value if they experienced this bug.

This PR uses a schema change, exploiting the fact that schema changes
occur with instances offline, to reset these values to a known value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants