Don't allow continually failing instance start/migrate sagas to block instance update sagas #6293

hawkw · 2024-08-12T17:18:51Z

In this comment, @gjcolumbo points out that, since failing instance-start and instance-migrate sagas increment an instance's generation number, an instance that is continually failing to start or to migrate can prevent instance-update sagas from completing. This is because incrementing the instance's state generation number invalidates any instance-update saga that started from a previous state generation. Although failed start and migrate sagas unwinding doesn't bump the state generation, as they just transition the VMMs they create to SagaUnwound, the forward actions for those sagas do, since they set new active VMM/target VMM/migration IDs.

As Greg put it:

There's one case buried in the design doc that I'm not quite sure is covered here. I think we should file an issue for it and take care of it in a follow-up PR. The text I have in mind is

If start and migrate sagas continually fail, they can “lock out” updates by continually changing the instance’s generation number. One way to deal with this is to force an instance into a “faulted” state after several consecutive failed attempts to start or migrate it, which state can only be cleared by a successful update.

On a second reading, I'm not sure this is too big of a concern for start sagas: any start saga that's setting SagaUnwound must have tried to start a new VMM for the instance, which implies it didn't have one before, which means that the updater must not have had anything to do. But for migration I think this is still a little dangerous: repeatedly trying and failing to migrate will cause the instance's state generation to go up every time this query executes, which can keep the update saga from committing anything.

We should come up with a plan to solve this, but because we're not migrating instances right this second I think it's OK to take care of this in a follow-up PR.

Originally posted by @gjcolombo in #5749 (comment)

At a glance, I think this would probably involve adding counts of failed start and migrate sagas to the instance record, and incrementing them when those sagas unwind, putting it in the faulted state that prevents new start/migrate sagas from starting when they get too high, and having a successful update saga clear that faulted state.

We would have to decide how the "faulted" state that prevents start/migrate sagas from starting would be implemented. As I see it, there are a few obvious options:

Use the values of the failure counter(s) themselves. Starting a start or migrate saga checks that the counts are below the limit, and a successful update saga clears the counters to zero, allowing new start/migrate sagas to proceed. This is probably the simplest option as it doesn't involve a complex db operation like "increment counter, check if it is over the line, and then set the state". But, it means we can't track the total number of start/migrate fields over the instance's entire lifetime, since the counters get cleared --- I don't know if we really care about that, though.
Represent the faulted state as an InstanceState. Alternatively, we could have an InstanceState variant that we transition the instance to when we want to block new start/migrate sagas. We could either add a new state variant or use the InstanceState::Failed for this --- I'd have to think about the implications of that.

If we wanted to track the total number of start/migrate failures, we could never clear the counters, and instead store the number of failures at which the most recent transition to the faulted state occured in a separate field. When the faulted state transition occurs, we store the current saga failure count in that field. Subsequent faulted states are detected as the difference between the total failures and the previous count at the last transition to the faulted state.

This solution is a bit more complex, and we'd have to work out how the faulted state interacts with other InstanceState variants...but, we probably want it to, in some way. It makes the faulted state more externally visible, if we choose to return it in external API or convert it to Failed there, which is probably either good or bad? And, it lets us have the failure counters continually go up, rather than resetting them, which we might want for debugging purposes.

The text was updated successfully, but these errors were encountered:

gjcolombo · 2024-08-29T21:32:52Z

This is a gut feeling, but I think I am leaning toward a failure-counter based approach here, just because it seems a lot simpler to manage.

I don't think we can reuse the Failed state to represent this condition ("the control plane is doing things with this instance, so you can't start/migrate it now, but you should be able to later"), because in the post-RFD 486 would, Failed instances will be allowed to start. So I think we'd need a new state for this, which comes with a few downsides/raises a few questions:

Adding a new internal instance state grows the instance/VMM state matrix by another row; not a horrible problem, but just another thing to consider in all the cases where we match on instance/VMM state tuples.
We'd have to decide where to write the Maintenance state. I think it probably has to go in the Instance, so that we don't lose information about the state of an instance's active VMM when the maintenance bit is cleared. Does the Maintenance state force us to break the rule that an Instance has an active VMM only if it's in the Vmm state? (Is it possible for a Maintenance instance not to have a VMM, or does being in Maintenance entail that you must have a SagaUnwound VMM of some kind?)

Having a failure counter dodges these questions at the cost of having some additional DB columns, both to store and to consider when running a start/migrate saga. (I do think it's possible to keep a running failure count without having a separate instance state, though it costs another column: the count doesn't get reset; instead the update saga records the failure count value at the time the last update was applied; start/migrate check to see if (current count - last observed count) > threshold instead of checking current count > threshold.)

I'll mull this over a bit more, but right now I find myself leaning toward the "have a counter" side of these tradeoffs. I can definitely be persuaded otherwise though.

hawkw · 2024-08-29T21:37:23Z

@gjcolombo yeah, the failure counter was my preference as well, thanks. I think it's probably the right thing to do. I thought I had mentioned the two-fields approach to keeping a running failure count, but maybe I forgot to...

hawkw added enhancement New feature or request. nexus Related to nexus labels Aug 12, 2024

hawkw self-assigned this Aug 12, 2024

hawkw mentioned this issue Aug 12, 2024

Perform instance state transitions in instance-update saga #5749

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't allow continually failing instance start/migrate sagas to block instance update sagas #6293

Don't allow continually failing instance start/migrate sagas to block instance update sagas #6293

hawkw commented Aug 12, 2024

gjcolombo commented Aug 29, 2024

Uh oh!

hawkw commented Aug 29, 2024

Uh oh!

Don't allow continually failing instance start/migrate sagas to block instance update sagas #6293

Don't allow continually failing instance start/migrate sagas to block instance update sagas #6293

Comments

hawkw commented Aug 12, 2024

gjcolombo commented Aug 29, 2024

Uh oh!

hawkw commented Aug 29, 2024

Uh oh!