-
Notifications
You must be signed in to change notification settings - Fork 21
Add explicit UpstairsState::Disabled
#1721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
So, I believe this path was to handle what happens when we have a bad set of targets for the downstairs. At least that was the idea, and seemed better than just panicing the upstairs. We did actually hit this scenario where a ROP had finished scrubbing but was still "attached" to the downstairs (a bug that is fixed). The actual ROP was eventually deleted, then the port numbers were re-used, and this long running upstairs tried to reconnect (as the new downstairs came online). The UUID mismatch is one check, but if a downstairs has different region info, that would (should) take the same path here and result in the same end state, whatever we decide that state should be. My feeling is this condition means something has gone terribly wrong somewhere, and I do think it's better to just hang and require operator intervention instead of either moving forward or panic. Does that seem like the right idea? |
Yup, that sounds reasonable to me! Do you think having an |
I don't love the name, but I'm also fine with it. I also don't have any other suggestions. Maybe the path forward would be to replace the |
A few specific comments:
One option would be to make this a downstairs state, and not have This discussion has helped me clarify the vibes of what's going on here: certain failures modes mean that a downstairs should not try to reconnect, basically dropping into a fail-safe mode where it hangs out and waits for further instruction. Follow-up questions:
I'm not sure if anyone knows the answers yet, but curious to hear everyone's thoughts. |
So, I'll also think about this overnight, but
Ah, okay. I think this is still okay. If one of our downstairs is wrong in a bad way, then, yeah, stopping the upstairs seems like our best choice of all the bad choices we have.
Yes, and if we do live migration, this is what would happen. In that situation we need to consider how could we allow an upstairs that got the I'm still thinking about the follow-up questions, I'm going to sleep on those and see what the morning brings me :) |
In the case of live migration, the destination propolis would have received a new volume checkout, bumping the gen numbers. The source propolis' upstairs kinda need to be scrapped after this - any downstairs replacement won't fix it. |
Both of those are kinda terminal states. |
The source propolis would need a new volume checkout and a new activation if it want to "take back" the downstairs. You are right that a downstairs replacement would not solve anything here :) |
We do have a |
My current list of things that a reconnect to the same downstairs won't fix: And, if things were connected and the upstairs was activated and some other upstairs takes over, the current upstairs would get kicked out and then attempt to reconnect (which I think could be okay) and then be denied because a higher generation number upstairs has connected.
The negotiation phase should determine what happens. The only place (I believe) that a downstairs would trigger something would be if another upstairs took over. So, what I think should happen is this.
We include James' phone number in the error message. Seriously, I guess it depends. If we fail on startup before activation, then either the VCR (crucible opts) are bad, or we have some rogue downstairs running on the wrong port. It would most likely be a bug in the software, so recovery may not be an option. My first thought is to hang forever, but I wonder if it would be better to panic and trust that we have left behind enough logs to sift through afterwards. If we have already activated, and one of the downstairs pulls the plug, then I think the other two should keep going until either we get another downstairs that pulls the plug, or someone send the upstairs a deactivation request. Once we have two downstairs that have opted out, the upstairs is going to hang on the next flush, though letting the final remaining downstairs get the writes and that flush that are in flight is probably the right thing to do here. And, given that we could be migrating, getting all the data out of the upstairs is what we want. |
There's a weird phantom state that's been hiding in the Upstairs state machine; this PR makes it explicit.
In the handler for
YouAreNoLongerActive
andon_uuid_mismatch
, the Upstairs performs a peculiar ritual:The effects here are somewhat confusing:
DownstairsClient::disable
stops a client withClientStopReason::Disabled
. This stop reason is a special case – it means that the client does not try to reconnect when reinitialized. In all other cases, whether the client connects depends on the upstairs state alone.Upstairs::set_inactive
sets the upstairs state toInitializing
. It does not do anything else – for example, it doesn't try to stop the other Downstairs.The end result is that the problematic client is restarted, and does not connect to the Downstairs.
As we rethink the state machine for RFD 542, I'd like to remove this special case. In this PR:
Upstairs::set_inactive
is renamed toUpstairs::set_disabled
and now sets the upstairs state to a newUpstairsState::Disabled
stateauto_promote: bool
is removed from the negotiation state, because we now only depend on the upstairs stateI still think the semantics of the
UpstairsState::Disabled
are fuzzy and could use some ironing out, but this PR is meant to be a step in the right direction.For example, going straight to
Initializing
without shutting down the other Downstairs seems bad! Once the upstairs is inInitializing
, it will accept aGoActive
request, which will hit this panic on the other Downstairs. This issue remains true after the PR (although the Upstairs will be inDisabled
instead).