Skip to content

Add explicit UpstairsState::Disabled #1721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented May 22, 2025

There's a weird phantom state that's been hiding in the Upstairs state machine; this PR makes it explicit.

In the handler for YouAreNoLongerActive and on_uuid_mismatch, the Upstairs performs a peculiar ritual:

        // Restart the state machine for this downstairs client
        self.downstairs.clients[client_id].disable(&self.state);
        self.set_inactive(CrucibleError::UuidMismatch);

The effects here are somewhat confusing:

  • Calling DownstairsClient::disable stops a client with ClientStopReason::Disabled. This stop reason is a special case – it means that the client does not try to reconnect when reinitialized. In all other cases, whether the client connects depends on the upstairs state alone.
  • Upstairs::set_inactive sets the upstairs state to Initializing. It does not do anything else – for example, it doesn't try to stop the other Downstairs.

The end result is that the problematic client is restarted, and does not connect to the Downstairs.

As we rethink the state machine for RFD 542, I'd like to remove this special case. In this PR:

  • Upstairs::set_inactive is renamed to Upstairs::set_disabled and now sets the upstairs state to a new UpstairsState::Disabled state
  • auto_promote: bool is removed from the negotiation state, because we now only depend on the upstairs state

I still think the semantics of the UpstairsState::Disabled are fuzzy and could use some ironing out, but this PR is meant to be a step in the right direction.

For example, going straight to Initializing without shutting down the other Downstairs seems bad! Once the upstairs is in Initializing, it will accept a GoActive request, which will hit this panic on the other Downstairs. This issue remains true after the PR (although the Upstairs will be in Disabled instead).

@leftwo
Copy link
Contributor

leftwo commented May 22, 2025

The end result is that the problematic client is restarted, and does not connect to the Downstairs.

So, I believe this path was to handle what happens when we have a bad set of targets for the downstairs.
We don't believe the downstairs we have connected to is correct, so we don't want to keep trying to connect to it, but we also wanted to keep the upstairs running to allow for a downstairs replacement to come in and "fix" it.

At least that was the idea, and seemed better than just panicing the upstairs.

We did actually hit this scenario where a ROP had finished scrubbing but was still "attached" to the downstairs (a bug that is fixed). The actual ROP was eventually deleted, then the port numbers were re-used, and this long running upstairs tried to reconnect (as the new downstairs came online).

The UUID mismatch is one check, but if a downstairs has different region info, that would (should) take the same path here and result in the same end state, whatever we decide that state should be.

My feeling is this condition means something has gone terribly wrong somewhere, and I do think it's better to just hang and require operator intervention instead of either moving forward or panic. Does that seem like the right idea?

@mkeeter
Copy link
Contributor Author

mkeeter commented May 22, 2025

My feeling is this condition means something has gone terribly wrong somewhere, and I do think it's better to just hang and require operator intervention instead of either moving forward or panic. Does that seem like the right idea?

Yup, that sounds reasonable to me!

Do you think having an UpstairsState::Disabled to represent this "terribly wrong" state makes sense? If so, that suggests it should not respond to GoActive requests. I'm not sure how we want to recover; we could add a new API that the agent could hit? We probably don't want to require a full restart, since the upstairs is attached to the Propolis VM.

@leftwo
Copy link
Contributor

leftwo commented May 22, 2025

Do you think having an UpstairsState::Disabled to represent this "terribly wrong" state makes sense? If so, that suggests it should not respond to GoActive requests. I'm not sure how we want to recover; we could add a new API that the agent could hit? We probably don't want to require a full restart, since the upstairs is attached to the Propolis VM.

I don't love the name, but I'm also fine with it. I also don't have any other suggestions.
I do think that, if you find the upstairs in this state, then yeah it should not respond to GoActive requests.

Maybe the path forward would be to replace the Disabled downstairs from the control plane, we could do that without a VM restart. But, really, if we are in this state things are bad, and by blocking GoActive requests, we have essentially prevented the VM from booting (which is probably good, as we don't trust our downstairs targets).

@mkeeter
Copy link
Contributor Author

mkeeter commented May 22, 2025

A few specific comments:

  • Disabled is not a Downstairs state; it's a newly-added variant in the UpstairsState enum, so it applies to the system as a whole (not to a specific downstairs). The problematic Downstairs will be hanging out in NegotiationState::Start, waiting for the connection one-shot to fire (which will never happen).
  • This could technically happen after boot – once we connect to a Downstairs, getting kicked out with YouAreNoLongerActive could happen at any time, which causes us to enter this state

One option would be to make this a downstairs state, and not have UpstairsState::Disabled.

This discussion has helped me clarify the vibes of what's going on here: certain failures modes mean that a downstairs should not try to reconnect, basically dropping into a fail-safe mode where it hangs out and waits for further instruction.

Follow-up questions:

  • What failures should be classified as "drop into fail-safe mode and stop trying to connect"?
  • What should happen to the upstairs if a downstairs triggers fail-safe mode?
    • What should it do with the other downstairs?
  • How should we recover from fail-safe mode?

I'm not sure if anyone knows the answers yet, but curious to hear everyone's thoughts.

@leftwo
Copy link
Contributor

leftwo commented May 23, 2025

So, I'll also think about this overnight, but

Disabled is not a Downstairs state ..

Ah, okay. I think this is still okay. If one of our downstairs is wrong in a bad way, then, yeah, stopping the upstairs seems like our best choice of all the bad choices we have.

This could technically happen after boot ..

Yes, and if we do live migration, this is what would happen. In that situation we need to consider how could we allow an upstairs that got the YouAreNoLongerActive because of a migration, then that migration failed and we are trying to put the original pieces back together. Is there a path by which we could do that?

I'm still thinking about the follow-up questions, I'm going to sleep on those and see what the morning brings me :)

@jmpesp
Copy link
Contributor

jmpesp commented May 23, 2025

In that situation we need to consider how could we allow an upstairs that got the YouAreNoLongerActive because of a migration, then that migration failed and we are trying to put the original pieces back together. Is there a path by which we could do that?

In the case of live migration, the destination propolis would have received a new volume checkout, bumping the gen numbers. The source propolis' upstairs kinda need to be scrapped after this - any downstairs replacement won't fix it.

@jmpesp
Copy link
Contributor

jmpesp commented May 23, 2025

Disabled may not carry enough information. We may want to separate out the different failure modes here:

  • CannotActivate when we see mismatched downstairs region info for example
  • KickedOut when we see YouAreNoLongerActive from any of the downstairs.

Both of those are kinda terminal states. Disabled is a bit vague, and I'm not sure we support the idea of Upstairs hanging around doing nothing, apart from before the GoActive is received?

@leftwo
Copy link
Contributor

leftwo commented May 23, 2025

In that situation we need to consider how could we allow an upstairs that got the YouAreNoLongerActive because of a migration, then that migration failed and we are trying to put the original pieces back together. Is there a path by which we could do that?

In the case of live migration, the destination propolis would have received a new volume checkout, bumping the gen numbers. The source propolis' upstairs kinda need to be scrapped after this - any downstairs replacement won't fix it.

The source propolis would need a new volume checkout and a new activation if it want to "take back" the downstairs. You are right that a downstairs replacement would not solve anything here :)

@leftwo
Copy link
Contributor

leftwo commented May 23, 2025

Disabled may not carry enough information. We may want to separate out the different failure modes here:

  • CannotActivate when we see mismatched downstairs region info for example
  • KickedOut when we see YouAreNoLongerActive from any of the downstairs.

Both of those are kinda terminal states. Disabled is a bit vague, and I'm not sure we support the idea of Upstairs hanging around doing nothing, apart from before the GoActive is received?

We do have a BlockOp::Deactivate that will result in the Upstairs disconnecting from the downstairs then just sitting there. However, we only use that in tests, Propolis never calls it. In theory the upstairs could self deactivate and then sit there waiting to either be taken down or to get a new activation. This is also tricky as a running upstairs has targets and will expect them to be the same, the re-activation can only provide an update generation number and not change anything else.

@leftwo
Copy link
Contributor

leftwo commented May 23, 2025

Follow-up questions:

  • What failures should be classified as "drop into fail-safe mode and stop trying to connect"?

My current list of things that a reconnect to the same downstairs won't fix:
Generation number too low.
Expected downstairs UUID mismatch
Any RegionInfo mismatch (block size, extents, etc..)
Failure to complete reconciliation (Never actually seen this, not sure how it would even happen)
Incompatible Version (not yet, but maybe someday :) )
Encryption mismatch (not expected to ever happen in production)
Read only mismatch.

And, if things were connected and the upstairs was activated and some other upstairs takes over, the current upstairs would get kicked out and then attempt to reconnect (which I think could be okay) and then be denied because a higher generation number upstairs has connected.

  • What should happen to the upstairs if a downstairs triggers fail-safe mode?

    • What should it do with the other downstairs?

The negotiation phase should determine what happens. The only place (I believe) that a downstairs would trigger something would be if another upstairs took over.

So, what I think should happen is this.

  • If the upstairs is not yet active, then it gives up and does not activate. It can return error if an activation request has been sent, and .. then it just sits there (at least for now).
  • How should we recover from fail-safe mode?

We include James' phone number in the error message.

Seriously, I guess it depends. If we fail on startup before activation, then either the VCR (crucible opts) are bad, or we have some rogue downstairs running on the wrong port. It would most likely be a bug in the software, so recovery may not be an option. My first thought is to hang forever, but I wonder if it would be better to panic and trust that we have left behind enough logs to sift through afterwards.

If we have already activated, and one of the downstairs pulls the plug, then I think the other two should keep going until either we get another downstairs that pulls the plug, or someone send the upstairs a deactivation request. Once we have two downstairs that have opted out, the upstairs is going to hang on the next flush, though letting the final remaining downstairs get the writes and that flush that are in flight is probably the right thing to do here. And, given that we could be migrating, getting all the data out of the upstairs is what we want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants