-
Notifications
You must be signed in to change notification settings - Fork 43
test failed in CI: test_instance_failed_by_stop_request_does_not_reincarnate #8178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The change in #8152 changed the part of the test that expects the instance to transition to I think @iximeow was correct to remove the If I'm correct, we should put back the explicit task activation there. Sometimes, it will do nothing, but in some cases it may be necessary to ensure the task runs. Not sure if this is the case though. |
Looking at the logs, it looks like this is the last activation of the
If we peek a couple log lines prior, we see that an
So I'm we have somehow arrived in a situation where the
This indicates that the task was activated by a completing update saga, which is ... curious. It looks like we attempt to chain into a successor update saga here, which does nothing:
I think this might actually be a bug in the actual Nexus code, rather than the test, although other mechanisms prevent it from being a problem in production. It seems like we will activate the omicron/nexus/db-model/src/instance.rs Lines 475 to 478 in 0313f70
In a production system, this doesn't cause any real problems, because the omicron/nexus/tests/config.test.toml Lines 144 to 153 in 0313f70
Therefore, in the test, the task is not activated again before the timeout. We should probably fix the update saga's activation of the reincarnation task to avoid causing these no-op activations. The task shouldn't be activated until the lock is released --- which means that if we spawn a child update saga after an update that would like to activate the reincarnation task, it should be the child saga that activates the task, only after the lock is dropped. This is a bit more complex than the current thing, but will ensure that we only activate the reincarnation task when it can actually reincarnate the instance. We probably could also reduce the flakiness of the test by putting the explicit activation of the task back in, but this wouldn't totally fix the flake, as the explicit activation might also occur while the child update saga still has the lock... |
Hmm, actually, it looks like the saga node that spawns a child saga does only activate the omicron/nexus/src/app/sagas/instance_update/mod.rs Lines 1276 to 1329 in c179ea6
That suggests that it wasn't actually the saga that spawned the no-op child that activated the reincarnation task --- it must have been activated from someplace else. I gotta look at the logs a bit more. I do still think that a no-op child saga should check if the instance needs to reincarnate and poke the task again, though. But I wanna make sure I totally understand what's happening. |
This test failed on a CI run on pull request #8174:
https://github.com/oxidecomputer/omicron/pull/8174/checks?check_run_id=42323677841
Log showing the specific test failure:
https://buildomat.eng.oxide.computer/wg/0/details/01JVB1PD2FS2TGNJ8YWS5PJWKX/ehxX0N3vFukjWLFqMtKuXAvVc02rU1FAD1MPIHUXJrAteusu/01JVB1Q0B25QFZ2F0ZSSNR1559#S7266
Excerpt from the log showing the failure:
The text was updated successfully, but these errors were encountered: