Skip to content

test failed in CI: test_instance_failed_by_stop_request_does_not_reincarnate #8178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hawkw opened this issue May 16, 2025 · 4 comments
Open
Assignees
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@hawkw
Copy link
Member

hawkw commented May 16, 2025

This test failed on a CI run on pull request #8174:

https://github.com/oxidecomputer/omicron/pull/8174/checks?check_run_id=42323677841

Log showing the specific test failure:

https://buildomat.eng.oxide.computer/wg/0/details/01JVB1PD2FS2TGNJ8YWS5PJWKX/ehxX0N3vFukjWLFqMtKuXAvVc02rU1FAD1MPIHUXJrAteusu/01JVB1Q0B25QFZ2F0ZSSNR1559#S7266

Excerpt from the log showing the failure:

        FAIL [ 192.386s] omicron-nexus::test_all integration_tests::instances::test_instance_failed_by_stop_request_does_not_reincarnate
──── STDOUT:             omicron-nexus::test_all integration_tests::instances::test_instance_failed_by_stop_request_does_not_reincarnate

running 1 test
sled b6d65341-167c-41df-9b5c-41cded99c229 successfully installed routes ResolvedVpcRouteSet { id: RouterId { vni: Vni(16322539), kind: Custom(V4(Ipv4Net { addr: 172.30.0.0, width: 22 })) }, version: None, routes: {} }
sled b6d65341-167c-41df-9b5c-41cded99c229 successfully installed routes ResolvedVpcRouteSet { id: RouterId { vni: Vni(16322539), kind: System }, version: Some(RouterVersion { router_id: 83e54225-6e48-443e-90e1-82e7fa9e3b36, version: 3 }), routes: {ResolvedVpcRoute { dest: V6(Ipv6Net { addr: ::, width: 0 }), target: InternetGateway(Instance(54e26dfe-249f-4bd3-a21f-0461c92e8a9f)) }, ResolvedVpcRoute { dest: V4(Ipv4Net { addr: 172.30.0.0, width: 22 }), target: VpcSubnet(V4(Ipv4Net { addr: 172.30.0.0, width: 22 })) }, ResolvedVpcRoute { dest: V4(Ipv4Net { addr: 0.0.0.0, width: 0 }), target: InternetGateway(Instance(54e26dfe-249f-4bd3-a21f-0461c92e8a9f)) }, ResolvedVpcRoute { dest: V6(Ipv6Net { addr: fd7a:5945:4de6::, width: 64 }), target: VpcSubnet(V6(Ipv6Net { addr: fd7a:5945:4de6::, width: 64 })) }} }
sled b6d65341-167c-41df-9b5c-41cded99c229 successfully installed routes ResolvedVpcRouteSet { id: RouterId { vni: Vni(16322539), kind: Custom(V4(Ipv4Net { addr: 172.30.0.0, width: 22 })) }, version: None, routes: {} }
sled b6d65341-167c-41df-9b5c-41cded99c229 successfully installed routes ResolvedVpcRouteSet { id: RouterId { vni: Vni(16322539), kind: Custom(V4(Ipv4Net { addr: 172.30.0.0, width: 22 })) }, version: None, routes: {} }
test integration_tests::instances::test_instance_failed_by_stop_request_does_not_reincarnate has been running for over 60 seconds
sled b6d65341-167c-41df-9b5c-41cded99c229 successfully installed routes ResolvedVpcRouteSet { id: RouterId { vni: Vni(16322539), kind: Custom(V4(Ipv4Net { addr: 172.30.0.0, width: 22 })) }, version: None, routes: {} }
sled b6d65341-167c-41df-9b5c-41cded99c229 successfully installed routes ResolvedVpcRouteSet { id: RouterId { vni: Vni(16322539), kind: Custom(V4(Ipv4Net { addr: 172.30.0.0, width: 22 })) }, version: None, routes: {} }
sled b6d65341-167c-41df-9b5c-41cded99c229 successfully installed routes ResolvedVpcRouteSet { id: RouterId { vni: Vni(16322539), kind: Custom(V4(Ipv4Net { addr: 172.30.0.0, width: 22 })) }, version: None, routes: {} }
sled b6d65341-167c-41df-9b5c-41cded99c229 successfully installed routes ResolvedVpcRouteSet { id: RouterId { vni: Vni(16322539), kind: Custom(V4(Ipv4Net { addr: 172.30.0.0, width: 22 })) }, version: None, routes: {} }
sled b6d65341-167c-41df-9b5c-41cded99c229 successfully installed routes ResolvedVpcRouteSet { id: RouterId { vni: Vni(16322539), kind: Custom(V4(Ipv4Net { addr: 172.30.0.0, width: 22 })) }, version: None, routes: {} }
sled b6d65341-167c-41df-9b5c-41cded99c229 successfully installed routes ResolvedVpcRouteSet { id: RouterId { vni: Vni(16322539), kind: Custom(V4(Ipv4Net { addr: 172.30.0.0, width: 22 })) }, version: None, routes: {} }
test integration_tests::instances::test_instance_failed_by_stop_request_does_not_reincarnate ... FAILED

failures:

failures:
    integration_tests::instances::test_instance_failed_by_stop_request_does_not_reincarnate

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 438 filtered out; finished in 192.00s

──── STDERR:             omicron-nexus::test_all integration_tests::instances::test_instance_failed_by_stop_request_does_not_reincarnate
log file: /var/tmp/omicron_tmp/test_all-94731d9e6db289ca-test_instance_failed_by_stop_request_does_not_reincarnate.23960.0.log
note: configured to log to "/var/tmp/omicron_tmp/test_all-94731d9e6db289ca-test_instance_failed_by_stop_request_does_not_reincarnate.23960.0.log"
DB URL: postgresql://root@[::1]:37220/omicron?sslmode=disable
DB address: [::1]:37220
log file: /var/tmp/omicron_tmp/test_all-94731d9e6db289ca-test_instance_failed_by_stop_request_does_not_reincarnate.23960.2.log
note: configured to log to "/var/tmp/omicron_tmp/test_all-94731d9e6db289ca-test_instance_failed_by_stop_request_does_not_reincarnate.23960.2.log"
[nexus/tests/integration_tests/instances.rs:1439:23] make_forgotten_instance(&cptestctx, instance_name,
InstanceAutoRestartPolicy::BestEffort,).await = a642b2b8-502d-4761-9190-01f5776645d3 (instance)
[nexus/tests/integration_tests/instances.rs:1458:5] instance_wait_for_state(client, instance_id, InstanceState::Failed).await = Instance {
    identity: IdentityMetadata {
        id: a642b2b8-502d-4761-9190-01f5776645d3,
        name: Name(
            "resurgam",
        ),
        description: "instance \"resurgam\"",
        time_created: 2025-05-15T23:55:01.319179Z,
        time_modified: 2025-05-15T23:55:09.802162Z,
    },
    project_id: 354efc7e-8c45-4d5e-b952-752f01b33dba,
    ncpus: InstanceCpuCount(
        4,
    ),
    memory: ByteCount(
        1073741824,
    ),
    hostname: "the-host",
    boot_disk_id: None,
    runtime: InstanceRuntimeState {
        run_state: Failed,
        time_run_state_updated: 2025-05-15T23:55:10.163327Z,
        time_last_auto_restarted: None,
    },
    auto_restart_status: InstanceAutoRestartStatus {
        enabled: true,
        policy: Some(
            BestEffort,
        ),
        cooldown_expiration: None,
    },
}
[nexus/tests/integration_tests/instances.rs:1464:5] nexus_test_utils::background::activate_background_task(&cptestctx.internal_client,
"instance_reincarnation",).await = BackgroundTask {
    current: Idle,
    description: "schedules start sagas for failed instances that can be automatically restarted",
    last: Completed(
        LastResultCompleted {
            details: Object {
                "changed_state": Array [],
                "disabled": Bool(false),
                "errors": Array [],
                "instances_found": Object {
                    "Failed": Number(0),
                    "SagaUnwound": Number(0),
                },
                "instances_reincarnated": Array [],
                "restart_errors": Array [],
            },
            elapsed: Duration {
                nanos: 131868255,
                secs: 0,
            },
            iteration: 2,
            reason: Signaled,
            start_time: 2025-05-15T23:55:11.040059501Z,
        },
    ),
    name: "instance_reincarnation",
    period: Duration {
        nanos: 0,
        secs: 600,
    },
}
[nexus/tests/integration_tests/instances.rs:1491:5] expect_instance_start_ok(client, instance_name).await = ()

thread 'integration_tests::instances::test_instance_failed_by_stop_request_does_not_reincarnate' panicked at nexus/tests/integration_tests/instances.rs:6913:19:
instance a642b2b8-502d-4761-9190-01f5776645d3 did not transition to Starting after 120s: timed out after 120.762367471s
stack backtrace:
   0: rust_begin_unwind
             at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/std/src/panicking.rs:695:5
   1: core::panicking::panic_fmt
             at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/core/src/panicking.rs:75:14
   2: {async_fn#0}
             at ./tests/integration_tests/instances.rs:6913:19
   3: {async_fn#0}
             at ./tests/integration_tests/instances.rs:6864:6
   4: {async_fn#0}
             at ./tests/integration_tests/instances.rs:1527:14
   5: {async_block#0}
             at ./tests/integration_tests/instances.rs:1430:1
   6: poll<&mut dyn core::future::future::Future<Output=()>>
             at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/core/src/future/future.rs:124:9
   7: poll<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/core/src/future/future.rs:124:9
   8: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/scheduler/current_thread/mod.rs:729:57
   9: with_budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/coop.rs:107:5
  10: budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/coop.rs:73:5
  11: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/scheduler/current_thread/mod.rs:729:25
  12: tokio::runtime::scheduler::current_thread::Context::enter
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/scheduler/current_thread/mod.rs:428:19
  13: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/scheduler/current_thread/mod.rs:728:36
  14: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/scheduler/current_thread/mod.rs:807:68
  15: tokio::runtime::context::scoped::Scoped<T>::set
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/context/scoped.rs:40:9
  16: tokio::runtime::context::set_scheduler::{{closure}}
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/context.rs:180:26
  17: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>, core::option::Option<()>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>)>
             at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/std/src/thread/local.rs:310:12
  18: std::thread::local::LocalKey<T>::with
             at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/std/src/thread/local.rs:274:15
  19: tokio::runtime::context::set_scheduler
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/context.rs:180:9
  20: tokio::runtime::scheduler::current_thread::CoreGuard::enter
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/scheduler/current_thread/mod.rs:807:27
  21: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/scheduler/current_thread/mod.rs:716:19
  22: {closure#0}<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/scheduler/current_thread/mod.rs:196:28
  23: tokio::runtime::context::runtime::enter_runtime
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/context/runtime.rs:65:16
  24: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/scheduler/current_thread/mod.rs:184:9
  25: tokio::runtime::runtime::Runtime::block_on_inner
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/runtime.rs:368:47
  26: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.1/src/runtime/runtime.rs:342:13
  27: test_instance_failed_by_stop_request_does_not_reincarnate
             at ./tests/integration_tests/instances.rs:1430:1
  28: test_all::integration_tests::instances::test_instance_failed_by_stop_request_does_not_reincarnate::{{closure}}
             at ./tests/integration_tests/instances.rs:1433:2
  29: core::ops::function::FnOnce::call_once
             at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/core/src/ops/function.rs:250:5
  30: core::ops::function::FnOnce::call_once
             at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: "/var/tmp/omicron_tmp/.tmphUsj2Q"
	If you would like to access the database for debugging, run the following:

	# Run the database
	cargo xtask db-dev run --no-populate --store-dir "/var/tmp/omicron_tmp/.tmphUsj2Q/data"
	# Access the database. Note the port may change if you run multiple databases.
	cockroach sql --host=localhost:32221 --insecure
WARN: dropped ClickHouse process without cleaning it up first (there may still be a child process running (PID 23967) and a temporary directory leaked, /var/tmp/omicron_tmp/test_all-94731d9e6db289ca-test_instance_failed_by_stop_request_does_not_reincarnate.23960.1-clickhouse-3UYIoZ)
failed to clean up ClickHouse data dir:
- /var/tmp/omicron_tmp/test_all-94731d9e6db289ca-test_instance_failed_by_stop_request_does_not_reincarnate.23960.1-clickhouse-3UYIoZ: File exists (os error 17)
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /var/tmp/omicron_tmp/.tmpHd7ihX
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /var/tmp/omicron_tmp/.tmpwdfnTW

@hawkw hawkw added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label May 16, 2025
@hawkw hawkw self-assigned this May 16, 2025
@hawkw
Copy link
Member Author

hawkw commented May 16, 2025

@iximeow made some changes to this test in PR #8152 that we had hoped would fix this flake (see #8119), but it looks like it's still flakey. Bleh!

@hawkw
Copy link
Member Author

hawkw commented May 16, 2025

The change in #8152 changed the part of the test that expects the instance to transition to Starting when moved to Failed by something other than a request to stop, by removing the explicit activation of the instance_reincarnation background task and the instance_wait_for_state(..., Failed) calls: https://github.com/oxidecomputer/omicron/pull/8152/files#diff-8cbc06849f2b6c581df600a3f06e0687c983a1be354633bfa675441410ed0387L1518-L1531

I think @iximeow was correct to remove the instance_wait_for_state(..., Failed) here, as it's possible to miss the transition to Failed when the reincarnation task activates immediately. But I wonder if the problem here is that we can now sometimes miss the instance_reincarnation task activation --- perhaps it was activated by the transition to Failed while it was already running and had already done the database query, and now the activation does nothing and a subsequent one would be required to actually reincarnate the instance.

If I'm correct, we should put back the explicit task activation there. Sometimes, it will do nothing, but in some cases it may be necessary to ensure the task runs. Not sure if this is the case though.

@hawkw
Copy link
Member Author

hawkw commented May 16, 2025

Looking at the logs, it looks like this is the last activation of the instance_reincarnation task:

2025-05-15T23:55:53.029Z	INFO	test_instance_failed_by_stop_request_does_not_reincarnate (BackgroundTasks): instance reincarnation completed successfully
    background_task = instance_reincarnation
    instances_changed_state = 0
    instances_found = 0
    instances_reincarnated = 0

If we peek a couple log lines prior, we see that an instance_update saga is running the FetchStateAndStartRealSaga node:

12618	2025-05-15T23:55:53.026Z	DEBG	test_instance_failed_by_stop_request_does_not_reincarnate (SagaExecutor): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    roles = RoleSet { roles: {} }
    saga_id = 72b6196d-8d63-4c4e-b2c7-1ad4f25b021e
    saga_name = start-instance-update
    saga_node = FetchStateAndStartRealSaga
12619	2025-05-15T23:55:53.027Z	DEBG	test_instance_failed_by_stop_request_does_not_reincarnate (SagaExecutor): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    resource = Database
    result = Ok(())
    saga_id = 72b6196d-8d63-4c4e-b2c7-1ad4f25b021e
    saga_name = start-instance-update
    saga_node = FetchStateAndStartRealSaga
12620	2025-05-15T23:55:53.029Z	TRAC	test_instance_failed_by_stop_request_does_not_reincarnate (BackgroundTasks): no more instances in need of reincarnation
    background_task = instance_reincarnation
    reincarnation_reason = start saga failed
    total_found = 0
12621	2025-05-15T23:55:53.029Z	INFO	test_instance_failed_by_stop_request_does_not_reincarnate (BackgroundTasks): instance reincarnation completed successfully
    background_task = instance_reincarnation
    instances_changed_state = 0
    instances_found = 0
    instances_reincarnated = 0

So I'm we have somehow arrived in a situation where the instance_reincarnation task activates before the instance's transition to Failed is processed. When trying to trace the activation of the reincarnation task, we see this line:

12598	2025-05-15T23:55:52.984Z	INFO	test_instance_failed_by_stop_request_does_not_reincarnate (SagaExecutor): instance update: instance transitioned to Failed, but can be automatically restarted; activating reincarnation.
    auto_restart_config = InstanceAutoRestart { policy: Some(BestEffort), cooldown: None }
    instance_id = a642b2b8-502d-4761-9190-01f5776645d3
    intended_state = running
    runtime_state = InstanceRuntimeState { time_updated: 2025-05-15T23:55:49.382820Z, gen: Generation(Generation(6)), propolis_id: None, dst_propolis_id: None, migration_id: None, nexus_state: Failed, time_last_auto_restarted: None }
    saga_id = 0c70175f-275f-49c0-9851-6fe2656abe3c
    saga_name = instance-update
12599	2025-05-15T23:55:52.984Z	DEBG	test_instance_failed_by_stop_request_does_not_reincarnate (BackgroundTasks): activating
    background_task = instance_reincarnation
    iteration = 3
    reason = Signaled

This indicates that the task was activated by a completing update saga, which is ... curious. It looks like we attempt to chain into a successor update saga here, which does nothing:

12641	2025-05-15T23:55:53.073Z	INFO	test_instance_failed_by_stop_request_does_not_reincarnate (SagaExecutor): instance update: no updates required, releasing lock.
    current.active_vmm = None
    current.migration = None
    current.runtime_state = InstanceRuntimeState { time_updated: 2025-05-15T23:55:49.382820Z, gen: Generation(Generation(6)), propolis_id: None, dst_propolis_id: None, migration_id: None, nexus_state: Failed, time_last_auto_restarted: None }
    current.target_vmm = None
    instance_id = a642b2b8-502d-4761-9190-01f5776645d3
    saga_id = 72b6196d-8d63-4c4e-b2c7-1ad4f25b021e
    saga_name = start-instance-update

I think this might actually be a bug in the actual Nexus code, rather than the test, although other mechanisms prevent it from being a problem in production. It seems like we will activate the instance_reincarnation background task from update sagas that will also start a "successor" update saga. But the background task actually does nothing in that case because the successor update saga is still holding the updater lock on the instance record, and the find_reincarnatable_instances query that drives the reincarnation task explicitly excludes instances locked by an update saga:

// If the instance is currently in the process of being updated,
// let's not mess with it for now and try to restart it on another
// pass.
.and(dsl::updater_id.is_null())

In a production system, this doesn't cause any real problems, because the instance_reincarnation task also activates periodically --- so it will eventually run again after the nop child update saga has dropped the lock, and the instance will reincarnate. However, in tests, we set its periodic activation period to 600 seconds because we want to ensure that the task is explicitly activated in the right places to ensure that it will run in a timely manner, rather than relying on periodic activation:

# How frequently to attempt to restart Failed instances?
#
# The default activation period for this task is once every sixty seconds.
# However, the task's unit tests rely on explicit activations of the task
# finding all instances eligible for reincarnation. Thus, we don't want the
# periodic activation to occur whilst running those tests, since it may
# reincarnate instances that we expect an explicit activation to reincarnate,
# removing them from the set of instances eligible for reincarnation. Thus, set
# the period much longer than the default for test purposes.
instance_reincarnation.period_secs = 600

Therefore, in the test, the task is not activated again before the timeout.

We should probably fix the update saga's activation of the reincarnation task to avoid causing these no-op activations. The task shouldn't be activated until the lock is released --- which means that if we spawn a child update saga after an update that would like to activate the reincarnation task, it should be the child saga that activates the task, only after the lock is dropped. This is a bit more complex than the current thing, but will ensure that we only activate the reincarnation task when it can actually reincarnate the instance.

We probably could also reduce the flakiness of the test by putting the explicit activation of the task back in, but this wouldn't totally fix the flake, as the explicit activation might also occur while the child update saga still has the lock...

@hawkw
Copy link
Member Author

hawkw commented May 16, 2025

Hmm, actually, it looks like the saga node that spawns a child saga does only activate the instance_reincarnation task if it doesn't start a child saga, which is what I thought was correct:

if let Some(update) = UpdatesRequired::for_instance(log, &new_state) {
debug!(
log,
"instance update: additional updates required, preparing a \
successor update saga...";
"instance_id" => %instance_id,
"update.new_runtime_state" => ?update.new_runtime,
"update.network_config_update" => ?update.network_config,
"update.destroy_active_vmm" => ?update.destroy_active_vmm,
"update.destroy_target_vmm" => ?update.destroy_target_vmm,
"update.deprovision" => ?update.deprovision,
);
let saga_dag = SagaInstanceUpdate::prepare(&Params {
serialized_authn,
authz_instance,
})
.context("failed to build new update saga DAG")?;
let saga = osagactx
.nexus()
.sagas
.saga_prepare(saga_dag)
.await
.context("failed to prepare new update saga")?;
// N.B. that we don't wait for the successor update saga to *complete*
// here. We just want to make sure it starts.
saga.start()
.await
.context("failed to start successor update saga")?;
info!(
log,
"instance update: successor update saga started!";
"instance_id" => %instance_id,
);
} else {
// If the instance has transitioned to the `Failed` state and no
// additional update saga is required, check if the instance's
// auto-restart policy allows it to be automatically restarted. If
// it does, activate the instance-reincarnation background task to
// automatically restart it.
let karmic_state = new_state
.instance
.auto_restart_status(new_state.active_vmm.as_ref());
if karmic_state.should_reincarnate() {
info!(
log,
"instance update: instance transitioned to Failed, \
but can be automatically restarted; activating \
reincarnation.";
"instance_id" => %instance_id,
"auto_restart_config" => ?new_state.instance.auto_restart,
"runtime_state" => ?new_state.instance.runtime_state,
"intended_state" => %new_state.instance.intended_state,
);
nexus.background_tasks.task_instance_reincarnation.activate();

That suggests that it wasn't actually the saga that spawned the no-op child that activated the reincarnation task --- it must have been activated from someplace else. I gotta look at the logs a bit more.

I do still think that a no-op child saga should check if the instance needs to reincarnate and poke the task again, though. But I wanna make sure I totally understand what's happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

No branches or pull requests

1 participant