[7/n] [installinator] write out zone hashes to mupdate-override.json #8155

sunshowers · 2025-05-14T03:55:24Z

Part of RFD 556. In upcoming work, sled-agent will check these hashes at boot time, and mark an error if there's a mismatch.

Created using spr 1.3.6-beta.1

jgallagher · 2025-05-14T15:38:04Z

installinator/src/write.rs

+/// Computes the zone hash IDs.
+///
+/// Hash computation is done in parallel on blocking tasks. If a task panics
+/// (should not happen in normal use), a `JoinError` is returned.


Do we need to return an error at all? I think we've pretty liberally unwrapped tokio::spawn() await points where there's obviously no cancellation possible, because that means

(a) in prod we would have aborted due to the task panic anyway
(b) in dev we only propagate other panics

right? This seems pretty similar.

A JoinError can also happen due to a runtime shutdown, not just a panic. Since installinator is pretty autonomous I wanted to make sure that errors were reported up to wicket. wdyt?

I guess something I could do is to check whether the JoinError is a panic, and if so, panic as you suggested. If it's a cancellation then report that.

Ehhh, what would cause the runtime to shutdown, in practice? We're running under a #[tokio::main], right?

Oh, if we have tests that exercise this and we might hit the whole "runtime cancels tasks in a random order when test ends" thing, then maybe we don't want to panic here. But I don't know that it's worth doing work to bubble any kind of error up to wicket for a case that's only possible in tests.

Do you feel like it's a lot of work? I'm just generally concerned with dead air from a hard-to-introspect service I guess.

No, it's not a lot of work. But handling this error in this way is pretty different from how we handle tokio JoinErrors in general. E.g., even from within installinator:

omicron/installinator-common/src/raw_disk_writer.rs

Lines 86 to 97 in 03081e1

// ...and also attempt to flush the disk write cache

tokio::task::spawn_blocking(move || {

match dkio::flush_write_cache(f.as_raw_fd()) {

Ok(()) => Ok(()),

// Some drives don't support `flush_write_cache`; we don't want

// to fail in this case.

Err(err) if err.raw_os_error() == Some(libc::ENOTSUP) => Ok(()),

Err(err) => Err(err),

}

})

.await

.expect("task panicked")

I don't think handling this error is wrong. I just think it's weird when we don't bother to most anywhere else, and it's unfortunate to make an otherwise-infallible function return a Result.

jgallagher · 2025-05-14T15:39:30Z

installinator/src/write.rs

+            let file_name = file_name.clone();
+            // data is a Bytes so is cheap to clone.
+            let data: Bytes = data.clone();
+            // Compute hashes in parallel.


Should this use JoinSet::spawn_blocking() since this work is synchronous and compute bound? (I assume collect() does the equivalent of JoinSet::spawn()?)

It should, I started off with that but got too clever by half, whoops. Fixed.

Created using spr 1.3.6-beta.1

[spr] initial version

168f77b

Created using spr 1.3.6-beta.1

sunshowers requested a review from jgallagher May 14, 2025 03:56

jgallagher reviewed May 14, 2025

View reviewed changes

sunshowers added 2 commits May 15, 2025 02:29

rebase on main

d0e2fb1

Created using spr 1.3.6-beta.1

whoops

0bb4ec9

Created using spr 1.3.6-beta.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[7/n] [installinator] write out zone hashes to mupdate-override.json #8155

[7/n] [installinator] write out zone hashes to mupdate-override.json #8155

sunshowers commented May 14, 2025 •

edited

Loading

jgallagher May 14, 2025

sunshowers May 14, 2025

sunshowers May 14, 2025

jgallagher May 14, 2025

jgallagher May 14, 2025

sunshowers May 15, 2025

jgallagher May 15, 2025

jgallagher May 14, 2025

sunshowers May 14, 2025

	// ...and also attempt to flush the disk write cache
	tokio::task::spawn_blocking(move \|\| {
	match dkio::flush_write_cache(f.as_raw_fd()) {
	Ok(()) => Ok(()),
	// Some drives don't support `flush_write_cache`; we don't want
	// to fail in this case.
	Err(err) if err.raw_os_error() == Some(libc::ENOTSUP) => Ok(()),
	Err(err) => Err(err),
	}
	})
	.await
	.expect("task panicked")

[7/n] [installinator] write out zone hashes to mupdate-override.json #8155

Are you sure you want to change the base?

[7/n] [installinator] write out zone hashes to mupdate-override.json #8155

Conversation

sunshowers commented May 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunshowers commented May 14, 2025 •

edited

Loading