Skip to content

[7/n] [installinator] write out zone hashes to mupdate-override.json #8155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sunshowers
Copy link
Contributor

@sunshowers sunshowers commented May 14, 2025

Part of RFD 556. In upcoming work, sled-agent will check these hashes at boot time, and mark an error if there's a mismatch.

Created using spr 1.3.6-beta.1
@sunshowers sunshowers requested a review from jgallagher May 14, 2025 03:56
/// Computes the zone hash IDs.
///
/// Hash computation is done in parallel on blocking tasks. If a task panics
/// (should not happen in normal use), a `JoinError` is returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to return an error at all? I think we've pretty liberally unwrapped tokio::spawn() await points where there's obviously no cancellation possible, because that means

(a) in prod we would have aborted due to the task panic anyway
(b) in dev we only propagate other panics

right? This seems pretty similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A JoinError can also happen due to a runtime shutdown, not just a panic. Since installinator is pretty autonomous I wanted to make sure that errors were reported up to wicket. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess something I could do is to check whether the JoinError is a panic, and if so, panic as you suggested. If it's a cancellation then report that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehhh, what would cause the runtime to shutdown, in practice? We're running under a #[tokio::main], right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, if we have tests that exercise this and we might hit the whole "runtime cancels tasks in a random order when test ends" thing, then maybe we don't want to panic here. But I don't know that it's worth doing work to bubble any kind of error up to wicket for a case that's only possible in tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you feel like it's a lot of work? I'm just generally concerned with dead air from a hard-to-introspect service I guess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not a lot of work. But handling this error in this way is pretty different from how we handle tokio JoinErrors in general. E.g., even from within installinator:

// ...and also attempt to flush the disk write cache
tokio::task::spawn_blocking(move || {
match dkio::flush_write_cache(f.as_raw_fd()) {
Ok(()) => Ok(()),
// Some drives don't support `flush_write_cache`; we don't want
// to fail in this case.
Err(err) if err.raw_os_error() == Some(libc::ENOTSUP) => Ok(()),
Err(err) => Err(err),
}
})
.await
.expect("task panicked")

I don't think handling this error is wrong. I just think it's weird when we don't bother to most anywhere else, and it's unfortunate to make an otherwise-infallible function return a Result.

let file_name = file_name.clone();
// data is a Bytes so is cheap to clone.
let data: Bytes = data.clone();
// Compute hashes in parallel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use JoinSet::spawn_blocking() since this work is synchronous and compute bound? (I assume collect() does the equivalent of JoinSet::spawn()?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should, I started off with that but got too clever by half, whoops. Fixed.

Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants