-
Notifications
You must be signed in to change notification settings - Fork 43
[7/n] [installinator] write out zone hashes to mupdate-override.json #8155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[7/n] [installinator] write out zone hashes to mupdate-override.json #8155
Conversation
Created using spr 1.3.6-beta.1
installinator/src/write.rs
Outdated
/// Computes the zone hash IDs. | ||
/// | ||
/// Hash computation is done in parallel on blocking tasks. If a task panics | ||
/// (should not happen in normal use), a `JoinError` is returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to return an error at all? I think we've pretty liberally unwrapped tokio::spawn()
await points where there's obviously no cancellation possible, because that means
(a) in prod we would have aborted due to the task panic anyway
(b) in dev we only propagate other panics
right? This seems pretty similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A JoinError
can also happen due to a runtime shutdown, not just a panic. Since installinator is pretty autonomous I wanted to make sure that errors were reported up to wicket. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess something I could do is to check whether the JoinError
is a panic, and if so, panic as you suggested. If it's a cancellation then report that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ehhh, what would cause the runtime to shutdown, in practice? We're running under a #[tokio::main]
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, if we have tests that exercise this and we might hit the whole "runtime cancels tasks in a random order when test ends" thing, then maybe we don't want to panic here. But I don't know that it's worth doing work to bubble any kind of error up to wicket for a case that's only possible in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you feel like it's a lot of work? I'm just generally concerned with dead air from a hard-to-introspect service I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not a lot of work. But handling this error in this way is pretty different from how we handle tokio JoinError
s in general. E.g., even from within installinator:
omicron/installinator-common/src/raw_disk_writer.rs
Lines 86 to 97 in 03081e1
// ...and also attempt to flush the disk write cache | |
tokio::task::spawn_blocking(move || { | |
match dkio::flush_write_cache(f.as_raw_fd()) { | |
Ok(()) => Ok(()), | |
// Some drives don't support `flush_write_cache`; we don't want | |
// to fail in this case. | |
Err(err) if err.raw_os_error() == Some(libc::ENOTSUP) => Ok(()), | |
Err(err) => Err(err), | |
} | |
}) | |
.await | |
.expect("task panicked") |
I don't think handling this error is wrong. I just think it's weird when we don't bother to most anywhere else, and it's unfortunate to make an otherwise-infallible function return a Result
.
installinator/src/write.rs
Outdated
let file_name = file_name.clone(); | ||
// data is a Bytes so is cheap to clone. | ||
let data: Bytes = data.clone(); | ||
// Compute hashes in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use JoinSet::spawn_blocking()
since this work is synchronous and compute bound? (I assume collect()
does the equivalent of JoinSet::spawn()
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should, I started off with that but got too clever by half, whoops. Fixed.
Created using spr 1.3.6-beta.1
Part of RFD 556. In upcoming work, sled-agent will check these hashes at boot time, and mark an error if there's a mismatch.