Skip to content

Document MaybeUninit bit validity #140463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

joshlf
Copy link
Contributor

@joshlf joshlf commented Apr 29, 2025

Partially addresses rust-lang/unsafe-code-guidelines#555 by clarifying that it is sound to write any byte values (initialized or uninitialized) to any MaybeUninit<T> regardless of T.

r? @RalfJung

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Apr 29, 2025
Comment on lines +275 to +278
/// If `T` contains initialized bytes at byte offsets where `U` contains padding bytes, these
/// may not be preserved in `MaybeUninit<U>`, and so `transmute(u)` may produce a `T` with
/// uninitialized bytes in these positions. This is an active area of discussion, and this code
/// may become sound in the future.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RalfJung I'd like some advice on this. I'm confident that this is correct as written, but could we perhaps make a stronger statement?

In particular, what happens if we round-trip a value which is invalid for U but where U nonetheless contains initialized bytes? For example, is 3u8 -> MaybeUninit<bool> -> u8 guaranteed to produce 3u8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per rust-lang/unsafe-code-guidelines#555 (comment), I've updated to the following text. Does that look good?

/// Note that, so long as every byte position which is initialized in `T` is also initialized
/// in `U`, then the preceding `identity` example *is* sound.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted below I don't think the term "bytes initialized in T" makes a lot of sense. But by taking a reasonable guess at what you mean by this (non-padding byte), then yes I think that is sound.

@@ -252,6 +252,33 @@ use crate::{fmt, intrinsics, ptr, slice};
/// std::process::exit(*code); // UB! Accessing uninitialized memory.
/// }
/// ```
///
/// # Validity
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this discussion here:

The MaybeUninit docs probably make sense for this. We now do have a definition of "byte" in the reference that this can link to.

Okay, awesome. And what wording would you recommend? Would it be accurate to say something like the following?

The value of a [MaybeUninit<u8>; N] may contain pointer provenance, and so p: P -> [MaybeUninit<u8>; N] -> P preserves the value of p, including provenance

@RalfJung would you like me to add language like this to this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I've added the following as a more concrete and fleshed out draft. I can edit or remove as preferred.

/// # Provenance
///
/// `MaybeUninit` values may contain [pointer provenance][provenance]. Concretely, for any
/// pointer type, `P`, which contains provenance, transmuting `p: P` to
/// `MaybeUninit<[u8; size_of::<P>]>` and then back to `P` will produce a value identical to
/// `p`, including provenance.
///
/// [provenance]: ../ptr/index.html#provenance

@rust-log-analyzer

This comment has been minimized.

@RalfJung
Copy link
Member

RalfJung commented May 7, 2025

Cc @rust-lang/opsem

Comment on lines +277 to +280
/// If `T` contains initialized bytes at byte offsets where `U` contains padding bytes, these
/// may not be preserved in `MaybeUninit<U>`, and so `transmute(u)` may produce a `T` with
/// uninitialized bytes in these positions. This is an active area of discussion, and this code
/// may become sound in the future.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to say that a type "contains initialized bytes" at some offset. That's a property of a representation.

The typical term for representation bytes that are lost here is "padding". I don't think we have rigorously defined padding anywhere yet, but the term is sufficiently widely-used (and generally with a consistent meaning) that we may just be able to use it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, you're making two points:

  • We should speak about a type's representation containing bytes, not about the type itself containing bytes
  • In a representation, we should speak about padding bytes rather than uninitialized bytes

Is that right?

One thing that's probably worth distinguishing here is between values and layouts. In my mental model, an uninit byte is one of the possible values that a byte can have (e.g., it's the 257th value that can legally appear in a MaybeUninit<u8>). By contrast, padding is a property of a layout - namely, it's a sequence of bytes in a type's layout that happen to have the validity [MaybeUninit<u8>; PADDING_LEN].

Based on this, maybe it's best to say:

If byte offsets exists at which T's representation does not permit uninitialized bytes but U's representation does (e.g. due to padding), then the bytes in T at these offsets may not be preserved in u, and so transmute(u) may produce a T with uninitialized bytes at these offsets. This is an active area of discussion, and this code may become sound in the future.

Copy link
Member

@RalfJung RalfJung May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that right?

No. I think both of the following concepts make sense:

  • The representation of a particular value at a particular type contains uninitialized bytes.
  • A type contains padding bytes. (These are bytes which are always ignored by the representation relation.)

But it makes less sense to talk about padding of a representation, or to talk about uninitialized bytes in a type.

So for this PR, the two key points (and they are separate points) are:

  • If U has padding, those bytes may be reset to "uninitialized" as part of the round-trip. If those same bytes are not padding in T, this can therefore mean some of the information of the original T value is lost.
  • If T does not permit uninitialized bytes on those positions, the round-trip is UB.

The second point is just a logical consequence of the first, it does not add any new information. Not sure if it is worth mentioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The representation of a particular value at a particular type contains uninitialized bytes.
  • A type contains padding bytes. (These are bytes which are always ignored by the representation relation.)

Does this imply that a type contains padding bytes, not a type's representation?

I'm thinking through the implications of what you said, and I think I understand something new that I didn't before, and I want to run it by you: In my existing mental model, a padding byte is a location in a type's layout such that every byte value at that location (including uninit) is valid (enums complicate this model, but I don't think that complication is relevant for this discussion - we can just stick to thinking about structs). The problem with this mental model is that, interpreted naively, it implies that different byte values in a padding byte could correspond to different logical values of the type. So e.g. in the type #[repr(C)] struct T(u8, u16), [0, 0, 0, 0] and [0, 1, 0, 0] would correspond to different values of the type since we're treating the padding byte itself as part of the representation relation. Of course, that is not something we want.

IIUC, by contrast your model is that the representation relation simply doesn't include padding bytes at all. So it'd be more accurate to describe the representation of T as consisting of three bytes - at offsets 0, 2, and 3. Every representation of T has a "hole" at offset 1 which is not part of the representation. This ensures that there's a 1:1 mapping between logical values and representations. Is that right?

Copy link
Member

@RalfJung RalfJung May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply that a type contains padding bytes, not a type's representation?

That's how I think about it. We can't tell which byte is a padding byte by looking at one representation -- it's a property of the type.

In my existing mental model, a padding byte is a location in a type's layout such that every byte value at that location (including uninit) is valid

That would make the only byte of MaybeUninit<u8> a padding byte, so I don't think this is the right definition.
That's why I said above: a padding byte is a byte that is ignored by the representation relation. Slightly more formally: if r is some representation valid for type T, and r' is equal to r everywhere except for padding bytes, then r and r' represent the same value.

So it'd be more accurate to describe the representation of T as consisting of three bytes

The representation has 4 bytes. But only 3 of them actually affect the represented value (which is a tuple of two [mathematical] integers).


We seem to be using the term "representation" slightly differently. For me, that's list a List<Byte> of appropriate length. You may be using that term to refer to what I call "representation relation"?

/// # Provenance
///
/// `MaybeUninit` values may contain [pointer provenance][provenance]. Concretely, for any
/// pointer type, `P`, which contains provenance, transmuting `p: P` to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "for any value p: P that contains provenance", or so -- "types that contain provenance" doesn't make much sense to me, and restricting this to pointer types seems unnecessary (this also applies to e.g. tuples and arrays containing pointers).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; updated to use your language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants