Skip to content

Proposal: Introduce safety-checked UB for reading invalid bit patterns via @bitCast, pointers, and untagged unions #6784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks
rohlem opened this issue Oct 23, 2020 · 7 comments
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@rohlem
Copy link
Contributor

rohlem commented Oct 23, 2020

Status-quo

The Zig type system allows for types that don't cleanly fit into bytes / registers, I'll call them sparse types for short. These have invalid bit patterns, which do not map to a value of the specified type. For example:

  • *T cannot be 0, the bit pattern that ?*T uses to represent null.
  • enum {a, b, c} needs more than 1 bit but less than 2 bits to be represented.

The only way to produce invalid values of sparse types is via reinterpretation operations, which include:

  • accessing an inactive field of an untagged union
  • @bitCast, @ptrCast, @intToPtr
  • dereferencing a pointer-to-sparse-type, f.e. to an address that was initialized as a different type, or that was not yet initialized.

As far as I could find, the documentation doesn't state how invalid values are handled in the general case. Instead, every checked, value-preserving cast is individually documented to trigger safety-checked undefined behaviour if the input value is out-of-bounds for the target type.

This proposal

Formally specify that constructing an invalid value is always immediately illegal (safety-checked undefined) behaviour.

I wrote a couple of example scenarios. These conditions are currently not checked (I can create individual issues if the general idea is accepted):

  • Accessing an inactive field of an untagged union can create invalid values.
  • Dereferencing a pointer-to-sparse-type can read an invalid value of that type.
  • @bitCast-ing to a packed struct/union with fields of sparse types can lead to those fields' values being invalid.
    (Essentially every potentially-new read of a packed struct/union field of a sparse type needs to be checked.)
  • @ptrCast-ing the address of any variable of a sparse type to a modifiable pointer-to-different-type means that new pointer can be used to write invalid values readable as the source type.
    (I don't think this is something a non-omniscient implementation can guard against, unless we did something radical like outlawing @ptrCast from sparse source types.
    Technically this would work to some extent. You could declare a packed/extern union to get the original behaviour, which also signals to the compiler that invalid values might appear in that location.
    This safety falls apart if you @ptrCast a sparse type to a union containing it it wasn't declared in though.)

As a special case, stage1 currently always loads all integer types by exactly their bit width (masking all other bits off), even when they have byte alignment (i.e. outside of packed types) and even when runtime safety is disabled.
This ensures valid values by hiding the actual invalid bit pattern, potentially hiding semantic errors. I feel like this is also the wrong default, and worth reconsidering.

Update (2023-06-26): Sentinel-terminated arrays [n:s]T are also an interesting case to also consider in this context. (I can't remember if they were in the language when I wrote this proposal originally.)

  • The sentinel element is considered as part of the type's size: @sizeOf([n:s]T) == @sizeOf([n+1]T)
  • The sentinel element is expected to be present in memory. I think this is asserted by the type system + compiler sometimes, but IMO it's under-specified when exactly this happens.
  • While they are not necessarily sparse types (only if T is), they are also vulnerable to invalid values when reinterpreting memory [n+1]T => [n:s]T.
  • It's also not fully clear in status-quo whether @as([n:s]T, undefined) should correctly establish the sentinel element or not.
    In my personal opinion it should, since it's a type invariant (similar to an unused higher byte in a padded integer), and only the other elements ("effective information") should be undefined.
@rohlem rohlem changed the title Proposal: Introduce safety-checked UB for reinterpreting invalid bit patterns via @bitCast, pointers, and untagged unions Proposal: Introduce safety-checked UB for reading invalid bit patterns via @bitCast, pointers, and untagged unions Oct 24, 2020
@cajw1
Copy link

cajw1 commented Oct 25, 2020

What are the reasons for not "doing something radical" and rejecting all of this at compile time (including untagged unions with sparse fields unless they share the same legal bit patterns)? I suppose I am missing some obvious C interop reasons...

@Vexu Vexu added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Oct 25, 2020
@Vexu Vexu added this to the 0.8.0 milestone Oct 25, 2020
@rohlem
Copy link
Contributor Author

rohlem commented Oct 25, 2020

@cajw1 I'm not personally opposed to these "radical" changes/limitations either. I wanted to clarify that I wouldn't blindly set them in stone up front though, since

  • they severely limit some forms of composition.
    This means if someone (besides me) has a use case that benefits from them, the program logic of their new workaround might get unreasonably complex, which can more easily lead to (or hide) mistakes.
  • I'm fairly certain most "solutions" only reduce some error surfaces, while not solving the issue. Maybe they're still valuable enough, but harder to justify, and might require assessment of the remaining risk even if implemented.
    • As an example, say const E = enum(u8){a, b}. If we outlaw @ptrCast from *E to *u8 because u8 has a valid bit pattern that is invalid for E, this only secures us against misuse for variables statically allocated as E. As soon as you allocate memory dynamically, you're originally dealing with addresses / pointers to bytes, which can by nature represent all bit patterns.
      So if our source is *u8, an error there can still write any bit pattern it wants, or give out the same address as different pointer types, etc.

@cajw1
Copy link

cajw1 commented Oct 26, 2020

@rohlem , I had not thought about heap allocation...

Looking at mem.zig and Allocator.zig: Allocator.create(...) is defined to return a pointer to undefined memory (line 300 https://github.com/ziglang/zig/blob/master/lib/std/mem/Allocator.zig). Using this before defining it is UB (though not tracked at compile-time, and I don't understand how the runtime tracks whether a u8 is in state undefined, or what memsetting a u8 to undefined actually does, understanding #4298 might help but I so far fail to understand why/how undefined should/could be a runtime concept).

What if @ptrCast(SparseType, valueOfLessSparseType) somehow were defined to return a pointer to undefined memory and :) the compiler ensured that undefineds are always defined before use? EDIT: see @rohlem #6784 (comment), this proposal would be better formulated as:
What if @ptrCast(SparseType, valueOfLessSparseType) somehow were defined to return a pointer to invalid memory and :) the compiler ensured that invalids are always defined before use?

As soon as you cast a non-sparse to a sparse, you are really overpromising, so if such a cast were considered to undefine the cast memory (only as a compile-time flag, no runtime effect), that promise would go away. If the only use-cases for this were such that a compile-time-enforced define after such a cast were acceptable (I am only thinking of the malloc use-case, where it is ok if the only thing you can do after such a cast is a define)... MMIO (and optimal deserialization etc) will have to use non-sparse types (I suspect extern struct is enforcing this already, but packed structs are not, they accept sparse types as members and can take part in ptrCasts).

@rohlem
Copy link
Contributor Author

rohlem commented Oct 26, 2020

Just to be clear, this proposal is (intentionally) not about undefined, but the concept of invalid values from other sources. undefined could also be considered such a source, but currently has many implications beyond that.

I'd like not to sidetrack too much, but to quickly address your ideas @cajw1 :

  • undefined in general is a compile-time optimization concept we're currently adopting from LLVM. It's only "tracked" at compile-time and the compiler basically elides computations it can prove aren't deterministic anyway (because one of the inputs is undefined).
    As such it's similar to optimizing based on "undefined behaviour", but doesn't always imply UB; audit analysis of undefined values; make it clear when undefined is allowed or not #1947 gives a bit more details.

  • What if [casting to a sparse type always] returned a pointer to undefined memory and :) the compiler ensured that [it needs to be initialized] before use?

    That's an interesting idea. In order to check the validity of values, you already need to check with a non-sparse (dense) type.
    So in the usual scenario, instead of being able to go I know I wrote a valid value as a u8 -> cast the pointer -> read it as my enum,
    you're instead required to always check if the value is valid (whether you already know or not) and save the value -> cast the pointer -> write back the value through the casted pointer -> able to read it as my enum.

    The only issue I really see with this approach are the ergonomics of undefined used for this job. Currently we put it into the LLVM IR and send that off to optimizations, without any feedback on how far LLVM analyzed/optimized it.
    This means if you misuse undefined even locally, and Zig's analysis doesn't catch you doing it, LLVM might optimize away some of your code without proper warning.
    So if we "solve" this use-case by referring to undefined, that means a normal @ptrCast that a developer forgot to validate may result in broken code if LLVM spotted it and Zig didn't.

I'd really appreciate more feedback from more knowledgeable developers than us, but if this is deemed the right direction, then a type distinction * maybe-invalid T (as proto-proposed by @cajw1 in #3328 (comment) ) might be the right avenue. But I think that's best discussed in its own proposal/issue, since @ptrCast is only the fourth bullet point of this more general proposal.
EDIT: changed * maybe-undefined T to * maybe-invalid T, because reasoning about undefined values (even to check if they're invalid) is not supported by current semantics, and I wouldn't want to involve undefined in the general discussion of invalid values.

@kyle-github
Copy link

kyle-github commented Oct 26, 2020

Introducing a second concept of invalid seems interesting. I had not realized what LLVM does with undefined. At least to me, that makes use of undefined fairly undesirable since you do not really know what the LLVM machinery is going to do with it ☹️. It sounds like undefined is closer to a synonym for unused.

Would invalid actually need to have a keyword? Perhaps the compiler could assign every variable invalid status unless proven otherwise by a valid assignment/initialization? The challenge then becomes determining what a valid assignment/init is.

Valid:

  • assignment from a literal since you can check that the literal is valid at compile time.
  • assignment from a function that returns an instance of that type (assuming the function itself makes a valid instance internally).
  • assignment from another valid variable (or part thereof such as an array/slice element).

Possibly invalid:

  • anything involving a pointer cast.

Unfortunately, I do not see how you will get valid data from an allocator. Maybe you also pass in an initialization function to the allocator? This would take an OO constructor and turn it inside out.

Sorry for the incoherent rambling. Clearly I have not had enough caffeine yet. It does seems like this concept of valid vs. invalid/unproven valid is possibly very powerful and probably different from undefined.

@rohlem
Copy link
Contributor Author

rohlem commented Oct 26, 2020

@kyle-github I think a keyword (type attribute) is necessary as soon as it affects interfaces. To quote the example from #3328 (comment), output parameter pointers (f.e. @addWithOverflow) will (in case of success) overwrite the value pointed to, so they could declare these as * maybeinvalid T.
The compiler internally handling this logic might be viable, but maybe code dealing with pointers ends up as more readable if we allow it this expressiveness.

Either way, I'd like for this proposal to remain about the idea that "we want a panic (safety-checked UB) when we cast to /read invalid values", and not conflate it with solutions for dealing with invalid values (which needs to be done before they are cast to /read).

@kj4tmp
Copy link
Contributor

kj4tmp commented Oct 6, 2024

Note that if this is accepted, the implementations of std.io.Reader.readStructEndian and std.io.Writer.writeStructEndian
need to be changed because they rely on temporary construction of invalid values before byte swapping for endianness conversion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests

6 participants