Skip to content

zig build system and std.ChildProcess fails if an environment variable contains an invalid UTF-8 #18694

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
paperclover opened this issue Jan 26, 2024 · 14 comments · Fixed by #19005
Labels
bug Observed behavior contradicts documented or intended behavior os-windows
Milestone

Comments

@paperclover
Copy link
Contributor

Zig Version

0.12.0-dev.2338+9d5a133f1

Steps to Reproduce and Observed Behavior

I ran into this on accident when trying to narrow a bug in bun. I do not suspect this has any real use case.

$Env:damn = "`u{D834}"
zig init
zig build
error: DanglingSurrogateHalf
C:\Users\dave\scoop\apps\zig-dev\0.12.0-dev.2338+9d5a133f1\lib\std\unicode.zig:469:39: 0xbe2e2a in nextCodepoint (build.exe.obj)
            if (it.i >= it.bytes.len) return error.DanglingSurrogateHalf;
                                      ^
C:\Users\dave\scoop\apps\zig-dev\0.12.0-dev.2338+9d5a133f1\lib\std\unicode.zig:806:12: 0xbe3f6e in utf16leToUtf8Alloc (build.exe.obj)
    while (try it.nextCodepoint()) |codepoint| {
           ^
C:\Users\dave\scoop\apps\zig-dev\0.12.0-dev.2338+9d5a133f1\lib\std\process.zig:280:27: 0xbe5409 in getEnvMap (build.exe.obj)
            const value = try std.unicode.utf16leToUtf8Alloc(allocator, value_w);
                          ^
C:\Users\dave\scoop\apps\zig-dev\0.12.0-dev.2338+9d5a133f1\lib\std\Build.zig:221:17: 0xbe811a in create (build.exe.obj)
    env_map.* = try process.getEnvMap(allocator);
                ^
C:\Users\dave\scoop\apps\zig-dev\0.12.0-dev.2338+9d5a133f1\lib\build_runner.zig:79:21: 0xbf3d80 in main (build.exe.obj)
    const builder = try std.Build.create(
                    ^
error: the following build command failed with exit code 1:
C:\test\zig-cache\o\8bd32588574a7225f4b7ae0f400f442b\build.exe C:\Users\dave\scoop\apps\zig-dev\0.12.0-dev.2338+9d5a133f1\zig.exe C:\test C:\test\zig-cache C:\Users\dave\AppData\Local\zig --seed 0xdac8d76b

Expected Behavior

To ignore or handle this error.

@paperclover paperclover added the bug Observed behavior contradicts documented or intended behavior label Jan 26, 2024
@squeek502
Copy link
Collaborator

squeek502 commented Jan 26, 2024

This is part of a larger problem that I've been meaning to create an issue for. Zig currently has many doc comments that allude to accepting/returning WTF-16 (which allows unpaired surrogates), but almost all code actually assumes valid UTF-16 (both as input and output).

At some point we'll likely need to move to proper WTF-16 awareness and conversion to/from something like WTF-8 instead of UTF-8 for the relevant Windows APIs.

@Paul-Dempsey
Copy link

I would like to see more information on how important/prevalent support for malformed Unicode really is. Is this solving a real problem, or defense-in-depth for an attack vector? Is there a genuine need for it as a first-line requirement, or does there just need to be a capability to handle it when needed? Handling malformed Unicode transparently may simply be deferring or hiding an issue that is better surfaced by letting things break.

@squeek502
Copy link
Collaborator

squeek502 commented Jan 26, 2024

@Paul-Dempsey the problem with letting things break is that it means that the Zig compiler and/or the Zig standard library just won't work in certain environments that are otherwise perfectly valid (in terms of what the operating system / syscalls support). The dilemma is illustrated pretty well by the problem being hit in the OP.

@Paul-Dempsey
Copy link

yes, I suppose letting things error is an opinionated approach that isn't the zig way.

The OP's problem is the result of user error: introducing invalid data to the environment. I don't buy that this is a valid scenario just because the OS doesn't prevent you from putting bad data into environment variables. You can also create broken environments by embedding nulls and other characters that are perfectly valid Unicode. An invalid path character in $PATH: will break various tools, but the same character is perfectly valid for other data in the environment.

But, you're correct that the whole WTF-16 issue will need thought. I don't know that this issue with environment variables is the right issue to carry the weight of WTF-16 in general, though.

@squeek502
Copy link
Collaborator

squeek502 commented Jan 27, 2024

I don't buy that this is a valid scenario just because the OS doesn't prevent you from putting bad data into environment variables.

It's not bad data from the perspective of the OS, though, which is what should probably matter. Zig is imposing its own interpretation of validity which doesn't match the operating system in this case.

An invalid path character in $PATH: will break various tools, but the same character is perfectly valid for other data in the environment.

This is a separate issue that is tracked by #15607.

You can also create broken environments by embedding nulls and other characters that are perfectly valid Unicode.

Could you provide an example? I'd be surprised if this were possible using the SetEnvironmentVariableW API.

@Jarred-Sumner
Copy link
Contributor

@Paul-Dempsey @paperdave started investigating this because a user reported InvalidUtf8 in an error message, and while trying to reproduce the error in a Zig program, a very similar error showed up in zig build. The user didn't know they had invalid UTF-8 in an environment variable. While it's strange, deliberately choosing to fail for input the operating system deems valid will make people less likely to want to use any software written in Zig.

@paperclover
Copy link
Contributor Author

paperclover commented Jan 27, 2024

The user didn't know they had invalid UTF-8 in an environment variable.

The user in question did not have Invalid UTF8 in their environment. This error was returned by std.ChildProcess.spawnAndWait. Since I cannot reproduce their issue locally, and the release builds do not print error return traces, I have resorted to testing random things; have not figured out why. They did.

oven-sh/bun#8197

This issue for Zig is not the parallel to the one in Bun. Once I narrow this issue down (unfortunately low priority), I will open a separate issue in Zig for the bug.

@paperclover
Copy link
Contributor Author

by embedding nulls

i do not think it is possible.

image

@paperclover
Copy link
Contributor Author

I don't buy that this is a valid scenario just because the OS doesn't prevent you from putting bad data into environment variables.

After more research and communication with one of the two people who have reported the bug in Bun, this is caused by the environment. Seems to be very easy to set this up when using the Starship prompt in PowerShell.

image

For me, this means I will have to write my own version of std.ChildProcess because the one in the standard library one forbids invalid UTF-8, while the underlying OS does not. On Mac and Linux, these would be forwarded as is regardless, which imo is more reason to be loose on the conversion.

I think the standard library should have APIs going between WTF-8 and UTF16-LE, and use those for the Windows OS layer instead of the stricter UTF-8 (though my ideal world: no text conversion here). This is something I'd be willing to help out fixing in a few weeks.

@paperclover paperclover changed the title zig build system on windows always fails if an environment variable contains an invalid unicode surrogate zig build system and std.ChildProcess fails if an environment variable contains an invalid UTF-8 Jan 27, 2024
@Paul-Dempsey
Copy link

If you want to be WTF compatible, WTF-8 is not required (and in my opinion would be a dangerous mistake). WTF tolerance can be done exactly how it would be done in a C Windows program, using the equivalent of wchar strings/buffers., the native datatype of text in Windows.

@paperclover
Copy link
Contributor Author

then should the standard library use wchar slices for EnvMap and so on?

I am not too well versed in the exact specifics of it, just want to make these apis work more reliably for Windows users.

@squeek502
Copy link
Collaborator

squeek502 commented Jan 27, 2024

EnvMap is currently meant as a cross-platform API that hides the differences between platforms, so it currently attempts to normalize everything to UTF-8.

So, there's a few options IMO:

  • Make the types in EnvMap.hash_map vary depending on the target OS ([]const u16 for Windows, []const u8 for everything else).
    • This would make cross-platform code harder to write, since users would have to deal with the type differences
  • Have a separate OS-specific EnvMap that uses the platform-native types, but leave EnvMap as []const u8
    • This would allow targeted fixes, like making ChildProcess take the OS-specific EnvMap type instead of the cross-platform one, but it might just create more complexity while still leaving the 'root' problem intact (since the cross-platform EnvMap would still try and potentially fail to convert to UTF-8)
  • Convert to/from WTF-8 instead of UTF-8
    • This would be a complete solution, but users would have to be mindful that they shouldn't directly output the WTF-8 values

EDIT: Another option would be to make EnvMap use []const u16 on Windows internally, but have the public API (get, put, etc still take UTF-8 [or perhaps WTF-8] and do the conversion for the user). This might be the best stop-gap solution until an overall strategy is decided on.

EDIT#2: Using []const u16 internally but 'speaking' UTF-8 via the public APIs has some issues, since e.g. how should get be implemented? It would either need to attempt to convert to UTF-8 before returning, which means that it'd need to take an allocator, or use an internal buffer and return a slice of the internal buffer (which would get invalidated on the next get call). Either approach would really complicate usage.

Also, just noticed that there is actually an existing issue for this problem: #1774

@Paul-Dempsey
Copy link

Thanks for mentioning #1774, which refers to https://simonsapin.github.io/wtf-8/! Now I know that WTF-8 is an actual well-defined concept and that spec addresses a number of my concerns, if followed (particularly keeping such data internal to a subsystem and never persisted or communicated outside the process).

@squeek502
Copy link
Collaborator

squeek502 commented Jan 29, 2024

(I linked to the WTF-8 spec in my first comment btw)

@paperdave the starship thing seems like it might be caused by a code page mismatch (that is, the input code page is different from what starship.exe is outputting), but I'm not sure how PowerShell input code pages / $env:foo= works exactly (I don't use PowerShell personally). It would surprise me if starship.exe is intentionally outputting ill-formed UTF-16, though, so I suspect either a code page mismatch or a starship bug.

Also, just to try to make things clearer, environment variables on Windows are always stored as WTF-16, and in Zig they are retrieved from the Process Environment Block (PEB). Any InvalidUtf8 errors are caused by Zig trying to convert from WTF-16.

squeek502 added a commit to squeek502/zig that referenced this issue Feb 19, 2024
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior.

WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8.

Closes ziglang#18694
Closes ziglang#1774
Closes ziglang#2565
squeek502 added a commit to squeek502/zig that referenced this issue Feb 19, 2024
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior.

WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8.

Closes ziglang#18694
Closes ziglang#1774
Closes ziglang#2565
squeek502 added a commit to squeek502/zig that referenced this issue Feb 24, 2024
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior.

WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8.

Closes ziglang#18694
Closes ziglang#1774
Closes ziglang#2565
@andrewrk andrewrk added this to the 0.12.0 milestone Feb 25, 2024
Rexicon226 pushed a commit to Rexicon226/zig that referenced this issue Feb 25, 2024
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior.

WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8.

Closes ziglang#18694
Closes ziglang#1774
Closes ziglang#2565
RossComputerGuy pushed a commit to ExpidusOS-archive/zig that referenced this issue Mar 20, 2024
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior.

WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8.

Closes ziglang#18694
Closes ziglang#1774
Closes ziglang#2565
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Observed behavior contradicts documented or intended behavior os-windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants