-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
zig build system and std.ChildProcess
fails if an environment variable contains an invalid UTF-8
#18694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is part of a larger problem that I've been meaning to create an issue for. Zig currently has many doc comments that allude to accepting/returning WTF-16 (which allows unpaired surrogates), but almost all code actually assumes valid UTF-16 (both as input and output). At some point we'll likely need to move to proper WTF-16 awareness and conversion to/from something like WTF-8 instead of UTF-8 for the relevant Windows APIs. |
I would like to see more information on how important/prevalent support for malformed Unicode really is. Is this solving a real problem, or defense-in-depth for an attack vector? Is there a genuine need for it as a first-line requirement, or does there just need to be a capability to handle it when needed? Handling malformed Unicode transparently may simply be deferring or hiding an issue that is better surfaced by letting things break. |
@Paul-Dempsey the problem with letting things break is that it means that the Zig compiler and/or the Zig standard library just won't work in certain environments that are otherwise perfectly valid (in terms of what the operating system / syscalls support). The dilemma is illustrated pretty well by the problem being hit in the OP. |
yes, I suppose letting things error is an opinionated approach that isn't the zig way. The OP's problem is the result of user error: introducing invalid data to the environment. I don't buy that this is a valid scenario just because the OS doesn't prevent you from putting bad data into environment variables. You can also create broken environments by embedding nulls and other characters that are perfectly valid Unicode. An invalid path character in $PATH: will break various tools, but the same character is perfectly valid for other data in the environment. But, you're correct that the whole WTF-16 issue will need thought. I don't know that this issue with environment variables is the right issue to carry the weight of WTF-16 in general, though. |
It's not bad data from the perspective of the OS, though, which is what should probably matter. Zig is imposing its own interpretation of validity which doesn't match the operating system in this case.
This is a separate issue that is tracked by #15607.
Could you provide an example? I'd be surprised if this were possible using the |
@Paul-Dempsey @paperdave started investigating this because a user reported |
This issue for Zig is not the parallel to the one in Bun. Once I narrow this issue down (unfortunately low priority), I will open a separate issue in Zig for the bug. |
After more research and communication with one of the two people who have reported the bug in Bun, this is caused by the environment. Seems to be very easy to set this up when using the Starship prompt in PowerShell. For me, this means I will have to write my own version of I think the standard library should have APIs going between WTF-8 and UTF16-LE, and use those for the Windows OS layer instead of the stricter UTF-8 (though my ideal world: no text conversion here). This is something I'd be willing to help out fixing in a few weeks. |
std.ChildProcess
fails if an environment variable contains an invalid UTF-8
If you want to be WTF compatible, WTF-8 is not required (and in my opinion would be a dangerous mistake). WTF tolerance can be done exactly how it would be done in a C Windows program, using the equivalent of wchar strings/buffers., the native datatype of text in Windows. |
then should the standard library use wchar slices for EnvMap and so on? I am not too well versed in the exact specifics of it, just want to make these apis work more reliably for Windows users. |
So, there's a few options IMO:
EDIT: Another option would be to make EDIT#2: Using Also, just noticed that there is actually an existing issue for this problem: #1774 |
Thanks for mentioning #1774, which refers to https://simonsapin.github.io/wtf-8/! Now I know that WTF-8 is an actual well-defined concept and that spec addresses a number of my concerns, if followed (particularly keeping such data internal to a subsystem and never persisted or communicated outside the process). |
(I linked to the WTF-8 spec in my first comment btw) @paperdave the starship thing seems like it might be caused by a code page mismatch (that is, the input code page is different from what Also, just to try to make things clearer, environment variables on Windows are always stored as WTF-16, and in Zig they are retrieved from the Process Environment Block (PEB). Any |
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior. WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8. Closes ziglang#18694 Closes ziglang#1774 Closes ziglang#2565
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior. WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8. Closes ziglang#18694 Closes ziglang#1774 Closes ziglang#2565
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior. WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8. Closes ziglang#18694 Closes ziglang#1774 Closes ziglang#2565
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior. WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8. Closes ziglang#18694 Closes ziglang#1774 Closes ziglang#2565
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior. WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8. Closes ziglang#18694 Closes ziglang#1774 Closes ziglang#2565
Zig Version
0.12.0-dev.2338+9d5a133f1
Steps to Reproduce and Observed Behavior
I ran into this on accident when trying to narrow a bug in bun. I do not suspect this has any real use case.
Expected Behavior
To ignore or handle this error.
The text was updated successfully, but these errors were encountered: