Fix handling of Windows (WTF-16) and WASI (UTF-8) paths, etc #19005

squeek502 · 2024-02-19T21:41:26Z

Motivation

On Windows, paths/environment variables/command line arguments are arbitrary sequences of u16 (known as WTF-16), which means that unpaired surrogate codepoints (U+D800 to U+DFFF) are allowed. Unpaired surrogate codepoints cannot be encoded as valid UTF-8/UTF-16, meaning that UTF-8/UTF-16 cannot represent all possible paths/environment variables/command line arguments on Windows.

On other platforms (but not WASI), paths/environment variables/command line arguments are arbitrary sequences of u8 with no particular encoding. Therefore, invalid UTF-8 sequences are allowed, which in turn means that valid UTF-8 cannot represent all possible paths/environment variables/command line arguments.

On WASI, paths/environment variables/command line arguments are specified to be sequences of Unicode scalar values, meaning that they must be encodable as valid UTF-8/UTF-16. This means that WASI cannot handle all paths/environment variables/command line arguments regardless of the host platform.

Because Zig has cross-platform APIs that deal with slices of u8, some normalization/conversion has to be done for certain platforms. Up to this point, the status quo of Zig has been:

On Windows, convert WTF-16 to UTF-8 and fail if something can't be encoded as valid UTF-8 (or invoke illegal behavior in some buggy/ill-advised cases)
On WASI, Zig would unintentionally hit error.Unexpected if invalid UTF-8 was attempted to be used (the underlying error is ILSEQ or invalid byte sequence)
On other platforms, Zig does the right thing and does not assume any particular encoding

Possible solutions

Continue with the status quo and have Zig's cross-platform APIs just not be able to handle all paths/environment variables/command line arguments on Windows
- This doesn't seem aligned to the goals of Zig
Scrap the conversions to/from []u8 and force APIs to always deal with WTF-16 directly on Windows
- This would really complicate writing cross-platform code in Zig when dealing with the filesystem, environment variables, and command line arguments
Convert to/from WTF-8 on Windows, which can losslessly encode all possible WTF-16 sequences
- This is the strategy this PR goes with

What is WTF-8?

WTF-8 is a superset of UTF-8 that allows the codepoints U+D800 to U+DFFF (surrogate codepoints) to be encoded using the normal UTF-8 encoding algorithm. Since U+D800 to U+DFFF are the only WTF-16 code units that are normally unrepresentable in UTF-8, this alone is sufficient to be able to losslessly roundtrip from WTF-8 to WTF-16.

Some notes:

WTF-16 to WTF-8 conversion cannot fail and is always lossless
WTF-8 to WTF-16 conversion can fail if the WTF-8 is invalid (for example, has a sequence with an invalid start byte, or a sequence that encodes an impossibly large codepoint; in other words, the normal rules around UTF-8 with the exception of surrogate codepoints)
WTF-8 -> WTF-16 -> WTF-8 roundtripping relies on the WTF-8 being "well-formed", meaning encoded surrogate codepoints are always unpaired. For example, if the sequence U+D83D U+DCA9 (a high surrogate followed by a low surrogate) was encoded as WTF-8, then when converted to WTF-16 and back to WTF-8 it'd be interpreted as a surrogate pair that enocdes the codepoint U+1F4A9, so the final WTF-8 would have the byte sequence for U+1F4A9 rather than U+D83D U+DCA9. As long as all surrogate codepoints in WTF-8 are unpaired, though, WTF-8 <-> WTF-16 roundtripping is guaranteed.
The spec says that users should avoid emitting/transmitting WTF-8 encoded bytes, and instead (lossily) convert to a valid Unicode encoding before emitting/transmitting WTF-8 (more on this later)

The changes

This PR was initially focused solely on handling WTF-16 via WTF-8, but now has a few interconnected changes:

std.unicode was refactored a bit and function names were made more consistent (e.g. lowercase le changed to the more common Le)
WTF-8 <-> WTF-16 conversion and related functions were added to std.unicode
WASI now properly handles ILSEQ errors and returns error.InvalidUtf8 (now a WASI-only error) in that case
Windows now does WTF-16 <-> WTF-8 conversion everywhere, and errors with error.InvalidWtf8 (a Windows-only error) if any user-supplied inputs are invalid WTF-8
Anything that incorrectly talked about UTF-8 was fixed (e.g. NativeUtf8ComponentIterator was previously incorrectly named [by me])
Some error sets were updated/narrowed/made explicit

The `std.unicode` changes in detail

This same information is in one of the commit messages, but:

std.unicode changes

Renamed functions for consistent Le capitalization and conventions:

utf16leToUtf8Alloc -> utf16LeToUtf8Alloc
utf16leToUtf8AllocZ -> utf16LeToUtf8AllocZ
utf16leToUtf8 -> utf16LeToUtf8
utf8ToUtf16LeWithNull -> utf8ToUtf16LeAllocZ
fmtUtf16le -> fmtUtf16Le

New UTF related functions:

utf16LeToUtf8ArrayList
utf8ToUtf16LeArrayList
utf8ToUtf16LeAlloc
isSurrogateCodepoint

(the ArrayList functions are mostly to allow the Alloc and AllocZ functions to share an implementation)

New WTF related functions/structs:

wtf8Encode
wtf8Decode
wtf8ValidateSlice
Wtf8View
Wtf8Iterator
wtf16LeToWtf8ArrayList
wtf16LeToWtf8Alloc
wtf16LeToWtf8AllocZ
wtf16LeToWtf8
wtf8ToWtf16LeArrayList
wtf8ToWtf16LeAlloc
wtf8ToWtf16LeAllocZ
wtf8ToWtf16Le
wtf8ToUtf8Lossy
wtf8ToUtf8LossyAlloc
wtf8ToUtf8LossyAllocZ
Wtf16LeIterator

Notes/concerns

The WTF-8/WTF-16 functions share a lot of their implementation with the UTF-8/UTF-16 functions. This is nice in some ways (reduces duplicate code), but potentially not so nice in others (changes to the UTF code has to always be mindful of how it affects the WTF code).
InvalidUtf8 has gone from a Windows-only error to a WASI-only error in many places. This may lead to bugs at existing callsites since it won't appear as a breaking change.
As mentioned before, only well-formed WTF-8 (meaning all surrogates are unpaired) roundtrips properly, but well-formedness is not enforced/validated by the std.unicode implementation. This means it is up to the user to be aware of WTF-8 well-formedness and maintain that property themselves (see the spec section on concatenation for what this means in practice) if they care about the roundtripping property. Note, however, that when converting to WTF-16, paired surrogates in WTF-8 are interpreted as a surrogate pair, so non-well-formed WTF-8 will get interpreted as if it were concatenated according to the spec in the process of being converted to WTF-16.
- Not sure if I've done a good job explaining this. The idea is basically that since WTF-8 is only really meant to be used as a lossless u8 encoding of WTF-16, well-formedness of the WTF-8 doesn't matter too much since it has to be mapped to WTF-16 before it can be used in syscalls.
The spec mentions that "Any WTF-8 data must be converted to a Unicode encoding at the system’s boundary before being emitted. UTF-8 is recommended. WTF-8 must not be used to represent text in a file format or for transmission over the Internet."
- I'm not fully convinced that emitting WTF-8 is that bad, since it's literally impossible to accurately represent WTF-16 unpaired surrogates as valid UTF-8, so converting invalid sequences to � (U+FFFD) before emission or letting whatever program handle the invalid UTF-8 and do the � replacements themselves doesn't seem that consequential--there's no approach that leads to the output being accurately represented as valid Unicode.
- However, I have added std.fs.path.fmtAsUtf8Lossy and std.fs.path.fmtWtf16LeAsUtf8Lossy for any use cases where the paths being printed should definitely be represented as valid UTF-8, with unrepresentable sequences replaced by �.

Closes #18694
Closes #1774
Closes #2565

Renamed functions for consistent `Le` capitalization and conventions: - utf16leToUtf8Alloc -> utf16LeToUtf8Alloc - utf16leToUtf8AllocZ -> utf16LeToUtf8AllocZ - utf16leToUtf8 -> utf16LeToUtf8 - utf8ToUtf16LeWithNull -> utf8ToUtf16LeAllocZ - fmtUtf16le -> fmtUtf16Le New UTF related functions: - utf16LeToUtf8ArrayList - utf8ToUtf16LeArrayList - utf8ToUtf16LeAlloc - isSurrogateCodepoint (the ArrayList functions are mostly to allow the Alloc and AllocZ to share an implementation) New WTF related functions/structs: - wtf8Encode - wtf8Decode - wtf8ValidateSlice - Wtf8View - Wtf8Iterator - wtf16LeToWtf8ArrayList - wtf16LeToWtf8Alloc - wtf16LeToWtf8AllocZ - wtf16LeToWtf8 - wtf8ToWtf16LeArrayList - wtf8ToWtf16LeAlloc - wtf8ToWtf16LeAllocZ - wtf8ToWtf16Le - wtf8ToUtf8Lossy - wtf8ToUtf8LossyAlloc - wtf8ToUtf8LossyAllocZ - Wtf16LeIterator

Ill-formed UTF-8 byte sequences are replaced by the replacement character (U+FFFD) according to "U+FFFD Substitution of Maximal Subparts" from Chapter 3 of the Unicode standard, and as specified by https://encoding.spec.whatwg.org/#utf-8-decoder

Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior. WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8. Closes ziglang#18694 Closes ziglang#1774 Closes ziglang#2565

…n getEnvVarOwned

andrewrk · 2024-02-25T06:07:27Z

Magnificent.

squeek502 requested a review from kprotty as a code owner February 19, 2024 21:41

squeek502 force-pushed the wtf branch from 660a552 to 33d0c26 Compare February 19, 2024 21:53

squeek502 changed the title ~~Fix handling of Windows (WTF-16) and WASI (UTF-8) paths~~ Fix handling of Windows (WTF-16) and WASI (UTF-8) paths, etc Feb 19, 2024

squeek502 mentioned this pull request Feb 21, 2024

Callable structs #19025

Closed

squeek502 added 6 commits February 24, 2024 14:04

Update deprecated std.unicode function usages

80508b9

Use stack fallback allocator to usually avoid extra heap allocation i…

abd250b

…n getEnvVarOwned

Add std.fs.path.fmtAsUtf8Lossy/fmtWtf16LeAsUtf8Lossy

9fec608

squeek502 force-pushed the wtf branch from 33d0c26 to 9fec608 Compare February 24, 2024 22:06

andrewrk enabled auto-merge February 25, 2024 06:07

andrewrk mentioned this pull request Feb 25, 2024

x86_64: pass more tests #18906

Merged

andrewrk merged commit 6c2eb0f into ziglang:master Feb 25, 2024

squeek502 mentioned this pull request Feb 25, 2024

Update for latest Zig changes Vexu/arocc#633

Merged

This was referenced Feb 25, 2024

Allow for smaller allocations in node:path methods oven-sh/bun#8932

Merged

Update path bits when we update Zig for unpaired surrogates support in Windows oven-sh/bun#9122

Closed

This was referenced Feb 27, 2024

Breaking change with InvalidUtf8 as zig introduces Wtf8 ziglibs/known-folders#46

Closed

Breaking change with InvalidUtf8 as zig introduces Wtf8 zigtools/zls#1797

Closed

squeek502 mentioned this pull request Mar 30, 2024

Improvements for UEFI #19486

Closed

squeek502 mentioned this pull request Jun 12, 2024

Allow surrogate codepoints when escaped with \u #20270

Open

squeek502 mentioned this pull request Nov 2, 2024

POSIX.1-2024 encourages returning EILSEQ for filenames with a newline #21883

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of Windows (WTF-16) and WASI (UTF-8) paths, etc #19005

Fix handling of Windows (WTF-16) and WASI (UTF-8) paths, etc #19005

squeek502 commented Feb 19, 2024 •

edited

Loading

andrewrk commented Feb 25, 2024

Fix handling of Windows (WTF-16) and WASI (UTF-8) paths, etc #19005

Fix handling of Windows (WTF-16) and WASI (UTF-8) paths, etc #19005

Conversation

squeek502 commented Feb 19, 2024 • edited Loading

Motivation

Possible solutions

What is WTF-8?

The changes

The std.unicode changes in detail

Notes/concerns

andrewrk commented Feb 25, 2024

squeek502 commented Feb 19, 2024 •

edited

Loading

The `std.unicode` changes in detail