-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Fix handling of Windows (WTF-16) and WASI (UTF-8) paths, etc #19005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Closed
Renamed functions for consistent `Le` capitalization and conventions: - utf16leToUtf8Alloc -> utf16LeToUtf8Alloc - utf16leToUtf8AllocZ -> utf16LeToUtf8AllocZ - utf16leToUtf8 -> utf16LeToUtf8 - utf8ToUtf16LeWithNull -> utf8ToUtf16LeAllocZ - fmtUtf16le -> fmtUtf16Le New UTF related functions: - utf16LeToUtf8ArrayList - utf8ToUtf16LeArrayList - utf8ToUtf16LeAlloc - isSurrogateCodepoint (the ArrayList functions are mostly to allow the Alloc and AllocZ to share an implementation) New WTF related functions/structs: - wtf8Encode - wtf8Decode - wtf8ValidateSlice - Wtf8View - Wtf8Iterator - wtf16LeToWtf8ArrayList - wtf16LeToWtf8Alloc - wtf16LeToWtf8AllocZ - wtf16LeToWtf8 - wtf8ToWtf16LeArrayList - wtf8ToWtf16LeAlloc - wtf8ToWtf16LeAllocZ - wtf8ToWtf16Le - wtf8ToUtf8Lossy - wtf8ToUtf8LossyAlloc - wtf8ToUtf8LossyAllocZ - Wtf16LeIterator
Ill-formed UTF-8 byte sequences are replaced by the replacement character (U+FFFD) according to "U+FFFD Substitution of Maximal Subparts" from Chapter 3 of the Unicode standard, and as specified by https://encoding.spec.whatwg.org/#utf-8-decoder
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior. WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8. Closes ziglang#18694 Closes ziglang#1774 Closes ziglang#2565
Magnificent. |
This was referenced Feb 25, 2024
This was referenced Feb 27, 2024
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
On Windows, paths/environment variables/command line arguments are arbitrary sequences of
u16
(known as WTF-16), which means that unpaired surrogate codepoints (U+D800 to U+DFFF) are allowed. Unpaired surrogate codepoints cannot be encoded as valid UTF-8/UTF-16, meaning that UTF-8/UTF-16 cannot represent all possible paths/environment variables/command line arguments on Windows.On other platforms (but not WASI), paths/environment variables/command line arguments are arbitrary sequences of
u8
with no particular encoding. Therefore, invalid UTF-8 sequences are allowed, which in turn means that valid UTF-8 cannot represent all possible paths/environment variables/command line arguments.On WASI, paths/environment variables/command line arguments are specified to be sequences of Unicode scalar values, meaning that they must be encodable as valid UTF-8/UTF-16. This means that WASI cannot handle all paths/environment variables/command line arguments regardless of the host platform.
Because Zig has cross-platform APIs that deal with slices of
u8
, some normalization/conversion has to be done for certain platforms. Up to this point, the status quo of Zig has been:error.Unexpected
if invalid UTF-8 was attempted to be used (the underlying error isILSEQ
or invalid byte sequence)Possible solutions
[]u8
and force APIs to always deal with WTF-16 directly on WindowsWhat is WTF-8?
WTF-8 is a superset of UTF-8 that allows the codepoints
U+D800
toU+DFFF
(surrogate codepoints) to be encoded using the normal UTF-8 encoding algorithm. SinceU+D800
toU+DFFF
are the only WTF-16 code units that are normally unrepresentable in UTF-8, this alone is sufficient to be able to losslessly roundtrip from WTF-8 to WTF-16.Some notes:
U+D83D U+DCA9
(a high surrogate followed by a low surrogate) was encoded as WTF-8, then when converted to WTF-16 and back to WTF-8 it'd be interpreted as a surrogate pair that enocdes the codepointU+1F4A9
, so the final WTF-8 would have the byte sequence forU+1F4A9
rather thanU+D83D U+DCA9
. As long as all surrogate codepoints in WTF-8 are unpaired, though,WTF-8
<->WTF-16
roundtripping is guaranteed.The changes
This PR was initially focused solely on handling WTF-16 via WTF-8, but now has a few interconnected changes:
std.unicode
was refactored a bit and function names were made more consistent (e.g. lowercasele
changed to the more commonLe
)std.unicode
ILSEQ
errors and returnserror.InvalidUtf8
(now a WASI-only error) in that caseerror.InvalidWtf8
(a Windows-only error) if any user-supplied inputs are invalid WTF-8NativeUtf8ComponentIterator
was previously incorrectly named [by me])The
std.unicode
changes in detailThis same information is in one of the commit messages, but:
std.unicode changes
Renamed functions for consistent
Le
capitalization and conventions:New UTF related functions:
(the ArrayList functions are mostly to allow the Alloc and AllocZ functions to share an implementation)
New WTF related functions/structs:
Notes/concerns
InvalidUtf8
has gone from a Windows-only error to a WASI-only error in many places. This may lead to bugs at existing callsites since it won't appear as a breaking change.std.unicode
implementation. This means it is up to the user to be aware of WTF-8 well-formedness and maintain that property themselves (see the spec section on concatenation for what this means in practice) if they care about the roundtripping property. Note, however, that when converting to WTF-16, paired surrogates in WTF-8 are interpreted as a surrogate pair, so non-well-formed WTF-8 will get interpreted as if it were concatenated according to the spec in the process of being converted to WTF-16.u8
encoding of WTF-16, well-formedness of the WTF-8 doesn't matter too much since it has to be mapped to WTF-16 before it can be used in syscalls.std.fs.path.fmtAsUtf8Lossy
andstd.fs.path.fmtWtf16LeAsUtf8Lossy
for any use cases where the paths being printed should definitely be represented as valid UTF-8, with unrepresentable sequences replaced by �.Closes #18694
Closes #1774
Closes #2565