-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Unicode (utf-8) rework #5559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode (utf-8) rework #5559
Conversation
lib/std/unicode.zig
Outdated
/// "ret" cannot be *u21 because when casting to *u32 it would have differn't | ||
/// behavior on Little-Endian and Big-Endian machines, which is too much to ask | ||
/// of our callers. | ||
pub fn utf8Decode(bytes: []const u8, ret: *align(4) u32) !u3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have an outarg here? ==> you could return a struct with codepoint
and bytes_read
fields.
Or have bytes_read as the out-arg so we don't have the endian issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no endianness issue.
I do like multiple return values, but zig (and C) does not handle those well in the language. I think we should just support them the way Go does, which is not very intrusive on the language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no endianness issue.
then what is the doc-comment about?
lib/std/unicode.zig
Outdated
// u32s (that is what the CPU is already working with). | ||
pub fn isValidUnicode(c: u32) !void { | ||
switch (c) { | ||
0x0000...0xd7ff => {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird indentation here
lib/std/unicode.zig
Outdated
|
||
// This accepts u32 instead of u21 because it is much more common to work with | ||
// u32s (that is what the CPU is already working with). | ||
pub fn isValidUnicode(c: u32) !void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already in the unicode module, call it isValidCodepoint
?
lib/std/unicode.zig
Outdated
@@ -32,7 +82,7 @@ pub fn utf8ByteSequenceLength(first_byte: u8) !u3 { | |||
/// out: the out buffer to write to. Must have a len >= utf8CodepointSequenceLength(c). | |||
/// Errors: if c cannot be encoded in UTF-8. | |||
/// Returns: the number of bytes written to out. | |||
pub fn utf8Encode(c: u21, out: []u8) !u3 { | |||
pub fn utf8Encode(c: u32, out: []u8) !u3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change to a u32
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it is a function. I will look into the LLVM optimization bug.
; Function Attrs: norecurse nounwind readnone
define i1 @nonpow2(i31 %0) local_unnamed_addr #0 {
Entry:
%1 = icmp ult i31 %0, 500
ret i1 %1
}
==>
nonpow2: // @nonpow2
.cfi_startproc
// %bb.0: // %Entry
and w8, w0, #0x7fffffff
cmp w8, #500 // =500
cset w0, lo
ret
lib/std/unicode.zig
Outdated
/// for the given codepoint. | ||
pub fn utf8CodepointSequenceLength(c: u21) !u3 { | ||
/// for the given codepoint. Taking a u21 here would add no safety, as we | ||
/// already have to check the upper bound, which is smaller than maxInt(u21). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zig in general prefers the tighter integer type; in future its likely to be replaced with a ranged integer (see #3806)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but this is a function entry, and if LLVM actually compiles this as a function it will mask the high bits. I can leave it and look into a LLVM patch however.
assert(bytes.len == 2); | ||
assert(bytes[0] & 0b11100000 == 0b11000000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this assert incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new assert is identical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right; that took me a bit to understand.
Why the new one? my intuition is that it would be slower to execute.
* Make the errors clearer. (the return error types are not set because that reveals bugs in stage1) * Make decode and length same operation * Give location of invalid character * Always report errors in Utf8View
I'm interested in improvements to std.unicode, however I will be closing this pull request. Here is why:
you are more than welcome to resurrect this in a new PR. |
reveals bugs in stage1)
@daurnimator Thanks for the review. That part was removed from the patch. I also added a rational for taking u32.
The reality is that even with the
zig_panic("TODO")
addressed, the error set code in stage1 was causing test failures that are hard to debug. Removing the error types gets rid of the test failures (lets see how CI goes...).A big reason I resurrected this patch from a long time ago is that I had put effort into trying to write some polished (all corner cases and error conditions correctly handled) string parser code.