Unicode (utf-8) rework #5559

shawnl · 2020-06-07T19:54:54Z

Make the errors clearer. (the return error types are not set because that
reveals bugs in stage1)
Make decode and length same operation
Give location of invalid character
Always report errors in Utf8View

@daurnimator Thanks for the review. That part was removed from the patch. I also added a rational for taking u32.

The reality is that even with the zig_panic("TODO") addressed, the error set code in stage1 was causing test failures that are hard to debug. Removing the error types gets rid of the test failures (lets see how CI goes...).

A big reason I resurrected this patch from a long time ago is that I had put effort into trying to write some polished (all corner cases and error conditions correctly handled) string parser code.

daurnimator · 2020-06-08T04:19:47Z

lib/std/unicode.zig

+/// "ret" cannot be *u21 because when casting to *u32 it would have differn't
+/// behavior on Little-Endian and Big-Endian machines, which is too much to ask
+/// of our callers.
+pub fn utf8Decode(bytes: []const u8, ret: *align(4) u32) !u3 {


Why have an outarg here? ==> you could return a struct with codepoint and bytes_read fields.
Or have bytes_read as the out-arg so we don't have the endian issue.

There is no endianness issue.

I do like multiple return values, but zig (and C) does not handle those well in the language. I think we should just support them the way Go does, which is not very intrusive on the language.

There is no endianness issue.

then what is the doc-comment about?

daurnimator · 2020-06-08T04:20:10Z

lib/std/unicode.zig

+// u32s (that is what the CPU is already working with).
+pub fn isValidUnicode(c: u32) !void {
+    switch (c) {
+    0x0000...0xd7ff => {},


weird indentation here

daurnimator · 2020-06-08T04:20:41Z

lib/std/unicode.zig

+
+// This accepts u32 instead of u21 because it is much more common to work with
+// u32s (that is what the CPU is already working with).
+pub fn isValidUnicode(c: u32) !void {


already in the unicode module, call it isValidCodepoint?

daurnimator · 2020-06-08T04:20:59Z

lib/std/unicode.zig

@@ -32,7 +82,7 @@ pub fn utf8ByteSequenceLength(first_byte: u8) !u3 {
 /// out: the out buffer to write to. Must have a len >= utf8CodepointSequenceLength(c).
 /// Errors: if c cannot be encoded in UTF-8.
 /// Returns: the number of bytes written to out.
-pub fn utf8Encode(c: u21, out: []u8) !u3 {
+pub fn utf8Encode(c: u32, out: []u8) !u3 {


Why change to a u32 here?

Because it is a function. I will look into the LLVM optimization bug.

; Function Attrs: norecurse nounwind readnone define i1 @nonpow2(i31 %0) local_unnamed_addr #0 { Entry: %1 = icmp ult i31 %0, 500 ret i1 %1 } ==> nonpow2: // @nonpow2 .cfi_startproc // %bb.0: // %Entry and w8, w0, #0x7fffffff cmp w8, #500 // =500 cset w0, lo ret

daurnimator · 2020-06-08T04:21:54Z

lib/std/unicode.zig

-/// for the given codepoint.
-pub fn utf8CodepointSequenceLength(c: u21) !u3 {
+/// for the given codepoint. Taking a u21 here would add no safety, as we
+/// already have to check the upper bound, which is smaller than maxInt(u21).


Zig in general prefers the tighter integer type; in future its likely to be replaced with a ranged integer (see #3806)

Yeah but this is a function entry, and if LLVM actually compiles this as a function it will mask the high bits. I can leave it and look into a LLVM patch however.

daurnimator · 2020-06-08T04:23:00Z

lib/std/unicode.zig

    assert(bytes.len == 2);
-    assert(bytes[0] & 0b11100000 == 0b11000000);


Was this assert incorrect?

the new assert is identical

oh right; that took me a bit to understand.

Why the new one? my intuition is that it would be slower to execute.

* Make the errors clearer. (the return error types are not set because that reveals bugs in stage1) * Make decode and length same operation * Give location of invalid character * Always report errors in Utf8View

andrewrk · 2020-10-10T02:57:21Z

I'm interested in improvements to std.unicode, however I will be closing this pull request. Here is why:

it's messy. a lot going on in 1 commit without a clear focus, making it hard to review, and likely to introduce regressions for other people.
test failing. the build is too old so I can't click it, and there is no comment mentioning why the test failed (perhaps it was a fluke? I can't tell)
merge conflicts. that's my fault for not merging this pull request earlier, but it's an unfortunate fact of life.
there are 58 other open pull requests for me to look at. I need to get to them. you gotta make this a little easier on me, to get merges more quickly.

you are more than welcome to resurrect this in a new PR.

daurnimator reviewed Jun 8, 2020

View reviewed changes

daurnimator added the standard library This issue involves writing Zig code for the standard library. label Jun 8, 2020

Unicode (utf-8) rework

4f75f10

* Make the errors clearer. (the return error types are not set because that reveals bugs in stage1) * Make decode and length same operation * Give location of invalid character * Always report errors in Utf8View

shawnl force-pushed the utf8-new branch from e4bee1a to 4f75f10 Compare June 10, 2020 09:10

andrewrk closed this Oct 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode (utf-8) rework #5559

Unicode (utf-8) rework #5559

shawnl commented Jun 7, 2020

daurnimator Jun 8, 2020

shawnl Jun 10, 2020

daurnimator Jun 10, 2020

daurnimator Jun 8, 2020

daurnimator Jun 8, 2020

daurnimator Jun 8, 2020

shawnl Jun 10, 2020 •

edited

Loading

daurnimator Jun 8, 2020

shawnl Jun 10, 2020

daurnimator Jun 8, 2020

shawnl Jun 10, 2020

daurnimator Jun 10, 2020

andrewrk commented Oct 10, 2020

		assert(bytes.len == 2);
		assert(bytes[0] & 0b11100000 == 0b11000000);

Unicode (utf-8) rework #5559

Unicode (utf-8) rework #5559

Conversation

shawnl commented Jun 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shawnl Jun 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewrk commented Oct 10, 2020

shawnl Jun 10, 2020 •

edited

Loading