Skip to content

Unicode (utf-8) rework #5559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Unicode (utf-8) rework #5559

wants to merge 1 commit into from

Conversation

shawnl
Copy link
Contributor

@shawnl shawnl commented Jun 7, 2020

  • Make the errors clearer. (the return error types are not set because that
    reveals bugs in stage1)
  • Make decode and length same operation
  • Give location of invalid character
  • Always report errors in Utf8View

@daurnimator Thanks for the review. That part was removed from the patch. I also added a rational for taking u32.

The reality is that even with the zig_panic("TODO") addressed, the error set code in stage1 was causing test failures that are hard to debug. Removing the error types gets rid of the test failures (lets see how CI goes...).

A big reason I resurrected this patch from a long time ago is that I had put effort into trying to write some polished (all corner cases and error conditions correctly handled) string parser code.

/// "ret" cannot be *u21 because when casting to *u32 it would have differn't
/// behavior on Little-Endian and Big-Endian machines, which is too much to ask
/// of our callers.
pub fn utf8Decode(bytes: []const u8, ret: *align(4) u32) !u3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have an outarg here? ==> you could return a struct with codepoint and bytes_read fields.
Or have bytes_read as the out-arg so we don't have the endian issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no endianness issue.

I do like multiple return values, but zig (and C) does not handle those well in the language. I think we should just support them the way Go does, which is not very intrusive on the language.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no endianness issue.

then what is the doc-comment about?

// u32s (that is what the CPU is already working with).
pub fn isValidUnicode(c: u32) !void {
switch (c) {
0x0000...0xd7ff => {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird indentation here


// This accepts u32 instead of u21 because it is much more common to work with
// u32s (that is what the CPU is already working with).
pub fn isValidUnicode(c: u32) !void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already in the unicode module, call it isValidCodepoint?

@@ -32,7 +82,7 @@ pub fn utf8ByteSequenceLength(first_byte: u8) !u3 {
/// out: the out buffer to write to. Must have a len >= utf8CodepointSequenceLength(c).
/// Errors: if c cannot be encoded in UTF-8.
/// Returns: the number of bytes written to out.
pub fn utf8Encode(c: u21, out: []u8) !u3 {
pub fn utf8Encode(c: u32, out: []u8) !u3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change to a u32 here?

Copy link
Contributor Author

@shawnl shawnl Jun 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is a function. I will look into the LLVM optimization bug.

; Function Attrs: norecurse nounwind readnone
define i1 @nonpow2(i31 %0) local_unnamed_addr #0 {
Entry:
  %1 = icmp ult i31 %0, 500
  ret i1 %1
}

==>
nonpow2:                                // @nonpow2
	.cfi_startproc
// %bb.0:                               // %Entry
	and	w8, w0, #0x7fffffff
	cmp	w8, #500                // =500
	cset	w0, lo
	ret

/// for the given codepoint.
pub fn utf8CodepointSequenceLength(c: u21) !u3 {
/// for the given codepoint. Taking a u21 here would add no safety, as we
/// already have to check the upper bound, which is smaller than maxInt(u21).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zig in general prefers the tighter integer type; in future its likely to be replaced with a ranged integer (see #3806)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but this is a function entry, and if LLVM actually compiles this as a function it will mask the high bits. I can leave it and look into a LLVM patch however.

assert(bytes.len == 2);
assert(bytes[0] & 0b11100000 == 0b11000000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this assert incorrect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new assert is identical

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right; that took me a bit to understand.

Why the new one? my intuition is that it would be slower to execute.

@daurnimator daurnimator added the standard library This issue involves writing Zig code for the standard library. label Jun 8, 2020
* Make the errors clearer. (the return error types are not set because that
  reveals bugs in stage1)
* Make decode and length same operation
* Give location of invalid character
* Always report errors in Utf8View
@andrewrk
Copy link
Member

I'm interested in improvements to std.unicode, however I will be closing this pull request. Here is why:

  • it's messy. a lot going on in 1 commit without a clear focus, making it hard to review, and likely to introduce regressions for other people.
  • test failing. the build is too old so I can't click it, and there is no comment mentioning why the test failed (perhaps it was a fluke? I can't tell)
  • merge conflicts. that's my fault for not merging this pull request earlier, but it's an unfortunate fact of life.
  • there are 58 other open pull requests for me to look at. I need to get to them. you gotta make this a little easier on me, to get merges more quickly.

you are more than welcome to resurrect this in a new PR.

@andrewrk andrewrk closed this Oct 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
standard library This issue involves writing Zig code for the standard library.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants