Skip to content

std.ascii: remove LUT and deprecations #13370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Dec 10, 2022
Merged

Conversation

wooster0
Copy link
Contributor

@wooster0 wooster0 commented Oct 31, 2022

I wrote this after #12448 got merged but decided to wait for 0.10.0 to be released before pulling it. Now I've decided to pull it a bit earlier since 0.10.0 is just about to be released, anyway (and in case someone else would accidentally start working on the same thing or something).

This should be merged after 0.10.0 and removes all deprecations, updates the codebase, does some other cleanups etc., and also removes the LUT (Look-Up Table): this was already discussed quite a bit in #11629 (see #11629 (comment) etc.) and I've also discussed this with @topolarity in more detail after that, but basically the reason is that the LUT made us depend on cache performance. It also reduces binary size. See also the commit's description for more detail.

Removing both the deprecations and the LUT at the same time is a very convenient timing, by the way. It shouldn't make review that much harder, I hope.

@wooster0 wooster0 force-pushed the newascii branch 5 times, most recently from 327403a to e4e47d1 Compare November 5, 2022 19:29
@IntegratedQuantum
Copy link
Contributor

I think I found something:

/// Returns whether the character is printable and has some graphical representation.
/// This also returns `true` for the space character.
/// This is the same as `!isControl(c)`.
pub fn isPrint(c: u8) bool {

This is the same as !isControl(c).

This was not true in the lookup-table implementation.
The lookup-tables always had a zero for non-ascii characters.
And I guess some part of the compiler assumes that behaviour.

@IntegratedQuantum
Copy link
Contributor

Right, it should be noted that std.ascii.isPrint now also considers spaces to be printable. This is a bit of behavioral breaking change.

I'm not talking about the spaces.
Also the old version was considering spaces as printable:

pub fn isPrint(c: u8) bool {
    return inTable(c, tIndex.Graph) or c == ' ';
} //                                   ^ Spaces returned true in the old code

The problem is non-ascii characters(c >= 0x80).
In the old code the lookup table was filled with zeroes for non-ascii characters:
mem.set(u8, table[128..256], 0); (line 293)
So for non-ascii characters isPrint(c) == false == isControl(c) was true in the old code.
In the new code isPrint(c) == true for non-ascii characters.

And I think it makes sense to consider non-ascii characters as not printable. For example in UTF-8 all non-ascii characters consist of multiple bytes, so a single byte of that sequence cannot be printed.

Your changes prevent both src/DepTokenizer.zig and std.fmt.fmtSliceEscape* from escaping non-ascii characters.
std.fmt.fmtSliceEscapeLower is used in escapeUnprintables in src/translate_c.zig.
This would explain why test/behaviour/translate_c_macros_not_utf8.h is failing with invalid byte: '\xa9' on the CI.

@wooster0 wooster0 force-pushed the newascii branch 3 times, most recently from 0989cff to a0b58b0 Compare November 19, 2022 12:18
This makes it so that we no longer use a LUT (Look-Up Table):
* The code is much simpler and easier to understand now.
* Using a LUT means we rely on a warm cache.
  Relying on the cache like this results in inconsistent performance and in many cases codegen will be worse.
  Also as @topolarity once pointed out, in some cases while it seems like the code may branch, it actually doesn't:
  ziglang#11629 (comment)
* Other languages' standard libraries don't do this either.
  JFF I wanted to see what other languages codegen compared to us now:
  https://rust.godbolt.org/z/Te4ax9Edf, https://zig.godbolt.org/z/nTbYedWKv
  So we are pretty much on par or better than other languages now.
On x86 interestingly I can see a reduction in codesize by 1 instruction with this.
While not necessarily faster, it might still reduce codesize a bit and this ordering is also more logical
because it follows ASCII table order. Rust's std uses this ordering too.
See https://zig.godbolt.org/z/PqodY8YqY for the difference.
This makes the function consider space to be printable as well (because it is)
and simplifies that function.
@andrewrk andrewrk merged commit 023b597 into ziglang:master Dec 10, 2022
scheibo added a commit to pkmn/engine that referenced this pull request Dec 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants