std.ascii: remove LUT and deprecations #13370

wooster0 · 2022-10-31T16:41:10Z

I wrote this after #12448 got merged but decided to wait for 0.10.0 to be released before pulling it. Now I've decided to pull it a bit earlier since 0.10.0 is just about to be released, anyway (and in case someone else would accidentally start working on the same thing or something).

This should be merged after 0.10.0 and removes all deprecations, updates the codebase, does some other cleanups etc., and also removes the LUT (Look-Up Table): this was already discussed quite a bit in #11629 (see #11629 (comment) etc.) and I've also discussed this with @topolarity in more detail after that, but basically the reason is that the LUT made us depend on cache performance. It also reduces binary size. See also the commit's description for more detail.

Removing both the deprecations and the LUT at the same time is a very convenient timing, by the way. It shouldn't make review that much harder, I hope.

lib/std/net.zig

IntegratedQuantum · 2022-11-18T18:09:16Z

I think I found something:

/// Returns whether the character is printable and has some graphical representation.
/// This also returns `true` for the space character.
/// This is the same as `!isControl(c)`.
pub fn isPrint(c: u8) bool {

This is the same as !isControl(c).

This was not true in the lookup-table implementation.
The lookup-tables always had a zero for non-ascii characters.
And I guess some part of the compiler assumes that behaviour.

IntegratedQuantum · 2022-11-19T10:35:46Z

Right, it should be noted that std.ascii.isPrint now also considers spaces to be printable. This is a bit of behavioral breaking change.

I'm not talking about the spaces.
Also the old version was considering spaces as printable:

pub fn isPrint(c: u8) bool {
    return inTable(c, tIndex.Graph) or c == ' ';
} //                                   ^ Spaces returned true in the old code

The problem is non-ascii characters(c >= 0x80).
In the old code the lookup table was filled with zeroes for non-ascii characters:
mem.set(u8, table[128..256], 0); (line 293)
So for non-ascii characters isPrint(c) == false == isControl(c) was true in the old code.
In the new code isPrint(c) == true for non-ascii characters.

And I think it makes sense to consider non-ascii characters as not printable. For example in UTF-8 all non-ascii characters consist of multiple bytes, so a single byte of that sequence cannot be printed.

Your changes prevent both src/DepTokenizer.zig and std.fmt.fmtSliceEscape* from escaping non-ascii characters.
std.fmt.fmtSliceEscapeLower is used in escapeUnprintables in src/translate_c.zig.
This would explain why test/behaviour/translate_c_macros_not_utf8.h is failing with invalid byte: '\xa9' on the CI.

@topolarity

This makes it so that we no longer use a LUT (Look-Up Table): * The code is much simpler and easier to understand now. * Using a LUT means we rely on a warm cache. Relying on the cache like this results in inconsistent performance and in many cases codegen will be worse. Also as @topolarity once pointed out, in some cases while it seems like the code may branch, it actually doesn't: ziglang#11629 (comment) * Other languages' standard libraries don't do this either. JFF I wanted to see what other languages codegen compared to us now: https://rust.godbolt.org/z/Te4ax9Edf, https://zig.godbolt.org/z/nTbYedWKv So we are pretty much on par or better than other languages now.

On x86 interestingly I can see a reduction in codesize by 1 instruction with this. While not necessarily faster, it might still reduce codesize a bit and this ordering is also more logical because it follows ASCII table order. Rust's std uses this ordering too. See https://zig.godbolt.org/z/PqodY8YqY for the difference.

This makes the function consider space to be printable as well (because it is) and simplifies that function.

wooster0 force-pushed the newascii branch 5 times, most recently from 327403a to e4e47d1 Compare November 5, 2022 19:29

IntegratedQuantum reviewed Nov 18, 2022

View reviewed changes

lib/std/net.zig Outdated Show resolved Hide resolved

wooster0 force-pushed the newascii branch from d0c5c2d to 0f20fd9 Compare November 18, 2022 16:16

wooster0 force-pushed the newascii branch from 0617c0f to b6e63e1 Compare November 18, 2022 22:18

wooster0 force-pushed the newascii branch 3 times, most recently from 0989cff to a0b58b0 Compare November 19, 2022 12:18

wooster0 added 7 commits December 9, 2022 21:57

api(std.ascii): remove deprecated decls

6b7d9b3

docs: minor improvements

626e02a

std.ascii: more tests

45650f7

DepTokenizer.printUnderstandableChar: consider space printable

37b1707

This makes the function consider space to be printable as well (because it is) and simplifies that function.

std.fmt escape functions: reflect printability of space in test

a01993e

wooster0 force-pushed the newascii branch from a0b58b0 to a01993e Compare December 9, 2022 20:57

andrewrk merged commit 023b597 into ziglang:master Dec 10, 2022

scheibo added a commit to pkmn/engine that referenced this pull request Dec 11, 2022

handle std.ascii.isAlpha removal (ziglang/zig#13370)

dc1a025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std.ascii: remove LUT and deprecations #13370

std.ascii: remove LUT and deprecations #13370

wooster0 commented Oct 31, 2022 •

edited

Loading

IntegratedQuantum commented Nov 18, 2022

IntegratedQuantum commented Nov 19, 2022

std.ascii: remove LUT and deprecations #13370

std.ascii: remove LUT and deprecations #13370

Conversation

wooster0 commented Oct 31, 2022 • edited Loading

IntegratedQuantum commented Nov 18, 2022

IntegratedQuantum commented Nov 19, 2022

wooster0 commented Oct 31, 2022 •

edited

Loading