Skip to content

std.ascii: make toLower toUpper branchless #21369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 14, 2024
Merged

Conversation

CrazyboyQCD
Copy link
Contributor

@CrazyboyQCD CrazyboyQCD commented Sep 10, 2024

Godbolt link
Remove some extended instructions.

@alexrp
Copy link
Member

alexrp commented Sep 10, 2024

Godbolt link

Looks like the wrong link?

@CrazyboyQCD
Copy link
Contributor Author

Godbolt link

Looks like the wrong link?

Updated.

@Rexicon226
Copy link
Contributor

Looks great! What do you think about:

fn toLower(c: u8) u8 {
    return c | (if (std.ascii.isUpper(c)) 32 else c);
}

It's 2 bytes smaller and has a bit smaller block rtp.

@CrazyboyQCD
Copy link
Contributor Author

It's 2 bytes smaller

What is the aspect that is compared?
If for the instructions size, current implementation is 2 bytes smaller and yours is 1 byte smaller(godbolt link).

@Rexicon226
Copy link
Contributor

What is the aspect that is compared?

I take back what I said :). The tool I used to determine the size of the instructions seems to be incorrect for the lea. Well, it's just an idea, your implementation is great as is.

Copy link
Contributor

@Rexicon226 Rexicon226 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet implementation. I can't think of a better one that generates such dead simple codegen.

Block RTP,
Previous:

           Skylake   Tigerlake   
uiCA     | 2.33    | 2.00
llvm-mca | 2.10    | 2.10

This one:

           Skylake   Tigerlake   
uiCA     | 2.41    | 2.00
llvm-mca | 1.88    | 1.88

uiCA thinks it's a tad slower on skylake, but it isn't from any practical point of view. The cmov takes a bit of space and latency in the previous implementation and it's nice to get rid of.

@zeroZshadow
Copy link

Do note that Debug generates slightly worse code due to the overflow check with the multiply.
Would it be worth using the *% operator to avoid this check?

@Rexicon226
Copy link
Contributor

Do note that Debug generates slightly worse code due to the overflow check with the multiply. Would it be worth using the *% operator to avoid this check?

Sounds reasonable. Any other optimization mode figures out the width of the integer and ellids this.

@MatthiasPortzel
Copy link

MatthiasPortzel commented Sep 10, 2024

Would it be worth using the *% operator to avoid this check?

(Not a Zig contributor, just my 2 cents.)
This is not reasonable. A feature of the language is debug safety checks. There’s no reason to work around them here. Even if there was, Zig provides a builtin called @setRuntimeSafety that allows you to disable runtime safety checks for a block.

@Rexicon226
Copy link
Contributor

I believe you misunderstand something.

There’s no reason to work around them here.

There 100% is. We, as the user, can guarantee that this safety check will never be triggered. The largest value this will ever be is 32, which fits into a u8. Wrapping multiplication is the proper way to disable this check.

@MatthiasPortzel
Copy link

MatthiasPortzel commented Sep 11, 2024

I don't think I'm misunderstanding something, I think this just isn't how my brain works.

If you give me a programming language with a debug mode that has runtime checks for integer overflows I expect there to be runtime checks for integer overflows in debug mode.
If you use a wrapping multiplication operator I expect there to be wrapping multiplication happening.
If you want to disable a runtime safety check for performance reasons I expect you to use the builtin that is designed for disabling runtime safety checks.

But I'm not really a systems programmer, and I'm not really cut out for it I guess, because these functions were already branchless. cmovae isn't a branching instruction. What are we doing here!?

This PR hurts readability, maintainability, destroys the semantic meaning of the code, doesn't remove any branches, and you can't even say it improves performance because you haven't done a benchmark.

The only metric that seems to have been considered here is "number of instructions generated in godbolt." So yes, I don't understand. Is this what systems programming is?

(My tone may have been out of hand here, I apologize. I mean no disrespect to any person.)

@Rexicon226
Copy link
Contributor

To clarify,

If you give me a programming language with a debug mode that has runtime checks for integer overflows I expect there to be runtime checks for integer overflows in debug mode.

Integer overflow cannot happen here. It's guaranteed. There's no point in using @setRuntimeSafetyFalse here since we do want runtime-safety. We're just telling it that that specific multiply will not overflow, or more accurately to not care if it overflows, since it won't happen.

But I'm not really a systems programmer, and I'm not really cut out for it I guess, because these functions were already branchless. cmovae isn't a branching instruction. What are we doing here!?

We're removing the large and relatively expensive cmov. It's bottlenecking the block rtp of this function.

you can't even say it improves performance because you haven't done a benchmark.

Doing machine code analysis is the only way to get an understanding of the performance of such a small function. CPU variance would just be too large, and the uses cases too vague to create a real-world benchmark for this. Hence we look at the exact instructions used, how much space it takes up, how much padding was required in the function, what are the instruction dependency bottlenecks, etc.

The only metric that seems to have been considered here is "number of instructions generated in godbolt."

It certainly isn't. The two metrics we care about here the most are how large the function is in bytes and its block rtp.

@T-727
Copy link

T-727 commented Sep 12, 2024

If it helps resolve conflict, you can maybe try this instead?

fn toLower(c: u8) u8 {
    return c | @as(u8, @intFromBool(isUpper(c))) << 5;
}

@SilasLock
Copy link

SilasLock commented Sep 12, 2024

I think it's a good idea to use the left shift operator here, too. It's

  1. a little more readable, since it more directly describes what the code is trying to do (i.e. create a bitmask using the output of isLower or isUpper and use it to swap the 6th bit of c), and
  2. is exactly what the optimizer ends up outputting anyway, which is to use a shl instruction on x86. This can be seen in the Godbolt link at the top of this PR conversation. And for reference, I'm pretty sure this is the optimal thing to do in MIPS, as well*.

To address concerns about readability, I'd propose making the two functions a little more symmetric, and have them both use the XOR operator, rather than having one use an OR while the other uses an XOR. That way it's clearer both toUpper and toLower are doing the same sort of thing (swapping the value of a single bit). I'd also split the mask definition onto a separate line.

pub fn toUpper(c: u8) u8 {
    const mask = @as(u8, @intFromBool(std.ascii.isLower(c))) << 5;
    return c ^ mask;
}

pub fn toLower(c: u8) u8 {
    const mask = @as(u8, @intFromBool(std.ascii.isUpper(c))) << 5;
    return c ^ mask;
}

*MIPS should use a sltu instruction to compute the 0 or 1 bit outcome of isUpper/isLower, so there's even less overhead from the setb instruction used in x86.

Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is guaranteed integer overflow cannot occur, then * is the correct operator, not *%. The latter is only appropriate if wrapping integer arithmetic behavior is required.

@Rexicon226
Copy link
Contributor

If it is guaranteed integer overflow cannot occur, then * is the correct operator, not *%. The latter is only appropriate if wrapping integer arithmetic behavior is required.

Fair enough. Not trying to die on a hill here :). I like the shifting version, I think it makes the most sense.

@andrewrk andrewrk merged commit 8ddce90 into ziglang:master Sep 14, 2024
10 checks passed
return c;
}
const mask = @as(u8, @intFromBool(isLower(c))) << 5;
return c ^ mask;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be &, rather than ^ (xor)? As is it would toggle the case, which isn't what the function is supposed to do (both by convention and according to doc comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xor looks correct -- you're right that it toggles the case, but only when isLower(c) is true

DivergentClouds pushed a commit to DivergentClouds/zig that referenced this pull request Sep 24, 2024
richerfu pushed a commit to richerfu/zig that referenced this pull request Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants