Skip to content

wasmparser: 15-30% performance regressions from v228 -> v229 #2180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Robbepop opened this issue May 6, 2025 · 15 comments
Open

wasmparser: 15-30% performance regressions from v228 -> v229 #2180

Robbepop opened this issue May 6, 2025 · 15 comments

Comments

@Robbepop
Copy link
Collaborator

Robbepop commented May 6, 2025

I just finished updating Wasmi to the new wasm-tools v229 coming from v228.

  • Roughly 15% performance regressions across the board.
  • Up to 30% performance regressions when Wasmi translation was lazy and thus wasmparser parsing and validation made up a larger chunk of the work. So that's probably the real regression in wasmparser.

Wasmi PR: wasmi-labs/wasmi#1501

Benchmarking before and after revealed the following performance regressions:
Image

@alexcrichton
Copy link
Member

Would you be able to help profile and track down where this is coming from? Perhaps via bisection?

@Robbepop
Copy link
Collaborator Author

Robbepop commented May 6, 2025

Would you be able to help profile and track down where this is coming from? Perhaps via bisection?

Yes I can do that.

@Robbepop
Copy link
Collaborator Author

Robbepop commented May 6, 2025

@alexcrichton The 30% parse and validation regression first appears here

commit 0354dde10184d7ff813282a2837d9a4ed8e766e6
Author: Keith Winstein <[email protected]>
Date:   Tue Apr 15 14:09:45 2025 -0700
    wasmparser: detect "malformed" cases in parser alone (without validator) (#2134)

The commit right before it does not have the regression:

commit 90c156ff15a81cb0360fb949e6504a4799d3b5cc
Author: Alex Crichton <[email protected]>
Date:   Tue Apr 15 14:34:21 2025 -0500
    Prevent over-deep type hierarchies in wasm-smith (#2143)

I have also ran benchmarks on the newly released wasmparser v0.230.0 but it seems to be suffering from the same performance regressions as wasmparser v0.229.0 does.

cc @keithw

@keithw
Copy link
Contributor

keithw commented May 6, 2025

I can replicate this in-repo with cargo bench -p wasmparser -- parse/tests ; let me try to produce a flamegraph and take a look asap (but please don't let me stop anybody else).

@keithw
Copy link
Contributor

keithw commented May 13, 2025

I did some measurements on parsing (without validating) a single huge module on an unloaded AMD Ryzen 7 PRO 4750U, comparing two release builds built with rustc 1.86.0 (05f9846f8 2025-03-31).

For the "before", I tested the commit just before that PR (90c156f), backporting the "full parse" logic from https://github.com/bytecodealliance/wasm-tools/blob/main/src/lib.rs#L299 (with the exception of the ops.finish()?; line since the "before" parser doesn't have a finish function to check anything at the end of a function).

For the "after", I used that PR (0354dde).

In both cases I tested wasm-tools parse without output (i.e. removing https://github.com/bytecodealliance/wasm-tools/blob/main/src/bin/wasm-tools/parse.rs#L25). I ran the command under perf stat a few hundred times and recorded the cycles counter. I ran it enough that the standard error of the mean is <1M cycles for both measurements.

I was surprised to find that in this "single huge module" test, I didn't see a regression:

Commit mean cycle count diff
Before 823,830,564
After 813,671,062 -1.2%

For parsing+validation, again with this single huge module, I see a roughly 3% slowdown on running cargo validate, but nothing like 15-30%.

On the other hand, cargo bench on the spec and local testsuite (which has thousands of mostly tiny modules) definitely documents a regression. So I'm wondering... are you seeing a big slowdown on parsing or parse+validating individual (large) modules, and if so, are you able to share an example module? Or should we be looking at a slowdown related to parser creation/startup time on lots of small modules, and that's getting washed out in the noise on my "single huge module" test?

@Robbepop
Copy link
Collaborator Author

Robbepop commented May 17, 2025

@keithw I am not sure I can follow the reasoning behind all your steps above but let me answer you the questions. And sorry for the delay in doing so.

First of all, I am on an Apple M2 Pro and have no access to other machines for the purpose of benchmarking and from what I remember I ran the benchmarks on Rust 1.85.

So I'm wondering... are you seeing a big slowdown on parsing or parse+validating individual (large) modules, and if so, are you able to share an example module?

I saw 15% regressions across the board for large modules such as spidermonkey.wasm (4MB) and small modules such as erc20.wasm (~7KB). Additionally I saw those 30% regression in cases where Wasmi's own translation was disabled and thus wasmparser performance played a bigger role such as the 30% regression I posted above with lazy-translation mode which parses and validates a module via wasmparser.

You can view Wasmi benchmarks here:
https://github.com/wasmi-labs/wasmi/blob/main/crates/wasmi/benches/benches.rs#L200-L289

These are parts of the Wasm files I use to benchmark:
https://github.com/wasmi-labs/wasmi/tree/main/crates/wasmi/benches/wasm

Why did you remove the call to ops.finish()?? I have checked its implementation and it only performs a single check if blocks are empty which should be insignificant.

@keithw
Copy link
Contributor

keithw commented May 18, 2025

Hmm, I'm still having trouble replicating this on x86-64 with spidermonkey.wasm. I wonder if this might be an aarch64 vs. x86-64 issue or something else. When you say wasmparser -- are we talking about validating, or just parsing?

For validation, here's what I get running taskset -c 0 perf stat -e cycles ./target/release/wasm-tools validate ~/spidermonkey.wasm about 1,500 times on commit 0354dde (the PR):

166,774,027 cycles (mean) ± 3,803,251 cycles (stddev), or about 47 milliseconds (43 milliseconds user time)

And with the same command 1,500 times on commit 90c156f (just before that PR):

160,586,953 cycles (mean) ± 5,394,219 cycles (stddev), or about 45 milliseconds (41 milliseconds user time)

So I'm seeing a slowdown in single-threaded validation performance of about 3.9%. (The taskset -c 0 restricts the command to run on one core... otherwise rayon spawns a lot of threads for parallel function-wise validation which I think is probably a distraction from this issue.)

If you're also talking about validation, can I ask what kinds of absolute numbers you're able to measure for the same pair of commits on your machine?

Why did you remove the call to ops.finish()?? I have checked its implementation and it only performs a single check if blocks are empty which should be insignificant.

The Parser only gained a finish function in that PR, so for my "parse-only" measurements I had to take it out to be able to backport the parse_binary_wasm function from https://github.com/bytecodealliance/wasm-tools/blob/main/src/lib.rs#L299 to the previous commit.

@Robbepop
Copy link
Collaborator Author

Robbepop commented May 18, 2025

are we talking about validating, or just parsing?

The largest regressions (~30%) I saw when Wasmi used lazy-translation which eagerly parses and validates Wasm via wasmparser and basically does nothing else that is expensive during Wasm module creation.

I'm still having trouble replicating this on x86-64

Did you manage to replicate this on aarch64 or do you just have access to x86-64?
In your first post you said you can replicate this via wasmparser's benchmarks.

I start to wonder if the slowdown is connected to the way wasmparser sets up Parser and Validator. For example, prior to your PR wasmparser::Parser barely had any (or even no) heap allocated state and now it does. For validation related logic the wasmparser::Validator type has an API to re-use heap allocated state which was kinda important to make it fast for many consecutive calls. Maybe this is the root issue here since wasmparser::Parser might be missing this and we are basically benchmarking the heap allocator here?

Though this would not explain why parse+validate is the slowest combo.

@alexcrichton
Copy link
Member

alexcrichton commented May 18, 2025

I'm like @keithw where I've reproduced a 3-5% regression in this repository's benchmarks (e.g. cargo bench -p wasmparser --bench benches). I checked out the wasmi repository and @Robbepop's PR and see similar regressions along 15-30%. I believe the main difference is LTO-vs-not, but I have not confirmed that.

Using main I replaced this vec with a simple depth: u32 counter. That performance slightly but didn't fully resolve anything. I then removed all modification of depth and the regression was resolved. This change makes the parser no longer spec-compliant (tests fail) so this isn't a valid change to land, but I wanted to try out some different ideas.

There appears to be something weird going on with wasmi's benchmark build and I don't know what. I removed all modification of my new depth field except in the handling of the delegate instruction, which wasmi is not using. In this build the regression is present. If I simply increment/decrement the depth counter without doing anything else the regression is also still present.

My hunch at this time is that this is right on the edge of something like CPU cache lines, registers being spilled, some inlining threshold, or something like that. I don't know how to narrow it down further, but IMO this isn't too too actionable. It's possible I think to use a depth counter in the parser (all tests path with a few other minor changes) and ensure that full validation still happens in the validator. Technically not spec compliant but compliant enough to get past tests for malformed-vs-invalid distinctions. Whether or not this is a good idea I don't know, it doesn't "fix" the 15-30% regression although it does improve it.

I'm a bit of a loss of how best to address this myself.

@Robbepop
Copy link
Collaborator Author

I'm like @keithw where I've reproduced a 3-5% regression in this repository's benchmarks (e.g. cargo bench -p wasmparser --bench benches).

Hmmm, this is strange! On my machine I also see those 30% regressions for cargo bench -p wasmparser --bench benchmark on the wasm-tools repo, not only on the Wasmi repository. In fact that's how I bisected the origin of the regression.

Image

Though, not all wasmparser benchmarks regress by ~30%. Some actually only regression by the aforementioned 3-5%.

@keithw
Copy link
Contributor

keithw commented May 19, 2025

@Robbepop wanted to bump my question -- are you able to report the absolute cycle counts or times (for the same validate command on spidermonkey.wasm) for the same pair of commits on your machine? Unfortunately I don't have access to an M2 machine (although if one can be rented by the hour that could work).

@Robbepop
Copy link
Collaborator Author

Robbepop commented May 20, 2025

Hey @keithw , sorry I didn't get that question. I have tried to find programs on MacOS that can provide cycle counters but found none. From what I found is that even Instruments cycle counts only work by sampling and are therefore inaccurate, so instead of this behemoth of software I used the very lightweight samply tool. Also there seems to be no replacement for taskset on MacOS.

Do you have ideas how to test this more accurately on MacOS?

Not sure how I can best share the findings of the samply tool since it produced a huge json file which was neatly viewable in the browser. Though what I can say is that in both runs (pre- and post PR) the validation ran in ~1s). So I weirdly didn't see any regressions using that tool.

Maybe @alexcrichton is right with his suspicion that indeed it is not about cycles but about "CPU cache lines, registers being spilled, some inlining threshold" or inferior branch prediction etc.

@alexcrichton
Copy link
Member

@Robbepop oh you're right about the in-repository benchmark regressions, at the time I was only looking at the validation of spidermonkey and ignoring all others. I can reproduce, for example, a 30% regression in parsing all tests. I believe this is due to the stack of blocks being managed where they weren't before. If I replace that vector with a simple depth: u32 the regression goes down to 20%. Through some inlining and not reading the locals twice (get_operators_reader skips over locals, as opposed to starting where the locals left off, and this logic changed in the PR in question) I can get this down to 7%. Overall nothing major though.

Can you confirm/deny whether the benchmarks in wasmi are validating? Locally as is somewhat expected the biggest regressions are in "just parse the bytes" tests and the smaller regressions are in "ok also validate" tests. That may help explain the discrepancy in wasmi as well. Overall though wasmi is less prone to inlining issues due to enabling LTO, so only the structural depth: u32 change would help (assuming you're not calling get_operators_reader).

@keithw the benchmarks that @Robbepop is mentioning are located in the wasmi repository and are defined with criterion which measures wall time across iterations. I'd be wary of measuring regressions with wasm-tools-the-CLI since valiation is typically so fast the process setup/teardown will probably contribute a nontrivial portion to measurements. The benchmarks I've been talking about are in this repository and are run with cargo bench -p wasmparser --bench benchmark.

I'm certainly not opposed to adding more #[inline] myself in any case.

@Robbepop
Copy link
Collaborator Author

Robbepop commented May 21, 2025

Can you confirm/deny whether the benchmarks in wasmi are validating?

@alexcrichton let me go through the Wasmi benchmarks I pasted above and inform what they do:

Benchmark Regression Description
eager+checked ~15% parses and validates via wasmparser and translates via Wasmi
eager+checked+metered ~14% parses and validates via wasmparser and translates with fuel metering via Wasmi
eager+unchecked ~18% parses via wasmparser and translates via Wasmi
lazy-translation/checked ~31% parses and validates via wasmparser
lazy/checked ~16% parses via wasmparser
lazy/unchecked ~16%

In all above benchmarks the non-function body parts of a Wasm module are entirely parsed and validated via wasmparser.

So what we see is the following:

  • lazy-tranlation/checked has most influence by wasmparser and regresses the most of all.
  • lazy/unchecked which does not parse nor validate any function body still regressed by ~16%. So there might be more regressions outside of function body.
  • lazy/checked only uses wasmparser's parsing but does not validate and regresses by ~16%, interestingly same as lazy/unchecked.

If I replace that vector with a simple depth: u32 the regression goes down to 20%. Through some inlining and not reading the locals twice (get_operators_reader skips over locals, as opposed to starting where the locals left off, and this logic changed in the PR in question) I can get this down to 7%.

I suppose if one applies all those optimizations to wasmparser's parser then it would no longer be spec compliant?

Overall though wasmi is less prone to inlining issues due to enabling LTO [..]

I think this is an incorrect thinking as inline annotations also act as hints to the compiler. LTO and inlining hints in general are useful in combination if applied correctly.

@alexcrichton
Copy link
Member

Hm ok I'm a bit surprised then, in pure validation cases in this repository nothing comes close to the 30% regression you're seeing in lazy-translation/checked. Everything here is in the ~5-10% range for parsing+validation. Could you double-check the API-wise usage of wasmparser? Maybe an old idiom which worked isn't the best way to do it any more? Unsure though...

I suppose if one applies all those optimizations to wasmparser's parser then it would no longer be spec compliant?

Yes and no. This was added to validate the distinction between assert_{malformed,invalid} the spec test suite. The validator can still catch truly invalid wasms and it's a question of whether invalid wasms (and checks) are checked during parsing instead. By using a simple depth counter it leans more heavily on validation to perform more checks (but it's already doing most of the checks already).

The non-spec-compliance aspect comes about insofar as wasmparser will parse wasms that are malformed, or technically syntactically invalid. Such invalid wasms will still be caught by the validator, however.

IMO the answer to your question, though, is "no". A simple depth counter would still be spec-compliant. In practice it seems the main definition the spec makes is whether a module is valid or not, and if it's not valid it only matters that it's rejected somehow, not precisely where. Running the spec tests, however, is a bit of another matter which is where this is very much a gray area and my opinion here is not definitive nor universally held.

alexcrichton added a commit to alexcrichton/wasm-tools that referenced this issue May 22, 2025
This commit relaxes a check added in bytecodealliance#2134 which maintains a stack of
frame kinds in the operators reader, in addition to the validator. The
goal of bytecodealliance#2134 was to ensure that spec-wise-syntactically-invalid-modules
are caught in the parser without the need of the validator, but
investigation in bytecodealliance#2180 has shown that this is a source of at least some
of a performance regression. The change here is to relax the check to
still be able to pass spec tests while making such infrastructure
cheaper.

The reader now maintains just a small `depth: u32` counter instead of a
stack of kinds. This means that the reader can still catch invalid
modules such as instructions-after-`end`, but the validator is required
to handle situations such as `else` outside of an `if` block.

This required some adjustments to tests as well as some workarounds for
the upstream spec tests that assert legacy exception-handling
instructions are malformed, not invalid.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants