Skip to content

Inefficiencies in the arrow-json tape implementation #7156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mwylde opened this issue Feb 20, 2025 · 0 comments · Fixed by #7157
Closed

Inefficiencies in the arrow-json tape implementation #7156

mwylde opened this issue Feb 20, 2025 · 0 comments · Fixed by #7157

Comments

@mwylde
Copy link
Contributor

mwylde commented Feb 20, 2025

We rely heavily on arrow-json, and json decoding is often a performance-sensitive part of streaming pipelines. After doing some benchmarking, I was surprised to see that arrow-json was significantly slower than Jackson (a popular Java json library). After spending some time profiling, I found some easy wins in the TapeDecoder. Together, they amount to ~X% improvement for a diverse set of json documents, according to the benchmarks here.

Here's an example profile:

Image

A plurality of the time is spent in BufIter::advance_until

fn advance_until<F: FnMut(u8) -> bool>(&mut self, f: F) -> &[u8] {
let s = self.as_slice();
match s.iter().copied().position(f) {
Some(x) => {
self.advance(x);
&s[..x]
}
None => {
self.advance(s.len());
s
}
}
}

But BufIter is wrapping an Iterator, which makes several of the operations pretty slow (in particular, advance() has to call next() in a loop). Re-implementing BufIter directly with a buffer and an offset allows us to more efficiently implement all of these operations, for an average 22% improvement.

We can also improve one the usages of advance_until, which finds the end of a string and is quite expensive for long strings as currently implemented:

let s = iter.advance_until(|b| matches!(b, b'\\' | b'"'));

By using memchr, a simd-optimized char finding library, we can get an average 16% improvement.

Another big cost for string-heavy documents is utf8 validation, and we can get some quick wins there by using simdutf8 (which has already been discussed in other contexts, in #7014). This is good for about 5%.

Altogether, these changes improve performance in my benchmarks from 25-39%, averaging 32%.

There are also some other opportunities to improve. There are more operations that could be vectorized (in particular, skipping whitespace). Another major cost is pushing strings and numbers into the buffer one by one. It's much faster to copy the entire input into the buffer at the start, although it costs extra memory to store whitespace and other tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant