Inefficiencies in the arrow-json tape implementation #7156

mwylde · 2025-02-20T01:30:11Z

We rely heavily on arrow-json, and json decoding is often a performance-sensitive part of streaming pipelines. After doing some benchmarking, I was surprised to see that arrow-json was significantly slower than Jackson (a popular Java json library). After spending some time profiling, I found some easy wins in the TapeDecoder. Together, they amount to ~X% improvement for a diverse set of json documents, according to the benchmarks here.

Here's an example profile:

A plurality of the time is spent in BufIter::advance_until

arrow-rs/arrow-json/src/reader/tape.rs

Lines 643 to 655 in f6ac87e

    
           fn advance_until<F: FnMut(u8) -> bool>(&mut self, f: F) -> &[u8] { 
        
               let s = self.as_slice(); 
        
               match s.iter().copied().position(f) { 
        
                   Some(x) => { 
        
                       self.advance(x); 
        
                       &s[..x] 
        
                   } 
        
                   None => { 
        
                       self.advance(s.len()); 
        
                       s 
        
                   } 
        
               } 
        
           }

But BufIter is wrapping an Iterator, which makes several of the operations pretty slow (in particular, advance() has to call next() in a loop). Re-implementing BufIter directly with a buffer and an offset allows us to more efficiently implement all of these operations, for an average 22% improvement.

We can also improve one the usages of advance_until, which finds the end of a string and is quite expensive for long strings as currently implemented:

arrow-rs/arrow-json/src/reader/tape.rs

Line 397 in f6ac87e

let s = iter.advance_until(|b| matches!(b, b'\\' | b'"'));

By using memchr, a simd-optimized char finding library, we can get an average 16% improvement.

Another big cost for string-heavy documents is utf8 validation, and we can get some quick wins there by using simdutf8 (which has already been discussed in other contexts, in #7014). This is good for about 5%.

Altogether, these changes improve performance in my benchmarks from 25-39%, averaging 32%.

There are also some other opportunities to improve. There are more operations that could be vectorized (in particular, skipping whitespace). Another major cost is pushing strings and numbers into the buffer one by one. It's much faster to copy the entire input into the buffer at the start, although it costs extra memory to store whitespace and other tokens.

mwylde mentioned this issue Feb 20, 2025

Improve arrow-json deserialization performance by 30% #7157

Merged

tustvold closed this as completed in #7157 Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inefficiencies in the arrow-json tape implementation #7156

Inefficiencies in the arrow-json tape implementation #7156

mwylde commented Feb 20, 2025

Inefficiencies in the arrow-json tape implementation #7156

Inefficiencies in the arrow-json tape implementation #7156

Comments

mwylde commented Feb 20, 2025