feat: take kernel for RunArray #3622

askoa · 2023-01-28T21:05:00Z

Which issue does this PR close?

Part of #3520

Rationale for this change

See issue description.

What changes are included in this PR?

Take kernel support for primitive run array
Benchmark for take kernel for primitive run array
Benchmark for primitive run array accessor.
Additional test for RunArrayIter.

Are there any user-facing changes?

arrow-select/src/take.rs

askoa · 2023-01-28T22:59:07Z

The current iteration of take_run kernel looks substantially slower compared to other take kernels. Mostly because of using binary search to determine the indices. I am working on some alternative approach. Will update here on the outcome.

Take kernel benchmarks

take i32 512            time:   [645.79 ns 656.51 ns 670.97 ns]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

take i32 1024           time:   [934.35 ns 954.59 ns 983.22 ns]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

take check bounds i32 512
                        time:   [929.47 ns 952.88 ns 992.80 ns]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

take check bounds i32 1024
                        time:   [1.5971 µs 1.6644 µs 1.7631 µs]
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

take i32 nulls 512      time:   [728.49 ns 750.21 ns 775.88 ns]
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe

take i32 nulls 1024     time:   [1.1009 µs 1.1676 µs 1.2571 µs]
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) high mild
  10 (10.00%) high severe

take bool 512           time:   [1.8112 µs 1.8761 µs 1.9625 µs]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

take bool 1024          time:   [4.2973 µs 4.3593 µs 4.4482 µs]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

take bool nulls 512     time:   [2.2029 µs 2.2717 µs 2.3672 µs]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

take bool nulls 1024    time:   [4.2868 µs 4.4366 µs 4.6310 µs]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

take str 512            time:   [6.1369 µs 6.3286 µs 6.5568 µs]
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

take str 1024           time:   [10.398 µs 10.606 µs 10.840 µs]
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

take str null indices 512
                        time:   [6.6720 µs 7.0256 µs 7.4753 µs]
Found 13 outliers among 100 measurements (13.00%)
  10 (10.00%) high mild
  3 (3.00%) high severe

take str null indices 1024
                        time:   [12.488 µs 12.707 µs 12.960 µs]
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe

take str null values 1024
                        time:   [14.919 µs 15.238 µs 15.622 µs]
Found 13 outliers among 100 measurements (13.00%)
  9 (9.00%) high mild
  4 (4.00%) high severe

take str null values null indices 1024
                        time:   [12.459 µs 12.800 µs 13.165 µs]
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) low mild
  6 (6.00%) high mild
  7 (7.00%) high severe

take primitive run logical len: 1024, physical len: 512, indices: 1024
                        time:   [57.119 µs 58.460 µs 59.983 µs]
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

askoa · 2023-01-29T14:37:57Z

I added a different approach to find physical indices for give logical indices. This approach sorts the input logical indices and then loops through run_ends array instead of doing a binary search.

I ran benchmarks comparing the two approaches. When the length of physical array and take indices are higher, the loop based approach is the clear winner with over 30% improvement compared to binary search approach. However for inputs of smaller size, the binary search approach seems to have better performance. I have furnished the results below. In the below benchmark results, the feature take_run_loop uses the loop based approach.

cargo bench --bench primitive_run_take -- --save-baseline="take_run"

cargo bench --features="take_run_loop" --bench primitive_run_take -- --baseline="take_run"

Benchmark result


primitive_run_take/(run_array_len:1024, physical_array_len:512, take_len:4)
                        time:   [2.1093 µs 2.1189 µs 2.1308 µs]
                        change: [-0.7829% -0.2254% +0.3476%] (p = 0.43 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
primitive_run_take/(run_array_len:1024, physical_array_len:512, take_len:16)
                        time:   [2.4320 µs 2.4392 µs 2.4484 µs]
                        change: [-3.0738% -2.1892% -1.3544%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe
primitive_run_take/(run_array_len:1024, physical_array_len:512, take_len:64)
                        time:   [3.8899 µs 3.8999 µs 3.9125 µs]
                        change: [+3.5299% +4.5620% +5.4698%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe
primitive_run_take/(run_array_len:512, physical_array_len:64, take_len:256)
                        time:   [9.4130 µs 9.4375 µs 9.4719 µs]
                        change: [+14.863% +17.956% +20.601%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
primitive_run_take/(run_array_len:512, physical_array_len:128, take_len:512)
                        time:   [16.928 µs 16.995 µs 17.081 µs]
                        change: [+3.4116% +4.2677% +5.0618%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  4 (4.00%) high mild
  14 (14.00%) high severe
primitive_run_take/(run_array_len:1024, physical_array_len:256, take_len:128)
                        time:   [5.7324 µs 5.7492 µs 5.7709 µs]
                        change: [+11.442% +11.876% +12.367%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
primitive_run_take/(run_array_len:1024, physical_array_len:512, take_len:1024)
                        time:   [40.290 µs 40.403 µs 40.548 µs]
                        change: [-30.546% -30.184% -29.833%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  6 (6.00%) high mild
  9 (9.00%) high severe
primitive_run_take/(run_array_len:2048, physical_array_len:1024, take_len:1024)
                        time:   [40.288 µs 40.376 µs 40.488 µs]
                        change: [-33.022% -32.420% -31.912%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

askoa · 2023-01-29T19:48:01Z

This is in continuation to previous comment #3622 (comment)

I looked into the current take benchmarks to see the parameters used for benchmarking. I see that the benchmarks typically have array_len = 512/1024 and take_len =512/1024. I updated the benchmarks for primitive_run_take to have similar parameters and the results are furnished below. The results look good for loop based approach as it works well for larger arrays and take values.

Based on this result, I am removing the binary search based approach and using the loop based approach for take_run. However I think there is a potential for future optimization where the program automatically chooses between two approach based on input parameters.

Benchmark result

primitive_run_take/(run_array_len:512, physical_array_len:64, take_len:512)
                        time:   [17.302 µs 17.449 µs 17.654 µs]
                        change: [+8.7767% +10.098% +11.338%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
primitive_run_take/(run_array_len:512, physical_array_len:128, take_len:512)
                        time:   [17.218 µs 17.307 µs 17.446 µs]
                        change: [-5.5015% -4.1807% -3.1203%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
primitive_run_take/(run_array_len:1024, physical_array_len:256, take_len:512)
                        time:   [17.184 µs 17.283 µs 17.420 µs]
                        change: [-13.514% -12.644% -11.789%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
primitive_run_take/(run_array_len:1024, physical_array_len:256, take_len:1024)
                        time:   [38.126 µs 38.393 µs 38.728 µs]
                        change: [-29.471% -28.762% -27.898%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
primitive_run_take/(run_array_len:2048, physical_array_len:512, take_len:512)
                        time:   [17.216 µs 17.297 µs 17.421 µs]
                        change: [-22.495% -21.951% -21.439%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe
primitive_run_take/(run_array_len:2048, physical_array_len:512, take_len:1024)
                        time:   [38.055 µs 38.322 µs 38.670 µs]
                        change: [-34.420% -33.418% -32.326%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
primitive_run_take/(run_array_len:4096, physical_array_len:1024, take_len:512)
                        time:   [17.353 µs 17.430 µs 17.537 µs]
                        change: [-27.334% -25.704% -23.314%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe
primitive_run_take/(run_array_len:4096, physical_array_len:1024, take_len:1024)
                        time:   [38.316 µs 38.830 µs 39.448 µs]
                        change: [-32.168% -29.402% -25.753%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

… run accessor.

askoa · 2023-02-04T12:25:33Z

arrow-array/src/array/run_array.rs

+    /// Returns index to the physical array for the given index to the logical array.
+    /// Performs a binary search on the run_ends array for the input index.
+    #[inline]
+    pub fn get_physical_index(&self, logical_index: usize) -> Option<usize> {


This function is moved from TypedRunArray to RunArray as the function has nothing to do with values.

tustvold

This looks good to me, thank you for this, I also confirmed it does not appear to meaningfully impact compile times which is 👌

arrow-array/src/array/run_array.rs

arrow-array/src/cast.rs

arrow/src/util/bench_util.rs

askoa · 2023-02-04T15:15:44Z

The CI failure seems unrelated.

ursabot · 2023-02-06T11:11:36Z

Benchmark runs are scheduled for baseline = 4835659 and contender = 9131c30. 9131c30 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the arrow Changes to the arrow crate label Jan 28, 2023

tustvold reviewed Jan 28, 2023

View reviewed changes

arrow-select/src/take.rs Outdated Show resolved Hide resolved

askoa mentioned this pull request Jan 28, 2023

Add ArrayAccessor, Iterator, Extend and benchmarks for RunArray #3603

Merged

tustvold reviewed Jan 28, 2023

View reviewed changes

arrow-select/src/take.rs Show resolved Hide resolved

ask added 15 commits February 4, 2023 06:42

Rebase to master branch

03e6365

Add take_run kernel and include benchmarks for take_run and primitive…

5d7eeaa

… run accessor.

fix ci issues

ef648c0

fix ci issues

d2c5e26

fix clippy issues

9b1cb28

fix clippy

e93386a

Alternative approach to find physical indices for given logical indices

ba33d94

Remove unused code, refactor benchmarks

9c8e9a6

minor fixes

b7751c7

some refactor

8a65327

add some comments

a048f70

doc fixes

8ebcf7f

change benchmkar parameters

a1aac10

add test for run_iterator

1a406c0

add comments

be661be

askoa commented Feb 4, 2023

View reviewed changes

askoa marked this pull request as ready for review February 4, 2023 12:31

tustvold approved these changes Feb 4, 2023

View reviewed changes

ask added 2 commits February 4, 2023 09:26

Fix some PR suggestions

df7f5e9

incorporte pr suggestion

c815f21

askoa mentioned this pull request Feb 5, 2023

feat + fix: IPC support for run encoded array. #3662

Merged

Merge remote-tracking branch 'upstream/master' into run-array-take

543d69e

tustvold merged commit 9131c30 into apache:master Feb 6, 2023

askoa deleted the run-array-take branch February 6, 2023 11:09

askoa mentioned this pull request Feb 12, 2023

take_run improvements #3701

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: take kernel for RunArray #3622

feat: take kernel for RunArray #3622

Uh oh!

askoa commented Jan 28, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

askoa commented Jan 28, 2023 •

edited

Loading

Uh oh!

askoa commented Jan 29, 2023

Uh oh!

askoa commented Jan 29, 2023

Uh oh!

askoa Feb 4, 2023

Uh oh!

tustvold left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

askoa commented Feb 4, 2023

Uh oh!

ursabot commented Feb 6, 2023

Uh oh!

Uh oh!

feat: take kernel for RunArray #3622

feat: take kernel for RunArray #3622

Uh oh!

Conversation

askoa commented Jan 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

askoa commented Jan 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

askoa commented Jan 29, 2023

Uh oh!

askoa commented Jan 29, 2023

Uh oh!

askoa Feb 4, 2023

Choose a reason for hiding this comment

Uh oh!

tustvold left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

askoa commented Feb 4, 2023

Uh oh!

ursabot commented Feb 6, 2023

Uh oh!

Uh oh!

askoa commented Jan 28, 2023 •

edited

Loading

askoa commented Jan 28, 2023 •

edited

Loading

tustvold left a comment •

edited

Loading