Skip to content

feat: take kernel for RunArray #3622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Feb 6, 2023
Merged

feat: take kernel for RunArray #3622

merged 18 commits into from
Feb 6, 2023

Conversation

askoa
Copy link
Contributor

@askoa askoa commented Jan 28, 2023

Which issue does this PR close?

Part of #3520

Rationale for this change

See issue description.

What changes are included in this PR?

  • Take kernel support for primitive run array
  • Benchmark for take kernel for primitive run array
  • Benchmark for primitive run array accessor.
  • Additional test for RunArrayIter.

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 28, 2023
@askoa
Copy link
Contributor Author

askoa commented Jan 28, 2023

The current iteration of take_run kernel looks substantially slower compared to other take kernels. Mostly because of using binary search to determine the indices. I am working on some alternative approach. Will update here on the outcome.

Take kernel benchmarks
take i32 512            time:   [645.79 ns 656.51 ns 670.97 ns]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

take i32 1024           time:   [934.35 ns 954.59 ns 983.22 ns]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

take check bounds i32 512
                        time:   [929.47 ns 952.88 ns 992.80 ns]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

take check bounds i32 1024
                        time:   [1.5971 µs 1.6644 µs 1.7631 µs]
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

take i32 nulls 512      time:   [728.49 ns 750.21 ns 775.88 ns]
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe

take i32 nulls 1024     time:   [1.1009 µs 1.1676 µs 1.2571 µs]
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) high mild
  10 (10.00%) high severe

take bool 512           time:   [1.8112 µs 1.8761 µs 1.9625 µs]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

take bool 1024          time:   [4.2973 µs 4.3593 µs 4.4482 µs]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

take bool nulls 512     time:   [2.2029 µs 2.2717 µs 2.3672 µs]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

take bool nulls 1024    time:   [4.2868 µs 4.4366 µs 4.6310 µs]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

take str 512            time:   [6.1369 µs 6.3286 µs 6.5568 µs]
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

take str 1024           time:   [10.398 µs 10.606 µs 10.840 µs]
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

take str null indices 512
                        time:   [6.6720 µs 7.0256 µs 7.4753 µs]
Found 13 outliers among 100 measurements (13.00%)
  10 (10.00%) high mild
  3 (3.00%) high severe

take str null indices 1024
                        time:   [12.488 µs 12.707 µs 12.960 µs]
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe

take str null values 1024
                        time:   [14.919 µs 15.238 µs 15.622 µs]
Found 13 outliers among 100 measurements (13.00%)
  9 (9.00%) high mild
  4 (4.00%) high severe

take str null values null indices 1024
                        time:   [12.459 µs 12.800 µs 13.165 µs]
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) low mild
  6 (6.00%) high mild
  7 (7.00%) high severe

take primitive run logical len: 1024, physical len: 512, indices: 1024
                        time:   [57.119 µs 58.460 µs 59.983 µs]
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

@askoa
Copy link
Contributor Author

askoa commented Jan 29, 2023

I added a different approach to find physical indices for give logical indices. This approach sorts the input logical indices and then loops through run_ends array instead of doing a binary search.

I ran benchmarks comparing the two approaches. When the length of physical array and take indices are higher, the loop based approach is the clear winner with over 30% improvement compared to binary search approach. However for inputs of smaller size, the binary search approach seems to have better performance. I have furnished the results below. In the below benchmark results, the feature take_run_loop uses the loop based approach.

cargo bench --bench primitive_run_take -- --save-baseline="take_run"

cargo bench --features="take_run_loop" --bench primitive_run_take -- --baseline="take_run"
Benchmark result

primitive_run_take/(run_array_len:1024, physical_array_len:512, take_len:4)
                        time:   [2.1093 µs 2.1189 µs 2.1308 µs]
                        change: [-0.7829% -0.2254% +0.3476%] (p = 0.43 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
primitive_run_take/(run_array_len:1024, physical_array_len:512, take_len:16)
                        time:   [2.4320 µs 2.4392 µs 2.4484 µs]
                        change: [-3.0738% -2.1892% -1.3544%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe
primitive_run_take/(run_array_len:1024, physical_array_len:512, take_len:64)
                        time:   [3.8899 µs 3.8999 µs 3.9125 µs]
                        change: [+3.5299% +4.5620% +5.4698%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe
primitive_run_take/(run_array_len:512, physical_array_len:64, take_len:256)
                        time:   [9.4130 µs 9.4375 µs 9.4719 µs]
                        change: [+14.863% +17.956% +20.601%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
primitive_run_take/(run_array_len:512, physical_array_len:128, take_len:512)
                        time:   [16.928 µs 16.995 µs 17.081 µs]
                        change: [+3.4116% +4.2677% +5.0618%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  4 (4.00%) high mild
  14 (14.00%) high severe
primitive_run_take/(run_array_len:1024, physical_array_len:256, take_len:128)
                        time:   [5.7324 µs 5.7492 µs 5.7709 µs]
                        change: [+11.442% +11.876% +12.367%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
primitive_run_take/(run_array_len:1024, physical_array_len:512, take_len:1024)
                        time:   [40.290 µs 40.403 µs 40.548 µs]
                        change: [-30.546% -30.184% -29.833%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  6 (6.00%) high mild
  9 (9.00%) high severe
primitive_run_take/(run_array_len:2048, physical_array_len:1024, take_len:1024)
                        time:   [40.288 µs 40.376 µs 40.488 µs]
                        change: [-33.022% -32.420% -31.912%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

@askoa
Copy link
Contributor Author

askoa commented Jan 29, 2023

This is in continuation to previous comment #3622 (comment)

I looked into the current take benchmarks to see the parameters used for benchmarking. I see that the benchmarks typically have array_len = 512/1024 and take_len =512/1024. I updated the benchmarks for primitive_run_take to have similar parameters and the results are furnished below. The results look good for loop based approach as it works well for larger arrays and take values.

Based on this result, I am removing the binary search based approach and using the loop based approach for take_run. However I think there is a potential for future optimization where the program automatically chooses between two approach based on input parameters.

Benchmark result
primitive_run_take/(run_array_len:512, physical_array_len:64, take_len:512)
                        time:   [17.302 µs 17.449 µs 17.654 µs]
                        change: [+8.7767% +10.098% +11.338%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
primitive_run_take/(run_array_len:512, physical_array_len:128, take_len:512)
                        time:   [17.218 µs 17.307 µs 17.446 µs]
                        change: [-5.5015% -4.1807% -3.1203%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
primitive_run_take/(run_array_len:1024, physical_array_len:256, take_len:512)
                        time:   [17.184 µs 17.283 µs 17.420 µs]
                        change: [-13.514% -12.644% -11.789%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
primitive_run_take/(run_array_len:1024, physical_array_len:256, take_len:1024)
                        time:   [38.126 µs 38.393 µs 38.728 µs]
                        change: [-29.471% -28.762% -27.898%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
primitive_run_take/(run_array_len:2048, physical_array_len:512, take_len:512)
                        time:   [17.216 µs 17.297 µs 17.421 µs]
                        change: [-22.495% -21.951% -21.439%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe
primitive_run_take/(run_array_len:2048, physical_array_len:512, take_len:1024)
                        time:   [38.055 µs 38.322 µs 38.670 µs]
                        change: [-34.420% -33.418% -32.326%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
primitive_run_take/(run_array_len:4096, physical_array_len:1024, take_len:512)
                        time:   [17.353 µs 17.430 µs 17.537 µs]
                        change: [-27.334% -25.704% -23.314%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe
primitive_run_take/(run_array_len:4096, physical_array_len:1024, take_len:1024)
                        time:   [38.316 µs 38.830 µs 39.448 µs]
                        change: [-32.168% -29.402% -25.753%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

/// Returns index to the physical array for the given index to the logical array.
/// Performs a binary search on the run_ends array for the input index.
#[inline]
pub fn get_physical_index(&self, logical_index: usize) -> Option<usize> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is moved from TypedRunArray to RunArray as the function has nothing to do with values.

@askoa askoa marked this pull request as ready for review February 4, 2023 12:31
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thank you for this, I also confirmed it does not appear to meaningfully impact compile times which is 👌

@askoa
Copy link
Contributor Author

askoa commented Feb 4, 2023

The CI failure seems unrelated.

@tustvold tustvold merged commit 9131c30 into apache:master Feb 6, 2023
@askoa askoa deleted the run-array-take branch February 6, 2023 11:09
@ursabot
Copy link

ursabot commented Feb 6, 2023

Benchmark runs are scheduled for baseline = 4835659 and contender = 9131c30. 9131c30 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@askoa askoa mentioned this pull request Feb 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants