Refactor ParquetExec in preparation for implementing parallel scans for statistics #897

andygrove · 2021-08-16T14:24:13Z

Which issue does this PR close?

Closes #896.

Rationale for this change

Refactor in preparation for making partition scans run in parallel.

What changes are included in this PR?

Refactor to move some logic from ParquetExec down to ParquetPartition

Are there any user-facing changes?

No.

…or statistics

andygrove · 2021-08-16T14:24:52Z

datafusion/src/physical_plan/parquet.rs

+        // collect statistics for all partitions - we should really do this in parallel threads
+        for chunk in chunks {
+            let filenames: Vec<String> = chunk.iter().map(|x| x.to_string()).collect();
+            partitions.push(ParquetPartition::try_from_files(filenames, limit)?);


The plan is to use tokio::spawn here but this requires making more methods async so I wanted to tackle that as a separate PR.

alamb

The idea looks very good to me @andygrove 👍

This may conflict with https://github.com/apache/arrow-datafusion/pull/811/files from @yjshen

I think the comment about truncateing the schemas should probably be addressed before merging, but I don't see any bug that it would cause at the present time, so I would be fine with this PR going in as is.

alamb · 2021-08-17T21:24:30Z

datafusion/src/physical_plan/parquet.rs

@@ -582,14 +389,244 @@ impl ParquetExec {

 impl ParquetPartition {
    /// Create a new parquet partition
-    pub fn new(filenames: Vec<String>, statistics: Statistics) -> Self {
+    pub fn new(
+        filenames: Vec<String>,


I wonder if having a Vec<FilenameAndSchema> would be clearer than two parallel arrays (and then they could not get out of sync either)

It seems like the PartitionedFile abstraction proposed by @yjshen in https://github.com/apache/arrow-datafusion/pull/811/files#diff-72f3a52c56e83e00d8c605d461f092617a3c205619376bb373069c662f9cfc93R54 would help solve this problem?

alamb · 2021-08-17T21:25:26Z

datafusion/src/physical_plan/parquet.rs

+        };
+        // remove files that are not needed in case of limit
+        let mut filenames = filenames;
+        filenames.truncate(total_files);


do you also need to truncate schemas as well?

looks like it.

yjshen · 2021-08-18T17:46:21Z

Sorry I didn't realize #896.

Actually, I was addressing the same issue (async and parallel parquet stats listing) pointed out by @rdettai here in #811: https://github.com/apache/arrow-datafusion/pull/811/files#diff-72f3a52c56e83e00d8c605d461f092617a3c205619376bb373069c662f9cfc93R189-R223, could you please take a look at this PR if you have time?

andygrove · 2021-08-19T14:35:00Z

@yjshen I have changed this PR to a draft and will hold off on working on this for now and will review your PR when I have time - probably at the weekend. It looks like you are farther along than I was with this.

yjshen · 2021-08-19T15:01:16Z

Thanks, @andygrove. It will be great to have your help.

As you mentioned here in the comment:

The plan is to use tokio::spawn here but this requires making more methods async so I wanted to tackle that as a separate PR.

I run into the same situation while handling the async listing and reading. And we may need to decide on how async should be propagated through the API: limit async to remote storage accessing or change the user-facing API. Looking forward to hearing from you. :P

andygrove · 2021-08-21T16:11:17Z

Closing this in favor of the work happening in #811

…he#897)

Refactor ParquetExec in preparation for implementing parallel scans f…

e6d8f9b

…or statistics

github-actions bot added the datafusion Changes in the datafusion crate label Aug 16, 2021

andygrove commented Aug 16, 2021

View reviewed changes

update test

c579527

alamb approved these changes Aug 17, 2021

View reviewed changes

andygrove marked this pull request as draft August 19, 2021 14:33

andygrove closed this Aug 21, 2021

andygrove deleted the refactor-parquet-exec branch February 6, 2022 17:42

unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025

perf: Fall back to Spark if query uses DPP with v1 data sources (apac…

60cad71

…he#897)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor ParquetExec in preparation for implementing parallel scans for statistics #897

Refactor ParquetExec in preparation for implementing parallel scans for statistics #897

andygrove commented Aug 16, 2021

andygrove Aug 16, 2021

alamb left a comment

alamb Aug 17, 2021

houqp Aug 18, 2021

alamb Aug 17, 2021

houqp Aug 18, 2021

yjshen commented Aug 18, 2021 •

edited

Loading

andygrove commented Aug 19, 2021

yjshen commented Aug 19, 2021

andygrove commented Aug 21, 2021

Refactor ParquetExec in preparation for implementing parallel scans for statistics #897

Refactor ParquetExec in preparation for implementing parallel scans for statistics #897

Conversation

andygrove commented Aug 16, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

andygrove Aug 16, 2021

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Aug 17, 2021

Choose a reason for hiding this comment

houqp Aug 18, 2021

Choose a reason for hiding this comment

alamb Aug 17, 2021

Choose a reason for hiding this comment

houqp Aug 18, 2021

Choose a reason for hiding this comment

yjshen commented Aug 18, 2021 • edited Loading

andygrove commented Aug 19, 2021

yjshen commented Aug 19, 2021

andygrove commented Aug 21, 2021

yjshen commented Aug 18, 2021 •

edited

Loading