Skip to content

Factor out Substrait consumers into separate files #15794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

gabotechs
Copy link
Contributor

@gabotechs gabotechs commented Apr 21, 2025

Which issue does this PR close?

Rationale for this change

The consumer.rs file grew a bit too big (~3400 LOC). Good thing is that it's easily splittable into separate files, each one responsible for converting one Substrait node into one DataFusion Logical plan node. With this change, people can just go to the file that they care about greatly reducing the amount of information that they need to deal with.

What changes are included in this PR?

A refactor of Substrait consumer.rs file into multiple files following these rules:

  • Every from_* prefixed function responsible for converting one Substrait node into one DataFusion Logical plan node is now factored out into its own file named after the original Substrait node (e.g. cast.rs, literal.rs, aggregate_rel.rs)
  • Every helper function that was only used once for a specific node conversion is moved to the same file that holds the node conversion function
  • Every helper function that is used by two or more node conversion functions is moved to a utils.rs file
  • The visibility rules of the functions to the outside is left intact (with a small exception, see below), making proper use of pub(super) for functions that now need to be shared across different files.
  • All the function names and function bodies are copy-pasted exactly as they were

There's one subtle public API change that might be nice to keep:

$ cargo public-api diff

Removed items from the public API
=================================
(none)

Changed items in the public API
===============================
(none)

Added items to the public API
=============================
+pub fn datafusion_substrait::logical_plan::consumer::from_substrait_type

That one should probably have been public from the beginning, and not exposing it now gets a bit messy as it's used outside of the consume module in the producer.rs tests, and hiding it to the outside would require introducing some #[cfg(test)] to the module definitions. I can hide though if people one a perfect no-api-change refactor.

Are these changes tested?

yes, by current pipelines.

Are there any user-facing changes?

The from_substrait_type function is now exposed

@github-actions github-actions bot added the substrait Changes to the substrait crate label Apr 21, 2025
@gabotechs gabotechs force-pushed the factor-out-substrait-consumer-into-files branch 2 times, most recently from 4bfa3a8 to d974ee1 Compare April 22, 2025 05:45
@gabotechs gabotechs marked this pull request as ready for review April 22, 2025 05:53
@Blizzara
Copy link
Contributor

Thanks! This has indeed been a long time todo :) also cc @vbarua

I think personally I'd prefer a bit less files, but that's just a suggestion: I'd probably do something like:

  • substrait_consumer
  • literal
  • types
  • relations
  • expressions
  • utils

If you prefer to keep things in smaller pieces, I'd suggest separating expressions and relations into their own folders, and considering combining at least things which are either tightly related (struct_type & named_struct -> types) or used by only one other file (like bound.rs -> window_function.rs).

Thoughts?

@gabotechs
Copy link
Contributor Author

Thoughts?

I have only a very slight preference for smaller pieces, but since I wasn’t on the front lines coding this, I trust your judgment much more. I’ll apply your suggestions for the next round, and we’ll see how things shape up from there.

@gabotechs
Copy link
Contributor Author

@Blizzara applied your suggestions. I think scoping expressions and relations into their own modules made a lot of sense.

@Blizzara
Copy link
Contributor

@Blizzara applied your suggestions. I think scoping expressions and relations into their own modules made a lot of sense.

Nice! I didn't dive into the details of each file, but the high-level organisation makes sense to me and looks good 👍 Thanks!

Copy link
Contributor

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions, but otherwise LGTM

The only real change is exposing the from_substrait_type, otherwise refactor is not visible externally and consist only of moving code around.

@gabotechs gabotechs force-pushed the factor-out-substrait-consumer-into-files branch from 4872b9f to b015ed3 Compare April 30, 2025 16:11
@vbarua
Copy link
Contributor

vbarua commented Apr 30, 2025

@alamb this PR should be ready for review

I've checked that this PR consists purely of code moves with no functional changes, at this point in time, aside from from_substrait_type being exposed externally now.

This will likely conflict with #15854, or if that gets merged first we'll need to make sure this PR pulls in those changes.

Full disclosure as well, I work with @gabotechs at DataDog

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gabotechs and @vbarua -- I trust your reviews and I think this is a very nice improvement to the code maintainability.

@alamb
Copy link
Contributor

alamb commented May 1, 2025

I merged up and assuming the CI passes I'll merge this PR in

@alamb
Copy link
Contributor

alamb commented May 1, 2025

Thanks again @vbarua and @gabotechs

@alamb alamb merged commit 185a02d into apache:main May 1, 2025
27 checks passed
@gabotechs gabotechs deleted the factor-out-substrait-consumer-into-files branch May 1, 2025 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[substrait] refactor consumer.rs
4 participants