Skip to content

Remove Wildcard from Expr #7765

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
waynexia opened this issue Oct 7, 2023 · 8 comments
Open

Remove Wildcard from Expr #7765

waynexia opened this issue Oct 7, 2023 · 8 comments
Labels
enhancement New feature or request

Comments

@waynexia
Copy link
Member

waynexia commented Oct 7, 2023

Is your feature request related to a problem or challenge?

Expr::Wildcard and Expr::QualifiedWildcard are expressions that reference all columns. But it seems redundant that we don't need a special expr type to do that. This issue proposes to remove these two expr kinds.

Describe the solution you'd like

Wildcard (*) can be expanded to concrete column lists when it appears. This manner seems viable in the three most common use cases I can come up with:

  • (1) generate plan from SQL AST. Some scenarios already acted like this. E.g., select * from table will generate a projection with all fields, rather than a wildcard expr.
  • (2) from LogicalPlanBuilder and (3) from DataFrame. In some aspects, these two entrances are the same. And both are strong typed (or strong schema-ed), which allows us to expand the wildcard immediately using the schema from current stage.

Besides this, Expr::Wildcard is not properly handled in the codebase, because it's not a "first class" expr. Take some functions as examples:

  • expr_to_columns: from the correctness aspect, it should also count columns referenced by Expr::Wildcard and Expr::QualifiedWildcard.
  • create_physical_name: this function requires all the wildcards to be expanded before calling it.

And I find another issue that related to Wildcard: #5473. The solution is to add an optimizer rule that expands all wildcards. This proposal is going to do something similar but in a more eager way.

Describe alternatives you've considered

No response

Additional context

Since Expr is widely used, we may need several versions to deprecate it (depending on the compatibility rule). We may ship this change step by step before fully removing these two variants

  • change all the internal usage of wildcards (most of them are in aggr expr I suppose)
  • change the public interface that may generate wildcard. Like Cannot Count(Expr:Wildcard) with DataFrame API #5473
  • mark Expr::Wildcard and Expr::QualifiedWildcard as #[deprecated]
  • fully remove them
@waynexia waynexia added the enhancement New feature or request label Oct 7, 2023
@alamb
Copy link
Contributor

alamb commented Oct 7, 2023

@waynexia -- I agree with your points about Expr::Wildcard being confusing. I didn't quite understand your proposal to change the public interface that may generate wildcard, specifically how someone would express "select all the columns" when building expressions programmatically 🤔

Concretely, what would be the equivalent to the following?

ctx.table("alltypes_plain")
        .await?
        .select(vec![count(Expr::Wildcard)])

BTW maybe we could improve the situation by adding more documentation to the Expr::Wildcard

@waynexia
Copy link
Member Author

waynexia commented Oct 7, 2023

Concretely, what would be the equivalent to the following?

That's the point 👍 For select * from table we can simply omit the projection. But this doesn't work for count(*). I was thinking about letting "nothing" to represent the wildcard, similar to pandas or substrait. But this is a little bad as we have to use Option<Expr> instead of overload count(). An alternative is to add a specific count() variant for count(*), like count_all(). But whatever this API looks like, we must change the outer usage because the schema is unavailable for an independent count() method. Wildcard helps to postpone the expanding, and we have to do it immediately if it's removed.

The lucky thing is Expr::Wildcard is not that generic, so we only need to handle a few cases about it (or maybe the count(*) is the only case? I have this feeling but am not sure about it).

BTW maybe we could improve the situation by adding more documentation to the Expr::Wildcard

Makes sense, I'll add it first.

@findepi
Copy link
Member

findepi commented Feb 10, 2025

count(*) is a special syntax that needs to be handled explicitly -- perhaps even at the parsing phase.
It doesn't constitute a reason to keep wildcard as an expression.

@findepi
Copy link
Member

findepi commented Feb 10, 2025

BTW it's clear some expressions are mere hacks, not real expressions. It's enough to look at expr.get_type impl and both Wildcard and GroupingSet stand out as such:

Expr::Wildcard { .. } => Ok(DataType::Null),
Expr::GroupingSet(_) => {
// Grouping sets do not really have a type and do not appear in projections
Ok(DataType::Null)
}

@alamb
Copy link
Contributor

alamb commented Feb 10, 2025

count(*) is a special syntax that needs to be handled explicitly -- perhaps even at the parsing phase.
It doesn't constitute a reason to keep wildcard as an expression.

I agree

As long as we also have some way to represent the same thing in the expr_fn API I don't see any reason that we need to have Expr::Wildcard

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 11, 2025

I don't think we use wildcard for count in datafusion, COUNT_STAR_EXPANSION is used instead which is count(1).

As long as we have alternative representation of wildcard (i.e. None in projection expressions) then removing it makes sense to me.

@findepi findepi changed the title proposal: deprecate Expr::Wildcard Remove Wildcard from Expr Feb 12, 2025
@findepi findepi marked this as a duplicate of #14635 Feb 12, 2025
@rkrishn7
Copy link
Contributor

From a user perspective, my opinion is that representing wildcard with None is too implicit. As @alamb mentioned, having an expr_fn helper makes this a bit better, but still not great IMO.

As long as we have alternative representation of wildcard (i.e. None in projection expressions)

How would we differentiate qualified vs. unqualified wildcards with None?

Just to think about alternatives, what if we kept Expr::Wildcard around and guarantee that it is expanded in any LogicalPlans built via LogicalPlanBuilder (related to #14618)?.

We could also think about a type safe way to represent this, but I think that would come with significant breaking changes.

@alamb
Copy link
Contributor

alamb commented Mar 2, 2025

Here is a PR that marks Expr::Wildcard deprecated (thanks @linhr 🙏 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants