Skip to content

Attach Diagnostic to "more than one column in subquery" error #14438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #14429
eliaperantoni opened this issue Feb 3, 2025 · 10 comments · Fixed by #15143
Closed
Tracked by #14429

Attach Diagnostic to "more than one column in subquery" error #14438

eliaperantoni opened this issue Feb 3, 2025 · 10 comments · Fixed by #15143
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@eliaperantoni
Copy link
Contributor

Is your feature request related to a problem or challenge?

For a query like:

SELECT (SELECT 1 AS x, 2 AS y) AS col

The only message that the end user of an application built atop of DataFusion sees is:

Error: Invalid (non-executable) plan after Analyzer
caused by
Error during planning: Scalar subquery should only return one column, but found 2: x, y

Caused by:
    Error during planning: Scalar subquery should only return one column, but found 2: x, y

We want to provide a richer message that references and highlights locations in the original SQL query, and contextualises and helps the user understand the error. In the end, it would be possible to display errors in a fashion akin to what was enabled by #13664 for some errors:

See #14429 for more information.

Describe the solution you'd like

Attach a well crafted Diagnostic to the DataFusionError, building on top of the foundations laid in #13664. See #14429 for more information.

Describe alternatives you've considered

No response

Additional context

No response

@eliaperantoni eliaperantoni added the enhancement New feature or request label Feb 3, 2025
@alamb alamb added the good first issue Good for newcomers label Feb 4, 2025
@alamb
Copy link
Contributor

alamb commented Feb 4, 2025

I think this is a good first issue as the need is clear and the tests in https://github.com/apache/datafusion/blob/85fbde2661bdb462fc498dc18f055c44f229604c/datafusion/sql/tests/cases/diagnostic.rs are well structured for extension.

@irenjj
Copy link
Contributor

irenjj commented Feb 7, 2025

take

@eliaperantoni
Copy link
Contributor Author

eliaperantoni commented Feb 13, 2025

Hey @irenjj how is it going with this ticket :) Can I help with anything?

@irenjj
Copy link
Contributor

irenjj commented Feb 15, 2025

Hey @irenjj how is it going with this ticket :) Can I help with anything?

Hi, @eliaperantoni, Sorry for not updating my status for a long time. I was working on other tasks the past few days, but I will address this issue in the next few days.❤️

@changsun20
Copy link
Contributor

Hi @irenjj ,
I noticed this issue was assigned to you a few weeks ago. May I check if there's any progress? If you're open to collaboration, I'd be happy to contribute a PR to help resolve this.
Please let me know if this works for you. Appreciate your efforts!

@irenjj
Copy link
Contributor

irenjj commented Mar 10, 2025

Hi @irenjj , I noticed this issue was assigned to you a few weeks ago. May I check if there's any progress? If you're open to collaboration, I'd be happy to contribute a PR to help resolve this. Please let me know if this works for you. Appreciate your efforts!

Hi @changsun20 ,feel free to take over this task, thank you!❤️

@changsun20
Copy link
Contributor

Hi @eliaperantoni,

After investigating this issue, here are my initial thoughts on implementation:

The most straightforward approach would be to add a new span property to the Subquery struct in datafusion/expr/src/logical_plan/plan.rs:

pub struct Subquery {
    /// The subquery
    pub subquery: Arc<LogicalPlan>,
    /// The outer references used in the subquery
    pub outer_ref_columns: Vec<Expr>,
    pub span: Option<Span>,
}

The span would be extracted in the parse_scalar_subquery function within datafusion/sql/src/expr/subquery.rs:

pub(super) fn parse_scalar_subquery(
    &self,
    subquery: Query,
    // other params...
) -> Result<Expr> {
    // other logic...
    let span = Span::try_from_sqlparser_span(subquery.some_way_to_index_the_span_from_the query());
    Ok(Expr::ScalarSubquery(Subquery {
        subquery: Arc::new(sub_plan),
        outer_ref_columns,
        span,
    }))
}

This span would then be accessible when generating error messages, allowing us to add diagnostic information at datafusion\expr\src\logical_plan\invariants.rs:

// This is the original implementation
if subquery.subquery.schema().fields().len() > 1 {
    return plan_err!(
        "Scalar subquery should only return one column, but found {}: {}",
        subquery.subquery.schema().fields().len(),
        subquery.subquery.schema().field_names().join(", ")
    );
}

However, a drawback of this approach is that modifying the Subquery struct will impact 14 other functions in the codebase, making this change less minimal than ideal (not fulfilling the "as little invasive as possible" requirement).

Please let me know if this is on the right track or if you have any suggestions. Thank you!

@eliaperantoni
Copy link
Contributor Author

Thank you so much @changsun20 for the incredible work, that's a very nice and comprehensive plan! I think it looks awesome and I definitely like it.

Just one minor thing: it seems like there are scalar subqueries and then "in" subqueries. e.g. SELECT * FROM users WHERE id IN (SELECT id FROM admins). Do you think you could easily implement Diagnostic for this second case too, in the same PR? It's okay either way, we can also make a separate ticket :)

@changsun20
Copy link
Contributor

Just one minor thing: it seems like there are scalar subqueries and then "in" subqueries. e.g. SELECT * FROM users WHERE id IN (SELECT id FROM admins). Do you think you could easily implement Diagnostic for this second case too, in the same PR?

Sure, no problem! I'll also address that in my pr.

@eliaperantoni
Copy link
Contributor Author

Just one minor thing: it seems like there are scalar subqueries and then "in" subqueries. e.g. SELECT * FROM users WHERE id IN (SELECT id FROM admins). Do you think you could easily implement Diagnostic for this second case too, in the same PR?

Sure, no problem! I'll also address that in my pr.

That's amazing, thank you ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
4 participants