Skip to content

fix: Avoid mistaken ILike to string equality optimization #15836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 29, 2025

Conversation

srh
Copy link
Contributor

@srh srh commented Apr 24, 2025

Which issue does this PR close?

Rationale for this change

Bugfix

What changes are included in this PR?

Bugfix and unit test cases for the optimization.

Are these changes tested?

Yes.

Are there any user-facing changes?

This changes the behavior of ILIKE.

@github-actions github-actions bot added the optimizer Optimizer rules label Apr 24, 2025
@xudong963 xudong963 added the bug Something isn't working label Apr 24, 2025
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @srh for your first contribution. Wondering if the bug you fixing related to parser like #15820 or to the query correctness?

What I mean for this fix would be that possible to test the change as a SQL query?

@srh
Copy link
Contributor Author

srh commented Apr 28, 2025

@comphead It isn't parser related, it's a bug in evaluation, with this optimizer.

It seems a query like "SELECT s.col ILIKE 'A' FROM (SELECT 'a' AS col) AS s" will trigger the bug. The query SELECT 'a' ILIKE 'A' evaluates to true because it hits constant folding before ExprSimplifier.

Here is a complete program that reproduces this on 46.0.1:

use datafusion::{
    arrow::
        util::pretty,
    prelude::*,
};
use tokio;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let session_config = SessionConfig::new();
    let ctx = SessionContext::new_with_config(session_config);

    for value in &["a", "A"] {
        let query = format!("SELECT s.col ILIKE 'A' FROM (SELECT '{}' AS col) AS s", value);

        let df = ctx.sql(&query).await?;
        let results = df.collect().await?;

        println!("'{}' ILIKE 'A' result:", value);
        pretty::print_batches(&results)?;
    }
    Ok(())
}

Output on 46.0.1 (before applying this change)

'a' ILIKE 'A' result:
+-----------------------+
| s.col ILIKE Utf8("A") |
+-----------------------+
| false                 |
+-----------------------+
'A' ILIKE 'A' result:
+-----------------------+
| s.col ILIKE Utf8("A") |
+-----------------------+
| true                  |
+-----------------------+

@comphead
Copy link
Contributor

Thanks @srh for providing the test case please add the query to one of select.slt files to preserve the regression

.contains(['%', '_', escape_char].as_ref()) =>
if !like.case_insensitive
&& !pattern_str
.contains(['%', '_', escape_char].as_ref()) =>
{
// If the pattern does not contain any wildcards, we can simplify the like expression to an equality expression
// TODO: handle escape characters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we have completed the TODO as well -- can yoy remove this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that comment is referring to parsing escape characters and performing the optimization on strings with character escapes too.

@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

Thank you for this PR @srh

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 29, 2025
@srh
Copy link
Contributor Author

srh commented Apr 29, 2025

Thanks @srh for providing the test case please add the query to one of select.slt files to preserve the regression

I have added a test case (not the one with an inner select, because there is a table already set up to use with a non-constant expression) and have confirmed it fails when the fix is reverted.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @srh - this looks good to me

@@ -115,6 +115,12 @@ p1
p1e1
p1m1e1

query T rowsort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case anyone is curious, without the fix this test fails like this:

Completed 1 test files in 0 seconds                                                                                                                                                                                             External error: query result mismatch:
[SQL] SELECT s FROM test WHERE s ILIKE 'p1';
[Diff] (-expected|+actual)
-   P1
    p1
at test_files/strings.slt:118

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @srh for your contribution.

@comphead comphead merged commit f52d9b9 into apache:main Apr 29, 2025
27 checks passed
@srh srh deleted the ilike-optimization branch April 29, 2025 04:18
nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
* fix: Avoid mistaken ILike to string equality optimization

* test: ILIKE without wildcards
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ILike with no wildcards is mistakenly optimized to string equality
4 participants