Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression: query error with case when as the group field #11118

Closed
haohuaijin opened this issue Jun 25, 2024 · 6 comments
Closed

regression: query error with case when as the group field #11118

haohuaijin opened this issue Jun 25, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@haohuaijin
Copy link
Contributor

haohuaijin commented Jun 25, 2024

Describe the bug

Internal error: Input field name CASE WHEN concat(t.b, Utf8("hello")) = Utf8("test") THEN Utf8("good") ELSE Utf8("bad") END does not match with the projection expression CASE WHEN concat(t.b,Utf8("hello")) = Utf8("test") THEN Utf8("good") ELSE Utf8("bad") END.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

To Reproduce

DataFusion CLI v39.0.0
> create table t(a int, b text);
0 row(s) fetched.
Elapsed 0.019 seconds.

> insert into t values (1,'hello'), (2,'world');
+-------+
| count |
+-------+
| 2     |
+-------+
1 row(s) fetched.
Elapsed 0.019 seconds.

> select (case when concat(b, 'hello') = 'test' THEN 'good' else 'bad' END) as c from t group by c;
Internal error: Input field name CASE WHEN concat(t.b, Utf8("hello")) = Utf8("test") THEN Utf8("good") ELSE Utf8("bad") END does not match with the projection expression CASE WHEN concat(t.b,Utf8("hello")) = Utf8("test") THEN Utf8("good") ELSE Utf8("bad") END.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

Expected behavior

run success

Additional context

I also check in v37 and v38

v37 can run, but v38 report error

DataFusion CLI v37.0.0
❯ create table t(a int, b text);
insert into t values (1,'hello'), (2,'world');
select (case when concat(b, 'hello') = 'test' THEN 'good' else 'bad' END) as c from t group by c;
0 rows in set. Query took 0.006 seconds.

+-------+
| count |
+-------+
| 2     |
+-------+
1 row in set. Query took 0.014 seconds.

+-----+
| c   |
+-----+
| bad |
+-----+
1 row in set. Query took 0.020 seconds.
DataFusion CLI v38.0.0
> create table t(a int, b text);
insert into t values (1,'hello'), (2,'world');
select (case when concat(b, 'hello') = 'test' THEN 'good' else 'bad' END) as c from t group by c;
0 row(s) fetched.
Elapsed 0.008 seconds.

+-------+
| count |
+-------+
| 2     |
+-------+
1 row(s) fetched.
Elapsed 0.007 seconds.

Internal error: Input field name CASE WHEN concat(t.b, Utf8("hello")) = Utf8("test") THEN Utf8("good") ELSE Utf8("bad") END does not match with the projection expression CASE WHEN concat(t.b,Utf8("hello")) = Utf8("test") THEN Utf8("good") ELSE Utf8("bad") END.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
@haohuaijin haohuaijin added the bug Something isn't working label Jun 25, 2024
@haohuaijin haohuaijin changed the title query error with case when as the group field regression: query error with case when as the group field Jun 25, 2024
@doki23
Copy link
Contributor

doki23 commented Jun 25, 2024

It seems that the projection expression lacks a whitespace between t.b and Utf8("hello").

@jonahgao
Copy link
Member

jonahgao commented Jun 25, 2024

It might be related to #10088, and could be due to the gap between physical_name and Expr::display_name of case expressions.

Changing the physical_name of case expression as follows can fix it:

        Expr::Case(case) => {
            let mut name = "CASE ".to_string();
            if let Some(e) = &case.expr {
                let _ = write!(name, "{} ", create_physical_name(e, false)?);
            }
            for (w, t) in &case.when_then_expr {
                let _ = write!(
                    name,
                    "WHEN {} THEN {} ",
                    create_physical_name(w, false)?,
                    create_physical_name(t, false)?
                );
            }
            if let Some(e) = &case.else_expr {
                let _ = write!(name, "ELSE {} ", create_physical_name(e, false)?);
            }
            name += "END";
            Ok(name)
        }

but I'm not sure if this is the correct way to do it. Keeping these two names consistent does not seem like a good design.

@jcsherin
Copy link
Contributor

jcsherin commented Jun 26, 2024

It might be related to #10088, and could be due to the gap between physical_name and Expr::display_name of case expressions.

The error originates from the following check between input field names and projection expressions. It seems like failure is the expected behavior. Before v38 the mismatch was being silently ignored. But now they are being detected.

if col.name() != matching_input_field.name() {
return internal_err!("Input field name {} does not match with the projection expression {}",
matching_input_field.name(),col.name())
}

The use of any scalar function inside case will fail similarly. An example using ends_with,

select (case when ends_with(b, 'hello') THEN 'good' else 'bad' END) as c from t group by c;

Internal error: Input field name 
CASE WHEN ends_with(t.b, Utf8("hello")) THEN Utf8("good") ELSE Utf8("bad") END 
does not match with the projection expression 
CASE WHEN ends_with(t.b,Utf8("hello")) THEN Utf8("good") ELSE Utf8("bad") END.

There exists a formatting mismatch in implementations,

  1. The default implementation of display_name does not put a space after the separator like: .join(", ").

    Ok(format!("{}({})", self.name(), names.join(",")))

  2. But the fmt::Display for Expr::ScalarFunction puts a space after the separator:

    write!(f, "{}({}{})", fun, distinct_str, args.join(", "))

This is a recent issue which reports the same bug and proposes to fix it by standardizing the separator in display_name. Even though the actual fix is small, many cascading changes are required in existing test snapshots.

@jonahgao
Copy link
Member

A failed query that does not involve separators.

> create table t(a int, b text);
0 row(s) fetched.
Elapsed 0.005 seconds.

> select (case a::bigint when 1 THEN 1 END) as c from t group by c;
Internal error: Input field name CASE CAST(t.a AS Int64) WHEN Int64(1) THEN Int64(1) END does not match with the projection expression CASE t.a WHEN Int64(1) THEN Int64(1) END.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

I think having multiple different types of names for expressions, while also ensuring they are the same, seems somewhat burdensome and error-prone.🤔

@alamb
Copy link
Contributor

alamb commented Jun 26, 2024

I think this Issue is resolved in #11133 by @jonahgao

However I am not sure if @jonahgao plans additional changes before we should close this issue -- or if perhaps we should file a new ticket

@jonahgao
Copy link
Member

Let's close it for this case. I'll open a new issue if there are new findings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants