Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6201] [SQL] promote string and do widen types for IN #4945

Closed
wants to merge 4 commits into from

Conversation

adrian-wang
Copy link
Contributor

@huangjs
Acutally spark sql will first go through analysis period, in which we do widen types and promote strings, and then optimization, where constant IN will be converted into INSET.

So it turn out that we only need to fix this for IN.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28382 has finished for PR 4945 at commit 6c838c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case i @ In(a, b) if b.exists(_.dataType == StringType)
&& a.dataType.isInstanceOf[NumericType] =>
i.makeCopy(Array(a, b.map(_.dataType match{
case StringType => Cast(a, DoubleType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Causes unmatched exception?

  case StringType => Cast(a, DoubleType)
  case x => x

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@SparkQA
Copy link

SparkQA commented Mar 10, 2015

Test build #28420 has finished for PR 4945 at commit cd72593.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@adrian-wang
Copy link
Contributor Author

@liancheng That's reasonable, not every string could be converted into numeric types.
Hive specified about this as
/**

  • GenericUDFIn
    *
  • Example usage:
  • SELECT key FROM src WHERE key IN ("238", "1");
    *
  • From MySQL page on IN(): To comply with the SQL standard, IN returns NULL
  • not only if the expression on the left hand side is NULL, but also if no
  • match is found in the list and one of the expressions in the list is NULL.
    *
  • Also noteworthy: type conversion behavior is different from MySQL. With
  • expr IN expr1, expr2... in MySQL, exprN will each be converted into the same
  • type as expr. In the Hive implementation, all expr(N) will be converted into
  • a common type for conversion consistency with other UDF's, and to prevent
  • conversions from a big type to a small type (e.g. int to tinyint)
    */

@marmbrus
Copy link
Contributor

marmbrus commented Apr 3, 2015

What is the status here? I haven't looked closely but this seems reasonable to me. However, I'd like to see more hive comparison tests that specifically test edge cases (different sized numbers, strings that can't be converted to numbers, etc) to make sure we are compatible with what they are doing.

@liancheng
Copy link
Contributor

The thing that makes me hesitant here is whether we should stick to Hive, because Hive's behavior is actually error prone and unintuitive. In Hive, IN is implemented as a UDF, and function argument type coercion rules apply here.

Take "1.00" IN (1.0, 2.0) as an example, "1.00", 1.0, and 2.0 are all arguments of GenericUDFIn. When doing type coercion, 1.0 and 2.0 is first converted to string "1.0" and "2.0", and then compared with "1.00", thus returns false.

Personally I think maybe we should just throw an exception if the left side of IN has different data types from the right side.

@marmbrus
Copy link
Contributor

Any comments on @liancheng 's suggestion?

@adrian-wang
Copy link
Contributor Author

If we don't like what hive does here, we can do it in MySQL way: Convert all expressions in the list to the type of left handle of IN. In fact the main differences here is that Spark SQL promotes strings to numeric when do type coercion, while Hive seems do the contrary.

@marmbrus
Copy link
Contributor

The mysql way seems reasonable to me. @liancheng ?

@liancheng
Copy link
Contributor

@adrian-wang @marmbrus Sorry for the late reply. Yeah, the MySQL way also seems reasonable to me.

In both Spark SQL and MySQL, IN is treated more like an operator, which has its own reasonable type coercion rules. However, in Hive, IN is defined as a UDF, which follows general UDF argument type coercion rules, but those rules doesn't make sense here.

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30732 has finished for PR 4945 at commit 581fa1c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30734 has finished for PR 4945 at commit 71e05cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #31058 has started for PR 4945 at commit 71e05cc.

@rxin
Copy link
Contributor

rxin commented May 6, 2015

Thanks. I'm going to merge this.

asfgit pushed a commit that referenced this pull request May 6, 2015
huangjs
Acutally spark sql will first go through analysis period, in which we do widen types and promote strings, and then optimization, where constant IN will be converted into INSET.

So it turn out that we only need to fix this for IN.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4945 from adrian-wang/inset and squashes the following commits:

71e05cc [Daoyuan Wang] minor fix
581fa1c [Daoyuan Wang] mysql way
f3f7baf [Daoyuan Wang] address comments
5eed4bc [Daoyuan Wang] promote string and do widen types for IN

(cherry picked from commit c3eb441)
Signed-off-by: Yin Huai <yhuai@databricks.com>
@asfgit asfgit closed this in c3eb441 May 6, 2015
@yhuai
Copy link
Contributor

yhuai commented May 6, 2015

Merged in master and branch 1.4. Thanks!

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
huangjs
Acutally spark sql will first go through analysis period, in which we do widen types and promote strings, and then optimization, where constant IN will be converted into INSET.

So it turn out that we only need to fix this for IN.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes apache#4945 from adrian-wang/inset and squashes the following commits:

71e05cc [Daoyuan Wang] minor fix
581fa1c [Daoyuan Wang] mysql way
f3f7baf [Daoyuan Wang] address comments
5eed4bc [Daoyuan Wang] promote string and do widen types for IN
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
huangjs
Acutally spark sql will first go through analysis period, in which we do widen types and promote strings, and then optimization, where constant IN will be converted into INSET.

So it turn out that we only need to fix this for IN.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes apache#4945 from adrian-wang/inset and squashes the following commits:

71e05cc [Daoyuan Wang] minor fix
581fa1c [Daoyuan Wang] mysql way
f3f7baf [Daoyuan Wang] address comments
5eed4bc [Daoyuan Wang] promote string and do widen types for IN
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
huangjs
Acutally spark sql will first go through analysis period, in which we do widen types and promote strings, and then optimization, where constant IN will be converted into INSET.

So it turn out that we only need to fix this for IN.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes apache#4945 from adrian-wang/inset and squashes the following commits:

71e05cc [Daoyuan Wang] minor fix
581fa1c [Daoyuan Wang] mysql way
f3f7baf [Daoyuan Wang] address comments
5eed4bc [Daoyuan Wang] promote string and do widen types for IN
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants