[SPARK-6201] [SQL] promote string and do widen types for IN #4945

adrian-wang · 2015-03-09T06:31:39Z

@huangjs
Acutally spark sql will first go through analysis period, in which we do widen types and promote strings, and then optimization, where constant IN will be converted into INSET.

So it turn out that we only need to fix this for IN.

SparkQA · 2015-03-09T07:51:47Z

Test build #28382 has finished for PR 4945 at commit 6c838c4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-03-09T15:23:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala

+      case i @ In(a, b) if b.exists(_.dataType == StringType)
+        && a.dataType.isInstanceOf[NumericType] =>
+        i.makeCopy(Array(a, b.map(_.dataType match{
+          case StringType => Cast(a, DoubleType)


Causes unmatched exception?

case StringType => Cast(a, DoubleType) case x => x

Same as above.

SparkQA · 2015-03-10T04:18:11Z

Test build #28420 has finished for PR 4945 at commit cd72593.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

adrian-wang · 2015-03-10T05:34:44Z

@liancheng That's reasonable, not every string could be converted into numeric types.
Hive specified about this as
/**

GenericUDFIn
*
Example usage:
SELECT key FROM src WHERE key IN ("238", "1");
*
From MySQL page on IN(): To comply with the SQL standard, IN returns NULL
not only if the expression on the left hand side is NULL, but also if no
match is found in the list and one of the expressions in the list is NULL.
*
Also noteworthy: type conversion behavior is different from MySQL. With
expr IN expr1, expr2... in MySQL, exprN will each be converted into the same
type as expr. In the Hive implementation, all expr(N) will be converted into
a common type for conversion consistency with other UDF's, and to prevent
conversions from a big type to a small type (e.g. int to tinyint)
*/

marmbrus · 2015-04-03T00:38:14Z

What is the status here? I haven't looked closely but this seems reasonable to me. However, I'd like to see more hive comparison tests that specifically test edge cases (different sized numbers, strings that can't be converted to numbers, etc) to make sure we are compatible with what they are doing.

liancheng · 2015-04-04T17:26:46Z

The thing that makes me hesitant here is whether we should stick to Hive, because Hive's behavior is actually error prone and unintuitive. In Hive, IN is implemented as a UDF, and function argument type coercion rules apply here.

Take "1.00" IN (1.0, 2.0) as an example, "1.00", 1.0, and 2.0 are all arguments of GenericUDFIn. When doing type coercion, 1.0 and 2.0 is first converted to string "1.0" and "2.0", and then compared with "1.00", thus returns false.

Personally I think maybe we should just throw an exception if the left side of IN has different data types from the right side.

marmbrus · 2015-04-14T01:31:58Z

Any comments on @liancheng 's suggestion?

adrian-wang · 2015-04-16T08:02:51Z

If we don't like what hive does here, we can do it in MySQL way: Convert all expressions in the list to the type of left handle of IN. In fact the main differences here is that Spark SQL promotes strings to numeric when do type coercion, while Hive seems do the contrary.

marmbrus · 2015-04-17T00:29:09Z

The mysql way seems reasonable to me. @liancheng ?

liancheng · 2015-04-17T15:26:38Z

@adrian-wang @marmbrus Sorry for the late reply. Yeah, the MySQL way also seems reasonable to me.

In both Spark SQL and MySQL, IN is treated more like an operator, which has its own reasonable type coercion rules. However, in Hive, IN is defined as a UDF, which follows general UDF argument type coercion rules, but those rules doesn't make sense here.

SparkQA · 2015-04-22T06:18:31Z

Test build #30732 has finished for PR 4945 at commit 581fa1c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SparkQA · 2015-04-22T08:11:44Z

Test build #30734 has finished for PR 4945 at commit 71e05cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SparkQA · 2015-04-27T18:34:47Z

Test build #31058 has started for PR 4945 at commit 71e05cc.

rxin · 2015-05-06T16:58:30Z

Thanks. I'm going to merge this.

huangjs Acutally spark sql will first go through analysis period, in which we do widen types and promote strings, and then optimization, where constant IN will be converted into INSET. So it turn out that we only need to fix this for IN. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4945 from adrian-wang/inset and squashes the following commits: 71e05cc [Daoyuan Wang] minor fix 581fa1c [Daoyuan Wang] mysql way f3f7baf [Daoyuan Wang] address comments 5eed4bc [Daoyuan Wang] promote string and do widen types for IN (cherry picked from commit c3eb441) Signed-off-by: Yin Huai <yhuai@databricks.com>

yhuai · 2015-05-06T17:34:03Z

Merged in master and branch 1.4. Thanks!

huangjs Acutally spark sql will first go through analysis period, in which we do widen types and promote strings, and then optimization, where constant IN will be converted into INSET. So it turn out that we only need to fix this for IN. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes apache#4945 from adrian-wang/inset and squashes the following commits: 71e05cc [Daoyuan Wang] minor fix 581fa1c [Daoyuan Wang] mysql way f3f7baf [Daoyuan Wang] address comments 5eed4bc [Daoyuan Wang] promote string and do widen types for IN

chenghao-intel reviewed Mar 9, 2015
View reviewed changes

adrian-wang force-pushed the inset branch from cd72593 to 581fa1c Compare April 22, 2015 05:58

adrian-wang added 3 commits April 21, 2015 22:58

promote string and do widen types for IN

5eed4bc

address comments

f3f7baf

mysql way

581fa1c

minor fix

71e05cc

asfgit closed this in c3eb441 May 6, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6201] [SQL] promote string and do widen types for IN #4945

[SPARK-6201] [SQL] promote string and do widen types for IN #4945

adrian-wang commented Mar 9, 2015

SparkQA commented Mar 9, 2015

chenghao-intel Mar 9, 2015

liancheng Mar 9, 2015

SparkQA commented Mar 10, 2015

adrian-wang commented Mar 10, 2015

marmbrus commented Apr 3, 2015

liancheng commented Apr 4, 2015

marmbrus commented Apr 14, 2015

adrian-wang commented Apr 16, 2015

marmbrus commented Apr 17, 2015

liancheng commented Apr 17, 2015

SparkQA commented Apr 22, 2015

SparkQA commented Apr 22, 2015

SparkQA commented Apr 27, 2015

rxin commented May 6, 2015

yhuai commented May 6, 2015

[SPARK-6201] [SQL] promote string and do widen types for IN #4945

[SPARK-6201] [SQL] promote string and do widen types for IN #4945

Conversation

adrian-wang commented Mar 9, 2015

SparkQA commented Mar 9, 2015

chenghao-intel Mar 9, 2015

Choose a reason for hiding this comment

liancheng Mar 9, 2015

Choose a reason for hiding this comment

SparkQA commented Mar 10, 2015

adrian-wang commented Mar 10, 2015

marmbrus commented Apr 3, 2015

liancheng commented Apr 4, 2015

marmbrus commented Apr 14, 2015

adrian-wang commented Apr 16, 2015

marmbrus commented Apr 17, 2015

liancheng commented Apr 17, 2015

SparkQA commented Apr 22, 2015

SparkQA commented Apr 22, 2015

SparkQA commented Apr 27, 2015

rxin commented May 6, 2015

yhuai commented May 6, 2015