Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16135][SQL] Remove hashCode and euqals in ArrayBasedMapData #13847

Closed
wants to merge 9 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Jun 22, 2016

What changes were proposed in this pull request?

This pr is to remove hashCode and equals in ArrayBasedMapData because the type cannot be used as join keys, grouping keys, or in equality tests.

How was this patch tested?

Add a new test suite MapDataSuite for comparison tests.

@srowen
Copy link
Member

srowen commented Jun 22, 2016

LGTM

@hvanhovell
Copy link
Contributor

@maropu I think you also need to add these methods to UnsafeArray for this to work.

Where in the spark code base do we compare two (unsafe) MapData objects? Or are you comparing these in your own code?

@hvanhovell
Copy link
Contributor

You could also just also use the approach taken in UnsafeRow.

@maropu
Copy link
Member Author

maropu commented Jun 22, 2016

It seems UnsafeArrayData already has its own equals and hashCode`.
Currently, spark doesn't compare unsafe MapData though, I think this might cause implicit bugs in future codes.

@maropu
Copy link
Member Author

maropu commented Jun 22, 2016

aha, yes. It'd better to take the same approach in UnsafeRow?

@hvanhovell
Copy link
Contributor

Yeah you are right about UnsafeArrayData (my bad).

I would take the same approach as UnsafeRow.

@maropu
Copy link
Member Author

maropu commented Jun 22, 2016

okay, I'm fixing now.

@maropu
Copy link
Member Author

maropu commented Jun 22, 2016

okay, done.

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #61041 has started for PR 13847 at commit 30d28bc.

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #61038 has finished for PR 13847 at commit 50f8be3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jun 22, 2016

Do we need to hash all values? This could be a performance issue if hashCode is called frequently on very large arrays.

Story: MLlib had some performance issues caused by Vector.hashCode, which is called during Pyrolite serialization. It saves the Vector as the key in a hash map to avoid re-serialization of the same object. But the hashCode costs almost the same as re-serialization.

@maropu
Copy link
Member Author

maropu commented Jun 22, 2016

Does the current implementation of Vector.hashCode have enough performance? If so, it's okay to follow the impl. to me.

@hvanhovell
Copy link
Contributor

hvanhovell commented Jun 22, 2016

The performance of hashCode() should be pretty good in this case, and this implementation is in line with the ones used in all other Unsafe* objects (MurMurHash). I'd rather be consistent. If this turns out to be a problem, we could always use the first n bytes (similar to the Vector.hashCode) for hashCode(), if it turns out to be a problem.

@maropu
Copy link
Member Author

maropu commented Jun 22, 2016

At least, we'd be better to leave comments for that.

// This `hashCode` computation could consume much processor time for large data.
// If the computation becomes a bottleneck, we can use a light-weight logic; the first fixed bytes
// are used to compute `hashCode` (See `Vector.hashCode`).
// The same issue exists in `UnsafeMapData.hashCode`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same issue also exists for UnsafeRow...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I'll add now.

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #61052 has finished for PR 13847 at commit e8eaf70.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #61053 has finished for PR 13847 at commit b6bac43.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #3126 has finished for PR 13847 at commit b6bac43.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

I think we don't need to implement equals and hashCode for map type, as map type doesn't support equality and ordering by design, see https://issues.apache.org/jira/browse/SPARK-9415

We should remove the equals and hashCode on ArrayBasedMapData and fix tests if needed

@maropu
Copy link
Member Author

maropu commented Jun 23, 2016

Thx, good direction. The current master doesn't throw any exception in an analyzer when map-typed data are passed into collect_set/collect_list. Probably, should we check the case in there?
https://github.com/apache/spark/pull/13802/files#diff-4b06b5fe0cedf425de14469d1356d6ecR474

@cloud-fan
Copy link
Contributor

yea we should improve the type check of CollectSet

@SparkQA
Copy link

SparkQA commented Jun 23, 2016

Test build #61093 has finished for PR 13847 at commit 827785d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jun 23, 2016

I'm now checking failed tests...

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61168 has finished for PR 13847 at commit 431a3fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu changed the title [SPARK-16135][SQL] Implement hashCode and euqals in UnsafeMapData [SPARK-16135][SQL] Remove hashCode and euqals in ArrayBasedMapData Jun 24, 2016
@maropu
Copy link
Member Author

maropu commented Jun 24, 2016

@hvanhovell please check again?

@@ -115,7 +115,7 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {
new GenericArrayData(0 until length),
new GenericArrayData(Seq.fill(length)(true))))

if (!checkResult(actual, expected)) {
if (!actual.zip(expected).forall { case (data, answer) => checkResult(data, answer)}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we turn map data in actual to scala map and compare it with expected? (also use scala map in expected)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I'll try to replace it.

@cloud-fan
Copy link
Contributor

looks pretty good, thanks for working on it!

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61173 has finished for PR 13847 at commit 78b57c6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61174 has finished for PR 13847 at commit 2feb984.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61175 has finished for PR 13847 at commit 8ada73c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61178 has finished for PR 13847 at commit d95824b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -39,4 +39,7 @@ abstract class MapData extends Serializable {
i += 1
}
}

// `MapData` should not implement `equals` and `hashCode` because the type cannot be used as join
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we put this in the class header?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

@hvanhovell
Copy link
Contributor

LGTM - @cloud-fan?

@SparkQA
Copy link

SparkQA commented Jun 25, 2016

Test build #61208 has finished for PR 13847 at commit e4b1384.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -19,6 +19,10 @@ package org.apache.spark.sql.catalyst.util

import org.apache.spark.sql.types.DataType

/**
* `MapData` should not implement `equals` and `hashCode` because the type cannot be used as join
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not your fault, but I think we need to add some comment for MapData itself, and then follows this comment as a note. We can simply say: An internal data representation for map type in Spark SQL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, how about this?

@SparkQA
Copy link

SparkQA commented Jun 25, 2016

Test build #61237 has finished for PR 13847 at commit 902fe5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@cloud-fan
Copy link
Contributor

LGTM, retest this as the last test pass is 2 days ago.

@SparkQA
Copy link

SparkQA commented Jun 27, 2016

Test build #61298 has finished for PR 13847 at commit 902fe5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jun 27, 2016
## What changes were proposed in this pull request?
This pr is to remove `hashCode` and `equals` in `ArrayBasedMapData` because the type cannot be used as join keys, grouping keys, or in equality tests.

## How was this patch tested?
Add a new test suite `MapDataSuite` for comparison tests.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #13847 from maropu/UnsafeMapTest.

(cherry picked from commit 3e4e868)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@asfgit asfgit closed this in 3e4e868 Jun 27, 2016
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants