SPARK-2978. Transformation with MR shuffle semantics #2274

sryza · 2014-09-04T08:36:55Z

I didn't add this to the transformations list in the docs because it's kind of obscure, but would be happy to do so if others think it would be helpful.

SparkQA · 2014-09-04T08:39:15Z

QA tests have started for PR 2274 at commit a75f277.

This patch merges cleanly.

SparkQA · 2014-09-04T08:40:23Z

QA tests have finished for PR 2274 at commit a75f277.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-04T08:54:24Z

QA tests have started for PR 2274 at commit a1ef807.

This patch merges cleanly.

SparkQA · 2014-09-04T09:52:06Z

QA tests have finished for PR 2274 at commit a1ef807.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class BlockManagerMaster(
- class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

davies · 2014-09-04T17:19:15Z

python/pyspark/rdd.py

@@ -514,6 +514,30 @@ def __add__(self, other):
            raise TypeError
        return self.union(other)

+    def repartitionAndSortWithinPartition(self, ascending=True, numPartitions=None,
+                                          partitionFunc=portable_hash, keyfunc=lambda x: x):


How about re-arrange the parameters to follow the function name? such as:

repartitionAndSortWithinPartition(self, numPartitions=None, partitionFunc=portable_hash, ascending=True, keyfunc=lambda x: x)

sryza · 2014-09-04T22:35:55Z

Updated patch removes Python version, adds Java version, and adds some additional doc.

mateiz · 2014-09-05T00:27:39Z

Just a nit, it should probably be called repartitionAndSortWithinPartition_s_.

Also, this name is pretty long. Another one I'd reconsider is repartitionWithSort, but not sure what other people think.

Finally I think it should be a policy to add all these APIs to Python, and implement them there too. Basically there are two options -- if you're doing this to support a slightly easier transition from MR jobs, but you don't want to do it in Python, you could just have it as a document, or an example, or maybe even a third-party package that takes a Hadoop JobConf and runs it on Spark. But if you want it in Spark, we need to put it in each language. The reason is to allow people to easily read code in one supported language and run it in others -- it's always disappointing when some operators turn out to be missing in yours.

rxin · 2014-09-05T00:54:19Z

@mateiz

The reason to add this is because this is a smaller API that we can support (both source and binary compatibility) in the long run before finalizing ShuffledRDD (since that one has been in flux and changing in multiple past releases). Perhaps we can mark this new API as DeveloperApi but commit to maintaining it. What do you think?

The naming is long, but I'm worried repartitionWithSort in a way implies the data are sorted globally.

rxin · 2014-09-05T00:58:13Z

core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala

+   * because it can push the sorting down into the shuffle machinery.
+   */
+  def repartitionAndSortWithinPartition(partitioner: Partitioner)
+      : RDD[(K, V)] = {


u can put this on the previous line ...

mateiz · 2014-09-05T01:20:23Z

Ah, I see. Then we can add it, but in that case I'd also add it in Python.

sryza · 2014-09-05T20:32:28Z

Updated patch adds Python back in and adds the 's' at the end.

rxin · 2014-09-05T20:40:55Z

Thanks, Sandy. Can you add a unit test in Java to make sure the thing is callable from Java?

davies · 2014-09-05T21:23:44Z

python/pyspark/tests.py

-
-        self.assertRaises(ValueError, lambda: rdd.countApproxDistinct(0.00000001))
-        self.assertRaises(ValueError, lambda: rdd.countApproxDistinct(0.5))
-


These are removed by accident during merging?

Yup, my bad

SparkQA · 2014-09-08T06:47:07Z

QA tests have started for PR 2274 at commit c04b447.

This patch merges cleanly.

SparkQA · 2014-09-08T07:39:53Z

QA tests have finished for PR 2274 at commit c04b447.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-08T07:47:29Z

QA tests have started for PR 2274 at commit 4a5332a.

This patch merges cleanly.

SparkQA · 2014-09-08T08:48:39Z

QA tests have finished for PR 2274 at commit 4a5332a.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2014-09-08T17:18:43Z

LGTM, thanks.

mateiz · 2014-09-08T18:21:31Z

Thanks Sandy! I've merged this.

davies reviewed Sep 4, 2014
View reviewed changes

sryza force-pushed the sandy-spark-2978 branch from a1ef807 to 423650a Compare September 4, 2014 22:35

rxin reviewed Sep 5, 2014
View reviewed changes

sryza force-pushed the sandy-spark-2978 branch from 15b2f90 to 1340d75 Compare September 5, 2014 20:29

davies reviewed Sep 5, 2014
View reviewed changes

sryza added 8 commits September 7, 2014 23:07

SPARK-2978. Transformation with MR shuffle semantics

f147634

Fix python style warnings

e5381cd

Add Java version and additional doc

48c12c2

Fix import ordering

36e0571

Fix compilation

9b0ba99

Add s at the end and a couple other fixes

4c25a54

Add Java test

433ad5b

Fix Python doc and add back deleted code

c04b447

sryza force-pushed the sandy-spark-2978 branch from f249f74 to c04b447 Compare September 8, 2014 06:11

Fix Java test

4a5332a

asfgit closed this in 16a73c2 Sep 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-2978. Transformation with MR shuffle semantics #2274

SPARK-2978. Transformation with MR shuffle semantics #2274

sryza commented Sep 4, 2014

SparkQA commented Sep 4, 2014

SparkQA commented Sep 4, 2014

SparkQA commented Sep 4, 2014

SparkQA commented Sep 4, 2014

davies Sep 4, 2014

sryza commented Sep 4, 2014

mateiz commented Sep 5, 2014

rxin commented Sep 5, 2014

rxin Sep 5, 2014

mateiz commented Sep 5, 2014

sryza commented Sep 5, 2014

rxin commented Sep 5, 2014

davies Sep 5, 2014

sryza Sep 8, 2014

SparkQA commented Sep 8, 2014

SparkQA commented Sep 8, 2014

SparkQA commented Sep 8, 2014

SparkQA commented Sep 8, 2014

davies commented Sep 8, 2014

mateiz commented Sep 8, 2014


		self.assertRaises(ValueError, lambda: rdd.countApproxDistinct(0.00000001))
		self.assertRaises(ValueError, lambda: rdd.countApproxDistinct(0.5))

SPARK-2978. Transformation with MR shuffle semantics #2274

SPARK-2978. Transformation with MR shuffle semantics #2274

Conversation

sryza commented Sep 4, 2014

SparkQA commented Sep 4, 2014

SparkQA commented Sep 4, 2014

SparkQA commented Sep 4, 2014

SparkQA commented Sep 4, 2014

davies Sep 4, 2014

Choose a reason for hiding this comment

sryza commented Sep 4, 2014

mateiz commented Sep 5, 2014

rxin commented Sep 5, 2014

rxin Sep 5, 2014

Choose a reason for hiding this comment

mateiz commented Sep 5, 2014

sryza commented Sep 5, 2014

rxin commented Sep 5, 2014

davies Sep 5, 2014

Choose a reason for hiding this comment

sryza Sep 8, 2014

Choose a reason for hiding this comment

SparkQA commented Sep 8, 2014

SparkQA commented Sep 8, 2014

SparkQA commented Sep 8, 2014

SparkQA commented Sep 8, 2014

davies commented Sep 8, 2014

mateiz commented Sep 8, 2014