[SPARK-2075][Core] Make the compiler generate same bytes code for Hadoop 1.+ and Hadoop 2.+ #3740

zsxwing · 2014-12-19T02:34:42Z

NullWritable is a Comparable rather than Comparable[NullWritable] in Hadoop 1.+, so the compiler cannot find an implicit Ordering for it. It will generate different anonymous classes for saveAsTextFile in Hadoop 1.+ and Hadoop 2.+. Therefore, here we provide an Ordering for NullWritable so that the compiler will generate same codes.

I used the following commands to confirm the generated byte codes are some.

mvn -Dhadoop.version=1.2.1 -DskipTests clean package -pl core -am
javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop1.txt

mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package -pl core -am
javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop2.txt

diff ~/hadoop1.txt ~/hadoop2.txt

However, the compiler will generate different codes for the classes which call methods of JobContext/TaskAttemptContext. JobContext/TaskAttemptContext is a class in Hadoop 1.+, and calling its method will use invokevirtual, while it's an interface in Hadoop 2.+, and will use invokeinterface.

To fix it, we can use reflection to call JobContext/TaskAttemptContext.getConfiguration.

…yte codes for RDD

SparkQA · 2014-12-19T02:37:28Z

Test build #24620 has started for PR 3740 at commit fa40db0.

This patch merges cleanly.

SparkQA · 2014-12-19T03:59:50Z

Test build #24620 has finished for PR 3740 at commit fa40db0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient

AmplabJenkins · 2014-12-19T03:59:54Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24620/
Test PASSed.

zsxwing · 2014-12-19T04:20:36Z

Since WriteInputFormatTestDataGenerator is for python test, we can ignore it.

Now for other codes in Spark core, Scala will generate same byte codes for Hadoop 1.+ and Hadoop 2.+

SparkQA · 2014-12-19T04:22:29Z

Test build #24624 has started for PR 3740 at commit ca03559.

This patch merges cleanly.

SparkQA · 2014-12-19T06:13:58Z

Test build #24624 has finished for PR 3740 at commit ca03559.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient

AmplabJenkins · 2014-12-19T06:14:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24624/
Test PASSed.

aarondav · 2014-12-19T07:27:01Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

+    // provide an Ordering for NullWritable so that the compiler will generate same codes.
+    implicit val nullWritableOrdering = new Ordering[NullWritable] {
+      override def compare(x: NullWritable, y: NullWritable): Int = 0
+    }
    this.map(x => (NullWritable.get(), new Text(x.toString)))
      .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)


Is the problem here that while compiling Hadoop 2, the compiler chooses to specify the Ordering on the implicit rddToPairRDDFunctions, while in Hadoop 1 it instead uses the default method (return null) to invoke the implicit?

I wonder if a more explicit solution, like the introduction of a conversion to PairRDDFunctions which takes an Ordering, is warranted for these cases. e.g.:

this.map(x => (NullWritable.get(), new Text(x.toString))) .toPairRDD(nullWritableOrdering) .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)

This would be less magical in why the definition of an implicit Ordering changes bytecode.

Right. Explicit solution is better for such tricky issue.

SparkQA · 2014-12-19T07:57:29Z

Test build #24635 has started for PR 3740 at commit 734bac9.

This patch merges cleanly.

SparkQA · 2014-12-19T09:19:54Z

Test build #24635 has finished for PR 3740 at commit 734bac9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-19T09:19:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24635/
Test PASSed.

aarondav · 2014-12-19T18:54:14Z

@rxin I'm not a core maintainer, and also at this point it's mainly an issue of how to style this so it is clear to future readers, so could you take a look?

rxin · 2014-12-19T20:45:28Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

+    }
+    val nullWritableClassTag = implicitly[ClassTag[NullWritable]]
+    val textClassTag = implicitly[ClassTag[Text]]
+    val r = this.map(x => (NullWritable.get(), new Text(x.toString)))


just noticed we can reuse the text array here to reduce gc. anyway that's not part of this PR - would you be willing to submit a new PR for that?

OK. I'll send another PR for that after this one is merged.

SparkQA · 2014-12-20T14:12:29Z

Test build #24670 has started for PR 3740 at commit 39d9df2.

This patch merges cleanly.

SparkQA · 2014-12-20T15:34:37Z

Test build #24670 has finished for PR 3740 at commit 39d9df2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-20T15:34:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24670/
Test PASSed.

rxin · 2014-12-22T06:10:08Z

Thanks - merging in master & branch-1.2.

…oop 1.+ and Hadoop 2.+ `NullWritable` is a `Comparable` rather than `Comparable[NullWritable]` in Hadoop 1.+, so the compiler cannot find an implicit Ordering for it. It will generate different anonymous classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+. Therefore, here we provide an Ordering for NullWritable so that the compiler will generate same codes. I used the following commands to confirm the generated byte codes are some. ``` mvn -Dhadoop.version=1.2.1 -DskipTests clean package -pl core -am javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop1.txt mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package -pl core -am javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop2.txt diff ~/hadoop1.txt ~/hadoop2.txt ``` However, the compiler will generate different codes for the classes which call methods of `JobContext/TaskAttemptContext`. `JobContext/TaskAttemptContext` is a class in Hadoop 1.+, and calling its method will use `invokevirtual`, while it's an interface in Hadoop 2.+, and will use `invokeinterface`. To fix it, we can use reflection to call `JobContext/TaskAttemptContext.getConfiguration`. Author: zsxwing <zsxwing@gmail.com> Closes #3740 from zsxwing/SPARK-2075 and squashes the following commits: 39d9df2 [zsxwing] Fix the code style e4ad8b5 [zsxwing] Use null for the implicit Ordering 734bac9 [zsxwing] Explicitly set the implicit parameters ca03559 [zsxwing] Use reflection to access JobContext/TaskAttemptContext.getConfiguration fa40db0 [zsxwing] Add an Ordering for NullWritable to make the compiler generate same byte codes for RDD (cherry picked from commit 6ee6aa7) Signed-off-by: Reynold Xin <rxin@databricks.com>

zsxwing · 2014-12-22T06:11:52Z

Need to backport this one to branch-1.2, as the implicit APIs fixes do not exist in branch-1.2

zsxwing · 2014-12-22T06:12:42Z

RDD.rddToPairRDDFunctions needs to be changed to rddToPairRDDFunctions for branch-1.2

zsxwing · 2014-12-22T06:22:20Z

See #3758

rxin · 2014-12-22T06:50:16Z

Thanks - I will merge the new PR once tests pass.

backport #3740 for branch-1.2 Author: zsxwing <zsxwing@gmail.com> Closes #3758 from zsxwing/SPARK-2075-branch-1.2 and squashes the following commits: b57d440 [zsxwing] SPARK-2075 backport for branch-1.2

baishuo · 2014-12-29T03:49:37Z

I learn a lot when review this PR，thanks

Add an Ordering for NullWritable to make the compiler generate same b…

fa40db0

…yte codes for RDD

zsxwing changed the title ~~Add an Ordering for NullWritable to make the compiler generate same byte codes for RDD~~ [SPARK-2075][Core] Add an Ordering for NullWritable to make the compiler generate same byte codes for RDD Dec 19, 2014

Use reflection to access JobContext/TaskAttemptContext.getConfiguration

ca03559

zsxwing changed the title ~~[SPARK-2075][Core] Add an Ordering for NullWritable to make the compiler generate same byte codes for RDD~~ [SPARK-2075][Core] Make the compiler generate same bytes code for Hadoop 1.+ and Hadoop 2.+ Dec 19, 2014

aarondav reviewed Dec 19, 2014
View reviewed changes

Explicitly set the implicit parameters

734bac9

rxin reviewed Dec 19, 2014
View reviewed changes

zsxwing added 2 commits December 20, 2014 21:26

Use null for the implicit Ordering

e4ad8b5

Fix the code style

39d9df2

asfgit closed this in 6ee6aa7 Dec 22, 2014

zsxwing deleted the SPARK-2075 branch December 22, 2014 06:18

zsxwing mentioned this pull request Dec 22, 2014

[SPARK-2075][Core] backport for branch-1.2 #3758

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2075][Core] Make the compiler generate same bytes code for Hadoop 1.+ and Hadoop 2.+ #3740

[SPARK-2075][Core] Make the compiler generate same bytes code for Hadoop 1.+ and Hadoop 2.+ #3740

zsxwing commented Dec 19, 2014

SparkQA commented Dec 19, 2014

SparkQA commented Dec 19, 2014

AmplabJenkins commented Dec 19, 2014

zsxwing commented Dec 19, 2014

SparkQA commented Dec 19, 2014

SparkQA commented Dec 19, 2014

AmplabJenkins commented Dec 19, 2014

aarondav Dec 19, 2014

zsxwing Dec 19, 2014

SparkQA commented Dec 19, 2014

SparkQA commented Dec 19, 2014

AmplabJenkins commented Dec 19, 2014

aarondav commented Dec 19, 2014

rxin Dec 19, 2014

zsxwing Dec 20, 2014

SparkQA commented Dec 20, 2014

SparkQA commented Dec 20, 2014

AmplabJenkins commented Dec 20, 2014

rxin commented Dec 22, 2014

zsxwing commented Dec 22, 2014

zsxwing commented Dec 22, 2014

zsxwing commented Dec 22, 2014

rxin commented Dec 22, 2014

baishuo commented Dec 29, 2014

[SPARK-2075][Core] Make the compiler generate same bytes code for Hadoop 1.+ and Hadoop 2.+ #3740

[SPARK-2075][Core] Make the compiler generate same bytes code for Hadoop 1.+ and Hadoop 2.+ #3740

Conversation

zsxwing commented Dec 19, 2014

SparkQA commented Dec 19, 2014

SparkQA commented Dec 19, 2014

AmplabJenkins commented Dec 19, 2014

zsxwing commented Dec 19, 2014

SparkQA commented Dec 19, 2014

SparkQA commented Dec 19, 2014

AmplabJenkins commented Dec 19, 2014

aarondav Dec 19, 2014

Choose a reason for hiding this comment

zsxwing Dec 19, 2014

Choose a reason for hiding this comment

SparkQA commented Dec 19, 2014

SparkQA commented Dec 19, 2014

AmplabJenkins commented Dec 19, 2014

aarondav commented Dec 19, 2014

rxin Dec 19, 2014

Choose a reason for hiding this comment

zsxwing Dec 20, 2014

Choose a reason for hiding this comment

SparkQA commented Dec 20, 2014

SparkQA commented Dec 20, 2014

AmplabJenkins commented Dec 20, 2014

rxin commented Dec 22, 2014

zsxwing commented Dec 22, 2014

zsxwing commented Dec 22, 2014

zsxwing commented Dec 22, 2014

rxin commented Dec 22, 2014

baishuo commented Dec 29, 2014