Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2075][Core] Make the compiler generate same bytes code for Hadoop 1.+ and Hadoop 2.+ #3740

Closed
wants to merge 5 commits into from

Conversation

zsxwing
Copy link
Member

@zsxwing zsxwing commented Dec 19, 2014

NullWritable is a Comparable rather than Comparable[NullWritable] in Hadoop 1.+, so the compiler cannot find an implicit Ordering for it. It will generate different anonymous classes for saveAsTextFile in Hadoop 1.+ and Hadoop 2.+. Therefore, here we provide an Ordering for NullWritable so that the compiler will generate same codes.

I used the following commands to confirm the generated byte codes are some.

mvn -Dhadoop.version=1.2.1 -DskipTests clean package -pl core -am
javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop1.txt

mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package -pl core -am
javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop2.txt

diff ~/hadoop1.txt ~/hadoop2.txt

However, the compiler will generate different codes for the classes which call methods of JobContext/TaskAttemptContext. JobContext/TaskAttemptContext is a class in Hadoop 1.+, and calling its method will use invokevirtual, while it's an interface in Hadoop 2.+, and will use invokeinterface.

To fix it, we can use reflection to call JobContext/TaskAttemptContext.getConfiguration.

@zsxwing zsxwing changed the title Add an Ordering for NullWritable to make the compiler generate same byte codes for RDD [SPARK-2075][Core] Add an Ordering for NullWritable to make the compiler generate same byte codes for RDD Dec 19, 2014
@SparkQA
Copy link

SparkQA commented Dec 19, 2014

Test build #24620 has started for PR 3740 at commit fa40db0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 19, 2014

Test build #24620 has finished for PR 3740 at commit fa40db0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24620/
Test PASSed.

@zsxwing zsxwing changed the title [SPARK-2075][Core] Add an Ordering for NullWritable to make the compiler generate same byte codes for RDD [SPARK-2075][Core] Make the compiler generate same bytes code for Hadoop 1.+ and Hadoop 2.+ Dec 19, 2014
@zsxwing
Copy link
Member Author

zsxwing commented Dec 19, 2014

Since WriteInputFormatTestDataGenerator is for python test, we can ignore it.

Now for other codes in Spark core, Scala will generate same byte codes for Hadoop 1.+ and Hadoop 2.+

@SparkQA
Copy link

SparkQA commented Dec 19, 2014

Test build #24624 has started for PR 3740 at commit ca03559.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 19, 2014

Test build #24624 has finished for PR 3740 at commit ca03559.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24624/
Test PASSed.

// provide an Ordering for NullWritable so that the compiler will generate same codes.
implicit val nullWritableOrdering = new Ordering[NullWritable] {
override def compare(x: NullWritable, y: NullWritable): Int = 0
}
this.map(x => (NullWritable.get(), new Text(x.toString)))
.saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the problem here that while compiling Hadoop 2, the compiler chooses to specify the Ordering on the implicit rddToPairRDDFunctions, while in Hadoop 1 it instead uses the default method (return null) to invoke the implicit?

I wonder if a more explicit solution, like the introduction of a conversion to PairRDDFunctions which takes an Ordering, is warranted for these cases. e.g.:

this.map(x => (NullWritable.get(), new Text(x.toString)))
  .toPairRDD(nullWritableOrdering)
  .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)

This would be less magical in why the definition of an implicit Ordering changes bytecode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Explicit solution is better for such tricky issue.

@SparkQA
Copy link

SparkQA commented Dec 19, 2014

Test build #24635 has started for PR 3740 at commit 734bac9.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 19, 2014

Test build #24635 has finished for PR 3740 at commit 734bac9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24635/
Test PASSed.

@aarondav
Copy link
Contributor

@rxin I'm not a core maintainer, and also at this point it's mainly an issue of how to style this so it is clear to future readers, so could you take a look?

}
val nullWritableClassTag = implicitly[ClassTag[NullWritable]]
val textClassTag = implicitly[ClassTag[Text]]
val r = this.map(x => (NullWritable.get(), new Text(x.toString)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticed we can reuse the text array here to reduce gc. anyway that's not part of this PR - would you be willing to submit a new PR for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll send another PR for that after this one is merged.

@SparkQA
Copy link

SparkQA commented Dec 20, 2014

Test build #24670 has started for PR 3740 at commit 39d9df2.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 20, 2014

Test build #24670 has finished for PR 3740 at commit 39d9df2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24670/
Test PASSed.

@rxin
Copy link
Contributor

rxin commented Dec 22, 2014

Thanks - merging in master & branch-1.2.

asfgit pushed a commit that referenced this pull request Dec 22, 2014
…oop 1.+ and Hadoop 2.+

`NullWritable` is a `Comparable` rather than `Comparable[NullWritable]` in Hadoop 1.+, so the compiler cannot find an implicit Ordering for it. It will generate different anonymous classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+. Therefore, here we provide an Ordering for NullWritable so that the compiler will generate same codes.

I used the following commands to confirm the generated byte codes are some.
```
mvn -Dhadoop.version=1.2.1 -DskipTests clean package -pl core -am
javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop1.txt

mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package -pl core -am
javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop2.txt

diff ~/hadoop1.txt ~/hadoop2.txt
```

However, the compiler will generate different codes for the classes which call methods of `JobContext/TaskAttemptContext`. `JobContext/TaskAttemptContext` is a class in Hadoop 1.+, and calling its method will use `invokevirtual`, while it's an interface in Hadoop 2.+, and will use `invokeinterface`.

To fix it, we can use reflection to call `JobContext/TaskAttemptContext.getConfiguration`.

Author: zsxwing <zsxwing@gmail.com>

Closes #3740 from zsxwing/SPARK-2075 and squashes the following commits:

39d9df2 [zsxwing] Fix the code style
e4ad8b5 [zsxwing] Use null for the implicit Ordering
734bac9 [zsxwing] Explicitly set the implicit parameters
ca03559 [zsxwing] Use reflection to access JobContext/TaskAttemptContext.getConfiguration
fa40db0 [zsxwing] Add an Ordering for NullWritable to make the compiler generate same byte codes for RDD

(cherry picked from commit 6ee6aa7)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in 6ee6aa7 Dec 22, 2014
@zsxwing
Copy link
Member Author

zsxwing commented Dec 22, 2014

Need to backport this one to branch-1.2, as the implicit APIs fixes do not exist in branch-1.2

@zsxwing
Copy link
Member Author

zsxwing commented Dec 22, 2014

RDD.rddToPairRDDFunctions needs to be changed to rddToPairRDDFunctions for branch-1.2

@zsxwing
Copy link
Member Author

zsxwing commented Dec 22, 2014

See #3758

@rxin
Copy link
Contributor

rxin commented Dec 22, 2014

Thanks - I will merge the new PR once tests pass.

asfgit pushed a commit that referenced this pull request Dec 22, 2014
backport #3740 for branch-1.2

Author: zsxwing <zsxwing@gmail.com>

Closes #3758 from zsxwing/SPARK-2075-branch-1.2 and squashes the following commits:

b57d440 [zsxwing] SPARK-2075 backport for branch-1.2
@baishuo
Copy link
Contributor

baishuo commented Dec 29, 2014

I learn a lot when review this PR,thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants