[SPARK-10949] Update Snappy version to 1.1.2 #8995

a-roberts · 2015-10-06T16:22:42Z

Snappy now supports concatenation of serialized streams, this patch contains a version number change and the "does not support" test is now a "supports" test.

Snappy 1.1.2 changelog mentions:
snappy-java-1.1.2 (22 September 2015)
This is a backward compatible release for 1.1.x.
Add AIX (32-bit) support.
There is no upgrade for the native libraries of the other platforms.

A major change since 1.1.1 is a support for reading concatenated results of SnappyOutputStream(s)
snappy-java-1.1.2-RC2 (18 May 2015)
Fix #107: SnappyOutputStream.close() is not idempotent
snappy-java-1.1.2-RC1 (13 May 2015)
SnappyInputStream now supports reading concatenated compressed results of SnappyOutputStream
There has been no compressed format change since 1.0.5.x. So You can read the compressed results interchangeablly between these versions.
Fixes a problem when java.io.tmpdir does not exist.

From https://github.com/xerial/snappy-java/blob/develop/Milestone.md and up to date at the time of this pull request

Also note xerial/snappy-java#103
"@xerial not sure how feasible or likely it is for this to happen, but it'd help tremendously Spark's performance because we are experimenting with a new shuffle path that uses channel.transferTo to avoid user space copying. However, for that to work, we'd need the underlying format to support concatenation. As far we know, LZF has this property, and Snappy might also have it (but snappy-java implementation doesn't support it)."

Would be useful to have this in both the 1.5 and 1.6 branches

…of serialized streams is now supported

JoshRosen · 2015-10-06T16:54:21Z

Snappy upgrades have historically been a cause of bugs in the past, so I'm going to veto putting this into 1.5.2. Let's definitely consider it for Spark 1.6, though.

Jenkins, this is ok to test.

JoshRosen · 2015-10-06T16:58:28Z

By the way, in addition to the changes here we need to update code elsewhere in order to benefit from the concatenation of serialized streams. For the Tungsten shuffle write path, the right line to change is

spark/core/src/main/java/org/apache/spark/shuffle/unsafe/UnsafeShuffleWriter.java

Line 269 in 27ecfe6

!compressionEnabled || compressionCodec instanceof LZFCompressionCodec;

Rather than changing this here, though, I'd prefer to do something similar to what I did for Serializer, defining a private API to let instances express whether they have this fast-merging property:

spark/core/src/main/scala/org/apache/spark/serializer/Serializer.scala

Line 99 in 27ecfe6

private[spark] def supportsRelocationOfSerializedObjects: Boolean = false

SparkQA · 2015-10-06T19:05:51Z

Test build #1845 has finished for PR 8995 at commit 352bb3d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-10-09T20:49:04Z

Any update here? Will you have time to address my comments?

a-roberts · 2015-10-13T09:13:23Z

Josh, apologies for the late response here and making sure I'm understanding your suggestion

So for your first comment the naive way would to be add SnappyCompressionCodec such that:

final boolean fastMergeIsSupported =
      !compressionEnabled || compressionCodec instanceof LZFCompressionCodec || compressionCodec instanceof SnappyCompressionCodec;

but this isn't scalable in the long term (will be prone to error if we support more codecs and feels hacky - requiring this to be modified over time based on new functionality or new codecs being available). Having said that, we already have hard coded compression codec names in CompressionCodec.scala...

With your proposal we'd add

@Private
private[spark] def supportsSerializedStreams: Boolean = false

Alternatively a required method named "supportsSerializedStreams" - so when a user defines their own codec to be used with Spark, this would be required.

JoshRosen · 2015-10-13T22:28:48Z

Hey @a-roberts,

How about this:

Add a private[spark] method to the private[spark] CompressionCodec companion object and have that method maintain the hardcoded list of compression codecs which support concatenation of serialized streams. This method should accept a CompressionCodec instance and perform the instanceof check. I'd consider naming this something like "supportsConcatenationOfSerializedStreams" to be very explicit and clear.
Update fastMergeIsSupported to use this new static method.

I like this approach since it makes it very clear why we're only supporting those two codecs.

I wouldn't worry about third-party / external compression codecs being able to take advantage of this feature.

a-roberts · 2015-10-14T16:09:48Z

Cheers Josh, makes sense, I'm going to test this on our systems before updating the PR with the changes, here's what I've added (the last parts of each code block).

In core/src/main/scala/org/apache/spark/io/CompressionCodec.scala

private[spark] object CompressionCodec {

  private val configKey = "spark.io.compression.codec"
  private val shortCompressionCodecNames = Map(
    "lz4" -> classOf[LZ4CompressionCodec].getName,
    "lzf" -> classOf[LZFCompressionCodec].getName,
    "snappy" -> classOf[SnappyCompressionCodec].getName)


  private[spark] def supportsConcatenationOfSerializedStreams(codec: CompressionCodec): Boolean = {
    codec.isInstanceOf[SnappyCompressionCodec] || codec.isInstanceOf[LZFCompressionCodec]
  }

In core/src/main/java/org/apache/spark/shuffle/unsafe/UnsafeShuffleWriter.java

private long[] mergeSpills(SpillInfo[] spills) throws IOException {
    final File outputFile = shuffleBlockResolver.getDataFile(shuffleId, mapId);
    final boolean compressionEnabled = sparkConf.getBoolean("spark.shuffle.compress", true);
    final CompressionCodec compressionCodec = CompressionCodec$.MODULE$.createCodec(sparkConf);
    final boolean fastMergeEnabled =
      sparkConf.getBoolean("spark.shuffle.unsafe.fastMergeEnabled", true);

    final boolean fastMergeIsSupported = !compressionEnabled || 
      CompressionCodec$.MODULE$.supportsConcatenationOfSerializedStreams(compressionCodec);

JoshRosen · 2015-10-15T19:59:08Z

That plan sounds fine to me.

Add known compression codecs that support concatenation of serialized streams

Update fastMergeIsSupported so we can support concatenation of serialized streams

JoshRosen · 2015-10-17T22:42:18Z

Jenkins, retest this please.

SparkQA · 2015-10-17T23:00:36Z

Test build #43890 has finished for PR 8995 at commit 3d650c8.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

Scalastyle fixes for whitespace

JoshRosen · 2015-10-18T18:47:44Z

Jenkins, retest this please.

SparkQA · 2015-10-18T19:05:57Z

Test build #43899 has finished for PR 8995 at commit 1949bcb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-10-19T22:14:38Z

Jenkins, this is ok to test.

JoshRosen · 2015-10-19T22:14:46Z

(I'll try to see if I can get Jenkins to just auto-retest this...)

SparkQA · 2015-10-20T00:52:11Z

Test build #43944 has finished for PR 8995 at commit 0f87052.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-10-20T16:38:23Z

Jenkins, retest this please.

SparkQA · 2015-10-20T19:15:39Z

Test build #43992 has finished for PR 8995 at commit 0f87052.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-10-23T02:05:54Z

Hey @a-roberts, any chance that you could fix the merge conflicts here so that I can re-test? I'd like to get this patch in soon so that users can benefit from the faster shuffle spill merging out of the box.

srowen · 2015-10-27T19:24:20Z

@a-roberts are you still working on this?

JoshRosen · 2015-10-27T19:51:06Z

I'd really like to get this in; @a-roberts, could you let us know if you no longer plan to work on this so that someone else could take over?

a-roberts · 2015-10-28T20:19:40Z

Hi, have been preparing for and enjoying Spark Summit Europe, better for somebody else to take it over; looks like the files I changed have moved around since the testing so I imagine it won't take a while anyway

JoshRosen · 2015-11-03T18:06:57Z

I've opened #9439 to take this over. @a-roberts, do you mind closing this one for now?

Update Snappy version to 1.1.2 and modify test case as concatenation …

352bb3d

…of serialized streams is now supported

a-roberts added 2 commits October 17, 2015 13:41

Update CompressionCodec.scala

ae13aa8

Add known compression codecs that support concatenation of serialized streams

Update UnsafeShuffleWriter.java

3d650c8

Update fastMergeIsSupported so we can support concatenation of serialized streams

Update CompressionCodec.scala

1949bcb

Scalastyle fixes for whitespace

Whitespace

0f87052

JoshRosen mentioned this pull request Nov 3, 2015

[SPARK-10949] Update Snappy version to 1.1.2 #9439

Closed

asfgit closed this in 701fb50 Nov 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10949] Update Snappy version to 1.1.2 #8995

[SPARK-10949] Update Snappy version to 1.1.2 #8995

a-roberts commented Oct 6, 2015

JoshRosen commented Oct 6, 2015

JoshRosen commented Oct 6, 2015

SparkQA commented Oct 6, 2015

JoshRosen commented Oct 9, 2015

a-roberts commented Oct 13, 2015

JoshRosen commented Oct 13, 2015

a-roberts commented Oct 14, 2015

JoshRosen commented Oct 15, 2015

JoshRosen commented Oct 17, 2015

SparkQA commented Oct 17, 2015

JoshRosen commented Oct 18, 2015

SparkQA commented Oct 18, 2015

JoshRosen commented Oct 19, 2015

JoshRosen commented Oct 19, 2015

SparkQA commented Oct 20, 2015

JoshRosen commented Oct 20, 2015

SparkQA commented Oct 20, 2015

JoshRosen commented Oct 23, 2015

srowen commented Oct 27, 2015

JoshRosen commented Oct 27, 2015

a-roberts commented Oct 28, 2015

JoshRosen commented Nov 3, 2015

[SPARK-10949] Update Snappy version to 1.1.2 #8995

[SPARK-10949] Update Snappy version to 1.1.2 #8995

Conversation

a-roberts commented Oct 6, 2015

JoshRosen commented Oct 6, 2015

JoshRosen commented Oct 6, 2015

SparkQA commented Oct 6, 2015

JoshRosen commented Oct 9, 2015

a-roberts commented Oct 13, 2015

JoshRosen commented Oct 13, 2015

a-roberts commented Oct 14, 2015

JoshRosen commented Oct 15, 2015

JoshRosen commented Oct 17, 2015

SparkQA commented Oct 17, 2015

JoshRosen commented Oct 18, 2015

SparkQA commented Oct 18, 2015

JoshRosen commented Oct 19, 2015

JoshRosen commented Oct 19, 2015

SparkQA commented Oct 20, 2015

JoshRosen commented Oct 20, 2015

SparkQA commented Oct 20, 2015

JoshRosen commented Oct 23, 2015

srowen commented Oct 27, 2015

JoshRosen commented Oct 27, 2015

a-roberts commented Oct 28, 2015

JoshRosen commented Nov 3, 2015