Concatenating and deserializing streams written by SnappyOutputStream #103

JoshRosen · 2015-05-10T23:57:52Z

I'd like to be able to write data to multiple files using separate SnappyOutputStreams, then concatenate the serialized files and read the combined file using a SnappyInputStream. Does SnappyOutputStream support this use-case? If not, is this a prohibitively difficult feature to add? The snappy framing format linked from the Snappy website explicitly supports this use-case, but I think that SnappyOutputStream uses a different format.

rxin · 2015-05-11T00:36:22Z

@xerial not sure how feasible or likely it is for this to happen, but it'd help tremendously Spark's performance because we are experimenting with a new shuffle path that uses channel.transferTo to avoid user space copying. However, for that to work, we'd need the underlying format to support concatenation. As far we know, LZF has this property, and Snappy might also have it (but snappy-java implementation doesn't support it).

xerial · 2015-05-11T02:10:57Z

Currently SnappyOutputStream doesn't support this use case, but SnappyOutputStream's data format is simple:

(header) (compressed size, compressed data)*

So it would be easy to let SnappyInputStream read concatenated chunks like this:

(header) (compressed size, compressed data)* ...   (header) (compressed size, compressed data)*

Just adding header check code before reading a compression block.

I think SnappyFramedInput/OutputStream implementation is not matured (#81) compared to SnappyOutput/InputStreams. So extending SnappyInputStream to support reading concatenated chunks is the easiest approach.

xerial · 2015-05-11T16:47:49Z

@JoshRosen @rxin
Created snappy-java-1.1.2-SNAPSHOT, which supports reading concatenated data of SnappyOutputStreams:
https://oss.sonatype.org/content/repositories/snapshots/org/xerial/snappy/snappy-java/1.1.2-SNAPSHOT/

Does this extension of SnappyInputStream satisfy your use case?

JoshRosen · 2015-05-11T19:49:06Z

@xerial I'll test out the snapshot this afternoon, but based on reading the code it looks like this will address our use-case.

Does snappy-java provide a programmatic way to access its version number? I'd like to be able to detect at runtime whether we're using a snappy-java version that supports reading concatenated data, since I'm worried about scenarios where Spark ships with 1.1.2+ but a dependency conflict forces an earlier version to be used; in these cases, it would be helpful to be able to detect that we're using an older version and fall back to an old code path which doesn't perform the concatenation.

xerial · 2015-05-12T01:00:44Z

@JoshRosen
Use Snappy.getNativeLibraryVersion(). snappy-java-1.1.2-SNAPSHOT will return "1.1.2" (string).
But due to a packaging failure, which is fixed in b1b8276, the previous versions may return "unknown".

I updated 1.1.2-SNAPSHOT jar with that fix.

bokken · 2015-05-12T12:11:02Z

The SnappyFramedInputStream also supports reading multiple framed streams concatenated together.

As for @xerial's reference to issue (#81), I do not see any actual issue. There is a difference in the way memory is allocated that significantly impacted a "fake" benchmark (reading/writing a single byte) and has little or no applicability in real use.

JoshRosen · 2015-05-12T21:41:46Z

@bokken, in our use-case we're always going to be performing fairly large bulk writes, so it sounds like we shouldn't expect to see a performance penalty from switching to the framed stream. I might actually prefer to take that approach since that won't require a dependency upgrade or introduce the potential for dependency conflicts to break our concatenation. @rxin, do you remember why you opened #81 or otherwise anticipate problems with switching Spark to use the framed stream?

rxin · 2015-05-12T22:00:01Z

It might not be a problem anymore with sort-based shuffle, since at any given time we have only one stream. (With the old hash based shuffle we had a lot of streams)

xerial · 2015-05-13T01:50:46Z

@bokken @rxin
That is good to know. I'll close #81.

Note that an expected overhead in SnappyFramedOutputStream is the computation of crc32, and there might be a portability problem since SnappyFramedOutputStream relies on sun.misc.Cleaner, which is not available on some platform (e.g., Android). But this would not be a problem unless @rxin's joke on April fool becomes a reality :)
https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html

Anyway, I'll deploy snappy-java-1.1.2 (-RC1) to Maven central for the ease of your testing.

bokken · 2015-05-13T02:12:18Z

@xerial, sun.misc.Cleaner is only used via reflections if present. If not present, there is simply no aggressive reclamation of direct byte buffers (i.e. native memory).

xerial · 2015-05-13T02:15:26Z

@bokken Thanks for the clarification. Then it's safe at any platform.

xerial · 2015-05-13T02:18:23Z

@JoshRosen
Just deployed snappy-java-1.1.2-RC1, which will be available soon.

Now let me close the ticket. If you find any problem, please open a new ticket.

xerial added a commit that referenced this issue May 11, 2015

#103: Support reading concatenated streams in SnappyInputStream

1c702ba

xerial mentioned this issue May 11, 2015

Concatenated input support #104

Merged

xerial self-assigned this May 11, 2015

xerial added a commit that referenced this issue May 12, 2015

#103: Embed /org/xerial/snappy/VERSION properly

b1b8276

xerial closed this as completed May 13, 2015

a-roberts mentioned this issue Oct 6, 2015

[SPARK-10949] Update Snappy version to 1.1.2 apache/spark#8995

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concatenating and deserializing streams written by SnappyOutputStream #103

Concatenating and deserializing streams written by SnappyOutputStream #103

JoshRosen commented May 10, 2015

rxin commented May 11, 2015

xerial commented May 11, 2015

xerial commented May 11, 2015

JoshRosen commented May 11, 2015

xerial commented May 12, 2015

bokken commented May 12, 2015

JoshRosen commented May 12, 2015

rxin commented May 12, 2015

xerial commented May 13, 2015

bokken commented May 13, 2015

xerial commented May 13, 2015

xerial commented May 13, 2015

Concatenating and deserializing streams written by SnappyOutputStream #103

Concatenating and deserializing streams written by SnappyOutputStream #103

Comments

JoshRosen commented May 10, 2015

rxin commented May 11, 2015

xerial commented May 11, 2015

xerial commented May 11, 2015

JoshRosen commented May 11, 2015

xerial commented May 12, 2015

bokken commented May 12, 2015

JoshRosen commented May 12, 2015

rxin commented May 12, 2015

xerial commented May 13, 2015

bokken commented May 13, 2015

xerial commented May 13, 2015

xerial commented May 13, 2015