Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concatenating and deserializing streams written by SnappyOutputStream #103

Closed
JoshRosen opened this issue May 10, 2015 · 12 comments
Closed
Assignees

Comments

@JoshRosen
Copy link
Contributor

I'd like to be able to write data to multiple files using separate SnappyOutputStreams, then concatenate the serialized files and read the combined file using a SnappyInputStream. Does SnappyOutputStream support this use-case? If not, is this a prohibitively difficult feature to add? The snappy framing format linked from the Snappy website explicitly supports this use-case, but I think that SnappyOutputStream uses a different format.

@rxin
Copy link

rxin commented May 11, 2015

@xerial not sure how feasible or likely it is for this to happen, but it'd help tremendously Spark's performance because we are experimenting with a new shuffle path that uses channel.transferTo to avoid user space copying. However, for that to work, we'd need the underlying format to support concatenation. As far we know, LZF has this property, and Snappy might also have it (but snappy-java implementation doesn't support it).

@xerial
Copy link
Owner

xerial commented May 11, 2015

Currently SnappyOutputStream doesn't support this use case, but SnappyOutputStream's data format is simple:

(header) (compressed size, compressed data)*

So it would be easy to let SnappyInputStream read concatenated chunks like this:

(header) (compressed size, compressed data)* ...   (header) (compressed size, compressed data)*

Just adding header check code before reading a compression block.

I think SnappyFramedInput/OutputStream implementation is not matured (#81) compared to SnappyOutput/InputStreams. So extending SnappyInputStream to support reading concatenated chunks is the easiest approach.

@xerial
Copy link
Owner

xerial commented May 11, 2015

@JoshRosen @rxin
Created snappy-java-1.1.2-SNAPSHOT, which supports reading concatenated data of SnappyOutputStreams:
https://oss.sonatype.org/content/repositories/snapshots/org/xerial/snappy/snappy-java/1.1.2-SNAPSHOT/

Does this extension of SnappyInputStream satisfy your use case?

@xerial xerial self-assigned this May 11, 2015
@JoshRosen
Copy link
Contributor Author

@xerial I'll test out the snapshot this afternoon, but based on reading the code it looks like this will address our use-case.

Does snappy-java provide a programmatic way to access its version number? I'd like to be able to detect at runtime whether we're using a snappy-java version that supports reading concatenated data, since I'm worried about scenarios where Spark ships with 1.1.2+ but a dependency conflict forces an earlier version to be used; in these cases, it would be helpful to be able to detect that we're using an older version and fall back to an old code path which doesn't perform the concatenation.

@xerial
Copy link
Owner

xerial commented May 12, 2015

@JoshRosen
Use Snappy.getNativeLibraryVersion(). snappy-java-1.1.2-SNAPSHOT will return "1.1.2" (string).
But due to a packaging failure, which is fixed in b1b8276, the previous versions may return "unknown".

I updated 1.1.2-SNAPSHOT jar with that fix.

@bokken
Copy link
Contributor

bokken commented May 12, 2015

The SnappyFramedInputStream also supports reading multiple framed streams concatenated together.

As for @xerial's reference to issue (#81), I do not see any actual issue. There is a difference in the way memory is allocated that significantly impacted a "fake" benchmark (reading/writing a single byte) and has little or no applicability in real use.

@JoshRosen
Copy link
Contributor Author

@bokken, in our use-case we're always going to be performing fairly large bulk writes, so it sounds like we shouldn't expect to see a performance penalty from switching to the framed stream. I might actually prefer to take that approach since that won't require a dependency upgrade or introduce the potential for dependency conflicts to break our concatenation. @rxin, do you remember why you opened #81 or otherwise anticipate problems with switching Spark to use the framed stream?

@rxin
Copy link

rxin commented May 12, 2015

It might not be a problem anymore with sort-based shuffle, since at any given time we have only one stream. (With the old hash based shuffle we had a lot of streams)

@xerial
Copy link
Owner

xerial commented May 13, 2015

@bokken @rxin
That is good to know. I'll close #81.

Note that an expected overhead in SnappyFramedOutputStream is the computation of crc32, and there might be a portability problem since SnappyFramedOutputStream relies on sun.misc.Cleaner, which is not available on some platform (e.g., Android). But this would not be a problem unless @rxin's joke on April fool becomes a reality :)
https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html

Anyway, I'll deploy snappy-java-1.1.2 (-RC1) to Maven central for the ease of your testing.

@bokken
Copy link
Contributor

bokken commented May 13, 2015

@xerial, sun.misc.Cleaner is only used via reflections if present. If not present, there is simply no aggressive reclamation of direct byte buffers (i.e. native memory).

@xerial
Copy link
Owner

xerial commented May 13, 2015

@bokken Thanks for the clarification. Then it's safe at any platform.

@xerial
Copy link
Owner

xerial commented May 13, 2015

@JoshRosen
Just deployed snappy-java-1.1.2-RC1, which will be available soon.

Now let me close the ticket. If you find any problem, please open a new ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants