-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concatenating and deserializing streams written by SnappyOutputStream #103
Comments
@xerial not sure how feasible or likely it is for this to happen, but it'd help tremendously Spark's performance because we are experimenting with a new shuffle path that uses channel.transferTo to avoid user space copying. However, for that to work, we'd need the underlying format to support concatenation. As far we know, LZF has this property, and Snappy might also have it (but snappy-java implementation doesn't support it). |
Currently SnappyOutputStream doesn't support this use case, but SnappyOutputStream's data format is simple:
So it would be easy to let SnappyInputStream read concatenated chunks like this:
Just adding header check code before reading a compression block. I think SnappyFramedInput/OutputStream implementation is not matured (#81) compared to SnappyOutput/InputStreams. So extending SnappyInputStream to support reading concatenated chunks is the easiest approach. |
@JoshRosen @rxin Does this extension of SnappyInputStream satisfy your use case? |
@xerial I'll test out the snapshot this afternoon, but based on reading the code it looks like this will address our use-case. Does snappy-java provide a programmatic way to access its version number? I'd like to be able to detect at runtime whether we're using a snappy-java version that supports reading concatenated data, since I'm worried about scenarios where Spark ships with 1.1.2+ but a dependency conflict forces an earlier version to be used; in these cases, it would be helpful to be able to detect that we're using an older version and fall back to an old code path which doesn't perform the concatenation. |
@JoshRosen I updated 1.1.2-SNAPSHOT jar with that fix. |
The SnappyFramedInputStream also supports reading multiple framed streams concatenated together. As for @xerial's reference to issue (#81), I do not see any actual issue. There is a difference in the way memory is allocated that significantly impacted a "fake" benchmark (reading/writing a single byte) and has little or no applicability in real use. |
@bokken, in our use-case we're always going to be performing fairly large bulk writes, so it sounds like we shouldn't expect to see a performance penalty from switching to the framed stream. I might actually prefer to take that approach since that won't require a dependency upgrade or introduce the potential for dependency conflicts to break our concatenation. @rxin, do you remember why you opened #81 or otherwise anticipate problems with switching Spark to use the framed stream? |
It might not be a problem anymore with sort-based shuffle, since at any given time we have only one stream. (With the old hash based shuffle we had a lot of streams) |
@bokken @rxin Note that an expected overhead in SnappyFramedOutputStream is the computation of crc32, and there might be a portability problem since SnappyFramedOutputStream relies on Anyway, I'll deploy snappy-java-1.1.2 (-RC1) to Maven central for the ease of your testing. |
@xerial, sun.misc.Cleaner is only used via reflections if present. If not present, there is simply no aggressive reclamation of direct byte buffers (i.e. native memory). |
@bokken Thanks for the clarification. Then it's safe at any platform. |
@JoshRosen Now let me close the ticket. If you find any problem, please open a new ticket. |
I'd like to be able to write data to multiple files using separate SnappyOutputStreams, then concatenate the serialized files and read the combined file using a SnappyInputStream. Does SnappyOutputStream support this use-case? If not, is this a prohibitively difficult feature to add? The snappy framing format linked from the Snappy website explicitly supports this use-case, but I think that SnappyOutputStream uses a different format.
The text was updated successfully, but these errors were encountered: