Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

zlib: add adler32 and crc32 #7372

Closed
wants to merge 7 commits into from
Closed

zlib: add adler32 and crc32 #7372

wants to merge 7 commits into from

Conversation

seishun
Copy link

@seishun seishun commented Mar 28, 2014

Continuing from #7213.

I couldn't think of a way to make a streaming interface without over-complicating the API, and it seems like something that could be easily implemented by a user-land library. But if someone can do some benchmarking and show that marshaling the running checksum between JS and C++ causes a significant performance loss, it will be a good reason to reconsider.

The proposed API is pretty straightforward. If it LGTY, I'll add docs and tests.

@Nodejs-Jenkins
Copy link

Thank you for contributing this pull request! Here are a few pointers to make sure your submission will be considered for inclusion.

The following commiters were not found in the CLA:

  • Nikolai Vavilov

You can fix all these things without opening another issue.

Please see CONTRIBUTING.md for more information

@indutny
Copy link
Member

indutny commented Mar 28, 2014

I like, where it goes. Thanks for working for it.

But I think it should offer a streaming interface, there is a lot of the cases, where it won't be possible or possible only with a significant overhead to fit all input data in memory. I think you could try looking at lib/crypto.js to figure out how it should work.

Thanks again!

@seishun
Copy link
Author

seishun commented Mar 28, 2014

Actually you don't have to fit all input data in memory, since both Adler-32 and CRC-32 can be computed incrementally. So you could do something like this:

var crc = 0; // not sure about this part, see below
someStream.on('data', function(data) {
  crc = zlib.crc32(data, crc);
});
someStream.on('end', function() {
  doSomethingWith(crc);
});

One thing I'm not sure about is how to deal with the initial checksum value. The zlib docs don't mention that it's 0 for CRC-32 (and 1 for Adler-32) and instead tell you to pass Z_NULL as the buffer to get the initial value. So maybe we should allow calling zlib.adler32() and zlib.crc32() with no arguments to get the initial value. What are your thoughts?

@indutny
Copy link
Member

indutny commented Mar 30, 2014

I got you, could we still try to produce a Stream API in addition to one that you are proposing?

@seishun
Copy link
Author

seishun commented Mar 30, 2014

Well, if we have a stream API then there isn't any use in supporting incremental checksum for the non-stream API, is there? So maybe we could have the following two pairs:

zlib.createCRC32 and zlib.createAdler32 which return streams, and
zlib.crc32 and zlib.adler32 which just accept a Buffer as the only argument and return a hash of it.

Does that look good to you?

@indutny
Copy link
Member

indutny commented Mar 30, 2014

It wouldn't hurt to have optional argument for incremental result.

@seishun
Copy link
Author

seishun commented Mar 31, 2014

Hmm, what should the stream output? The checksum value as a 4-byte buffer? If so, in which endianness?

@vkurchatkin
Copy link

@seishun I think it doesn't make sense to make this Transform since the output isn't really a stream. Maybe just a writable stream with digest() method?

@seishun
Copy link
Author

seishun commented Mar 31, 2014

@vkurchatkin crypto hashes are Transform streams and their output isn't exactly a stream either.

Currently this outputs the checksum as a Buffer in little-endian, but perhaps this should be made configurable.

I still think that supporting incremental result for the non-stream API is a bad idea for the following reasons:

  • It duplicates functionality, potentially creating confusion for the user ("which one should I use?")
  • We need to document, test and support calling zlib.crc32 with two arguments, one argument and possibly with no arguments (see below)
  • Duplicate code – we need to have both the functions and the classes in node_zlib.cc. Otherwise we could just implement zlib.crc32 using the Stream API, similarly to the existing zlib convenience methods. Alternatively, we can do the opposite and implement the Stream API using the non-stream functions, but that could potentially be slower (due to the marshaling of the running checksum between JS and C++) and we then definitely need to figure out how to provide the initial checksum value to JS (probably by calling zlib.crc32 with no arguments, as mentioned before).

@vkurchatkin
Copy link

@seishun true... but doesn't make sense either)

Currently this outputs the checksum as a Buffer in little-endian, but perhaps this should be made configurable.

Maybe it's better to have result as a number (so, the stream can't be readable) and let the user write it to whatever buffer they want.

@indutny
Copy link
Member

indutny commented Mar 31, 2014

I'm ok with making it a Transform stream, which will emit the result only when .end() is called on it's readable side.

@seishun
Copy link
Author

seishun commented Mar 31, 2014

@indutny any comments on my last commit?

@indutny
Copy link
Member

indutny commented Apr 1, 2014

@seishun I wonder if it could be an readable-object-stream emitting a plain number as the data. Summoning @tjfontaine and @TooTallNate here to get some sort of consensus.

@seishun
Copy link
Author

seishun commented Apr 3, 2014

It seems you can't have a stream that's in object mode for reading and non-object mode for writing, at least according to the current docs. If you mean a writable stream with a digest() method, like @vkurchatkin suggested, then I could do that for the time being, until @tjfontaine and @TooTallNate wake up.

Do you still insist on supporting incremental checksum for the adler32 and crc32 methods? If so, how do you think we should deal with the initial checksum value?

@indutny
Copy link
Member

indutny commented Apr 3, 2014

@seishun I'd rather wait for @tjfontaine or @TooTallNate for streams thing.

Initial value could be implicitly used in C++-land when no argument is passed, does it sounds right for you?

@seishun
Copy link
Author

seishun commented Apr 3, 2014

I think there should be a way to get the initial value in JS as well, for example, if you want to crc32 a bunch of Buffers in a loop:

var crc = get_initial_value_somehow;
buffers.forEach(function(buf) {
  crc = zlib.crc32(buf, crc);
});

It will be especially important if we decide to implement the Stream API in JS-land using the non-stream functions. Perhaps I should do some benchmarks to see if such an implementation would be significantly slower than the current one.

@indutny
Copy link
Member

indutny commented Apr 3, 2014

Ok, I guess you could export it in a require('constants') module then.

@seishun
Copy link
Author

seishun commented Apr 3, 2014

It seems that module is internal to node, we need something easily accessible to the user.

@YurySolovyov
Copy link

require('zlib').crcInitilalValue ?

@indutny
Copy link
Member

indutny commented Apr 3, 2014

Oh, indeed. Better use all-caps name, though. zlib.CRC32_INITIAL_VALUE.

@seishun
Copy link
Author

seishun commented Apr 3, 2014

That seems pretty verbose though, I get the feeling a lot of people would just 0 instead. Is there something wrong with zlib.crc32() (no arguments)?

@vkurchatkin
Copy link

I think crc32(undefined), crc32(null) and crc32(new Buffer(0)) all should work.

@indutny
Copy link
Member

indutny commented Apr 3, 2014

That's how we do it in the rest of core, I think. I can't remember any method that will return initial value when called without arguments.

@seishun
Copy link
Author

seishun commented Apr 3, 2014

Is there any other method where an initial value is needed?

@indutny
Copy link
Member

indutny commented Apr 3, 2014

Not exactly initial value, but yeah: zlib.Z_NO_FLUSH, tls.DEFAULT_CIPHERS and other modules as well.

Anyway, I object a need to call method instead of a static explicit constant.

@vkurchatkin
Copy link

(offtopic) @seishun @indutny see #6284. That would be really nice to have

@indutny
Copy link
Member

indutny commented Apr 3, 2014

@vkurchatkin yeah, I am aware of it.

@seishun
Copy link
Author

seishun commented Apr 3, 2014

@indutny these are not quite the same. Z_NO_FLUSH is a constant in the underlying C library too, so someone who has worked with zlib would naturally expect to be a constant in the node binding too. The initial Adler32/CRC32 value, however, is not – you call adler32/crc32 with Z_NULL to get it – so it would be unintuitive and surprising if it were a constant rather than a method call in node.

Additionally, both crc32(buf) and crc32(buf, crc) return checksums. The initial value is also a completely valid checksum, so it would make sense if it were also returned by somehow calling crc32.

Also, getting back to my point about verbosity. If we force people to use a long and hard-to-type constant, they are more likely to just use 0 directly. Someone reading that code would either need to spend time to figure out where that magic value is coming from, or make incorrect assumptions (for example, that it's 0 for Adler32 too). We shouldn't make it harder to write good code.

Or we could avoid this problem altogether by not supporting incremental checksum for non-stream methods, since I don't see any use-case for that where it wouldn't be simpler to just use a stream.

@vkurchatkin yeah I guess it would help in this case.

@indutny
Copy link
Member

indutny commented Apr 3, 2014

@seishun ok, let's use occam's razor on this. Streams for incremental, methods for one-chunk checksums.

@seishun
Copy link
Author

seishun commented Apr 3, 2014

Good to have reached a consensus. Once we figure out how to deal with the streams, I can move the methods to JS-land to simply wrap the streams, reducing amount of code.

@seishun
Copy link
Author

seishun commented Apr 20, 2014

Since there hasn't been any input on this, I proceeded with the API that we seemed to have agreed on for now. The streams are now object mode for reading and emit a number, and the functions are implemented via the streams to remove them from C++ land.

this._binding = new binding.CRC32();
Transform.call(this);
this._writableState.objectMode = false;
this._readableState.objectMode = true;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can now use the new partial objectMode initialization

@seishun
Copy link
Author

seishun commented Apr 30, 2014

Fixed the objectMode thing.

@tjfontaine
Copy link

Ok, sorry to bounce you around on this, but this really needs to live in the crypto module and behave the way our existing hash implementations work. Your interface pretty much already matches that, we just need to move it to lib/crypto.js instead of lib/zlib.js the binding can stay in the zlib.cc

@seishun
Copy link
Author

seishun commented May 1, 2014

Are you sure about it? Moving checksums to crypto would be extremely unintuitive and misleading since they are not provided by OpenSSL (like everything else in crypto is) and should not be used for any cryptographic purposes.

@JavaScriptDude
Copy link

I would vote that moving it to crypto would not be that ugly. For anyone who does not know zlib, it's not intuitive that checksum functions would be in zlib module. Crypto would be my first guess. Additionally, other checksum and digest API's should get added over time to node core and these would not get added to zlib. It would be unfortunate to have checksum and digest algorithms split across different modules.

Just my 10c.

@bjouhier
Copy link

bjouhier commented May 5, 2014

@JavaScriptDude CRC32 is a hash function but it is not a cryptographic hash by any means: http://en.wikipedia.org/wiki/Cyclic_redundancy_check#CRCs_and_data_integrity

@trevnorris trevnorris added the zlib label May 7, 2014
@tjfontaine
Copy link

it's not a matter of it being cryptographic, but the fact that is a hash means it should be colocated with our other hashes and behave in the same way and use the same facilities -- the fact that it is implemented by zlib is an artifact of implementation

@seishun
Copy link
Author

seishun commented May 7, 2014

Well it would still be a separate class from Hash. And it can't behave the same way as Hash, since crypto hashes return a sequence of bytes, while the zlib checksums return an integer.

@indutny what's your take on this?

@indutny
Copy link
Member

indutny commented May 7, 2014

I'd like things to belong where they come from. If we would ever decide to remove zlib (which won't really happen), both things will go away together :P

@bjouhier
Copy link

bjouhier commented May 8, 2014

I was just raising the point because it may mislead people.
But there is also a rationale in putting all the hash functions together.

@seishun
Copy link
Author

seishun commented May 8, 2014

Rebased and replaced _binding with _handle.

@tjfontaine
Copy link

Our backwards compatibility guarantees basically mean that since we've provided a zlib module we will continue to provide that module, regardless if we link against zlib to implement that. Similarly crypto will exist even if we no longer rely on openssl to implement it.

CRC32 is a hash that works in the same way as the other hashes with .update() and .digest() it needs to live in the same spot as the rest of our hashes, regardless what the return type from .digest() is.

@jasnell
Copy link
Member

jasnell commented Aug 13, 2015

@indutny @chrisdickinson @trevnorris ... unfortunately this one seems to have stalled out. Any further thoughts on this one?

@JavaScriptDude
Copy link

Sure would be great to see some cpu efficient checksum tools in the node.js base for those non-crypto usecases like file change checking algorithms.

@jasnell
Copy link
Member

jasnell commented Aug 13, 2015

@JavaScriptDude ... I agree. At this point tho, I'm just trying to get a sense of where it should land. It's probably not something we'd add to v0.12 at this point so I'm thinking this should be closed here and reopened against the active development stream at http://github.com/nodejs/node

@jasnell
Copy link
Member

jasnell commented Aug 16, 2015

Going to close this here. The PR would never be able to land in it's current state and an updated PR opened against http://github.com/nodejs/node master would be needed. That's not to comment on the validity of this.

@JavaScriptDude
Copy link

For the record, I have added a new issue to open a discussion to Incorporate suite of hash utilities into node #2444

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants