Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: incremental-hasher #261

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

feat: incremental-hasher #261

wants to merge 5 commits into from

Conversation

Gozala
Copy link
Contributor

@Gozala Gozala commented Jul 19, 2023

Proposal for the #260

@Gozala Gozala requested review from rvagg and achingbrain July 19, 2023 17:38
Copy link
Member

@alanshaw alanshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I left some suggestions.

digest(): Digest

/**
* Computes the digest of the given input and writes it into the provided
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of the given input

Do you mean "of the bytes written so far"?

* can be use to control whether multihash prefix is written, if `false`
* only the raw digest writtend omitting the prefix.
*/
digestInto(output: Uint8Array, offset?: number, asMultihash?: boolean): this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return output? Not sure how useful it is to chain this method.

Suggested change
digestInto(output: Uint8Array, offset?: number, asMultihash?: boolean): this
digestInto(output: Uint8Array, offset?: number, asMultihash?: boolean): Uint8Array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's a better convention for method name when receiving a parameter to mutate. digestInto is a bit awkwardly named (IMO!).

digestBYOB? 🙄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asMultihash - do we need? The point of this library is multiformats.

Copy link
Contributor Author

@Gozala Gozala Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went into this in the linked issue. In practice I have encountered many instances where I do need to leave out the prefix, I could go and use another library in those instances, but seems like we could just expose this. That sayid I agree that this method is kind of meh and we could do better.

@vasco-santos suggested having a whole another method, personally I'm wondering if perhaps we should have methods just to get digest without a prefix as this is low level API anyway ?

Suggested change
digestInto(output: Uint8Array, offset?: number, asMultihash?: boolean): this
// multihash prefix for it
header: Uint8Array
// only writes the digest without a prefix
digestInto(output: Uint8Array, offset?: number): this

Or alternatively something like this

Suggested change
digestInto(output: Uint8Array, offset?: number, asMultihash?: boolean): this
encodeMultihashInto(target: Uint8Array, offset?: number): this
encodeMultihashHeaderInto(target: Uint8Array, offset?: number): this
encodeDigestInto(target: Uint8Array, offset?: number): thsi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

digestInto is a bit awkwardly named (IMO!).

I'm happy to call this something, else I was trying to align with varint.encodeInto, which as it turns out was called varint.encodeTo 😅

digestBYOB? 🙄

Works for me although I had to google to figure out what BYOB stand for.

Alternatively we could just have read(bytes: Uint8Array, offfset?: number): this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return output? Not sure how useful it is to chain this method.

I personally find mutations that return values misleading, that said I'm amendable to the idea

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to google to figure out what BYOB stand for.

BYOB was aiming for familiarity with https://developer.mozilla.org/en-US/docs/Web/API/ReadableStreamBYOBReader etc. but I guess that was a miss 😆

/**
* Number of bytes that were consumed.
*/
count(): bigint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean if someone is hashing >9PiB of data in JS then 👏👏👏.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah ... is this overkill?

digest(): Digest

/**
* Computes the digest of the given input and writes it into the provided
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Computes the digest of the given input and writes it into the provided
* Computes the digest of the bytes written so far and writes it into the provided

@Gozala
Copy link
Contributor Author

Gozala commented Jul 20, 2023

I'm realizing now that interface here is intentionally non-destructive, as in I could compute digest over and over. Unfortunately node crypto APIs are destructive though

The Hash object can not be used again after hash.digest() method has been called. Multiple calls will cause an error to be thrown.

Instead node provides copy method so you could continue writing into the hasher copy.

Given this the case with node, proposed API seems impractical, perhaps instead we could also introduce same constraint and copy() method. Better implementations could return same instance from the copy while ones that wrap node crypto APIs would avoid making a copy just in case.

On the other hand copy on digest maybe negligible overhead, in which case API without copy would be nicer.

/**
* Writes bytes to be digested.
*/
write(bytes: Uint8Array): this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically in streaming hashers this is called update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with calling it update although I do find that name confusing personally as I think of update as overwrite as opposed to append.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but streaming hashers aren't appending to a buffer they are updating their internal state with the new data you pass.

@alanshaw
Copy link
Member

alanshaw commented Jul 22, 2023

What about digest() -> multihash and rawDigest() -> digest without multihash header?

@alanshaw
Copy link
Member

alanshaw commented Aug 4, 2023

I'd really like this to land! I find myself wanting to do this more and more...

*
* @param [offset=0] - Byte offset in the `target`.
*/
readDigest(target: Uint8Array, offset?: number): this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you describe the use-case for this? it seems like this makes it an onerous API to have to implement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my mistake, this is the output function!

I think maybe the naming could be better here. We have ample precedent of digest() in JS-land, so we could have digest() and multihash() (or multihashDigest() if you want to be more explicit). In Go-land Sum() is the standard for this action, which has grown on me to make sense (though it's taken time!).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I also see I'm discussing history here - read* being the new versions? I'm not a fan. I also wonder whether we could have nicer APIs that don't require you to pass in a target? I understand that's an important part of this, for efficiency, but casual use typically just wants that done for you. So could the APIs take a target? instead and always return Uint8Array? So you can either choose to supply the bytes to write in to (with optional offset) or not supply one, but either way you get back some bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read* being the new versions? I'm not a fan.

I mean if you think of it as a transform stream, it makes sense to have write and read ops. I don't mind renaming it to something else, but please don't make me come up with a name that everyone will like.

I also wonder whether we could have nicer APIs that don't require you to pass in a target? I understand that's an important part of this, for efficiency, but casual use typically just wants that done for you. So could the APIs take a target? instead and always return Uint8Array? So you can either choose to supply the bytes to write in to (with optional offset) or not supply one, but either way you get back some bytes.

I'm not completely opposed to returning back the target, however I would caution against it as it mixes two very different modes into one and can also lead to mistakes (e.g. you may have passed undefined reference which will no through but happily give you back Uint8Array)

Idea was that if you want to compute digest you just call digest method and use this only in those rare cases when you need to work with slabs of memory.

@Gozala
Copy link
Contributor Author

Gozala commented Aug 12, 2023

For what it's worth Rust Multihash hasher has a similar interface to one proposed here

/// Trait implemented by a hash function implementation.
pub trait Hasher {
    /// Consume input and update internal state.
    fn update(&mut self, input: &[u8]);

    /// Returns the final digest.
    fn finalize(&mut self) -> &[u8];

    /// Reset the internal hasher state.
    fn reset(&mut self);
}

Comment on lines +100 to +103
/**
* Number of bytes that were consumed.
*/
count(): bigint
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* Number of bytes that were consumed.
*/
count(): bigint

Let's just drop this method, we can revisit if we find it really necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants