-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alterative hashings for data #1676
Comments
Hi @dunstantom ! That is a very interesting idea! So is the cache duplication the main motivation for this request? Or do you also want to use something different than md5 for checking if your dependencies are changed in pipeline scenario? Thanks, |
I would say mainly it is having an alternative for md5. On one hand, one could argue about how SHA1 compares to md5, but on the other hand (and what I had in mind) we may have our own preferences on what kind of changes necessitate recalculating/updating a dataset. This ties in to limiting cache duplication as well, but more from the perspective of avoiding unnecessary computation rather than a concern of having storage for the cache. Hope that answers your question. |
@dunstantom Thank you for your answer! 🙂 We do think about moving to SHA too, but other users also requested some faster hashing as well, so looks like it is a general problem of supporting different hashing algos . Related to #1568 . This is something we'll take a closer look later. Thanks for the feedback! 🙂 |
@efiop I asked earlier this week in #1568 about the possibility to use other hash algorithms. I'm shifting that part of the discussion here since this issue seems to be a better match for my request (#1568 could be solved in principle by changing to one other hash algorithm, i.e., just switch the md5 function for another one). Short report on my progress: I made a local My first idea about how to implement this is to move the info about the hash algorithm to the Output and to let each instance of an Output have its own hash algoritm. At present the But I only know a very small part of the code base, so I'm hoping you can help me judge if this idea makes any sense at all. It would then instead be up to the specific cache implementation to check the hash function on the Output instance and decide on a cache path. In principle one could imagine something like this in the cache directory:
But for backwards compatibility any md5 hashed files could be stored under their old paths, like so:
|
Hi @rasmuse ! We used to have an assumption that you might have a few checksums, but the code evolved in such a way, that we started to stick with single-checksum assumption. That being said, the underlying logic is still partially there. For example, if you take a look at
So far, the likelihood of those two colliding was pretty slim, so we didn't worry about it too much. But if we are moving to a more general approach, it totally makes sense to separate them somehow. I like your idea with creating subdirs for each hash type and I agree that for backward compatibility we should stick with current cache structure for md5 caches, meaning that it won't be put into We actually have a few problems with a current hashing (for example, see #992) and I think we could switch to using This is how I see it (approximately):
OPTIONAL make it also accept a number of hashes in a given order of priority. E.g. if config looks like
it would try to compute NOTE if we would want to, we could easilly support per-remote and even per-output/dep hash type configuration later.
Or, maybe, if you specify strictly
🙂 But we could totally start with a simple
NOTE this is where it might get tricky, because our RemoteLOCAL uses our state db for caching md5s, so we don't have to recompute them each time. For now, you could simply not cache sha values at all, to make it simpler for you. We have some state generalization logic coming in #1654 , that will help us easilly cache sha (and other hash types) the same way as md5, so will be able to help you with that later. 🙂 Hope this helps :) Let us know if you have any questions and/or need help. Thanks, |
Related to iterative#4144 , iterative#3069 , iterative#1676
By itself `self.info` is quite confusing, as it is not clear what it is about. Using `hash_info` makes much more sense and is required to support alternative hash types. Related to iterative#4144, iterative#3069, iterative#1676
It's great that this is being worked on. Solving for the general use case of configurable hash is excellent, even though I'm sure there is some extra complexity there. There's a lot of advantages to different hash schemes. blake3 - fast as heck, also available as a python package with no other deps. (This would be ideal for DVC since it's so fast, but the package is new-ish and old-and-stable is always valuable. So having that configurable based on a user's risk tolerance is 💯 great) |
It seems to me that @dunstantom's issue is not really addressed by the proposed solution. The problem is that trivial permutations of the data result in different hashes. If I understand it correctly, alternative hash algorithms will not solve this. Only custom hash functions, or possibly wrappers around the built-in hash functions can. |
I've forked DVC and implemented some changes to allow the user to supply hashes in separate checksum files. If a hash is needed for a file in the dataset, DVC searches for a checksum file with the same name but suffixed with ".md5". If this is found, the hash is retrieved from the checksum file instead of using the algorithm on the dataset file. |
@mbraakhekke Timestamps are a classic problem of reproducibility: e.g. https://wiki.debian.org/ReproducibleBuilds/Howto#Files_in_data.tar_contain_timestamps . Usually the way to solve it is to just strip the timestamps off your files, since if you can use a custom md5, likely you just don't care about those stamps at all. Would that be a possible solution for you? Your approach works too, but it is error-prone, as you are pretty much hacking dvc and dvc won't be able to verify the cache files with the md5 that you provide. We could use it only for dependencies though, as those just need a checksum and not necessarily a full blown hash. That's why we were discussing alternative hashes, so that you could make dvc aware that you want to use some custom hash and whether or not dvc could actually treat it as hash for caching purposes. To my understanding, in the scenario that you've mentioned, I think stripping the timestamps is the better solution, as it is explicit and doesn't rely on dvc hacks. If this solution is acceptable for you, of course. |
Rephrasing the original problem statement a bit, here's what we are looking for: Given a set of binary blobs xn ∈ X, x is a vector of bytes, and some transformation with an outcome F : x ↦ y (our "data science function"), what we want is a hash function H : x ↦ hx such that for all n, x, H(x) = hx, F(xn) = y. In other words, we want some H such that the premimage of H matches the preimage of F. Ignoring adversarial preimages for a moment, for "true" cryptographic hash functions, there is only one xn which satisfies this criteria, since any perturbation to x perturbs H(x), that's the point of crypto hashes. If I understand correctly, this would put our desired functions in the class of Locality-sensitive hash functions (LSHF), and they are effectively orthogonal to cryptographic hash functions (CHF). md5 is effectively a crypto hash, it's just bad at it (all the more reason not to use it as the default dvc hash). UNF would be a LSHF. The "locality" in this space is domain-specific to the problem at hand and the given function F varies across projects - i.e.: does the order of columns in a dataframe matter? If you only use column name indexing, no. If you use any kind of integer indexing, yes. The "cool" solution would be to allow the .dvc metadata for a given file to specify the hashing algorithm, as a runnable command or python entrypoint. You could specify your own code that would compute the hash any way you pleased. I've been working on a caching/memoization wrapper library and this is exactly what I do. This is great from a UX point of view, and frankly challenging from the Ops/deployment point of view, and likely a nontrivial technical challenge for the DVC team as well. Now you have to somehow inject the python hashing code, which I guess since you already have the dvc metadata structure, the path specify isn't horrible. Environment will be a mess so I'm just going to assume it'll use the environ DVC itself uses. I can see a route to this solution which involves:
where
At least, that's how I'd personally do it. |
@xkortex @mbraakhekke @efiop how about this one #3439 (and we have PR by @casperdcl that implements it) ? Especially the discussion with a specific proposal, second part of the ticket. It seems to me that that mechanism covers some problems discussed here. It might not cover the part that is related to replacing md5 with a better hash for the sake of performance, cryptographic strength, etc. |
yes #4363 is still work in progress; will probably find time to continue in a few weeks unless someone want to take it over. |
Late reply. Nice to see everyone's input!
@efiop: Yes, the time stamps are not relevant.
OK, this would be one of those unforeseen consequences I mentioned. Yes, I see that for verification purposes this would cause problems. Indeed, my main motivation is avoiding unncessary reruns of pipeline stages. So you're saying that the custom hash would complement the standard hash, rather than replace it?
I see your point. However, this is easier said than done, particularly for binary files. Some examples from a project I'm currently working on:
For the first two I could switch to other formats. However, in general this is not always possible--some systems need to be able to work with certain formats. Regarding the last point (binaries) this is more tricky. I guess is possible to patch a binary to remove timestamps but I've found there's also other sources of seemingly random variation. I want to convince my team to start using DVC but most of them have little experience with this sort of stuff, nor the time/patience to figure it out. I really like DVC but I think this could be an obstable for widespread adoption, particularly for Windows users. |
@shcheklein, this discussion is quite long. Could you tell me with which post the relevant part begins 😉? |
@mbraakhekke sure, I guess the last comment summarizes it well enough - #3439 (comment) The idea is that for every dependency and/or output in DVC files we could specify a custom script, command, whatever to execute on that dependency/output before we calculate |
I'd like to echo some of these concerns and add a complementary perspective on why user-defined hashes might be valuable for pipeline stages. In addition to having value for custom analysis of data dependencies, a UDF for hashing also introduces the potential for more fine-grained analysis of code dependencies. The motivating example for me, coming from the Clojure world, is the fact that changes to a source file don't always change a program's tree. Some examples: whitespaceThe simplest example: two forms that differ only in their whitespace placement will naturally yield equivalent programs. (map inc (range 25))
;; is the same as
(map inc
(range 25)) Changes like this will trigger a re-run of a pipeline stage, even though no definitions in the program have changed. Granted, this can be avoided by a "format on save" workflow, so it's not as though a solution doesn't currently exist for this particular problem. But there are other considerations: regular commentsAdding a comment to some code after you've run it will also change the hash, a situation likely familiar to many DVC users regardless of their preferred programming language. (mapv dec (range 89 23 -2)) ; a change to this comment changes the hash
|
Interesting perspective! I have had similar thoughts but always thought of it a little bit differently. My idea would be to use such tools as a separate processing step which treats the code as data and transforms the human-readable (commented, idiosyncratically formatted, etc) code into a normalized form (free of comments, normalized formatting, etc). This normalized code would then be input into the actual processing step. As long as the normalized code has the same hash digest, the output of the (potentially expensive) calculation can be fetched from cache. The benefit of this approach is that nothing "new" is really introduced into the workflow system. The code normalization step becomes a processing step like any other. What are your thoughts on this idea? |
I lean towards that this can be solved outside of dvc. The only 'value' I might see to integrate it with dvc is that it would save typing the normalizing code. |
Since md5 is sensitive to the order and format of the data, simple changes to the schema (eg. swapping two columns) or changing the type of a column (eg. integer to float) leads to new hash values and duplicated datasets. There are some alternatives that attempt to address this, such as UNF (http://guides.dataverse.org/en/latest/developers/unf/index.html).
It would be great to specify an alternative hash function in DVC, particularly to be able to provide a user-defined function.
The text was updated successfully, but these errors were encountered: