-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert-*.py: autogenerate general.uuid if missing #8565
convert-*.py: autogenerate general.uuid if missing #8565
Conversation
@compilade I recall you had an observation about potential issues with autogenerating uuids (Also add me to any PR relating to authorship metadata that you said you may want to adjust as well) |
@mofosyne Yes, there are possible problems.
I would like to keep the equivalence of |
@compilade Well with the current technique of just hashing all the tensors... that's not quite possible at the moment. Also source models can now be tracked by I think if people here wants to do the 'uuid' referring to semantic model id... then maybe we could do a copy of the id if converting? But if generating a new model from scratch... finetuning... or merging models then generate a new uuid. Anyway one argument against this, is that the process of 'down conversion' will lead to a difference in response, so while it is derived from the same single parent as a converted model... it's still a distinct model itself. Ultimately, my conceptization with this... is that eventually we will be able to autogenerate a tree of life. And you could argue that the 'down converted' models are the leaves of each model branch? (Be nice if someone could create a visualization too) |
gguf-py/gguf/gguf_writer.py
Outdated
for name, ti in tensors.items(): | ||
assert ti.tensor is not None | ||
assert ti.tensor.nbytes == ti.nbytes | ||
uuidv5_sha1.update(ti.tensor.tobytes('C')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While writing #8645 (comment), I've realized that this specific line is materializing the lazy tensors by reading their data, which would cause a very noticeable memory regression (making it no better than --no-lazy
, which is not good RAM-usage-wise), at least when there is no UUID specified (which means, by default). This is because the data read is not immediately written (unlike in GGUFWriter.write_tensors_to_file
), so this puts all the tensors in memory before even writing the metadata.
This will be more visible with models with BF16
weights and/or MoE models, because their original tensors are not used as-is (type conversion and/or expert stacking) and so the output tensor list is never mmap
-ed.
If you can prove there is no memory regression, I'd be happy to dismiss this review.
(otherwise, be quick with Ctrl
+C
(at least on Linux) to interrupt the conversion with SIGINT
, to avoid OOM)
I think you got a point about the lazy loading nature of this script and how this will cause problems @compilade Perhaps this is more of an argument to close this PR and figure out a different approach to uuid generation. E.g. Maybe we could mandate that on generation of any new models, that a UUIDv7 or UUIDv4 code is generated for it. But for conversion of models, we would only do copies of models (or if we deem a quantized version to be a new model, it would be a UUIDv5 hash of the source model UUID). Unsure what to do if source model lacks an ID, maybe don't generate an id? |
@mofosyne Hashing the source tensors could work without making the memory usage too high (because they are The CPU overhead of hashing might make conversion slower, though, since it's all single-threaded, and the i/o operations are blocking (nothing else is done when reading and when writing). |
@compilade you mean like generate_source_tensors_uuid() in this? (I've set to draft and added generate_source_tensors_uuid() just for illustrative purpose). For the 'source', I found I can't just straight up hash pytorch but had to convert it into a numpy format first. I've noticed that setting the output to f32 doesn't have the same uuid, even when I set Still inclined to call it quits and close this PR unless there is actually a working solution I can think if. Still feels like the best approach is just to tell model makers to generate a random UUID when they create their model and about to distribute it. (e.g. maybe add a --publish flag for any training program which would then generate a random UUID for it?) |
@mofosyne Yes, pretty much. This reads the whole source tensors twice (so it's slow), but I don't really see a way around that considering metadata is written before the tensor data.
This is not a problem, because using
No need to change the type when hashing, this makes it impossible to directly use It's normal that it's not resulting in the same UUID when converting to f32 vs when keeping the original type because you're giving different bits to the hashing function.
While this would work, there's the overhead of reading all the tensors twice which is hard to avoid. Making conversion twice as slow on low-RAM systems isn't desirable. If we can think of a solution around that, this would be more useful.
Ideally, yes, but in practice, this would be very hard to enforce, and I'm pretty sure the big model makers like Meta and Mistral will totally ignore this because they're likely using custom training frameworks. (And/or using special-purpose frameworks like Alternatively, it might be possible to get hashes of the models files directly from Git, but this would not give the same result for |
…cept for bf16 which is typecasted upward)
@compilade yeah I just spotted what you meant. For now regarding bf16 having no direct mapping, I've just cast it upwards. It's compiling. At least this should work as long as the safe tensors are only in float16, float32, float64 or bfloat16 Also this doesn't address the lazy tensor issue... but I'll be happy to try and apply any changes needed if you got suggestions. |
Closing as I cannot think of any good justification for this feature due to potential issues with an autogenerated UUID. Best to make it optional |
This PR was split from #7499 as this required more thought before being included.
But basically the idea is if UUID is not included, we may want to automatically generate a UUID that is deterministically based on the tensor data (so that regenerating the file will give the same hash).
At the moment when generating a new gguf model file it will add this extra console line
Then when you dump the gguf you can see its in the kv store
Just note that this method won't detect models that are semantically the same but quantized differently.