convert-*.py: autogenerate general.uuid if missing #8565

mofosyne · 2024-07-18T11:09:39Z

This PR was split from #7499 as this required more thought before being included.

But basically the idea is if UUID is not included, we may want to automatically generate a UUID that is deterministically based on the tensor data (so that regenerating the file will give the same hash).

At the moment when generating a new gguf model file it will add this extra console line

INFO:hf-to-gguf:generating general.uuid     86b1ebff-d754-50fc-9245-d23fe329817c

Then when you dump the gguf you can see its in the kv store

     13: STRING     |        1 | general.uuid = '86b1ebff-d754-50fc-9245-d23fe329817c'

Just note that this method won't detect models that are semantically the same but quantized differently.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

mofosyne · 2024-07-18T20:49:36Z

@compilade I recall you had an observation about potential issues with autogenerating uuids (Also add me to any PR relating to authorship metadata that you said you may want to adjust as well)

compilade · 2024-07-19T04:18:56Z

@compilade I recall you had an observation about potential issues with autogenerating uuids

@mofosyne Yes, there are possible problems.

Should the UUID of a model be the same if it's converted to f32, f16, bf16, or q8_0?
- Why or why not?
Should llama-quantize keep the UUID intact? (I think that yes)
- Is this consistent with the previous point?

I would like to keep the equivalence of convert_hf_to_gguf.py --outtype q8_0 with llama-quantize model-f32.gguf model-q8_0.gguf q8_0.

mofosyne · 2024-07-19T08:32:31Z

@compilade Well with the current technique of just hashing all the tensors... that's not quite possible at the moment.

Also source models can now be tracked by general.source.uuid so might not be quite an issue anymore.

I think if people here wants to do the 'uuid' referring to semantic model id... then maybe we could do a copy of the id if converting? But if generating a new model from scratch... finetuning... or merging models then generate a new uuid.

Anyway one argument against this, is that the process of 'down conversion' will lead to a difference in response, so while it is derived from the same single parent as a converted model... it's still a distinct model itself.

Ultimately, my conceptization with this... is that eventually we will be able to autogenerate a tree of life. And you could argue that the 'down converted' models are the leaves of each model branch? (Be nice if someone could create a visualization too)

compilade · 2024-07-23T19:43:22Z

gguf-py/gguf/gguf_writer.py

+            for name, ti in tensors.items():
+                assert ti.tensor is not None
+                assert ti.tensor.nbytes == ti.nbytes
+                uuidv5_sha1.update(ti.tensor.tobytes('C'))


While writing #8645 (comment), I've realized that this specific line is materializing the lazy tensors by reading their data, which would cause a very noticeable memory regression (making it no better than --no-lazy, which is not good RAM-usage-wise), at least when there is no UUID specified (which means, by default). This is because the data read is not immediately written (unlike in GGUFWriter.write_tensors_to_file), so this puts all the tensors in memory before even writing the metadata.

This will be more visible with models with BF16 weights and/or MoE models, because their original tensors are not used as-is (type conversion and/or expert stacking) and so the output tensor list is never mmap-ed.

If you can prove there is no memory regression, I'd be happy to dismiss this review.

(otherwise, be quick with Ctrl+C (at least on Linux) to interrupt the conversion with SIGINT, to avoid OOM)

mofosyne · 2024-07-24T13:34:16Z

I think you got a point about the lazy loading nature of this script and how this will cause problems @compilade

Perhaps this is more of an argument to close this PR and figure out a different approach to uuid generation.

E.g. Maybe we could mandate that on generation of any new models, that a UUIDv7 or UUIDv4 code is generated for it. But for conversion of models, we would only do copies of models (or if we deem a quantized version to be a new model, it would be a UUIDv5 hash of the source model UUID). Unsure what to do if source model lacks an ID, maybe don't generate an id?

compilade · 2024-07-24T14:11:38Z

@mofosyne Hashing the source tensors could work without making the memory usage too high (because they are mmap-ed), and would also solve the other equivalence problems, since the semantic of the UUID would be about where it came from, so llama-quantize can leave it unchanged.

The CPU overhead of hashing might make conversion slower, though, since it's all single-threaded, and the i/o operations are blocking (nothing else is done when reading and when writing).

mofosyne · 2024-07-26T16:32:25Z

@compilade you mean like generate_source_tensors_uuid() in this? (I've set to draft and added generate_source_tensors_uuid() just for illustrative purpose).

For the 'source', I found I can't just straight up hash pytorch but had to convert it into a numpy format first.
I've added a 64 type, to at least capture any larger pytorch values (unless it makes sense to stick to f32 or f16).

I've noticed that setting the output to f32 doesn't have the same uuid, even when I set data_torch.to(torch.float32).squeeze().numpy() in generate_source_tensors_uuid() so unsure about whats going on here.

Still inclined to call it quits and close this PR unless there is actually a working solution I can think if. Still feels like the best approach is just to tell model makers to generate a random UUID when they create their model and about to distribute it. (e.g. maybe add a --publish flag for any training program which would then generate a random UUID for it?)

compilade · 2024-07-26T18:44:48Z

you mean like generate_source_tensors_uuid() in this?

@mofosyne Yes, pretty much. This reads the whole source tensors twice (so it's slow), but I don't really see a way around that considering metadata is written before the tensor data.

For the 'source', I found I can't just straight up hash pytorch but had to convert it into a numpy format first.

This is not a problem, because using data_torch.numpy() shares the same memory, even if the source is mmap-ed.

I've noticed that setting the output to f32 doesn't have the same uuid, even when I set data_torch.to(torch.float32).squeeze().numpy() in generate_source_tensors_uuid() so unsure about whats going on here.

No need to change the type when hashing, this makes it impossible to directly use mmap-ed tensors. But since the tensor objects are not kept afterwards, either way could still work without using too much memory. (Also, squeeze doesn't affect the data, only the shape, so it's not necessary here.)
I also think ignoring tensors is not necessary for the purpose of hashing the source.

It's normal that it's not resulting in the same UUID when converting to f32 vs when keeping the original type because you're giving different bits to the hashing function.

Still inclined to call it quits and close this PR unless there is actually a working solution I can think of.

While this would work, there's the overhead of reading all the tensors twice which is hard to avoid. Making conversion twice as slow on low-RAM systems isn't desirable. If we can think of a solution around that, this would be more useful.

Still feels like the best approach is just to tell model makers to generate a random UUID when they create their model and about to distribute it. (e.g. maybe add a --publish flag for any training program which would then generate a random UUID for it?)

Ideally, yes, but in practice, this would be very hard to enforce, and I'm pretty sure the big model makers like Meta and Mistral will totally ignore this because they're likely using custom training frameworks. (And/or using special-purpose frameworks like states-spaces/mamba to train mamba-codestral-7B-v0.1, which (at least on release) doesn't even have the standard config.json)

Alternatively, it might be possible to get hashes of the models files directly from Git, but this would not give the same result for pytorch_model.bin vs model.safetensors of the same model unlike when hashing the actual tensors.

…y-autogen-uuid

convert_hf_to_gguf.py

…cept for bf16 which is typecasted upward)

mofosyne · 2024-07-27T16:32:28Z

@compilade yeah I just spotted what you meant. For now regarding bf16 having no direct mapping, I've just cast it upwards. It's compiling. At least this should work as long as the safe tensors are only in float16, float32, float64 or bfloat16

Also this doesn't address the lazy tensor issue... but I'll be happy to try and apply any changes needed if you got suggestions.

mofosyne · 2024-11-09T10:19:18Z

Closing as I cannot think of any good justification for this feature due to potential issues with an autogenerated UUID. Best to make it optional

convert-*.py: autogen uuid

be8306d

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 18, 2024

github-actions bot added the python python script changes label Jul 18, 2024

mofosyne requested a review from compilade July 18, 2024 20:49

mofosyne added the need feedback Testing and feedback with results are needed label Jul 19, 2024

mofosyne mentioned this pull request Jul 23, 2024

convert: add tensor hash general.hash.sha256 to kv store #8645

Closed

4 tasks

compilade requested changes Jul 23, 2024

View reviewed changes

convert-*.py: Add source uuid generation

0c49152

mofosyne marked this pull request as draft July 26, 2024 16:26

mofosyne added 2 commits July 27, 2024 13:03

convert*.py: inline source uuid generation approach

3fb690e

Merge remote-tracking branch 'upstream/master' into feature-convert-p…

f05fa2a

…y-autogen-uuid

mofosyne requested a review from compilade July 27, 2024 03:05

compilade reviewed Jul 27, 2024

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert-*.py: hash pytorch array as numpy without type conversion (ex…

6db4f52

…cept for bf16 which is typecasted upward)

mofosyne requested a review from compilade July 28, 2024 09:01

mofosyne closed this Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert-*.py: autogenerate general.uuid if missing #8565

convert-*.py: autogenerate general.uuid if missing #8565

mofosyne commented Jul 18, 2024

mofosyne commented Jul 18, 2024 •

edited

Loading

compilade commented Jul 19, 2024

mofosyne commented Jul 19, 2024 •

edited

Loading

compilade Jul 23, 2024 •

edited

Loading

mofosyne commented Jul 24, 2024

compilade commented Jul 24, 2024

mofosyne commented Jul 26, 2024 •

edited

Loading

compilade commented Jul 26, 2024

mofosyne commented Jul 27, 2024 •

edited

Loading

mofosyne commented Nov 9, 2024

convert-*.py: autogenerate general.uuid if missing #8565

convert-*.py: autogenerate general.uuid if missing #8565

Conversation

mofosyne commented Jul 18, 2024

mofosyne commented Jul 18, 2024 • edited Loading

compilade commented Jul 19, 2024

mofosyne commented Jul 19, 2024 • edited Loading

compilade Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

mofosyne commented Jul 24, 2024

compilade commented Jul 24, 2024

mofosyne commented Jul 26, 2024 • edited Loading

compilade commented Jul 26, 2024

mofosyne commented Jul 27, 2024 • edited Loading

mofosyne commented Nov 9, 2024

mofosyne commented Jul 18, 2024 •

edited

Loading

mofosyne commented Jul 19, 2024 •

edited

Loading

compilade Jul 23, 2024 •

edited

Loading

mofosyne commented Jul 26, 2024 •

edited

Loading

mofosyne commented Jul 27, 2024 •

edited

Loading