Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert-*.py: autogenerate general.uuid if missing #8565

Closed

Conversation

mofosyne
Copy link
Collaborator

This PR was split from #7499 as this required more thought before being included.

But basically the idea is if UUID is not included, we may want to automatically generate a UUID that is deterministically based on the tensor data (so that regenerating the file will give the same hash).

At the moment when generating a new gguf model file it will add this extra console line

INFO:hf-to-gguf:generating general.uuid     86b1ebff-d754-50fc-9245-d23fe329817c

Then when you dump the gguf you can see its in the kv store

     13: STRING     |        1 | general.uuid = '86b1ebff-d754-50fc-9245-d23fe329817c'

Just note that this method won't detect models that are semantically the same but quantized differently.

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 18, 2024
@github-actions github-actions bot added the python python script changes label Jul 18, 2024
@mofosyne mofosyne requested a review from compilade July 18, 2024 20:49
@mofosyne
Copy link
Collaborator Author

mofosyne commented Jul 18, 2024

@compilade I recall you had an observation about potential issues with autogenerating uuids (Also add me to any PR relating to authorship metadata that you said you may want to adjust as well)

@compilade
Copy link
Collaborator

@compilade I recall you had an observation about potential issues with autogenerating uuids

@mofosyne Yes, there are possible problems.

  • Should the UUID of a model be the same if it's converted to f32, f16, bf16, or q8_0?
    • Why or why not?
  • Should llama-quantize keep the UUID intact? (I think that yes)
    • Is this consistent with the previous point?

I would like to keep the equivalence of convert_hf_to_gguf.py --outtype q8_0 with llama-quantize model-f32.gguf model-q8_0.gguf q8_0.

@mofosyne
Copy link
Collaborator Author

mofosyne commented Jul 19, 2024

@compilade Well with the current technique of just hashing all the tensors... that's not quite possible at the moment.

Also source models can now be tracked by general.source.uuid so might not be quite an issue anymore.

I think if people here wants to do the 'uuid' referring to semantic model id... then maybe we could do a copy of the id if converting? But if generating a new model from scratch... finetuning... or merging models then generate a new uuid.

Anyway one argument against this, is that the process of 'down conversion' will lead to a difference in response, so while it is derived from the same single parent as a converted model... it's still a distinct model itself.

Ultimately, my conceptization with this... is that eventually we will be able to autogenerate a tree of life. And you could argue that the 'down converted' models are the leaves of each model branch? (Be nice if someone could create a visualization too)

@mofosyne mofosyne added the need feedback Testing and feedback with results are needed label Jul 19, 2024
for name, ti in tensors.items():
assert ti.tensor is not None
assert ti.tensor.nbytes == ti.nbytes
uuidv5_sha1.update(ti.tensor.tobytes('C'))
Copy link
Collaborator

@compilade compilade Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While writing #8645 (comment), I've realized that this specific line is materializing the lazy tensors by reading their data, which would cause a very noticeable memory regression (making it no better than --no-lazy, which is not good RAM-usage-wise), at least when there is no UUID specified (which means, by default). This is because the data read is not immediately written (unlike in GGUFWriter.write_tensors_to_file), so this puts all the tensors in memory before even writing the metadata.

This will be more visible with models with BF16 weights and/or MoE models, because their original tensors are not used as-is (type conversion and/or expert stacking) and so the output tensor list is never mmap-ed.

If you can prove there is no memory regression, I'd be happy to dismiss this review.

(otherwise, be quick with Ctrl+C (at least on Linux) to interrupt the conversion with SIGINT, to avoid OOM)

@mofosyne
Copy link
Collaborator Author

I think you got a point about the lazy loading nature of this script and how this will cause problems @compilade

Perhaps this is more of an argument to close this PR and figure out a different approach to uuid generation.

E.g. Maybe we could mandate that on generation of any new models, that a UUIDv7 or UUIDv4 code is generated for it. But for conversion of models, we would only do copies of models (or if we deem a quantized version to be a new model, it would be a UUIDv5 hash of the source model UUID). Unsure what to do if source model lacks an ID, maybe don't generate an id?

@compilade
Copy link
Collaborator

@mofosyne Hashing the source tensors could work without making the memory usage too high (because they are mmap-ed), and would also solve the other equivalence problems, since the semantic of the UUID would be about where it came from, so llama-quantize can leave it unchanged.

The CPU overhead of hashing might make conversion slower, though, since it's all single-threaded, and the i/o operations are blocking (nothing else is done when reading and when writing).

@mofosyne mofosyne marked this pull request as draft July 26, 2024 16:26
@mofosyne
Copy link
Collaborator Author

mofosyne commented Jul 26, 2024

@compilade you mean like generate_source_tensors_uuid() in this? (I've set to draft and added generate_source_tensors_uuid() just for illustrative purpose).

For the 'source', I found I can't just straight up hash pytorch but had to convert it into a numpy format first.
I've added a 64 type, to at least capture any larger pytorch values (unless it makes sense to stick to f32 or f16).

I've noticed that setting the output to f32 doesn't have the same uuid, even when I set data_torch.to(torch.float32).squeeze().numpy() in generate_source_tensors_uuid() so unsure about whats going on here.

Still inclined to call it quits and close this PR unless there is actually a working solution I can think if. Still feels like the best approach is just to tell model makers to generate a random UUID when they create their model and about to distribute it. (e.g. maybe add a --publish flag for any training program which would then generate a random UUID for it?)

@compilade
Copy link
Collaborator

you mean like generate_source_tensors_uuid() in this?

@mofosyne Yes, pretty much. This reads the whole source tensors twice (so it's slow), but I don't really see a way around that considering metadata is written before the tensor data.

For the 'source', I found I can't just straight up hash pytorch but had to convert it into a numpy format first.

This is not a problem, because using data_torch.numpy() shares the same memory, even if the source is mmap-ed.

I've noticed that setting the output to f32 doesn't have the same uuid, even when I set data_torch.to(torch.float32).squeeze().numpy() in generate_source_tensors_uuid() so unsure about whats going on here.

No need to change the type when hashing, this makes it impossible to directly use mmap-ed tensors. But since the tensor objects are not kept afterwards, either way could still work without using too much memory. (Also, squeeze doesn't affect the data, only the shape, so it's not necessary here.)
I also think ignoring tensors is not necessary for the purpose of hashing the source.

It's normal that it's not resulting in the same UUID when converting to f32 vs when keeping the original type because you're giving different bits to the hashing function.

Still inclined to call it quits and close this PR unless there is actually a working solution I can think of.

While this would work, there's the overhead of reading all the tensors twice which is hard to avoid. Making conversion twice as slow on low-RAM systems isn't desirable. If we can think of a solution around that, this would be more useful.

Still feels like the best approach is just to tell model makers to generate a random UUID when they create their model and about to distribute it. (e.g. maybe add a --publish flag for any training program which would then generate a random UUID for it?)

Ideally, yes, but in practice, this would be very hard to enforce, and I'm pretty sure the big model makers like Meta and Mistral will totally ignore this because they're likely using custom training frameworks. (And/or using special-purpose frameworks like states-spaces/mamba to train mamba-codestral-7B-v0.1, which (at least on release) doesn't even have the standard config.json)

Alternatively, it might be possible to get hashes of the models files directly from Git, but this would not give the same result for pytorch_model.bin vs model.safetensors of the same model unlike when hashing the actual tensors.

@mofosyne mofosyne requested a review from compilade July 27, 2024 03:05
convert_hf_to_gguf.py Outdated Show resolved Hide resolved
@mofosyne
Copy link
Collaborator Author

mofosyne commented Jul 27, 2024

@compilade yeah I just spotted what you meant. For now regarding bf16 having no direct mapping, I've just cast it upwards. It's compiling. At least this should work as long as the safe tensors are only in float16, float32, float64 or bfloat16


Also this doesn't address the lazy tensor issue... but I'll be happy to try and apply any changes needed if you got suggestions.

@mofosyne mofosyne requested a review from compilade July 28, 2024 09:01
@mofosyne mofosyne closed this Nov 9, 2024
@mofosyne
Copy link
Collaborator Author

mofosyne commented Nov 9, 2024

Closing as I cannot think of any good justification for this feature due to potential issues with an autogenerated UUID. Best to make it optional

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need feedback Testing and feedback with results are needed python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants