Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert-hf : reduce repeated boilerplate from write_tensors #7031

Closed
wants to merge 19 commits into from

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented May 1, 2024

A lot of the lines in convert-hf-to-gguf.py are repeated from Model.write_tensors.
With some refactoring, it is possible to remove 600+ lines while mostly keeping the original behavior.

Important

Continued at #7075. The testing section below is still valid for both PRs.

Changes Summary

  • Fix (almost) all type errors reported by Pyright
    • I've also included an appropriate pyrightconfig.json to let it load the gguf library correctly
  • upgrade SentencePiece requirement to ~=0.2.0 from ~=0.1.98
  • rename all encoding="utf8" to encoding="utf-8", for consistency, even if they are the same in Python.
  • make Model a Protocol a plain class instead of an ABC (Abstract Base Class)
    • no abstract methods are used, so using ABC doesn't seem necessary anymore
    • but forbid direct instantiation in Model.__init__
    • allow setting the model_arch field in subclasses exactly as it's already done, but without making pyright output type errors about Literal[gguf.MODEL_ARCH.SOMETHING] not compatible with type property
      • an error is now thrown when forgetting to set the model_arch field, by checking for this in __init_subclass__ (so it's checked even for unused subclasses of Model)
  • biggest change: add a modify_tensors callback which is used in the model classes instead of overriding write_tensors.
    • called for each tensor of the model
    • allows filtering, renaming, removing, modifying, and adding new tensors
    • all modifications are done on data_torch, which should make lazy computation easier to implement in the future (instead of also having to wrap complex numpy operations)
      • MoE weights stacking have been simplified
  • when f16 output is chosen, it's possible to selectively override the type of output tensors by overriding extra_f32_tensors and/or extra_f16_tensors in the architecture-specific model classes
  • allow unusual model part names
    • These unusual names were previously included in the part count, but not in the part names
    • This allows converting (for example) https://huggingface.co/jtatman/TinyMistral-248m-v2.5-4x-Moe which has a single model-00001-of-00001.safetensors part (previously, single-part models were expected to name their model file model.safetensors)
  • support conversion of Mamba models with a SentencePiece tokenizer
  • add ps.einops to llama-python-extra Nix packages
    • was previously missing, and is used for InternLM2Model in convert-hf-to-gguf.py

Testing

To make sure I didn't break previous work, I need help with testing conversion of existing architectures; I'm sure some of you already have the original model files downloaded, so testing conversion would be greatly appreciated.

$ python3 convert-hf-to-gguf.py --outtype f16 --outfile ./path/to/output.f16.gguf ./path/to/original/model/

If your username is mentioned below, it's because I think you might have the relevant model nearby, and/or I'd like you to check whether or not I broke your previous work, because I did modify quite a bit of model-specific conversion code.

It would be nice to automate such testing in the future, though I'm not sure how feasible it is. It could require finding the smallest model for each architecture, though unfortunately some architecture have only been used for very large models.

Also, @mofosyne, note that this will cause (some) conflicts with #6511, though they shouldn't be too hard to resolve (since most of them will be caused by removed code). I would be fine with your PR getting merged first to handle conflicts here, though I don't know if you'd prefer to do it on your side instead.

Also pinging @cebtenzzre because of #5825 which was another effort at simplifying convert-hf-to-gguf.py, so you should be familiar with the code.

Out of scope

Anything related to BPE pre-tokenizer errors was not caused by this PR and won't be fixed here to reduce potential conflicts.

For example, loading `model-00001-of-00001.safetensors` now works.

* convert-hf : fix stacking MoE expert tensors

`torch.stack` and `torch.cat` don't do the same thing.

* convert-hf : fix Mamba conversion

Tested to work even with a SentencePiece-based tokenizer.
`os.listdir` is said to list files in arbitrary order.
Sorting the file names should let "model-00009-of-00042.safetensors"
be loaded before "model-00010-of-00042.safetensors".
@compilade compilade added refactoring Refactoring need feedback Testing and feedback with results are needed labels May 1, 2024
@@ -101,7 +101,7 @@ def copy_with_new_metadata(reader: gguf.GGUFReader, writer: gguf.GGUFWriter, new

for tensor in reader.tensors:
# Dimensions are written in reverse order, so flip them first
shape = np.flipud(tensor.shape)
shape = np.flipud(tensor.shape).tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for coercing shape to list here (and not elsewhere)?

Copy link
Collaborator Author

@compilade compilade May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's passed to GGUFWriter.add_tensor_info, which expects a Sequence[int] for the shape, and this shape is of type NDArray[uint32] which caused the error:

Argument of type "NDArray[uint32]" cannot be assigned to parameter "tensor_shape" of type "Sequence[int]" in function "add_tensor_info"
"NDArray[uint32]" is incompatible with "Sequence[int]"
(reportGeneralTypeIssues)

This comes from the shape field of a ReaderTensor, and it is coerced to list in other places, like in gguf-dump.py:

"shape": tensor.shape.tolist(),

prettydims = ', '.join('{0:5}'.format(d) for d in list(tensor.shape) + [1] * (4 - len(tensor.shape)))

But the shape field of ReaderTensor is only used in 7 places (including its definition). In other places, the "shape" usually directly come from either Numpy or PyTorch, which use tuple[int, ...] for the shape type, which is compatible with Sequence[int].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, although GGUFWriter.add_tensor_infos typing is then perhaps not correct I understand why it's simpler to just make it a list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GGUFWriter.add_tensor_info's typing seems correct to me; it's used in 2 other places, and both use shapes which are already compatible with Sequence[int].

shape: Sequence[int] = raw_shape if raw_shape is not None else tensor.shape
self.add_tensor_info(name, shape, tensor.dtype, tensor.nbytes, raw_dtype = raw_dtype)

self.gguf.add_tensor_info(name, tensor.shape, data_type, data_nbytes, raw_dtype=raw_dtype)

So using Sequence[int] there seems appropriate, as it's the most general type (I think?) which can be indexed (it avoids having to cast tuple[int, ...] into list[int], or list[int] into tuple[int, ...]).
This is how the shape is used in add_tensor_info:

for i in range(n_dims):
self.ti_data += self._pack("Q", tensor_shape[n_dims - 1 - i])

Copy link
Contributor

@CISC CISC May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, all I'm saying is that that also works with NDArray[uint32] (even though it's not compatible with Sequence[int]).

@jploski
Copy link
Contributor

jploski commented May 2, 2024

An attempt to convert MPT fails early with an unhelpful error message, as shown below:

$ python3 convert-hf-to-gguf.py --outfile /mnt/f2fs/mpt/mpt7b.gguf --outtype f16 /mnt/seagate/dalai/mpt/models/mpt-7b
Loading model: mpt-7b
Traceback (most recent call last):
  File "/mnt/seagate/dalai/llama.cpp.pr7031/convert-hf-to-gguf.py", line 2306, in <module>
    main()
  File "/mnt/seagate/dalai/llama.cpp.pr7031/convert-hf-to-gguf.py", line 2287, in main
    model_instance = model_class(dir_model, ftype_map[args.outtype], fname_out, args.bigendian, args.use_temp_file)
  File "/mnt/seagate/miniconda3/lib/python3.10/typing.py", line 1456, in _no_init_or_replace_init
    cls.__init__(self, *args, **kwargs)
TypeError: MPTModel.__init__() takes exactly one argument (the instance to initialize)

@teleprint-me
Copy link
Contributor

teleprint-me commented May 2, 2024

Could be a bound method being treated as a static method. cls may have no self instance and so raises an error? It's strange because it's obviously receiving the model path as well as other required arguments for the factory.

@jploski
Copy link
Contributor

jploski commented May 2, 2024

Could be a bound method being treated as a static method. cls may have no self instance and so raises an error? It's strange because it's obviously receiving the model path as well as other required arguments for the factory.

Not sure what that means in terms of fixing it "properly".

The error disappears when I add an explicit init method iin MPTmodel which calls super().init with the same parameters. Then a different error occurs, which seems to be less related to Python and more to the actual conversion at hand:

Traceback (most recent call last):
  File "/mnt/seagate/dalai/llama.cpp.pr7031/convert-hf-to-gguf.py", line 2311, in <module>
    main()
  File "/mnt/seagate/dalai/llama.cpp.pr7031/convert-hf-to-gguf.py", line 2295, in main
    model_instance.set_gguf_parameters()
  File "/mnt/seagate/dalai/llama.cpp.pr7031/convert-hf-to-gguf.py", line 628, in set_gguf_parameters
    block_count = self.hparams["n_layers"]
AttributeError: 'MPTModel' object has no attribute 'hparams'

It seems Protocol can't be used as a statically type-checked ABC,
because its subclasses also can't be instantiated. (why did it seem to work?)

At least there's still a way to throw an error when forgetting to define
the `model_arch` property of any registered Model subclasses.
@compilade
Copy link
Collaborator Author

compilade commented May 2, 2024

I think this error happens because Protocol in Python isn't meant to be used as a statically-checked ABC. Subclasses of Protocol aren't supposed to be instantiated.
It's weird that it worked for me though.

Seems like Model needs to be an ABC instead (or simply an object?), but I still want an error to be thrown when the model_arch property of any subclass of Model is not defined. Maybe using __init_subclass__ would be more appropriate than using the Model.register decorator for this check, and would also make it possible to ensure that this decorator is always used?

…tion

There are no abstract methods used anyway,
so using ABC isn't really necessary.
@jploski
Copy link
Contributor

jploski commented May 2, 2024

I think this error happens because Protocol in Python isn't meant to be used as a statically-checked ABC. Subclasses of Protocol aren't supposed to be instantiated. It's weird that it worked for me though.

Seems like Model needs to be an ABC instead (or simply an object?), but I still want an error to be thrown when the model_arch property of any subclass of Model is not defined. Maybe using __init_subclass__ would be more appropriate than using the Model.register decorator for this check, and would also make it possible to ensure that this decorator is always used?

I can't recommend anything in that regard. But I checked that after your last commit the script works and produces exactly the same GGUF for MPT-7B as the master branch version (and the resulting model works).

# we don't need these
if name.endswith((".attention.masked_bias", ".attention.bias", ".attention.rotary_emb.inv_freq")):
if name.endswith((".attention.masked_bias", ".attention.bias", ".rotary_emb.inv_freq")):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to blacklist tensors in a per model basis you can use

MODEL_TENSOR_SKIP: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
MODEL_ARCH.LLAMA: [
MODEL_TENSOR.ROPE_FREQS,
MODEL_TENSOR.ATTN_ROT_EMBD,
],
MODEL_ARCH.BAICHUAN: [
MODEL_TENSOR.ROPE_FREQS,
MODEL_TENSOR.ATTN_ROT_EMBD,
],
MODEL_ARCH.PERSIMMON: [
MODEL_TENSOR.ROPE_FREQS,
],

Copy link
Collaborator Author

@compilade compilade May 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea! I didn't know about MODEL_TENSOR_SKIP. Right now it's only used for MODEL_ARCH.LLAMA in convert.py. The exclusion lists for the other architectures were never really tested (e.g. MODEL_ARCH.PLAMO and MODEL_ARCH.GROK both have MODEL_TENSOR.ATTN_ROT_EMBD in MODEL_TENSORS, but they are not in MODEL_TENSOR_SKIP, since they rely on matching the name like you pointed out).

I'm not sure about using this; modify_tensors can already be used to skip tensors, and I think it's cleaner this way for architecture-specific use, while common skips (as it is) belong in the base Model.write_tensors (the line on which this comment thread was started). BertModel has a nice example for an architecture-specific exclusion:

# we are only using BERT for embeddings so we don't need the pooling layer
if name in ("embeddings.position_ids", "pooler.dense.weight", "pooler.dense.bias"):
return [] # we don't need these

It was already done similarly in write_tensors (I tried to keep most of the model-specific code from write_tensors unchanged into modify_tensors). Otherwise, this means tensor types that won't be serialized should still get a serialized name, even if they're never used. BERT's pooling layers don't have an enum, and they probably should not have one if it's to be carried around but still remain unused like LLM_TENSOR_ATTN_ROT_EMBD in the llama.cpp file. (if they do get used, I agree they should get added, but they aren't for now.)

{ LLM_TENSOR_ATTN_ROT_EMBD, "blk.%d.attn_rot_embd" },

So instead, maybe it would be worth considering removing MODEL_TENSOR.ROPE_FREQS and MODEL_TENSOR.ATTN_ROT_EMBD? But this is likely out of the scope of this PR.

@compilade compilade force-pushed the compilade/convert-hf-refactor branch from 66ecf6c to 98f2d0e Compare May 4, 2024 03:06
@CISC
Copy link
Contributor

CISC commented May 4, 2024

@compilade BTW, I added your typing changes to #6778 (and a progress bar, inspired by your other PR), so will be a simpler rebase for whichever gets merged last. :)

@ggerganov ggerganov added the high priority Very important issue label May 5, 2024
@compilade
Copy link
Collaborator Author

(For anyone following only this and not #7075)

#7075 is starting to be more ready and I'm considering to unify both PRs to make the branches easier to manage. To keep the naming less confusing, I think it would be best to change the base branch of #7075 to master so that it would include the commits from here, then close this PR, having been superseded.

However, doing this won't move over the review comments, so I'm wondering if that would still be fine? (@CISC, @Galunid)

@CISC
Copy link
Contributor

CISC commented May 6, 2024

However, doing this won't move over the review comments, so I'm wondering if that would still be fine?

Fine by me, it was more for my benefit rather than this PR anyway. :)

@Galunid
Copy link
Collaborator

Galunid commented May 6, 2024

However, doing this won't move over the review comments, so I'm wondering if that would still be fine?

Sure, we can do that

@compilade
Copy link
Collaborator Author

Closing in favor of #7075

@compilade compilade closed this May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue need feedback Testing and feedback with results are needed refactoring Refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants