Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : unified file format #220

Closed
philpax opened this issue May 31, 2023 · 82 comments · Fixed by #302
Closed

ggml : unified file format #220

philpax opened this issue May 31, 2023 · 82 comments · Fixed by #302
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed refactoring Refactoring

Comments

@philpax
Copy link
Contributor

philpax commented May 31, 2023

Obsoletes #147, #150, ggerganov/llama.cpp#1575, ggerganov/llama.cpp#1590, rustformers/llm#143, and probably some other issues across some other repositories.

Please see the spec PR at #302; the following is left as-is so you can see the original proposal.


Current state of affairs

Overview

At present, there are two GGML file formats floating around for LLMs (and potentially other ggml-using projects, I haven't looked too much at the implementation of whisper):

  • GGML unversioned
  • GGJTv3 (same as v1 and v2, but with different quantization formats), which is similar to GGML but includes a version and aligns the tensors to allow for memory-mapping

Both of these formats share the same fundamental structure:

  • a magic number with an optional version number
  • model-specific hyperparameters that include a ftype that should describe the type of the majority of the tensors, and for GGML files, the quantization version encoded using a modulo in the ftype
  • an embedded vocabulary, which is a list of strings with length prepended. The GGMF/GGJT formats embed a f32 score next to the strings
  • finally, a list of tensors with their length-prepended name, type, and (aligned, in the case of GGJT) tensor data

We have more details on the format here: https://github.com/rustformers/llm/tree/main/crates/ggml#format

Drawbacks

Unfortunately, over the last few months, there are a few issues that have become apparent with the existing models:

  • There's no way to identify which model architecture a given model is for, because that information isn't present
    • Similarly, existing programs cannot intelligently fail upon encountering new architectures
  • Adding or removing any new hyperparameters is a breaking change, which is impossible for a reader to detect without herculean hacks
  • Each model architecture requires its own conversion script to their architecture's variant of GGML
  • Maintaining backwards compatibility without breaking the structure of the format requires clever tricks, like packing the quantization version into the ftype, which are not guaranteed to be picked up by readers/writers, and are not consistent between the two formats

GGJTv4/GGUF

Based on this, I'd like to propose a new format that's designed to be universal and addresses these issues. It is largely identical to GGJTv3, but makes one important difference: the hyperparameters are encoded as an array of key-value pairs that can be read in any order, and these hyperparameters are used to encode additional information about the model. A really important property I'd like to keep is single-file deployment: if I give you a GGUF file and you have a compatible executor, it should Just Work:tm without any additional conversion or extra files.

"Specification"

To quote from ggerganov/llama.cpp#1575 (comment):

Instead of storing the hyperparameters as

n_vocab: i32,
n_ctx: i32,
n_embd: i32,
n_head: i32,
n_layer: i32,
n_rot: i32,
use_parallel_residual: bool,
file_type: i32,

it's instead stored as an array of

key_length: u32,
key: [u8; key_length],
value_type: ValueType,
value: raw binary little-endian representation of value

so that you might have

[
  {
    key_length: 6,
    key: 'n_embd',
    value_type: ValueType::I32,
    value: 2560
  },
  {
    key_length: 11,
    key = 'use_parallel_residual',
    value_type = ValueType::Bool,
    value: true
  },
  ...
]

The brackets are for notational convenience - in practice, they're flatpacked and would come after each other in the binary. The ValueType enum would be standardized (like ggml_type), and so would the ways to represent each type of value.

This would allow for the addition of more parameters, readers to be more resilient to models coming from other sources, etc, because you'd be looking up values by key and trying to read them by binary.

It wouldn't be freeform - the storage medium would be entirely structured, so that any reader could pick up data from it without having to know about the other fields. As time goes on, I imagine this would look like ID3v2, with commonly-used tags being standardized by the community for whatever metadata they want to attach.

The main thing I want to achieve is to a) allow the reading of a GGML file knowing nothing else about it, even if you can't do anything with it and b) allow for community model authors to add useful metadata in a way that won't cause breakage for future readers, while still remaining maximally compatible.

Filling in some of the missing details:

Keys

Keys are ASCII lower_snake_case with dots for separation. Their length is stored before the key. They have a maximal length of 256 (open for debate, just a number I picked that seems like a reasonable upper bound).

This means that:

  • vocabulary.hugging_face is a valid key
  • vocabulary-hugging-face is not
  • Vocabulary.HuggingFace is not
  • vocabulary.hugging-face is not

I'd say we're looking at something like TOML keys without quotation.

Values

Values are one of the following types:

  • U32: little-endian unsigned 32-bit integer
  • I32: little-endian signed 32-bit integer (honestly not sure if this is necessary, I feel like a lot of the existing i32 use has been more just due to the use of int than anything)
  • F32: IEEE754 32-bit floating point number
  • String: UTF-8 string data, length prepended
  • Bytes: Raw binary data with no specific meaning attached, length prepended
  • Boolean: 1-byte value where 0 is false and 1 is true. Anything else is invalid. I considered making anything other than 0 true, but being strict on this will help detect misbehaving writers.

Standardized key-value pairs

This list is incomplete. Feel free to suggest additions. Where possible, I've tried to use the original names from the models to remove a layer of semantic confusion.

This is just from a quick appraisal of the models that llm supports. There are likely other fields that we can standardise ahead of time by looking at the HuggingFace config.

General

  • general.architecture: String: describes what architecture this model implements. Values can include llama, mpt, gpt-neox, gpt-j, gpt-2, bloom, etc. (List more if you can think of them, and they're not just variants of existing architectures!)
  • general.quantization_version: u32: version of quantization scheme
  • general.file_type: String: type of the majority of the tensors in the file. This shouldn't have any semantic meaning and should be purely informational, hence the use of String.
  • general.license: String: SPDX license of the model
  • general.description: String: information about the model, including provenance
  • general.original_model_url: String: path to the original model that this GGML file was created from

LLM

  • llm.context_length: u32: size of the maximum supported context
  • llm.hidden_size: u32: embedding layer size
  • llm.num_hidden_layers: u32: number of hidden layers
  • llm.num_rotary: u32: int(hparams["rotary_pct"]*(hparams["hidden_size"]//hparams["num_attention_heads"]))
  • llm.use_parallel_residual: bool: whether or not the parallel residual logic should be used
  • llm.max_seq_len: u32: Maximum sequence length
  • llm.attention.num_heads: u32: number of attention heads
  • llm.attention.alibi_bias_max: f32: The maximum bias to use for ALiBI
  • llm.attention.clip_kqv: f32: not sure

Vocabulary

  • vocabulary.embedded_size: u32: size of the embedded vocabulary. Zero if there is no embedded vocabulary.
  • vocabulary.huggingface_tokenizer_json: String: the entirety of the HF tokenizer.json for a given model (e.g. https://huggingface.co/mosaicml/mpt-7b-instruct/blob/main/tokenizer.json). Optional, but highly recommended for best tokenization quality with supported executors.

Future

This is not something we should aim for in the MVP, but ggml now has support for exporting the computation graph. A sample computation graph could be embedded to allow an executor to run the model without having direct support for the architecture.

Migration

The existing migrations have been pretty messy for the ecosystem and for the community. We should try to avoid causing significant upset by providing a migration path.

My suggestion is to switch over all model implementations, including llama.cpp, to GGUF, but offer a very straightforward conversion utility that does not require Python and can convert GGML and GGJTv3 to GGUF with all required information.

If interested, we could also include support for GGJT v1 and v2 using ggerganov/llama.cpp#1504 (although the requantisation process is inherently lossy).

Hopefully, this is the last time we have to bite this bullet. Even if we make breaking changes (like quantization version) again, software consuming GGUF can intelligently decide what to do based on the available information in the hyperparameters.

New model architectures can use GGUF without any additional work, so no breaking changes should be necessary there, either.

Conversion of Python models to GGUF

Ideally, all of the existing convert-h5-to-ggml.py and convert.py scripts can be entirely deprecated. Instead, there is one script that takes an arbitrary HuggingFace model and converts it to a compatible GGUF file. This vastly reduces the maintenance burden and makes it simpler to action changes across the ecosystem when necessary.


cc @ggerganov @LostRuins @KerfuffleV2 @LLukas22 @TheBloke @iacore @comex and others who work with GGML models

@Green-Sky
Copy link
Contributor

technically speaking, we also had a GGMFv1, the one before the memory mapped GGJTv1

@Green-Sky
Copy link
Contributor

there is also the new .ggml wip file, which contains the computation graph. 3b697a2

@cztomsik
Copy link

cztomsik commented May 31, 2023

Wonderful, I thought I would go for safetensors but they are not really willing to extend the spec for quantized dtypes. Obviously, I wanted to avoid GGML because there was no spec. If there is a spec, I am all in for this.

BTW: I would be also super-exited about the graph-saving/loading, I thought I would refactor my logic to be agnostic over Model vs. Graph, because both just need input and have output, so for inference, it shouldn't matter if I have a graph or a real model.
(where model can be for example fine-tuned or something, but graph can be only evaluated)

@klosax
Copy link
Contributor

klosax commented May 31, 2023

This is the first step to realize a unified llm API and interface and that would handle any supported architecture.

ggerganov/llama.cpp#1602 (comment)
#185
#145 (comment)

@KerfuffleV2
Copy link

Their length is stored before the key. They have a maximal length of 256

255 makes more sense if you're going to use a byte to store the length. Unless you want to always add +1 to that value.

general.architecture: String: describes what architecture this model implements. Values can include llama, mpt, gpt-neox, gpt-j, gpt-2, bloom, etc.

It might make more sense to make something like general.type which could be ggml and then put GGML-specific stuff under ggml like ggml.architecture: String. That way there would be the possibility of using this container format for non-GGML models.

If you're going to go through a bunch of trouble designing a model container format, it seems like it would make sense to make it something that could just generally be used.

That would also mean tools that manipulate it wouldn't really have to care if it was GGML or some other type of model.

llm.num_hidden_layers: u32: number of hidden layers

and similar - Why not use the architecture as the base key? A different architecture of model isn't necessarily going to have hidden layers, rotary, etc. It might have its own stuff. Just as an example, RWKV models don't even have attention heads.

So instead, you'd have llama.num_rotary.


Did I miss it or does this not really describe how/where the actual tensors get defined? I actually like the SafeTensors approach quite a bit where the metadata just defines position and length. The only thing I'd change from that is adding a requirement that the tensor data has to start align.

You'd want to pick an alignment that is pretty future proof and works for most CPU architectures and most types. What is it GGML uses internally, 64 bytes? That wastes a little space but it's not enough to really matter.

@philpax
Copy link
Contributor Author

philpax commented May 31, 2023

This is the first step to realize a unified llm API and interface and that would handle any supported architecture.

ggerganov/llama.cpp#1602 (comment) #185 #145 (comment)

Yep! We already implement this in llm in the Rust world, but we'd love to see upstream support for this and to begin consolidating the various examples into a cohesive framework so that we can all benefit.

Their length is stored before the key. They have a maximal length of 256

255 makes more sense if you're going to use a byte to store the length. Unless you want to always add +1 to that value.

Sure. I wasn't thinking about using a byte for the length, but that's entirely reasonable.

general.architecture: String: describes what architecture this model implements. Values can include llama, mpt, gpt-neox, gpt-j, gpt-2, bloom, etc.

It might make more sense to make something like general.type which could be ggml and then put GGML-specific stuff under ggml like ggml.architecture: String. That way there would be the possibility of using this container format for non-GGML models.

If you're going to go through a bunch of trouble designing a model container format, it seems like it would make sense to make it something that could just generally be used.

That would also mean tools that manipulate it wouldn't really have to care if it was GGML or some other type of model.

I'm not opposed, but I'd like to see a motivating case first. I think this is most likely to be implemented by all parties if we can agree on a reasonable extension from the original format.

llm.num_hidden_layers: u32: number of hidden layers

and similar - Why not use the architecture as the base key? A different architecture of model isn't necessarily going to have hidden layers, rotary, etc. It might have its own stuff. Just as an example, RWKV models don't even have attention heads.

So instead, you'd have llama.num_rotary.

No particular reason. I saw a commonality and merged them; if people decide using the architecture as base key makes more sense, I'm happy to go with that.

Did I miss it or does this not really describe how/where the actual tensors get defined? I actually like the SafeTensors approach quite a bit where the metadata just defines position and length. The only thing I'd change from that is adding a requirement that the tensor data has to start align.

You'd want to pick an alignment that is pretty future proof and works for most CPU architectures and most types. What is it GGML uses internally, 64 bytes? That wastes a little space but it's not enough to really matter.

Yeah, this just uses the current GGJTv3 scheme in the interest of minimising the amount of work required to migrate to the format. No opposition to moving to a ST-like format from me, but I also don't feel particularly strongly about it. What does everyone else think?

@klosax
Copy link
Contributor

klosax commented May 31, 2023

general.architecture: String: describes what architecture this model implements. Values can include llama, mpt, gpt-neox, gpt-j, gpt-2, bloom, etc.

It might make more sense to make something like general.type which could be ggml and then put GGML-specific stuff under ggml like ggml.architecture: String. That way there would be the possibility of using this container format for non-GGML models.

Even better (the value sets the key):

general.type = ggml
ggml.type = llm
llm.architecture = llama
llama.num_rotary

@LostRuins
Copy link
Contributor

@klosax i'd say the ggml magic would take care of that - ideally non-ggml formats shouldn't be using it as a container format. No need to over engineer it (my 2c).

@LLukas22
Copy link

Could we also include some optional generation parameters. Which contain default values for some sampling parameters? Or would that be to specific?

@LostRuins
Copy link
Contributor

I would recommend including stuff that's mainly essential for loading the model - things that are required for proper functioning. Samplers are technically not even dependent on the model - user is free to do with the output logits as they please.

@philpax
Copy link
Contributor Author

philpax commented May 31, 2023

Agree with sampling parameters not being essential (especially since you can use whatever sampler you want with whatever model.)

That being said, that reminds me - it might be a good idea to include suggested prompt formats as one of the standardised config parameters. Feel free to 👍 or 👎 this post if you think that's too extra.

@LostRuins
Copy link
Contributor

LostRuins commented May 31, 2023

Hmm I think that will be fine as an optional parameter, but not as a standard parameter. Standard params should be stuff that are required for loading correctly, like use_parallel_residual and quantization types.

Also prompt formats may not even make sense universally (they're kindof an instruct model thing). I have a model trained on literature, it has no prompt format, it just spews out prose. I also have another model that just generates long sequences of increasing numbers. Likewise... base llama has no prompt format.

@danforbes
Copy link
Contributor

danforbes commented May 31, 2023

Is it possible that something like this could be useful https://github.com/khonsulabs/pot? It seems like at a high-level this discussion revolves around the best way to construct a self-describing data format, which is a problem that I think has already been addressed to a certain extent.

More ideas here: https://github.com/yasammez/nachricht#prior-art

@philpax
Copy link
Contributor Author

philpax commented May 31, 2023

Hmm I think that will be fine as an optional parameter, but not as a standard parameter.

Yes. Sorry, to clarify, when I say "standard" I don't mean they should be included in all models. It's just that if you do add a prompt format, you should call it something we've declared here, so that things that expect it know what to look for.

I'll go through the list of k-v pairs up there to clarify which ones are required and which ones are standardised-in-name but otherwise optional, but I'll wait for feedback on the rest of the proposal first.

Is it possible that something like this could be useful https://github.com/khonsulabs/pot? It seems like at a high-level this discussion revolves around the best way to construct a self-describing data format, which is a problem that I think has already been addressed to a certain extent.

Normally, yeah, I'd just use a self-descriptive standard format. However, GGML/llama.cpp aim to be as dependency-free as possible, so something moderately bespoke but not too complex is more likely to be accepted by the wider community.

@danforbes
Copy link
Contributor

GGML/llama.cpp aim to be as dependency-free as possible

Adopting a format or specification doesn't necessarily mean taking on any new dependencies, and it would allow for greater focus to be placed on the "secret sauce", which I think will be the standardized key/value pairs and what they are meant to specify.

@klosax
Copy link
Contributor

klosax commented May 31, 2023

That being said, that reminds me - it might be a good idea to include suggested prompt formats as one of the standardised config parameters. Feel free to +1 or -1 this post if you think that's too extra.

I think this will be needed to run inference in instruction mode on any instruction tuned model. It is maybe enough with a key telling what supported standardized prompt formatting to use. If the key is missing, no instruction mode inference will be available. llm.instruct_format = alpaca

@iacore
Copy link

iacore commented May 31, 2023

I recommend extending safetensors. Only libggml need to load the model correctly anyways. See original discussion here: rustformers/llm#143

What extensions we need

  • include GGML types like ggml_q4_0
  • include hypeparams and vocab in metadata (this in already in spec)

@danforbes
Copy link
Contributor

I recommend extending safetensors.

Considering that the safetensors project already answers the question Yet another format? I think this is unambiguously the right thing to do.

@KerfuffleV2
Copy link

Considering that the safetensors project already answers the question

You'd have to fork it to do that, they don't seem interested in extending it. Based on existing discussion, it seems like they want to lock it down and reduce its extensibility further by, for example, forbidding gaps between tensors (even though the format currently would allow it since the metadata only says where tensors start and their length).

@iacore
Copy link

iacore commented May 31, 2023

You'd have to fork it to do that, they don't seem interested in extending it.

I disagree. The format itself is very simple. The huggingface parser is not that good, and we need to write the parser in C (for ggml) anyways. The safetensors format is just a format. If we get enough people to use our version, then our version becomes the "official" one.

@KerfuffleV2
Copy link

KerfuffleV2 commented May 31, 2023

I agree with all of the above, but that's basically what I'd call "forking" it. Taking that project and basing another one on it that takes a different approach, has different requirements, sets different restrictions, etc.

quick edit: Probably also should add: While I'm not really a fan of the direction they seem to have chosen, I personally wouldn't use the approach of forcefully trying to take control away. If it was me, I'd start with the SafeTensors format but call it something different.

@philpax
Copy link
Contributor Author

philpax commented May 31, 2023

I agree with Kerfuffle that that would be a non-ideal turn of events and would likely alienate an ecosystem that we should stay on good terms with.

safetensors as a format comes with certain assumptions that we should not singlehandedly override - it will end up causing a similar problem to what we have now, but with someone else's format, and with a lot more bad blood.


In any case, I'd like to request that we keep discussion about switching formats or fundamentally changing the structure of this format out of this issue. Feel free to open another issue.

I'm looking for a solution that solves the immediate issues the ecosystem is encountering at the least cost possible; we're not trying to find the perfect solution here, but the one that enables the most reusability / functionality at the least cost.

This format's not perfect by any means, but it's simple, easy to work with and understand (i.e. can be parsed from C without too much suffering), and more importantly: it powers an existing ecosystem with inertia.

The more complicated we make this change and the more parties we involve, the harder it will be to actually make the change. Let's keep it on track.

@iacore
Copy link

iacore commented May 31, 2023

we can name this .safetensors-ggml or something.

@klosax
Copy link
Contributor

klosax commented May 31, 2023

vocabulary.huggingface_tokenizer_json: String: the entirety of the HF tokenizer.json for a given model .. Optional, but highly recommended for best tokenization quality with supported executors.

Why would json give a higher quality than the current layout?

Some models dont have a tokenizer.json, Replit uses spiece.model. How should such vocab be handled?

To support any vocab, maybe a key like vocabulary.encoding (defaulting to utf-8) would be needed?

@iacore
Copy link

iacore commented May 31, 2023

vocabulary.huggingface_tokenizer_json: String: the entirety of the HF tokenizer.json for a given model .. Optional, but highly recommended for best tokenization quality with supported executors.

This makes the tokenizer config less portable. The tokenizer file is usually loaded by an external library from a file.

@philpax
Copy link
Contributor Author

philpax commented May 31, 2023

vocabulary.huggingface_tokenizer_json: String: the entirety of the HF tokenizer.json for a given model .. Optional, but highly recommended for best tokenization quality with supported executors.

Why would json give a higher quality than the current layout?

Some models dont have a tokenizer.json, Replit uses spiece.model. How should such vocab be handled?

To support any vocab, maybe a key like vocabulary.encoding (defaulting to utf-8) would be needed?

Good question. For context: llm has support for using tokenizers directly, so we can load a tokenizer.json (which seems common for the models we support). That JSON file has a lot of specifics about tokenization that aren't captured in the (token, score) embedded vocabulary.

I wasn't aware of the existence of other ways to store the tokenization data, and I'd have to look into it. Do you have any further information about it that I could look into?

To support any vocab, maybe a key like vocabulary.encoding (defaulting to utf-8) would be needed?

Is encoding the only thing that can diverge? I'll admit I am not too across the nuances here - my understanding is that the HF models have their complex tokenizers, and then the Python conversion scripts load those in and extract (token, score) tuples that a GGML executor can use to tokenize a string, except it may not account for all of the complexities of the original tokenizer.

This makes the tokenizer config less portable. The tokenizer file is usually loaded by an external library from a file.

Yes, that's why it's optional. The (token, score) scheme can still be used, but I'd like for users to be able to use the original HF tokenizers out of the box if possible.

@TheBloke
Copy link
Contributor

TheBloke commented May 31, 2023

I thoroughly support any effort to produce a new format which will be future-proof and will protect against any more breaking changes.

I know it's probably not on the cards but what I would really love is if this change would eventually lead to llama.cpp being able to load any GGML model, like GPTJ, MPT, etc. If that's not being considered then at least if a standardised format would allow for non-compatible clients to inform the user that this is not a supported model then that would help a lot.

The idea of using safetensors sounds smart, although if it is used I think it'd be ideal to change the name for this fork of safetensors. safeggml perhaps. Otherwise I am envisaging a lot more support requests along the lines of "I downloaded the GPTQ, why won't it work in llama.cpp - they're both safetensors?"

I really like the idea of an embedded prompt template. Users are asking more and more for prompt templates to be communicated. Having that in the format itself sounds like a great idea.

I have a feature request of my own: multi-part files. It'd be really helpful if this change could bring back support for multi-part GGML files. safetensors would support that natively I guess. This would be useful because of the Hugging Face Hub limit of 50GB per file, which prevents uploading 65B q8_0 models unless they're uploaded eg as a multi-part ZIP, which is messy and extra work for uploader and user alike. I could also imagine that in the future we might see some new larger models - perhaps a Falcon 80B for example - which might similarly not be possible to upload in the higher quant sizes. Multi-part GGML would solve that neatly.

Great work, hope this gets implemented!

@klosax
Copy link
Contributor

klosax commented May 31, 2023

I wasn't aware of the existence of other ways to store the tokenization data, and I'd have to look into it. Do you have any further information about it that I could look into?

Replit is implemented here. Look at the conversion script. It needs a special tokenizer implemented in main.cpp

In the MPT example you can see what had to be done to correctly encode (in convert script) and decode (in main.cpp) the gpt-neox vocab.

Maybe the vocabs that are not json could be converted to it when creating the gguf file?

@philpax
Copy link
Contributor Author

philpax commented May 31, 2023

I thoroughly support any effort to produce a new format which will be future-proof and will protect against any more breaking changes.

Awesome! Yeah, I figured you might have a stake in this 😂

I know it's probably not on the cards but what I would really love is if this change would eventually lead to llama.cpp being able to load any GGML model, like GPTJ, MPT, etc. If that's not being considered then at least if a standardised format would allow for non-compatible clients to inform the user that this is not a supported model then that would help a lot.

Agreed, that would be ideal. I left the possibility of this open in the future section:

This is not something we should aim for in the MVP, but ggml now has support for exporting the computation graph. A sample computation graph could be embedded to allow an executor to run the model without having direct support for the architecture.

but I'm not sure how far along the cgraph export/import functionality is, or how stable it is. I figured we can add that as an extension once that's solidified a bit.

I'd be happy just to have llm and friends gracefully fail when the architecture isn't recognised, instead of plowing through and trying to read invalid hyperparameters 😂

The idea of using safetensors sounds smart, although if it is used I think it'd be ideal to change the name for this fork of safetensors. safeggml perhaps. Otherwise I am envisaging a lot more support requests along the lines of "I downloaded the GPTQ, why won't it work in llama.cpp - they're both safetensors?"

100% agreed - we were bouncing around ST support a couple months ago for llm, but one of my primary concerns is that we'd encourage users to seek out non-GGML-augmented ST models and get confused by those not working. An extension change might work, but we'd still have to set up our own pipelines for doing so and we'd still be creating a format that wouldn't be compatible.

I'm not opposed to the use of safetensors (we're likely to support it in llm at some point, or a variant of it), but it's easier to make GGML fit-for-purpose than to try to repurpose a format that other people are using and doesn't support what we need yet.

I really like the idea of an embedded prompt template. Users are asking more and more for prompt templates to be communicated. Having that in the format itself sounds like a great idea.

Glad to hear it. Do you have any suggestions for what that might look like/what needs to be supported?

I have a feature request of my own: multi-part files. It'd be really helpful if this change could bring back support for multi-part GGML files.

Aaahhh, I did think about this but I'm not sure about it. I feel like that's conflating a distribution concern with a deployment concern; do you think you'd still need this if it weren't for the HF limit? Would it be a significant improvement over uploading multipart ZIPs?

Replit is implemented here. Look at the conversion script. It needs a special tokenizer implemented in main.cpp

Ah... I see... they have a custom sentencepiece tokenizer. Yeah, not sure how to best handle that. @Narsil is that something tokenizers can support and/or be a part of tokenizer.json?

@apage43
Copy link
Contributor

apage43 commented May 31, 2023

If a major file format change is going to happen again the tokenizer configs for the models using huggingface tokenizers
BPE/GPT-2-like tokenizers ought to be improved (i.e. all but the SentencePiece ones - which I think are less broken but I haven't looked into them as much), all the formats that only store the vocab list and not the merges and have no way of identifying the "additional" tokens are, unfortunately, incomplete.

When encoding they should, after doing the "pretokenizing" stage with the regex, merge bigrams in the order they occur in the merges list, which will not necessarily get the same result as just taking the longest matching token. The logic in minGPT's implementation of GPT2's tokenizer is a good reference: https://github.com/karpathy/minGPT/blob/master/mingpt/bpe.py#L95

Tokens added after "training", the ones in the "added_tokens" section of tokenizer.json need to be handled separately - see this comment in tokenizers for an explanation: https://github.com/huggingface/tokenizers/blob/cb819724eff2769aa1211b0f296649ceb502ccc4/tokenizers/src/tokenizer/added_vocabulary.rs#L130-L140

Lastly, to totally match the behavior of tokenizers, unicode normalization is required - I think most models settled on NFC form but tokenizers supports all of them.

I have a C++ implementation of enough of that to correctly encode ChatML prompts as used by MPT-7B-Chat at https://github.com/apage43/bpe.cpp but it depends on ICU for two things, the unicode normalization, which might be possible to live without, and the pretokenizing regex being unicode-aware when splitting on "letter" characters, which is somewhat important for handling non-English text.

@philpax
Copy link
Contributor Author

philpax commented Jun 25, 2023

Sorry about the delay, I'll get to making the PR within the next few days 👍

@philpax

Hm, it should be laid out in memory as [{k: v}, vocabulary-if-present[], tensor data aligned]. That being said, for v2 of the spec (in the PR), I'm likely going to take Kerfuffle's suggestion and decouple the tensor info from the data, so it's more likely going to be [{k: v}, vocabulary-if-present[], tensor-info[]], where each tensor-info has an (aligned) offset into the file where the actual tensor data can be found, and the data is decoupled entirely from the metadata.

This sounds great. Lets also envision a way to specify the tensor alignment upon writing the GGUF files. There are 2 aspects of the alignment:

* pad the meta data up towards the specified alignment

* pad each tensor data up towards the specified alignment

This is important, since for example the Metal implementation requires the shared Metal buffers to be page aligned and we currently do some extra tricks to make this work. Ideally, one could simply generate a page aligned GGUF and avoid having to do that. Also, there are some arguments about having page aligned tensors will generally perform better when running on the CPU. Note that different alignment sizes do not break format compatibility - everything works the same way, we just have the benefit of loading the model data directly memory aligned. We can optionally have the alignment size explicitly written in the meta data too general.alignment: u32

Seems reasonable. I'll account for this in the spec. Given that this might be a requirement for Metal loading, should we align both the metadata and tensor data to some large predefined alignment to ensure the models are always loadable?

I'm also thinking about whether the vocabulary-array can be consolidated into the key-value pairs (so that we can avoid specialising for the current wonky vocabulary), but I can't think of a nice way to do that. It would involve one of the following:

value types being complicated (adding an array type, adding tuples, adding arrays of tuples)
the k-v structure is replaced entirely with something more powerful (JSON/BSON/msgpack/whatever), requiring a dependency to handle that or a constrained reader
encoding the vocabulary as a string that has to be parsed, which is the least-effort solution, but also the ugliest

I guess we can support arr(type) where type is one of the fundamental types u32, f32, String, etc.. This way we can have:

* `vocabulary.text: arr(String)`

* `vocabulary.score: arr(f32)`

* etc.

Ah, yeah, I suppose we could always use separate fields. That would handle the case where there aren't any scores, either. Great, I'll account for that in the spec.

Should there be support for nested arrays (e.g. arr(arr(u32))) or should that be ruled out for now?

I'd also like to use a unified file format to store the text-training checkpoint files. They also follow the pattern of storing key:value pairs and some tensors.

They are different from the regular model files, because key:values and tensors from both model and optimizer state need to be stored.

Is that different? Unless I'm mistaken, wouldn't that just be an extended version of this format with more KVs and tensors?

@xaedes
Copy link

xaedes commented Jun 25, 2023

Is that different? Unless I'm mistaken, wouldn't that just be an extended version of this format with more KVs and tensors?

Indeed, they are not really that different, just different from the current model files.

@philpax
Copy link
Contributor Author

philpax commented Jun 25, 2023

Okay, I've written a first pass at the spec at #302! Have at it there - please make any further suggestions against that PR, so that we have a unified document/vision that we can update.

@ggerganov
Copy link
Owner

Should there be support for nested arrays (e.g. arr(arr(u32))) or should that be ruled out for now?

It does not hurt to be in the spec and supported in the future.

@StellaAthena
Copy link

I’m confused about the claim that safetensors wont add support for ggml. The primary maintainer of the project discusses supporting the idea here. I think that if you submit a PR with an update that supports quantization it’ll likely be approved.

@Green-Sky
Copy link
Contributor

Green-Sky commented Jul 1, 2023

I think the main point against using safetensors is it's json usage, AND it's inability to be mapped into memory as is (i think) (you probably can).

@StellaAthena
Copy link

I think the main point against using safetensors is it's json usage, AND it's inability to be mapped into memory as is (i think) (you probably can).

Can you elaborate as to why this is problematic?

@philpax
Copy link
Contributor Author

philpax commented Jul 1, 2023

Hi! (Big fan of your work!)

We were considering using safetensors before (and may still do in the future), but there are a few issues.

Fundamentally, it boils down to a few things:

  • GGML is a moving target (especially with its quantization), so we'd like a format that we can alter as required.
    • The quantization format has already changed twice, which was mentioned in the dtypes issue.
  • GGML is a single-file library with no dependencies, so JSON parsing will require integrating an existing parser or writing a new one that only handles what's needed
  • safetensors already has its existing usage patterns (next to other files in a directory, like the tokenizer); a feature of the existing GGML models is single-file deployments, which may lead to user confusion

I think that safetensors will be supported by executors in the future - including the necessary extensions for GGML use - but GGUF's designed to resolve the issues with the current format while still leaving room for rapid evolution.

@cztomsik
Copy link

cztomsik commented Jul 2, 2023

that only handles what's needed

I think this would be the case, and honestly, if that's the only blocker, I'd be happy to do it.

I also believe we could use safetensors, if:

  • there was a written spec (let's make it easy for all the people to implement parsers/writers)
  • there was an "unknown" data type, so we could put there anything, without having to worry about Q_XX versions, this is currently not possible and the mentioned PR does not solve that I've missed the u8 dtype, we could probably use that?
  • metadata restriction was relaxed to include any kind of JSON, because then we could include the vocab easily as another json (we can do that even now but to be honest, it's awkward to stringify/double-encode JSON so that it can be embedded in another JSON)

I am probably missing a few things (unicode?), but I think we could just say that our format is a subset of the safetensors (to make the parsing simpler)

@philpax
Copy link
Contributor Author

philpax commented Jul 2, 2023

Aye, I think it would be possible to make safetensors work with enough work. My suspicion, though, is that the amount of work is on par with defining our own format, and it'd come with two disadvantages:

  • our safetensors-subset/light-fork wouldn't be compatible with other safetensors models anyway, and they wouldn't be compatible with us, so we're not gaining much from using it
  • we do not control the development of the format, so changing the format for something in our ecosystem is harder to do without further forking safetensors

I think safetensors support is an excellent idea, but I don't think we can/should make it the primary format for this ecosystem until the rate of development slows down and things can be more standardised.

@apage43
Copy link
Contributor

apage43 commented Jul 3, 2023

It might be reasonable to support reading safetensors in quantize, maybe even directly loading f16/f32 weights from them for inference (though both of these would also need a way to convey the extra metadata/config/hparams - there's the safetensors json metadata but it doesn't standardize names or schemas for those things as safetensors isn't really meant to be a "everything you need to run the model" format, its just meant to store ... the tensors.)

but it probably doesn't make sense as a format for storing quantized models unless the quantization formats become more standardized (at the very least not exclusively used by ggml)

@cztomsik
Copy link

cztomsik commented Jul 4, 2023

our safetensors-subset/light-fork wouldn't be compatible with other safetensors models anyway, and they wouldn't be compatible with us, so we're not gaining much from using it

I think if we are a subset then everybody can read us but we can't read any other model, which is IMHO fine. Reading itself of course does not mean it will be useful, the client still needs to know what is in the file but that's how it is with every format except graph-dumps.

we do not control the development of the format

The question is how they are eager to define/relax the spec and participate in joint development.

The main benefit of safetensors is that we are trying to be forward-compatible here and burning brain cycles even when JSON has already solved all of that. The metadata itself is enough to describe anything, the rest are just tensor bytes.


That said, I personally don't care that much about the format, the most important thing is to get it done and supported ASAP. The fragmentation is crazy ATM. So feel free to just ignore whatever I said :)

@StellaAthena
Copy link

Regarding nested JSONs The conversation about nested JSON support is about the metadata field, correct? Metadata is (by definition) auxiliary information that is non-essential. If there’s data stored as metadata that you need to have present to run a model, it seems like something is being misused more than anything else.

Regarding the Spec What info is desired that isn’t found here? Is there a standard way for file specs to be written? I’m happy to write up the desired document (consulting with Nicolas of course) if that would be helpful.

Regarding Everything Else My primary interest is in improving the interoperability of the open source ecosystem and reducing duplication of work. I came to this thread because I tweeted about how I was excited ST was gaining traction and someone replied with a list of complaints and linked to this thread.

I’m very much not here to tell the ggml community how you all should prioritize things or what decisions you should make. If we can make ggml and ST happy at the same time instead of necessitating the creation of another format, that’s a big win in my book. If y’all decide that you have too divergent values from ST or that your library isn’t stable enough I understand.

P.S.: Are there leader(s) of this community? I would love to learn about how y’all’re organizing and managing the community, and if there’s anything that EleutherAI can either do to help or learn from you about.

@saharNooby
Copy link

@StellaAthena I believe the leader is @ggerganov, who created ggml, llama.cpp etc. and organizes work here in form of roadmaps.

@philpax
Copy link
Contributor Author

philpax commented Jul 5, 2023

Regarding nested JSONs The conversation about nested JSON support is about the metadata field, correct? Metadata is (by definition) auxiliary information that is non-essential. If there’s data stored as metadata that you need to have present to run a model, it seems like something is being misused more than anything else.

We store the model configuration within the model (i.e. hyperparameters, model structure information, tokenizer). This is to allow single-file deployments of models, because the configuration rarely changes and users enjoy being able to download one file and get going with their existing executor.

The proposal being made with regards to safetensors here is to store that configuration in the JSON metadata, to allow for the same kind of experience. This seems doable, but would require readers to be able to read JSON, which is harder with GGML's single-file C header. (Although this appears to be changing with the addition of more backends.)

I can't comment on whether that would be a misuse of the metadata field, but other ST readers would, most likely, ignore the presence of this data.

Regarding the Spec What info is desired that isn’t found here? Is there a standard way for file specs to be written? I’m happy to write up the desired document (consulting with Nicolas of course) if that would be helpful.

That looks good to me. I've seen several different ST implementations, so I assume that this is sufficient. I assume this was missed by the person who raised the concern.

Regarding Everything Else My primary interest is in improving the interoperability of the open source ecosystem and reducing duplication of work. I came to this thread because I tweeted about how I was excited ST was gaining traction and someone replied with a list of complaints and linked to this thread.

I'm sorry to hear that - we have absolutely no bad blood with safetensors, it's just not necessarily the right fit for our ecosystem at this moment due to our slightly different constraints.

I’m very much not here to tell the ggml community how you all should prioritize things or what decisions you should make. If we can make ggml and ST happy at the same time instead of necessitating the creation of another format, that’s a big win in my book. If y’all decide that you have too divergent values from ST or that your library isn’t stable enough I understand.

I agree that we should unify the formats if possible. It's just that GGML moves very quickly - we had two quantization format breaks in two weeks the other month - and having full control of the format will allow us to maintain that velocity.

My main concern with making safetensors work for us is that we risk breaking the wider ecosystem or creating a somewhat-incompatible fork of safetensors that's compatible on paper, but not necessarily in practice. (e.g. ggml safetensors models being all-in-one with custom quantization, while other safetensors models have their config in separate files with well-defined quantization).

The quantization is really the major sticking point for me; our quantization formats are fickle and won't be compatible with the larger safetensors ecosystem, which will lead to a lot of user confusion.

With that being said - my hope is that in a few months time, this format will be retired, and we're all on safetensors. My part of the ecosystem (rustformers/llm) is likely to support safetensors as a format to load before that, as adding additional dependencies is easy for us.

P.S.: Are there leader(s) of this community? I would love to learn about how y’all’re organizing and managing the community, and if there’s anything that EleutherAI can either do to help or learn from you about.

As mentioned, Georgi is the lead of the ecosystem and is the head of the newly-founded ggml.ai. I'm the primary maintainer of rustformers/llm, a Rust library that uses GGML to implement several architectures with a unified interface. Executors outside of Georgi's repositories (llm included) are somewhat of an organic development, and aren't necessarily organized.

It's still early days for this ecosystem - as far as I know, we don't actually have a place to communicate synchronously - so there's a lot of work to be done. Georgi is currently the BDFL :-)

@ggerganov
Copy link
Owner

Hi @StellaAthena

I pretty much agree with everything that @philpax said. In the long run we will support and integrate with safetensors, but at the moment we are primarily focused on consolidating the work from various ggml-related repos and making life easier for the maintainers involved.

More specifically, the main goal atm is to complete the current roadmap which would lay a good foundation for the project. After that we will look into extending support and collaboration further.

if there’s anything that EleutherAI can either do to help or learn from you about.

The best way to help now is to help complete the roadmap.

@qnixsynapse
Copy link
Contributor

The idea of using safetensors sounds smart, although if it is used I think "

The idea of safetensors sounds very good to me.

it'd be ideal to change the name for this fork of safetensors. safeggml perhaps. Otherwise I am envisaging a lot more support requests along the lines of "I downloaded the GPTQ, why won't it work in llama.cpp - they're both safetensors?

Or better llama_weights.q4_0.safetensors where q4_0 is quantization and it won't create any confusion imo.

@klosax
Copy link
Contributor

klosax commented Aug 6, 2023

Note: The discussion about the file format is continued in PR #302.

@iashchak
Copy link

iashchak commented Sep 30, 2023

@philpax Thank you for such a great proposal.

I have a few questions:

I'm wondering, you said:

Ideally, all of the existing convert-h5-to-ggml.py and convert.py scripts can be entirely deprecated. Instead, there is one script that takes an arbitrary HuggingFace model and converts it to a compatible GGUF file.

As for ggml/gguf user there is only conver-blabla.py path to convert some custom model (as it was recently done for baichuan model at llama.cpp) or there are any other place where I can put mappings/conversion logic?

I see converters are placed in many repos now:
llama.cpp / koboldcpp / ggml

Where is the main place for it? Maybe it is a good place to specify some global registry-kind repo for them / template repo?


PS I wish to write some kind of happy path for contributors

Thank you in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed refactoring Refactoring
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.