Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make loading weights 10-100x faster #613

Merged
merged 9 commits into from
Mar 30, 2023
Merged

Make loading weights 10-100x faster #613

merged 9 commits into from
Mar 30, 2023

Conversation

jart
Copy link
Contributor

@jart jart commented Mar 29, 2023

This is a breaking change that's going to give us three benefits:

  1. Your inference commands should load 100x faster
  2. You may be able to safely load models 2x larger
  3. You can run many concurrent inference processes

This was accomplished by changing the file format so we can mmap()
weights directly into memory without having to read() or copy them
thereby ensuring the kernel can make its file cache pages directly
accessible to our inference processes; and secondly, that the file
cache pages are much less likely to get evicted (which would force
loads to hit disk) because they're no longer competing with memory
pages that were needlessly created by gigabytes of standard i/o.

The new file format supports single-file models like LLaMA 7b, and
it also supports multi-file models like LLaMA 13B. Our Python tool
now merges the foo.1, foo.2, etc. files back into a single file so
that the C++ code which maps it doesn't need to reshape data every
time. That's made llama.cpp so much simpler. Much of its load code
has now been deleted.

Furthermore, this change ensures that tensors are aligned properly
on a 32-byte boundary. That opens the door to seeing if we can get
additional performance gains on some microprocessors, by using ops
that require memory alignment.

Lastly note that both POSIX and the Windows platform are supported

The issue this PR solves is #91

This PR was written in collaboration with @slaren. This PR is also rebased on
PR #586 so please do not squash merge! Use either merge or rebase.

@jart jart added performance Speed related topics breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. labels Mar 29, 2023
@jart jart mentioned this pull request Mar 30, 2023
4 tasks
@luminalle
Copy link

Should the other converters also be rewritten to handle this new format?

@jart
Copy link
Contributor Author

jart commented Mar 30, 2023

Yes indeed. I just fixed the quantize program. Now I'm hunting down all the tests.

@jart
Copy link
Contributor Author

jart commented Mar 30, 2023

All tests look green except for a CMake test. For example: https://github.com/ggerganov/llama.cpp/actions/runs/4559537462/jobs/8043597142?pr=613 I'm stumped on this error. I can't figure out where the file models/ggml-vocab.bin comes from. Does anyone know? Could it be a stale cache?

@FNsi
Copy link
Contributor

FNsi commented Mar 30, 2023

All tests look green except for a CMake test. For example: https://github.com/ggerganov/llama.cpp/actions/runs/4559537462/jobs/8043597142?pr=613 I'm stumped on this error. I can't figure out where the file models/ggml-vocab.bin comes from. Does anyone know? Could it be a stale cache?

#355 mentioned "Added ./models/ggml-vocab.bin containing just LLaMA vocab data (used for tests)"

@@ -20,7 +20,7 @@
#endif

#define LLAMA_FILE_VERSION 1
#define LLAMA_FILE_MAGIC 0x67676d66 // 'ggmf' in hex
Copy link
Contributor

@bakkot bakkot Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: why change the magic rather than the version? I assumed the plan was to keep the magic constant forever. If you bump the version instead, old executables will recognize new model files and give a more useful error message. And it's nice to distinguish between "this is definitely a model file for this project, but it's the wrong version" vs "this is some random junk we don't know anything about".

(This PR is a very neat bit of engineering; please don't let my nitpick distract from that.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a nitpick but a real change request :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nvm)

@ggerganov
Copy link
Owner

ggerganov commented Mar 30, 2023

@jart
The models/ggml-vocab.bin is generated by convert-pth-to-ggml.py by providing an extra arg.

I had the expectation that mmap support would be much more intrusive, but in fact it turned out to be very compact. llama.cpp is much more simpler now. Good stuff

Regarding the version comment - yes, the plan was to bump versions and no the magic. But I'm ok to change the magic to commemorate the significance of this update. In fact, maybe we can make this a thing and everybody who makes a significant contribution to the project will get their initials appended to the version. What do you think? 😄

Let me play with this tonight before merging. We have to make special care that all the other ggml model files floating around (Alpaca, GPT4All, Chinese LLaMA, etc.) have a nice way to convert to this new format and update the instructions in the README.

Also, maybe some synchronisation with #545 would be needed

This is a breaking change that's going to give you three benefits:

1. Your inference commands should load 100x faster
2. You may be able to safely load models 2x larger
3. You can run many concurrent inference processes

This was accomplished by changing the file format so we can mmap()
weights directly into memory without having to read() or copy them
thereby ensuring the kernel can make its file cache pages directly
accessible to our inference processes; and secondly, that the file
cache pages are much less likely to get evicted (which would force
loads to hit disk) because they're no longer competing with memory
pages that were needlessly created by gigabytes of standard i/o.

The new file format supports single-file models like LLaMA 7b, and
it also supports multi-file models like LLaMA 13B. Our Python tool
now merges the foo.1, foo.2, etc. files back into a single file so
that the C++ code which maps it doesn't need to reshape data every
time. That's made llama.cpp so much simpler. Much of its load code
has now been deleted.

Furthermore, this change ensures that tensors are aligned properly
on a 32-byte boundary. That opens the door to seeing if we can get
additional performance gains on some microprocessors, by using ops
that require memory alignment.

Lastly note that both POSIX and the Windows platform are supported

Fixes ggerganov#91
@jart
Copy link
Contributor Author

jart commented Mar 30, 2023

File updated. A lot more tests are green now. No idea what's up with the sanitizer.

I thought so too! I too was pleasantly surprised by how well it worked out. Glad we took a few weeks to think.

I'm honored to hear you say that. I can roundup the magic to 64 bytes if you like, so there's room to hand out kudos without breaking backwards compatibility in the future. Since my initials also act as a stamp of approval, I'm going to be sending a follow-up change after this, that'll harden the loading code, so that folks will be able to trade model files for this format on HuggingFace with maximum safety and confidence.

#545 is an ambitious unification. I've done my best to comment my changes to make the merge less painful for the author. I've sought to update the other scripts too, but don't know how to run them. One thing you could also consider with this project is having a contrib/ folder, where folks can merge as much of their own stuff as they want, under the expectation that the ones who need it are the ones who maintain it.

int fd = open(fname, O_RDONLY);
if (fd == -1) return 0;
int64_t length = lseek(fd, 0, SEEK_END);
void *addr = mmap(NULL, length, PROT_READ, MAP_SHARED, fd, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is it more safe to use mmap64 for 4GB+ files?
  2. It seems mmap, mmap64 and MapViewOfFile support mapping from given offset. Is it possible to map from header_len (as offset)? If we can do this, no need to align model file, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The right thing to do on 32-bit platforms is to have your build system define -D_FILE_OFFSET_BITS=64 which will cause your system header files to automatically #define mmap mmap64
  2. File offsets passed to mmap() need to be page size aligned, so I don't think so.

Copy link

@pgoodman pgoodman Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jart Is it possible to ensure the file size is a multiple of the hugepage size (e.g. using ftruncate), to benefit from fewer TLB lookups when the model data is accessed? (corresponding mmap hints or other system-specific APIs, e.g. needed for macOS, might need to be used)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter with mmap() if the file length isn't page size aligned, even with smaller pages. You should be good to go if you modify the mmap() code in llama.cpp by hand and actually manage to get huge pages to work without nuking your machine :-)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL!

convert-pth-to-ggml.py Outdated Show resolved Hide resolved
convert-pth-to-ggml.py Outdated Show resolved Hide resolved
llama.cpp Show resolved Hide resolved
jart added a commit to jart/llama.cpp that referenced this pull request Mar 30, 2023
If you deleted your old Meta LLaMA .pth files, then the
migrate-ggml-2023-03-30-pr613.py script will allow you to convert your
old ggml files into the new mmap()'able format.

See ggerganov#613
jart added a commit to jart/llama.cpp that referenced this pull request Mar 30, 2023
If you deleted your old Meta LLaMA .pth files, then the
migrate-ggml-2023-03-30-pr613.py script will allow you to convert your
old ggml files into the new mmap()'able format.

See ggerganov#613
@jart
Copy link
Contributor Author

jart commented Mar 30, 2023

@ggerganov This change now includes a migration tool named migrate-ggml-2023-03-30-pr613.py. This will ensure that users of the old GGML file format who've deleted the original .pth files, will be able to convert their ggml+ggmf files to the new ggml+ggjt format. Please take a look.

@x02Sylvie
Copy link

Having issue migrating alpaca model ggml-alpaca-13b-q4.bin, python script seems to think that model has two n_parts rather than one, adding --n_parts argument to conversion script to manually specify --n_parts 1 just like when running alpaca models on llama.cpp might resolve the issue?
migrate

@jart
Copy link
Contributor Author

jart commented Mar 30, 2023

@x02Sylvie I don't have access to the Alpaca model. Could send a pull request fixing that after this gets merged?

@x02Sylvie
Copy link

x02Sylvie commented Mar 30, 2023

I don't really know python, so I'd rather leave pull request to someone smarter than me,

I did however manage to get alpaca 13b model converted by manually setting n_parts to 1 in .py conversion script . I'm unsure if it's proper place to set n_parts though

def get_n_parts(dim):
    mappings = {4096: 1, 5120: 2, 6656: 4, 8192: 8}

    n_parts = mappings.get(dim)

    if n_parts is None:
        print(f"Invalid dim: {dim}")
        sys.exit(1)
    print(f"n_parts = {n_parts}\n")
    return n_parts

to

def get_n_parts(dim):
    mappings = {4096: 1, 5120: 2, 6656: 4, 8192: 8}

    n_parts = 1

    if n_parts is None:
        print(f"Invalid dim: {dim}")
        sys.exit(1)
    print(f"n_parts = {n_parts}\n")
    return n_parts

Model does work however after conversion

@gaceladri
Copy link

Hello,

I can not load the gtp4all after converting it to the new ggml format using your script:
python3 convert-gpt4all-to-ggml.py models/gpt4all/gpt4all-lora-quantized.bin ./models/tokenizer.model

I have opened a new issue probably related to this: #655 (comment)

@gaceladri
Copy link

I could run it with the previous version https://github.com/ggerganov/llama.cpp/tree/master-ed3c680

Hello,

I can not load the gtp4all after converting it to the new ggml format using your script: python3 convert-gpt4all-to-ggml.py models/gpt4all/gpt4all-lora-quantized.bin ./models/tokenizer.model

I have opened a new issue probably related to this: #655 (comment)

@rabidcopy
Copy link
Contributor

rabidcopy commented Mar 31, 2023

Hello,

I can not load the gtp4all after converting it to the new ggml format using your script: python3 convert-gpt4all-to-ggml.py models/gpt4all/gpt4all-lora-quantized.bin ./models/tokenizer.model

I have opened a new issue probably related to this: #655 (comment)

You need to also run the resulting file through migrate-ggml-2023-03-30-pr613.py as well.

gpt4all weights -> convert-gpt4all-to-ggml.py -> converted gpt4all weights -> migrate-ggml-2023-03-30-pr613.py -> gpt4all weights compatible with the latest version of llama.cpp

@gaceladri
Copy link

It worked. Thank you for your fast response!

Nuked88 pushed a commit to Nuked88/llama.http that referenced this pull request Mar 31, 2023
If you deleted your old Meta LLaMA .pth files, then the
migrate-ggml-2023-03-30-pr613.py script will allow you to convert your
old ggml files into the new mmap()'able format.

See ggerganov#613
@asklar
Copy link

asklar commented Apr 1, 2023

great work @jart and @slaren ! <3

ShoufaChen added a commit to ShoufaChen/langchain-patch that referenced this pull request Apr 4, 2023
As noted in https://github.com/ggerganov/llama.cpp/blob/master/migrate-ggml-2023-03-30-pr613.py,

Authors from `llama.cpp` caused a breaking change to the file format on 2023-03-30 in:
ggerganov/llama.cpp#613

Therefore, we need further use `migrate-ggml-2023-03-30-pr613.py` to convert the llama model.
hwchase17 pushed a commit to langchain-ai/langchain that referenced this pull request Apr 6, 2023
As noted in
https://github.com/ggerganov/llama.cpp/blob/master/migrate-ggml-2023-03-30-pr613.py,

Authors from `llama.cpp` caused a breaking change to the file format on
2023-03-30 in: ggerganov/llama.cpp#613

Therefore, we need further use `migrate-ggml-2023-03-30-pr613.py` to
convert the llama model.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.