-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simple convert scripts run out of tmpfs space on Linux #3433
Comments
If I'm reading the code right, that temp file is only being written to, no seeks so sequentially, until the point where it's rewound and copied to the output file. If that's correct, as far as Linux is concerned, there shouldn't really be any benefit from using a tempfile for "write only" workload, since kernel caches already work in a similar way, and final write to file still takes as long as it takes. If however, tempfile does in fact provide performance benefits, something else must be non optimal. Python's I would imagine, specifying buffer size in |
Back when I did my cleanup stuff, I also added a feature to Just for example, the GGML to GGUF converter script always sets that: llama.cpp/convert-llama-ggml-to-gguf.py Lines 236 to 241 in 9476b01
|
The only reason to use a tempfile is on platforms other than Linux (*BSD, macOS, Windows) where the temp folder is usually not a ramdisk. This allows you to convert a model with an output size larger than your RAM+swap space. But on typical Linux installs, use_temp_file has no benefit unless you set TMPDIR to disk. |
This sounds like the best option, but I'm not sure how difficult it is to achieve and if we can implement a simple enough API to reuse it in the simple convert scripts |
There's an approach that should mostly avoid the issues temp file and memory issues and that's to just There is one limitation though, and that is on 32bit systems only files up to 2GB can be mmaped. Would that be acceptable? |
AFAIK python on 32-bit Linux is built with 64-bit file offsets ( edit: we use mmap for safetensors files in convert.py anyway. |
The numpy reference for The source for the function is here: https://github.com/numpy/numpy/blob/main/numpy/core/memmap.py - it's pretty short if you want to take a look and see if there's a reason it would only support 2GB on 32bit. (Kind of looks like the main reason to use that function is because it simplifies using offset, where with |
Unfortunately, an approach like this would not work with convert.py because the original LLaMA tensors are split between multiple files, and must be concatenated. And I would rather not have different versions of LazyUnpickler between convert.py and the gguf package. |
Why not? Like I mentioned, it can be
Sure, there should be only one version of it in one place. I'm not sure how that relates to loading the data via |
LazyUnpickler already waits until the last second to load tensor data from disk - it builds a chain of 'load' functions and then calls them when the data is needed. mmap would not be any simpler (it requires extra effort to find the offset in the zipfile, which is not directly exposed by the API) or save significant memory usage in convert.py as far as I can tell. |
That's true, but it's not too hard since it does expose an offset to the local file header (you basically just have to find the name length and extra part length and the raw data will be right past that).
Well, you wouldn't run out of temp space because there'd no longer be a need to use that approach. Right? Also when concatting tensors, the data from the first in the sequence wouldn't be written to so it would basically be the same as a read only mmap. I'm not sure if permuting will touch every element or not. The gguf interface could probably be changed a little to support taking a list of tensor arrays that need to be saved, that way it potentially wouldn't be necessary to concat everything into one big numpy array at all. |
convert.py doesn't use tmpfs, it calls GGUFWriter.write_tensor_data directly. This issue refers to the simple scripts that call GGUFWriter.add_tensor. |
I wonder if a better temporary solution is to set |
Python's temporary file handling stuff should respect environment variables like |
I'm not sure if you understood what I meant. Python's SpooledTemporaryFile exposes |
What I meant is there's already a convention for allowing users to set where temp file get stored, and I actually think it would probably be weird to override that behavior and use some other temp directory. |
The point of SpooledTemporaryFile is actually that dir is supposed to be on disk - it's designed to use memory until a certain threshold, then write to dir, which is assumed to have more space, not less. Otherwise, you would just use RAM and not use a file object at all. TMPDIR is the default for consistency with e.g. NamedTemporaryFile, but on Linux we really should not spool to /tmp - we should use the current directory (like transformers does) or /var/tmp. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
By default, Linux prevents tmpfs from using more than 50% of available system memory. This is normally a good thing, but the simple convert scripts write all tensor data to tmpfs before saving the output file, causing this exception if the converted model is larger than 50% of system RAM (ref):
This is annoying. You can set TMPDIR=/var/tmp to work around this, but then you need twice as much free disk space as the size of the output file - which I don't always have.
The smallest change that would fix this problem would be to provide a way to effectively disable use_temp_file on Linux, while still supporting use of e.g. /var/tmp if desired. That way, I could leverage 100% of my RAM, and my swap space, in order to convert these models. If we choose this route, we should make sure the converted tensor data is freed as it is written, to avoid unnecessary swapping - right now it is not.
We can't just change the simple scripts to convert on-the-fly, because they load one pytorch file at a time, so making two passes over the input tensors would have a high I/O cost.
We could make LazyUnpickler part of the gguf module. That way, we can significantly reduce memory usage, and avoid the need to load the entire model into memory, while hopefully still keeping the non-LLaMA conversion scripts relatively simple.
@ggerganov what do you think?
The text was updated successfully, but these errors were encountered: