Replies: 12 comments 55 replies
-
it did actually not, you are not seeing the actual ram usage, because the os now counts it as filesystem cache. Since #613 the model is a memory mapped file. It does not do the impossible, it's just that gnome-system-monitor does not show cached files. 😄 Try |
Beta Was this translation helpful? Give feedback.
-
Hi. I'm the author of #613 which is what made this improvement. I'm glad you're happy with the fact that LLaMA 30B (a 20gb file) can be evaluated with only 4gb of memory usage! The thing that makes this possible is that we're now using mmap() to load models. This lets us load the read-only weights into memory without having to read() them or even copy them. The operating only has to create page table entries which reserve 20GB of virtual memory addresses. Crudely speaking, mapping 20GB of RAM requires only 40MB of page tables ( Here's why lazy loading of memory matters. LLaMA 30B appears to be a sparse model. While there's 20GB of weights, depending on your prompt I suppose only a small portion of that needs to be used at evaluation time. It should be possible to measure exactly how many lazy loads are happening using a tool I wrote called rusage.com which I wrote a blog post about two weeks ago: https://justine.lol/rusage/ If I run 30B on my Intel machine:
As we can see, 400k page faults happen, which means only 1.6 gigabytes ( Now, since my change is so new, it's possible my theory is wrong and this is just a bug. I don't actually understand the inner workings of LLaMA 30B well enough to know why it's sparse. Maybe we made some kind of rare mistake where llama.cpp is somehow evaluating 30B as though it were the 7B model. Anything's possible, however I don't think it's likely. I was pretty careful in writing this change, to compare the deterministic output of the LLaMA model, before and after the Git commit occurred. I haven't however actually found the time to reconcile the output of LLaMA C++ with something like PyTorch. It'd be great if someone could help with that, and possibly help us know why, from more a data science (rather than systems engineering perspective) why 30B is sparse. Until then, I hope you enjoy it! You can now run a bigger, badder, better model on your PC without compromising other programs. The recent change also means you can run multiple LLaMA |
Beta Was this translation helpful? Give feedback.
-
Wow, mmap has a side affect of not loading the file data until it is used even I am surprised why this project loaded the whole model not what it needed. This looks like a hint that rearranging the model in memory might potentially give great performance improvements. |
Beta Was this translation helpful? Give feedback.
-
Pretty sure it’s the input embedding weights matrix that benefits a lot from lazy loading. It contributes a significant amount to the model’s size. Each of the 32k tokens in the vocabulary has its row in the matrix but only a tiny subset of tokens are needed for each query. BTW, I was wrong. While there is sparsity in the embeddings, they are not a significant amount of the model's weights |
Beta Was this translation helpful? Give feedback.
-
It would be great, to see the same memory use improvement from |
Beta Was this translation helpful? Give feedback.
-
I don't think there's any actual sparseness happening - I did a quick check with |
Beta Was this translation helpful? Give feedback.
-
In my test of using 30B gptq 4bit model under windows with 16g memory, it still consumes more than 16gb...And during the very slow processing, the ssd reading speed is at 100% capacity. |
Beta Was this translation helpful? Give feedback.
-
I'm becoming somewhat doubtful of how much |
Beta Was this translation helpful? Give feedback.
-
To make this scale on clusters of low memory hardware you need either:
If so you could evaluate the model in parallel for speedup. Same loop in PicoGPT for readability. llama.cpp/examples/main/main.cpp Line 243 in d8d4e86 Example of arbitrary length Roman Numeral evaluation as prefix sum. Hardest part is usually coming up with an identity state. --update-- https://gist.github.com/chadbrewbaker/ffe95290fc945af63611693688dfe54d |
Beta Was this translation helpful? Give feedback.
-
Please, do it with Koala LLM! |
Beta Was this translation helpful? Give feedback.
-
Does this work with the 4 bit versions?. Would there be any advantage to running the 4 Bit 65B model on a consumer card (I have 24GB). |
Beta Was this translation helpful? Give feedback.
-
Hello, my apologies if it's not the right question/request, but can someone put a TLDR? |
Beta Was this translation helpful? Give feedback.
-
(Edit: apologies, I should have clarified initially I'm running on Linux OS. I didn't realize it might not be obvious from the screenshot alone for a non-Linux users.All tests are done on Ubuntu based Linux Mint 21.1)
I've been only playing with 30B model so far, since neither 7B nor 13B were very engaging.
As recently as yesterday 30B model fill just close to 30GB, but today's release now runs fine with less than 6GB (and that's with system memory usage).
Initially I thought it must be a bug, but I couldn't notice any quality loss in responses, and then I saw there was some major change introduced only hours ago, but the fundamentals of those changes are a little over my head.
Maybe someone smarter than me can at least roughly explain, in basic terms (if that's even possible at all), how memory usage dropped 5 times overnight?
Thanks a lot in advance.
Beta Was this translation helpful? Give feedback.
All reactions