30B model now needs only 5.8GB of RAM? How? #638

pugzly · 2023-03-30T22:17:47Z

pugzly
Mar 30, 2023

(Edit: apologies, I should have clarified initially I'm running on Linux OS. I didn't realize it might not be obvious from the screenshot alone for a non-Linux users.All tests are done on Ubuntu based Linux Mint 21.1)

I've been only playing with 30B model so far, since neither 7B nor 13B were very engaging.
As recently as yesterday 30B model fill just close to 30GB, but today's release now runs fine with less than 6GB (and that's with system memory usage).

Initially I thought it must be a bug, but I couldn't notice any quality loss in responses, and then I saw there was some major change introduced only hours ago, but the fundamentals of those changes are a little over my head.

Maybe someone smarter than me can at least roughly explain, in basic terms (if that's even possible at all), how memory usage dropped 5 times overnight?

Thanks a lot in advance.

Green-Sky · 2023-03-30T23:08:50Z

Green-Sky
Mar 30, 2023
Collaborator

it did actually not, you are not seeing the actual ram usage, because the os now counts it as filesystem cache. Since #613 the model is a memory mapped file. It does not do the impossible, it's just that gnome-system-monitor does not show cached files. 😄

Try htop or similar.

5 replies

pugzly Mar 31, 2023
Author

Ah that makes sense. It didn't cross my mind to verify with htop that voodoo I was seeing last night.
But it seems I'll have to use that to monitor mem usage from now on.

Thanks a lot!

jart Mar 31, 2023

However the memory usage of the process is still much lower. The operating system isn't lying to you. What that means is you can now have more LLaMA processes running at the same time. It means you can load much bigger models. My best guess is that my recent change caused "true" memory usage to decrease by 2x, because earlier, you had to make a full copy. Now you don't need to copy at all.

jart Mar 31, 2023

Another thing you can do, is pass the --mlock flag which will lock the file pages into memory much harder, and therefore should give you the thing you were expecting. Please note it shouldn't matter, unless your system is reacting to memory pressure. Results may vary.

jart Mar 31, 2023

Actually I take back what I said. I have better information now that I took a closer look at 30B specifically. I'll write a new response. @Green-Sky is mistaken.

nonetrix Dec 31, 2023

For me on NixOS it seems htop doesn't show the real memory as well, however it does show it in the process list. Here is me running a 70B model with 4 bits, is there a way to make it count against the main counter and in btop as well ideally? I really don't like this behavior to begin with, I thought I could trust the memory usage bar to be accurate why is this default?!

jart · 2023-03-31T19:49:10Z

jart
Mar 31, 2023

Hi. I'm the author of #613 which is what made this improvement. I'm glad you're happy with the fact that LLaMA 30B (a 20gb file) can be evaluated with only 4gb of memory usage!

The thing that makes this possible is that we're now using mmap() to load models. This lets us load the read-only weights into memory without having to read() them or even copy them. The operating only has to create page table entries which reserve 20GB of virtual memory addresses. Crudely speaking, mapping 20GB of RAM requires only 40MB of page tables ((20*(1024*1024*1024)/4096*8) / (1024*1024)). The individual pages aren't actually loaded into the resident set size on Unix systems until they're needed.

Here's why lazy loading of memory matters. LLaMA 30B appears to be a sparse model. While there's 20GB of weights, depending on your prompt I suppose only a small portion of that needs to be used at evaluation time. It should be possible to measure exactly how many lazy loads are happening using a tool I wrote called rusage.com which I wrote a blog post about two weeks ago: https://justine.lol/rusage/

If I run 30B on my Intel machine:

$ rusage doas ./main -m ./models/30B/ggml-model-q4_0.bin -c 512 -t 12 -n 40 -s 1680277445 -p '## Question: What is best in life? ## Jeeves: '
...
llama_print_timings:        load time =  4067.39 ms
llama_print_timings:      sample time =    18.92 ms /    40 runs   (    0.47 ms per run)
llama_print_timings: prompt eval time =  5407.50 ms /    16 tokens (  337.97 ms per token)
llama_print_timings:        eval time = 25576.09 ms /    39 runs   (  655.80 ms per run)
llama_print_timings:       total time = 32382.17 ms
RL: took 33,484,182µs wall time
RL: ballooned to 23,024,644kb in size
RL: needed 373,356,606µs cpu (0% kernel)
RL: caused 411,522 page faults (100% memcpy)
RL: 793 context switches (24% consensual)

As we can see, 400k page faults happen, which means only 1.6 gigabytes ((411522 * 4096) / (1024 * 1024)) of the 20 gigabyte weights file actually needed to be used.

Now, since my change is so new, it's possible my theory is wrong and this is just a bug. I don't actually understand the inner workings of LLaMA 30B well enough to know why it's sparse. Maybe we made some kind of rare mistake where llama.cpp is somehow evaluating 30B as though it were the 7B model. Anything's possible, however I don't think it's likely. I was pretty careful in writing this change, to compare the deterministic output of the LLaMA model, before and after the Git commit occurred. I haven't however actually found the time to reconcile the output of LLaMA C++ with something like PyTorch. It'd be great if someone could help with that, and possibly help us know why, from more a data science (rather than systems engineering perspective) why 30B is sparse.

Until then, I hope you enjoy it! You can now run a bigger, badder, better model on your PC without compromising other programs. The recent change also means you can run multiple LLaMA ./main processes at the same time, and they'll all share the same memory resources. Similar to how many executables are able to mmap() the same shared object dependencies into memory.

20 replies

InconsolableCellist Apr 1, 2023

Hi. I'm the author of #613 which is what made this improvement.

Credit where it's due, co-author

acatovic Apr 1, 2023

@jart wow this is amazing work Justine!

You asked about the sparsity characteristics, from the model perspective. I think Tim Dettmers explains this very well here: https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features. It is especially pronounced once you go beyond 7B parameter attention-based models, and probably why you are seeing this behaviour with 30B model here. If I understood it correctly:

One characteristic of attention-based models, even smaller ones (with as few as 100M parameters), is the emergence of "outlier features", which are probably responsible for giving the models that incredible expressive ability and a knack for nuance
Once you go above 7B parameters, a "phase shift" occurs, where these outlier features become even greater in number, and present across all transformer layers
However, they start to coordinate through only a small number of hidden dimensions
Feed-forward layers become highly dense, HOWEVER, the attention layers become extremely sparse, almost binary in nature

I think this is the same reason why certain low precision quantization like 4-bit GPTQ work extremely well on very large models!

Tolga-Karahan Apr 2, 2023

Could someone explain this calculation: (20*(1024*1024*1024)/4096*8) / (1024*1024). My OS is a bit rusty. 😄

chowder Apr 2, 2023

Could someone explain this calculation: (20*(1024*1024*1024)/4096*8) / (1024*1024). My OS is a bit rusty. 😄

(20GB / 4KB Page size * 8 bytes per PTE) / 1MB = 40MB

khengari77 May 3, 2023

Interesting, is it possible to port this to whisper.cpp?

CoderRC · 2023-03-31T21:28:18Z

CoderRC
Mar 31, 2023

Wow, mmap has a side affect of not loading the file data until it is used even I am surprised why this project loaded the whole model not what it needed. This looks like a hint that rearranging the model in memory might potentially give great performance improvements.

1 reply

danuker Mar 31, 2023

In hindsight, it's obvious. But it takes experience to get insight.

raphael-sch · 2023-03-31T22:17:05Z

raphael-sch
Mar 31, 2023

Pretty sure it’s the input embedding weights matrix that benefits a lot from lazy loading. It contributes a significant amount to the model’s size. Each of the 32k tokens in the vocabulary has its row in the matrix but only a tiny subset of tokens are needed for each query.

BTW, I was wrong. While there is sparsity in the embeddings, they are not a significant amount of the model's weights

8 replies

raphael-sch Apr 1, 2023

agree that it doesn’t explain the huge drop in memory footprint reported here. Also likely that the embedding matrix is loaded as a whole but accessed sparsely.

apage43 Apr 1, 2023

The input embedding layer would be sparsely accessed as that's just how embedding layers usually work, but the rest of the model is definitely fully dense - what's actually happened is just that MAP_SHARED just makes memory accounting confusing - since multiple processes with shared mappings of the same file are using the same memory, its not counted against any single process - it will be included in system wide counters as "cached", and you can use mincore as I showed below to check how much of any given file is cached. If you try to run a model larger than your RAM the file will not be able to be entirely cached and there will be persistent disk activity during inference as each layer will be re-read on use.

josemlopez Apr 1, 2023

Thanks for the clarification, I just saw your graph. ~~Does it mean we have virtually gain nothing?~~

apage43 Apr 1, 2023

not quite - the models do load faster now, you can run multiple instances without loading the model multiple times, and you don't need to wait for it to be read from disk on subsequent runs as caching it is now managed by the OS, and although its slow, running models larger than RAM just reads your disk a lot, but does not need additional swap space and therefore should not need to write to your disk

josemlopez Apr 1, 2023

Sure. I will rephrase. I meant the model is not using only the 6GB it seemed.

leuc · 2023-04-01T00:59:10Z

leuc
Apr 1, 2023

It would be great, to see the same memory use improvement from mmap() with whisper.cpp

0 replies

apage43 · 2023-04-01T01:08:33Z

apage43
Apr 1, 2023

I don't think there's any actual sparseness happening - I did a quick check with mincore() as suggested - quick and dirty mincore sampler here https://gist.github.com/apage43/ef502602a511ccff73a8882677e54627 - and plotted the # of resident pages over time, generation only starts when the model is fully paged in and it remains fully paged in:

2 replies

jart Apr 1, 2023

Hmm. OK in that case what's probably happening is the tools like htop are just subtracting shared memory from resident memory when reporting it to the user.

cztomsik Apr 3, 2023

maybe for the (input) embeddings part but that's it, everything else is always used for computation, it's either reporting or something else.

qdwang · 2023-04-01T03:40:05Z

qdwang
Apr 1, 2023

In my test of using 30B gptq 4bit model under windows with 16g memory, it still consumes more than 16gb...And during the very slow processing, the ssd reading speed is at 100% capacity.

5 replies

dim-ask Apr 1, 2023

Same in my 16GB m1 macbook air. It seems it rereads the whole weights file from the disk again to compute each new token (as it prints every ~20GB disk read), no swap memory written, it is a pity. The 13B quantised model runs faster than before, and subsequent calls take very little time to load as the file is already cached in RAM.

LoganDark Apr 1, 2023

Windows doesn't have mmap and I don't know what macOS does (macOS does have mmap), but this post is about Linux.

dim-ask Apr 1, 2023

There is no specific mention about linux in the OP
Others here report similar behaviour on linux after surpassing the confusion caused by reading the monitoring tools (as one would expect, unless you have something different to report) - but I am happy to also check it on linux later.

jart Apr 1, 2023

Windows reports memory usage different than Linux tools do. You should see on Windows using Process Explorer the "physical memory" should increase by the full amount, however the "physical commit" won't increase much. That's probably the most analogous comparison to what Linux is doing.

jart Apr 1, 2023

Windows does have mmap() it just goes by a different name. See our polyfill code here:

llama.cpp/llama.cpp

Lines 310 to 340 in a717cba

    
           static void *mmap_file(const char *fname, uint64_t *mm_length) { 
        
           #if defined(_WIN32) && !defined(_POSIX_MAPPED_FILES) 
        
               HANDLE hFile = CreateFileA(fname, 
        
                                          GENERIC_READ, 
        
                                          FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, 
        
                                          NULL, 
        
                                          OPEN_EXISTING, 
        
                                          FILE_ATTRIBUTE_NORMAL | FILE_ATTRIBUTE_NOT_CONTENT_INDEXED, 
        
                                          NULL); 
        
               if (hFile == INVALID_HANDLE_VALUE) return 0; 
        
               LARGE_INTEGER fileSize; 
        
               fileSize.QuadPart = -1; 
        
               GetFileSizeEx(hFile, &fileSize); 
        
               int64_t length = fileSize.QuadPart; 
        
               HANDLE hMapping = CreateFileMappingA(hFile, NULL, PAGE_READONLY, 0, 0, NULL); 
        
               CloseHandle(hFile); 
        
               if (!hMapping) return 0; 
        
               void *addr = MapViewOfFile(hMapping, FILE_MAP_READ, 0, 0, 0); 
        
               CloseHandle(hMapping); 
        
               if (!addr) return 0; 
        
           #else 
        
               int fd = open(fname, O_RDONLY); 
        
               if (fd == -1) return 0; 
        
               int64_t length = lseek(fd, 0, SEEK_END); 
        
               void *addr = mmap(NULL, length, PROT_READ, MAP_SHARED, fd, 0); 
        
               close(fd); 
        
               if (addr == MAP_FAILED) return 0; 
        
           #endif 
        
               *mm_length = length; 
        
               return addr; 
        
           }

rabidcopy · 2023-04-02T01:05:18Z

rabidcopy
Apr 2, 2023

I'm becoming somewhat doubtful of how much mmap actually reduces RAM usage. Despite what the OS reports, it's clear in my case on a RAM-limited system, the 30B model will go into swap and result in extremely slow speeds. The swap very clearly filling up despite free RAM being shown as available. I think to say mmap reduces memory usage is a slight misnomer. Feel free to be prove me wrong and say my experience doesn't reflect reality though. While there may be some technicalities on the language surrounding it, I'm leaning towards what @Green-Sky states and that it's not magic. It's not a bad thing to have and does make certain use cases much quicker and I'm not trying to be ruthlessly abrasive, but by no means is it enabling lower end systems to load models they previously were unable to. At least not in a usable state. I don't think waiting minutes per token is realistic.

4 replies

trollkotze Apr 2, 2023

Yeah, l can confirm, looks like that's what's happening for me, too. While previously all the 7 cores I assigned to llama.cpp were busy with 100% usage and almost all of my 30GB actual RAM used by it, now the cpu cores are only doing very little work, mostly waiting for all the loaded data in swap, apparently. Really weird.
I found that I can make it use real RAM again by starting llama.cpp with the parameter "--mlock", using "locked memory", and its performance seems to be as it was before then. However, I am only able to do that as root, since there seem to be fixed limits in place how much "locked memory" a normal user can use.
I have no clue about this concept of "locked memory". It's quite puzzling that the earlier version just used up all my RAM, refusing to use any swap at all (memory usage of llama.cpp shown as "pinned memory", i.e. non-swappable in gnome-system-monitor) when I ran it as a normal user, but now I need extra privileges to explicitly ask for "locked memory" to use.

comex Apr 2, 2023

I'm becoming somewhat doubtful of how much mmap actually reduces RAM usage.

In reality [edit: almost] the entire model is needed for inference, so mmap doesn’t reduce RAM usage at all. It’s purely a measurement artifact. The exception is while loading the model at startup. Reading without mmap can temporarily put more memory pressure on the rest of the system because chunks of the file will temporarily exist in memory twice (once in the kernel’s page cache, once in the application’s buffer). Not sure how significant this effect is; it’s separate from the main benefit of mmap, which is the ability to reuse loaded data across multiple runs.

EDIT: I lied; @raphael-sch is correct that the input embeddings matrix is accessed sparsely. But that is only 250-500MB, depending on the model size.

comex Apr 2, 2023

I have no clue about this concept of "locked memory". It's quite puzzling that the earlier version just used up all my RAM, refusing to use any swap at all (memory usage of llama.cpp shown as "pinned memory", i.e. non-swappable in gnome-system-monitor) when I ran it as a normal user, but now I need extra privileges to explicitly ask for "locked memory" to use.

By default, Linux treats mmapped pages as somewhat lower priority to keep in memory than anonymous (i.e. non-mmapped) pages. This is because anonymous pages have to be actually written to swap, whereas with mmapped pages (at least if they’re unmodified, as is the case here), the kernel can just toss them away, and read them back from their existing location in disk when they’re needed again. This page has more detail:

https://biriukov.dev/docs/page-cache/4-page-cache-eviction-and-page-reclaim/

It’s a simple heuristic that in this case is counterproductive. You could try adjusting vm.swappiness, but it’s probably better to just use mlock.

FNsi Apr 4, 2023

I think so, my ram not enough for loading q4_1 65b, in previous version it just crash but after mmp, it can running, and I checked my swap be taken like 16g. At least it work!

chadbrewbaker · 2023-04-02T15:42:29Z

chadbrewbaker
Apr 2, 2023

To make this scale on clusters of low memory hardware you need either:

HTTP range queries like AWS S3. This supports workflows complex as walking sqlite b-trees
Remote Direct Memory Access (RDMA). Message Passing Interface (MPI) or memcached.

What partition of the model makes sense?
Could inference be refactored to parallel prefix?

If so you could evaluate the model in parallel for speedup. Same loop in PicoGPT for readability.

llama.cpp/examples/main/main.cpp

Line 243 in d8d4e86

while (n_remain != 0 || params.interactive) {

https://github.com/jaymody/picoGPT/blob/3b7f4d180bb125a76b2f4f7b3a74268e5ec5f131/gpt2_pico.py#L45

Example of arbitrary length Roman Numeral evaluation as prefix sum. Hardest part is usually coming up with an identity state.

--update--
ChatGPT3.5 suggestion for refactor of evaluation to prefix sum over tokens. Looks somewhat sane?

https://gist.github.com/chadbrewbaker/ffe95290fc945af63611693688dfe54d

0 replies

4uniquemind · 2023-04-07T19:42:56Z

4uniquemind
Apr 7, 2023

Please, do it with Koala LLM!

0 replies

dustyny · 2023-04-09T19:32:19Z

dustyny
Apr 9, 2023

Does this work with the 4 bit versions?. Would there be any advantage to running the 4 Bit 65B model on a consumer card (I have 24GB).

2 replies

trollkotze Apr 9, 2023

@dustyny
The question in the title can be answered with "no". It was a misunderstanding.
Llama.cpp does not run on GPU, so your graphics card won't help you. The relevant metric is your normal system RAM.
4 Bit 65B runs fine with 64GB of RAM. I even got it running on 32GB with zram-swap configured on Linux, but it was slow.
If you have the model in GPTQ format you have to convert it to GGML, but there might be some current issues with that: #442

dustyny Apr 9, 2023

Thanks @trollkotze I get it now..

basujindal · 2023-08-09T19:24:41Z

basujindal
Aug 9, 2023

Hello, my apologies if it's not the right question/request, but can someone put a TLDR?
I saw a lot of back and forth between whether the model needs 30GB or much less while maintaining similar tokens per second. AFAIK we need the entire model in RAM for fast inference on the CPU. Thanks!

7 replies

Green-Sky Aug 10, 2023
Collaborator

what I said here should still hold true overall. #638 (comment)

basujindal Aug 11, 2023

Thank you for the clarifications.
Also, @SlyEcho #638 (reply in thread) says that "My best guess is that my recent change caused "true" memory usage to decrease by 2x, because earlier, you had to make a full copy". Can someone explain that, is that really the case?

abshkbh Sep 8, 2023

Haven't followed the entire discussion but I"ll try to explain mmap in this context if it helps in future discussions.

If a file is mmap-ed -

Virtual Page Addresses will be reserved for it i.e. think of it as book keeping cost for accessing the file.
As the file is read at different offsets by your process, the relevant data from the file on disk is read into physical memory pages (page cache). Think of pages as 4KB of contiguous memory.
As data is read into physical memory, your process' Virtual Addresses point to the physical pages in step 2. Now your process can access the data.
The next time this data is read, we don't repeat the process we directly access the physical memory page.
So you only need to load data from the file as you need it. Sometimes the OS will be smart and read ahead the next few pages as well to anticipate your request.
You may have 8GB of RAM, what that means is that if your model needs more, your OS would need to keep evicting pages and swapping in data that's needed for the current request.
If you have RAM larger than the model requires then you can keep it in RAM and pay the disk penalty only once.
Also, if you run the model once with decent RAM, chances are the model is in the page cache and subsequent runs won't pay the disk penalty to load the model.
You can also fork your process and your child will access the same memory address pointing to the model file.

To conclude, mmap doesn't magically "reduce memory". It provides a memory interface for a file and a direct command to the OS to bring the file into its page cache / physical memory.

Once the data is in page cache different tools may construe it as "memory not belonging to the process". But in reality, it does correspond to the data you need for your model to run.

Ordinary file reads via read() also cache file blocks in the OS (technically an OS can be stupid and not do this).

Green-Sky Sep 8, 2023
Collaborator

adding to

You can also fork your process and your child will access the same memory address pointing to the model file.

this also counts for multiple processes, assuming it is not modified by eg. lora .

abshkbh Sep 8, 2023

Yes, the page cache is global and free to serve other processes as well

30B model now needs only 5.8GB of RAM? How? #638

Replies: 12 comments · 55 replies

Green-Sky Mar 30, 2023 Collaborator

pugzly Mar 31, 2023 Author

Green-Sky Aug 10, 2023 Collaborator

Green-Sky Sep 8, 2023 Collaborator

Replies: 12 comments 55 replies

Green-Sky
Mar 30, 2023
Collaborator

pugzly Mar 31, 2023
Author

Green-Sky Aug 10, 2023
Collaborator

Green-Sky Sep 8, 2023
Collaborator