Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the external fragmentation in a long running process #632

Open
romange opened this issue Oct 8, 2022 · 5 comments
Open

Comments

@romange
Copy link
Contributor

romange commented Oct 8, 2022

Hello Daan,

The project I am working on uses mimalloc underneath.

A user reported the following problem:
Dragonfly reports 80GiB used memory (tracked by aggregating mi_usable_size) but mi_malloc stats show 90GiB committed. He also says that the backend adds up roughly 0.5% committed memory each day while used does not change. Assuming that my "used" accounting is correct, it means the backend does not have leakage (otherwise "used" would grow as well). Dragonfly has 16 threads.

mimalloc stats output:

heap stats:    peak      total      freed    current       unit      count   
  reserved:   90.1 GiB   90.1 GiB      0       90.1 GiB                        not all freed!
 committed:   90.0 GiB   90.4 GiB  412.0 MiB   90.0 GiB                        not all freed!
   touched:  180.1 MiB  180.1 MiB   53.1 MiB  126.9 MiB                        not all freed!
  segments:    1.4 Ki     1.4 Ki       0        1.4 Ki                         not all freed!
     pages:  834.0 Ki   834.2 Ki     288      833.9 Ki                         not all freed!
   commits:    508    
   threads:     16         16          0         16                            not all freed!
  searches:     0.0 avg
numa nodes:       1
   elapsed:  111589.444 s
   process: user: 227549.812 s, system: 255655.955 s, faults: 0, rss: 89.2 GiB, commit: 90.0 GiB
  1. Having 12.5% increase of "committed" over "used" seems normal to me. What ratios do you observe for long-running processes?
  2. I obtained a sample of area sizes (via mi_heap_visit_blocks(.., false, ...)) from a single heap - from a single thread out of 16. That heap has 5.9 GiB of committed memory and 5GiB used. When looking at the unused memory I see lots of areas that have unused blocks. See the chart below. Why does it happen? Specifically, can mi_malloc create a new area of block size 80 if there is another one with unused memory?
    "Unused" in this context is committed - used.

I am also attaching the detailed table of areas from which the chart was created. The first column in the table is the number of occurrences of triple <reserved, committed, used>.
Unused memory vs block size
sorted-malloc.txt

@daanx
Copy link
Collaborator

daanx commented Nov 22, 2022

Thanks @romange for the detailed feedback (and apologies for the late reply).

  1. 12.5% increase of committed over used is excellent (!) :-) For a long running process there will always be fragmentation and thus more committed than actually used.

  2. It should be the case that for each thread we never allocate a fresh (mimalloc) page when there is still a page that has blocks available of the right size (see page.c:mi_page_queue_find_free_ex). However, we slowly extend the "top" of the page up (page->capacity vs page->reserved items) so we touch the memory in a page in about 4KiB increments -- so a partially used page will only increase the RSS by the touched part.

  3. We do commit the memory though per mimalloc page at once so that may affect the measurements. We could also commit inside a page on-demand. I tried this once and it helps but is also relatively expensive ...

So, for your measurements it is best to distinguish between the unused but untouched parts (the difference between the capacity and total reserved space), and the unused but touched part (the blocks in the free lists).

I suspect the graph with the "unused but untouched" should have much less unused memory? (let me know :-) )

@romange
Copy link
Contributor Author

romange commented Nov 23, 2022

Hi @daanx, thanks for following up on my questions. Meanwhile, I've been studying the domain myself.
Unfortunately, when time passes by, the difference in the process between committed and used memory goes well beyond 40-50%, so I need to solve the issue. Below is another graph that shows the distribution of (comitted-used) sizes for different block sizes.

image

As you said yourself in #645 , any improvements on the allocator side will be incremental (aka reduce probability of, reduce the acuteness etc), because there is nothing we can do with an under-utilized and non-empty page, since the allocator can not move its contents somewhere else. Therefore, it seems that the only solution (if someone really needs to solve the external fragmentation issue) would be - app-aware defragmentation. And this is what I described in dragonflydb/dragonfly#448

Redis, btw, implemented something similar with jemalloc. They call it "active-defrag". They introduced a function that tells how much a page is utilized: redis/redis@e8099ca and use it to move memory around.

We are taking a similar approach with mimalloc. This is the function that I believe will help us to determine whether we need to realloc the ptr that belongs to the underutilized page: https://github.com/romange/helio/pull/27/files#diff-a9d9d3912eb3a1eeced1a0726da5a12f80f764b999f683c40d96a45c8c31884b

What do you think about it?

@daanx
Copy link
Collaborator

daanx commented Nov 25, 2022

Ah I see -- I agree that an allocator can not generally "solve" the fragmentation problem as we cannot move objects. I understand now that you would like to give the application itself the opportunity to reallocate objects that are in fragmented page.

  • Best is to avoid the fragmentation you see in the first place: do you know the root cause of the application allocating many objects of some size but only keeping few live after a while? If this is somewhat isolated, you could consider using an extra mi_heap and allocate most temporary objects in that heap so they won't fragment the long lived ones.
  • I looked a bit at your "is page underutilized" function: the measure is right and it seems thread safe as well -- just don't try to access the xheap pointer since if it belongs to another thread, that thread may be terminating and deallocating its heap structure. You don't need to get the page queue right? Just check if page->prev != NULL is enough to know it is not the first page in a queue.
  • I wonder if mimalloc could do more as well: we could visit all pages, and reduce the capacity back to the top-most object that is still live -- but again, that depends a bit on good luck; but also, I may on average reduce the used vs committed by 50% in your graphs. Of course, this would be a bit expensive and cannot be done automatically (but needs a mi_compact call or something).

Very interesting to work on this

@romange
Copy link
Contributor Author

romange commented Dec 17, 2022

@daanx We ended up patching mimalloc with the following patch: https://github.com/romange/helio/blob/master/patches/mimalloc-v2.0.5.patch

seems that it's doing its job. We reduce the committed/used ratio from 1.25 to 1.08.

@anthonyalayo
Copy link

Best is to avoid the fragmentation you see in the first place: do you know the root cause of the application allocating many objects of some size but only keeping few live after a while? If this is somewhat isolated, you could consider using an extra mi_heap and allocate most temporary objects in that heap so they won't fragment the long lived ones.

@daanx I am hitting the same issue as the original poster, and I tried to use an extra mi_heap like you noted. However, the next issue I hit was #720, where I can't actually use it because I need it in a multi-threaded environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants