LLP: mmap labels instead of loading them in memory #117

progval · 2024-12-16T12:44:00Z

LLP on SWH's graph now needs more than 3TB, which makes it crash in this step when there is anything else running on our 4TB machine that needs significant memory, like a previous version of the graph.

I did not try to benchmark this yet

LLP on SWH's graph now needs more than 3TB, which makes it crash in this step when there is anything else running on our 4TB machine that needs significant memory, like a previous version of the graph.

vigna · 2024-12-16T14:20:42Z

At that point in the code, we allocate result_labels, and then in the loop we in turn allocate two other arrays. On the other hand, if NLL work as expected the label store has been dropped, freeing two arrays. So we have one array more, which might explain why the problem appears there.

On the other hand, we are not dropping the graph, which would be the obvious thing to do, and would help if the graph is not memory mapped (I know it is in the CLI tool tho). So I pushed a small commit that makes the argument a graph or a reference to a graph, and in the CLI we pass the graph.

Just to undertand whether the memory usage is sensible, how many nodes are we talking here?

vigna · 2024-12-16T14:23:43Z

Incidentally, with more knowledge of Rust I'm unhappy with the fact that we save arrays as Vec<T> instead of Box<[T]>. The capacity is there for nothing, as everything is immutable.

zacchiro · 2024-12-16T14:24:52Z

Just to undertand whether the memory usage is sensible, how many nodes are we talking here?

Quoting @progval from elsewhere:

44.5B nodes, 769B edges

vigna · 2024-12-16T14:27:24Z

So it's 400GB per array (gulp). Still, 4 arrays is 1.6 TB. Unless Rust is doing something weird, there's nothing else in memory at that point. Where does the 3TB come from?

progval · 2024-12-16T14:44:13Z

I don't know.

LLP logged this:

[2024-12-12T23:17:53Z INFO  webgraph::algo::llp] 44,572,995,153 nodes, 13m 9s, 56421974.87 nodes/s, 17.72 ns/node; 100.00% done, 1ms to end [1972503.27 nodes/s, 506.97 ns/node]
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Completed.
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Elapsed: 13m 13s [44,573,066,306 nodes, 56206034.65 nodes/s, 17.79 ns/node]
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Log-gap cost: 2770667245468
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] 4 gammas, 1d 5h 48m 11s, 3.22 gammas/d, 7.45 h/gamma; 100.00% done, 0ms to end; res/vir/avail/free/total mem 2.11TB/2.12TB/1.25TB/1.12TB/4.33TB
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Completed.
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Elapsed: 1d 5h 48m 11s [4 gammas, 3.22 gammas/d, 7.45 h/gamma]; res/vir/avail/free/total mem 2.11TB/2.12TB/1.25TB/1.12TB/4.33TB
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Best gamma: 0.0625	with log-gap cost 2412757022861
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Worst gamma: 0.5	with log-gap cost 2770667245468
[2024-12-12T23:32:16Z INFO  webgraph::algo::llp] Starting step 0...

and then the machine's memory use kept growing, up to 2.91TB, where LLP crashed and total memory use fell to 20GB, as you can see here: https://grafana.softwareheritage.org/goto/qRAeEdSNz?orgId=1

vigna · 2024-12-16T16:51:08Z

I'm trying to understand where the memory occupancy comes from. The algorithm uses currently 33 bytes per node. So that does account for roughly 1.5 TB. There's a missing half terabyte. Unfortunately I don't remember whether sysinfo includes memory-mapping in the memory footprint (which is printed by the progress logger). I remember it gives different results on Linux and MacOS.

The other suspicious thing is that the label store (which accounts for 16 of these bytes) should be automatically dropped before the "Elapsed: 1d 5h 48m 11s" line. So if this is happening the 2.1 TB look weird. If this is not happening there's something going on. I tried to add a manual drop (on MacOS) and the occupancy does not change.

So my guess is that the memory includes the graph. How large is the graph + ef + dcf?

Do you set RUST_MIN_STACK? We allocate one such stack for each thread, and if it's big, and there are many threads, that might become a problem.

progval · 2024-12-16T19:05:39Z

So my guess is that the memory includes the graph. How large is the graph + ef + dcf?

-rw-r--r-- 1 vlorentz vlorentz  40G Dec 11 17:05 graph-bfs-simplified.dcf
-rw-r--r-- 1 vlorentz vlorentz  50G Dec 11 15:10 graph-bfs-simplified.ef
-rw-r--r-- 1 vlorentz vlorentz 830G Dec 11 14:58 graph-bfs-simplified.graph
-rw-r--r-- 1 vlorentz vlorentz  53G Dec 11 14:58 graph-bfs-simplified.offsets

Do you set RUST_MIN_STACK? We allocate one such stack for each thread, and if it's big, and there are many threads, that might become a problem.

8MB × ~250 threads, so that's not it

vigna · 2024-12-16T21:51:08Z

Mmmhhh. No. That would mean 2.5, not 2.1.

Another weird thing is

[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] 4 gammas, 1d 5h 48m 11s, 3.22 gammas/d, 7.45 h/gamma; 100.00% done, 0ms to end; res/vir/avail/free/total mem 2.11TB/2.12TB/1.25TB/1.12TB/4.33TB
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Completed.
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Elapsed: 1d 5h 48m 11s [4 gammas, 3.22 gammas/d, 7.45 h/gamma]; res/vir/avail/free/total mem 2.11TB/2.12TB/1.25TB/1.12TB/4.33TB

So, in theory between the first and the last line the label stores stop being used, and NLL should deallocate it. But the memory is the same.

vigna · 2024-12-16T22:14:37Z

Sorry, my bad, it's 25 bytes per node. I don't know how I computed 8 ⨉ 3 = 32, which of course is not true.

So the arrays are 1.25 TB. The graph 800 GB. The rest of the structures 150 GB. I think we have our 2.11 GB occupancy.

With the additional array, we'll get to 1.65 TB. So the thing is crashing for 1.65 TB—the rest is memory-mapped.

However, if the label store is not dropped, we have an additional 800 GB which explains the 2.91 TB occupancy.

So the next question is: why it has not been dropped?

vigna · 2024-12-16T22:15:16Z

Note: memory usage does not go down even with an explicit drop after the main loop.

progval · 2024-12-16T22:40:23Z

Could it be the memory allocator keeping it in its pool?

vigna · 2024-12-17T06:34:02Z

Well, then the promise of Rust is quite bogus. It is a behavior comparable to a garbage collector.

It could be a delay in the detection by sysinfo of the occupied memory, but there is a refresh call in ProgressLogger::done, and your graph shows no reduction in memory occupancy.

I'll do more test during the week—these two days I have my last 8 hours of teaching. But, definitely, this thing shouldn't crash because it uses too much memory. Unless other processes are using > 2.5 TB of memory.

progval · 2024-12-17T08:05:09Z

It is a behavior comparable to a garbage collector.

not exactly, a memory allocator does not need to be aware of references between objects.

also I just freed 400GB on our machine (by removing an old graph from tmpfs), and LLP passed this time.

Logs of the end:

[2024-12-17T02:19:55Z INFO  webgraph::algo::llp] Elapsed: 13m 18s [44,573,066,306 nodes, 55805766.87 nodes/s, 17.92 ns/node]
[2024-12-17T02:19:55Z INFO  webgraph::algo::llp] Log-gap cost: 2773287290429
[2024-12-17T02:19:55Z INFO  webgraph::algo::llp] 4 gammas, 1d 4h 41m 13s, 3.35 gammas/d, 7.17 h/gamma; 100.00% done, 0ms to end; res/vir/avail/free/total mem 2.11TB/2.12TB/129.74GB/136.82GB/4.33TB
[2024-12-17T02:19:55Z INFO  webgraph::algo::llp] Completed.
[2024-12-17T02:19:55Z INFO  webgraph::algo::llp] Elapsed: 1d 4h 41m 13s [4 gammas, 3.35 gammas/d, 7.17 h/gamma]; res/vir/avail/free/total mem 2.11TB/2.12TB/129.74GB/136.82GB/4.33TB
[2024-12-17T02:19:55Z INFO  webgraph::algo::llp] Best gamma: 0.0625	with log-gap cost 2413493439855
[2024-12-17T02:19:55Z INFO  webgraph::algo::llp] Worst gamma: 0.5	with log-gap cost 2773287290429
[2024-12-17T02:32:43Z INFO  webgraph::algo::llp] Starting step 0...
[2024-12-17T03:15:34Z INFO  webgraph::algo::llp] Number of labels: 25738043939
[2024-12-17T03:15:34Z INFO  webgraph::algo::llp] Finished step 0.
[2024-12-17T03:15:53Z INFO  webgraph::algo::llp] Starting step 1...
[2024-12-17T03:58:39Z INFO  webgraph::algo::llp] Number of labels: 28963610559
[2024-12-17T03:58:39Z INFO  webgraph::algo::llp] Finished step 1.
[2024-12-17T03:58:57Z INFO  webgraph::algo::llp] Starting step 2...
[2024-12-17T04:44:41Z INFO  webgraph::algo::llp] Number of labels: 30414506967
[2024-12-17T04:44:41Z INFO  webgraph::algo::llp] Finished step 2.
[2024-12-17T04:45:11Z INFO  webgraph::algo::llp] Starting step 3...
[2024-12-17T05:27:03Z INFO  webgraph::algo::llp] Number of labels: 30414506967
[2024-12-17T05:27:03Z INFO  webgraph::algo::llp] Finished step 3.
[2024-12-17T05:34:25Z INFO  webgraph::cli::run::llp] Elapsed: 116825.486112324
[2024-12-17T05:34:25Z INFO  webgraph::cli::run::llp] Saving permutation...
[2024-12-17T05:42:45Z INFO  webgraph::cli::run::llp] Completed in 117326.244502336 seconds

graph of memory usage: https://grafana.softwareheritage.org/goto/fSxfx5IHz?orgId=1

LLP: mmap labels instead of loading them in memory

90fe626

LLP on SWH's graph now needs more than 3TB, which makes it crash in this step when there is anything else running on our 4TB machine that needs significant memory, like a previous version of the graph.

progval force-pushed the llp-mmap-labels branch from 997fd89 to 90fe626 Compare December 16, 2024 12:48

vigna merged commit 8398846 into vigna:main Dec 16, 2024
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLP: mmap labels instead of loading them in memory #117

LLP: mmap labels instead of loading them in memory #117

progval commented Dec 16, 2024

vigna commented Dec 16, 2024

vigna commented Dec 16, 2024

zacchiro commented Dec 16, 2024

vigna commented Dec 16, 2024

progval commented Dec 16, 2024 •

edited

Loading

vigna commented Dec 16, 2024

progval commented Dec 16, 2024

vigna commented Dec 16, 2024

vigna commented Dec 16, 2024

vigna commented Dec 16, 2024

progval commented Dec 16, 2024

vigna commented Dec 17, 2024

progval commented Dec 17, 2024 •

edited

Loading

LLP: mmap labels instead of loading them in memory #117

LLP: mmap labels instead of loading them in memory #117

Conversation

progval commented Dec 16, 2024

vigna commented Dec 16, 2024

vigna commented Dec 16, 2024

zacchiro commented Dec 16, 2024

vigna commented Dec 16, 2024

progval commented Dec 16, 2024 • edited Loading

vigna commented Dec 16, 2024

progval commented Dec 16, 2024

vigna commented Dec 16, 2024

vigna commented Dec 16, 2024

vigna commented Dec 16, 2024

progval commented Dec 16, 2024

vigna commented Dec 17, 2024

progval commented Dec 17, 2024 • edited Loading

progval commented Dec 16, 2024 •

edited

Loading

progval commented Dec 17, 2024 •

edited

Loading