Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLP: mmap labels instead of loading them in memory #117

Merged
merged 1 commit into from
Dec 16, 2024

Conversation

progval
Copy link
Contributor

@progval progval commented Dec 16, 2024

LLP on SWH's graph now needs more than 3TB, which makes it crash in this step when there is anything else running on our 4TB machine that needs significant memory, like a previous version of the graph.

I did not try to benchmark this yet

LLP on SWH's graph now needs more than 3TB, which makes it crash in this step when
there is anything else running on our 4TB machine that needs significant memory,
like a previous version of the graph.
@vigna
Copy link
Owner

vigna commented Dec 16, 2024

At that point in the code, we allocate result_labels, and then in the loop we in turn allocate two other arrays. On the other hand, if NLL work as expected the label store has been dropped, freeing two arrays. So we have one array more, which might explain why the problem appears there.

On the other hand, we are not dropping the graph, which would be the obvious thing to do, and would help if the graph is not memory mapped (I know it is in the CLI tool tho). So I pushed a small commit that makes the argument a graph or a reference to a graph, and in the CLI we pass the graph.

Just to undertand whether the memory usage is sensible, how many nodes are we talking here?

@vigna
Copy link
Owner

vigna commented Dec 16, 2024

Incidentally, with more knowledge of Rust I'm unhappy with the fact that we save arrays as Vec<T> instead of Box<[T]>. The capacity is there for nothing, as everything is immutable.

@zacchiro
Copy link
Collaborator

Just to undertand whether the memory usage is sensible, how many nodes are we talking here?

Quoting @progval from elsewhere:

44.5B nodes, 769B edges

@vigna
Copy link
Owner

vigna commented Dec 16, 2024

So it's 400GB per array (gulp). Still, 4 arrays is 1.6 TB. Unless Rust is doing something weird, there's nothing else in memory at that point. Where does the 3TB come from?

@progval
Copy link
Contributor Author

progval commented Dec 16, 2024

I don't know.

LLP logged this:

[2024-12-12T23:17:53Z INFO  webgraph::algo::llp] 44,572,995,153 nodes, 13m 9s, 56421974.87 nodes/s, 17.72 ns/node; 100.00% done, 1ms to end [1972503.27 nodes/s, 506.97 ns/node]
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Completed.
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Elapsed: 13m 13s [44,573,066,306 nodes, 56206034.65 nodes/s, 17.79 ns/node]
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Log-gap cost: 2770667245468
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] 4 gammas, 1d 5h 48m 11s, 3.22 gammas/d, 7.45 h/gamma; 100.00% done, 0ms to end; res/vir/avail/free/total mem 2.11TB/2.12TB/1.25TB/1.12TB/4.33TB
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Completed.
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Elapsed: 1d 5h 48m 11s [4 gammas, 3.22 gammas/d, 7.45 h/gamma]; res/vir/avail/free/total mem 2.11TB/2.12TB/1.25TB/1.12TB/4.33TB
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Best gamma: 0.0625	with log-gap cost 2412757022861
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Worst gamma: 0.5	with log-gap cost 2770667245468
[2024-12-12T23:32:16Z INFO  webgraph::algo::llp] Starting step 0...

and then the machine's memory use kept growing, up to 2.91TB, where LLP crashed and total memory use fell to 20GB, as you can see here: https://grafana.softwareheritage.org/goto/qRAeEdSNz?orgId=1

@vigna
Copy link
Owner

vigna commented Dec 16, 2024

I'm trying to understand where the memory occupancy comes from. The algorithm uses currently 33 bytes per node. So that does account for roughly 1.5 TB. There's a missing half terabyte. Unfortunately I don't remember whether sysinfo includes memory-mapping in the memory footprint (which is printed by the progress logger). I remember it gives different results on Linux and MacOS.

The other suspicious thing is that the label store (which accounts for 16 of these bytes) should be automatically dropped before the "Elapsed: 1d 5h 48m 11s" line. So if this is happening the 2.1 TB look weird. If this is not happening there's something going on. I tried to add a manual drop (on MacOS) and the occupancy does not change.

So my guess is that the memory includes the graph. How large is the graph + ef + dcf?

Do you set RUST_MIN_STACK? We allocate one such stack for each thread, and if it's big, and there are many threads, that might become a problem.

@vigna vigna merged commit 8398846 into vigna:main Dec 16, 2024
2 of 3 checks passed
@progval
Copy link
Contributor Author

progval commented Dec 16, 2024

So my guess is that the memory includes the graph. How large is the graph + ef + dcf?

-rw-r--r-- 1 vlorentz vlorentz  40G Dec 11 17:05 graph-bfs-simplified.dcf
-rw-r--r-- 1 vlorentz vlorentz  50G Dec 11 15:10 graph-bfs-simplified.ef
-rw-r--r-- 1 vlorentz vlorentz 830G Dec 11 14:58 graph-bfs-simplified.graph
-rw-r--r-- 1 vlorentz vlorentz  53G Dec 11 14:58 graph-bfs-simplified.offsets

Do you set RUST_MIN_STACK? We allocate one such stack for each thread, and if it's big, and there are many threads, that might become a problem.

8MB × ~250 threads, so that's not it

@vigna
Copy link
Owner

vigna commented Dec 16, 2024

Mmmhhh. No. That would mean 2.5, not 2.1.

Another weird thing is

[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] 4 gammas, 1d 5h 48m 11s, 3.22 gammas/d, 7.45 h/gamma; 100.00% done, 0ms to end; res/vir/avail/free/total mem 2.11TB/2.12TB/1.25TB/1.12TB/4.33TB
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Completed.
[2024-12-12T23:17:56Z INFO  webgraph::algo::llp] Elapsed: 1d 5h 48m 11s [4 gammas, 3.22 gammas/d, 7.45 h/gamma]; res/vir/avail/free/total mem 2.11TB/2.12TB/1.25TB/1.12TB/4.33TB

So, in theory between the first and the last line the label stores stop being used, and NLL should deallocate it. But the memory is the same.

@vigna
Copy link
Owner

vigna commented Dec 16, 2024

Sorry, my bad, it's 25 bytes per node. I don't know how I computed 8 ⨉ 3 = 32, which of course is not true.

So the arrays are 1.25 TB. The graph 800 GB. The rest of the structures 150 GB. I think we have our 2.11 GB occupancy.

With the additional array, we'll get to 1.65 TB. So the thing is crashing for 1.65 TB—the rest is memory-mapped.

However, if the label store is not dropped, we have an additional 800 GB which explains the 2.91 TB occupancy.

So the next question is: why it has not been dropped?

@vigna
Copy link
Owner

vigna commented Dec 16, 2024

Note: memory usage does not go down even with an explicit drop after the main loop.

@progval
Copy link
Contributor Author

progval commented Dec 16, 2024

Could it be the memory allocator keeping it in its pool?

@vigna
Copy link
Owner

vigna commented Dec 17, 2024

Well, then the promise of Rust is quite bogus. It is a behavior comparable to a garbage collector.

It could be a delay in the detection by sysinfo of the occupied memory, but there is a refresh call in ProgressLogger::done, and your graph shows no reduction in memory occupancy.

I'll do more test during the week—these two days I have my last 8 hours of teaching. But, definitely, this thing shouldn't crash because it uses too much memory. Unless other processes are using > 2.5 TB of memory.

@progval
Copy link
Contributor Author

progval commented Dec 17, 2024

It is a behavior comparable to a garbage collector.

not exactly, a memory allocator does not need to be aware of references between objects.


also I just freed 400GB on our machine (by removing an old graph from tmpfs), and LLP passed this time.

Logs of the end:

[2024-12-17T02:19:55Z INFO  webgraph::algo::llp] Elapsed: 13m 18s [44,573,066,306 nodes, 55805766.87 nodes/s, 17.92 ns/node]
[2024-12-17T02:19:55Z INFO  webgraph::algo::llp] Log-gap cost: 2773287290429
[2024-12-17T02:19:55Z INFO  webgraph::algo::llp] 4 gammas, 1d 4h 41m 13s, 3.35 gammas/d, 7.17 h/gamma; 100.00% done, 0ms to end; res/vir/avail/free/total mem 2.11TB/2.12TB/129.74GB/136.82GB/4.33TB
[2024-12-17T02:19:55Z INFO  webgraph::algo::llp] Completed.
[2024-12-17T02:19:55Z INFO  webgraph::algo::llp] Elapsed: 1d 4h 41m 13s [4 gammas, 3.35 gammas/d, 7.17 h/gamma]; res/vir/avail/free/total mem 2.11TB/2.12TB/129.74GB/136.82GB/4.33TB
[2024-12-17T02:19:55Z INFO  webgraph::algo::llp] Best gamma: 0.0625	with log-gap cost 2413493439855
[2024-12-17T02:19:55Z INFO  webgraph::algo::llp] Worst gamma: 0.5	with log-gap cost 2773287290429
[2024-12-17T02:32:43Z INFO  webgraph::algo::llp] Starting step 0...
[2024-12-17T03:15:34Z INFO  webgraph::algo::llp] Number of labels: 25738043939
[2024-12-17T03:15:34Z INFO  webgraph::algo::llp] Finished step 0.
[2024-12-17T03:15:53Z INFO  webgraph::algo::llp] Starting step 1...
[2024-12-17T03:58:39Z INFO  webgraph::algo::llp] Number of labels: 28963610559
[2024-12-17T03:58:39Z INFO  webgraph::algo::llp] Finished step 1.
[2024-12-17T03:58:57Z INFO  webgraph::algo::llp] Starting step 2...
[2024-12-17T04:44:41Z INFO  webgraph::algo::llp] Number of labels: 30414506967
[2024-12-17T04:44:41Z INFO  webgraph::algo::llp] Finished step 2.
[2024-12-17T04:45:11Z INFO  webgraph::algo::llp] Starting step 3...
[2024-12-17T05:27:03Z INFO  webgraph::algo::llp] Number of labels: 30414506967
[2024-12-17T05:27:03Z INFO  webgraph::algo::llp] Finished step 3.
[2024-12-17T05:34:25Z INFO  webgraph::cli::run::llp] Elapsed: 116825.486112324
[2024-12-17T05:34:25Z INFO  webgraph::cli::run::llp] Saving permutation...
[2024-12-17T05:42:45Z INFO  webgraph::cli::run::llp] Completed in 117326.244502336 seconds

graph of memory usage: https://grafana.softwareheritage.org/goto/fSxfx5IHz?orgId=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants