Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreasonably high memory consumption during restore #1067

Closed
buengese opened this issue Feb 17, 2024 · 12 comments · Fixed by rustic-rs/rustic_core#165
Closed

Unreasonably high memory consumption during restore #1067

buengese opened this issue Feb 17, 2024 · 12 comments · Fixed by rustic-rs/rustic_core#165
Labels
A-commands Area: Related to commands in `rustic` A-core Area: Generally related to `rustic_core` C-bug Category: Something isn't working as expected

Comments

@buengese
Copy link

buengese commented Feb 17, 2024

I've recently tried out rustic on one of my existing repositories and noticed an unreasonable high memory usage compared to restic with rustic being killed by the OOM killer at more than 12gb of ram consumption. Original restic uses less than 2gb of ram during the entire restore.

The command being run is: rustic -r /path/to/repo restore 8c15dc1a .
Rustic version: 0.7.0

The repo in question is around 4.5TB in size and contains 14991 files.

Edit: I'm just noticing that this might be more a rustic_core kind of issue...

@github-actions github-actions bot added the S-triage Status: Waiting for a maintainer to triage this issue/PR label Feb 17, 2024
@aawsome
Copy link
Member

aawsome commented Feb 17, 2024

Thanks for opening this issue @buengese
Actutally this is unexpected. The only reason I can imagine is that it is related to the fact that rustic does a local scan to determine which already present data can be re-used.
Can you give the output of the rustic restore run until it aborts?

@buengese
Copy link
Author

buengese commented Feb 17, 2024

Here the base output.

root@vm-nas:/mnt/media-ingress/restore# rustic --log-level=info --log-file rustic-restore.log -r <redacted> restore 8c15dc1a .
enter repository password: [hidden]
[00:00:00] reading index...               ████████████████████████████████████████       1503/1503
[00:00:00] collecting file information... ⠁
[WARN] Note: additional entries exist in destination
[00:00:23] collecting file information...
Files:  15011 to restore, 0 unchanged, 0 verified, 0 to modify, 1 additional
Dirs:   991 to restore, 15 to modify, 0 additional
[00:01:39] restoring file contents...     ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 998.72 MiB/4.47 TiB   21.10 MiB/s  (ETA 5d)
Getötet

Additionally this is from the log file at info logging level tough it doesn't seem to contain any particularly useful info.

19:22:53 [INFO] using no config file, none of these exist: /root/.config/rustic/rustic.toml, /etc/rustic/rustic.toml, ./rustic.toml
19:23:00 [INFO] repository local:<redacted>: password is correct.
19:23:01 [INFO] using cache at /root/.cache/rustic/361eb9d4ab59bbaac47481b8f351c4f681eac44c8e4887c567bd58c15395c3d6
19:23:02 [INFO] getting snapshot...
19:23:26 [WARN] Note: additional entries exist in destination
19:23:26 [INFO] total restore size: 4.5 TiB

OOM message from the Kernel log:

[18949.707360] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=ns,mems_allowed=0-1,oom_memcg=/lxc/1003,task_memcg=/lxc/1003/ns/user.slice/user-0.slice/session-50.scope,task=rustic,pid=65670,uid=0
[18949.708190] Memory cgroup out of memory: Killed process 65670 (rustic) total-vm:14989124kB, anon-rss:11756028kB, file-rss:1920kB, shmem-rss:0kB, UID:0 pgtables:23448kB oom_score_adj:0

Worth noting that this an lxc container so cgroups are involved. Why I don't think that should cause any weird behaviour it is something worth keeping in mind.

@aawsome
Copy link
Member

aawsome commented Feb 17, 2024

Ok, trying to investigate what could use that much memory during the actual restore... Can you please give some further information:

  • Output of rustic repoinfo (to see pack size statistics)
  • Output of /bin/time rustic restore -n or something else to get get the memory usage of a dry-run restore

Thanks a lot!

@buengese
Copy link
Author

buengese commented Feb 17, 2024

Sure.
Something I only just noticed: the repo is using a non standard pack size of 128M.

Here is the repoinfo output

| File type | Count | Total Size |
|-----------|-------|------------|
| Key       |     1 |      449 B |
| Snapshot  |     1 |      234 B |
| Index     |  1503 |  372.5 MiB |
| Pack      |  9013 |    4.4 TiB |
| Total     | 10518 |    4.4 TiB |


| Blob type |   Count | Total Size | Total Size in Packs |
|-----------|---------|------------|---------------------|
| Tree      |    1007 |  209.4 MiB |           209.4 MiB |
| Data      | 3157727 |    4.4 TiB |             4.4 TiB |
| Total     | 3158734 |    4.4 TiB |             4.4 TiB |

| Blob type  | Pack Count | Minimum Size | Maximum Size |
|------------|------------|--------------|--------------|
| Tree packs |          2 |     93.0 MiB |    116.5 MiB |
| Data packs |       9011 |    472.4 MiB |    519.8 MiB |

Time output looks like this:

167.71user 8.23system 2:57.78elapsed 98%CPU (0avgtext+0avgdata 1138612maxresident)k
37680889inputs+0outputs (2149major+647536minor)pagefaults 0swaps

The dry run behaved just fine and peaked at only little over 1.2gb. So I've gone a little further and thrown a profiler at it during a real restore. I'm sure there is a more rust native tool than heaptrack but I'm not that familiar with the rust ecosystem. This is where all the allocations seem to be coming from:

Bildschirmfoto 2024-02-17 um 21 34 36

@aawsome
Copy link
Member

aawsome commented Feb 17, 2024

Actually you have a (data) pack size of around 500 MiB - the rustic "standard" would be around 100MiB for your repository size.

Thanks for the memory profiling. Yes, it seems the following is happening:

  • Your restore needs most of the data from the repository
  • This means you basically need the full content of each pack file
  • which implies that restore reads the complete pack file into memory before processing it
  • And - this is not yet optimized in any way - rustic uses a concurrency of 20 for restoring which means 20 pack files are kept in-memory in parallel.
  • This gives the 10 GiB memory usage plus the 1.2 GiB rustic needs for index data and the "restore plan". There is some other overhead, so arriving at around 12 GiB sounds plausible.

IIRC restic uses a streaming approach which of course doesn't need the "pack" space in memory..

I'll think about what we can improve here. But this might need an enhancement how rustic talks to the backend and maybe also needs to wait until we tackle concurrency in a right way...

In meanwhile I can only suggest to you to decrease restore concurrency by self-compiling rustic after setting https://github.com/rustic-rs/rustic_core/blob/5ccf12c71fd6730e78f2679ea18c0e5be7cd84e4/crates/core/src/commands/restore.rs#L35 to a lower value. Sorry.

@aawsome
Copy link
Member

aawsome commented Feb 17, 2024

An alternative would be to use lower packsizes. (but I don't know if you want to do this)
To let rustic do the repacking, the following steps must be made:

  • use rustic config to set the datapack size relevant settings. Don't forget to specify a --set-max-packsize-tolerate-percent to let rustic not tolerate too large pack files
  • run rustic prune - it should repack all too large packs.

@buengese
Copy link
Author

buengese commented Feb 17, 2024

Yeah that makes perfect sense to me. Thank you for looking into this. That being said I think concurrency of 20 might just be to high of a default given the current approach that reads all data into memory. I know the default pack size for both restic and rustic is much lower but there are reasons (specifically limits regarding the maximum number of files per directory implemented by a number of cloud storage providers) to increase it. Maybe it would be possible to limit concurrency based on pack size for now.

@aawsome
Copy link
Member

aawsome commented Feb 17, 2024

Actually we could limit the amount of data which is read from the backend at once, e.g. to something like 32MiB or 64MiB. That would not decrease performance (IMO) and keep memory usage for pack file data below MAX_READER_THREADS_NUM * LIMIT.
And it would keep memory-usage independent from the pack size.

@aawsome
Copy link
Member

aawsome commented Feb 17, 2024

Another optimization would be to remove index entries from memory just before starting the actual restore (all information for the restore is already contained in the restore plan) That would in your case remove most of the 1.2 GiB during the actual restore. Combined with a limit as written above I think memory usage would drop below 1.5 GiB.

@buengese
Copy link
Author

That does sound like an even better plan.

@aawsome
Copy link
Member

aawsome commented Feb 17, 2024

@buengese Can you try if rustic-rs/rustic_core#165 works for you? The memory usage with that PR should be roughly 2GiB in your case..
To try it out, you have to self-compile rustic-rs using rustic_core from the PR.

@aawsome aawsome added C-bug Category: Something isn't working as expected A-commands Area: Related to commands in `rustic` A-core Area: Generally related to `rustic_core` and removed S-triage Status: Waiting for a maintainer to triage this issue/PR labels Feb 17, 2024
aawsome added a commit to rustic-rs/rustic_core that referenced this issue Feb 21, 2024
@buengese
Copy link
Author

Sorry for missing to answer here but I can confirm that works.

github-merge-queue bot pushed a commit that referenced this issue Sep 23, 2024
reduce memory usage of the `restore` command by removing the data
entries from the index before doing the actual restore.

see #1067

TODO:
- [x] remove Cargo changes once
rustic-rs/rustic_core#166 is merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-commands Area: Related to commands in `rustic` A-core Area: Generally related to `rustic_core` C-bug Category: Something isn't working as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants