Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further memory investigation #287

Closed
martinsumner opened this issue Jul 12, 2019 · 8 comments
Closed

Further memory investigation #287

martinsumner opened this issue Jul 12, 2019 · 8 comments

Comments

@martinsumner
Copy link
Owner

Had a Riak instance in production where there were a lot of leveled_cdb instances that had a large amount of binary memory referenced. This could be cleared through garbage_collect() but garbage collection didn't seem to be trying to do this.

The binaries referenced may have been in the active journal. This was after a riak restart, so perhaps related to scan on startup?

@martinsumner
Copy link
Owner Author

martinsumner commented Jul 17, 2019

There are two directions to look at this problem - the (over) use of memory by the database, and the (under) use of memory by the OS page cache.

Firstly the situation where the problem arose was where the database is left idle for a long time (days), and then a volume test is started. At that stage there is an unexpectedly high proportion of the memory taken by Riak, and within that the majority seems to be binary references in leveled_cdb processes that are actually ready for garbage collection.

I've struggled to find concise references to how GC is triggered on a process (and of course the BEAM memory management also changes significantly between R16 and OTP20, so any reference may not be relevant due to evolution). However, there are references to idle processes not triggering garbage collection, and that idles processes should be hibernated because of this.

It has been guessed that scan_over_file in CDB has a high risk of producing lots of binary GC'able binary references, but this hasn't been proven at a non-trivial scale in tests. GC may be triggered at the end of a scan to eliminate this as a possibility, but it is difficult to be certain that this is actually going to make a real difference.

In terms of the under use of memory by the OS page cache, then fadvise should be our friend. When we start up the database, we would normally want the ledger to be in the page cache - so an option is to fadvise this as willneed on startup. However, this may have an impact on startup times - as the fadvise may involve a sync read (https://stackoverflow.com/questions/4936520/posix-fadvisewillneed-makes-io-slower). Will this punitively impact startup times?

Note that the database was left idle after startup - hence the assumption that adjusting startup behaviour may have a positive impact.

@martinsumner
Copy link
Owner Author

martinsumner commented Jul 17, 2019

Note, that in volume tests where we start and empty store, and then load it as part of the volume test - there are no issues with memory allocation. Riak will take a minimal memory footprint in comparison to the page cache.

If this is specifically an issue with non-GC when idle - is hibernate a better answer? Should leveled_cdb files hibernate after an inactivity timeout?

@martinsumner
Copy link
Owner Author

If there is to be a page cache load via fadvise on startup - this needs to be configurable, as it won't necessarily be of help when leveled is used as an AAE backend

@martinsumner
Copy link
Owner Author

martinsumner commented Jul 17, 2019

In terms of the under use of memory by the OS page cache, then fadvise should be our friend. When we start up the database, we would normally want the ledger to be in the page cache - so an option is to fadvise this as willneed on startup. However, this may have an impact on startup times - as the fadvise may involve a sync read (https://stackoverflow.com/questions/4936520/posix-fadvisewillneed-makes-io-slower). Will this punitively impact startup times?

Testing this it makes negligible difference in startup times. With fadvise there is higher disk utilisation in the minutes following startup - so it looks like this activity is sent to the background.

However, this is when restarting, when the pages are likely to be already cached.

Doing this after a reboot, the node startup time as 30% slower, but there was a lot more disk activity post-startup (about 3 x as much disk I/O).

@martinsumner
Copy link
Owner Author

When testing in an NHS pre-production environment, the following was observed:

  • Following startup (with a full data set) and prior to sending new load, memory usage is low;
  • Following a load test, memory usage is very high, much higher than has been observed during basho_bench tests.

The additional memory is all on the binary heap (not the process heap), and the files with the largest number of binary memory referenced are SST files. Further investigation reveals the binary references are related to the cache of header information (an array of binaries).

This cache is built lazily after startup - by splitting out the header binary from the overall block when the block is first loaded. However, there is no binary copy - so this retains a reference to the whole block. Hence the unexpected high volume of binary memory referenced in the nodes following startup.

See 5bef21d where a unit test has been added to demonstrate this (and fixed using binary:copy whenever a header binary is added to the array).

@martinsumner
Copy link
Owner Author

#288

@martinsumner
Copy link
Owner Author

Further to PR. To close issue the level at which page cache is pre-loaded needs to be configurable.

@martinsumner
Copy link
Owner Author

#292

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant