-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Make remote snapshot local file_cache block size configurable #14990
Comments
One question, if we change the block size of an existing file_cache, how do we handle the old blocks with a different block size? clear them all and repopulate the cache or split/combine the old blocks into new blocks? |
@finnegancarroll are we proposing this setting to be static or dynamic? |
Thanks @finnegancarroll this is an interesting feature request. I'm curious about this part:
Have you done any measurements on how much baseline cache usage can be reduced with different block sizes? And on smaller block sizes would there be additional data that has to be redownloaded each time? Coming at it from a different perspective, if the problem we are trying to solve is reducing the baseline cache usage then is it viable to introduce some custom logic for handling the metadata blocks instead? |
Hi @finnegancarroll,
I am interested in learning more details about this part. |
Is your feature request related to a problem? Please describe
To perform a search on a remote snapshot we download only the specific blocks of the snapshot needed to complete the search. These blocks have a set 8MB size and are stored on disk in a local reference counted file cache. While there are benefits to pulling down large blocks to take advantage of spatial locality and reduce the overhead of accessing our remote store, we also risk over populating our cache with un-needed data.
The large block size is particularly noticeable when initializing a remote snapshot. For each segment Lucene opens and holds onto file references to metadata. Lucene never closes these file references so the blocks must remain downloaded and present in our cache for the lifetime of the program. Particularly in the case of 'metadata' blocks 8MB is a lot and so the baseline disk usage of our caches can be drastically reduced with a more conservative block size.
Describe the solution you'd like
Can this block size be a configurable setting for a remote snapshot repository?
Related component
Search:Searchable Snapshots
Describe alternatives you've considered
Alternatively could a smaller default still improve performance? How was 8MB selected?
Additional context
Some short tests with 13GB of OSB Big5 data restored from a remote snapshot local to that cluster. This does mean very little overhead for accessing the remote snapshot and a more robust test should use an actual remote store to get a better idea of how the overhead of more frequent block downloads impacts performance.
file_cache capacity is 10MB so that we can easily populate our cache fully.
OSB query-string-on-message workload chosen due to the large number of block downloads required. Something less expensive might never access any doc fields.
The text was updated successfully, but these errors were encountered: