Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 2.11.1-2.16.0 Cluster crashes because search node run out of disk space after some time #11676

Open
sandervandegeijn opened this issue Dec 27, 2023 · 63 comments
Assignees
Labels
bug Something isn't working Search:Searchable Snapshots

Comments

@sandervandegeijn
Copy link

sandervandegeijn commented Dec 27, 2023

Describe the bug

We have dedicated search nodes within the cluster with 100GB of SAN-storage as a cache layer. On normal nodes opensearch would have gone red because the watermarks would have been exceeded. No such notice here, it keeps on filling the disks until it is full after which you get errors like:

image

image

We have not set node.search.cache.size because these nods don't do anything other than searching.

Also, Object object isn't really helpful ghehe :)

Related component

Storage:Snapshots

To Reproduce

Leave the cluster running for a while, do several searches to fill the cache -> error.

Expected behavior

On a dedicated search node use like 90% of the available storage but nothing more. Prevent stability issues.

Additional Details

Plugins
Default docker 2.11.1 images with S3 plugin installed.

Screenshots
N/A

Host/Environment (please complete the following information):
Kubernetes with docker images 2.11.1

Additional context
Add any other context about the problem here.

@andrross
Copy link
Member

@sandervandegeijn Have you configured a value for cluster.filecache.remote_data_ratio?

Can you share the output of GET _nodes/stats/file_cache when a node gets into this state?

@sandervandegeijn
Copy link
Author

sandervandegeijn commented Dec 27, 2023

I have implemented the node.search.cache.size setting, setting it to 95GB of the 100GB available on the PVC. I'll see if that holds. Ideally I would like to use a percentage and if not set by me, a proper default would be nice.

The cluster.filecache.remote_data_ratio doesn't do what I need if I understand the docs correctly, I don't want to be limited by the cache size. Cache is a bonus and should not prohibit the functioning of the cluster.

@andrross
Copy link
Member

The cluster.filecache.remote_data_ratio doesn't do what I need if I understand the docs correctly, I don't want to be limited by the cache size.

Unfortunately there are still limits here. If the actively-referenced data at any one time exceeds what is available on the machine then you will run into problems like what you've described here. The result of the _node/stats/file_cache API will help determine if that is what is going on here. The remote_data_ratio setting is meant as a safeguard to prevent assigning more remote shards that can reasonably be supported.

@sandervandegeijn
Copy link
Author

sandervandegeijn commented Dec 28, 2023

Will take a look at it tomorrow. What would reasonable mean, what would be the limiting factor? I'm currently on a cluster with:

  • 6 data nodes with 1TB full flash SAN and 16GB RAM
  • 3 master / coordinating nodes and 8GB RAM
  • 10 search nodes with 100GB full flash SAN and 16GB RAM. Backend storage is a s3 cluster with spinning disks and another flash cache in front of the spinning disks.
  • 2 ingest nodesand 2GB RAM

There's around 3.5TB on the hot nodes and some 15TB on the S3 cluster (growing with ±0.5TB/day, final size will be around 100TB). The search nodes makes use of our S3 solution, I'm pulling more than 20gbit/s (and probably more now, implemented more search nodes, no more recent measurements) from the S3 appliance. I/O response times are stable. All nodes can take 12 cpu cores if need be.

Performance is fine, doing a search on source.ip = x.x.x.x over the whole dataset takes around 10-15s which is more than acceptable.

@andrross
Copy link
Member

What would reasonable mean, what would be the limiting factor?

The strictly limiting factor will be shown by the file_cache.active_in_bytes. That is the size of all the actively-referenced (i.e. unevictable) data. If your search workload leads to referencing more data at any point in time than there is storage available on the node, then you'll start seeing failures. In practice, as the active_in_bytes size approaches the total cache size then performance might start to degrade as well as you'll see increased cache misses which means that search requests will have to pull down data on-demand more frequently.

@sandervandegeijn
Copy link
Author

Okay, I've got stats for some of the nodes:

"MVyDvVRmT3OP-wzqw2elag": {
      "timestamp": 1703846452872,
      "name": "opensearch-search-nodes-0",
      "transport_address": "10.244.206.182:9300",
      "host": "10.244.206.182",
      "ip": "10.244.206.182:9300",
      "roles": [
        "search"
      ],
      "attributes": {
        "shard_indexing_pressure_enabled": "true"
      },
      "file_cache": {
        "timestamp": 1703846452872,
        "active_in_bytes": 81091561675,
        "total_in_bytes": 0,
        "used_in_bytes": 81703930059,
        "evictions_in_bytes": 265176318281,
        "active_percent": 99,
        "used_percent": 0,
        "hit_count": 313719,
        "miss_count": 63017
      }
    },
    "w1voQkb2SUKuBjDvK_Mfqw": {
      "timestamp": 1703846452883,
      "name": "opensearch-search-nodes-8",
      "transport_address": "10.244.207.169:9300",
      "host": "10.244.207.169",
      "ip": "10.244.207.169:9300",
      "roles": [
        "search"
      ],
      "attributes": {
        "shard_indexing_pressure_enabled": "true"
      },
      "file_cache": {
        "timestamp": 1703846452883,
        "active_in_bytes": 72728861570,
        "total_in_bytes": 0,
        "used_in_bytes": 73525779330,
        "evictions_in_bytes": 257238910529,
        "active_percent": 99,
        "used_percent": 0,
        "hit_count": 349777,
        "miss_count": 59203
      }
    },
    "J3g3YbS7Q_qbvLniDz19RQ": {
      "timestamp": 1703846452883,
      "name": "opensearch-search-nodes-9",
      "transport_address": "10.244.204.127:9300",
      "host": "10.244.204.127",
      "ip": "10.244.204.127:9300",
      "roles": [
        "search"
      ],
      "attributes": {
        "shard_indexing_pressure_enabled": "true"
      },
      "file_cache": {
        "timestamp": 1703846452883,
        "active_in_bytes": 75778946930,
        "total_in_bytes": 0,
        "used_in_bytes": 76533921650,
        "evictions_in_bytes": 258573952202,
        "active_percent": 99,
        "used_percent": 0,
        "hit_count": 362107,
        "miss_count": 60526
      }
    }
  }

From what you've written, is the data always transferred to the local storage of the nodes? I was under the impression it would get the data from S3 and it would write the most used parts to the local storage for performance gains. That there will be more cache misses when the local storage / S3 ratio changes is obvious (and a performance penalty I would happily pay to be able to archive the data).

Since I've implemented node.search.cache.size at effectively 95% of the local storage I've not seen errors anymore (so far).

@sandervandegeijn
Copy link
Author

Okay, had another run in with full disks on the search nodes. From a sysadmin's perspective, the cache should be a bonus for performance optimisation and should not fill up the entire disk of the search node regardless of the total storage in use on the S3 side.

Can I make this happen? :)

@andrross
Copy link
Member

I was under the impression it would get the data from S3 and it would write the most used parts to the local storage for performance gains...From a sysadmin's perspective, the cache should be a bonus for performance optimisation and should not fill up the entire disk of the search node regardless of the total storage in use on the S3 side.

@sandervandegeijn Fair points! However, the current architecture requires that the blocks pulled from the remote object store be written to disk, and then open as files on disk to serve searches. This means we don't have to have all actively-referenced data loaded into memory because we can rely on the operating system to page this data between the disk and page cache as appropriate.

I believe the scenario you are encountering is that the sum of the actively-referenced data is exceeding the disk space on the node. At a minimum, I believe a better experience would be to fail new searches that come it that require downloading new data, as opposed to letting the system run out of disk space and fail unpredictably. What do you think?

Separate, it does look like there's another bug here where your file cache stats are showing "total_in_bytes": 0. I've not seen that before but "total_in_bytes" is supposed to be the size of the file cache and it is clearly not zero here.

@sandervandegeijn
Copy link
Author

sandervandegeijn commented Jan 11, 2024

Okay, uhm... This kind of things are better to discuss in person and with a whiteboard, but I'll try;

Failing gracefully and handling the situation is always better than just letting it blow up. But in this case it's not even doing searches in parallel, this is one search that's going through ±20TB of data where 3.5TB is on full flash storage (within the cluster) and the rest on S3. The search nodes (10) have full flash storage (100GB each), so 1TB of active storage in addition to the hot nodes. It explains why I'm seeing 20-25gbit/s from the S3 storage: basically it's pulling all data in as fast as possible. It also explains why doing concurrent searches over the whole dataset is almost impossible (acceptable).

Without having any knowledge of the caveats and I intend this as constructive feedback

This 1:20 ratio is about as much as I can stomach in terms of storage costs, ideally I would like something 1:100 where I can live with slower performance. I really don't care that it would take 5-10 minutes for 1 search, it's archive, cold cold storage. Lower ratios negate the advantage of the lower storage costs on the S3 side and it's not going to be practical. I'm also talking to another university that wants to store & search 5 petabyte (!) of logs. Searching 5PB this way seems almost impossible. You need too much primary storage to be cost effective when the ratio's are too low.

That's a shame, it would be the ideal cost effective solution to be able to do this. Thinking through this, it seems we have three problems:

  1. All referenced data needs to be on disk, so large searches also require large amounts of primary storage in the cluster, maybe almost in the 1:<10 ratio.
  2. The search nodes that run out of storage seem to trigger a red state on the cluster
  3. It needs to pull all the data from the object storage

Are there ways to improve the implementation in such a way that not always all storage needs to be copied to primary storage, but data can just be used directly from the object storage and where the cache in the search nodes prevent excessive reads on the most used data? The first two problems are the main things to solve, the third one would be very nice to solve if possible (by reading only the relevant data as you would on block storage if possible). You could work around the third one with massive bandwidth, although it's not efficient.

If this is not possible, I'm thinking about totally different ways of scaling - maybe doing one primary cluster for hot ingest and then a cluster per time unit (month?) with slower storage while using cross cluster search to search through everything at once.

Have you experimented with these kinds of cluster for petabytes of data?

edit: ai, my cluster went red because of this it seems. I would really like that the search nodes are a "bonus" and never cause the cluster to stop working. As long as I have regular nodes that are functioning, the cluster should remain functional and keep ingesting data. The logs/monitoring can spam me to kingdom come for the search nodes not functioning. Updated the list of problems.

@sandervandegeijn sandervandegeijn changed the title [BUG] 2.11.1 Search node disk runs full after some time [BUG] 2.11.1 Search node disk out of space after some time Jan 11, 2024
@andrross
Copy link
Member

Thanks for the feedback! To clarify, and to get deeper into the technical details...

Searchable snapshots use the Lucene Directory and IndexInput interfaces to abstract away the remote versus local details. When an IndexInput instance is opened, if the specific part of the file being pointed to is not already on disk, it pulls down that part of the file in an 8MiB chunk. When the "cursor" of the IndexInput moves past the current 8MiB part it will then download the next 8MiB part, keeping previous parts cached on disk. So in the absolute simplest case, if you had a single search on a single shard that had a single open IndexInput, then only 8MiB of data would be actively referenced at any one time. Reality is generally more complicated than that, but I would not expect a single search to have to actively reference all data of the index. I think we need to dig deeper into specifically what is happening in this case, because it does not match my expectation for how the system should perform.

@sandervandegeijn
Copy link
Author

Thanks for the explanation. Would it be necessary to write those chunks to disk or would it be possible to do this in memory and either discard the data immediately or write it to disk when it's a block that used frequently as a cache? The current problem is that - from what I understand - when you do a search over a large dataset a large portion of that dataset needs to be on primary storage.

For us, the point of using searchable snapshots like this is to save on storage costs (full flash vs S3 is about 20x). This would enable us to keep data available for 180 days (security forensics use case).

I'm happy to provide details or schedule a call to show you around the cluster.

@sandervandegeijn
Copy link
Author

sandervandegeijn commented Jan 14, 2024

Okay, unfortunately I'm having to scrap the current setup. The search node problems cause too many interruptions, cluster went down again because of it. I'll try to leave the data on the S3 side and maybe even take snapshots so I can restore everything in a second cluster to be able to help with debugging. But this is getting too cumbersome for production use I'm afraid.

@andrross
Copy link
Member

The current problem is that - from what I understand - when you do a search over a large dataset a large portion of that dataset needs to be on primary storage.

@sandervandegeijn This should not be the case. With a very large data set and a relatively small primary storage cache, the expected behavior is that you might see a lot of thrashing - i.e. re-downloading of data that was evicted from primary storage - but that should manifest as slow queries and not cluster instability or disk exhaustion. Obviously something is not working correctly here though.

@sandervandegeijn
Copy link
Author

Okay is what I would have expected as well. With a smaller cache, cache misses will be more frequent and more traffic will be generated. That's to be expected.

@sandervandegeijn
Copy link
Author

I had to delete the data as well (for now). Do you have any means to reproduce this at scale?

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7 8]
@sandervandegeijn Thanks for filing this issue

@sandervandegeijn
Copy link
Author

Is it possible to solve this in the 2.12 release?

@sandervandegeijn
Copy link
Author

Guys, is there any chance this will be solved in 2.13? I can't use the future right now because it makes the clusters unstable.

@sandervandegeijn
Copy link
Author

Guys? :)

@peternied
Copy link
Member

@sandervandegeijn what do you think about initiating a draft pull request to tackle this challenge, or perhaps motivating someone directly impacted by it to contribute a pull request? I'd be happy to help review the changes.

@sandervandegeijn
Copy link
Author

sandervandegeijn commented Mar 29, 2024

Afraid I'm not a very capable java dev, so it's quite a task for me. Second excuse: I can't build opensearch on my arm-based macbook as far as I can tell (which is a requirement for the coming weeks, not taking the Lenovo P15 hernia inducing brick to the camping ;) )

@peternied
Copy link
Member

@sandervandegeijn Hmm I thought ARM was supported for core, but I'm a windows person that ssh's into linux machines 🤷. If you have time give it a try and feel free to create issues if it doesn't work. Maybe you know someone at your university that could take on the challenge while you are away. Enjoy camping in any case!

@sandervandegeijn
Copy link
Author

Promise I'll give it a try, but I hope it won't come to my java skills to get this fixed for 2.14 ghehe.

@Pallavi-AWS
Copy link
Member

@anasalkouz can we please look into prioritizing handling disk issues with small disk cache compared to the remote data being addressed?

@finnegancarroll
Copy link
Contributor

So far I’m not seeing cases where a search node is completely leaking file references. It is however, very easy to over subscribe the local file cache while addressing a sufficiently large remote snapshot. The two ways I notice this happening are as follows:



1.) As discussed above Lucene does not promise to close IndexInputs(file references) after completing a search. Opensearch relies on GC to close these after the fact, which can allow references to overstay their welcome if GC is not aggressive enough.

2.) To reduce file handle usage Lucene collapses several per segment files into a single Compound File. These files appear to maintain high reference counts regardless of GC and as such are never evicted even as the cache overflows.

If possible the below information would help narrow down the issue:

  • Forcing (more like “suggesting”) GC on an oversubscribed search node consuming excessive disk space and observing to see if disk space is freed as a result.
    jcmd <pid> GC.run

  • I would be interested to see how many segments are present in indices serviced by the search node.
    /<index>/_stats?pretty

  • To verify that disk space used is actively accounted for by the file cache.
    /_nodes/stats/file_cache?pretty

PS - In the case of "total_in_bytes": 0, Opensearch will probe the 'path.data' directory with getUsableSpace() as an estimate. This is liable to return 0 if there is any external disk I/O going on concurrently. I would be interested to know if this bug is consistently reproducible as that would eliminate the possibility of some background disk operation messing with this method.

@sandervandegeijn
Copy link
Author

I don't have a formal testing env I can use for the tests, I'm hesitant to do it in production since the last times it blew up. I'm also somewhat pressed for time because our daughter was born last week. I can help in a couple of weeks.

Is there someone else that could help out?

@sandervandegeijn
Copy link
Author

Hi is there any news? I'm starting at work again, I can see if I can help out. :)

@rlevytskyi
Copy link

No news.
Issue still persists at 2.12

@sandervandegeijn
Copy link
Author

Without changes I'm hesitant to try again (and possibly blow up another cluster ghehe). We have missed the 2.15 release window, I'm hoping for 2.16. This issue is important to opensearch being a valid candidate for multiple educational institutions in the Netherlands. We need the storage tiering this feature provides to keep the costs acceptable.

@finnegancarroll
Copy link
Contributor

This issue is still being investigated and I'll be spending time this week working towards a fix.
Can you share more details of your use case? In particular how much query throughput the affected nodes experience and roughly how long it takes for a node to fill its disk?

@sandervandegeijn
Copy link
Author

sandervandegeijn commented Jul 3, 2024

Great thanks!

In our case we had 10 search nodes with 100GB primary storage each with about 20TB of snapshot data. It took 10-20 searches to break the cluster, searches being hours or days apart. During the search I was peaking at about 25gbit/s of throughput.

@andrross
Copy link
Member

andrross commented Jul 3, 2024

about 20TB of snapshot data

@sandervandegeijn Can you share some statistics about this data? I think it might be helpful to know: total number of indexes, total number of primary shards, average number of segments per shard, and number of replicas (if any) configured for the searchable snapshot indexes.

@sandervandegeijn
Copy link
Author

sandervandegeijn commented Jul 5, 2024

Okay, I'm typing along while setting up the testing environment.

First try
Using 2.15.0. First conservative try, 5 search nodes, 20GB storage each. Couldn't restore the 20TB-ish of data, exceeded max number of shards per node. Also, the primary storage of the search nodes is 100% used without firing one query judging by df -h

Second try
Retrying with 10 search nodes, 100GB of storage each and max 20k shards (very high, but got errors on the first try) per node max. No replicas configured, indices have 4 primaries.

  1. 30% restore done: already 15% disk space of. search nodes in use without firing a query.
  2. 60% done: 28% in use. 1770 indices restored.
  3. 100% done 37% in use. 2405 indices restored.

GET */_stats

{
  "_shards": {
    "total": 7330,
    "successful": 7330,
    "failed": 0
  },
  "_all": {
    "primaries": {
      "docs": {
        "count": 75303014367,
        "deleted": 0
      },
      "store": {
        "size_in_bytes": 19541917513409,
        "reserved_in_bytes": 0
      }

GET _cat/segments?v
7283 rows (first month or so I didn't force merge, after that everything is being force merged because of your tip)

Trying to create an index pattern for log-*:

opensearch-master-nodes-0 opensearch-master-nodes [2024-07-05T20:10:18,045][WARN ][o.o.c.r.a.DiskThresholdMonitor] [opensearch-master-nodes-0] flood stage disk watermark [95%] exceeded on [SubLWTJHS9-qBhoqYDHpnA][opensearch-search-nodes-3][/usr/share/opensearch/data/nodes/0] free: 2.7gb[2.8%], all indices on this node will be marked read-only
opensearch-master-nodes-0 opensearch-master-nodes [2024-07-05T20:10:18,045][WARN ][o.o.c.r.a.DiskThresholdMonitor] [opensearch-master-nodes-0] flood stage disk watermark [95%] exceeded on [bCQS-sKDQ82jqqsHXAWWjw][opensearch-search-nodes-9][/usr/share/opensearch/data/nodes/0] free: 2.7gb[2.8%], all indices on this node will be marked read-only
opensearch-master-nodes-0 opensearch-master-nodes [2024-07-05T20:10:18,045][WARN ][o.o.c.r.a.DiskThresholdMonitor] [opensearch-master-nodes-0] flood stage disk watermark [95%] exceeded on [f1mZ-i8KR76i1JdJ0suvmQ][opensearch-search-nodes-8][/usr/share/opensearch/data/nodes/0] free: 2.7gb[2.8%], all indices on this node will be marked read-only
opensearch-master-nodes-0 opensearch-master-nodes [2024-07-05T20:10:18,046][WARN ][o.o.c.r.a.DiskThresholdMonitor] [opensearch-master-nodes-0] flood stage disk watermark [95%] exceeded on [-9h1gQg5RcCR3S6E1ECg3Q][opensearch-search-nodes-7][/usr/share/opensearch/data/nodes/0] free: 2.7gb[2.8%], all indices on this node will be marked read-only
opensearch-master-nodes-0 opensearch-master-nodes [2024-07-05T20:10:18,046][WARN ][o.o.c.r.a.DiskThresholdMonitor] [opensearch-master-nodes-0] flood stage disk watermark [95%] exceeded on [VB_j9mF6THOKA-_tOY-drw][opensearch-search-nodes-0][/usr/share/opensearch/data/nodes/0] free: 2.7gb[2.8%], all indices on this node will be marked read-only
opensearch-master-nodes-0 opensearch-master-nodes [2024-07-05T20:10:18,046][WARN ][o.o.c.r.a.DiskThresholdMonitor] [opensearch-master-nodes-0] flood stage disk watermark [95%] exceeded on [SicKggTVRm2yKQTOAOPpXQ][opensearch-search-nodes-4][/usr/share/opensearch/data/nodes/0] free: 2.7gb[2.8%], all indices on this node will be marked read-only
opensearch-master-nodes-0 opensearch-master-nodes [2024-07-05T20:10:18,046][WARN ][o.o.c.r.a.DiskThresholdMonitor] [opensearch-master-nodes-0] flood stage disk watermark [95%] exceeded on [-1eNOqW0SryX_LsJ5RKtjg][opensearch-search-nodes-5][/usr/share/opensearch/data/nodes/0] free: 2.7gb[2.8%], all indices on this node will be marked read-only
opensearch-master-nodes-0 opensearch-master-nodes [2024-07-05T20:10:18,046][WARN ][o.o.c.r.a.DiskThresholdMonitor] [opensearch-master-nodes-0] flood stage disk watermark [95%] exceeded on [s1Ffo8v9T0C8tsul2nrLzw][opensearch-search-nodes-2][/usr/share/opensearch/data/nodes/0] free: 2.7gb[2.8%], all indices on this node will be marked read-only
opensearch-master-nodes-0 opensearch-master-nodes [2024-07-05T20:10:18,046][WARN ][o.o.c.r.a.DiskThresholdMonitor] [opensearch-master-nodes-0] flood stage disk watermark [95%] exceeded on [yF820wFTR7mQr44EFt_pIQ][opensearch-search-nodes-6][/usr/share/opensearch/data/nodes/0] free: 2.7gb[2.8%], all indices on this node will be marked read-only
opensearch-master-nodes-0 opensearch-master-nodes [2024-07-05T20:10:18,047][WARN ][o.o.c.r.a.DiskThresholdMonitor] [opensearch-master-nodes-0] flood stage disk watermark [95%] exceeded on [PozHDaJBRgemBVZfxQtUOw][opensearch-search-nodes-1][/usr/share/opensearch/data/nodes/0] free: 2.7gb[2.8%], all indices on this node will be marked read-only
image image image image image

Something is going wrong the with free space detection it seems. There is enough space on these nodes, the watermark notice is not valid.

image

huh? Separate bug?

Third try
Rebuilding the cluster with 2.14.0. Started fresh cluster, ran the security tool. Did nothing else, no restoring of the searchable snapshots:

image

Already complaning about full disks on all search nodes. Checked one:

image

@sandervandegeijn sandervandegeijn changed the title [BUG] 2.11.1 Search node disk out of space after some time [BUG] 2.11.1-2.15.0 Search node disk out of space after some time Jul 6, 2024
@finnegancarroll
Copy link
Contributor

The incorrect reporting of disk space could be related to this recently merged issue: #14004.
When restarting a search node with an existing file cache already occupying disk the calculated availableCapacity does not account for files already in the cache. availableCapacity is then reported as the difference between the total remaining space and previous cache size.

@sandervandegeijn
Copy link
Author

sandervandegeijn commented Jul 10, 2024

If you have a docker image for me I can verify? :)

Can I provide more data to help resolve the issue?

@finnegancarroll
Copy link
Contributor

To clarify, I don't think this fix will impact your actual disk usage. While the capacity of the file cache is incorrectly reported in some cases according to #14004, the file cache capacity is only used to determine if the cache is overflowing and can begin evicting stale entries. So an artificially low capacity would only make evictions start sooner.

@finnegancarroll
Copy link
Contributor

I believe the fastest way to reduce disk usage will be to make the size of fetched blocks configurable to the user which I am working on here.

The current default block size of 8Mib is quite large considering our cache is largely populated by metadata when restoring a remote searchable snapshot and it seems as if we are agnostic as to the size of the data we actually need to perform any given operation. That is, 100bytes of required metadata in our remote repository will likely be downloaded as an 8Mib block.

By modifying my block size to be smaller i've seen the disk usage of my file cache drop drastically.

@sandervandegeijn
Copy link
Author

sandervandegeijn commented Jul 15, 2024

Cool! Is that something that is adjustable after snapshots have been created already? Or can it be adjusted for a current repo for new snapshots.

Will this incur more calls to the s3 storage when you make it smaller? Then there might an optimum to be found.

But from what I've read this makes the odds of this bug occurring smaller, but it doesn't fix the underlying problem, is that correct? Can I provide more info to help find a structural fix for this one?

@sandervandegeijn
Copy link
Author

sandervandegeijn commented Jul 23, 2024

Any news? Can I help you guys somehow? I'm afraid we're going to miss the 2.16 deadline.

We are moving along with our plans for a new SOC design, opensearch can fill an important role here, but we kind of need this functionality. Another option would be to tier the storage within the cluster (hot/warm architecture), but we like the data being immutable as a snapshot and it brings less overhead. Boot times of the cluster with S3/searchable snapshots are much much lower than with a warm storage tier (NFS/spinning disks) that needs to initialize all the shards.

@linuxpi
Copy link
Collaborator

linuxpi commented Jul 25, 2024

[Storage Triage - attendees 1 2 3 4 5 6 7 8]

@finnegancarroll Do you think we can share some information here or can @sandervandegeijn help provide some more details which might help. Lets see if we can target for 2.17 maybe?

@sandervandegeijn sandervandegeijn changed the title [BUG] 2.11.1-2.15.0 Search node disk out of space after some time [BUG] 2.11.1-2.16.0 Cluster crashes because search node run out of disk space after some time Jul 25, 2024
@finnegancarroll
Copy link
Contributor

Cool! Is that something that is adjustable after snapshots have been created already? Or can it be adjusted for a current repo for new snapshots.

As planned the effect of this block size setting would be limited to the local cache so I think it would be appropriate as a node level setting. The remote snapshot would not need to change.

Will this incur more calls to the s3 storage when you make it smaller? Then there might an optimum to be found.

This is definitely a con of lowering the block size and i've noticed performance hits when pushing the block size down too far.

But from what I've read this makes the odds of this bug occurring smaller, but it doesn't fix the underlying problem, is that correct? Can I provide more info to help find a structural fix for this one?

That's correct. This change will not fix the underlying problems with this cache and is only an attempt to soften the impact on disk usage. I have not been able to produce a case where subsequent queries inflate the size of the cache indefinitely. What I have noticed is the configured capacity of this particular cache has little bearing on it's disk usage. The logic which removes cached files from disk is tightly coupled with garbage collection causing heavy spikes in disk usage up until GC runs and the cache can be "cleaned". With this change I would only hope to shrink these spikes, a true fix will likely require a change in design.

@finnegancarroll
Copy link
Contributor

Hi @linuxpi,

Here are some more details regarding the behavior of this cache as I currently understand it.

Opensearch provides Lucene OnDemandBlockSnapshotIndexInputs which negotiate downloading blocks from the remote store while implementing all the functionality of a regular IndexInput. This is accomplished by downloading whichever 8MB block of the file is referenced by the IndexInput whenever the data is actually needed by Lucene.

Each block downloaded by an OnDemandBlockSnapshotIndexInput is stored in a ref counted file cache. The originating IndexInput does not contribute to the reference count but all clones of that block increment the reference count. The reference count for any given block is decremented when it is closed. This is problematic because Lucene makes no promise to close IndexInputs it has cloned. The solution to this is where our design creates a dependency on GC. Each IndexInput is registered with a java Cleaner which maintains a list of phantom references to each IndexInput. When GC runs any object in the cleaner's list that is now only phantom reachable is moved into a queue. A single thread is constantly iterating through this queue and calling the cleanup task for each object, in our case, closing the IndexInput.

This design not only ties our disk usage to the performance and behavior of GC, but also the single thread registered with the cleaner that must actually do the work of closing our IndexInputs, then decrementing their ref count. It's also worth noting that even when GC runs you will often be left with a full cache despite reaching a reference count of zero on many of your entries. For example:

Directly after forcing GC

"file_cache" : {
    "timestamp" : 1722210969134,
    "active_in_bytes" : 249869432,
    "total_in_bytes" : 10485760,
    "used_in_bytes" : 434418808,
    "evictions_in_bytes" : 51220894703,
    "active_percent" : 58,
    "used_percent" : 4143,
    "hit_count" : 57169,
    "miss_count" : 6221
}

While GC might close all phantom reachable clones of an IndexInput, the actual cache will not trigger an eviction and close the originating IndexInput until it is accessed. So you often have IndexInputs with ref count zero remaining on disk while Opensearch is idle.

The effect of this seems to be disk usage that trends upwards during a search, with substantial drops that correlate with GC followed closely by a cache eviction.

Here's a small snippet of my cache size in bytes logged every second while running the OSB big5 workload from a remote snapshot: directory_size_bytes.log

In addition to the above Lucene does its own caching and holds onto some select IndexInputs containing metadata for each segment in an index. These IndexInputs maintain a high reference count for the lifetime of the cache and can never be evicted. #14990 has some information relating to how that behavior effects the baseline size of our cache.

@finnegancarroll
Copy link
Contributor

In terms of structural changes I've been wondering if we could move to a true LRU cache with strictly enforced capacity. It seems like we could extend our cache implementation to evict blocks from disk without waiting for their ref count to reach zero as long as we hold onto the metadata necessary to fetch the block once it's needed again.

@andrross
Copy link
Member

@finnegancarroll What about changing the behavior to fail a search instead of allowing the cache size to be oversubscribed? That doesn't necessarily solve the underlying problem but it is a better experience than allowing the node to crash.

@sandervandegeijn
Copy link
Author

sandervandegeijn commented Jul 31, 2024

That would be better, better fail to gracefully (kind of). The consequences are a lot easier to deal with than a red cluster that doesn't ingest data anymore.

In terms of structural changes I've been wondering if we could move to a true LRU cache with strictly enforced capacity. It seems like we could extend our cache implementation to evict blocks from disk without waiting for their ref count to reach zero as long as we hold onto the metadata necessary to fetch the block once it's needed again.

That does sound like a better solution, at least stability would be certain. Maybe there are more intelligent schemes than LRU even.

That being said, the ideal solution would not require a cache at all, do the work in memory as much as possible and write frequently used blocks to disk if it's there. To be really nit-picky: if it's required it's not really a cache. The cache should be there to optimize performance, having less of it should only mean less/degraded performance. If this would be possible, stability would be almost guaranteed.

@sandervandegeijn
Copy link
Author

sandervandegeijn commented Aug 13, 2024

2.16 is out now, what can we do to resolve this for the 2.17.0 release? I would happily throw in some hoodies from the Wageningen University and buy you a decent bottle of whiskey if it helps ;)

@AmiStrn
Copy link
Contributor

AmiStrn commented Aug 14, 2024

We may benefit from a cache clearing API to mitigate this issue when we get rejected queries or when the cache is suspected of overflowing. Currently, the cache clearing API does not clear the cache as it does with other caches (like the request/fielddata caches)

@sandervandegeijn
Copy link
Author

sandervandegeijn commented Aug 16, 2024

Sorry I'm triggering a bit on the 6 months plus label, this bug is already 8 (!) months old and breaks production functionality. If little priority is given to this, that would be a shame but I can't oversee the value of other issues. That might be valid.

But in that case searchable snapshots should be removed as a functionality until it is fixed because it's unreliable. This should not be in a stable opensearch release in it's current state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search:Searchable Snapshots
Projects
Status: Later (6 months plus)
Status: Todo
Development

No branches or pull requests