Prevent ILM from spuriously rolling over (many) empty indices #86203

joegallo · 2022-04-26T22:18:11Z

Description

This has been brought up a few times in different forms, see #46161, #73349, #83039, #85054.

Any ILM policy with a max_age associated with the rollover action could trigger this scenario, but in order to talk about something concrete, I'll use metricbeat as an example (double emphasizing, though, this isn't unique to metricbeat, it's just the nature of the way rollover currently works with a max_age).

With a test 8.1.3 Elasticsearch cluster, I ran metricbeat-8.1.2 for a few seconds and then stopped it, and then metricbeat-8.1.3 for a bit longer. The default metricbeat policy has rollover with "max_age" : "30d" (30 days) but in order to illustrate this problem better, I've set that to "1m" (1 minute) instead:

PUT /_cluster/settings
{
  "persistent": {
    "indices.lifecycle.poll_interval": "5s"
  }
}

PUT _ilm/policy/metricbeat
{
  "policy" : {
    "phases" : {
      "hot" : {
        "min_age" : "0ms",
        "actions" : {
          "rollover" : {
            "max_size" : "50gb",
            "max_age" : "1m"
          }
        }
      }
    }
  }
}

After a few minutes, my cluster looks like this:

GET _cat/indices/.ds-metricbeat-*?s=index
yellow open .ds-metricbeat-8.1.2-2022.04.26-000001 GBqDAprYSl2NmFzi81n9Ug 1 1 1134 0 652.2kb 652.2kb
yellow open .ds-metricbeat-8.1.2-2022.04.26-000002 Ybd4SCiWT0-7W0v9zKPR4A 1 1    0 0    225b    225b
yellow open .ds-metricbeat-8.1.2-2022.04.26-000003 3_9OmFkOSKaEfF_J-_D9TA 1 1    0 0    225b    225b
yellow open .ds-metricbeat-8.1.2-2022.04.26-000004 4olQItwcTtCOWrBotcqoLw 1 1    0 0    225b    225b
yellow open .ds-metricbeat-8.1.2-2022.04.26-000005 N9_gYkcORWSVfacUwnDegw 1 1    0 0    225b    225b
yellow open .ds-metricbeat-8.1.3-2022.04.26-000001 kWW-N_bfRbO0vMR4z3F72g 1 1  862 0 639.2kb 639.2kb
yellow open .ds-metricbeat-8.1.3-2022.04.26-000002 qzu-L-zZQqqm-6GQAZMtgA 1 1  235 0 431.2kb 431.2kb
yellow open .ds-metricbeat-8.1.3-2022.04.26-000003 iW68NzFyTv-CCAg3Rsfj4A 1 1  265 0   494kb   494kb
yellow open .ds-metricbeat-8.1.3-2022.04.26-000004 NaIa-gUjShKpEHzNcAxA3w 1 1  234 0 451.7kb 451.7kb
yellow open .ds-metricbeat-8.1.3-2022.04.26-000005 lDhzxqtdR8miPnqvsf7HDQ 1 1  271 0 595.8kb 595.8kb

That is, for a little while, the first writer (metricbeat version 8.1.2) wrote documents, and then it stopped and was upgraded and replaced by the second writer (metricbeat version 8.1.3). Each of those writers uses a versioned datastream (metricbeat-8.1.2 and metricbeat-8.1.3 respectively).

The problem is easy to see -- notice that we're getting a new empty (0 document) .ds-metricbeat-8.1.2-[...] index every minute, and that we'll keep accumulating them forever. ILM doesn't have any special logic around empty indices like this, i.e. empty indices are treated the same as non-empty indices as far as ILM is concerned.

In this simple scenario, we know that the metricbeat-8.1.2 datastream is done now, and can be retired. However, there's no particular point in time where Elasticsearch itself or some individual metricbeat process could know that. I'm using just one metricbeat writer, but I could be running one on each of N hosts. No one writer process in this scenario knows that it is special and should "turn off the lights when it's done".

To further complicate matters, maybe I have a weekly batch process which will run on Sunday evening and write some logs after a long quiet period (and its logs are still being monitored by metricbeat version 8.1.2)-- when it does so we could end up with more data flowing into the current metricbeat-8.1.2 write index. Let's call that the "sporadic writer" case. In that case, we'd end up with periods of no data flowing in and the accumulation of empty indices, followed by one or more non-empty indices, and then back to accumulating empty indices again.

ILM doesn't know whether there's a sporadic writer out there or not, and ignorant of whether more documents will be coming one day, it dutifully executes the policy, rolling over the now defunct metricbeat-8.1.2 datastream every minute and leaving a trail of empty .ds-metricbeat-8.1.2-[...] indices in its wake.

An additional note: my illustration here is datastream specific, but in the broad strokes this issue could also exist in a pre-datastream indexing strategy built around aliases. It would be most excellent if we were able to solve both the datastream and alias -based versions of this empty index problem (but reserving a degree of freedom, I don't think the solution must necessarily be precisely the same in both cases).

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-04-26T22:18:14Z

Pinging @elastic/es-data-management (Team:Data Management)

joegallo · 2022-04-26T22:24:59Z

Original scenario

green open .ds-metricbeat-8.1.2-2022.05.01-000015 CCsJepDKRemPqXzWWKlYPg 1 1    437735 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.02-000016 HnmRnMB3R1WZMHj38Mrj7g 1 1    434799 0   2.5gb   1.2gb
green open .ds-metricbeat-8.1.2-2022.05.03-000017 D2DL7n97QsiNX7mFbghYjA 1 1    432417 0   2.5gb   1.2gb
green open .ds-metricbeat-8.1.2-2022.05.04-000018 A9zEIESPSAaFJ8_siF-tmw 1 1    440193 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.05-000019 wHCtcxm9QLeEnWCn8V9zWg 1 1    434248 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.06-000020 3RA8H5AISemuGbhKYBbcTw 1 1    186666 0   1.7gb   0.8gb
green open .ds-metricbeat-8.1.2-2022.05.07-000021 6cyuaHx1SY-NV1xHQ-RAGQ 1 1         0 0    225b    225b
green open .ds-metricbeat-8.1.2-2022.05.08-000022 MZqmoyw_QmWiY5sEpLnJiw 1 1         0 0    225b    225b
green open .ds-metricbeat-8.1.2-2022.05.09-000023 3Js4NRNJSkGcXtao6TxUYQ 1 1         0 0    225b    225b
green open .ds-metricbeat-8.1.2-2022.05.10-000024 8_vBYrnDTEmibFv0mNIxwg 1 1         0 0    225b    225b
green open .ds-metricbeat-8.1.2-2022.05.11-000025 G7K0euo3TuuInSt_I4JsgA 1 1         0 0    225b    225b
green open .ds-metricbeat-8.1.2-2022.05.12-000026 UxaTjfZ8QBuGVJWM-9gENQ 1 1         0 0    225b    225b

A writer (metricbeat) is in place at normal load from 5/1 to 5/5, then it cuts out partway through 5/6. After that, because of daily rollovers, empty indices begin to accumulate (one per day).

Solution 1: Don't rollover empty indices (#46161, #85054)

green open .ds-metricbeat-8.1.2-2022.05.01-000015 CCsJepDKRemPqXzWWKlYPg 1 1    437735 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.02-000016 HnmRnMB3R1WZMHj38Mrj7g 1 1    434799 0   2.5gb   1.2gb
green open .ds-metricbeat-8.1.2-2022.05.03-000017 D2DL7n97QsiNX7mFbghYjA 1 1    432417 0   2.5gb   1.2gb
green open .ds-metricbeat-8.1.2-2022.05.04-000018 A9zEIESPSAaFJ8_siF-tmw 1 1    440193 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.05-000019 wHCtcxm9QLeEnWCn8V9zWg 1 1    434248 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.06-000020 3RA8H5AISemuGbhKYBbcTw 1 1    186666 0   1.7gb   0.8gb
green open .ds-metricbeat-8.1.2-2022.05.07-000021 6cyuaHx1SY-NV1xHQ-RAGQ 1 1         0 0    225b    225b # but today is the 12th

Pros:

Straightforward to explain and to implement
Low risk since not indices are being deleted
Can be implemented as a new option in the _rollover API itself

Cons:

Doesn't handle sporadic writers very well -- a sporadic writer, writing new data on the 18th of may, would write a few minutes worth of data to .ds-metricbeat-8.1.2-2022.05.07-000021, and then that index would be rolled over the next time ILM runs because now it's not empty anymore.
Unless users are mindful to _rollover when they put a new template, the .ds-metricbeat-8.1.2-2022.05.07-000021 index could have 'old' settings / mappings, etc

Solution 2: Delete empty indices after they've been rolled over (#73349)

green open .ds-metricbeat-8.1.2-2022.05.01-000015 CCsJepDKRemPqXzWWKlYPg 1 1    437735 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.02-000016 HnmRnMB3R1WZMHj38Mrj7g 1 1    434799 0   2.5gb   1.2gb
green open .ds-metricbeat-8.1.2-2022.05.03-000017 D2DL7n97QsiNX7mFbghYjA 1 1    432417 0   2.5gb   1.2gb
green open .ds-metricbeat-8.1.2-2022.05.04-000018 A9zEIESPSAaFJ8_siF-tmw 1 1    440193 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.05-000019 wHCtcxm9QLeEnWCn8V9zWg 1 1    434248 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.06-000020 3RA8H5AISemuGbhKYBbcTw 1 1    186666 0   1.7gb   0.8gb
green open .ds-metricbeat-8.1.2-2022.05.12-000026 UxaTjfZ8QBuGVJWM-9gENQ 1 1         0 0    225b    225b # empty indices are created (and deleted) each day

Pros:

Still seems pretty straightforward to implement, a bit less obvious to explain
Handles sporadic writers well, doesn't fall-behind in terms of template changes (regardless of _rollover being called manually)

Cons:

Creating and then deleting these indices seems a bit noisy
A little risky -- need to be sure that we're only deleting indices that are empty
Snapshotting gets a little touchy (can't delete during a snapshot)
ILM permissions annoyances -- the policy now also needs delete permissions rather than just create/rollover permissions
Major con: won't work for TSDB indices because they may be written to even after they are no longer the write index

Solution 3: Lazy rollover for datastreams

green open .ds-metricbeat-8.1.2-2022.05.01-000015 CCsJepDKRemPqXzWWKlYPg 1 1    437735 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.02-000016 HnmRnMB3R1WZMHj38Mrj7g 1 1    434799 0   2.5gb   1.2gb
green open .ds-metricbeat-8.1.2-2022.05.03-000017 D2DL7n97QsiNX7mFbghYjA 1 1    432417 0   2.5gb   1.2gb
green open .ds-metricbeat-8.1.2-2022.05.04-000018 A9zEIESPSAaFJ8_siF-tmw 1 1    440193 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.05-000019 wHCtcxm9QLeEnWCn8V9zWg 1 1    434248 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.06-000020 3RA8H5AISemuGbhKYBbcTw 1 1    186666 0   1.7gb   0.8gb

If datastreams rolled over lazily (on the next write) then no additional indices would be created backing the datastream in the above scenario, because there wouldn't be any writes to it.

Pros:

Handles sporadic writers well, doesn't fall-behind in terms of template changes (regardless of _rollover being called manually)

Cons:

Only solves the problem for datastreams, no change for indices behind alias
Overcomplicates the datastream model:
- We go from always having a write index to sometimes having a write index (breaking API change?)
- TSDB complications around timeseries and start and end times
Latency issues on document writes due to create-on-demand

joegallo · 2022-04-28T15:13:13Z

In the same neighborhood as this, there's a problem with policies that are only size based and have no time component (or in a min_* world, anything with a min_size/min_docs) -- imagine that filebeat-8.13 writes 10g per day, and I rollover at 10g. What happens if I upgrade to filebeat-8.1.4 mid-morning, and only 500MB has been written into the last filebeat-8.1.3 index? With a time based rollover, nothing interesting happens -- everything works as expected. But with a size based policy at 10g, or a time based policy with an associated min_size of 1g, my "tailed off" last 500MB of filebeat-8.1.3 data would never rollover.

The infinite empty rollover problem and the never-rolled-over "tailed off" data problem are not opposite sides of the same coin, though, but they are related.

edit: Note, the workaround/solution to a never-rolled-over "tailed off" index is quite straightforward -- manually hit the _rollover API for the associated datastream or alias.

ruflin · 2022-11-30T08:47:41Z

Just wanted to open an issue around this, really great to have this fixed! Is there any magic available to get rid of all the empty indices that are already created?

Update: Wrote my own quick python script and just deleted 1200 empty indices ... 🥳

joegallo added >enhancement WIP :Data Management/ILM+SLM Index and Snapshot lifecycle management team-discuss needs:triage Requires assignment of a team area label labels Apr 26, 2022

elasticmachine added the Team:Data Management Meta label for data/management team label Apr 26, 2022

pquentin removed the needs:triage Requires assignment of a team area label label May 16, 2022

joegallo removed the WIP label May 16, 2022

dakrone removed the team-discuss label Jul 14, 2022

joegallo self-assigned this Aug 3, 2022

joegallo mentioned this issue Aug 11, 2022

Better rollover handling of small and empty indices #89283

Open

8 tasks

ruflin mentioned this issue Aug 15, 2022

[Fleet] Cleanup empty indices elastic/kibana#138792

Open

dakrone mentioned this issue Aug 15, 2022

Flag data stream as "needs a new index" to handle mappings updates where no documents are indexed #89346

Closed

lucabelluccini mentioned this issue Aug 16, 2022

[DOCS] Document how to handle write aliases rolling over to empty indices #83039

Open

joegallo mentioned this issue Aug 23, 2022

ILM don't rollover empty indices #89557

Merged

joegallo closed this as completed in #89557 Sep 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent ILM from spuriously rolling over (many) empty indices #86203

Prevent ILM from spuriously rolling over (many) empty indices #86203

joegallo commented Apr 26, 2022 •

edited

Loading

elasticmachine commented Apr 26, 2022

joegallo commented Apr 26, 2022 •

edited

Loading

joegallo commented Apr 28, 2022 •

edited

Loading

ruflin commented Nov 30, 2022 •

edited

Loading

Prevent ILM from spuriously rolling over (many) empty indices #86203

Prevent ILM from spuriously rolling over (many) empty indices #86203

Comments

joegallo commented Apr 26, 2022 • edited Loading

Description

elasticmachine commented Apr 26, 2022

joegallo commented Apr 26, 2022 • edited Loading

Original scenario

Solution 1: Don't rollover empty indices (#46161, #85054)

Solution 2: Delete empty indices after they've been rolled over (#73349)

Solution 3: Lazy rollover for datastreams

joegallo commented Apr 28, 2022 • edited Loading

ruflin commented Nov 30, 2022 • edited Loading

joegallo commented Apr 26, 2022 •

edited

Loading

joegallo commented Apr 26, 2022 •

edited

Loading

joegallo commented Apr 28, 2022 •

edited

Loading

ruflin commented Nov 30, 2022 •

edited

Loading