Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent ILM from spuriously rolling over (many) empty indices #86203

Closed
Tracked by #89283
joegallo opened this issue Apr 26, 2022 · 4 comments · Fixed by #89557
Closed
Tracked by #89283

Prevent ILM from spuriously rolling over (many) empty indices #86203

joegallo opened this issue Apr 26, 2022 · 4 comments · Fixed by #89557
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement Team:Data Management Meta label for data/management team

Comments

@joegallo
Copy link
Contributor

joegallo commented Apr 26, 2022

Description

This has been brought up a few times in different forms, see #46161, #73349, #83039, #85054.

Any ILM policy with a max_age associated with the rollover action could trigger this scenario, but in order to talk about something concrete, I'll use metricbeat as an example (double emphasizing, though, this isn't unique to metricbeat, it's just the nature of the way rollover currently works with a max_age).

With a test 8.1.3 Elasticsearch cluster, I ran metricbeat-8.1.2 for a few seconds and then stopped it, and then metricbeat-8.1.3 for a bit longer. The default metricbeat policy has rollover with "max_age" : "30d" (30 days) but in order to illustrate this problem better, I've set that to "1m" (1 minute) instead:

PUT /_cluster/settings
{
  "persistent": {
    "indices.lifecycle.poll_interval": "5s"
  }
}

PUT _ilm/policy/metricbeat
{
  "policy" : {
    "phases" : {
      "hot" : {
        "min_age" : "0ms",
        "actions" : {
          "rollover" : {
            "max_size" : "50gb",
            "max_age" : "1m"
          }
        }
      }
    }
  }
}

After a few minutes, my cluster looks like this:

GET _cat/indices/.ds-metricbeat-*?s=index
yellow open .ds-metricbeat-8.1.2-2022.04.26-000001 GBqDAprYSl2NmFzi81n9Ug 1 1 1134 0 652.2kb 652.2kb
yellow open .ds-metricbeat-8.1.2-2022.04.26-000002 Ybd4SCiWT0-7W0v9zKPR4A 1 1    0 0    225b    225b
yellow open .ds-metricbeat-8.1.2-2022.04.26-000003 3_9OmFkOSKaEfF_J-_D9TA 1 1    0 0    225b    225b
yellow open .ds-metricbeat-8.1.2-2022.04.26-000004 4olQItwcTtCOWrBotcqoLw 1 1    0 0    225b    225b
yellow open .ds-metricbeat-8.1.2-2022.04.26-000005 N9_gYkcORWSVfacUwnDegw 1 1    0 0    225b    225b
yellow open .ds-metricbeat-8.1.3-2022.04.26-000001 kWW-N_bfRbO0vMR4z3F72g 1 1  862 0 639.2kb 639.2kb
yellow open .ds-metricbeat-8.1.3-2022.04.26-000002 qzu-L-zZQqqm-6GQAZMtgA 1 1  235 0 431.2kb 431.2kb
yellow open .ds-metricbeat-8.1.3-2022.04.26-000003 iW68NzFyTv-CCAg3Rsfj4A 1 1  265 0   494kb   494kb
yellow open .ds-metricbeat-8.1.3-2022.04.26-000004 NaIa-gUjShKpEHzNcAxA3w 1 1  234 0 451.7kb 451.7kb
yellow open .ds-metricbeat-8.1.3-2022.04.26-000005 lDhzxqtdR8miPnqvsf7HDQ 1 1  271 0 595.8kb 595.8kb

That is, for a little while, the first writer (metricbeat version 8.1.2) wrote documents, and then it stopped and was upgraded and replaced by the second writer (metricbeat version 8.1.3). Each of those writers uses a versioned datastream (metricbeat-8.1.2 and metricbeat-8.1.3 respectively).

The problem is easy to see -- notice that we're getting a new empty (0 document) .ds-metricbeat-8.1.2-[...] index every minute, and that we'll keep accumulating them forever. ILM doesn't have any special logic around empty indices like this, i.e. empty indices are treated the same as non-empty indices as far as ILM is concerned.

In this simple scenario, we know that the metricbeat-8.1.2 datastream is done now, and can be retired. However, there's no particular point in time where Elasticsearch itself or some individual metricbeat process could know that. I'm using just one metricbeat writer, but I could be running one on each of N hosts. No one writer process in this scenario knows that it is special and should "turn off the lights when it's done".

To further complicate matters, maybe I have a weekly batch process which will run on Sunday evening and write some logs after a long quiet period (and its logs are still being monitored by metricbeat version 8.1.2)-- when it does so we could end up with more data flowing into the current metricbeat-8.1.2 write index. Let's call that the "sporadic writer" case. In that case, we'd end up with periods of no data flowing in and the accumulation of empty indices, followed by one or more non-empty indices, and then back to accumulating empty indices again.

ILM doesn't know whether there's a sporadic writer out there or not, and ignorant of whether more documents will be coming one day, it dutifully executes the policy, rolling over the now defunct metricbeat-8.1.2 datastream every minute and leaving a trail of empty .ds-metricbeat-8.1.2-[...] indices in its wake.

An additional note: my illustration here is datastream specific, but in the broad strokes this issue could also exist in a pre-datastream indexing strategy built around aliases. It would be most excellent if we were able to solve both the datastream and alias -based versions of this empty index problem (but reserving a degree of freedom, I don't think the solution must necessarily be precisely the same in both cases).

@joegallo joegallo added >enhancement WIP :Data Management/ILM+SLM Index and Snapshot lifecycle management team-discuss needs:triage Requires assignment of a team area label labels Apr 26, 2022
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Apr 26, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@joegallo
Copy link
Contributor Author

joegallo commented Apr 26, 2022

Original scenario

green open .ds-metricbeat-8.1.2-2022.05.01-000015 CCsJepDKRemPqXzWWKlYPg 1 1    437735 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.02-000016 HnmRnMB3R1WZMHj38Mrj7g 1 1    434799 0   2.5gb   1.2gb
green open .ds-metricbeat-8.1.2-2022.05.03-000017 D2DL7n97QsiNX7mFbghYjA 1 1    432417 0   2.5gb   1.2gb
green open .ds-metricbeat-8.1.2-2022.05.04-000018 A9zEIESPSAaFJ8_siF-tmw 1 1    440193 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.05-000019 wHCtcxm9QLeEnWCn8V9zWg 1 1    434248 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.06-000020 3RA8H5AISemuGbhKYBbcTw 1 1    186666 0   1.7gb   0.8gb
green open .ds-metricbeat-8.1.2-2022.05.07-000021 6cyuaHx1SY-NV1xHQ-RAGQ 1 1         0 0    225b    225b
green open .ds-metricbeat-8.1.2-2022.05.08-000022 MZqmoyw_QmWiY5sEpLnJiw 1 1         0 0    225b    225b
green open .ds-metricbeat-8.1.2-2022.05.09-000023 3Js4NRNJSkGcXtao6TxUYQ 1 1         0 0    225b    225b
green open .ds-metricbeat-8.1.2-2022.05.10-000024 8_vBYrnDTEmibFv0mNIxwg 1 1         0 0    225b    225b
green open .ds-metricbeat-8.1.2-2022.05.11-000025 G7K0euo3TuuInSt_I4JsgA 1 1         0 0    225b    225b
green open .ds-metricbeat-8.1.2-2022.05.12-000026 UxaTjfZ8QBuGVJWM-9gENQ 1 1         0 0    225b    225b

A writer (metricbeat) is in place at normal load from 5/1 to 5/5, then it cuts out partway through 5/6. After that, because of daily rollovers, empty indices begin to accumulate (one per day).

Solution 1: Don't rollover empty indices (#46161, #85054)

green open .ds-metricbeat-8.1.2-2022.05.01-000015 CCsJepDKRemPqXzWWKlYPg 1 1    437735 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.02-000016 HnmRnMB3R1WZMHj38Mrj7g 1 1    434799 0   2.5gb   1.2gb
green open .ds-metricbeat-8.1.2-2022.05.03-000017 D2DL7n97QsiNX7mFbghYjA 1 1    432417 0   2.5gb   1.2gb
green open .ds-metricbeat-8.1.2-2022.05.04-000018 A9zEIESPSAaFJ8_siF-tmw 1 1    440193 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.05-000019 wHCtcxm9QLeEnWCn8V9zWg 1 1    434248 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.06-000020 3RA8H5AISemuGbhKYBbcTw 1 1    186666 0   1.7gb   0.8gb
green open .ds-metricbeat-8.1.2-2022.05.07-000021 6cyuaHx1SY-NV1xHQ-RAGQ 1 1         0 0    225b    225b # but today is the 12th

Pros:

  • Straightforward to explain and to implement
  • Low risk since not indices are being deleted
  • Can be implemented as a new option in the _rollover API itself

Cons:

  • Doesn't handle sporadic writers very well -- a sporadic writer, writing new data on the 18th of may, would write a few minutes worth of data to .ds-metricbeat-8.1.2-2022.05.07-000021, and then that index would be rolled over the next time ILM runs because now it's not empty anymore.
  • Unless users are mindful to _rollover when they put a new template, the .ds-metricbeat-8.1.2-2022.05.07-000021 index could have 'old' settings / mappings, etc

Solution 2: Delete empty indices after they've been rolled over (#73349)

green open .ds-metricbeat-8.1.2-2022.05.01-000015 CCsJepDKRemPqXzWWKlYPg 1 1    437735 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.02-000016 HnmRnMB3R1WZMHj38Mrj7g 1 1    434799 0   2.5gb   1.2gb
green open .ds-metricbeat-8.1.2-2022.05.03-000017 D2DL7n97QsiNX7mFbghYjA 1 1    432417 0   2.5gb   1.2gb
green open .ds-metricbeat-8.1.2-2022.05.04-000018 A9zEIESPSAaFJ8_siF-tmw 1 1    440193 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.05-000019 wHCtcxm9QLeEnWCn8V9zWg 1 1    434248 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.06-000020 3RA8H5AISemuGbhKYBbcTw 1 1    186666 0   1.7gb   0.8gb
green open .ds-metricbeat-8.1.2-2022.05.12-000026 UxaTjfZ8QBuGVJWM-9gENQ 1 1         0 0    225b    225b # empty indices are created (and deleted) each day

Pros:

  • Still seems pretty straightforward to implement, a bit less obvious to explain
  • Handles sporadic writers well, doesn't fall-behind in terms of template changes (regardless of _rollover being called manually)

Cons:

  • Creating and then deleting these indices seems a bit noisy
  • A little risky -- need to be sure that we're only deleting indices that are empty
  • Snapshotting gets a little touchy (can't delete during a snapshot)
  • ILM permissions annoyances -- the policy now also needs delete permissions rather than just create/rollover permissions
  • Major con: won't work for TSDB indices because they may be written to even after they are no longer the write index

Solution 3: Lazy rollover for datastreams

green open .ds-metricbeat-8.1.2-2022.05.01-000015 CCsJepDKRemPqXzWWKlYPg 1 1    437735 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.02-000016 HnmRnMB3R1WZMHj38Mrj7g 1 1    434799 0   2.5gb   1.2gb
green open .ds-metricbeat-8.1.2-2022.05.03-000017 D2DL7n97QsiNX7mFbghYjA 1 1    432417 0   2.5gb   1.2gb
green open .ds-metricbeat-8.1.2-2022.05.04-000018 A9zEIESPSAaFJ8_siF-tmw 1 1    440193 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.05-000019 wHCtcxm9QLeEnWCn8V9zWg 1 1    434248 0   2.6gb   1.3gb
green open .ds-metricbeat-8.1.2-2022.05.06-000020 3RA8H5AISemuGbhKYBbcTw 1 1    186666 0   1.7gb   0.8gb

If datastreams rolled over lazily (on the next write) then no additional indices would be created backing the datastream in the above scenario, because there wouldn't be any writes to it.

Pros:

  • Handles sporadic writers well, doesn't fall-behind in terms of template changes (regardless of _rollover being called manually)

Cons:

  • Only solves the problem for datastreams, no change for indices behind alias
  • Overcomplicates the datastream model:
    • We go from always having a write index to sometimes having a write index (breaking API change?)
    • TSDB complications around timeseries and start and end times
  • Latency issues on document writes due to create-on-demand

@joegallo
Copy link
Contributor Author

joegallo commented Apr 28, 2022

In the same neighborhood as this, there's a problem with policies that are only size based and have no time component (or in a min_* world, anything with a min_size/min_docs) -- imagine that filebeat-8.13 writes 10g per day, and I rollover at 10g. What happens if I upgrade to filebeat-8.1.4 mid-morning, and only 500MB has been written into the last filebeat-8.1.3 index? With a time based rollover, nothing interesting happens -- everything works as expected. But with a size based policy at 10g, or a time based policy with an associated min_size of 1g, my "tailed off" last 500MB of filebeat-8.1.3 data would never rollover.

The infinite empty rollover problem and the never-rolled-over "tailed off" data problem are not opposite sides of the same coin, though, but they are related.

edit: Note, the workaround/solution to a never-rolled-over "tailed off" index is quite straightforward -- manually hit the _rollover API for the associated datastream or alias.

@ruflin
Copy link
Contributor

ruflin commented Nov 30, 2022

Just wanted to open an issue around this, really great to have this fixed! Is there any magic available to get rid of all the empty indices that are already created?

Update: Wrote my own quick python script and just deleted 1200 empty indices ... 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants