Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watcher: Fix race condition when reloading watches #33157

Merged

Conversation

spinscale
Copy link
Contributor

@spinscale spinscale commented Aug 27, 2018

The current watcher implementation had two issues on reload, that could
lead to the existing watches not properly cleared out.

One fix here was to ensure that when TriggerService.start() is called,
we ensure in the trigger engine implementations that current watches are
removed instead of adding to the existing ones in
TickerScheduleTriggerEngine.start()

The second fix is a bit more subtle, as the underlying issue is rooted
more subtle due concurrent code.

When WatcherService.reload() is called it calls in turn
WatcherService.reloadInner(), which is synchronized. In the reload
method we cleared out existing watches and executions.

Also, there were two additional minor fixes

  1. If the node is not a data node, we forgot to set the status to
    STARTING when watcher is being started. This should not be a big issue,
    because a non-data node does not spent a lot of time loading as there
    are no watches which need loading.
  2. If a new cluster state came in during a reload, we had two checks in
    place to abort loading the current one. The first one before we load all
    the watches of the local node and the second before watcher is starting
    with those new watches. Turned out that the first check was not
    returning, which meant we always tried to load all the watches, and then
    would fail on the second check. This has been fixed here.

The current watcher implementation had two issues on reload, that could
lead to the existing watches not properly cleared out.

One fix here was to ensure that when `TriggerService.start()` is called,
we ensure in the trigger engine implementations that current watches are
removed instead of adding to the existing ones in
`TickerScheduleTriggerEngine.start()`

The second fix is a bit more subtle, as the underlying issue is rooted
more subtle due concurrent code.

When `WatcherService.reload()` is called it calls in turn
`WatcherService.reloadInner()`, which is synchronized. In the reload
method we cleared out existing watches and executions. However there was
still a small window of time, where the clearing could happen in
relatively quick succession for two cluster states coming in, however
due to timing issues the second clearing happened before the first
starting of the trigger engine. This could lead due
`TriggerEngine.start(Collection<Watch> watches)` being called twice with
a different set of watches, without the existing ones being cleared out,
resulting in the execution of watches on this node, that should not be
executed.

Also, there were two minor fixes

1. If the node is not a data node, we forgot to set the status to
STARTING when watcher is being started. This should not be a big issue,
because a non-data node does not spent a lot of time loading as there
are no watches which need loading.
2. If a new cluster state came in during a reload, we had two checks in
place to abort loading the current one. The first one before we load all
the watches of the local node and the second before watcher is starting
with those new watches. Turned out that the first check was not
returning, which meant we always tried to load all the watches, and then
would fail on the second check. This has been fixed here.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

Copy link
Contributor

@hub-cap hub-cap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putAll considered evil :P Another good find!

@spinscale spinscale added v6.4.1 and removed v6.4.1 labels Aug 30, 2018
@spinscale
Copy link
Contributor Author

I removed the 6.4.1 one label for this one, as the main issue cannot be triggered by the cluster state listener currently due to checking the watcher state before and only call TriggerService.start() when watcher is not started. Still makes sense to fix it though in the concrete engine, so putting it in master and 6.x.

@spinscale spinscale merged commit b6f762d into elastic:master Aug 30, 2018
spinscale added a commit that referenced this pull request Aug 30, 2018
This commit ensures that when `TriggerService.start()` is called,
we ensure in the trigger engine implementations that current watches are
removed instead of adding to the existing ones in
`TickerScheduleTriggerEngine.start()`

Two additional minor fixes, where the result remains the same but less code gets executed.

1. If the node is not a data node, we forgot to set the status to
STARTING when watcher is being started. This should not be a big issue,
because a non-data node does not spent a lot of time loading as there
are no watches which need loading.
2. If a new cluster state came in during a reload, we had two checks in
place to abort loading the current one. The first one before we load all
the watches of the local node and the second before watcher is starting
with those new watches. Turned out that the first check was not
returning, which meant we always tried to load all the watches, and then
would fail on the second check. This has been fixed here.
dnhatn added a commit that referenced this pull request Sep 1, 2018
* 6.x:
  Mute test watcher usage stats output
  [Rollup] Fix FullClusterRestart test
  TEST: Disable soft-deletes in ParentChildTestCase
  TEST: Disable randomized soft-deletes settings
  Integrates soft-deletes into Elasticsearch (#33222)
  drop `index.shard.check_on_startup: fix` (#32279)
  Fix AwaitsFix issue number
  Mute SmokeTestWatcherWithSecurityIT testsi
  [DOCS] Moves ml folder from x-pack/docs to docs (#33248)
  TEST: mute more SmokeTestWatcherWithSecurityIT tests
  [DOCS] Move rollup APIs to docs (#31450)
  [DOCS] Rename X-Pack Commands section (#33005)
  Fixes SecurityIntegTestCase so it always adds at least one alias (#33296)
  TESTS: Fix Random Fail in MockTcpTransportTests (#33061) (#33307)
  MINOR: Remove Dead Code from PathTrie (#33280) (#33306)
  Fix pom for build-tools (#33300)
  Lazy evaluate java9home (#33301)
  SQL: test coverage for JdbcResultSet (#32813)
  Work around to be able to generate eclipse projects (#33295)
  Different handling for security specific errors in the CLI. Fix for #33230 (#33255)
  [ML] Refactor delimited file structure detection (#33233)
  SQL: Support multi-index format as table identifier (#33278)
  Enable forbiddenapis server java9 (#33245)
  [MUTE] SmokeTestWatcherWithSecurityIT flaky tests
  Add region ISO code to GeoIP Ingest plugin (#31669) (#33276)
  Don't be strict for 6.x
  Update serialization versions for custom IndexMetaData backport
  Replace IndexMetaData.Custom with Map-based custom metadata (#32749)
  Painless: Fix Bindings Bug (#33274)
  SQL: prevent duplicate generation for repeated aggs (#33252)
  TEST: Mute testMonitorClusterHealth
  Fix serialization of empty field capabilities response (#33263)
  Fix nested _source retrieval with includes/excludes (#33180)
  [DOCS] TLS file resources are reloadable (#33258)
  Watcher: Ensure TriggerEngine start replaces existing watches (#33157)
  Ignore module-info in jar hell checks (#33011)
  Fix docs build after #33241
  [DOC] Repository GCS ADC not supported (#33238)
  Upgrade to latest Gradle 4.10  (#32801)
  Fix/30904 cluster formation part2 (#32877)
  Move file-based discovery to core (#33241)
  HLRC: add client side RefreshPolicy (#33209)
  [Kerberos] Add unsupported languages for tests (#33253)
  Watcher: Reload properly on remote shard change (#33167)
  Fix classpath security checks for external tests. (#33066)
  [Rollup] Only allow aggregating on multiples of configured interval (#32052)
  Added deprecation warning for rescore in scroll queries (#33070)
  Apply settings filter to get cluster settings API (#33247)
  [Rollup] Re-factor Rollup Indexer into a generic indexer for re-usability   (#32743)
  HLRC: create base timed request class (#33216)
  HLRC: Use Optional in validation logic (#33104)
  Painless: Add Bindings (#33042)
spinscale added a commit to spinscale/elasticsearch that referenced this pull request Sep 3, 2018
This commit reverts most of elastic#33157 as it introduces another race
condition and breaks a common case of watcher, when the first watch is
added to the system and the index does not exist yet.

This means, that the index will be created, which triggers a reload, but
during this time the put watch operation that triggered this is not yet
indexed, so that both processes finish roughly add the same time and
should not overwrite each other but act complementary.

This commit reverts the logic of cleaning out the ticker engine watches
on start up, as this is done already when the execution is paused -
which also gets paused on the cluster state listener again, as we can be
sure here, that the watches index has not yet been created.

This also adds a new test, that starts a one node cluster and emulates
the case of a non existing watches index and a watch being added, which
should result in proper execution.

Closes elastic#33320
spinscale added a commit that referenced this pull request Sep 21, 2018
…3360)

This commit reverts most of #33157 as it introduces another race
condition and breaks a common case of watcher, when the first watch is
added to the system and the index does not exist yet.

This means, that the index will be created, which triggers a reload, but
during this time the put watch operation that triggered this is not yet
indexed, so that both processes finish roughly add the same time and
should not overwrite each other but act complementary.

This commit reverts the logic of cleaning out the ticker engine watches
on start up, as this is done already when the execution is paused -
which also gets paused on the cluster state listener again, as we can be
sure here, that the watches index has not yet been created.

This also adds a new test, that starts a one node cluster and emulates
the case of a non existing watches index and a watch being added, which
should result in proper execution.

Closes #33320
spinscale added a commit that referenced this pull request Sep 21, 2018
…3360)

This commit reverts most of #33157 as it introduces another race
condition and breaks a common case of watcher, when the first watch is
added to the system and the index does not exist yet.

This means, that the index will be created, which triggers a reload, but
during this time the put watch operation that triggered this is not yet
indexed, so that both processes finish roughly add the same time and
should not overwrite each other but act complementary.

This commit reverts the logic of cleaning out the ticker engine watches
on start up, as this is done already when the execution is paused -
which also gets paused on the cluster state listener again, as we can be
sure here, that the watches index has not yet been created.

This also adds a new test, that starts a one node cluster and emulates
the case of a non existing watches index and a watch being added, which
should result in proper execution.

Closes #33320
kcm pushed a commit that referenced this pull request Oct 30, 2018
…3360)

This commit reverts most of #33157 as it introduces another race
condition and breaks a common case of watcher, when the first watch is
added to the system and the index does not exist yet.

This means, that the index will be created, which triggers a reload, but
during this time the put watch operation that triggered this is not yet
indexed, so that both processes finish roughly add the same time and
should not overwrite each other but act complementary.

This commit reverts the logic of cleaning out the ticker engine watches
on start up, as this is done already when the execution is paused -
which also gets paused on the cluster state listener again, as we can be
sure here, that the watches index has not yet been created.

This also adds a new test, that starts a one node cluster and emulates
the case of a non existing watches index and a watch being added, which
should result in proper execution.

Closes #33320
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants