Watcher: Fix race condition when reloading watches #33157

spinscale · 2018-08-27T07:37:59Z

The current watcher implementation had two issues on reload, that could
lead to the existing watches not properly cleared out.

One fix here was to ensure that when TriggerService.start() is called,
we ensure in the trigger engine implementations that current watches are
removed instead of adding to the existing ones in
TickerScheduleTriggerEngine.start()

The second fix is a bit more subtle, as the underlying issue is rooted
more subtle due concurrent code.

When WatcherService.reload() is called it calls in turn
WatcherService.reloadInner(), which is synchronized. In the reload
method we cleared out existing watches and executions.

Also, there were two additional minor fixes

If the node is not a data node, we forgot to set the status to
STARTING when watcher is being started. This should not be a big issue,
because a non-data node does not spent a lot of time loading as there
are no watches which need loading.
If a new cluster state came in during a reload, we had two checks in
place to abort loading the current one. The first one before we load all
the watches of the local node and the second before watcher is starting
with those new watches. Turned out that the first check was not
returning, which meant we always tried to load all the watches, and then
would fail on the second check. This has been fixed here.

The current watcher implementation had two issues on reload, that could lead to the existing watches not properly cleared out. One fix here was to ensure that when `TriggerService.start()` is called, we ensure in the trigger engine implementations that current watches are removed instead of adding to the existing ones in `TickerScheduleTriggerEngine.start()` The second fix is a bit more subtle, as the underlying issue is rooted more subtle due concurrent code. When `WatcherService.reload()` is called it calls in turn `WatcherService.reloadInner()`, which is synchronized. In the reload method we cleared out existing watches and executions. However there was still a small window of time, where the clearing could happen in relatively quick succession for two cluster states coming in, however due to timing issues the second clearing happened before the first starting of the trigger engine. This could lead due `TriggerEngine.start(Collection<Watch> watches)` being called twice with a different set of watches, without the existing ones being cleared out, resulting in the execution of watches on this node, that should not be executed. Also, there were two minor fixes 1. If the node is not a data node, we forgot to set the status to STARTING when watcher is being started. This should not be a big issue, because a non-data node does not spent a lot of time loading as there are no watches which need loading. 2. If a new cluster state came in during a reload, we had two checks in place to abort loading the current one. The first one before we load all the watches of the local node and the second before watcher is starting with those new watches. Turned out that the first check was not returning, which meant we always tried to load all the watches, and then would fail on the second check. This has been fixed here.

elasticmachine · 2018-08-27T07:38:00Z

Pinging @elastic/es-core-infra

hub-cap

putAll considered evil :P Another good find!

spinscale · 2018-08-30T08:48:38Z

I removed the 6.4.1 one label for this one, as the main issue cannot be triggered by the cluster state listener currently due to checking the watcher state before and only call TriggerService.start() when watcher is not started. Still makes sense to fix it though in the concrete engine, so putting it in master and 6.x.

This commit ensures that when `TriggerService.start()` is called, we ensure in the trigger engine implementations that current watches are removed instead of adding to the existing ones in `TickerScheduleTriggerEngine.start()` Two additional minor fixes, where the result remains the same but less code gets executed. 1. If the node is not a data node, we forgot to set the status to STARTING when watcher is being started. This should not be a big issue, because a non-data node does not spent a lot of time loading as there are no watches which need loading. 2. If a new cluster state came in during a reload, we had two checks in place to abort loading the current one. The first one before we load all the watches of the local node and the second before watcher is starting with those new watches. Turned out that the first check was not returning, which meant we always tried to load all the watches, and then would fail on the second check. This has been fixed here.

* 6.x: Mute test watcher usage stats output [Rollup] Fix FullClusterRestart test TEST: Disable soft-deletes in ParentChildTestCase TEST: Disable randomized soft-deletes settings Integrates soft-deletes into Elasticsearch (#33222) drop `index.shard.check_on_startup: fix` (#32279) Fix AwaitsFix issue number Mute SmokeTestWatcherWithSecurityIT testsi [DOCS] Moves ml folder from x-pack/docs to docs (#33248) TEST: mute more SmokeTestWatcherWithSecurityIT tests [DOCS] Move rollup APIs to docs (#31450) [DOCS] Rename X-Pack Commands section (#33005) Fixes SecurityIntegTestCase so it always adds at least one alias (#33296) TESTS: Fix Random Fail in MockTcpTransportTests (#33061) (#33307) MINOR: Remove Dead Code from PathTrie (#33280) (#33306) Fix pom for build-tools (#33300) Lazy evaluate java9home (#33301) SQL: test coverage for JdbcResultSet (#32813) Work around to be able to generate eclipse projects (#33295) Different handling for security specific errors in the CLI. Fix for #33230 (#33255) [ML] Refactor delimited file structure detection (#33233) SQL: Support multi-index format as table identifier (#33278) Enable forbiddenapis server java9 (#33245) [MUTE] SmokeTestWatcherWithSecurityIT flaky tests Add region ISO code to GeoIP Ingest plugin (#31669) (#33276) Don't be strict for 6.x Update serialization versions for custom IndexMetaData backport Replace IndexMetaData.Custom with Map-based custom metadata (#32749) Painless: Fix Bindings Bug (#33274) SQL: prevent duplicate generation for repeated aggs (#33252) TEST: Mute testMonitorClusterHealth Fix serialization of empty field capabilities response (#33263) Fix nested _source retrieval with includes/excludes (#33180) [DOCS] TLS file resources are reloadable (#33258) Watcher: Ensure TriggerEngine start replaces existing watches (#33157) Ignore module-info in jar hell checks (#33011) Fix docs build after #33241 [DOC] Repository GCS ADC not supported (#33238) Upgrade to latest Gradle 4.10 (#32801) Fix/30904 cluster formation part2 (#32877) Move file-based discovery to core (#33241) HLRC: add client side RefreshPolicy (#33209) [Kerberos] Add unsupported languages for tests (#33253) Watcher: Reload properly on remote shard change (#33167) Fix classpath security checks for external tests. (#33066) [Rollup] Only allow aggregating on multiples of configured interval (#32052) Added deprecation warning for rescore in scroll queries (#33070) Apply settings filter to get cluster settings API (#33247) [Rollup] Re-factor Rollup Indexer into a generic indexer for re-usability (#32743) HLRC: create base timed request class (#33216) HLRC: Use Optional in validation logic (#33104) Painless: Add Bindings (#33042)

This commit reverts most of elastic#33157 as it introduces another race condition and breaks a common case of watcher, when the first watch is added to the system and the index does not exist yet. This means, that the index will be created, which triggers a reload, but during this time the put watch operation that triggered this is not yet indexed, so that both processes finish roughly add the same time and should not overwrite each other but act complementary. This commit reverts the logic of cleaning out the ticker engine watches on start up, as this is done already when the execution is paused - which also gets paused on the cluster state listener again, as we can be sure here, that the watches index has not yet been created. This also adds a new test, that starts a one node cluster and emulates the case of a non existing watches index and a watch being added, which should result in proper execution. Closes elastic#33320

…3360) This commit reverts most of #33157 as it introduces another race condition and breaks a common case of watcher, when the first watch is added to the system and the index does not exist yet. This means, that the index will be created, which triggers a reload, but during this time the put watch operation that triggered this is not yet indexed, so that both processes finish roughly add the same time and should not overwrite each other but act complementary. This commit reverts the logic of cleaning out the ticker engine watches on start up, as this is done already when the execution is paused - which also gets paused on the cluster state listener again, as we can be sure here, that the watches index has not yet been created. This also adds a new test, that starts a one node cluster and emulates the case of a non existing watches index and a watch being added, which should result in proper execution. Closes #33320

spinscale added >bug review v7.0.0 :Data Management/Watcher v6.5.0 v6.4.1 labels Aug 27, 2018

spinscale requested a review from hub-cap August 27, 2018 07:37

Merge branch 'master' into 1808-fix-another-duplicate-watch-execution

41bfde9

hub-cap approved these changes Aug 29, 2018

View reviewed changes

Merge branch 'master' into 1808-fix-another-duplicate-watch-execution

7ad015d

spinscale added v6.4.1 and removed v6.4.1 labels Aug 30, 2018

spinscale merged commit b6f762d into elastic:master Aug 30, 2018

spinscale mentioned this pull request Sep 3, 2018

Watcher: Ensure that execution triggers properly on initial setup #33360

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watcher: Fix race condition when reloading watches #33157

Watcher: Fix race condition when reloading watches #33157

spinscale commented Aug 27, 2018 •

edited

Loading

elasticmachine commented Aug 27, 2018

hub-cap left a comment

spinscale commented Aug 30, 2018

Watcher: Fix race condition when reloading watches #33157

Watcher: Fix race condition when reloading watches #33157

Conversation

spinscale commented Aug 27, 2018 • edited Loading

elasticmachine commented Aug 27, 2018

hub-cap left a comment

Choose a reason for hiding this comment

spinscale commented Aug 30, 2018

spinscale commented Aug 27, 2018 •

edited

Loading