Fix race in clear scroll #31259

jasontedor · 2018-06-12T02:47:02Z

Here is the problem: if two threads are racing and one hits a failure freeing a context and the other succeeded, we can expose the value of the has failure marker to the succeeding thread before the failing thread has had a chance to set the failure marker. This is a problem if the failing thread counted down the expected number of operations, then be put to sleep by a gentle lullaby from the OS, and then the other thread could count down to zero. Since the failing thread did not get to set the failure marker, the succeeding thread would respond that the clear scroll succeeded and that makes that thread a liar. This commit addresses by first setting the failure marker before we potentially expose its value to another thread.

elasticmachine · 2018-06-12T02:47:04Z

Pinging @elastic/es-search-aggs

jasontedor · 2018-06-12T02:47:46Z

This addresses this build failure:

17:06:14    2> jun 11, 2018 3:06:14 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
17:06:14    2> ADVERTENCIA: Uncaught exception in thread: Thread[Thread-92,5,TGRP-ClearScrollControllerTests]
17:06:14  Suite: org.elasticsearch.action.search.ClearScrollControllerTests
17:06:14    2> java.lang.AssertionError
17:06:14    1> [2018-06-11T15:06:14,077][INFO ][o.e.a.s.ClearScrollControllerTests] [testClearScrollIdsWithFailure]: before test
17:06:14    2> 	at __randomizedtesting.SeedInfo.seed([B475D8195532593A]:0)
17:06:14    2> 	at org.junit.Assert.fail(Assert.java:86)
17:06:14    1> [2018-06-11T15:06:14,085][WARN ][o.e.a.s.ClearScrollControllerTests] Clear SC failed on node[{node_2}{x49Meh1-TFaqWs_0u2chCQ}{0.0.0.0}{0.0.0.0:2}]
17:06:14    2> 	at org.junit.Assert.assertTrue(Assert.java:41)
17:06:14    2> 	at org.junit.Assert.assertFalse(Assert.java:64)
17:06:14    2> 	at org.junit.Assert.assertFalse(Assert.java:74)
17:06:14    1> java.lang.IllegalArgumentException: boom
17:06:14    2> 	at org.elasticsearch.action.search.ClearScrollControllerTests$5.onResponse(ClearScrollControllerTests.java:195)
17:06:14    2> 	at org.elasticsearch.action.search.ClearScrollControllerTests$5.onResponse(ClearScrollControllerTests.java:189)
17:06:14    2> 	at org.elasticsearch.action.search.ClearScrollController.onFreedContext(ClearScrollController.java:130)
17:06:14    2> 	at org.elasticsearch.action.search.ClearScrollController.lambda$cleanScrollIds$2(ClearScrollController.java:115)
17:06:14    2> 	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60)
17:06:14    2> 	at org.elasticsearch.action.search.ClearScrollControllerTests$6.lambda$sendFreeContext$0(ClearScrollControllerTests.java:232)
17:06:14    1> 	at org.elasticsearch.action.search.ClearScrollControllerTests$6.lambda$sendFreeContext$0(ClearScrollControllerTests.java:227) ~[test/:?]
17:06:14    2> 	at java.lang.Thread.run(Thread.java:748)
17:06:14    2> REPRODUCE WITH: ./gradlew :server:test -Dtests.seed=B475D8195532593A -Dtests.class=org.elasticsearch.action.search.ClearScrollControllerTests -Dtests.method="testClearScrollIdsWithFailure" -Dtests.security.manager=true -Dtests.locale=es-CR -Dtests.timezone=America/Boise
17:06:14    1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
17:06:14    1> [2018-06-11T15:06:14,096][INFO ][o.e.a.s.ClearScrollControllerTests] [testClearScrollIdsWithFailure]: after test
17:06:14  ERROR   0.03s J2 | ClearScrollControllerTests.testClearScrollIdsWithFailure <<< FAILURES!
17:06:14     > Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=919, name=Thread-92, state=RUNNABLE, group=TGRP-ClearScrollControllerTests]
17:06:14     > 	at __randomizedtesting.SeedInfo.seed([B475D8195532593A:73C3E7A276FA0DD6]:0)
17:06:14    2> NOTE: test params are: codec=Lucene70, sim=RandomSimilarity(queryNorm=true): {}, locale=es-CR, timezone=America/Boise
17:06:14    2> NOTE: Linux 3.16.0-4-amd64 amd64/Oracle Corporation 1.8.0_172 (64-bit)/cpus=16,threads=1,free=355015096,total=524812288
17:06:14     > Caused by: java.lang.AssertionError
17:06:14     > 	at __randomizedtesting.SeedInfo.seed([B475D8195532593A]:0)
17:06:14     > 	at org.elasticsearch.action.search.ClearScrollControllerTests$5.onResponse(ClearScrollControllerTests.java:195)
17:06:14     > 	at org.elasticsearch.action.search.ClearScrollControllerTests$5.onResponse(ClearScrollControllerTests.java:189)
17:06:14     > 	at org.elasticsearch.action.search.ClearScrollController.onFreedContext(ClearScrollController.java:130)
17:06:14     > 	at org.elasticsearch.action.search.ClearScrollController.lambda$cleanScrollIds$2(ClearScrollController.java:115)
17:06:14     > 	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60)
17:06:14     > 	at org.elasticsearch.action.search.ClearScrollControllerTests$6.lambda$sendFreeContext$0(ClearScrollControllerTests.java:232)
17:06:14     > 	at java.lang.Thread.run(Thread.java:748)
17:06:14    1> [2018-06-11T15:06:14,105][INFO ][o.e.a.s.ClearScrollControllerTests] [testClearAll]: before test
17:06:14    1> [2018-06-11T15:06:14,107][INFO ][o.e.a.s.ClearScrollControllerTests] [testClearAll]: after test
17:06:14    1> [2018-06-11T15:06:14,109][INFO ][o.e.a.s.ClearScrollControllerTests] [testClearScrollIds]: before test
17:06:14    1> [2018-06-11T15:06:14,127][INFO ][o.e.a.s.ClearScrollControllerTests] [testClearScrollIds]: after test
17:06:14  Completed [794/1067] on J2 in 0.06s, 3 tests, 1 error <<< FAILURES!

jpountz

LGTM. Great catch.

Here is the problem: if two threads are racing and one hits a failure freeing a context and the other succeeded, we can expose the value of the has failure marker to the succeeding thread before the failing thread has had a chance to set the failure marker. This is a problem if the failing thread counted down the expected number of operations, then be put to sleep by a gentle lullaby from the OS, and then the other thread could count down to zero. Since the failing thread did not get to set the failure marker, the succeeding thread would respond that the clear scroll succeeded and that makes that thread a liar. This commit addresses by first setting the failure marker before we potentially expose its value to another thread.

jasontedor · 2018-06-12T14:27:47Z

Thanks for reviewing @jpountz.

* master: Remove RestGetAllAliasesAction (#31308) Temporary fix for broken build Reenable Checkstyle's unused import rule (#31270) Remove remaining unused imports before merging #31270 Fix non-REST doc snippet [DOC] Extend SQL docs Immediately flush channel after writing to buffer (#31301) [DOCS] Shortens ML API intros Use quotes in the call invocation (#31249) move security ingest processors to a sub ingest directory (#31306) Add 5.6.11 version constant. Fix version detection. SQL: Whitelist SQL utility class for better scripting (#30681) [Docs] All Rollup docs experimental, agg limitations, clarify DeleteJob (#31299) CCS: don't proxy requests for already connected node (#31273) Mute ScriptedMetricAggregatorTests testSelfReferencingAggStateAfterMap [test] opensuse packaging turn up debug logging Add unreleased version 6.3.1 Removes experimental tag from scripted_metric aggregation (#31298) [Rollup] Metric config parser must use builder so validation runs (#31159) [ML] Check licence when datafeeds use cross cluster search (#31247) Add notion of internal index settings (#31286) Test: Remove broken yml test feature (#31255) REST hl client: cluster health to default to cluster level (#31268) [ML] Update test thresholds to account for changes to memory control (#31289) Log warnings when cluster state publication failed to some nodes (#31233) Fix AntFixture waiting condition (#31272) Ignore numeric shard count if waiting for ALL (#31265) [ML] Implement new rules design (#31110) index_prefixes back-compat should test 6.3 (#30951) Core: Remove plain execute method on TransportAction (#30998) Update checkstyle to 8.10.1 (#31269) Set analyzer version in PreBuiltAnalyzerProviderFactory (#31202) Modify pipelining handlers to require full requests (#31280) Revert upgrade to Netty 4.1.25.Final (#31282) Use armored input stream for reading public key (#31229) Fix Netty 4 Server Transport tests. Again. REST hl client: adjust wait_for_active_shards param in cluster health (#31266) REST high-level Client: remove deprecated API methods (#31200) [DOCS] Mark SQL feature as experimental [DOCS] Updates machine learning custom URL screenshots (#31222) Fix naming conventions check for XPackTestCase Fix security Netty 4 transport tests Fix race in clear scroll (#31259) [DOCS] Clarify audit index settings when remote indexing (#30923) Delete typos in SAML docs (#31199) REST high-level client: add Cluster Health API (#29331) [ML][TEST] Mute tests using rules (#31204) Support RequestedAuthnContext (#31238) SyncedFlushResponse to implement ToXContentObject (#31155) Add Get Aliases API to the high-level REST client (#28799) Remove some line length supressions (#31209) Validate xContentType in PutWatchRequest. (#31088) [INGEST] Interrupt the current thread if evaluation grok expressions take too long (#31024) Suppress extras FS on caching directory tests Revert "[DOCS] Added 6.3 info & updated the upgrade table. (#30940)" Revert "Fix snippets in upgrade docs" Fix snippets in upgrade docs [DOCS] Added 6.3 info & updated the upgrade table. (#30940) LLClient: Support host selection (#30523) Upgrade to Netty 4.1.25.Final (#31232) Enable custom credentials for core REST tests (#31235) Move ESIndexLevelReplicationTestCase to test framework (#31243) Encapsulate Translog in Engine (#31220) HLRest: Add get index templates API (#31161) Remove all unused imports and fix CRLF (#31207) [Tests] Fix self-referencing tests [TEST] Fix testRecoveryAfterPrimaryPromotion [Docs] Remove mention pattern files in Grok processor (#31170) Use stronger write-once semantics for Azure repository (#30437) Don't swallow exceptions on replication (#31179) Limit the number of concurrent requests per node (#31206) Call ensureNoSelfReferences() on _agg state variable after scripted metric agg script executions (#31044) Move java version checker back to its own jar (#30708) [test] add fix for rare virtualbox error (#31212)

* 6.x: SQL: Fix build on Java 10 [Tests] Mutualize fixtures code in BaseHttpFixture (#31210) [TEST] Fix RemoteClusterClientTests#testEnsureWeReconnect [ML] Update test thresholds to account for changes to memory control (#31289) Reenable Checkstyle's unused import rule (#31270) [ML] Check licence when datafeeds use cross cluster search (#31247) Fix non-REST doc snippet [DOC] Extend SQL docs [DOCS] Shortens ML API intros Use quotes in the call invocation (#31249) move security ingest processors to a sub ingest directory (#31306) SQL: Whitelist SQL utility class for better scripting (#30681) Add 5.6.11 version constant. Fix version detection. [Docs] All Rollup docs experimental, agg limitations, clarify DeleteJob (#31299) Add missing release notes. Security: fix token bwc with pre 6.0.0-beta2 (#31254) Fix compilation error in UpdateSettingsIT (#31304) Test: Remove broken yml test feature (#31255) Add unreleased version 6.3.1 [Rollup] Metric config parser must use builder so validation runs (#31159) Removes experimental tag from scripted_metric aggregation (#31298) [DOCS] Removes coming tag from 6.3.0 release notes 6.3 release notes. Add notion of internal index settings (#31286) REST high-level client: add Cluster Health API (#29331) Remove leftover usage of deprecated client API SyncedFlushResponse to implement ToXContentObject (#31155) Add Get Aliases API to the high-level REST client (#28799) HLRest: Add get index templates API (#31161) Log warnings when cluster state publication failed to some nodes (#31233) Fix AntFixture waiting condition (#31272) [TEST] Mute RecoveryIT.testHistoryUUIDIsGenerated Ignore numeric shard count if waiting for ALL (#31265) Update checkstyle to 8.10.1 (#31269) Set analyzer version in PreBuiltAnalyzerProviderFactory (#31202) Revert upgrade to Netty 4.1.25.Final (#31282) Use armored input stream for reading public key (#31229) [DOCS] Added 'fail_on_unsupported_field' param to MLT. Closes #28008 (#31160) Fix Netty 4 Server Transport tests. Again. [DOCS] Fixed typo. [DOCS] Added release highlights for 6.3 (#31256) [DOCS] Mark SQL feature as experimental [DOCS] Updates machine learning custom URL screenshots (#31222) Fix naming conventions check for XPackTestCase Fix security Netty 4 transport tests Fix race in clear scroll (#31259) [DOCS] Clarify audit index settings when remote indexing (#30923) [ML][TEST] Mute tests using rules (#31204) Support RequestedAuthnContext (#31238) Validate xContentType in PutWatchRequest. (#31088) [INGEST] Interrupt the current thread if evaluation grok expressions take too long (#31024) Upgrade to Netty 4.1.25.Final (#31232) Suppress extras FS on caching directory tests Revert "[DOCS] Added 6.3 info & updated the upgrade table. (#30940)" Revert "Fix snippets in upgrade docs" Fix snippets in upgrade docs [DOCS] Added 6.3 info & updated the upgrade table. (#30940) Enable custom credentials for core REST tests (#31235) Move ESIndexLevelReplicationTestCase to test framework (#31243) Encapsulate Translog in Engine (#31220) [DOCS] Adds machine learning 6.3.0 release notes (#31217) Remove all unused imports and fix CRLF (#31207) [TEST] Fix testRecoveryAfterPrimaryPromotion [Docs] Remove mention pattern files in Grok processor (#31170) Use stronger write-once semantics for Azure repository (#30437) Don't swallow exceptions on replication (#31179) Compliant SAML Response destination check (#31175) Move java version checker back to its own jar (#30708) TEST: Retry synced-flush if ongoing ops on primary (#30978) [test] add fix for rare virtualbox error (#31212)

jasontedor added review :Search/Search Search-related issues that do not fall into other categories v7.0.0 v6.4.0 v6.3.1 v5.6.11 labels Jun 12, 2018

jasontedor added 2 commits June 11, 2018 22:54

Adjust comment

ec28da9

Adjust comment for clarity again

f35c4a7

jpountz approved these changes Jun 12, 2018

View reviewed changes

jasontedor merged commit a365435 into elastic:master Jun 12, 2018

jasontedor deleted the clear-scroll-race branch June 12, 2018 14:27

davidkyle added the >bug label Jul 5, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race in clear scroll #31259

Fix race in clear scroll #31259

jasontedor commented Jun 12, 2018

elasticmachine commented Jun 12, 2018

jasontedor commented Jun 12, 2018

jpountz left a comment

jasontedor commented Jun 12, 2018

Fix race in clear scroll #31259

Fix race in clear scroll #31259

Conversation

jasontedor commented Jun 12, 2018

elasticmachine commented Jun 12, 2018

jasontedor commented Jun 12, 2018

jpountz left a comment

Choose a reason for hiding this comment

jasontedor commented Jun 12, 2018