[CI] EsqlActionBreakerIT class failing #118238

elasticsearchmachine · 2024-12-09T07:01:02Z

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:esql:internalClusterTest" --tests "org.elasticsearch.xpack.esql.action.EsqlActionBreakerIT.testEvalWhere" -Dtests.seed=56B8063E89FE55A1 -Dtests.locale=uk-UA -Dtests.timezone=Etc/GMT-12 -Druntime.java=23

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.AssertionError: Leftover exchanges ExchangeService{sinks=[OkHnDwl5R6aapn8_w1-FTw:4566/29, OkHnDwl5R6aapn8_w1-FTw:4566/29[n]]} on node node_s3

Issue Reasons:

[main] 2 failures in class org.elasticsearch.xpack.esql.action.EsqlActionBreakerIT (1.0% fail rate in 210 executions)
[main] 2 failures in step part3 (5.3% fail rate in 38 executions)
[main] 2 failures in pipeline elasticsearch-intake (5.3% fail rate in 38 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-12-09T07:01:26Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

alex-spies · 2024-12-09T09:45:03Z

In both build scans, testScriptField is the first test to fail and I think may cause an inconsistent state of the cluster because it trips AssertionErrors - and thus kills a node. @nik9000 or @dnhatn , do you maybe know of a recent change that could've caused this?

Without digging deeper into this test suite, there may be 2 issues here:

Potential bug related to usage of the StoredFieldSourceProvider (see stack traces) and the StoredFieldsReader moving between threads.
Killing a node should not lead to leaked "memory" in the breaker.

Dec 9 build scan has:

java.lang.AssertionError: unexpected source provider [org.elasticsearch.search.lookup.StoredFieldSourceProvider$LeafStoredFieldSourceProvider@6056ce1c]	
	at __randomizedtesting.SeedInfo.seed([D7E8E0C29F45546C]:0)	
	at org.elasticsearch.search.lookup.StoredFieldSourceProvider.getSource(StoredFieldSourceProvider.java:39)	
	at org.elasticsearch.xpack.esql.plugin.ReinitializingSourceProvider.getSource(ReinitializingSourceProvider.java:37)	
	at org.elasticsearch.search.lookup.LeafSearchLookup.lambda$new$0(LeafSearchLookup.java:40)	
	at org.elasticsearch.script.AbstractFieldScript.extractFromSource(AbstractFieldScript.java:107)	
	at org.elasticsearch.script.AbstractFieldScript.emitFromSource(AbstractFieldScript.java:127)	
	at org.elasticsearch.script.LongFieldScript$1$1.execute(LongFieldScript.java:29)	
	at org.elasticsearch.script.AbstractFieldScript.runForDoc(AbstractFieldScript.java:159)	
	at org.elasticsearch.index.fielddata.LongScriptDocValues.advanceExact(LongScriptDocValues.java:26)	
	at org.elasticsearch.search.MultiValueMode$6.advanceExact(MultiValueMode.java:534)	
	at org.elasticsearch.index.fielddata.FieldData$17.advanceExact(FieldData.java:620)	
	at org.apache.lucene.search.comparators.LongComparator$LongLeafComparator.getValueForDoc(LongComparator.java:70)	
	at org.apache.lucene.search.comparators.LongComparator$LongLeafComparator.copy(LongComparator.java:95)	
	at org.apache.lucene.search.TopFieldCollector$TopFieldLeafCollector.collectAnyHit(TopFieldCollector.java:124)	
	at org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector$1.collect(TopFieldCollector.java:208)	
	at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:277)	
	at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:236)	
	at org.elasticsearch.compute.lucene.LuceneOperator$LuceneScorer.scoreNextRange(LuceneOperator.java:191)	
	at org.elasticsearch.compute.lucene.LuceneTopNSourceOperator.collect(LuceneTopNSourceOperator.java:150)	
	at org.elasticsearch.compute.lucene.LuceneTopNSourceOperator.getCheckedOutput(LuceneTopNSourceOperator.java:130)	
	at org.elasticsearch.compute.lucene.LuceneOperator.getOutput(LuceneOperator.java:116)	
	at org.elasticsearch.compute.operator.Driver.runSingleLoopIteration(Driver.java:258)	
	at org.elasticsearch.compute.operator.Driver.run(Driver.java:189)	
	at org.elasticsearch.compute.operator.Driver$1.doRun(Driver.java:378)	
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)	
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:34)	
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1023)	
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)	
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)	
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)	
	at java.base/java.lang.Thread.run(Thread.java:1575)


Dec 06 build scan:

WARNING: Uncaught exception in thread: Thread[#143,elasticsearch[node_s1][esql_worker][T#1],5,TGRP-EsqlActionBreakerIT]	
java.lang.AssertionError: StoredFieldsReader are only supposed to be consumed in the thread in which they have been acquired. But was acquired in Thread[#154,elasticsearch[node_s1][esql_worker][T#3],5,TGRP-EsqlActionBreakerIT] and consumed in Thread[#143,elasticsearch[node_s1][esql_worker][T#1],5,TGRP-EsqlActionBreakerIT].	
	at __randomizedtesting.SeedInfo.seed([8CD8DB465774CF97]:0)	
	at org.apache.lucene.tests.codecs.asserting.AssertingCodec.assertThread(AssertingCodec.java:44)	
	at org.apache.lucene.tests.codecs.asserting.AssertingStoredFieldsFormat$AssertingStoredFieldsReader.document(AssertingStoredFieldsFormat.java:75)	
	at org.elasticsearch.search.internal.FieldUsageTrackingDirectoryReader$FieldUsageTrackingLeafReader$FieldUsageTrackingStoredFieldsReader.document(FieldUsageTrackingDirectoryReader.java:272)	
	at org.elasticsearch.index.fieldvisitor.StoredFieldLoader$ReaderStoredFieldLoader.advanceTo(StoredFieldLoader.java:205)	
	at org.elasticsearch.search.lookup.StoredFieldSourceProvider$LeafStoredFieldSourceProvider.getSource(StoredFieldSourceProvider.java:59)	
	at org.elasticsearch.search.lookup.StoredFieldSourceProvider.getSource(StoredFieldSourceProvider.java:41)	
	at org.elasticsearch.xpack.esql.plugin.ReinitializingSourceProvider.getSource(ReinitializingSourceProvider.java:37)	
	at org.elasticsearch.search.lookup.LeafSearchLookup.lambda$new$0(LeafSearchLookup.java:40)	
	at org.elasticsearch.script.AbstractFieldScript.extractFromSource(AbstractFieldScript.java:107)	
	at org.elasticsearch.script.AbstractFieldScript.emitFromSource(AbstractFieldScript.java:127)	
	at org.elasticsearch.script.LongFieldScript$1$1.execute(LongFieldScript.java:29)	
	at org.elasticsearch.script.AbstractFieldScript.runForDoc(AbstractFieldScript.java:159)	
	at org.elasticsearch.index.mapper.LongScriptBlockDocValuesReader.read(LongScriptBlockDocValuesReader.java:69)	
	at org.elasticsearch.index.mapper.LongScriptBlockDocValuesReader.read(LongScriptBlockDocValuesReader.java:65)	
	at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator$LoadFromMany.read(ValuesSourceReaderOperator.java:418)	
	at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator$LoadFromMany.run(ValuesSourceReaderOperator.java:357)	
	at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator.process(ValuesSourceReaderOperator.java:158)	
	at org.elasticsearch.compute.operator.AbstractPageMappingOperator.getOutput(AbstractPageMappingOperator.java:76)	
	at org.elasticsearch.compute.operator.Driver.runSingleLoopIteration(Driver.java:258)	
	at org.elasticsearch.compute.operator.Driver.run(Driver.java:189)	
	at org.elasticsearch.compute.operator.Driver$1.doRun(Driver.java:378)	
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)	
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:34)	
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1023)	
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)	
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)	
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)	
	at java.base/java.lang.Thread.run(Thread.java:1575)	
Dec 06, 2024 7:30:28 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException	
WARNING: Uncaught exception in thread: Thread[#154,elasticsearch[node_s1][esql_worker][T#3],5,TGRP-EsqlActionBreakerIT]	
java.lang.AssertionError: unexpected source provider [org.elasticsearch.search.lookup.StoredFieldSourceProvider$LeafStoredFieldSourceProvider@318cd0f0]	
	at __randomizedtesting.SeedInfo.seed([8CD8DB465774CF97]:0)	
	at org.elasticsearch.search.lookup.StoredFieldSourceProvider.getSource(StoredFieldSourceProvider.java:39)	
	at org.elasticsearch.xpack.esql.plugin.ReinitializingSourceProvider.getSource(ReinitializingSourceProvider.java:37)	
	at org.elasticsearch.search.lookup.LeafSearchLookup.lambda$new$0(LeafSearchLookup.java:40)	
	at org.elasticsearch.script.AbstractFieldScript.extractFromSource(AbstractFieldScript.java:107)	
	at org.elasticsearch.script.AbstractFieldScript.emitFromSource(AbstractFieldScript.java:127)	
	at org.elasticsearch.script.LongFieldScript$1$1.execute(LongFieldScript.java:29)	
	at org.elasticsearch.script.AbstractFieldScript.runForDoc(AbstractFieldScript.java:159)	
	at org.elasticsearch.index.mapper.LongScriptBlockDocValuesReader.read(LongScriptBlockDocValuesReader.java:69)	
	at org.elasticsearch.index.mapper.LongScriptBlockDocValuesReader.read(LongScriptBlockDocValuesReader.java:65)	
	at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator$LoadFromMany.read(ValuesSourceReaderOperator.java:418)	
	at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator$LoadFromMany.run(ValuesSourceReaderOperator.java:357)	
	at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator.process(ValuesSourceReaderOperator.java:158)	
	at org.elasticsearch.compute.operator.AbstractPageMappingOperator.getOutput(AbstractPageMappingOperator.java:76)	
	at org.elasticsearch.compute.operator.Driver.runSingleLoopIteration(Driver.java:258)	
	at org.elasticsearch.compute.operator.Driver.run(Driver.java:189)	
	at org.elasticsearch.compute.operator.Driver$1.doRun(Driver.java:378)	
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)	
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:34)	
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1023)	
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)	
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)	
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)	
	at java.base/java.lang.Thread.run(Thread.java:1575)

nik9000 · 2024-12-09T14:44:56Z

I recall something about this! I'll see if I can refresh my memory later today.

dnhatn · 2024-12-09T22:01:55Z

@martijnvg It seems to be related to the recent changes made to the SourceProvider in #117792. Could you please take a look at this?

martijnvg · 2024-12-10T07:53:20Z

I will take a look at this.

elasticsearchmachine · 2024-12-10T07:55:59Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

martijnvg · 2024-12-10T10:36:27Z

The underlying issue is that SourceProvider is meant to be used by multiple threads concurrently. The SourceProvider abstraction is both used by _search and es|ql both work entirely differently. The assertion that trips was written based on usage in _search and that assertion doesn't hold within the context of es|ql. In the es|ql case, depending on the data partitioning, the same segment may be accessed by different threads concurrently. So the fact that the assertion trips is expected in the context of es|ql.

We noticed issues some time ago and introduced ReinitializingSourceProvider as a workaround to this problem via #117792.
It looks like this workaround isn't working as expected, since each thread should end up with its own SourceProvider (and for example the assertion in StoredFieldSourceProvider shouldn't trip). I think we can fix the workaround.

Accessing source and in particular synthetic source is very inefficient today. Either when block loaders fall back to _source or when runtime fields are used. For the former we have an issue (#115394) that describes one solution. We could apply this for runtime fields as well. The upside is that we will move away from SourceProvider, which was just designed for _search.

The previous fix to ensure that each thread uses its own SearchProvider wasn't good enough. When multiple threads access `ReinitializingSourceProvider` the simple thread accounting could still result in returned `SourceProvider` being used by multiple threads concurrently. The ReinitializingSourceProvider was introduced via elastic#117792 Closes elastic#118238

The previous fix to ensure that each thread uses its own SearchProvider wasn't good enough. The read from `perThreadProvider` field could be stale and therefore returning a previous source provider. Instead the source provider should be returned from `provider` local variable. This change also addresses another issue, sometimes current docid goes backwards compared to last seen docid and this causes issue when synthetic source provider is used, as doc values can't advance backwards. This change addresses that by returning a new source provider if backwards docid is detected. Closes elastic#118238

elasticsearchmachine · 2024-12-11T11:21:18Z

This has been muted on branch main

Mute Reasons:

[main] 2 failures in class org.elasticsearch.xpack.esql.action.EsqlActionBreakerIT (1.0% fail rate in 210 executions)
[main] 2 failures in step part3 (5.3% fail rate in 38 executions)
[main] 2 failures in pipeline elasticsearch-intake (5.3% fail rate in 38 executions)

Build Scans:

…ticsearch.xpack.esql.action.EsqlActionBreakerIT #118238

The previous fix to ensure that each thread uses its own SearchProvider wasn't good enough. The read from `perThreadProvider` field could be stale and therefore returning a previous source provider. Instead the source provider should be returned from `provider` local variable. This change also addresses another issue, sometimes current docid goes backwards compared to last seen docid and this causes issue when synthetic source provider is used, as doc values can't advance backwards. This change addresses that by returning a new source provider if backwards docid is detected. Closes #118238

The previous fix to ensure that each thread uses its own SearchProvider wasn't good enough. The read from `perThreadProvider` field could be stale and therefore returning a previous source provider. Instead the source provider should be returned from `provider` local variable. This change also addresses another issue, sometimes current docid goes backwards compared to last seen docid and this causes issue when synthetic source provider is used, as doc values can't advance backwards. This change addresses that by returning a new source provider if backwards docid is detected. Closes #118238 Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

martijnvg · 2024-12-12T14:07:53Z

Issue didn't auto close... this is fixed by #118370

elasticsearchmachine added :Analytics/ES|QL AKA ESQL >test-failure Triaged test failures from CI Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) needs:risk Requires assignment of a risk label (low, medium, blocker) labels Dec 9, 2024

alex-spies added medium-risk An open issue or test failure that is a medium risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Dec 9, 2024

nik9000 self-assigned this Dec 9, 2024

martijnvg assigned martijnvg and unassigned nik9000 Dec 10, 2024

martijnvg added :StorageEngine/Mapping The storage related side of mappings and removed :Analytics/ES|QL AKA ESQL labels Dec 10, 2024

elasticsearchmachine added Team:StorageEngine and removed Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Dec 10, 2024

martijnvg mentioned this issue Dec 10, 2024

Fix issues with ReinitializingSourceProvider #118370

Merged

martijnvg closed this as completed in #118370 Dec 11, 2024

martijnvg closed this as completed in e3bddd0 Dec 11, 2024

elasticsearchmachine reopened this Dec 11, 2024

elasticsearchmachine added a commit that referenced this issue Dec 11, 2024

Mute org.elasticsearch.xpack.esql.action.EsqlActionBreakerIT org.elas…

2d66d25

…ticsearch.xpack.esql.action.EsqlActionBreakerIT #118238

martijnvg closed this as completed Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] EsqlActionBreakerIT class failing #118238

[CI] EsqlActionBreakerIT class failing #118238

elasticsearchmachine commented Dec 9, 2024 •

edited

Loading

elasticsearchmachine commented Dec 9, 2024

alex-spies commented Dec 9, 2024

nik9000 commented Dec 9, 2024

dnhatn commented Dec 9, 2024

martijnvg commented Dec 10, 2024

elasticsearchmachine commented Dec 10, 2024

martijnvg commented Dec 10, 2024

elasticsearchmachine commented Dec 11, 2024

martijnvg commented Dec 12, 2024

[CI] EsqlActionBreakerIT class failing #118238

[CI] EsqlActionBreakerIT class failing #118238

Comments

elasticsearchmachine commented Dec 9, 2024 • edited Loading

elasticsearchmachine commented Dec 9, 2024

alex-spies commented Dec 9, 2024

nik9000 commented Dec 9, 2024

dnhatn commented Dec 9, 2024

martijnvg commented Dec 10, 2024

elasticsearchmachine commented Dec 10, 2024

martijnvg commented Dec 10, 2024

elasticsearchmachine commented Dec 11, 2024

martijnvg commented Dec 12, 2024

elasticsearchmachine commented Dec 9, 2024 •

edited

Loading