Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] EsqlActionBreakerIT class failing #118238

Closed
elasticsearchmachine opened this issue Dec 9, 2024 · 9 comments · Fixed by #118370
Closed

[CI] EsqlActionBreakerIT class failing #118238

elasticsearchmachine opened this issue Dec 9, 2024 · 9 comments · Fixed by #118370
Assignees
Labels
medium-risk An open issue or test failure that is a medium risk to future releases :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Dec 9, 2024

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:esql:internalClusterTest" --tests "org.elasticsearch.xpack.esql.action.EsqlActionBreakerIT.testEvalWhere" -Dtests.seed=56B8063E89FE55A1 -Dtests.locale=uk-UA -Dtests.timezone=Etc/GMT-12 -Druntime.java=23

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.AssertionError: Leftover exchanges ExchangeService{sinks=[OkHnDwl5R6aapn8_w1-FTw:4566/29, OkHnDwl5R6aapn8_w1-FTw:4566/29[n]]} on node node_s3

Issue Reasons:

  • [main] 2 failures in class org.elasticsearch.xpack.esql.action.EsqlActionBreakerIT (1.0% fail rate in 210 executions)
  • [main] 2 failures in step part3 (5.3% fail rate in 38 executions)
  • [main] 2 failures in pipeline elasticsearch-intake (5.3% fail rate in 38 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Analytics/ES|QL AKA ESQL >test-failure Triaged test failures from CI Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) needs:risk Requires assignment of a risk label (low, medium, blocker) labels Dec 9, 2024
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-analytical-engine (Team:Analytics)

@alex-spies alex-spies added medium-risk An open issue or test failure that is a medium risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Dec 9, 2024
@alex-spies
Copy link
Contributor

In both build scans, testScriptField is the first test to fail and I think may cause an inconsistent state of the cluster because it trips AssertionErrors - and thus kills a node. @nik9000 or @dnhatn , do you maybe know of a recent change that could've caused this?

Without digging deeper into this test suite, there may be 2 issues here:

  1. Potential bug related to usage of the StoredFieldSourceProvider (see stack traces) and the StoredFieldsReader moving between threads.
  2. Killing a node should not lead to leaked "memory" in the breaker.
Dec 9 build scan has:

java.lang.AssertionError: unexpected source provider [org.elasticsearch.search.lookup.StoredFieldSourceProvider$LeafStoredFieldSourceProvider@6056ce1c]	
	at __randomizedtesting.SeedInfo.seed([D7E8E0C29F45546C]:0)	
	at org.elasticsearch.search.lookup.StoredFieldSourceProvider.getSource(StoredFieldSourceProvider.java:39)	
	at org.elasticsearch.xpack.esql.plugin.ReinitializingSourceProvider.getSource(ReinitializingSourceProvider.java:37)	
	at org.elasticsearch.search.lookup.LeafSearchLookup.lambda$new$0(LeafSearchLookup.java:40)	
	at org.elasticsearch.script.AbstractFieldScript.extractFromSource(AbstractFieldScript.java:107)	
	at org.elasticsearch.script.AbstractFieldScript.emitFromSource(AbstractFieldScript.java:127)	
	at org.elasticsearch.script.LongFieldScript$1$1.execute(LongFieldScript.java:29)	
	at org.elasticsearch.script.AbstractFieldScript.runForDoc(AbstractFieldScript.java:159)	
	at org.elasticsearch.index.fielddata.LongScriptDocValues.advanceExact(LongScriptDocValues.java:26)	
	at org.elasticsearch.search.MultiValueMode$6.advanceExact(MultiValueMode.java:534)	
	at org.elasticsearch.index.fielddata.FieldData$17.advanceExact(FieldData.java:620)	
	at org.apache.lucene.search.comparators.LongComparator$LongLeafComparator.getValueForDoc(LongComparator.java:70)	
	at org.apache.lucene.search.comparators.LongComparator$LongLeafComparator.copy(LongComparator.java:95)	
	at org.apache.lucene.search.TopFieldCollector$TopFieldLeafCollector.collectAnyHit(TopFieldCollector.java:124)	
	at org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector$1.collect(TopFieldCollector.java:208)	
	at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:277)	
	at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:236)	
	at org.elasticsearch.compute.lucene.LuceneOperator$LuceneScorer.scoreNextRange(LuceneOperator.java:191)	
	at org.elasticsearch.compute.lucene.LuceneTopNSourceOperator.collect(LuceneTopNSourceOperator.java:150)	
	at org.elasticsearch.compute.lucene.LuceneTopNSourceOperator.getCheckedOutput(LuceneTopNSourceOperator.java:130)	
	at org.elasticsearch.compute.lucene.LuceneOperator.getOutput(LuceneOperator.java:116)	
	at org.elasticsearch.compute.operator.Driver.runSingleLoopIteration(Driver.java:258)	
	at org.elasticsearch.compute.operator.Driver.run(Driver.java:189)	
	at org.elasticsearch.compute.operator.Driver$1.doRun(Driver.java:378)	
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)	
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:34)	
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1023)	
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)	
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)	
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)	
	at java.base/java.lang.Thread.run(Thread.java:1575)


Dec 06 build scan:

WARNING: Uncaught exception in thread: Thread[#143,elasticsearch[node_s1][esql_worker][T#1],5,TGRP-EsqlActionBreakerIT]	
java.lang.AssertionError: StoredFieldsReader are only supposed to be consumed in the thread in which they have been acquired. But was acquired in Thread[#154,elasticsearch[node_s1][esql_worker][T#3],5,TGRP-EsqlActionBreakerIT] and consumed in Thread[#143,elasticsearch[node_s1][esql_worker][T#1],5,TGRP-EsqlActionBreakerIT].	
	at __randomizedtesting.SeedInfo.seed([8CD8DB465774CF97]:0)	
	at org.apache.lucene.tests.codecs.asserting.AssertingCodec.assertThread(AssertingCodec.java:44)	
	at org.apache.lucene.tests.codecs.asserting.AssertingStoredFieldsFormat$AssertingStoredFieldsReader.document(AssertingStoredFieldsFormat.java:75)	
	at org.elasticsearch.search.internal.FieldUsageTrackingDirectoryReader$FieldUsageTrackingLeafReader$FieldUsageTrackingStoredFieldsReader.document(FieldUsageTrackingDirectoryReader.java:272)	
	at org.elasticsearch.index.fieldvisitor.StoredFieldLoader$ReaderStoredFieldLoader.advanceTo(StoredFieldLoader.java:205)	
	at org.elasticsearch.search.lookup.StoredFieldSourceProvider$LeafStoredFieldSourceProvider.getSource(StoredFieldSourceProvider.java:59)	
	at org.elasticsearch.search.lookup.StoredFieldSourceProvider.getSource(StoredFieldSourceProvider.java:41)	
	at org.elasticsearch.xpack.esql.plugin.ReinitializingSourceProvider.getSource(ReinitializingSourceProvider.java:37)	
	at org.elasticsearch.search.lookup.LeafSearchLookup.lambda$new$0(LeafSearchLookup.java:40)	
	at org.elasticsearch.script.AbstractFieldScript.extractFromSource(AbstractFieldScript.java:107)	
	at org.elasticsearch.script.AbstractFieldScript.emitFromSource(AbstractFieldScript.java:127)	
	at org.elasticsearch.script.LongFieldScript$1$1.execute(LongFieldScript.java:29)	
	at org.elasticsearch.script.AbstractFieldScript.runForDoc(AbstractFieldScript.java:159)	
	at org.elasticsearch.index.mapper.LongScriptBlockDocValuesReader.read(LongScriptBlockDocValuesReader.java:69)	
	at org.elasticsearch.index.mapper.LongScriptBlockDocValuesReader.read(LongScriptBlockDocValuesReader.java:65)	
	at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator$LoadFromMany.read(ValuesSourceReaderOperator.java:418)	
	at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator$LoadFromMany.run(ValuesSourceReaderOperator.java:357)	
	at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator.process(ValuesSourceReaderOperator.java:158)	
	at org.elasticsearch.compute.operator.AbstractPageMappingOperator.getOutput(AbstractPageMappingOperator.java:76)	
	at org.elasticsearch.compute.operator.Driver.runSingleLoopIteration(Driver.java:258)	
	at org.elasticsearch.compute.operator.Driver.run(Driver.java:189)	
	at org.elasticsearch.compute.operator.Driver$1.doRun(Driver.java:378)	
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)	
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:34)	
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1023)	
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)	
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)	
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)	
	at java.base/java.lang.Thread.run(Thread.java:1575)	
Dec 06, 2024 7:30:28 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException	
WARNING: Uncaught exception in thread: Thread[#154,elasticsearch[node_s1][esql_worker][T#3],5,TGRP-EsqlActionBreakerIT]	
java.lang.AssertionError: unexpected source provider [org.elasticsearch.search.lookup.StoredFieldSourceProvider$LeafStoredFieldSourceProvider@318cd0f0]	
	at __randomizedtesting.SeedInfo.seed([8CD8DB465774CF97]:0)	
	at org.elasticsearch.search.lookup.StoredFieldSourceProvider.getSource(StoredFieldSourceProvider.java:39)	
	at org.elasticsearch.xpack.esql.plugin.ReinitializingSourceProvider.getSource(ReinitializingSourceProvider.java:37)	
	at org.elasticsearch.search.lookup.LeafSearchLookup.lambda$new$0(LeafSearchLookup.java:40)	
	at org.elasticsearch.script.AbstractFieldScript.extractFromSource(AbstractFieldScript.java:107)	
	at org.elasticsearch.script.AbstractFieldScript.emitFromSource(AbstractFieldScript.java:127)	
	at org.elasticsearch.script.LongFieldScript$1$1.execute(LongFieldScript.java:29)	
	at org.elasticsearch.script.AbstractFieldScript.runForDoc(AbstractFieldScript.java:159)	
	at org.elasticsearch.index.mapper.LongScriptBlockDocValuesReader.read(LongScriptBlockDocValuesReader.java:69)	
	at org.elasticsearch.index.mapper.LongScriptBlockDocValuesReader.read(LongScriptBlockDocValuesReader.java:65)	
	at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator$LoadFromMany.read(ValuesSourceReaderOperator.java:418)	
	at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator$LoadFromMany.run(ValuesSourceReaderOperator.java:357)	
	at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator.process(ValuesSourceReaderOperator.java:158)	
	at org.elasticsearch.compute.operator.AbstractPageMappingOperator.getOutput(AbstractPageMappingOperator.java:76)	
	at org.elasticsearch.compute.operator.Driver.runSingleLoopIteration(Driver.java:258)	
	at org.elasticsearch.compute.operator.Driver.run(Driver.java:189)	
	at org.elasticsearch.compute.operator.Driver$1.doRun(Driver.java:378)	
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)	
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:34)	
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1023)	
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)	
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)	
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)	
	at java.base/java.lang.Thread.run(Thread.java:1575)

@nik9000 nik9000 self-assigned this Dec 9, 2024
@nik9000
Copy link
Member

nik9000 commented Dec 9, 2024

I recall something about this! I'll see if I can refresh my memory later today.

@dnhatn
Copy link
Member

dnhatn commented Dec 9, 2024

@martijnvg It seems to be related to the recent changes made to the SourceProvider in #117792. Could you please take a look at this?

@martijnvg
Copy link
Member

I will take a look at this.

@martijnvg martijnvg assigned martijnvg and unassigned nik9000 Dec 10, 2024
@martijnvg martijnvg added :StorageEngine/Mapping The storage related side of mappings and removed :Analytics/ES|QL AKA ESQL labels Dec 10, 2024
@elasticsearchmachine elasticsearchmachine added Team:StorageEngine and removed Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Dec 10, 2024
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@martijnvg
Copy link
Member

The underlying issue is that SourceProvider is meant to be used by multiple threads concurrently. The SourceProvider abstraction is both used by _search and es|ql both work entirely differently. The assertion that trips was written based on usage in _search and that assertion doesn't hold within the context of es|ql. In the es|ql case, depending on the data partitioning, the same segment may be accessed by different threads concurrently. So the fact that the assertion trips is expected in the context of es|ql.

We noticed issues some time ago and introduced ReinitializingSourceProvider as a workaround to this problem via #117792.
It looks like this workaround isn't working as expected, since each thread should end up with its own SourceProvider (and for example the assertion in StoredFieldSourceProvider shouldn't trip). I think we can fix the workaround.

Accessing source and in particular synthetic source is very inefficient today. Either when block loaders fall back to _source or when runtime fields are used. For the former we have an issue (#115394) that describes one solution. We could apply this for runtime fields as well. The upside is that we will move away from SourceProvider, which was just designed for _search.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Dec 10, 2024
The previous fix to ensure that each thread uses its own SearchProvider wasn't good enough.
When multiple threads access `ReinitializingSourceProvider` the simple thread accounting could still result in returned `SourceProvider` being used by multiple threads concurrently.

The ReinitializingSourceProvider was introduced via elastic#117792

Closes elastic#118238
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Dec 11, 2024
The previous fix to ensure that each thread uses its own SearchProvider wasn't good enough.  The read from `perThreadProvider` field could be stale and therefore returning a previous source provider.  Instead the source provider should be returned from `provider` local variable.

This change also addresses another issue, sometimes current docid goes backwards compared to last seen docid and this causes issue when synthetic source provider is used, as doc values can't advance backwards. This change addresses that by returning a new source provider if backwards docid is detected.

Closes elastic#118238
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Dec 11, 2024
The previous fix to ensure that each thread uses its own SearchProvider wasn't good enough.  The read from `perThreadProvider` field could be stale and therefore returning a previous source provider.  Instead the source provider should be returned from `provider` local variable.

This change also addresses another issue, sometimes current docid goes backwards compared to last seen docid and this causes issue when synthetic source provider is used, as doc values can't advance backwards. This change addresses that by returning a new source provider if backwards docid is detected.

Closes elastic#118238
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 2 failures in class org.elasticsearch.xpack.esql.action.EsqlActionBreakerIT (1.0% fail rate in 210 executions)
  • [main] 2 failures in step part3 (5.3% fail rate in 38 executions)
  • [main] 2 failures in pipeline elasticsearch-intake (5.3% fail rate in 38 executions)

Build Scans:

elasticsearchmachine added a commit that referenced this issue Dec 11, 2024
…ticsearch.xpack.esql.action.EsqlActionBreakerIT #118238
elasticsearchmachine pushed a commit that referenced this issue Dec 11, 2024
The previous fix to ensure that each thread uses its own SearchProvider wasn't good enough.  The read from `perThreadProvider` field could be stale and therefore returning a previous source provider.  Instead the source provider should be returned from `provider` local variable.

This change also addresses another issue, sometimes current docid goes backwards compared to last seen docid and this causes issue when synthetic source provider is used, as doc values can't advance backwards. This change addresses that by returning a new source provider if backwards docid is detected.

Closes #118238
elasticsearchmachine pushed a commit that referenced this issue Dec 11, 2024
The previous fix to ensure that each thread uses its own SearchProvider wasn't good enough.  The read from `perThreadProvider` field could be stale and therefore returning a previous source provider.  Instead the source provider should be returned from `provider` local variable.

This change also addresses another issue, sometimes current docid goes backwards compared to last seen docid and this causes issue when synthetic source provider is used, as doc values can't advance backwards. This change addresses that by returning a new source provider if backwards docid is detected.

Closes #118238

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
@martijnvg
Copy link
Member

Issue didn't auto close... this is fixed by #118370

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
medium-risk An open issue or test failure that is a medium risk to future releases :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants