Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker][branch-3.0] Fail fast if the extensible load manager failed to start (#23297) #23302

Merged
merged 1 commit into from
Sep 14, 2024

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Sep 13, 2024

cherry picked from commit fc60ec0

Motivation

backporting PR #23297 to branch-3.0

Other context

There are some test failures and problems that show up when running LoadManagerFailFastTest
Could someone take over this PR who understands the problems?

in LoadManagerFailFastTest.testServiceUnitStateChannelFailure, there's an continue loop of "Failed to get the channel owner" exceptions:

2024-09-13T07:23:58,402 - WARN  - [pulsar-load-manager-67-1:ExtensibleLoadManagerImpl] - The broker:localhost:56523 failed to set the role. Retrying 10 th ...
java.lang.RuntimeException: Failed to get the channel owner.
	at org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl.isChannelOwner(ServiceUnitStateChannelImpl.java:467) ~[classes/:?]
	at org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl.playLeader(ExtensibleLoadManagerImpl.java:793) ~[classes/:?]
	at org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl.lambda$start$11(ExtensibleLoadManagerImpl.java:334) ~[classes/:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.113.Final.jar:4.1.113.Final]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Invalid channel state:Closed
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096) ~[?:?]
	at org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl.isChannelOwner(ServiceUnitStateChannelImpl.java:462) ~[classes/:?]
	... 9 more
Caused by: java.lang.IllegalStateException: Invalid channel state:Closed
	at org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl.getChannelOwnerAsync(ServiceUnitStateChannelImpl.java:441) ~[classes/:?]
	at org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl.isChannelOwnerAsync(ServiceUnitStateChannelImpl.java:449) ~[classes/:?]
	at org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl.isChannelOwner(ServiceUnitStateChannelImpl.java:462) ~[classes/:?]
	... 9 more
2024-09-13T07:23:58,503 - ERROR - [pulsar-load-manager-67-1:ServiceUnitStateChannelImpl] - Failed to get the channel owner.
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Invalid channel state:Closed
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096) ~[?:?]
	at org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl.isChannelOwner(ServiceUnitStateChannelImpl.java:462) ~[classes/:?]
	at org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl.playLeader(ExtensibleLoadManagerImpl.java:793) ~[classes/:?]
	at org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl.lambda$start$11(ExtensibleLoadManagerImpl.java:334) ~[classes/:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.113.Final.jar:4.1.113.Final]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
Caused by: java.lang.IllegalStateException: Invalid channel state:Closed
	at org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl.getChannelOwnerAsync(ServiceUnitStateChannelImpl.java:441) ~[classes/:?]
	at org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl.isChannelOwnerAsync(ServiceUnitStateChannelImpl.java:449) ~[classes/:?]
	... 10 more

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 13, 2024
@heesung-sn heesung-sn self-assigned this Sep 13, 2024
@heesung-sn
Copy link
Contributor

i will work on this tmr.

@heesung-sn
Copy link
Contributor

@lhotari I think some code was not picked. I fixed it.

@heesung-sn
Copy link
Contributor

I see these tests in flaky suite are constantly failing


Flaky tests suiteProcess completed with exit code 1.
--
StreamingEntryReaderTests.testCanCancelReadEntryRequestAndResumeReading: pulsar-broker/src/test/java/org/apache/pulsar/broker/service/streamingdispatch/StreamingEntryReaderTests.java#L311Condition with org.apache.pulsar.broker.service.streamingdispatch.StreamingEntryReaderTests was not fulfilled within 10 seconds.
StreamingEntryReaderTests.testCanReadEntryFromMLedgerWaitingForNewEntry: pulsar-broker/src/test/java/org/apache/pulsar/broker/service/streamingdispatch/StreamingEntryReaderTests.java#L237Condition with org.apache.pulsar.broker.service.streamingdispatch.StreamingEntryReaderTests was not fulfilled within 10 seconds.
StreamingEntryReaderTests.testCanReadEntryFromMLedgerSizeExceededLimit: pulsar-broker/src/test/java/org/apache/pulsar/broker/service/streamingdispatch/StreamingEntryReaderTests.java#L191expected [2] but found [0]
StreamingEntryReaderTests.testCanReadEntryFromMLedgerHappyPath: pulsar-broker/src/test/java/org/apache/pulsar/broker/service/streamingdispatch/StreamingEntryReaderTests.java#L135Condition with org.apache.pulsar.broker.service.streamingdispatch.StreamingEntryReaderTests was not fulfilled within 10 seconds.
StreamingEntryReaderTests.testWillCancelReadAfterExhaustingRetry: pulsar-broker/src/test/java/org/apache/pulsar/broker/service/streamingdispatch/StreamingEntryReaderTests.java#L435expected [8] but found [3]

[Flaky tests suite](https://github.com/apache/pulsar/actions/runs/10858650204/job/30138570112#step:8:2677)
Process completed with exit code 1.
[StreamingEntryReaderTests.testCanCancelReadEntryRequestAndResumeReading: pulsar-broker/src/test/java/org/apache/pulsar/broker/service/streamingdispatch/StreamingEntryReaderTests.java#L311](https://github.com/apache/pulsar/pull/23302/files#annotation_26266780631)
Condition with org.apache.pulsar.broker.service.streamingdispatch.StreamingEntryReaderTests was not fulfilled within 10 seconds.
[StreamingEntryReaderTests.testCanReadEntryFromMLedgerWaitingForNewEntry: pulsar-broker/src/test/java/org/apache/pulsar/broker/service/streamingdispatch/StreamingEntryReaderTests.java#L237](https://github.com/apache/pulsar/pull/23302/files#annotation_26266780634)
Condition with org.apache.pulsar.broker.service.streamingdispatch.StreamingEntryReaderTests was not fulfilled within 10 seconds.
[StreamingEntryReaderTests.testCanReadEntryFromMLedgerSizeExceededLimit: pulsar-broker/src/test/java/org/apache/pulsar/broker/service/streamingdispatch/StreamingEntryReaderTests.java#L191](https://github.com/apache/pulsar/pull/23302/files#annotation_26266780635)
expected [2] but found [0]
[StreamingEntryReaderTests.testCanReadEntryFromMLedgerHappyPath: pulsar-broker/src/test/java/org/apache/pulsar/broker/service/streamingdispatch/StreamingEntryReaderTests.java#L135](https://github.com/apache/pulsar/pull/23302/files#annotation_26266780637)
Condition with org.apache.pulsar.broker.service.streamingdispatch.StreamingEntryReaderTests was not fulfilled within 10 seconds.
[StreamingEntryReaderTests.testWillCancelReadAfterExhaustingRetry: pulsar-broker/src/test/java/org/apache/pulsar/broker/service/streamingdispatch/StreamingEntryReaderTests.java#L435](https://github.com/apache/pulsar/pull/23302/files#annotation_26266780638)
expected [8] but found [3]

@heesung-sn
Copy link
Contributor

Otherwise, LGTM.

@lhotari lhotari merged commit 6d8b15d into apache:branch-3.0 Sep 14, 2024
44 of 45 checks passed
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Sep 19, 2024
…iled to start (apache#23297) (apache#23302)

Co-authored-by: Yunze Xu <xyzinfernity@163.com>
(cherry picked from commit 6d8b15d)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Sep 19, 2024
…iled to start (apache#23297) (apache#23302)

Co-authored-by: Yunze Xu <xyzinfernity@163.com>
(cherry picked from commit 6d8b15d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs ready-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants