Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] release orphan replicator after replicator is closed. #21203

Closed
wants to merge 9 commits into from

Conversation

hanmz
Copy link
Contributor

@hanmz hanmz commented Sep 19, 2023

Motivation

When the replicator.producer start failed the method does not exit the loop(checkTopicActiveAndRetryStartProducer) even if the replicator is disconnect. You can reproduce by the test testExitRetryStartProducerAfterReplicatorDisconnect.

Modifications

If the replicator is already disconnected, stop to retryStartProducer.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: hanmz#2

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 19, 2023
@hanmz hanmz changed the title [fix][broker] release orphan replicator after replicator is disconnected. [fix][broker] release orphan replicator after replicator is closed. Sep 19, 2023
@hanmz hanmz requested a review from Jason918 September 19, 2023 11:07
@hanmz hanmz requested a review from poorbarcode September 21, 2023 08:52
@hanmz hanmz requested a review from poorbarcode September 22, 2023 07:00
@@ -116,6 +118,13 @@ public String getRemoteCluster() {
// This method needs to be synchronized with disconnects else if there is a disconnect followed by startProducer
// the end result can be disconnect.
public synchronized void startProducer() {
// This method comes from some actives call and may be call again after disconnect
// so here we will first mark isClosed is false
isClosed = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the stat is set to closed, it can no longer be recovered to no-closed. If the replicator needs to keep working, it shouldn't be set to closed, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally this is the case. But here, we can call the method startProducer again after call the method disconnect.
In fact, this class can be closed and opened again.
In fact, this class can be closed and opened again.

So we should only set the stat of Replicator to closed when it is exactly is closed, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the stat is set to closed, it can no longer be recovered to no-closed. If the replicator needs to keep working, it shouldn't be set to closed, right?

Normally this is the case. But here, we can call the method startProducer again after call the method disconnect.
In fact, this class can be closed and opened again.

Essentially, what we want to do here is to allow the retry thread (checkTopicActiveAndRetryStartProducer) to end normally when the replicator is closed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we add a stat, it must be meaningful, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codelipenghui @Technoboy- What do you think about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, this class can be closed and opened again.

There is already a state [Stopped, Starting, Started, Stopping] is working on this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently no clear indication that this replicator will not be used again after it is disconnect.

Mabey it is not very reasonable to use the close status identifier here, but it can work normally.

I think it is also reasonable to use some other markers as judgment conditions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a definition for this new state isClosed? How is it different from State.Stopped when it is true?

@poorbarcode
Copy link
Contributor

Add a context:

This PR is trying to fix the issue below

  • Close replication task and Start producer of replicator were executed at the same time.
  • Start producer of replicator runs after Close replication task
  • If the producer started failed, retry; if the next start also failed, retry again....loop.

Once the producer creates success, the loop will be stopped. Explain: After the cursor was closed. The method readEntries | readMoreEntries will get a CursorAlreadyClosedException, then the producer will be closed. see https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentReplicator.java#L440

if (exception instanceof CursorAlreadyClosedException) {
    log.error("[{}] Error reading entries because replicator is"
                    + " already deleted and cursor is already closed {}, ({})",
            replicatorId, ctx, exception.getMessage(), exception);
    // replicator is already deleted and cursor is already closed so, producer should also be stopped
    closeProducerAsync();
    return;
}

So this PR only trying to fix the scenario that the producer always creates fails.

@hanmz
Copy link
Contributor Author

hanmz commented Sep 26, 2023

Add a context:

This PR is trying to fix the issue below

  • Close replication task and Start producer of replicator were executed at the same time.
  • Start producer of replicator runs after Close replication task
  • If the producer started failed, retry; if the next start also failed, retry again....loop.

Once the producer creates success, the loop will be stopped. Explain: After the cursor was closed. The method readEntries | readMoreEntries will get a CursorAlreadyClosedException, then the producer will be closed. see https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentReplicator.java#L440

if (exception instanceof CursorAlreadyClosedException) {
    log.error("[{}] Error reading entries because replicator is"
                    + " already deleted and cursor is already closed {}, ({})",
            replicatorId, ctx, exception.getMessage(), exception);
    // replicator is already deleted and cursor is already closed so, producer should also be stopped
    closeProducerAsync();
    return;
}

So this PR only trying to fix the scenario that the producer always creates fails.

YES That's right

@Jason918
Copy link
Contributor

@poorbarcode PTAL.

@Jason918
Copy link
Contributor

Jason918 commented Oct 8, 2023

@hanmz Please check the CI failure.

@hanmz
Copy link
Contributor Author

hanmz commented Oct 8, 2023

/pulsarbot rerun-failure-checks

@poorbarcode
Copy link
Contributor

Current PR is pending on this concern.

Copy link
Contributor

@poorbarcode poorbarcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this new state is not meaningful in the current design.

@poorbarcode
Copy link
Contributor

#21946 will fix the issue that the current PR tries to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants