Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Replication stuck when partitions count between two clusters is not the same #22983

Merged
merged 16 commits into from
Jul 15, 2024

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Jun 26, 2024

Motivation

Auto-creation rule of both clusters

  • cluster-c1: non-partitioned
  • cluster-c2: partitioned topic with 2 partitions

Issue 1 (Issue)

  • Enable double-way replication c1 -> c2 & c2 -> c1.
  • Create a non-partitioned topic tp1 on c1.
    • The messages will be attempted to replicate into both partitions of c2, but it will fail due to a ClassCaseExceptionpic-1
  • At the moment, the topics on both clusters are below:
    • non-partitioned topic tp-1
    • partitioned topic tp-1 with 2 partitions.
  • The new partitions tp-1-partition-0 and tp-1-partition-1 on the cluster c2 will also start replication.
    • Replicator will trigger topic auto-creation for c1
  • Finally, the topics on both clusters are below:
    • c1:
      • non-partitioned topic tp-1
      • non-partitioned topic tp-1-partition-0
      • non-partitioned topic tp-1-partition-1
    • c2: partitioned topic tp-1 with 2 partitions.
  • The new partitions tp-1-partition-0 and tp-1-partition-1 on the cluster c1 will also start replication.
    • Since the replicator of non-partitioned topic tp-1 is already connected to the remote cluster, the new topics' replicator can not be created successfully and it will get an error Producer with name 'pulsar.repl.c1--> c2' is already connected to topic tp-1-partition-0/tp-1-partition-1

Issue 2

  • Enable one-way replication c1 -> c2.
  • Create a non-partitioned topic tp1 on c1.
    • The messages will be attempted to replicate into both partitions of c2, but it will fail due to a ClassCaseExceptionpic-1
  • Finally, the topics on both clusters are below:
    • c1: non-partitioned topic tp-1
    • c2: partitioned topic tp-1 with 2 partitions.

pic-1:
Screenshot 2024-06-27 at 00 20 45

Modifications

Replicator will only trigger a non-partitioned topic creation.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode added this to the 3.4.0 milestone Jun 26, 2024
@poorbarcode poorbarcode self-assigned this Jun 26, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jun 26, 2024
@poorbarcode poorbarcode changed the title [improve][log] Improve replicator's log when partitions count between two clusters is not the same [fix][broker] Replication stuck when partitions count between two clusters is not the same Jun 27, 2024
@poorbarcode poorbarcode changed the title [fix][broker] Replication stuck when partitions count between two clusters is not the same [fix][broker] [break-change] Replication stuck when partitions count between two clusters is not the same Jun 27, 2024
@poorbarcode poorbarcode added type/bug The PR fixed a bug or issue reported a bug category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost labels Jun 28, 2024
@nodece
Copy link
Member

nodece commented Jun 28, 2024

When a topic type is different(non-partitioned/partitioned) on the local and remote clusters, the pulsar-client has been thrown Producer with name 'pulsar.repl.c1--> c2' is already connected to topic tp-1-partition-0/tp-1-partition-1, I think the user has to handle this error manually, not depend on the pulsar to fix that.

@codelipenghui
Copy link
Contributor

@poorbarcode You mentioned "break-change" in the PR's title. Do you mean this PR will introduce break change or it fixed a break change? If it will fix a break change, could you please link the change which introduced the break change?

@poorbarcode poorbarcode changed the title [fix][broker] [break-change] Replication stuck when partitions count between two clusters is not the same [fix][broker] Replication stuck when partitions count between two clusters is not the same Jun 28, 2024
@poorbarcode
Copy link
Contributor Author

poorbarcode commented Jun 28, 2024

@codelipenghui

@poorbarcode You mentioned "break-change" in the PR's title. Do you mean this PR will introduce break change or it fixed a break change? If it will fix a break change, could you please link the change which introduced the break change?

Sorry, I assumed that the code class cast was added by #21946, if so, issue-2 described in Motivation can successfully copy data to c2 even though there are two partitions on the remote cluster.

I checked the history commits, and the class cast code was added when creating the class AbstractReplicator. I ran a test with the tag before #21946, and it does not work also.

So there is no break change( all the two scenarios can not work ), I removed the label break-change.

@poorbarcode
Copy link
Contributor Author

poorbarcode commented Jun 28, 2024

@nodece

When a topic type is different(non-partitioned/partitioned) on the local and remote clusters, the pulsar-client has been thrown Producer with name 'pulsar.repl.c1--> c2' is already connected to topic tp-1-partition-0/tp-1-partition-1, I think the user has to handle this error manually, not depend on the pulsar to fix that.

Yes, it does. see here

if (metadata.partitions != 0) {
    log.error("[{}] The partitions count between two clusters is not the same(remote partitions: {}),"
                    + " the replicator can not be created successfully: {}", replicatorId, metadata.partitions,
            state);
    // This exception will be caught below, so it can be any typed.
    checkPartitionsSameFuture.completeExceptionally(new RuntimeException(replicatorId
            + "Can not replicate data to a partitioned topic."));
}

This PR only changes the scenario that there is no topic on the remote cluster. And makes the error logs clearer

@poorbarcode
Copy link
Contributor Author

/pulsarbot rerun-failure-checks

@poorbarcode poorbarcode requested review from nodece and Demogorgon314 and removed request for nodece July 5, 2024 19:23
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 85.71429% with 4 lines in your changes missing coverage. Please review.

Project coverage is 73.45%. Comparing base (bbc6224) to head (ad95488).
Report is 449 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #22983      +/-   ##
============================================
- Coverage     73.57%   73.45%   -0.13%     
+ Complexity    32624     2412   -30212     
============================================
  Files          1877     1914      +37     
  Lines        139502   143616    +4114     
  Branches      15299    15668     +369     
============================================
+ Hits         102638   105489    +2851     
- Misses        28908    30049    +1141     
- Partials       7956     8078     +122     
Flag Coverage Δ
inttests 27.52% <42.85%> (+2.93%) ⬆️
systests 24.69% <21.42%> (+0.36%) ⬆️
unittests 72.51% <85.71%> (-0.33%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...ache/pulsar/broker/service/AbstractReplicator.java 65.35% <100.00%> (-19.65%) ⬇️
...ar/client/impl/conf/ProducerConfigurationData.java 87.20% <100.00%> (-4.56%) ⬇️
...roker/service/persistent/PersistentReplicator.java 67.98% <50.00%> (-0.89%) ⬇️
...rg/apache/pulsar/client/impl/PulsarClientImpl.java 74.52% <90.47%> (+0.22%) ⬆️

... and 493 files with indirect coverage changes

@poorbarcode poorbarcode merged commit a8ce990 into apache:master Jul 15, 2024
51 checks passed
poorbarcode added a commit that referenced this pull request Jul 15, 2024
…sters is not the same (#22983)

(cherry picked from commit a8ce990)
@poorbarcode
Copy link
Contributor Author

Since branch-3.2 does not contain PIP-344, so this PR can not cherry-pick into branch-3.2, removed the label release:3.2

poorbarcode added a commit that referenced this pull request Jul 15, 2024
…sters is not the same (#22983)

(cherry picked from commit a8ce990)
shibd added a commit that referenced this pull request Jul 16, 2024
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Jul 25, 2024
…sters is not the same (apache#22983)

(cherry picked from commit a8ce990)
(cherry picked from commit 25542d8)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Jul 25, 2024
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Jul 26, 2024
…sters is not the same (apache#22983)

(cherry picked from commit a8ce990)
(cherry picked from commit 25542d8)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost cherry-picked/branch-3.0 cherry-picked/branch-3.3 doc-not-needed Your PR changes do not impact docs ready-to-test release/3.0.6 release/3.3.1 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants