Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][broker] Added isActive in ManagedCursorImpl #19341

Merged
merged 1 commit into from
Jan 28, 2023

Conversation

heesung-sn
Copy link
Contributor

@heesung-sn heesung-sn commented Jan 28, 2023

Motivation

When there are many concurrent subscriptions in a topic, broker performance degrades because many io-threads are waiting for the lock, synchronized (activeCursors) while calling checkBackloggedCursors.

"pulsar-io-4-2" #77 prio=5 os_prio=0 cpu=17289.90ms elapsed=170.09s tid=0x00007f85b0085800 nid=0xa2 runnable  [0x00007f859d05e000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.deactivateCursor(ManagedLedgerImpl.java:3526)
        - locked <0x00007f8921b24c58> (a org.apache.bookkeeper.mledger.impl.ManagedCursorContainer)
        at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.setInactive(ManagedCursorImpl.java:1116)
        at org.apache.pulsar.broker.service.persistent.PersistentTopic.lambda$checkBackloggedCursors$95(PersistentTopic.java:2363)
        at org.apache.pulsar.broker.service.persistent.PersistentTopic$$Lambda$1346/0x00007f8465c23cb0.accept(Unknown Source)
        at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.forEach(ConcurrentOpenHashMap.java:544)
        at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.forEach(ConcurrentOpenHashMap.java:272)
--
        at java.lang.Thread.run(java.base@11.0.17/Thread.java:829)

"pulsar-io-4-3" #78 prio=5 os_prio=0 cpu=17224.94ms elapsed=170.09s tid=0x00007f85b0087000 nid=0xa3 waiting for monitor entry  [0x00007f859cf5d000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.deactivateCursor(ManagedLedgerImpl.java:3526)
        - waiting to lock <0x00007f8921b24c58> (a org.apache.bookkeeper.mledger.impl.ManagedCursorContainer)
        at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.setInactive(ManagedCursorImpl.java:1116)
        at org.apache.pulsar.broker.service.persistent.PersistentTopic.lambda$checkBackloggedCursors$95(PersistentTopic.java:2363)
        at org.apache.pulsar.broker.service.persistent.PersistentTopic$$Lambda$1346/0x00007f8465c23cb0.accept(Unknown Source)
        at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.forEach(ConcurrentOpenHashMap.java:544)
        at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.forEach(ConcurrentOpenHashMap.java:272)
--
        at java.lang.Thread.run(java.base@11.0.17/Thread.java:829)

Modifications

Added the isActve var in ManagedCursorImpl in order to minimize the access to activeCursors in ManagedLedgerImpl

Verifying this change

  • Make sure that the change passes the CI checks.

This change can be verified as follows:

simulate many concurrent subscriptions

#!/bin/sh

while :
do
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest -time 3 -ss a &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss b &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss c &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss d &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss e &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss f &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss g &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss h &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss i &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss j &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss k &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss l &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss m &

   for job in `jobs -p`
     do
       echo $job
       wait $job
     done
done

before this PR

bin/pulsar-perf produce my-topic -r 10 -m 1000
2023-01-27T18:47:30,430-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:       0 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:47:40,510-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---     11.8 msg/s ---      0.1 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 6834.356 ms - med: 6834.815 - 95pct: 12128.831 - 99pct: 12629.887 - 99.9pct: 12728.447 - 99.99pct: 12728.447 - Max: 12728.447
2023-01-27T18:47:50,531-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:00,555-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:10,583-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:20,616-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:30,656-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:40,683-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:50,711-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:49:00,734-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:49:01,300-0800 [pulsar-perf-producer-exec-1-1] INFO  org.apache.pulsar.testclient.PerformanceProducer - ------------- DONE (reached the maximum number: 1000 of production) --------------
2023-01-27T18:49:06,306-0800 [Thread-0] INFO  org.apache.pulsar.testclient.PerformanceProducer - Aggregated throughput stats --- 119 records sent --- 1.123 msg/s --- 0.009 Mbit/s
2023-01-27T18:49:06,337-0800 [Thread-0] INFO  org.apache.pulsar.testclient.PerformanceProducer - Aggregated latency stats --- Latency: mean: 6834.356 ms - med: 6834.815 - 95pct: 12128.831 - 99pct: 12629.887 - 99.9pct: 12728.447 - 99.99pct: 12728.447 - 99.999pct: 12728.447 - Max: 12728.447

after this PR

bin/pulsar-perf produce my-topic -r 10 -m 1000

org.apache.pulsar.testclient.PerformanceProducer - Created 1 producers
2023-01-27T18:57:42,190-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:      31 msg ---      3.1 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 2848.717 ms - med: 2265.311 - 95pct: 5757.055 - 99pct: 5822.847 - 99.9pct: 5822.847 - 99.99pct: 5822.847 - Max: 5822.847
2023-01-27T18:57:52,230-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:      49 msg ---      1.8 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 10118.172 ms - med: 9954.495 - 95pct: 12453.503 - 99pct: 12554.047 - 99.9pct: 12554.047 - 99.99pct: 12554.047 - Max: 12554.047
2023-01-27T18:58:02,254-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:      87 msg ---      3.8 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 17079.874 ms - med: 16370.687 - 95pct: 20056.703 - 99pct: 20304.127 - 99.9pct: 20304.127 - 99.99pct: 20304.127 - Max: 20304.127
2023-01-27T18:58:12,277-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     126 msg ---      3.9 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 25037.297 ms - med: 25613.823 - 95pct: 26452.607 - 99pct: 26574.591 - 99.9pct: 26574.591 - 99.99pct: 26574.591 - Max: 26574.591
2023-01-27T18:58:22,299-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     174 msg ---      4.8 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 30616.232 ms - med: 31042.431 - 95pct: 32272.895 - 99pct: 32440.447 - 99.9pct: 32440.447 - 99.99pct: 32440.447 - Max: 32440.447
2023-01-27T18:58:32,325-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     240 msg ---      6.6 msg/s ---      0.1 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 35219.516 ms - med: 35478.527 - 95pct: 36584.447 - 99pct: 36717.823 - 99.9pct: 36817.663 - 99.99pct: 36817.663 - Max: 36817.663
2023-01-27T18:58:42,354-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     314 msg ---      7.4 msg/s ---      0.1 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 36440.735 ms - med: 36492.543 - 95pct: 37402.623 - 99pct: 37561.087 - 99.9pct: 37659.391 - 99.99pct: 37659.391 - Max: 37659.391
2023-01-27T18:58:52,377-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     397 msg ---      8.3 msg/s ---      0.1 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 40985.450 ms - med: 41395.967 - 95pct: 42582.783 - 99pct: 42684.927 - 99.9pct: 42727.167 - 99.99pct: 42727.167 - Max: 42727.167
2023-01-27T18:59:02,406-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     519 msg ---     12.2 msg/s ---      0.1 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 38750.025 ms - med: 38156.543 - 95pct: 41149.439 - 99pct: 41394.175 - 99.9pct: 41492.479 - 99.99pct: 41492.479 - Max: 41492.479
2023-01-27T18:59:12,436-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     645 msg ---     12.6 msg/s ---      0.1 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 36111.466 ms - med: 35880.959 - 95pct: 37313.535 - 99pct: 37454.335 - 99.9pct: 37470.719 - 99.99pct: 37470.719 - Max: 37470.719
2023-01-27T18:59:12,993-0800 [pulsar-perf-producer-exec-1-1] INFO  org.apache.pulsar.testclient.PerformanceProducer - ------------- DONE (reached the maximum number: 1000 of production) --------------
2023-01-27T18:59:17,999-0800 [Thread-0] INFO  org.apache.pulsar.testclient.PerformanceProducer - Aggregated throughput stats --- 723 records sent --- 6.826 msg/s --- 0.053 Mbit/s
2023-01-27T18:59:18,027-0800 [Thread-0] INFO  org.apache.pulsar.testclient.PerformanceProducer - Aggregated latency stats --- Latency: mean: 32857.773 ms - med: 35737.599 - 95pct: 41621.759 - 99pct: 42467.839 - 99.9pct: 42684.927 - 99.99pct: 42727.167 - 99.999pct: 42727.167 - Max: 42727.167

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: heesung-sn#23

ledger.deactivateCursor(this);
if (isActive) {
ledger.deactivateCursor(this);
isActive = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is still the possibility of inconsistent status in the concurrent scenario, but I'm not sure whether it is a problem.

@mattisonchao
Copy link
Member

I am still wondering why we need synchronized (activeCursors). It looks like ManagedCursorContainer is thread-safe.
@lhotari Could you help to explain more about it? because I found you added it in #17273.

@codecov-commenter
Copy link

Codecov Report

Merging #19341 (71cedd7) into master (644be5f) will increase coverage by 31.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #19341       +/-   ##
=============================================
+ Coverage     32.42%   63.46%   +31.04%     
- Complexity     6347    25878    +19531     
=============================================
  Files          1644     1818      +174     
  Lines        123712   133107     +9395     
  Branches      13486    14644     +1158     
=============================================
+ Hits          40109    84474    +44365     
+ Misses        77694    40879    -36815     
- Partials       5909     7754     +1845     
Flag Coverage Δ
inttests 24.93% <85.71%> (+<0.01%) ⬆️
systests 25.49% <100.00%> (-0.18%) ⬇️
unittests 60.74% <100.00%> (+43.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java 77.92% <100.00%> (+38.10%) ⬆️
...ce/ConsistentHashingStickyKeyConsumerSelector.java 76.92% <0.00%> (-3.85%) ⬇️
...in/java/org/apache/pulsar/PulsarBrokerStarter.java 28.14% <0.00%> (-3.00%) ⬇️
.../apache/pulsar/client/impl/ClientCnxIdleState.java 68.88% <0.00%> (-2.23%) ⬇️
...service/nonpersistent/NonPersistentReplicator.java 58.41% <0.00%> (-1.99%) ⬇️
...in/java/org/apache/pulsar/common/api/AuthData.java 71.42% <0.00%> (ø)
...ava/org/apache/pulsar/client/api/schema/Field.java 80.00% <0.00%> (ø)
.../apache/pulsar/broker/namespace/LookupOptions.java 87.50% <0.00%> (ø)
.../apache/pulsar/common/naming/SystemTopicNames.java 55.55% <0.00%> (ø)
...apache/pulsar/common/util/SafeCollectionUtils.java 0.00% <0.00%> (ø)
... and 1214 more

@lhotari
Copy link
Member

lhotari commented Jan 30, 2023

I am still wondering why we need synchronized (activeCursors). It looks like ManagedCursorContainer is thread-safe. @lhotari Could you help to explain more about it? because I found you added it in #17273.

@mattisonchao I guess I had added it to be consistent with the other method deactivateCursor, https://github.com/apache/pulsar/pull/17273/files#diff-f6a849bd8fdb782ef6c17a2e07a2c54c3dc7d1655c00ec3546d5f3b3fc61e970L3527 . That method contained synchronized(activeCursors)

liangyepianzhou pushed a commit that referenced this pull request Feb 7, 2023
liangyepianzhou pushed a commit that referenced this pull request Feb 7, 2023
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Feb 28, 2023
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Feb 28, 2023
(cherry picked from commit 96fb7da)
(cherry picked from commit f61aafa)
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Feb 28, 2023
(cherry picked from commit 96fb7da)
(cherry picked from commit d1e4008)
@heesung-sn heesung-sn deleted the ManagedCursorImpl.isActive branch April 2, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants