Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove checkBackloggedCursors from subscribe #24

Closed
wants to merge 1 commit into from

Conversation

heesung-sn
Copy link
Owner

@heesung-sn heesung-sn commented Jan 28, 2023

Motivation

Related to apache#19341,

  • When there are many concurrent subscriptions in a topic(10k> ), brokers' performance degrades because checkBackloggedCursors() iterates all subscriptions for each subscribe call.
"pulsar-io-4-2" #77 prio=5 os_prio=0 cpu=17289.90ms elapsed=170.09s tid=0x00007f85b0085800 nid=0xa2 runnable  [0x00007f859d05e000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.deactivateCursor(ManagedLedgerImpl.java:3526)
        - locked <0x00007f8921b24c58> (a org.apache.bookkeeper.mledger.impl.ManagedCursorContainer)
        at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.setInactive(ManagedCursorImpl.java:1116)
        at org.apache.pulsar.broker.service.persistent.PersistentTopic.lambda$checkBackloggedCursors$95(PersistentTopic.java:2363)
        at org.apache.pulsar.broker.service.persistent.PersistentTopic$$Lambda$1346/0x00007f8465c23cb0.accept(Unknown Source)
        at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.forEach(ConcurrentOpenHashMap.java:544)
        at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.forEach(ConcurrentOpenHashMap.java:272)
--
        at java.lang.Thread.run(java.base@11.0.17/Thread.java:829)

"pulsar-io-4-3" #78 prio=5 os_prio=0 cpu=17224.94ms elapsed=170.09s tid=0x00007f85b0087000 nid=0xa3 waiting for monitor entry  [0x00007f859cf5d000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.deactivateCursor(ManagedLedgerImpl.java:3526)
        - waiting to lock <0x00007f8921b24c58> (a org.apache.bookkeeper.mledger.impl.ManagedCursorContainer)
        at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.setInactive(ManagedCursorImpl.java:1116)
        at org.apache.pulsar.broker.service.persistent.PersistentTopic.lambda$checkBackloggedCursors$95(PersistentTopic.java:2363)
        at org.apache.pulsar.broker.service.persistent.PersistentTopic$$Lambda$1346/0x00007f8465c23cb0.accept(Unknown Source)
        at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.forEach(ConcurrentOpenHashMap.java:544)
        at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.forEach(ConcurrentOpenHashMap.java:272)
--
        at java.lang.Thread.run(java.base@11.0.17/Thread.java:829)
  • Running checkBackloggedCursors() in the subscription path is redundant as checkBackloggedCursors() is called in updateStats() every statsUpdateFrequencyInSecs(by default 1min).

Modifications

Removed checkBackloggedCursors() from subscribe() call.

Verifying this change

  • Make sure that the change passes the CI checks.

This change can be verified as follows:

simulate many concurrent subscriptions

#!/bin/sh

while :
do
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest -time 3 -ss a &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss b &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss c &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss d &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss e &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss f &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss g &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss h &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss i &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss j &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss k &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss l &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss m &

   for job in `jobs -p`
     do
       echo $job
       wait $job
     done
done

before this PR

bin/pulsar-perf produce my-topic -r 10 -m 1000
2023-01-27T18:47:30,430-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:       0 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:47:40,510-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---     11.8 msg/s ---      0.1 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 6834.356 ms - med: 6834.815 - 95pct: 12128.831 - 99pct: 12629.887 - 99.9pct: 12728.447 - 99.99pct: 12728.447 - Max: 12728.447
2023-01-27T18:47:50,531-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:00,555-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:10,583-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:20,616-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:30,656-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:40,683-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:50,711-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:49:00,734-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:49:01,300-0800 [pulsar-perf-producer-exec-1-1] INFO  org.apache.pulsar.testclient.PerformanceProducer - ------------- DONE (reached the maximum number: 1000 of production) --------------
2023-01-27T18:49:06,306-0800 [Thread-0] INFO  org.apache.pulsar.testclient.PerformanceProducer - Aggregated throughput stats --- 119 records sent --- 1.123 msg/s --- 0.009 Mbit/s
2023-01-27T18:49:06,337-0800 [Thread-0] INFO  org.apache.pulsar.testclient.PerformanceProducer - Aggregated latency stats --- Latency: mean: 6834.356 ms - med: 6834.815 - 95pct: 12128.831 - 99pct: 12629.887 - 99.9pct: 12728.447 - 99.99pct: 12728.447 - 99.999pct: 12728.447 - Max: 12728.447

after this PR

 bin/pulsar-perf produce my-topic -r 10 -m 1000

2023-01-27T21:26:43,991-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:      15 msg ---      1.5 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 5465.266 ms - med: 6249.727 - 95pct: 7482.431 - 99pct: 7584.639 - 99.9pct: 7584.639 - 99.99pct: 7584.639 - Max: 7584.639
2023-01-27T21:26:54,025-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:      62 msg ---      4.6 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 9959.775 ms - med: 10424.703 - 95pct: 12068.927 - 99pct: 12823.935 - 99.9pct: 12823.935 - 99.99pct: 12823.935 - Max: 12823.935
2023-01-27T21:27:04,048-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:      97 msg ---      3.5 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 16149.440 ms - med: 16641.855 - 95pct: 19121.023 - 99pct: 19414.143 - 99.9pct: 19414.143 - 99.99pct: 19414.143 - Max: 19414.143
2023-01-27T21:27:14,070-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     131 msg ---      3.4 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 22474.266 ms - med: 22154.495 - 95pct: 24488.959 - 99pct: 25234.559 - 99.9pct: 25234.559 - 99.99pct: 25234.559 - Max: 25234.559
2023-01-27T21:27:24,106-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     192 msg ---      6.1 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 28824.695 ms - med: 28932.479 - 95pct: 30471.679 - 99pct: 30507.007 - 99.9pct: 30516.479 - 99.99pct: 30516.479 - Max: 30516.479
2023-01-27T21:27:34,143-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     269 msg ---      7.7 msg/s ---      0.1 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 31316.609 ms - med: 30886.527 - 95pct: 32863.231 - 99pct: 32963.199 - 99.9pct: 33045.759 - 99.99pct: 33045.759 - Max: 33045.759
2023-01-27T21:27:44,178-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     418 msg ---     14.9 msg/s ---      0.1 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 30904.110 ms - med: 31308.287 - 95pct: 33457.919 - 99pct: 33696.511 - 99.9pct: 33730.303 - 99.99pct: 33730.303 - Max: 33730.303
2023-01-27T21:27:54,205-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     792 msg ---     37.3 msg/s ---      0.3 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 12566.494 ms - med: 11749.119 - 95pct: 25876.223 - 99pct: 27032.319 - 99.9pct: 27429.375 - 99.99pct: 27429.375 - Max: 27429.375
2023-01-27T21:28:04,231-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     892 msg ---     10.0 msg/s ---      0.1 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 158.367 ms - med: 103.553 - 95pct: 512.813 - 99pct: 626.371 - 99.9pct: 729.767 - 99.99pct: 729.767 - Max: 729.767
2023-01-27T21:28:14,254-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     992 msg ---     10.0 msg/s ---      0.1 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 113.023 ms - med:  87.656 - 95pct: 275.551 - 99pct: 362.147 - 99.9pct: 384.981 - 99.99pct: 384.981 - Max: 384.981
2023-01-27T21:28:14,871-0800 [pulsar-perf-producer-exec-1-1] INFO  org.apache.pulsar.testclient.PerformanceProducer - ------------- DONE (reached the maximum number: 1000 of production) --------------
2023-01-27T21:28:19,879-0800 [Thread-0] INFO  org.apache.pulsar.testclient.PerformanceProducer - Aggregated throughput stats --- 1000 records sent --- 9.434 msg/s --- 0.074 Mbit/s
2023-01-27T21:28:19,942-0800 [Thread-0] INFO  org.apache.pulsar.testclient.PerformanceProducer - Aggregated latency stats --- Latency: mean: 15404.022 ms - med: 14796.287 - 95pct: 32638.591 - 99pct: 33425.023 - 99.9pct: 33696.511 - 99.99pct: 33730.303 - 99.999pct: 33730.303 - Max: 33730.303

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: #24

@heesung-sn heesung-sn force-pushed the removeCheckBackloggedCursorsFromSubscribe branch from 1df04c6 to 1f41b31 Compare January 28, 2023 05:23
@heesung-sn heesung-sn changed the title Remove check backlogged cursors from subscribe Remove checkBackloggedCursors from subscribe Jan 28, 2023
@heesung-sn heesung-sn force-pushed the removeCheckBackloggedCursorsFromSubscribe branch 2 times, most recently from 70f1883 to 559194d Compare January 29, 2023 01:26
…dCursor(single subscription check) from subscribe
@heesung-sn heesung-sn force-pushed the removeCheckBackloggedCursorsFromSubscribe branch from 559194d to 3cb8c55 Compare January 29, 2023 01:27
@heesung-sn heesung-sn closed this Jan 29, 2023
@heesung-sn heesung-sn deleted the removeCheckBackloggedCursorsFromSubscribe branch April 2, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant