Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Fix thread safety issues in accessing ManagedCursorContainer.heap ArrayList #16049

Merged

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Jun 14, 2022

Motivation

  • heap is an ArrayList which isn't thread safe and it's currently accessed concurrently.

This could lead to thread safety issues. Here's an example: (stacktraces are from a fork of branch-2.10)

2022-06-13T13:57:27,163+0000 [bookkeeper-ml-cache-eviction-19-1] WARN  org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Exception while performing cache eviction: Index 0 out of bounds for length 0
java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
	at jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64) ~[?:?]
	at jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70) ~[?:?]
	at jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248) ~[?:?]
	at java.util.Objects.checkIndex(Objects.java:372) ~[?:?]
	at java.util.ArrayList.get(ArrayList.java:459) ~[?:?]
	at org.apache.bookkeeper.mledger.impl.ManagedCursorContainer.getSlowestReadPositionForActiveCursors(ManagedCursorContainer.java:108) ~[com.datastax.oss-managed-ledger-2.10.0.5.jar:2.10.0.5]
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.getEarlierReadPositionForActiveCursors(ManagedLedgerImpl.java:2185) ~[com.datastax.oss-managed-ledger-2.10.0.5.jar:2.10.0.5]
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.doCacheEviction(ManagedLedgerImpl.java:2175) ~[com.datastax.oss-managed-ledger-2.10.0.5.jar:2.10.0.5]
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl.lambda$doCacheEviction$4(ManagedLedgerFactoryImpl.java:282) ~[com.datastax.oss-managed-ledger-2.10.0.5.jar:2.10.0.5]
	at java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4772) ~[?:?]
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl.doCacheEviction(ManagedLedgerFactoryImpl.java:278) ~[com.datastax.oss-managed-ledger-2.10.0.5.jar:2.10.0.5]
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl.cacheEvictionTask(ManagedLedgerFactoryImpl.java:263) ~[com.datastax.oss-managed-ledger-2.10.0.5.jar:2.10.0.5]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.77.Final.jar:4.1.77.Final]
	at java.lang.Thread.run(Thread.java:829) [?:?]

Modifications

  • protect access using the existing rwLock

…eap" field access

- heap is an ArrayList which isn't thread safe
@lhotari
Copy link
Member Author

lhotari commented Jun 14, 2022

The thread safety issue was introduced in #12045 . A similar pattern was used in #14985 changes.

@lhotari lhotari requested a review from hangc0276 June 14, 2022 07:37
@lhotari
Copy link
Member Author

lhotari commented Jun 14, 2022

There's a chance that this fix causes a performance regression. If that's the case, we would have to revisit the solution for tracking the slowest positions.

@lhotari
Copy link
Member Author

lhotari commented Jun 14, 2022

When evaluating the #12045 and #14985 changes, it seems that the solution breaks broker caching. I'll create an issue for tracking this.

@lhotari
Copy link
Member Author

lhotari commented Jun 14, 2022

The issue about the broker cache regression is #16054

@315157973
Copy link
Contributor

The previous PR is used to solve the memory problem caused by frequent traversal.
I don't think add a read lock for cache evictions have any impact on the Broker's read and write performance.

Copy link
Contributor

@Jason918 Jason918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. heap should be guarded by rwLock.

@michaeljmarshall
Copy link
Member

There's a chance that this fix causes a performance regression. If that's the case, we would have to revisit the solution for tracking the slowest positions.

@lhotari - is there a specific test scenario that we should run to proactively check for a performance regression? The changes looks good to me, but it'd be helpful to diagnose a performance regression sooner than later, if possible.

codelipenghui pushed a commit that referenced this pull request Jun 28, 2022
@codelipenghui codelipenghui added the type/bug The PR fixed a bug or issue reported a bug label Jun 28, 2022
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Jul 4, 2022
…eap" field access (apache#16049)

(cherry picked from commit ec9676f)
(cherry picked from commit cec950e)
dragonls pushed a commit to dragonls/pulsar that referenced this pull request Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker cherry-picked/branch-2.10 doc-not-needed Your PR changes do not impact docs release/2.10.2 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants