[improve][misc][WIP] Detect "double release" and "use after release" bugs with recycled objects #22110

lhotari · 2024-02-23T16:01:07Z

Motivation

In Pulsar, users have reported issues that could be caused by "double release" or "use after release" bugs with recycled objects.

Here's are some issues that could potentially be caused by "double release" or "use after release" bugs:
#22035
#21892

Other example of such potential issue: #21421/#21933. It is possible that these issues are fixed by apache/bookkeeper#4196. The other root cause could be a "double release" or "use after release" bug which is corrupting the buffer and causing checksum calculation to fail.

Outside of Java, there's a known bug pattern called "double-free" or "Doubly freeing memory" (with malloc).
The "double release" bug pattern is a bit similar, but happens with the recycled object pattern using Netty's Recycler that Pulsar uses because of performance reasons.

There is also a "use after free" bug pattern. Something similar could be happen with Netty recycled objects that the object instance gets used after releasing.

The solution in this PR attempts to help detect "double release" and "use after release" bugs.

Modifications

get rid of any .setRefCnt(1) calls in production code since that could hide real issues
add additional detection feature that can be disabled by setting the system property -Dpulsar.refcount.check.on_access=false.
- the refcount check will add a volatile read. This isn't a performance problem in most use cases. For production usage we could add -Dpulsar.refcount.check.on_access=false into the bin/pulsar script by default if we are afraid of the performance overhead. For all tests, we should be running with checks enabled so that we could find the source of problems.

Documentation

doc
doc-required
doc-not-needed
doc-complete

…recycled objects

lhotari · 2024-02-23T20:32:06Z

After making the changes, there are a lot of unit test failures. I haven't had a chance to look into the details. This PR is still a very early proposal about how to start detecting "double release" and "use after release" bugs.

BewareMyPower

The recycled objects are widely used in Pulsar, not only for EntryImpl. The 1st concern is that should we apply checks to all these places? For example, the client side could also use recycled objects.

The 2nd concern is, I'm afraid currently Pulsar allows a recycled object is accessed with a "null check". We need to investigate such cases.

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryImpl.java

…ts here

lhotari · 2024-02-26T05:49:56Z

The recycled objects are widely used in Pulsar, not only for EntryImpl. The 1st concern is that should we apply checks to all these places? For example, the client side could also use recycled objects.

@BewareMyPower Yes that's a good point. It looks like the AbstractCASReferenceCounted/AbstractReferenceCounted base class prevents a lot of the problems since there would be exceptions at some point if there would be execution paths where "double release" or "use after release" bugs existed.

The 2nd concern is, I'm afraid currently Pulsar allows a recycled object is accessed with a "null check". We need to investigate such cases.

Yes, those are cases where the problem is getting hidden.

It would be great to find a better way to track down the issues. That's why I started this PR which is more like an experiment to find some way that would work for detecting "double release" and "use after release" bugs and also raise awareness of such bug patterns with the recycled objects. These are bug patterns that most Java developers have never had to deal with because of Java's garbage collection. With recycled objects and Netty ByteBufs, that all changes.

lhotari · 2024-02-26T05:55:52Z

I guess there's also the possibility of "double release" and "use after release" bugs with Netty ByteBufs.

In Netty, there's the leak detector for detecting when you don't release buffers, but there seems to be nothing for detecting the "use after release" bugs. I guess "double release" would be detected with the io.netty.buffer.AbstractReferenceCountedByteBuf base class which will throw an exception on double release.
It leaves the "use after release" bugs undetected.

It seems that the solution might be a Java Agent written with Byte Buddy etc. which would add additional checks with byte code instrumentation when the agent is activated. Thinking something like https://github.com/reactor/BlockHound but for a completely different purpose, to help detect "use after release" bugs.

lhotari · 2024-02-26T06:16:51Z

The 2nd concern is, I'm afraid currently Pulsar allows a recycled object is accessed with a "null check". We need to investigate such cases.

Getting back to this one more time. Netty protects against most "use after release" ByteBuf bugs by setting the fields to null and the NPEs would be popping up as a sign of issues. Therefore it's extremely important that NPEs aren't suppressed with null checks.

onobc

Nice work @lhotari

My comments are mostly questions - 1 minor suggestion on naming. Other than that LGTM.

Also, because it is coupled w/ the "emergency brake" property , it seems harmless to add in .

The remaining piece will be to find out where we are currently doing the null checks and remove them.

onobc · 2024-03-10T19:55:20Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryImpl.java

+        return entry;
+    }
+
+    private static EntryImpl getEntryFromRecycler() {


👍🏻 factoring this out makes it easier to follow (and add other common behaviors in one place later if needed - basic DRY stuff).

onobc · 2024-03-10T19:59:26Z

...r-common/src/main/java/org/apache/pulsar/common/util/AbstractValidatingReferenceCounted.java

+        setRefCnt(1);
+    }
+
+    public static <T extends AbstractValidatingReferenceCounted> T getAndCheck(Recycler<T> recycler) {


[NIT] It is not really doing a "check". In the ACRC it was called getEntryFromRecycler. I think it would be helpful to name them the same thing (sans "entry"), maybe getInstanceFromRecycler.

onobc · 2024-03-10T20:02:58Z

...r-common/src/main/java/org/apache/pulsar/common/util/AbstractValidatingReferenceCounted.java

+    }
+
+    public final void resetRefCnt() {
+        setRefCnt(1);


It is unfortunate that setRefCnt is protected. Otherwise, this could have been done in a single util rather than inserted into the hierarchy (composition over inheritance). Also, the code in both hierarchies is almost identical.

The checkOnAccess var is just a static lookup.

The checkRefCount and getFromRecycler could be passed in the counter or recycler, respectively

Not the end of the world but my brain is having trouble leaving it alone.

onobc · 2024-03-10T20:04:35Z

pulsar-client/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java

@@ -1374,8 +1375,10 @@ protected ProducerImpl.ChunkedMessageCtx newObject(
                };

        public static ChunkedMessageCtx get(int totalChunks) {
-            ChunkedMessageCtx chunkedMessageCtx = RECYCLER.get();
-            chunkedMessageCtx.setRefCnt(totalChunks);
+            ChunkedMessageCtx chunkedMessageCtx = getAndCheck(RECYCLER);


Is setting ref count to 1 and then retaining N-1 less performant than setting ref count to N?

OR

Is setting ref count to 1 and then retaining N-1 more functionally correct than setting ref count to N?

lhotari · 2024-05-17T09:08:19Z

Pattern used in Netty to detect "use after release" bugs:

https://github.com/netty/netty/blob/c07dec11a90b82674db1529d0402af8e62d6589f/buffer/src/main/java/io/netty/buffer/AbstractByteBuf.java#L1431-L1456

https://github.com/netty/netty/blob/94cfa608a8071e8a9005e0d52e0199b3584f4ae7/buffer/src/main/java/io/netty/buffer/ByteBuf.java#L2485-L2491

lhotari · 2024-05-24T09:22:51Z

There are 2 ByteBuf reference count handling issues in Bookkeeper client that have been recently fixed: apache/bookkeeper#4289 and apache/bookkeeper#4293 .
These fixes are expected to be delivered in Bookkeeper 4.16.6 and 4.17.1 so that we could deliver the fixes to Pulsar releases 3.0.6, 3.2.4 and 3.3.1.

lhotari · 2024-05-24T09:41:20Z

I recently discovered another bug pattern with ByteBufs which is due to an incorrect assumption of how Netty ByteBuf reference counting works for derived buffers.

Writing some pseudo code to explain:

ByteBuf buf = PulsarByteBufAllocator.DEFAULT.buffer(16);
...
ByteBuf cachedBuffer = buf.duplicate().retain();
...
buf.release();
...

The assumption could be that the above code is correct. Why is it wrong?

To understand how a duplicate buffer works, one could take a look at the source, in AbstractPooledDerivedByteBuf.
https://github.com/netty/netty/blob/5085cef149134951c94a02a743ed70025c8cdad4/buffer/src/main/java/io/netty/buffer/AbstractPooledDerivedByteBuf.java#L32-L37

The reference count of the duplicated buffer (extends AbstractPooledDerivedByteBuf in the case of pooled buffers) is independent of the parent buffer. It will call .release() on the parent buffer when the reference count of the duplicated buffer reaches 0. Understanding this will help use .retain() and .release() correctly when .duplicate() or .slice() is used.

lhotari · 2024-05-31T10:41:21Z

One potential source of double release bugs is the race conditions in RangeCache implementation used for the broker cache. There are 2 PRs to address the race conditions: #22789 and #22814

lhotari · 2024-06-19T14:07:49Z

The Netty 4.1.111.Final upgrade will prevent some problems in this area, more details in #22892

WIP Add checks for preventing hard-to-debug double release bugs with …

397b6d4

…recycled objects

lhotari added the ready-to-test label Feb 23, 2024

lhotari added this to the 3.3.0 milestone Feb 23, 2024

lhotari requested review from merlimat, codelipenghui, RobertIndie, BewareMyPower, onobc and liudezhi2098 February 23, 2024 16:01

lhotari self-assigned this Feb 23, 2024

lhotari mentioned this pull request Feb 23, 2024

[Bug] [Broker] Entry digest does not match #22103

Open

2 tasks

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Feb 23, 2024

lhotari changed the title ~~WIP Detect double release bugs with recycled objects~~ [improve][misc] WIP Detect double release bugs with recycled objects Feb 23, 2024

lhotari marked this pull request as ready for review February 23, 2024 16:03

lhotari changed the title ~~[improve][misc] WIP Detect double release bugs with recycled objects~~ [improve][misc] WIP Detect "double release" and "use after release" bugs with recycled objects Feb 23, 2024

lhotari changed the title ~~[improve][misc] WIP Detect "double release" and "use after release" bugs with recycled objects~~ [improve][misc][WIP] Detect "double release" and "use after release" bugs with recycled objects Feb 23, 2024

lhotari added 2 commits February 23, 2024 22:47

refCnt starts at 1

373e343

Fix retain logic for ChunkedMessageCtx to match setRefCnt(totalChunks)

dc3e1a9

lhotari mentioned this pull request Feb 23, 2024

[fix][Offload] fix indexEntries NullPointerException error #22035

Closed

15 tasks

BewareMyPower reviewed Feb 26, 2024

View reviewed changes

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryImpl.java Outdated Show resolved Hide resolved

Remove refCnt==0 check since object might be new or reused when it ge…

d59f9ab

…ts here

lhotari marked this pull request as draft February 26, 2024 06:03

onobc requested changes Mar 10, 2024

View reviewed changes

coderzc modified the milestones: 3.3.0, 3.4.0 May 8, 2024

lhotari mentioned this pull request May 17, 2024

[Bug] parseMessageMetadata error when broker entry metadata enable with high loading #22601

Closed

3 tasks

lhotari modified the milestones: 4.0.0, 4.1.0 Oct 11, 2024

lhotari added the release/4.0.1 label Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improve][misc][WIP] Detect "double release" and "use after release" bugs with recycled objects #22110

[improve][misc][WIP] Detect "double release" and "use after release" bugs with recycled objects #22110

lhotari commented Feb 23, 2024 •

edited

Loading

lhotari commented Feb 23, 2024

BewareMyPower left a comment •

edited

Loading

lhotari commented Feb 26, 2024 •

edited

Loading

lhotari commented Feb 26, 2024 •

edited

Loading

lhotari commented Feb 26, 2024 •

edited

Loading

onobc left a comment

onobc Mar 10, 2024

onobc Mar 10, 2024

onobc Mar 10, 2024

onobc Mar 10, 2024

lhotari commented May 17, 2024 •

edited

Loading

lhotari commented May 24, 2024

lhotari commented May 24, 2024 •

edited

Loading

lhotari commented May 31, 2024

lhotari commented Jun 19, 2024

[improve][misc][WIP] Detect "double release" and "use after release" bugs with recycled objects #22110

Are you sure you want to change the base?

[improve][misc][WIP] Detect "double release" and "use after release" bugs with recycled objects #22110

Conversation

lhotari commented Feb 23, 2024 • edited Loading

Motivation

Modifications

Documentation

lhotari commented Feb 23, 2024

BewareMyPower left a comment • edited Loading

Choose a reason for hiding this comment

lhotari commented Feb 26, 2024 • edited Loading

lhotari commented Feb 26, 2024 • edited Loading

lhotari commented Feb 26, 2024 • edited Loading

onobc left a comment

Choose a reason for hiding this comment

onobc Mar 10, 2024

Choose a reason for hiding this comment

onobc Mar 10, 2024

Choose a reason for hiding this comment

onobc Mar 10, 2024

Choose a reason for hiding this comment

onobc Mar 10, 2024

Choose a reason for hiding this comment

lhotari commented May 17, 2024 • edited Loading

lhotari commented May 24, 2024

lhotari commented May 24, 2024 • edited Loading

lhotari commented May 31, 2024

lhotari commented Jun 19, 2024

lhotari commented Feb 23, 2024 •

edited

Loading

BewareMyPower left a comment •

edited

Loading

lhotari commented Feb 26, 2024 •

edited

Loading

lhotari commented Feb 26, 2024 •

edited

Loading

lhotari commented Feb 26, 2024 •

edited

Loading

lhotari commented May 17, 2024 •

edited

Loading

lhotari commented May 24, 2024 •

edited

Loading