Fix message buffer being copied from direct memory to heap memory #10330

BewareMyPower · 2021-04-22T16:32:57Z

Motivation

When I tested pulsar with AppendIndexMetadataInterceptor configured, i.e.

brokerEntryMetadataInterceptors=org.apache.pulsar.common.intercept.AppendIndexMetadataInterceptor

I found the heap memory increased very fast so that GC happened frequently. After the analysis from the dump info, I found many messages, which have BrokerEntryMetadata in the head, are stored in heap memory instead of direct memory.

When I removed the above configuration, the heap memory became slow to increase.

Modifications

Copy two buffers into a single buffer in direct memory instead of using CompositeByteBuf. Because when a CompositeByteBuf that has over 1 components goes to BookKeeper's checksum logic, the nioBuffer() method will be called and the CompositeByteBuf's internal buffers will be concatenated into a single buffer in heap memory.
Add the reference count check to ensure that after the change, the input and output buffers of addBrokerEntryMetadata will be released finally after the output buffer is released.
Add a slightly code refactor that makes ManagedLedgerInterceptor#beforeAddEntry's returned value be used.

Verifying this change

Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

After this change, the heap memory increasing problem was solved.

BewareMyPower · 2021-04-22T16:45:46Z

Here're the monitor statistics after and before this change. The first case (before 00:15) uses this PR's addBrokerEntryMetadata method.

I used a producer with nearly 200 MB/s write speed.

./bin/pulsar-perf produce -r 300000 my-topic

As we can see, the heap memory of second case (after 00:15) increased very fast and the GC happened very frequently. The difference of these two cases are only the pulsar-common.jar.

I'm not an expert of Netty and also very confused why the CompositeByteBuf make the whole buffer be copied into the heap memory.

Here's a dumped result that was generated by jmap -dump:format=b.

We can see there're many messages in heap memory. The first two bytes are [14, 2], which are equivalent to magicBrokerEntryMetadata. See

pulsar/pulsar-common/src/main/java/org/apache/pulsar/common/protocol/Commands.java

Line 117 in 37e3180

public static final short magicBrokerEntryMetadata = 0x0e02;

BewareMyPower · 2021-04-22T16:51:41Z

@codelipenghui @hangc0276 @wuzhanpeng @dockerzhang @aloyszhang PTAL. Also it would be appreciated if you can compare the heap memory usage before and after this PR.

I'm not sure if there's something wrong with my test environment. The VM is AWS EC2 Ubuntu Server 18.04 LTS (HVM) ami-05248307900d52e3a i3.4xlarge.

BewareMyPower · 2021-04-23T11:21:54Z

It looks like the heap memory problem is a BK side problem. I doubt that when LedgerHandle#asyncAddEntry(ByteBuf data, AddCallback cb, Object ctx) accepts a CompositeByteBuf as the data, it will finally concat CompositeByteBuf's buffers to a single buffer in heap memory.

eolivelli

It looks like the heap memory problem is a BK side problem. I doubt that when LedgerHandle#asyncAddEntry(ByteBuf data, AddCallback cb, Object ctx) accepts a CompositeByteBuf as the data, it will finally concat CompositeByteBuf's buffers to a single buffer in heap memory.

Probably you are right.

In this patch we are forcibly copying the data, and we should avoid it.

...d-ledger/src/main/java/org/apache/bookkeeper/mledger/intercept/ManagedLedgerInterceptor.java

eolivelli · 2021-04-23T12:05:22Z

@dlg99 PTAL

codelipenghui · 2021-04-23T12:20:44Z

Seems the problem is related to the crc32 checksum in the bookkeeper client. Since we have more than 1 component in the CompositeByteBuf, CompositeByteBuf.hasMemoryAddress() will return false, details to see implement at the CompositeByteBuf:

public boolean hasMemoryAddress() {
        switch(this.componentCount) {
        case 0:
            return Unpooled.EMPTY_BUFFER.hasMemoryAddress();
        case 1:
            return this.components[0].buf.hasMemoryAddress();
        default:
            return false;
        }
    }

For the crc32 checksum in the bookkeeper will use nioBuffer() to resume the CRC hash, due to CompositeByteBuf..nioBuffer() will copy the data to the HeapByteBuffer if the component size > 1.

public static int resumeChecksum(int previousChecksum, ByteBuf payload) {
        if (payload.hasMemoryAddress() && (CRC32C_HASH instanceof Sse42Crc32C)) {
            return CRC32C_HASH.resume(previousChecksum, payload.memoryAddress() + payload.readerIndex(),
                payload.readableBytes());
        } else if (payload.hasArray()) {
            return CRC32C_HASH.resume(previousChecksum, payload.array(), payload.arrayOffset() + payload.readerIndex(),
                payload.readableBytes());
        } else {
            return CRC32C_HASH.resume(previousChecksum, payload.nioBuffer());
        }
    }

BewareMyPower · 2021-04-23T14:22:45Z

@codelipenghui Thanks for you help, I'll update the PR description after I changed the beforeAddEntry back.

@eolivelli As what @codelipenghui mentioned, here we cannot avoid copying data because in BookKeeper side CompositeByteBuf#nioBuffer will be called. I think we can open an issue in https://github.com/apache/bookkeeper. So after
BookKeeper side fixed the problem, we could change the code here to avoid copying data.

BewareMyPower · 2021-04-23T15:08:20Z

I've updated the code and PR description, PTAL again. @eolivelli @codelipenghui

codelipenghui · 2021-04-23T15:20:42Z

@BewareMyPower Could you please help add an issue for tracking the Pulsar side and an issue for tracking the bookkeeper side? Before the 2.8.0 release, we need to fix both of them, this also needs a release of the bookkeeper which contains this fix. I agree with we can copying the data for now to avoid copy data into the JVM, since this only happens when the broker entry metadata enabled, if this can not be fixed completely, we'd better disable this feature by default in 2.8.0.

@eolivelli WDYT?

eolivelli · 2021-04-23T15:23:18Z

@dlg99 please follow up on BK.

dlg99 · 2021-04-23T15:31:36Z

@codelipenghui nice catch. The checksum part should be easy to fix, DigestManager.computeDigestAndPackageForSending can check if ByteBuf is an instance of CompositeByteBuf, treat it as Iterable (implemented by CompositeByteBuf), update digest in a loop. Or something along these lines.
PCBC.addEntry is doing something similar with ByteBufList.

BewareMyPower · 2021-04-23T15:35:07Z

since this only happens when the broker entry metadata enabled, if this can not be fixed completely, we'd better disable this feature by default in 2.8.0.

This feature is disabled by default now because the default brokerEntryMetadataInterceptors is empty and managedLedgerInterceptor is null.

BewareMyPower · 2021-04-23T15:41:49Z

BTW, I'm not sure if BK could be released in time so that Pulsar 2.8.0 won't be delayed too long. Currently I suggest applying this PR's patch first. If BK can't be released in time, we could apply the BK's fix to Pulsar 2.8.1.

dlg99 · 2021-04-23T15:54:18Z

apache/bookkeeper#2701 should fix it.
Existing tests should be enough, we are not unit-testing memory allocations.
I didn't wait for the test run locally, let's see what CI checks say.

BewareMyPower · 2021-04-23T16:00:28Z

@dlg99 Thanks for your quick fix.

And what I concern is if the PR is merged, when will BK be released and then we can upgrade Pulsar's BK dependency.

dlg99 · 2021-04-23T16:19:05Z

@BewareMyPower I think we have a few other interesting changes since bk 4.13.0 to justify 4.13.1 release.
@eolivelli what do you think? We'll need to review pending PRs, what can be merged etc. Release process (+voting etc) easily takes a week.

@BewareMyPower meanwhile, can I ask you to build BK locally and confirm that the fix in fact improves the situation with heap allocations in your scenario? Let's assume this is a pre-req for starting the release process.

eolivelli · 2021-04-23T16:21:38Z

@dlg99 I agree with you

BewareMyPower · 2021-04-23T16:26:46Z

@dlg99 OK, I'll verify it soon.

BewareMyPower · 2021-04-23T17:19:21Z

@dlg99 It looks like there's not significant improvement.

I'll add some logs to check if the BK dependency is updated. BTW, my build steps are:

Build latest BK with PR 2701

mvn clean install -DskipTests

Upgrade Pulsar's BK version

<bookkeeper.version>4.14.0-SNAPSHOT</bookkeeper.version>

Build Pulsar

mvn clean install -DskipTests -Pcore-modules

Then upload the distribution/server/target/apache-pulsar-2.8.0-SNAPSHOT-bin.tar.gz to my VM and start it.

dlg99 · 2021-04-23T17:38:12Z

@BewareMyPower Looking at original chart, heap utilization used to go up to 4.5G, now about 3G (assuming I interpret these charts correctly, the tests and configuration are are the same etc). I'd expect these to be the same but grow slower and less frequent GC pauses but I have no idea what these charts actually show.

You may need to run a profiler to check where the allocations happen. IIRC JFR can record data over some interval and then JMC can show you where the most allocations happened, with stacktraces etc.

BewareMyPower · 2021-04-23T17:38:29Z

I'll try to debug BK dependencies first. Maybe it was not caused by the checksum.

BewareMyPower · 2021-04-23T19:35:05Z

@dlg99 I just found there's something wrong with @codelipenghui 's analysis, which leads to a result that your PR didn't work.

We can debug BrokerEntryMetadataE2ETest and go to

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpAddEntry.java

Lines 123 to 128 in 5a7ba52

    
           ByteBuf duplicateBuffer = data.retainedDuplicate(); 
        
           // internally asyncAddEntry() will take the ownership of the buffer and release it at the end 
        
           addOpCount = ManagedLedgerImpl.ADD_OP_COUNT_UPDATER.incrementAndGet(ml); 
        
           lastInitTime = System.nanoTime(); 
        
           ledger.asyncAddEntry(duplicateBuffer, this, addOpCount);

The original data is io.netty.buffer.CompositeByteBuf
However, duplicateBuffer is io.netty.buffer.UnpooledDuplicatedByteBuf

So the argument type, which is passed to LedgerHandle#asyncAddEntry, is UnpooledDuplicatedByteBuf not CompositeByteBuf.

I tried to change the code above to

            ByteBuf duplicateBuffer = data.retainedDuplicate();
            final ByteBuf originalData = data;
            data = duplicateBuffer;

            // internally asyncAddEntry() will take the ownership of the buffer and release it at the end
            addOpCount = ManagedLedgerImpl.ADD_OP_COUNT_UPDATER.incrementAndGet(ml);
            lastInitTime = System.nanoTime();
            ledger.asyncAddEntry(originalData, this, addOpCount);

But it still didn't work (even worse). I've added some debug logs to BK:

    public static int resumeChecksum(int previousChecksum, ByteBuf payload) {
        log.info("XYZ resumeChecksum payload.hasMemoryAddress(): {}, hasArray(): {}, is CompositeByteBuf: {},"
                        + " class name: {}",
                payload.hasMemoryAddress(), payload.hasArray(), (payload instanceof CompositeByteBuf),
                payload.getClass().getName());

Then send 1 message, the logs is:

19:41:21.303 [BookKeeperClientWorker-OrderedExecutor-12-0] INFO com.scurrilous.circe.checksum.Crc32cIntChecksum - XYZ resumeChecksum payload.hasMemoryAddress(): true, hasArray(): false, is CompositeByteBuf: false, class name: io.netty.buffer.PooledUnsafeDirectByteBuf
19:41:21.304 [BookKeeperClientWorker-OrderedExecutor-12-0] INFO com.scurrilous.circe.checksum.Crc32cIntChecksum - XYZ resumeChecksum payload.hasMemoryAddress(): true, hasArray(): false, is CompositeByteBuf: false, class name: io.netty.buffer.PooledUnsafeDirectByteBuf
19:41:21.305 [BookKeeperClientWorker-OrderedExecutor-12-0] INFO com.scurrilous.circe.checksum.Crc32cIntChecksum - XYZ resumeChecksum payload.hasMemoryAddress(): true, hasArray(): false, is CompositeByteBuf: false, class name: io.netty.buffer.AbstractPooledDerivedByteBuf$PooledNonRetainedSlicedByteBuf

The CompositeByteBuf was split to a PooledUnsafeDirectByteBuf and a AbstractPooledDerivedByteBuf$PooledNonRetainedSlicedByteBuf. It seems that nioBuffer() wouldn't be called, but the heap memory problem still existed and might get worse.

merlimat · 2021-04-23T20:09:37Z

So the argument type, which is passed to LedgerHandle#asyncAddEntry, is UnpooledDuplicatedByteBuf not CompositeByteBuf.

@BewareMyPower We might need to use ByteBuf.unwrap() to access to the wrapped buffer.

I tried to change the code above to
ByteBuf duplicateBuffer = data.retainedDuplicate();
But it still didn't work (even worse). I've added some debug logs to BK:

There's an extra retain, that would also require another release on the buffer.

dlg99 · 2021-04-23T20:09:47Z

@BewareMyPower I updated the pr to unwrap DuplicatedByteBuf

michaeljmarshall

Nice find!

michaeljmarshall · 2021-04-24T04:57:17Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

        }
    }

-    private boolean beforeAddEntry(OpAddEntry addOperation) {
+    private Optional<OpAddEntry> beforeAddEntry(OpAddEntry addOperation) {
        // if no interceptor, just return true to make sure addOperation will be initiate()


This comment should be updated to reflect the new return type.

OK, but since this PR may not be merged because the BookKeeper side could fix it. I may do this code refactor in another PR.

BewareMyPower · 2021-04-24T05:29:34Z

@dlg99 Really thanks for your help. It should work now.

But there's still a little problem that is not related to your code change. When I run Pulsar with BK 4.14-SNAPSHOT, it looks like that the JNI library can't be loaded:

WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider

This is the reason that I still cannot get a good test result. However, when I run the Pulsar with BK 4.13, it succeeded to load the JNI library:

INFO com.scurrilous.circe.checksum.Crc32cIntChecksum - SSE4.2 CRC32C provider initialized

They are on the same VM so I wondered what could cause this?

BewareMyPower · 2021-04-24T05:34:34Z

Also I have a question. Should we add an option to configure that if the CompositeByteBuf could be merged? Because for the platforms that can't load the Circe JNI library, the performance impact that is caused by frequent GC is much more than the data copy in memory. With the Java based CRC32c provider, the merged ByteBuf is still in direct memory but the CompositeByteBuf will call nioBuffer() and copy data into heap memory.

eolivelli · 2021-04-24T12:31:43Z

BK 4.14 is basically equals to 4.13 in this scope.
So probably it is a problem about how you have built it locally.
On which platform you are? Arw you on Linux? Which JDK?

BewareMyPower · 2021-04-24T12:40:20Z

@eolivelli I built BK on my MacOS with JDK 8, see #10330 (comment)

And I run the Pulsar with BK 4.14 on AWS EC2 Ubuntu Server 18.04 LTS (HVM) ami-05248307900d52e3a, with JDK 11.

eolivelli · 2021-04-24T12:42:44Z

Probably (I am not sure) the problem is that you have to build BK on Linux, in order to build the CirceChecksum library correctly.

When we release BK, even while using Mace's, we run the build in a docker env, you can use the script you find in the 'dev' folder for instance

BewareMyPower · 2021-04-24T12:45:19Z

@eolivelli Sounds reasonable. I'll give it a shot.

Descriptions of the changes in this PR: Handling CompositeByteBuf in a way that avoids unnecessary data copy. ### Motivation apache/pulsar#10330 apache/pulsar#10330 (comment) ### Changes Handling CompositeByteBuf in a way that avoids unnecessary data copy.

dlg99 · 2021-04-26T21:30:23Z

@BewareMyPower can you confirm that the copy to heap issue is resolved by the bookkeeper change?

BewareMyPower · 2021-04-27T02:13:24Z

@dlg99 Yeah, it's resolved. Sorry I forgot to update the result. Here's the compare with the same workload of #10330 (comment)

The heap memory is stable and there's no GC.

BewareMyPower · 2021-05-02T16:34:26Z

Hi, @eolivelli I see apache/bookkeeper#2701 has been merged for a few days. Is it any plan for BookKeeper's release now? Then the Pulsar's BK dependency could be upgraded to fix #10330.

eolivelli · 2021-05-02T17:51:35Z

Can you please start a discussion on dev@bookkeeper.apache.org ?
I am fine with cutting a release

BewareMyPower · 2021-05-04T09:23:23Z

@eolivelli Sorry for the late reply because I'm on vacation now. I sent an email just now, PTAL.

Concat two buffers instead of using CompositeByteBuf

5e60361

codelipenghui assigned BewareMyPower Apr 23, 2021

codelipenghui added this to the 2.8.0 milestone Apr 23, 2021

codelipenghui requested review from merlimat, eolivelli, sijie, 315157973, hangc0276 and zymap April 23, 2021 09:29

eolivelli requested changes Apr 23, 2021

View reviewed changes

...d-ledger/src/main/java/org/apache/bookkeeper/mledger/intercept/ManagedLedgerInterceptor.java Outdated Show resolved Hide resolved

Change beforeAddEntry back to the original signature

563a068

dlg99 added a commit to dlg99/bookkeeper that referenced this pull request Apr 23, 2021

Fixed unnecessary copy to heap, see apache/pulsar#10330

f105c7d

dlg99 mentioned this pull request Apr 23, 2021

Fixed unnecessary copy to heap, see https://github.com/apache/pulsar/pull/10330 apache/bookkeeper#2701

Merged

michaeljmarshall reviewed Apr 24, 2021

View reviewed changes

BewareMyPower closed this Apr 26, 2021

hangc0276 mentioned this pull request May 24, 2021

Upgrade BookKeeper to 4.14.1 #10686

Merged

lhotari mentioned this pull request Feb 5, 2024

Fix checksum calculation bug when the payload is a CompositeByteBuf with readerIndex > 0 apache/bookkeeper#4196

Merged

Fix message buffer being copied from direct memory to heap memory #10330

Fix message buffer being copied from direct memory to heap memory #10330

Conversation

BewareMyPower commented Apr 22, 2021 • edited Loading

Motivation

Modifications

Verifying this change

BewareMyPower commented Apr 22, 2021 • edited Loading

BewareMyPower commented Apr 22, 2021

BewareMyPower commented Apr 23, 2021

eolivelli left a comment

Choose a reason for hiding this comment

eolivelli commented Apr 23, 2021

codelipenghui commented Apr 23, 2021

BewareMyPower commented Apr 23, 2021

BewareMyPower commented Apr 23, 2021

codelipenghui commented Apr 23, 2021

eolivelli commented Apr 23, 2021

dlg99 commented Apr 23, 2021

BewareMyPower commented Apr 23, 2021

BewareMyPower commented Apr 23, 2021

dlg99 commented Apr 23, 2021

BewareMyPower commented Apr 23, 2021

dlg99 commented Apr 23, 2021

eolivelli commented Apr 23, 2021

BewareMyPower commented Apr 23, 2021

BewareMyPower commented Apr 23, 2021

dlg99 commented Apr 23, 2021

BewareMyPower commented Apr 23, 2021

BewareMyPower commented Apr 23, 2021 • edited Loading

merlimat commented Apr 23, 2021

dlg99 commented Apr 23, 2021

michaeljmarshall left a comment

Choose a reason for hiding this comment

michaeljmarshall Apr 24, 2021

Choose a reason for hiding this comment

BewareMyPower Apr 24, 2021

Choose a reason for hiding this comment

BewareMyPower commented Apr 24, 2021 • edited Loading

BewareMyPower commented Apr 24, 2021

eolivelli commented Apr 24, 2021

BewareMyPower commented Apr 24, 2021

eolivelli commented Apr 24, 2021

BewareMyPower commented Apr 24, 2021

dlg99 commented Apr 26, 2021

BewareMyPower commented Apr 27, 2021

BewareMyPower commented May 2, 2021

eolivelli commented May 2, 2021

BewareMyPower commented May 4, 2021

BewareMyPower commented Apr 22, 2021 •

edited

Loading

BewareMyPower commented Apr 22, 2021 •

edited

Loading

BewareMyPower commented Apr 23, 2021 •

edited

Loading

BewareMyPower commented Apr 24, 2021 •

edited

Loading