Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heap memory leak problem when ledger replication failed #2794

Conversation

gaozhangmin
Copy link
Contributor

@gaozhangmin gaozhangmin commented Sep 15, 2021

Motivation

production environment, memory leak always happened, and there were ledger cannot be replicated successfully.

This cause by when openLedgerNoRecovery with BKNotEnoughBookiesException, the LedgerHandler won't closed properly, caused memory leak

try (LedgerHandle lh = admin.openLedgerNoRecovery(ledgerIdToReplicate)) {
Set<LedgerFragment> fragments =
getUnderreplicatedFragments(lh, conf.getAuditorLedgerVerificationPercentage());
if (LOG.isDebugEnabled()) {
LOG.debug("Founds fragments {} for replication from ledger: {}", fragments, ledgerIdToReplicate);
}
boolean foundOpenFragments = false;
for (LedgerFragment ledgerFragment : fragments) {
if (!ledgerFragment.isClosed()) {
foundOpenFragments = true;
continue;
}
if (!tryReadingFaultyEntries(lh, ledgerFragment)) {
LOG.error("Failed to read faulty entries, so giving up replicating ledgerFragment {}",
ledgerFragment);
continue;
}
try {
admin.replicateLedgerFragment(lh, ledgerFragment, onReadEntryFailureCallback);
} catch (BKException.BKBookieHandleNotAvailableException e) {
LOG.warn("BKBookieHandleNotAvailableException while replicating the fragment", e);
} catch (BKException.BKLedgerRecoveryException e) {
LOG.warn("BKLedgerRecoveryException while replicating the fragment", e);
} catch (BKException.BKNotEnoughBookiesException e) {
LOG.warn("BKNotEnoughBookiesException while replicating the fragment", e);
}
}
if (foundOpenFragments || isLastSegmentOpenAndMissingBookies(lh)) {
deferLedgerLockRelease = true;
deferLedgerLockRelease(ledgerIdToReplicate);
return false;
}
fragments = getUnderreplicatedFragments(lh, conf.getAuditorLedgerVerificationPercentage());
if (fragments.size() == 0) {
LOG.info("Ledger replicated successfully. ledger id is: " + ledgerIdToReplicate);
underreplicationManager.markLedgerReplicated(ledgerIdToReplicate);
return true;
} else {
deferLedgerLockRelease = true;
deferLedgerLockReleaseOfFailedLedger(ledgerIdToReplicate);
numDeferLedgerLockReleaseOfFailedLedger.inc();
// Releasing the underReplication ledger lock and compete
// for the replication again for the pending fragments
return false;
}
} catch (BKNoSuchLedgerExistsOnMetadataServerException e) {
// Ledger might have been deleted by user
LOG.info("BKNoSuchLedgerExistsOnMetadataServerException while opening "
+ "ledger {} for replication. Other clients "
+ "might have deleted the ledger. "
+ "So, no harm to continue", ledgerIdToReplicate);
underreplicationManager.markLedgerReplicated(ledgerIdToReplicate);
getExceptionCounter("BKNoSuchLedgerExistsOnMetadataServerException").inc();
return false;
} catch (BKNotEnoughBookiesException e) {
logBKExceptionAndReleaseLedger(e, ledgerIdToReplicate);
throw e;

Changes

close LedgerHandler when openComplete with exception

@gaozhangmin
Copy link
Contributor Author

@eolivelli PTAL

@gaozhangmin gaozhangmin changed the title Memory leak problem when ledger replication Memory leak problem when ledger replication failed Sep 15, 2021
@gaozhangmin
Copy link
Contributor Author

@hangc0276

@eolivelli
Copy link
Contributor

@nicoloboschi @RaulGracia @Ghatage PTAL if you have time please

@gaozhangmin which memory is leaked ? direct memory ?

@gaozhangmin
Copy link
Contributor Author

@nicoloboschi @RaulGracia @Ghatage PTAL if you have time please

@gaozhangmin which memory is leaked ? direct memory ?

heap memory

@gaozhangmin gaozhangmin changed the title Memory leak problem when ledger replication failed Heap memory leak problem when ledger replication failed Sep 16, 2021
@gaozhangmin
Copy link
Contributor Author

gaozhangmin commented Sep 16, 2021

@nicoloboschi @RaulGracia @Ghatage PTAL if you have time please

@gaozhangmin which memory is leaked ? direct memory ?
image

@gaozhangmin
Copy link
Contributor Author

image

@gaozhangmin
Copy link
Contributor Author

I've tested this pr in my test cluster, it can solve the heap memory leak problem.

Copy link
Contributor

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 nice catch

I'd like to point out that the dead objects are kept alive by AbstractZkLedgerManager

Copy link
Contributor

@dlg99 dlg99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with one nit

@gaozhangmin gaozhangmin force-pushed the memory-leak-when-replication-with-BKNotEnoughBookiesException branch from 0ea9789 to f745744 Compare September 24, 2021 03:09
@gaozhangmin
Copy link
Contributor Author

@dlg99 PTAL

@dlg99
Copy link
Contributor

dlg99 commented Sep 24, 2021

@gaozhangmin looks good! Thank you.

Copy link
Contributor

@RaulGracia RaulGracia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, would it make sense to add some specific test to validate the scenario in which memory was getting leaked (i.e., ledger handles are being closed in the situation in which they were left open before)?

@gaozhangmin
Copy link
Contributor Author

gaozhangmin commented Sep 30, 2021

LGTM, would it make sense to add some specific test to validate the scenario in which memory was getting leaked (i.e., ledger handles are being closed in the situation in which they were left open before)?

@RaulGracia I tried, But it's hard to verify if the ReadOnlyLedgerHandle is closed.

@zymap zymap merged commit 4dc4260 into apache:master Oct 15, 2021
hsaputra pushed a commit that referenced this pull request Oct 20, 2021
BP-44: USE metrics. A proposal for improving BookKeeper metrics so that operators can employ the USE method for diagnosing performance issues.


Reviewers: Henry Saputra <hsaputra@apache.org>, Andrey Yegorov <None>, Enrico Olivelli <eolivelli@gmail.com>

This closes #2835 from Vanlightly/BP-44-use-metrics and squashes the following commits:

8d9baab [Jack Vanlightly] Added link to USE method and listed each term of USE
5a0f67d [Jack Vanlightly] BP-44 USE metrics
a9b576d [Yunze Xu] Release semaphore when addEntry accepts the same entries (#2832)
148bf22 [Yun Tang] Ensure to release cache during KeyValueStorageRocksDB#closec (#2821)
4dc4260 [gaozhangmin] Heap memory leak problem when ledger replication failed (#2794)
a522fa3 [Raúl Gracia] Issue 2815: Upgrade to log4j2 to get rid of CVE-2019-17571 (#2816)
0465052 [Nicolò Boschi] Upgrade httpclient from 4.5.5 to 4.5.13 (#2793)
594a056 [Raúl Gracia] Issue 2795: Bookkeeper upgrade using Bookie ID may fail due to cookie mismatch (#2796)
354cf37 [Raúl Gracia] Upgraded dependencies with CVEs (#2792)
e413c70 [Raúl Gracia] Issue 2728: Entry Log GC may get blocked when using entryLogPerLedgerEnabled option (#2779)
883231e [pradeepbn] Building bookkeeper with gradle on java11
zymap pushed a commit that referenced this pull request Oct 26, 2021
### Motivation

production environment, memory leak always happened, and there were ledger cannot be replicated successfully.

This cause by when `openLedgerNoRecovery` with `BKNotEnoughBookiesException`,  the LedgerHandler won't  closed properly, caused memory leak

https://github.com/apache/bookkeeper/blob/c7236adc3cb659e65ae5ce53b7156569d7f50ebd/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java#L364-L424

### Changes

close LedgerHandler when openComplete with exception

(cherry picked from commit 4dc4260)
hsaputra pushed a commit that referenced this pull request Oct 29, 2021
### Motivation

In order to complete migration to Gradle we must build all the subprojects.

### Changes

- Enabled `sh` integration tests with gradle, located in `tests/scripts/src/test/bash/gradle`
- Added these modules to the build
    - `bookkeeper-http:servlet-http-server` 
    - `metadata-drivers:etcd`
    - `tests:backward-compat:*`
    - `tests:shaded:*`
    - `stream:bk-grpc-name-resolver`
- DL shading process is now performed (before it didn't build any jar)
- Groovy tests (`tests:backward-compat:*`) now are triggered by the build/tests itself; with Maven, there is a "runner" project (`tests/integration-tests-base-groovy`); in Gradle is useless so it is skipped


### Test

- Both `bin/bookkeper standalone` and `bin/bookkeper_gradle standalone` work locally
- Tests are passing locally  

Master Issue: #2849



Reviewers: Henry Saputra <hsaputra@apache.org>, Prashant Kumar <None>

This closes #2850 from nicoloboschi/fix/2849/gradle and squashes the following commits:

00b49f4 [Nicolò Boschi] Fix common_gradle.sh regex
bd739fd [Nicolò Boschi] fix sh tests
43230ba [Nicolò Boschi] revert sh files. Avoid to modify maven files, create gradle versions to faciltate migration
d1f95e4 [Nicolò Boschi] fix shaded deps
bcab40d [Nicolò Boschi] fix build
5fd0341 [Nicolò Boschi] fix build
0082e0e [Nicolò Boschi] fix build
2c32ac1 [Nicolò Boschi] fixes
3bc0b26 [Nicolò Boschi] bookkeeper-server-shaded-tests
ba89132 [Nicolò Boschi] shaded tests
6d39e33 [Nicolò Boschi] sh tests
e0032bc [Nicolò Boschi] actually run arquillian groovy tests
08dcc39 [Nicolò Boschi] backwards
2361f79 [Nicolò Boschi] hierarchical-ledger-manager
8388e11 [Nicolò Boschi] current-server-old-clients
6a24344 [Nicolò Boschi] bc-non-fips
2faca01 [Nicolò Boschi] bk-grpc-name-resolver
991bc11 [Nicolò Boschi] servlet-http-server
675ef7b [Nicolò Boschi] etcd
b1d5e14 [ZhangJian He] A empty implement in EtcdLedgerManagerFactory to let the project can compile (#2845)
bd5c50b [shustsud] Add error handling to readLedgerMetadata in over-replicated ledger GC (#2844)
746f9f6 [Matteo Merli] Remove direct ZK access for Auditor (#2842)
4117200 [ZhangJian He] the compare should be >= instead of > (#2782)
14ef56f [Prashant Kumar] BookieId can not be cast to BookieSocketAddress (#2843)
e10f3fe [ZhangJian He] Forget to close preAllocator log on shutdown (#2819)
53954ca [shustsud] Add ensemble check to over-replicated ledger GC (#2813)
919fdd3 [Prashant Kumar] Issue:2840 Create bookie shellscript for gradle (#2841)
031d168 [gaozhangmin] fix-npe-when-pulsar-ZkBookieRackAffinityMapping-getBookieAddressResolver (#2788)
3dd671c [Prashant Kumar] Migrate bookkeepr-server:test to gradle run unit tests excepts org.apache.bookkeeper.bookie. org.apache.bookkeeper.client org.apache.bookkeeper.replication org.apache.bookkeeper.tls. (#2812)
f6903b8 [Jack Vanlightly] BP-44 USE metrics
a4afaa4 [Matteo Merli] Eliminate direct ZK access in ScanAndCompareGarbageCollector (#2833)
a9b576d [Yunze Xu] Release semaphore when addEntry accepts the same entries (#2832)
148bf22 [Yun Tang] Ensure to release cache during KeyValueStorageRocksDB#closec (#2821)
4dc4260 [gaozhangmin] Heap memory leak problem when ledger replication failed (#2794)
a522fa3 [Raúl Gracia] Issue 2815: Upgrade to log4j2 to get rid of CVE-2019-17571 (#2816)
0465052 [Nicolò Boschi] Upgrade httpclient from 4.5.5 to 4.5.13 (#2793)
594a056 [Raúl Gracia] Issue 2795: Bookkeeper upgrade using Bookie ID may fail due to cookie mismatch (#2796)
354cf37 [Raúl Gracia] Upgraded dependencies with CVEs (#2792)
e413c70 [Raúl Gracia] Issue 2728: Entry Log GC may get blocked when using entryLogPerLedgerEnabled option (#2779)
883231e [pradeepbn] Building bookkeeper with gradle on java11
Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024
### Motivation

production environment, memory leak always happened, and there were ledger cannot be replicated successfully.



This cause by when `openLedgerNoRecovery` with `BKNotEnoughBookiesException`,  the LedgerHandler won't  closed properly, caused memory leak

https://github.com/apache/bookkeeper/blob/c7236adc3cb659e65ae5ce53b7156569d7f50ebd/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java#L364-L424

### Changes

close LedgerHandler when openComplete with exception
Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024
BP-44: USE metrics. A proposal for improving BookKeeper metrics so that operators can employ the USE method for diagnosing performance issues.


Reviewers: Henry Saputra <hsaputra@apache.org>, Andrey Yegorov <None>, Enrico Olivelli <eolivelli@gmail.com>

This closes apache#2835 from Vanlightly/BP-44-use-metrics and squashes the following commits:

8d9baab [Jack Vanlightly] Added link to USE method and listed each term of USE
5a0f67d [Jack Vanlightly] BP-44 USE metrics
a9b576d [Yunze Xu] Release semaphore when addEntry accepts the same entries (apache#2832)
148bf22 [Yun Tang] Ensure to release cache during KeyValueStorageRocksDB#closec (apache#2821)
4dc4260 [gaozhangmin] Heap memory leak problem when ledger replication failed (apache#2794)
a522fa3 [Raúl Gracia] Issue 2815: Upgrade to log4j2 to get rid of CVE-2019-17571 (apache#2816)
0465052 [Nicolò Boschi] Upgrade httpclient from 4.5.5 to 4.5.13 (apache#2793)
594a056 [Raúl Gracia] Issue 2795: Bookkeeper upgrade using Bookie ID may fail due to cookie mismatch (apache#2796)
354cf37 [Raúl Gracia] Upgraded dependencies with CVEs (apache#2792)
e413c70 [Raúl Gracia] Issue 2728: Entry Log GC may get blocked when using entryLogPerLedgerEnabled option (apache#2779)
883231e [pradeepbn] Building bookkeeper with gradle on java11
Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024
### Motivation

In order to complete migration to Gradle we must build all the subprojects.

### Changes

- Enabled `sh` integration tests with gradle, located in `tests/scripts/src/test/bash/gradle`
- Added these modules to the build
    - `bookkeeper-http:servlet-http-server` 
    - `metadata-drivers:etcd`
    - `tests:backward-compat:*`
    - `tests:shaded:*`
    - `stream:bk-grpc-name-resolver`
- DL shading process is now performed (before it didn't build any jar)
- Groovy tests (`tests:backward-compat:*`) now are triggered by the build/tests itself; with Maven, there is a "runner" project (`tests/integration-tests-base-groovy`); in Gradle is useless so it is skipped


### Test

- Both `bin/bookkeper standalone` and `bin/bookkeper_gradle standalone` work locally
- Tests are passing locally  

Master Issue: apache#2849



Reviewers: Henry Saputra <hsaputra@apache.org>, Prashant Kumar <None>

This closes apache#2850 from nicoloboschi/fix/2849/gradle and squashes the following commits:

00b49f4 [Nicolò Boschi] Fix common_gradle.sh regex
bd739fd [Nicolò Boschi] fix sh tests
43230ba [Nicolò Boschi] revert sh files. Avoid to modify maven files, create gradle versions to faciltate migration
d1f95e4 [Nicolò Boschi] fix shaded deps
bcab40d [Nicolò Boschi] fix build
5fd0341 [Nicolò Boschi] fix build
0082e0e [Nicolò Boschi] fix build
2c32ac1 [Nicolò Boschi] fixes
3bc0b26 [Nicolò Boschi] bookkeeper-server-shaded-tests
ba89132 [Nicolò Boschi] shaded tests
6d39e33 [Nicolò Boschi] sh tests
e0032bc [Nicolò Boschi] actually run arquillian groovy tests
08dcc39 [Nicolò Boschi] backwards
2361f79 [Nicolò Boschi] hierarchical-ledger-manager
8388e11 [Nicolò Boschi] current-server-old-clients
6a24344 [Nicolò Boschi] bc-non-fips
2faca01 [Nicolò Boschi] bk-grpc-name-resolver
991bc11 [Nicolò Boschi] servlet-http-server
675ef7b [Nicolò Boschi] etcd
b1d5e14 [ZhangJian He] A empty implement in EtcdLedgerManagerFactory to let the project can compile (apache#2845)
bd5c50b [shustsud] Add error handling to readLedgerMetadata in over-replicated ledger GC (apache#2844)
746f9f6 [Matteo Merli] Remove direct ZK access for Auditor (apache#2842)
4117200 [ZhangJian He] the compare should be >= instead of > (apache#2782)
14ef56f [Prashant Kumar] BookieId can not be cast to BookieSocketAddress (apache#2843)
e10f3fe [ZhangJian He] Forget to close preAllocator log on shutdown (apache#2819)
53954ca [shustsud] Add ensemble check to over-replicated ledger GC (apache#2813)
919fdd3 [Prashant Kumar] Issue:2840 Create bookie shellscript for gradle (apache#2841)
031d168 [gaozhangmin] fix-npe-when-pulsar-ZkBookieRackAffinityMapping-getBookieAddressResolver (apache#2788)
3dd671c [Prashant Kumar] Migrate bookkeepr-server:test to gradle run unit tests excepts org.apache.bookkeeper.bookie. org.apache.bookkeeper.client org.apache.bookkeeper.replication org.apache.bookkeeper.tls. (apache#2812)
f6903b8 [Jack Vanlightly] BP-44 USE metrics
a4afaa4 [Matteo Merli] Eliminate direct ZK access in ScanAndCompareGarbageCollector (apache#2833)
a9b576d [Yunze Xu] Release semaphore when addEntry accepts the same entries (apache#2832)
148bf22 [Yun Tang] Ensure to release cache during KeyValueStorageRocksDB#closec (apache#2821)
4dc4260 [gaozhangmin] Heap memory leak problem when ledger replication failed (apache#2794)
a522fa3 [Raúl Gracia] Issue 2815: Upgrade to log4j2 to get rid of CVE-2019-17571 (apache#2816)
0465052 [Nicolò Boschi] Upgrade httpclient from 4.5.5 to 4.5.13 (apache#2793)
594a056 [Raúl Gracia] Issue 2795: Bookkeeper upgrade using Bookie ID may fail due to cookie mismatch (apache#2796)
354cf37 [Raúl Gracia] Upgraded dependencies with CVEs (apache#2792)
e413c70 [Raúl Gracia] Issue 2728: Entry Log GC may get blocked when using entryLogPerLedgerEnabled option (apache#2779)
883231e [pradeepbn] Building bookkeeper with gradle on java11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants