Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RATIS-2184. Improve TestRaftWithGrpc test stability #1177

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jianghuazhu
Copy link
Contributor

What changes were proposed in this pull request?

Improve stability of TestRaftWithGrpc tests.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-2184

How was this patch tested?

ci:
https://github.com/jianghuazhu/ratis/actions/runs/11792452996

@jianghuazhu
Copy link
Contributor Author

@szetszwo @duongkame , can you help take it a look?
Thanks.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianghuazhu , thanks a lot for working on this! Please see the comments inlined.

Comment on lines +264 to +275
ReferenceCountedObject<EntryWithData> entryWithData = null;
try {
entryWithData = getRaftLog().retainEntryWithData(next);
if (!buffer.offer(entryWithData.get())) {
entryWithData.release();
break;
}
offered.put(next, entryWithData);
} catch (Exception e){
if (entryWithData != null) {
entryWithData.release();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogAppenderDaemon failed
org.apache.ratis.server.raftlog.RaftLogIOException: Log entry not found: index = 4269
	at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.retainEntryWithData(SegmentedRaftLog.java:334)
	at org.apache.ratis.server.leader.LogAppenderBase.nextAppendEntriesRequest(LogAppenderBase.java:264)

For the above particular exception, this change won't help since, when retainEntryWithData(..) throws an exception, entryWithData must be null.

This change will help if other methods (e.g. get(), offer(..), put(..)) throw an exception. However, these methods throw only runtime exceptions/errors (e.g. OutOfMemoryError). We may not need to handle it.

pom.xml Outdated
@@ -643,7 +643,7 @@
<enableProcessChecker>all</enableProcessChecker>
<forkedProcessTimeoutInSeconds>600</forkedProcessTimeoutInSeconds>
<!-- @argLine is filled by jacoco maven plugin. @{} means late evaluation -->
<argLine>-Xmx2g -XX:+HeapDumpOnOutOfMemoryError @{argLine}</argLine>
<argLine>-Xmx8g -XX:+HeapDumpOnOutOfMemoryError @{argLine}</argLine>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When turning on the advanced reference trace, it does need more memory. I recall that I did similar change for running with advanced reference trace.

pom.xml Outdated
<maxmem>2048m</maxmem>
<maxmem>4096m</maxmem>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that it needs more memory for compilation?

@jianghuazhu
Copy link
Contributor Author

@szetszwo , I updated some comments. In RATIS-2184.
Hope you can give some advice.

@jianghuazhu
Copy link
Contributor Author

jianghuazhu commented Dec 18, 2024

@szetszwo , I updated it.
From running the test, I found some problems. Here are some:

  1. When cluster#shutdown() is called, there is no guarantee that RaftServerProxy will release all resources in advance.
  2. The stack shows that LogSegment#readSegmentFile() is a very important entry.

Therefore, I improved a few points:

  1. When cluster#shutdown() is called, try to release all resources as much as possible.
  2. Where necessary, add some captures and call ReferenceCountedObject#release().
    After many tests, no problems were found when running TestRaftWithGrpc.
    ci : https://github.com/jianghuazhu/ratis/actions/runs/12390154687

@szetszwo
Copy link
Contributor

... RaftServerProxy will release all resources in advance.

What resources? We need to fix if it is really the case.

After many tests, no problems were found ...

That's great! Let me start a build to repeating running many times.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianghuazhu , Thanks for the update! Please see the comments inlined.

Comment on lines 313 to 319
/*grpcServerMetrics.unregister();
CompletableFuture<LifeCycle.State> future = super.stopAsync();
if (appendLogRequestObserver != null) {
appendLogRequestObserver.stop();
appendLogRequestObserver = null;
}
return future;*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change. Could you remove the commented code?

@@ -33,7 +33,6 @@
public final class ReferenceCountedLeakDetector {
private static final Logger LOG = LoggerFactory.getLogger(ReferenceCountedLeakDetector.class);
// Leak detection is turned off by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert this whitespace change.

toReturn.set(entryRef);
} else {
try {
final LogEntryProto entry = entryRef.retain();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entryRef.retain() should be called before the try-block. Otherwise, if it throws an exception, we will call release() without successfully retained.

try {
ref = retainLog(index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the previous case, retainLog(index) should be called before the try-block.

@szetszwo
Copy link
Contributor

Let me start a build to repeating running many times.

Started https://github.com/szetszwo/ratis/actions/runs/12398883445

@jianghuazhu
Copy link
Contributor Author

Let me start a build to repeating running many times.

Started https://github.com/szetszwo/ratis/actions/runs/12398883445

Sorry, there seem to be some errors or omissions that have not been discovered.

@szetszwo
Copy link
Contributor

@jianghuazhu , The 10x100 build timed out. Let's retry with 10x10:
https://github.com/szetszwo/ratis/actions/runs/12414246844

@szetszwo
Copy link
Contributor

@jianghuazhu , compared with the master, your branch does have improved the success rate

Could you clean up the code? We may merge it first and have some further improvement in a separated JIRA.

@jianghuazhu
Copy link
Contributor Author

@jianghuazhu , compared with the master, your branch does have improved the success rate

Could you clean up the code? We may merge it first and have some further improvement in a separated JIRA.

Thanks @szetszwo .
I have updated it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants