Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11083. Avoid duplicate creation of RunningDatanodeState #6886

Merged
merged 4 commits into from
Jul 25, 2024

Conversation

jianghuazhu
Copy link
Contributor

What changes were proposed in this pull request?

When DN is running, the RunningDatanodeState is created every time the heartbeat is sent to the SCM. This operation is very frequent and should be reused as much as possible.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11083

How was this patch tested?

Make sure that your unit tests pass.

@jianghuazhu
Copy link
Contributor Author

ci : https://github.com/jianghuazhu/ozone/actions/runs/9717049941
In CI/CD, several unit tests failed:
TestECContainerRecovery
TestStorageContainerManager
TestReconAndAdminContainerCLI
TestDecommissionAndMaintenance

It seems they all failed for the same reason.
image
This needs to be addressed in another jira, and I'll create a new issue later.

@adoroszlai @hemantk-12 @errose28 , can you help review this PR.
thank you.

1 similar comment
@jianghuazhu
Copy link
Contributor Author

ci : https://github.com/jianghuazhu/ozone/actions/runs/9717049941
In CI/CD, several unit tests failed:
TestECContainerRecovery
TestStorageContainerManager
TestReconAndAdminContainerCLI
TestDecommissionAndMaintenance

It seems they all failed for the same reason.
image
This needs to be addressed in another jira, and I'll create a new issue later.

@adoroszlai @hemantk-12 @errose28 , can you help review this PR.
thank you.

@adoroszlai adoroszlai marked this pull request as draft June 29, 2024 11:47
@adoroszlai
Copy link
Contributor

n CI/CD, several unit tests failed:
TestECContainerRecovery
TestStorageContainerManager
TestReconAndAdminContainerCLI
TestDecommissionAndMaintenance

This needs to be addressed in another jira

These tests are not encountering BindException without this change, so why would you address in another Jira instead of this one?

@jianghuazhu jianghuazhu force-pushed the HDDS-11083 branch 4 times, most recently from 750d650 to ac48935 Compare July 4, 2024 03:08
@jianghuazhu
Copy link
Contributor Author

Thanks @adoroszlai for the comments and review.
I'm trying to solve these problems.

@jianghuazhu jianghuazhu force-pushed the HDDS-11083 branch 4 times, most recently from 30126e0 to 61ad24c Compare July 17, 2024 10:01
@jianghuazhu
Copy link
Contributor Author

ci:https://github.com/jianghuazhu/ozone/actions/runs/9974167102
I have updated it and all CI passes.
Can you review it again, @adoroszlai ?
Thanks.

Comment on lines 58 to 61
/**
* Clean up some resources.
*/
void clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add default no-op implementation.

Comment on lines 546 to 549
@Override
public void clear() {

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change needed in this file after adding default no-op implementation.

Comment on lines 182 to 185
@Override
public void clear() {

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change needed in this file after adding default no-op implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated it.

@jianghuazhu
Copy link
Contributor Author

Do you think there are other places in this PR that need to be improved? If so, I will work hard to improve it. @adoroszlai @hemantk-12

@adoroszlai
Copy link
Contributor

@jianghuazhu Thanks for working on this. Sorry, I have no time to review it, please ask others.

@jianghuazhu
Copy link
Contributor Author

@jianghuazhu Thanks for working on this. Sorry, I have no time to review it, please ask others.
Thanks @adoroszlai .

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianghuazhu , thanks for working on this! Please see the comment/question inlined.

@@ -655,41 +659,45 @@ public void execute(ExecutorService service, long time, TimeUnit unit)
// we called stop DatanodeStateMachine, this sets state to SHUTDOWN, and
// there is a chance of getting task as null.
if (task != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change the if-statement to return instead of add a new indentation.

@@ -654,7 +658,11 @@ public void execute(ExecutorService service, long time, TimeUnit unit)
     // Adding not null check, in a case where datanode is still starting up, but
     // we called stop DatanodeStateMachine, this sets state to SHUTDOWN, and
     // there is a chance of getting task as null.
-    if (task != null) {
+    if (task == null) {
+      return;
+    }
+
+    try {
       if (this.isEntering()) {
         task.onEnter();
       }
@@ -691,6 +699,8 @@ public void execute(ExecutorService service, long time, TimeUnit unit)
         // that we can terminate the datanode.
         setShutdownOnError();
       }
+    } finally {
+      task.clear();
     }
   }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update later.

Comment on lines 121 to 142
Callable<EndPointStates> endpointTask = null;
for (EndpointStateMachine endpointStateMachine : connectionManager.getValues()) {
if (endpointStateMachine.getAddressString().equals(endpoint.getAddressString())) {
if (endpoint.getState().getValue() == EndPointStates.GETVERSION.getValue()) {
endpointTask = new VersionEndpointTask(endpoint, conf,
context.getParent().getContainer());
} else if (endpoint.getState().getValue() == EndPointStates.REGISTER.getValue()) {
endpointTask = RegisterEndpointTask.newBuilder()
.setConfig(conf)
.setEndpointStateMachine(endpoint)
.setContext(context)
.setDatanodeDetails(context.getParent().getDatanodeDetails())
.setOzoneContainer(context.getParent().getContainer())
.build();
} else if (endpoint.getState().getValue() == EndPointStates.HEARTBEAT.getValue()) {
endpointTask = HeartbeatEndpointTask.newBuilder()
.setConfig(conf)
.setEndpointStateMachine(endpoint)
.setDatanodeDetails(context.getParent().getDatanodeDetails())
.setContext(context)
.build();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to create an endpointTask each time instead of using the endpointTasks map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @szetszwo .
There are two reasons for this:

  1. If endpointTasks is kept and initialized once, unfortunately, some CI fail.
  2. If endpointTasks is created each time execute is executed, a VersionEndpointTask, RegisterEndpointTask, and HeartbeatEndpointTask are created. However, HeartbeatEndpointTask is more often used, so it can be created based on the endpoint state, which is less expensive.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianghuazhu , thanks for the updates and the explanation! Please see the comments inlined.

Comment on lines 122 to 123
for (EndpointStateMachine endpointStateMachine : connectionManager.getValues()) {
if (endpointStateMachine.getAddressString().equals(endpoint.getAddressString())) {
Copy link
Contributor

@szetszwo szetszwo Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for loop and the if-statement are not needed since endpointStateMachine is not used to build the task. Also, we already has it in the execute(..) method.

Callable<EndPointStates> endpointTask = null;
for (EndpointStateMachine endpointStateMachine : connectionManager.getValues()) {
if (endpointStateMachine.getAddressString().equals(endpoint.getAddressString())) {
if (endpoint.getState().getValue() == EndPointStates.GETVERSION.getValue()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use switch-case instead.

Comment on lines 119 to 120
private Callable<EndPointStates> getEndPointTask(
EndpointStateMachine endpoint) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename it to buildEndPointTask.

  private Callable<EndPointStates> buildEndPointTask(EndpointStateMachine endpoint) {
    switch (endpoint.getState()) {
      case GETVERSION:
        return new VersionEndpointTask(endpoint, conf, context.getParent().getContainer());
      case REGISTER:
        return RegisterEndpointTask.newBuilder()
            .setConfig(conf)
            .setEndpointStateMachine(endpoint)
            .setContext(context)
            .setDatanodeDetails(context.getParent().getDatanodeDetails())
            .setOzoneContainer(context.getParent().getContainer())
            .build();
      case HEARTBEAT:
        return HeartbeatEndpointTask.newBuilder()
            .setConfig(conf)
            .setEndpointStateMachine(endpoint)
            .setDatanodeDetails(context.getParent().getDatanodeDetails())
            .setContext(context)
            .build();
      default:
        return null;
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @szetszwo .
I'll update soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated it, @szetszwo .

@szetszwo szetszwo marked this pull request as ready for review July 21, 2024 07:29
Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@szetszwo
Copy link
Contributor

Just have restarted the failed tests.

@jianghuazhu
Copy link
Contributor Author

Just have restarted the failed tests.
Thanks @szetszwo .

@szetszwo szetszwo merged commit c760804 into apache:master Jul 25, 2024
39 checks passed
@adoroszlai
Copy link
Contributor

Thanks @jianghuazhu for the patch, @szetszwo for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants