Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix cluster not able to spin up issue when disk usage exceeds threshold #15258

Merged
merged 9 commits into from
Oct 16, 2024
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
- Fix protobuf-java leak through client library dependencies ([#16254](https://github.com/opensearch-project/OpenSearch/pull/16254))
- Fix multi-search with template doesn't return status code ([#16265](https://github.com/opensearch-project/OpenSearch/pull/16265))
- Fix wrong default value when setting `index.number_of_routing_shards` to null on index creation ([#16331](https://github.com/opensearch-project/OpenSearch/pull/16331))
- Fix disk usage exceeds threshold cluster can't spin up issue ([#15258](https://github.com/opensearch-project/OpenSearch/pull/15258)))


### Security

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,15 @@

package org.opensearch.bootstrap;

import org.opensearch.common.logging.LogConfigurator;
import org.opensearch.common.settings.KeyStoreCommandTestCase;
import org.opensearch.common.settings.KeyStoreWrapper;
import org.opensearch.common.settings.SecureSettings;
import org.opensearch.common.settings.Settings;
import org.opensearch.common.util.io.IOUtils;
import org.opensearch.core.common.settings.SecureString;
import org.opensearch.env.Environment;
import org.opensearch.node.Node;
import org.opensearch.test.OpenSearchTestCase;
import org.junit.After;
import org.junit.Before;
Expand All @@ -51,8 +53,14 @@
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;

import static org.hamcrest.Matchers.equalTo;
import static org.mockito.Mockito.doAnswer;
import static org.mockito.Mockito.mock;
import static org.mockito.Mockito.verify;

public class BootstrapTests extends OpenSearchTestCase {
Environment env;
Expand Down Expand Up @@ -131,4 +139,38 @@ private void assertPassphraseRead(String source, String expected) {
}
}

public void testInitExecutionOrder() throws Exception {
AtomicInteger order = new AtomicInteger(0);
CountDownLatch countDownLatch = new CountDownLatch(1);
Thread mockThread = new Thread(() -> {
assertEquals(0, order.getAndIncrement());
countDownLatch.countDown();
});

Node mockNode = mock(Node.class);
doAnswer(invocation -> {
try {
boolean threadStarted = countDownLatch.await(1000, TimeUnit.MILLISECONDS);
assertTrue(
"Waited for one second but the keepAliveThread isn't started, please check the execution order of"
+ "keepAliveThread.start and node.start",
threadStarted
);
} catch (InterruptedException e) {
fail("Thread interrupted");
}
assertEquals(1, order.getAndIncrement());
return null;
}).when(mockNode).start();

LogConfigurator.registerErrorListener();
Bootstrap testBootstrap = new Bootstrap(mockThread, mockNode);
Bootstrap.setInstance(testBootstrap);

Bootstrap.startInstance(testBootstrap);

verify(mockNode).start();
assertEquals(2, order.get());
}

}
21 changes: 19 additions & 2 deletions server/src/main/java/org/opensearch/bootstrap/Bootstrap.java
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,17 @@
private final Thread keepAliveThread;
private final Spawner spawner = new Spawner();

// For testing purpose
static void setInstance(Bootstrap bootstrap) {
INSTANCE = bootstrap;
}

// For testing purpose
Bootstrap(Thread keepAliveThread, Node node) {
this.keepAliveThread = keepAliveThread;
this.node = node;
}

/** creates a new instance */
Bootstrap() {
keepAliveThread = new Thread(new Runnable() {
Expand Down Expand Up @@ -336,8 +347,10 @@
}

private void start() throws NodeValidationException {
node.start();
// keepAliveThread should start first than node to ensure the cluster can spin up successfully in edge cases:
// https://github.com/opensearch-project/OpenSearch/issues/14791
keepAliveThread.start();
node.start();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naively speaking, if node.start() doesn't succeeded, then there is nothing to keep alive and the JVM shutting down seems like the right thing to do. It seems like this change could lead to a partially-started or not-at-all-started node running indefinitely because the keepAliveThread will just keep it alive in some sort of zombie state after node.start() failed. What am I missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naively responding, I've no objection to the JVM shutting down, but having spent way too many hours trying to figure out why the JVM shut down in various situations, I'd be really happy to have at least one thread faithfully logging whatever issues caused it to shut down, before shutting itself down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all node.start() failing means the JVM should quit, like the case in the corresponding issue, if we're able to keep JVM running, user can fix this issue simply by changing the cluster settings. So the fix is to provide user the capability to interact with the running cluster(even partially-started).
And for those cases node.start() failing and the cluster is not available at all case, this fix doesn't have real impact because in the end user needs to check error/fatal logs to fix the root cause. This change doesn't add any blocking to user but w/o this change, the first case users are blocked.
So in all I think this is a positive enhancement to user.

Copy link
Member

@andrross andrross Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like the case in the corresponding issue, if we're able to keep JVM running, user can fix this issue simply by changing the cluster settings.

This seems like a really blunt change for an extremely specific case. In general, quitting the JVM and letting it restart is preferable to continuing to run in a partially-started state. Prior to this change, a transient failure in node.start() might automatically recover because the JVM would quit and whatever is monitoring the process would restart it. Now there is a risk (in my opinion) that the process will continue running in a non-functional partially-started state that will require human intervention to resolve. How do we know that is not a risk here?

The assumption that anything useful can be done with a partially started node was true in the specific case mentioned but is not true in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but having spent way too many hours trying to figure out why the JVM shut down in various situations, I'd be really happy to have at least one thread faithfully logging whatever issues caused it to shut down, before shutting itself down.

@dbwiddis Not logging why the JVM shutdown is clearly a problem. However, I'd argue this change might make things considerably worse. Instead of knowing there's a problem (because the process restarted) we'll instead have a partially-started node running in a crippled state with no indication that something went wrong during startup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can create the revert PR now, and I'll look into the solution 2 for a long term solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zane-neo . Please also consider solution 1 and/or the bug I reported. One of the two of those issues should be fixed ASAP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrross @dblock @dbwiddis Do you think we can conclude that all plugins shouldn't block the OpenSearch from starting up even plugin itself encountering issue? I believe if this is true, we can simple do a try catch check for the onNodeStarted method to avoid this. The pros of this fix is it's simple and effective, and it can prevent this won't happen in other plugins.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can conclude that all plugins shouldn't block the OpenSearch from starting up

@zane-neo I don't think this is true generally. For example, if the security plugin can't properly start up it would be safer to fail versus coming up with security disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrross thanks, I created this PR: opensearch-project/observability#1873 for short term fix, and I will continue the long term fix exploration.

}

static void stop() throws IOException {
Expand Down Expand Up @@ -410,7 +423,7 @@
throw new BootstrapException(e);
}

INSTANCE.start();
startInstance(INSTANCE);

Check warning on line 426 in server/src/main/java/org/opensearch/bootstrap/Bootstrap.java

View check run for this annotation

Codecov / codecov/patch

server/src/main/java/org/opensearch/bootstrap/Bootstrap.java#L426

Added line #L426 was not covered by tests

// We don't close stderr if `--quiet` is passed, because that
// hides fatal startup errors. For example, if OpenSearch is
Expand Down Expand Up @@ -462,6 +475,10 @@
}
}

static void startInstance(Bootstrap instance) throws NodeValidationException {
instance.start();
}

@SuppressForbidden(reason = "System#out")
private static void closeSystOut() {
System.out.close();
Expand Down
Loading