Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify REMOTE_HOST_GONE for transport failure #7691

Merged
merged 3 commits into from
Apr 27, 2017

Conversation

wenleix
Copy link
Contributor

@wenleix wenleix commented Mar 28, 2017

Related Issue: #7618

Work in progress. Early comments on structure are welcome.

TODO: adding tests.

@dain
Copy link
Contributor

dain commented Mar 29, 2017

I'm cool with putting this in the failure detector, but the detector will need two methods. One that tells us that the host is "having issues" and one that tells us that "we can't even connect". Currently the failure detector can only tell you the first one, so you can split the problem into "very long GC" vs. "host rebooted", which is the entire point of this.

@wenleix wenleix force-pushed the die_node branch 5 times, most recently from 056e382 to 0e6a15b Compare April 5, 2017 06:15
@wenleix
Copy link
Contributor Author

wenleix commented Apr 5, 2017

Tested and verified HOST_UNRECHABLE (maybe rename to WORKER_UNRECHABLE?) will be returned as status if the worker machine is rebooted during query execution. On the other side, killing the Presto process will return the same TOO_MANY_REQEUST_FAILED error. This separates software error (Presto) and hardware error (machine reboot).

screen shot 2017-04-04 at 22 51 15

Some open design questions:

  1. The error message returned to CLI would still say
Query 20170405_074858_00002_w6es6 failed: Encountered too many errors talking to a worker node.

We can probably amend something like "Worker is gone"?

  1. As we model "WORKER_GONE" as a cause for the failure, so I use the same detection mechanism (ratio > failureRatioThreshold) in 2c995ab .

  2. In internal Stats class, the current implementation maintains designated counter and method for UnrechabilityCounter and UnrechabilityRatio.

An alternative implementation is to leverage failureCountByType, and implements getRecentFailureRatioByType. However, this might be tricky when several FailureTypes maps to one FailureCause, etc.

@wenleix
Copy link
Contributor Author

wenleix commented Apr 5, 2017

An alternative design for FailureDetector would be the following:

public interface FailureDetector
{
    Set<ServiceDescriptor> getFailed();

    Map<ServiceDescriptor, FailureCause> getFailedAndCause();
}

@wenleix wenleix changed the title [WIP] Identify HOST_UNREACHABLE error by checking FailureDetector Identify HOST_UNREACHABLE error by checking FailureDetector Apr 6, 2017
@wenleix
Copy link
Contributor Author

wenleix commented Apr 6, 2017

@raghavsethi : Ready for review :)

@raghavsethi
Copy link
Contributor

Can you rebase?

@wenleix
Copy link
Contributor Author

wenleix commented Apr 7, 2017

@raghavsethi done. thx!

Copy link
Contributor

@raghavsethi raghavsethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The architecture needs to be rethought a bit.

@@ -52,7 +54,8 @@ public ExecutionFailureInfo(
@JsonProperty("suppressed") List<ExecutionFailureInfo> suppressed,
@JsonProperty("stack") List<String> stack,
@JsonProperty("errorLocation") @Nullable ErrorLocation errorLocation,
@JsonProperty("errorCode") @Nullable ErrorCode errorCode)
@JsonProperty("errorCode") @Nullable ErrorCode errorCode,
@JsonProperty("destinationHost") @Nullable HostAddress destinationHost)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Execution can fail for several reasons, only one of which has a 'destination host'. This needs to be re-thought.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this remoteHost, and add a comment saying that it is populated for comms failures.

@@ -390,7 +390,7 @@ public void testErrorCodes()
{
assertEquals(new PageTooLargeException().getErrorCode(), PAGE_TOO_LARGE.toErrorCode());
assertEquals(new PageTransportErrorException("").getErrorCode(), PAGE_TRANSPORT_ERROR.toErrorCode());
assertEquals(new PageTransportTimeoutException("", null).getErrorCode(), PAGE_TRANSPORT_TIMEOUT.toErrorCode());
assertEquals(new PageTransportTimeoutException(null, "", null).getErrorCode(), PAGE_TRANSPORT_TIMEOUT.toErrorCode());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be disallowed.

*/
package com.facebook.presto.spi;

public class PrestoTransportException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need four constructors? Would prefer fewer of these and more code changes.

@@ -13,11 +13,22 @@
*/
package com.facebook.presto.failureDetector;

import com.facebook.presto.spi.HostAddress;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inquisition is the wrong word.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FailureDetector now attempts to determine the state of a node based on the last exception encountered.

{
ALIVE,
UNKNOWN,
GONE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNKNOWN is super weird - not a good fit for this enum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorder to: ALIVE, UNRESPONSIVE, GONE, UNKNOWN

return State.UNRESPONSIVE;
}
else {
return State.UNKNOWN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of things:

  • Can we make this so it can look up hosts in a map? Not saying it's the best way, just curious.
  • When does the unknown branch actually get taken? Should it ever happen? My guess is that it's super rare.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these thoughts!! :)

  • Today the tasks is a map from UUID (get from the ServiceDescriptor) to the MonitoringTask. I didn't see an easy way to reverse lookup the ServiceDescriptor based on hostName.

  • It's not common but it does happen. Here are something I am currently aware of:

    • Something like org.eclipse.jetty.client.HttpResponseException: HTTP protocol violation will be returned if we ping on a non-HTTP port (e.g. port 22)
    • Some kind of "EOFException" will be returned when the machine is restarting, or some weird firewall issue that allows the connection but doesn't allow any communication.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

            if (hostAddress.equals(fromUri(task.uri))) {
                if (!task.isFailed()) {
                    return State.ALIVE;
                }
                
                Exception lastException = task.getStats().getLastFailureException();
                if (lastException instanceof SocketTimeoutException || lastException instanceof UnknownHostException) {
                    return State.GONE;
                }
                
                if (lastException instanceof ConnectException) {
                    return State.UNRESPONSIVE;
                }
                
                return State.UNKNOWN;
            }

@Override
public State getState(HostAddress hostAddress)
{
return State.UNKNOWN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the coordinator is always in 'unknown' state? Seems sketchy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch should only be triggered when it's NOT on coordinator right? (Since coordinator is false). So my understanding is that this is on the worker, and there is no failure detector on worker so it looks to me UNKNOWN is the only answer :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment saying that this is because failure detectors are not available on workers.

@wenleix wenleix force-pushed the die_node branch 2 times, most recently from a689f0e to de0f00a Compare April 17, 2017 19:04
@wenleix
Copy link
Contributor Author

wenleix commented Apr 18, 2017

Addressed the comments and rebased. @raghavsethi would you like to take another look? Thank you !! :)

public State getState(HostAddress hostAddress)
{
for (MonitoringTask task : tasks.values()) {
if (hostAddress.equals(HostAddress.fromUri(task.uri))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static import

@Override
public State getState(HostAddress hostAddress)
{
return State.UNKNOWN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment saying that this is because failure detectors are not available on workers.

@@ -52,7 +54,8 @@ public ExecutionFailureInfo(
@JsonProperty("suppressed") List<ExecutionFailureInfo> suppressed,
@JsonProperty("stack") List<String> stack,
@JsonProperty("errorLocation") @Nullable ErrorLocation errorLocation,
@JsonProperty("errorCode") @Nullable ErrorCode errorCode)
@JsonProperty("errorCode") @Nullable ErrorCode errorCode,
@JsonProperty("destinationHost") @Nullable HostAddress destinationHost)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this remoteHost, and add a comment saying that it is populated for comms failures.

@@ -380,7 +380,7 @@ public void onFailure(Throwable t)
uri,
backoff.getFailureCount(),
backoff.getTimeSinceLastSuccess().convertTo(SECONDS));
t = new PageTransportTimeoutException(message, t);
t = new PageTransportTimeoutException(HostAddress.fromUri(uri), message, t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static import

public RuntimeException toException(FailureDetector failureDetector)
{
if (getDestinationHost() != null &&
failureDetector.getState(getDestinationHost()) == FailureDetector.State.GONE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we agreed this should go somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ! I think moving to SqlStageExecution looks better! :)

@@ -79,6 +79,7 @@
CORRUPT_PAGE(0x0001_0013, INTERNAL_ERROR),
OPTIMIZER_TIMEOUT(0x0001_0014, INTERNAL_ERROR),
OUT_OF_SPILL_SPACE(0x0001_0015, INTERNAL_ERROR),
HOST_UNREACHABLE(0x0001_0016, INTERNAL_ERROR),
Copy link
Contributor

@raghavsethi raghavsethi Apr 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that GONE maps to HOST_UNREACHABLE. This should be consistent.

@wenleix
Copy link
Contributor Author

wenleix commented Apr 19, 2017

@raghavsethi Let me know if there is any further thoughts ! :)

Copy link
Contributor

@raghavsethi raghavsethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good except minor comments. I'll have another look before you merge.

@@ -13,11 +13,22 @@
*/
package com.facebook.presto.failureDetector;

import com.facebook.presto.spi.HostAddress;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FailureDetector now attempts to determine the state of a node based on the last exception encountered.

{
ALIVE,
UNKNOWN,
GONE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorder to: ALIVE, UNRESPONSIVE, GONE, UNKNOWN

for (MonitoringTask task : tasks.values()) {
if (hostAddress.equals(fromUri(task.uri))) {
if (!task.isFailed()) {
return State.ALIVE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static import all of these enum values

Copy link
Contributor Author

@wenleix wenleix Apr 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raghavsethi : The reason for the ordering is because ALIVE and UNKNOWN are the two states will always exits. And we can add more states as we can identify more reasons :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK then, UNKOWN, ALIVE, GONE, RESPONSIVE

}
}
}
return State.UNKNOWN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add newline before this

Copy link
Contributor Author

@wenleix wenleix Apr 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raghavsethi return vs. if-else is an interesting question. Looks like nowadays pepole are preferring using early return :): https://www.quora.com/Which-one-is-better-using-if-else-or-return

In this case, the first if branch will handle when the host is not failing, and the else branch is all about different type of failed situations :). I don't have strong opinion in either way, though :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's debatable in other scenarios, but in this one early-return makes it strictly easier to reason about code paths.

return State.UNRESPONSIVE;
}
else {
return State.UNKNOWN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

            if (hostAddress.equals(fromUri(task.uri))) {
                if (!task.isFailed()) {
                    return State.ALIVE;
                }
                
                Exception lastException = task.getStats().getLastFailureException();
                if (lastException instanceof SocketTimeoutException || lastException instanceof UnknownHostException) {
                    return State.GONE;
                }
                
                if (lastException instanceof ConnectException) {
                    return State.UNRESPONSIVE;
                }
                
                return State.UNKNOWN;
            }

@@ -122,7 +124,8 @@ public void requestFailed(Throwable reason)
// fail the task, if we have more than X failures in a row and more than Y seconds have passed since the last request
if (backoff.failure()) {
// it is weird to mark the task failed locally and then cancel the remote task, but there is no way to tell a remote task that it is failed
PrestoException exception = new PrestoException(TOO_MANY_REQUESTS_FAILED,
PrestoException exception = new PrestoTransportException(TOO_MANY_REQUESTS_FAILED,
HostAddress.fromUri(taskUri),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static import

@@ -55,21 +57,26 @@ public static ExecutionFailureInfo toFailure(Throwable failure)
}
// todo prevent looping with suppressed cause loops and such
String type;
HostAddress destinationHost = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null is the default value - you don't need to assign it.

Copy link
Contributor Author

@wenleix wenleix Apr 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raghavsethi : For class field, yes. For local variable, it doesn't seem so. (Compiler will complain "Variable xxx might not have been initialized" :) )

Anyway, rename it to remoteHost :)

@@ -418,5 +425,25 @@ private synchronized void updateMemoryUsage(TaskStatus taskStatus)
previousMemory = currentMemory;
stateMachine.updateMemoryUsage(deltaMemoryInBytes);
}

private ExecutionFailureInfo identifyHostGone(ExecutionFailureInfo executionFailureInfo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private ExecutionFailureInfo checkHostStatus(ExecutionFailureInfo executionFailureInfo)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raghavsethi I think the most accurate names are

  • checkHostStatusToRewriteFailureInfo
  • rewriteFailureInfoBasedOnHostStatus

... So I think the best decision is to use checkHostStatus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Maybe rewriteTransportFailure captures both things?

@@ -79,6 +79,7 @@
CORRUPT_PAGE(0x0001_0013, INTERNAL_ERROR),
OPTIMIZER_TIMEOUT(0x0001_0014, INTERNAL_ERROR),
OUT_OF_SPILL_SPACE(0x0001_0015, INTERNAL_ERROR),
HOST_GONE(0x0001_0016, INTERNAL_ERROR),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REMOTE_HOST_GONE is a little more descriptive. @martint this works for you?

@raghavsethi raghavsethi assigned wenleix and unassigned raghavsethi Apr 19, 2017
@wenleix wenleix changed the title Identify HOST_UNREACHABLE error by checking FailureDetector Identify REMOTE_HOST_GONE error by checking FailureDetector Apr 20, 2017
@wenleix wenleix force-pushed the die_node branch 2 times, most recently from 7822a15 to 733cf28 Compare April 20, 2017 05:06
@wenleix wenleix changed the title Identify REMOTE_HOST_GONE error by checking FailureDetector Identify REMOTE_HOST_GONE for transport failure Apr 20, 2017
@wenleix wenleix force-pushed the die_node branch 2 times, most recently from e455ced to d7991e5 Compare April 21, 2017 17:22
@wenleix
Copy link
Contributor Author

wenleix commented Apr 21, 2017

Revised and jenkin is passed :) @raghavsethi

@wenleix wenleix assigned raghavsethi and unassigned wenleix Apr 21, 2017
Copy link
Contributor

@raghavsethi raghavsethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, except that minor comment.

@@ -322,7 +322,7 @@ public void onSuccess(PagesResponse result)

if (!isNullOrEmpty(taskInstanceId) && !result.getTaskInstanceId().equals(taskInstanceId)) {
// TODO: update error message
throw new PrestoException(REMOTE_TASK_MISMATCH, format("%s (%s)", REMOTE_TASK_MISMATCH_ERROR, HostAddress.fromUri(uri)));
throw new PrestoException(REMOTE_TASK_MISMATCH, format("%s (%s)", REMOTE_TASK_MISMATCH_ERROR, fromUri(uri)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this earlier - but this is not an indication that stuff is currently broken on the remote node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raghavsethi As we discussed, this is not a PrestoTransportException. The URI is used for exception message. And it get changed because we refactor to static import fromUri :)

@raghavsethi raghavsethi assigned wenleix and unassigned raghavsethi Apr 24, 2017
@wenleix wenleix force-pushed the die_node branch 3 times, most recently from b19f311 to 7809b60 Compare April 27, 2017 04:31
wenleix added 3 commits April 26, 2017 22:27
FailureDetector now attempts to determine the state of a node based on
the last exception encountered.
The remote host will also be reported during a transport error
occur. This information help us give more precise error code if failure
detector already knows the target node is dead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants