-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify REMOTE_HOST_GONE for transport failure #7691
Conversation
I'm cool with putting this in the failure detector, but the detector will need two methods. One that tells us that the host is "having issues" and one that tells us that "we can't even connect". Currently the failure detector can only tell you the first one, so you can split the problem into "very long GC" vs. "host rebooted", which is the entire point of this. |
056e382
to
0e6a15b
Compare
Tested and verified Some open design questions:
We can probably amend something like "Worker is gone"?
An alternative implementation is to leverage |
An alternative design for
|
@raghavsethi : Ready for review :) |
Can you rebase? |
@raghavsethi done. thx! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The architecture needs to be rethought a bit.
@@ -52,7 +54,8 @@ public ExecutionFailureInfo( | |||
@JsonProperty("suppressed") List<ExecutionFailureInfo> suppressed, | |||
@JsonProperty("stack") List<String> stack, | |||
@JsonProperty("errorLocation") @Nullable ErrorLocation errorLocation, | |||
@JsonProperty("errorCode") @Nullable ErrorCode errorCode) | |||
@JsonProperty("errorCode") @Nullable ErrorCode errorCode, | |||
@JsonProperty("destinationHost") @Nullable HostAddress destinationHost) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Execution can fail for several reasons, only one of which has a 'destination host'. This needs to be re-thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this remoteHost, and add a comment saying that it is populated for comms failures.
@@ -390,7 +390,7 @@ public void testErrorCodes() | |||
{ | |||
assertEquals(new PageTooLargeException().getErrorCode(), PAGE_TOO_LARGE.toErrorCode()); | |||
assertEquals(new PageTransportErrorException("").getErrorCode(), PAGE_TRANSPORT_ERROR.toErrorCode()); | |||
assertEquals(new PageTransportTimeoutException("", null).getErrorCode(), PAGE_TRANSPORT_TIMEOUT.toErrorCode()); | |||
assertEquals(new PageTransportTimeoutException(null, "", null).getErrorCode(), PAGE_TRANSPORT_TIMEOUT.toErrorCode()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be disallowed.
*/ | ||
package com.facebook.presto.spi; | ||
|
||
public class PrestoTransportException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need four constructors? Would prefer fewer of these and more code changes.
@@ -13,11 +13,22 @@ | |||
*/ | |||
package com.facebook.presto.failureDetector; | |||
|
|||
import com.facebook.presto.spi.HostAddress; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inquisition is the wrong word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FailureDetector now attempts to determine the state of a node based on the last exception encountered.
{ | ||
ALIVE, | ||
UNKNOWN, | ||
GONE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UNKNOWN is super weird - not a good fit for this enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reorder to: ALIVE, UNRESPONSIVE, GONE, UNKNOWN
return State.UNRESPONSIVE; | ||
} | ||
else { | ||
return State.UNKNOWN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of things:
- Can we make this so it can look up hosts in a map? Not saying it's the best way, just curious.
- When does the unknown branch actually get taken? Should it ever happen? My guess is that it's super rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these thoughts!! :)
-
Today the
tasks
is a map fromUUID
(get from theServiceDescriptor
) to theMonitoringTask
. I didn't see an easy way to reverse lookup theServiceDescriptor
based on hostName. -
It's not common but it does happen. Here are something I am currently aware of:
- Something like
org.eclipse.jetty.client.HttpResponseException: HTTP protocol violation
will be returned if we ping on a non-HTTP port (e.g. port 22) - Some kind of "EOFException" will be returned when the machine is restarting, or some weird firewall issue that allows the connection but doesn't allow any communication.
- Something like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (hostAddress.equals(fromUri(task.uri))) {
if (!task.isFailed()) {
return State.ALIVE;
}
Exception lastException = task.getStats().getLastFailureException();
if (lastException instanceof SocketTimeoutException || lastException instanceof UnknownHostException) {
return State.GONE;
}
if (lastException instanceof ConnectException) {
return State.UNRESPONSIVE;
}
return State.UNKNOWN;
}
@Override | ||
public State getState(HostAddress hostAddress) | ||
{ | ||
return State.UNKNOWN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the coordinator is always in 'unknown' state? Seems sketchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch should only be triggered when it's NOT on coordinator right? (Since coordinator
is false). So my understanding is that this is on the worker, and there is no failure detector on worker so it looks to me UNKNOWN
is the only answer :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment saying that this is because failure detectors are not available on workers.
a689f0e
to
de0f00a
Compare
Addressed the comments and rebased. @raghavsethi would you like to take another look? Thank you !! :) |
public State getState(HostAddress hostAddress) | ||
{ | ||
for (MonitoringTask task : tasks.values()) { | ||
if (hostAddress.equals(HostAddress.fromUri(task.uri))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Static import
@Override | ||
public State getState(HostAddress hostAddress) | ||
{ | ||
return State.UNKNOWN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment saying that this is because failure detectors are not available on workers.
@@ -52,7 +54,8 @@ public ExecutionFailureInfo( | |||
@JsonProperty("suppressed") List<ExecutionFailureInfo> suppressed, | |||
@JsonProperty("stack") List<String> stack, | |||
@JsonProperty("errorLocation") @Nullable ErrorLocation errorLocation, | |||
@JsonProperty("errorCode") @Nullable ErrorCode errorCode) | |||
@JsonProperty("errorCode") @Nullable ErrorCode errorCode, | |||
@JsonProperty("destinationHost") @Nullable HostAddress destinationHost) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this remoteHost, and add a comment saying that it is populated for comms failures.
@@ -380,7 +380,7 @@ public void onFailure(Throwable t) | |||
uri, | |||
backoff.getFailureCount(), | |||
backoff.getTimeSinceLastSuccess().convertTo(SECONDS)); | |||
t = new PageTransportTimeoutException(message, t); | |||
t = new PageTransportTimeoutException(HostAddress.fromUri(uri), message, t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Static import
public RuntimeException toException(FailureDetector failureDetector) | ||
{ | ||
if (getDestinationHost() != null && | ||
failureDetector.getState(getDestinationHost()) == FailureDetector.State.GONE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we agreed this should go somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you ! I think moving to SqlStageExecution
looks better! :)
@@ -79,6 +79,7 @@ | |||
CORRUPT_PAGE(0x0001_0013, INTERNAL_ERROR), | |||
OPTIMIZER_TIMEOUT(0x0001_0014, INTERNAL_ERROR), | |||
OUT_OF_SPILL_SPACE(0x0001_0015, INTERNAL_ERROR), | |||
HOST_UNREACHABLE(0x0001_0016, INTERNAL_ERROR), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that GONE maps to HOST_UNREACHABLE. This should be consistent.
@raghavsethi Let me know if there is any further thoughts ! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good except minor comments. I'll have another look before you merge.
@@ -13,11 +13,22 @@ | |||
*/ | |||
package com.facebook.presto.failureDetector; | |||
|
|||
import com.facebook.presto.spi.HostAddress; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FailureDetector now attempts to determine the state of a node based on the last exception encountered.
{ | ||
ALIVE, | ||
UNKNOWN, | ||
GONE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reorder to: ALIVE, UNRESPONSIVE, GONE, UNKNOWN
for (MonitoringTask task : tasks.values()) { | ||
if (hostAddress.equals(fromUri(task.uri))) { | ||
if (!task.isFailed()) { | ||
return State.ALIVE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Static import all of these enum values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavsethi : The reason for the ordering is because ALIVE
and UNKNOWN
are the two states will always exits. And we can add more states as we can identify more reasons :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK then, UNKOWN, ALIVE, GONE, RESPONSIVE
} | ||
} | ||
} | ||
return State.UNKNOWN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add newline before this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavsethi return
vs. if-else
is an interesting question. Looks like nowadays pepole are preferring using early return :): https://www.quora.com/Which-one-is-better-using-if-else-or-return
In this case, the first if branch will handle when the host is not failing, and the else branch is all about different type of failed situations :). I don't have strong opinion in either way, though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's debatable in other scenarios, but in this one early-return makes it strictly easier to reason about code paths.
return State.UNRESPONSIVE; | ||
} | ||
else { | ||
return State.UNKNOWN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (hostAddress.equals(fromUri(task.uri))) {
if (!task.isFailed()) {
return State.ALIVE;
}
Exception lastException = task.getStats().getLastFailureException();
if (lastException instanceof SocketTimeoutException || lastException instanceof UnknownHostException) {
return State.GONE;
}
if (lastException instanceof ConnectException) {
return State.UNRESPONSIVE;
}
return State.UNKNOWN;
}
@@ -122,7 +124,8 @@ public void requestFailed(Throwable reason) | |||
// fail the task, if we have more than X failures in a row and more than Y seconds have passed since the last request | |||
if (backoff.failure()) { | |||
// it is weird to mark the task failed locally and then cancel the remote task, but there is no way to tell a remote task that it is failed | |||
PrestoException exception = new PrestoException(TOO_MANY_REQUESTS_FAILED, | |||
PrestoException exception = new PrestoTransportException(TOO_MANY_REQUESTS_FAILED, | |||
HostAddress.fromUri(taskUri), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static import
@@ -55,21 +57,26 @@ public static ExecutionFailureInfo toFailure(Throwable failure) | |||
} | |||
// todo prevent looping with suppressed cause loops and such | |||
String type; | |||
HostAddress destinationHost = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null is the default value - you don't need to assign it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavsethi : For class field, yes. For local variable, it doesn't seem so. (Compiler will complain "Variable xxx might not have been initialized" :) )
Anyway, rename it to remoteHost
:)
@@ -418,5 +425,25 @@ private synchronized void updateMemoryUsage(TaskStatus taskStatus) | |||
previousMemory = currentMemory; | |||
stateMachine.updateMemoryUsage(deltaMemoryInBytes); | |||
} | |||
|
|||
private ExecutionFailureInfo identifyHostGone(ExecutionFailureInfo executionFailureInfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private ExecutionFailureInfo checkHostStatus(ExecutionFailureInfo executionFailureInfo)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavsethi I think the most accurate names are
checkHostStatusToRewriteFailureInfo
rewriteFailureInfoBasedOnHostStatus
... So I think the best decision is to use checkHostStatus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. Maybe rewriteTransportFailure
captures both things?
@@ -79,6 +79,7 @@ | |||
CORRUPT_PAGE(0x0001_0013, INTERNAL_ERROR), | |||
OPTIMIZER_TIMEOUT(0x0001_0014, INTERNAL_ERROR), | |||
OUT_OF_SPILL_SPACE(0x0001_0015, INTERNAL_ERROR), | |||
HOST_GONE(0x0001_0016, INTERNAL_ERROR), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REMOTE_HOST_GONE is a little more descriptive. @martint this works for you?
7822a15
to
733cf28
Compare
e455ced
to
d7991e5
Compare
Revised and jenkin is passed :) @raghavsethi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, except that minor comment.
@@ -322,7 +322,7 @@ public void onSuccess(PagesResponse result) | |||
|
|||
if (!isNullOrEmpty(taskInstanceId) && !result.getTaskInstanceId().equals(taskInstanceId)) { | |||
// TODO: update error message | |||
throw new PrestoException(REMOTE_TASK_MISMATCH, format("%s (%s)", REMOTE_TASK_MISMATCH_ERROR, HostAddress.fromUri(uri))); | |||
throw new PrestoException(REMOTE_TASK_MISMATCH, format("%s (%s)", REMOTE_TASK_MISMATCH_ERROR, fromUri(uri))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this earlier - but this is not an indication that stuff is currently broken on the remote node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavsethi As we discussed, this is not a PrestoTransportException
. The URI is used for exception message. And it get changed because we refactor to static import fromUri
:)
b19f311
to
7809b60
Compare
FailureDetector now attempts to determine the state of a node based on the last exception encountered.
The remote host will also be reported during a transport error occur. This information help us give more precise error code if failure detector already knows the target node is dead.
Related Issue: #7618
Work in progress. Early comments on structure are welcome.
TODO: adding tests.