Identify REMOTE_HOST_GONE for transport failure #7691

wenleix · 2017-03-28T22:13:41Z

Related Issue: #7618

Work in progress. Early comments on structure are welcome.

TODO: adding tests.

dain · 2017-03-29T03:04:01Z

I'm cool with putting this in the failure detector, but the detector will need two methods. One that tells us that the host is "having issues" and one that tells us that "we can't even connect". Currently the failure detector can only tell you the first one, so you can split the problem into "very long GC" vs. "host rebooted", which is the entire point of this.

wenleix · 2017-04-05T08:10:35Z

Tested and verified HOST_UNRECHABLE (maybe rename to WORKER_UNRECHABLE?) will be returned as status if the worker machine is rebooted during query execution. On the other side, killing the Presto process will return the same TOO_MANY_REQEUST_FAILED error. This separates software error (Presto) and hardware error (machine reboot).

Some open design questions:

The error message returned to CLI would still say

Query 20170405_074858_00002_w6es6 failed: Encountered too many errors talking to a worker node.

We can probably amend something like "Worker is gone"?

As we model "WORKER_GONE" as a cause for the failure, so I use the same detection mechanism (ratio > failureRatioThreshold) in 2c995ab .
In internal Stats class, the current implementation maintains designated counter and method for UnrechabilityCounter and UnrechabilityRatio.

An alternative implementation is to leverage failureCountByType, and implements getRecentFailureRatioByType. However, this might be tricky when several FailureTypes maps to one FailureCause, etc.

wenleix · 2017-04-05T08:16:29Z

An alternative design for FailureDetector would be the following:

public interface FailureDetector
{
    Set<ServiceDescriptor> getFailed();

    Map<ServiceDescriptor, FailureCause> getFailedAndCause();
}

wenleix · 2017-04-06T00:35:39Z

@raghavsethi : Ready for review :)

raghavsethi · 2017-04-07T02:30:04Z

Can you rebase?

wenleix · 2017-04-07T20:22:14Z

@raghavsethi done. thx!

raghavsethi

The architecture needs to be rethought a bit.

raghavsethi · 2017-04-10T20:11:04Z

presto-main/src/main/java/com/facebook/presto/execution/ExecutionFailureInfo.java

@@ -52,7 +54,8 @@ public ExecutionFailureInfo(
            @JsonProperty("suppressed") List<ExecutionFailureInfo> suppressed,
            @JsonProperty("stack") List<String> stack,
            @JsonProperty("errorLocation") @Nullable ErrorLocation errorLocation,
-            @JsonProperty("errorCode") @Nullable ErrorCode errorCode)
+            @JsonProperty("errorCode") @Nullable ErrorCode errorCode,
+            @JsonProperty("destinationHost") @Nullable HostAddress destinationHost)


Execution can fail for several reasons, only one of which has a 'destination host'. This needs to be re-thought.

I would call this remoteHost, and add a comment saying that it is populated for comms failures.

raghavsethi · 2017-04-10T20:13:47Z

presto-main/src/test/java/com/facebook/presto/operator/TestHttpPageBufferClient.java

@@ -390,7 +390,7 @@ public void testErrorCodes()
    {
        assertEquals(new PageTooLargeException().getErrorCode(), PAGE_TOO_LARGE.toErrorCode());
        assertEquals(new PageTransportErrorException("").getErrorCode(), PAGE_TRANSPORT_ERROR.toErrorCode());
-        assertEquals(new PageTransportTimeoutException("", null).getErrorCode(), PAGE_TRANSPORT_TIMEOUT.toErrorCode());
+        assertEquals(new PageTransportTimeoutException(null, "", null).getErrorCode(), PAGE_TRANSPORT_TIMEOUT.toErrorCode());


This should be disallowed.

raghavsethi · 2017-04-10T20:14:50Z

presto-spi/src/main/java/com/facebook/presto/spi/PrestoTransportException.java

+ */
+package com.facebook.presto.spi;
+
+public class PrestoTransportException


Do we really need four constructors? Would prefer fewer of these and more code changes.

raghavsethi · 2017-04-10T20:15:13Z

presto-main/src/main/java/com/facebook/presto/failureDetector/FailureDetector.java

@@ -13,11 +13,22 @@
 */
 package com.facebook.presto.failureDetector;

+import com.facebook.presto.spi.HostAddress;


Inquisition is the wrong word.

FailureDetector now attempts to determine the state of a node based on the last exception encountered.

raghavsethi · 2017-04-10T20:17:04Z

presto-main/src/main/java/com/facebook/presto/failureDetector/FailureDetector.java

+    {
+        ALIVE,
+        UNKNOWN,
+        GONE,


UNKNOWN is super weird - not a good fit for this enum.

Reorder to: ALIVE, UNRESPONSIVE, GONE, UNKNOWN

raghavsethi · 2017-04-10T20:19:45Z

presto-main/src/main/java/com/facebook/presto/failureDetector/HeartbeatFailureDetector.java

+                        return State.UNRESPONSIVE;
+                    }
+                    else {
+                        return State.UNKNOWN;


Couple of things:

Can we make this so it can look up hosts in a map? Not saying it's the best way, just curious.

When does the unknown branch actually get taken? Should it ever happen? My guess is that it's super rare.

Thank you for these thoughts!! :)

Today the tasks is a map from UUID (get from the ServiceDescriptor) to the MonitoringTask. I didn't see an easy way to reverse lookup the ServiceDescriptor based on hostName.

It's not common but it does happen. Here are something I am currently aware of:

Something like org.eclipse.jetty.client.HttpResponseException: HTTP protocol violation will be returned if we ping on a non-HTTP port (e.g. port 22)

Some kind of "EOFException" will be returned when the machine is restarting, or some weird firewall issue that allows the connection but doesn't allow any communication.

if (hostAddress.equals(fromUri(task.uri))) { if (!task.isFailed()) { return State.ALIVE; } Exception lastException = task.getStats().getLastFailureException(); if (lastException instanceof SocketTimeoutException || lastException instanceof UnknownHostException) { return State.GONE; } if (lastException instanceof ConnectException) { return State.UNRESPONSIVE; } return State.UNKNOWN; }

raghavsethi · 2017-04-10T20:20:46Z

presto-main/src/main/java/com/facebook/presto/server/ServerMainModule.java

+                @Override
+                public State getState(HostAddress hostAddress)
+                {
+                    return State.UNKNOWN;


So the coordinator is always in 'unknown' state? Seems sketchy.

This branch should only be triggered when it's NOT on coordinator right? (Since coordinator is false). So my understanding is that this is on the worker, and there is no failure detector on worker so it looks to me UNKNOWN is the only answer :(

Add a comment saying that this is because failure detectors are not available on workers.

wenleix · 2017-04-18T03:49:07Z

Addressed the comments and rebased. @raghavsethi would you like to take another look? Thank you !! :)

raghavsethi · 2017-04-18T15:44:47Z

presto-main/src/main/java/com/facebook/presto/failureDetector/HeartbeatFailureDetector.java

+    public State getState(HostAddress hostAddress)
+    {
+        for (MonitoringTask task : tasks.values()) {
+            if (hostAddress.equals(HostAddress.fromUri(task.uri))) {


Static import

raghavsethi · 2017-04-18T15:45:43Z

presto-main/src/main/java/com/facebook/presto/server/ServerMainModule.java

+                @Override
+                public State getState(HostAddress hostAddress)
+                {
+                    return State.UNKNOWN;


Add a comment saying that this is because failure detectors are not available on workers.

raghavsethi · 2017-04-18T15:46:33Z

presto-main/src/main/java/com/facebook/presto/execution/ExecutionFailureInfo.java

@@ -52,7 +54,8 @@ public ExecutionFailureInfo(
            @JsonProperty("suppressed") List<ExecutionFailureInfo> suppressed,
            @JsonProperty("stack") List<String> stack,
            @JsonProperty("errorLocation") @Nullable ErrorLocation errorLocation,
-            @JsonProperty("errorCode") @Nullable ErrorCode errorCode)
+            @JsonProperty("errorCode") @Nullable ErrorCode errorCode,
+            @JsonProperty("destinationHost") @Nullable HostAddress destinationHost)


I would call this remoteHost, and add a comment saying that it is populated for comms failures.

raghavsethi · 2017-04-18T16:09:04Z

presto-main/src/main/java/com/facebook/presto/operator/HttpPageBufferClient.java

@@ -380,7 +380,7 @@ public void onFailure(Throwable t)
                            uri,
                            backoff.getFailureCount(),
                            backoff.getTimeSinceLastSuccess().convertTo(SECONDS));
-                    t = new PageTransportTimeoutException(message, t);
+                    t = new PageTransportTimeoutException(HostAddress.fromUri(uri), message, t);


Static import

raghavsethi · 2017-04-18T16:11:37Z

presto-main/src/main/java/com/facebook/presto/execution/ExecutionFailureInfo.java

+    public RuntimeException toException(FailureDetector failureDetector)
+    {
+        if (getDestinationHost() != null &&
+                failureDetector.getState(getDestinationHost()) == FailureDetector.State.GONE) {


I thought we agreed this should go somewhere else?

Thank you ! I think moving to SqlStageExecution looks better! :)

raghavsethi · 2017-04-18T16:12:34Z

presto-spi/src/main/java/com/facebook/presto/spi/StandardErrorCode.java

@@ -79,6 +79,7 @@
    CORRUPT_PAGE(0x0001_0013, INTERNAL_ERROR),
    OPTIMIZER_TIMEOUT(0x0001_0014, INTERNAL_ERROR),
    OUT_OF_SPILL_SPACE(0x0001_0015, INTERNAL_ERROR),
+    HOST_UNREACHABLE(0x0001_0016, INTERNAL_ERROR),


I don't like that GONE maps to HOST_UNREACHABLE. This should be consistent.

wenleix · 2017-04-19T17:53:53Z

@raghavsethi Let me know if there is any further thoughts ! :)

raghavsethi

Looks good except minor comments. I'll have another look before you merge.

raghavsethi · 2017-04-19T18:19:02Z

presto-main/src/main/java/com/facebook/presto/failureDetector/FailureDetector.java

@@ -13,11 +13,22 @@
 */
 package com.facebook.presto.failureDetector;

+import com.facebook.presto.spi.HostAddress;


FailureDetector now attempts to determine the state of a node based on the last exception encountered.

raghavsethi · 2017-04-19T18:22:27Z

presto-main/src/main/java/com/facebook/presto/failureDetector/FailureDetector.java

+    {
+        ALIVE,
+        UNKNOWN,
+        GONE,


Reorder to: ALIVE, UNRESPONSIVE, GONE, UNKNOWN

raghavsethi · 2017-04-19T18:23:59Z

presto-main/src/main/java/com/facebook/presto/failureDetector/HeartbeatFailureDetector.java

+        for (MonitoringTask task : tasks.values()) {
+            if (hostAddress.equals(fromUri(task.uri))) {
+                if (!task.isFailed()) {
+                    return State.ALIVE;


Static import all of these enum values

@raghavsethi : The reason for the ordering is because ALIVE and UNKNOWN are the two states will always exits. And we can add more states as we can identify more reasons :)

OK then, UNKOWN, ALIVE, GONE, RESPONSIVE

raghavsethi · 2017-04-19T18:26:52Z

presto-main/src/main/java/com/facebook/presto/failureDetector/HeartbeatFailureDetector.java

+                }
+            }
+        }
+        return State.UNKNOWN;


Add newline before this

@raghavsethi return vs. if-else is an interesting question. Looks like nowadays pepole are preferring using early return :): https://www.quora.com/Which-one-is-better-using-if-else-or-return

In this case, the first if branch will handle when the host is not failing, and the else branch is all about different type of failed situations :). I don't have strong opinion in either way, though :)

I agree that it's debatable in other scenarios, but in this one early-return makes it strictly easier to reason about code paths.

raghavsethi · 2017-04-19T18:27:55Z

presto-main/src/main/java/com/facebook/presto/failureDetector/HeartbeatFailureDetector.java

+                        return State.UNRESPONSIVE;
+                    }
+                    else {
+                        return State.UNKNOWN;


if (hostAddress.equals(fromUri(task.uri))) { if (!task.isFailed()) { return State.ALIVE; } Exception lastException = task.getStats().getLastFailureException(); if (lastException instanceof SocketTimeoutException || lastException instanceof UnknownHostException) { return State.GONE; } if (lastException instanceof ConnectException) { return State.UNRESPONSIVE; } return State.UNKNOWN; }

raghavsethi · 2017-04-19T18:30:25Z

presto-main/src/main/java/com/facebook/presto/server/remotetask/RequestErrorTracker.java

@@ -122,7 +124,8 @@ public void requestFailed(Throwable reason)
        // fail the task, if we have more than X failures in a row and more than Y seconds have passed since the last request
        if (backoff.failure()) {
            // it is weird to mark the task failed locally and then cancel the remote task, but there is no way to tell a remote task that it is failed
-            PrestoException exception = new PrestoException(TOO_MANY_REQUESTS_FAILED,
+            PrestoException exception = new PrestoTransportException(TOO_MANY_REQUESTS_FAILED,
+                    HostAddress.fromUri(taskUri),


static import

raghavsethi · 2017-04-19T18:30:57Z

presto-main/src/main/java/com/facebook/presto/util/Failures.java

@@ -55,21 +57,26 @@ public static ExecutionFailureInfo toFailure(Throwable failure)
        }
        // todo prevent looping with suppressed cause loops and such
        String type;
+        HostAddress destinationHost = null;


null is the default value - you don't need to assign it.

@raghavsethi : For class field, yes. For local variable, it doesn't seem so. (Compiler will complain "Variable xxx might not have been initialized" :) )

Anyway, rename it to remoteHost :)

raghavsethi · 2017-04-19T18:36:00Z

presto-main/src/main/java/com/facebook/presto/execution/SqlStageExecution.java

@@ -418,5 +425,25 @@ private synchronized void updateMemoryUsage(TaskStatus taskStatus)
            previousMemory = currentMemory;
            stateMachine.updateMemoryUsage(deltaMemoryInBytes);
        }
+
+        private ExecutionFailureInfo identifyHostGone(ExecutionFailureInfo executionFailureInfo)


private ExecutionFailureInfo checkHostStatus(ExecutionFailureInfo executionFailureInfo)

@raghavsethi I think the most accurate names are

checkHostStatusToRewriteFailureInfo

rewriteFailureInfoBasedOnHostStatus

... So I think the best decision is to use checkHostStatus

I see your point. Maybe rewriteTransportFailure captures both things?

raghavsethi · 2017-04-19T18:37:09Z

presto-spi/src/main/java/com/facebook/presto/spi/StandardErrorCode.java

@@ -79,6 +79,7 @@
    CORRUPT_PAGE(0x0001_0013, INTERNAL_ERROR),
    OPTIMIZER_TIMEOUT(0x0001_0014, INTERNAL_ERROR),
    OUT_OF_SPILL_SPACE(0x0001_0015, INTERNAL_ERROR),
+    HOST_GONE(0x0001_0016, INTERNAL_ERROR),


REMOTE_HOST_GONE is a little more descriptive. @martint this works for you?

wenleix · 2017-04-21T19:03:30Z

Revised and jenkin is passed :) @raghavsethi

raghavsethi

Looks good, except that minor comment.

raghavsethi · 2017-04-21T19:28:14Z

presto-main/src/main/java/com/facebook/presto/operator/HttpPageBufferClient.java

@@ -322,7 +322,7 @@ public void onSuccess(PagesResponse result)

                        if (!isNullOrEmpty(taskInstanceId) && !result.getTaskInstanceId().equals(taskInstanceId)) {
                            // TODO: update error message
-                            throw new PrestoException(REMOTE_TASK_MISMATCH, format("%s (%s)", REMOTE_TASK_MISMATCH_ERROR, HostAddress.fromUri(uri)));
+                            throw new PrestoException(REMOTE_TASK_MISMATCH, format("%s (%s)", REMOTE_TASK_MISMATCH_ERROR, fromUri(uri)));


Sorry I missed this earlier - but this is not an indication that stuff is currently broken on the remote node.

@raghavsethi As we discussed, this is not a PrestoTransportException. The URI is used for exception message. And it get changed because we refactor to static import fromUri :)

FailureDetector now attempts to determine the state of a node based on the last exception encountered.

The remote host will also be reported during a transport error occur. This information help us give more precise error code if failure detector already knows the target node is dead.

facebook-github-bot added the CLA Signed label Mar 28, 2017

wenleix force-pushed the die_node branch 5 times, most recently from 056e382 to 0e6a15b Compare April 5, 2017 06:15

wenleix force-pushed the die_node branch from 0e6a15b to f7b4ecc Compare April 5, 2017 22:44

wenleix assigned raghavsethi Apr 6, 2017

wenleix changed the title ~~[WIP] Identify HOST_UNREACHABLE error by checking FailureDetector~~ Identify HOST_UNREACHABLE error by checking FailureDetector Apr 6, 2017

wenleix force-pushed the die_node branch from f7b4ecc to ec269ba Compare April 7, 2017 04:02

raghavsethi suggested changes Apr 10, 2017

View reviewed changes

wenleix force-pushed the die_node branch 2 times, most recently from a689f0e to de0f00a Compare April 17, 2017 19:04

raghavsethi suggested changes Apr 18, 2017

View reviewed changes

wenleix force-pushed the die_node branch from de0f00a to aafcfd3 Compare April 18, 2017 18:12

raghavsethi suggested changes Apr 19, 2017

View reviewed changes

raghavsethi assigned wenleix and unassigned raghavsethi Apr 19, 2017

raghavsethi added the changes-requested label Apr 19, 2017

wenleix changed the title ~~Identify HOST_UNREACHABLE error by checking FailureDetector~~ Identify REMOTE_HOST_GONE error by checking FailureDetector Apr 20, 2017

wenleix force-pushed the die_node branch 2 times, most recently from 7822a15 to 733cf28 Compare April 20, 2017 05:06

wenleix changed the title ~~Identify REMOTE_HOST_GONE error by checking FailureDetector~~ Identify REMOTE_HOST_GONE for transport failure Apr 20, 2017

wenleix force-pushed the die_node branch 2 times, most recently from e455ced to d7991e5 Compare April 21, 2017 17:22

wenleix assigned raghavsethi and unassigned wenleix Apr 21, 2017

raghavsethi approved these changes Apr 21, 2017

View reviewed changes

raghavsethi assigned wenleix and unassigned raghavsethi Apr 24, 2017

wenleix added ready-to-merge and removed changes-requested labels Apr 26, 2017

wenleix force-pushed the die_node branch 3 times, most recently from b19f311 to 7809b60 Compare April 27, 2017 04:31

wenleix added 3 commits April 26, 2017 22:27

Add getState() to FailureDetector

7feed9f

FailureDetector now attempts to determine the state of a node based on the last exception encountered.

Add remote host info to ExecutionFailureInfo

bb841ab

The remote host will also be reported during a transport error occur. This information help us give more precise error code if failure detector already knows the target node is dead.

Identify REMOTE_HOST_GONE for transport failure

4b35396

wenleix force-pushed the die_node branch from 7809b60 to 4b35396 Compare April 27, 2017 05:28

wenleix merged commit 4b35396 into prestodb:master Apr 27, 2017

wenleix deleted the die_node branch November 18, 2017 05:56

sanchitkashyap mentioned this pull request Jul 22, 2020

Add remote Host info to PageTransportErrorException trinodb/trino#4511

Merged

Identify REMOTE_HOST_GONE for transport failure #7691

Identify REMOTE_HOST_GONE for transport failure #7691

Conversation

wenleix commented Mar 28, 2017

dain commented Mar 29, 2017

wenleix commented Apr 5, 2017

wenleix commented Apr 5, 2017

wenleix commented Apr 6, 2017

raghavsethi commented Apr 7, 2017

wenleix commented Apr 7, 2017

raghavsethi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenleix commented Apr 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raghavsethi Apr 18, 2017 • edited Loading

Choose a reason for hiding this comment

wenleix commented Apr 19, 2017

raghavsethi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenleix Apr 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenleix Apr 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenleix Apr 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenleix commented Apr 21, 2017

raghavsethi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raghavsethi Apr 18, 2017 •

edited

Loading

wenleix Apr 20, 2017 •

edited

Loading

wenleix Apr 20, 2017 •

edited

Loading

wenleix Apr 20, 2017 •

edited

Loading