[SPARK-4253]Ignore spark.driver.host in yarn-cluster and standalone-cluster mode #3112

WangTaoTheTonic · 2014-11-05T15:24:43Z

And I see description in configuration document is "This is used for communicating with the executors and the standalone Master.", should we modify it together?

SparkQA · 2014-11-05T15:27:50Z

Test build #22939 has started for PR 3112 at commit 2286e6b.

This patch merges cleanly.

SparkQA · 2014-11-05T16:52:51Z

Test build #22939 has finished for PR 3112 at commit 2286e6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-05T16:52:54Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22939/
Test PASSed.

tgravescs · 2014-11-06T14:38:11Z

does this only occur if user manually sets spark.driver.host ?

WangTaoTheTonic · 2014-11-06T14:43:29Z

Yep. Considering setting this config item in yarn-client mode is ok and switching between yarn-client and yarn-cluster mode is convenient, I think it would be better to ignore it in cluster mode.

tgravescs · 2014-11-06T15:29:41Z

I'm just curious what is the reason to set it for client mode?

Note that in yarn-client mode the ApplicationMaster handles setting this so I"m wondering if there is some other usecase where you need to manually set it.

WangTaoTheTonic · 2014-11-06T18:28:29Z

If we didn't set it, Yarn cluster would probably not recognize driver's host (default value of spark.driver.host) because the nodes in yarn could not just resolve its hostname. Setting it to driver's ip would help.

tgravescs · 2014-11-06T20:22:37Z

Sorry perhaps my questions was misleading. Are you as a user setting spark.driver.host for some reason? The user doesn't need to set it in yarn mode (either client or cluster). SparkContext sets it for you as the localhost and that gets propagated to the executors.

I'm not all saying this isn't a good change to protect users from doing bad things I'm just wondering why it was set in the first place. Perhaps there is some use case I'm not aware of.

WangTaoTheTonic · 2014-11-07T03:21:15Z

Here is the scenario: When a client submit an application(in yarn-client mode) on a node which is not in yarn cluster. The ApplicationMaster may not recognize the client's hostname like this:

14/11/07 11:11:02 INFO ApplicationMaster: Waiting for Spark driver to be reachable. 14/11/07 11:11:02 ERROR ApplicationMaster: Failed to connect to driver at datasight1:10084, retrying ... 14/11/07 11:11:02 ERROR ApplicationMaster: Failed to connect to driver at datasight1:10084, retrying ... 14/11/07 11:11:02 ERROR ApplicationMaster: Failed to connect to driver at datasight1:10084, retrying ...

I think it happens when Application Master could not resolve client's hostname(because the client's hostname-ip pair was not added into /etc/hosts or DNS server).

If we set spark.driver.host AM will use the config value(for instance, client's ip address) instead of the default value(Utils.localHostName()), then the error will not occur.

That is to say, actually in yarn-client mode this configuration is useful sometime.

WangTaoTheTonic · 2014-11-09T13:00:39Z

Additionally, I think spark.driver.host is useful in all client mode including standalone, mesos mode ( i don't know it very mych) and yarn-client mode. When the cluster cannot resolve client's hostname we must set this configuration to client's ip address to avoid failure to connect to driver.

If i understood it wrong, correct me please.

tgravescs · 2014-11-10T14:43:26Z

Ok, I understand the use case for it in yarn client mode.

JoshRosen · 2014-11-11T00:45:00Z

I don't think that this problem is limited to YARN; I think that I've also seen when submitting jobs to a standalone cluster using spark-submit's cluster deploy mode. If a user has set spark.driver.host in their default configuration file and submits the driver to a remote machine for which this setting is invalid, then I think they'll still run into this error.

When using a cluster deploy mode, whether through YARN or Spark's standalone cluster manager, I think we should completely ignore any values of spark.driver.host that might have been inherited from the submitting machine.

WangTaoTheTonic · 2014-11-11T05:49:29Z

@JoshRosen Sorry for just checking the 1.1.0 version(where only client mode is supported in standalone mode) not the latest documents. Now I ignore spark.driver.host in yarn and standalone cluster mode. Please review.

SparkQA · 2014-11-11T05:50:13Z

Test build #23198 has started for PR 3112 at commit ff8d5f7.

This patch merges cleanly.

SparkQA · 2014-11-11T06:02:32Z

Test build #23200 has started for PR 3112 at commit 667cf24.

This patch merges cleanly.

JoshRosen · 2014-11-11T06:09:52Z

core/src/main/scala/org/apache/spark/deploy/Client.scala

+
+    // Set the web ui port to be ephemeral so we don't conflict with other spark processes
+    // running on the same box
+    conf.set("spark.ui.port", "0")


Why do we need this change? We already have logic that tries to bind to a higher-numbered port if the desired port is unavailable.

JoshRosen · 2014-11-11T06:22:01Z

In Spark Standalone cluster deploy mode, I think that master will still point to a regular standalone master URL (e.g. spark://my-master-hostname) but the driver will be launched inside of a worker. To fix this case, I think we'd have to add code on the submitter to strip out / ignore the spark.driver.hostname property so the submitter's value isn't used on the remote machine that hosts the driver.

If we do the ignoring inside of SparkContext itself, then this might cause problems later down the road if the remote container wants to inject some machine-specific values for these settings. Instead, I think that the right way to fix this is to stop the inappropriate sending of machine-local properties to remote machines. Essentially, I think this fix should go into SparkSubmit rather than SparkContext.

WangTaoTheTonic · 2014-11-11T06:53:12Z

then this might cause problems later down the road if the remote container wants to inject some machine-specific values for these settings like what? We do not just ignore it, but replace its value with remote machine's (who launch the driver) host name.

In Spark Standalone cluster deploy mode, I think that master will still point to a regular standalone master URL (e.g. spark://my-master-hostname) I don't understand what master represents here, master url or value of spark.driver.host ?

SparkQA · 2014-11-11T07:19:10Z

Test build #23198 has finished for PR 3112 at commit ff8d5f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-11T07:19:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23198/
Test PASSed.

WangTaoTheTonic · 2014-11-11T07:23:40Z

Ohh..I got what you mean. It is indeed inappropriate to set spark.master to standalone-master as we still need the master url.

SparkQA · 2014-11-11T07:25:11Z

Test build #23200 has finished for PR 3112 at commit 667cf24.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-11T07:25:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23200/
Test PASSed.

SparkQA · 2014-11-11T09:10:03Z

Test build #23213 has started for PR 3112 at commit 32a3f3f.

This patch merges cleanly.

WangTaoTheTonic · 2014-11-11T09:37:05Z

Now we ignore it in SparkSubmit.

SparkQA · 2014-11-11T10:36:22Z

Test build #23213 has finished for PR 3112 at commit 32a3f3f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-11T10:36:25Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23213/
Test PASSed.

WangTaoTheTonic · 2014-11-13T15:17:08Z

@JoshRosen Is it ok?

SparkQA · 2014-11-13T15:19:49Z

Test build #23313 has started for PR 3112 at commit 02c4e49.

This patch merges cleanly.

SparkQA · 2014-11-13T16:44:33Z

Test build #23313 has finished for PR 3112 at commit 02c4e49.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-13T16:44:37Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23313/
Test PASSed.

JoshRosen · 2014-11-13T21:47:15Z

I have an integration test framework that has a reproduction for the case where I originally observed this problem, so I'd like to try to run your PR there to verify that it fixes the issue. Super minor nit, but do you mind reverting those unrelated formatting changes?

SparkQA · 2014-11-14T01:42:44Z

Test build #23347 has started for PR 3112 at commit ed1a25c.

This patch merges cleanly.

WangTaoTheTonic · 2014-11-14T01:45:30Z

@JoshRosen I have reverted the dot which I think is produced in modifying comments. And the blank between ! and args.isEmpty in ApplicationMasterArguments is unnecessary so I keep the change.

SparkQA · 2014-11-14T03:07:09Z

Test build #23347 has finished for PR 3112 at commit ed1a25c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-14T03:07:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23347/
Test PASSed.

JoshRosen · 2014-11-17T00:49:00Z

I haven't forgotten about reviewing this; while attempting to run my integration test to see whether this fixed my bug, I uncovered a potential regression (unrelated to this patch) in the master branch (and the current 1.1.1 RC): https://issues.apache.org/jira/browse/SPARK-4434

Once we fix that issue, I'd like to try re-running my test.

WangTaoTheTonic · 2014-12-04T07:11:27Z

@JoshRosen I have tested on my cluster using on yarn mode, here is my result:

*before this patch

configure spark.driver.host	client ip-host in /etc/profile in cluster nodes	yarn-client result	yarn-cluster result
Yes	Yes	OK	Exception in thread "Driver" org.jboss.netty.channel.ChannelException: Failed to bind to: /9.91.8.140:23283
Yes	No	OK	Exception in thread "Driver" org.jboss.netty.channel.ChannelException: Failed to bind to: /9.91.8.140:23283
No	Yes	OK	OK
No	No	Failed to connect to driver at datasight1:23520, retrying …	OK

*after this patch

configure spark.driver.host	client ip-host in /etc/profile in cluster nodes	yarn-client result	yarn-cluster result
Yes	Yes	OK	OK
Yes	No	OK	OK
No	Yes	OK	OK
No	No	Failed to connect to driver at datasight1:23520, retrying …	OK

Haven't tested on standalone mode, but I think we will get same behavior.

…cluster modes In yarn-cluster and standalone-cluster modes, we don't know where driver will run until it is launched. If the `spark.driver.host` property is set on the submitting machine and propagated to the driver through SparkConf then this will lead to errors when the driver launches. This patch fixes this issue by dropping the `spark.driver.host` property in SparkSubmit when running in a cluster deploy mode. Author: WangTaoTheTonic <barneystinson@aliyun.com> Author: WangTao <barneystinson@aliyun.com> Closes #3112 from WangTaoTheTonic/SPARK4253 and squashes the following commits: ed1a25c [WangTaoTheTonic] revert unrelated formatting issue 02c4e49 [WangTao] add comment 32a3f3f [WangTaoTheTonic] ingore it in SparkSubmit instead of SparkContext 667cf24 [WangTaoTheTonic] document fix ff8d5f7 [WangTaoTheTonic] also ignore it in standalone cluster mode 2286e6b [WangTao] ignore spark.driver.host in yarn-cluster mode (cherry picked from commit 8106b1e) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

JoshRosen · 2014-12-04T20:02:36Z

Thanks for testing this out! I also tested it with my own integration test and it now passes, so this looks good to me. I'm going to merge this into master and branch-1.2. I'll edit the commit message to reflect the bug description from JIRA.

JoshRosen · 2014-12-04T20:09:18Z

I've also backported this to branch-1.1.

…cluster modes In yarn-cluster and standalone-cluster modes, we don't know where driver will run until it is launched. If the `spark.driver.host` property is set on the submitting machine and propagated to the driver through SparkConf then this will lead to errors when the driver launches. This patch fixes this issue by dropping the `spark.driver.host` property in SparkSubmit when running in a cluster deploy mode. Author: WangTaoTheTonic <barneystinson@aliyun.com> Author: WangTao <barneystinson@aliyun.com> Closes #3112 from WangTaoTheTonic/SPARK4253 and squashes the following commits: ed1a25c [WangTaoTheTonic] revert unrelated formatting issue 02c4e49 [WangTao] add comment 32a3f3f [WangTaoTheTonic] ingore it in SparkSubmit instead of SparkContext 667cf24 [WangTaoTheTonic] document fix ff8d5f7 [WangTaoTheTonic] also ignore it in standalone cluster mode 2286e6b [WangTao] ignore spark.driver.host in yarn-cluster mode (cherry picked from commit 8106b1e) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

ignore spark.driver.host in yarn-cluster mode

2286e6b

also ignore it in standalone cluster mode

ff8d5f7

WangTaoTheTonic changed the title ~~[SPARK-4253]Ignore spark.driver.host in yarn-cluster mode~~ [SPARK-4253]Ignore spark.driver.host in yarn-cluster and standalone-cluster mode Nov 11, 2014

document fix

667cf24

JoshRosen reviewed Nov 11, 2014
View reviewed changes

ingore it in SparkSubmit instead of SparkContext

32a3f3f

add comment

02c4e49

revert unrelated formatting issue

ed1a25c

asfgit closed this in 8106b1e Dec 4, 2014

[SPARK-4253]Ignore spark.driver.host in yarn-cluster and standalone-cluster mode #3112

[SPARK-4253]Ignore spark.driver.host in yarn-cluster and standalone-cluster mode #3112

Conversation

WangTaoTheTonic commented Nov 5, 2014

SparkQA commented Nov 5, 2014

SparkQA commented Nov 5, 2014

AmplabJenkins commented Nov 5, 2014

tgravescs commented Nov 6, 2014

WangTaoTheTonic commented Nov 6, 2014

tgravescs commented Nov 6, 2014

WangTaoTheTonic commented Nov 6, 2014

tgravescs commented Nov 6, 2014

WangTaoTheTonic commented Nov 7, 2014

WangTaoTheTonic commented Nov 9, 2014

tgravescs commented Nov 10, 2014

JoshRosen commented Nov 11, 2014

WangTaoTheTonic commented Nov 11, 2014

SparkQA commented Nov 11, 2014

SparkQA commented Nov 11, 2014

JoshRosen Nov 11, 2014

Choose a reason for hiding this comment

JoshRosen commented Nov 11, 2014

WangTaoTheTonic commented Nov 11, 2014

SparkQA commented Nov 11, 2014

AmplabJenkins commented Nov 11, 2014

WangTaoTheTonic commented Nov 11, 2014

SparkQA commented Nov 11, 2014

AmplabJenkins commented Nov 11, 2014

SparkQA commented Nov 11, 2014

WangTaoTheTonic commented Nov 11, 2014

SparkQA commented Nov 11, 2014

AmplabJenkins commented Nov 11, 2014

WangTaoTheTonic commented Nov 13, 2014

SparkQA commented Nov 13, 2014

SparkQA commented Nov 13, 2014

AmplabJenkins commented Nov 13, 2014

JoshRosen commented Nov 13, 2014

SparkQA commented Nov 14, 2014

WangTaoTheTonic commented Nov 14, 2014

SparkQA commented Nov 14, 2014

AmplabJenkins commented Nov 14, 2014

JoshRosen commented Nov 17, 2014

WangTaoTheTonic commented Dec 4, 2014

JoshRosen commented Dec 4, 2014

JoshRosen commented Dec 4, 2014