Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4253]Ignore spark.driver.host in yarn-cluster and standalone-cluster mode #3112

Closed
wants to merge 6 commits into from

Conversation

WangTaoTheTonic
Copy link
Contributor

And I see description in configuration document is "This is used for communicating with the executors and the standalone Master.", should we modify it together?

@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22939 has started for PR 3112 at commit 2286e6b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22939 has finished for PR 3112 at commit 2286e6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22939/
Test PASSed.

@tgravescs
Copy link
Contributor

does this only occur if user manually sets spark.driver.host ?

@WangTaoTheTonic
Copy link
Contributor Author

Yep. Considering setting this config item in yarn-client mode is ok and switching between yarn-client and yarn-cluster mode is convenient, I think it would be better to ignore it in cluster mode.

@tgravescs
Copy link
Contributor

I'm just curious what is the reason to set it for client mode?

Note that in yarn-client mode the ApplicationMaster handles setting this so I"m wondering if there is some other usecase where you need to manually set it.

@WangTaoTheTonic
Copy link
Contributor Author

If we didn't set it, Yarn cluster would probably not recognize driver's host (default value of spark.driver.host) because the nodes in yarn could not just resolve its hostname. Setting it to driver's ip would help.

@tgravescs
Copy link
Contributor

Sorry perhaps my questions was misleading. Are you as a user setting spark.driver.host for some reason? The user doesn't need to set it in yarn mode (either client or cluster). SparkContext sets it for you as the localhost and that gets propagated to the executors.

I'm not all saying this isn't a good change to protect users from doing bad things I'm just wondering why it was set in the first place. Perhaps there is some use case I'm not aware of.

@WangTaoTheTonic
Copy link
Contributor Author

Here is the scenario: When a client submit an application(in yarn-client mode) on a node which is not in yarn cluster. The ApplicationMaster may not recognize the client's hostname like this:

14/11/07 11:11:02 INFO ApplicationMaster: Waiting for Spark driver to be reachable. 14/11/07 11:11:02 ERROR ApplicationMaster: Failed to connect to driver at datasight1:10084, retrying ... 14/11/07 11:11:02 ERROR ApplicationMaster: Failed to connect to driver at datasight1:10084, retrying ... 14/11/07 11:11:02 ERROR ApplicationMaster: Failed to connect to driver at datasight1:10084, retrying ...

I think it happens when Application Master could not resolve client's hostname(because the client's hostname-ip pair was not added into /etc/hosts or DNS server).

If we set spark.driver.host AM will use the config value(for instance, client's ip address) instead of the default value(Utils.localHostName()), then the error will not occur.

That is to say, actually in yarn-client mode this configuration is useful sometime.

@WangTaoTheTonic
Copy link
Contributor Author

Additionally, I think spark.driver.host is useful in all client mode including standalone, mesos mode ( i don't know it very mych) and yarn-client mode. When the cluster cannot resolve client's hostname we must set this configuration to client's ip address to avoid failure to connect to driver.

If i understood it wrong, correct me please.

@tgravescs
Copy link
Contributor

Ok, I understand the use case for it in yarn client mode.

@JoshRosen
Copy link
Contributor

I don't think that this problem is limited to YARN; I think that I've also seen when submitting jobs to a standalone cluster using spark-submit's cluster deploy mode. If a user has set spark.driver.host in their default configuration file and submits the driver to a remote machine for which this setting is invalid, then I think they'll still run into this error.

When using a cluster deploy mode, whether through YARN or Spark's standalone cluster manager, I think we should completely ignore any values of spark.driver.host that might have been inherited from the submitting machine.

@WangTaoTheTonic
Copy link
Contributor Author

@JoshRosen Sorry for just checking the 1.1.0 version(where only client mode is supported in standalone mode) not the latest documents. Now I ignore spark.driver.host in yarn and standalone cluster mode. Please review.

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23198 has started for PR 3112 at commit ff8d5f7.

  • This patch merges cleanly.

@WangTaoTheTonic WangTaoTheTonic changed the title [SPARK-4253]Ignore spark.driver.host in yarn-cluster mode [SPARK-4253]Ignore spark.driver.host in yarn-cluster and standalone-cluster mode Nov 11, 2014
@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23200 has started for PR 3112 at commit 667cf24.

  • This patch merges cleanly.


// Set the web ui port to be ephemeral so we don't conflict with other spark processes
// running on the same box
conf.set("spark.ui.port", "0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this change? We already have logic that tries to bind to a higher-numbered port if the desired port is unavailable.

@JoshRosen
Copy link
Contributor

In Spark Standalone cluster deploy mode, I think that master will still point to a regular standalone master URL (e.g. spark://my-master-hostname) but the driver will be launched inside of a worker. To fix this case, I think we'd have to add code on the submitter to strip out / ignore the spark.driver.hostname property so the submitter's value isn't used on the remote machine that hosts the driver.

If we do the ignoring inside of SparkContext itself, then this might cause problems later down the road if the remote container wants to inject some machine-specific values for these settings. Instead, I think that the right way to fix this is to stop the inappropriate sending of machine-local properties to remote machines. Essentially, I think this fix should go into SparkSubmit rather than SparkContext.

@WangTaoTheTonic
Copy link
Contributor Author

then this might cause problems later down the road if the remote container wants to inject some machine-specific values for these settings like what? We do not just ignore it, but replace its value with remote machine's (who launch the driver) host name.

In Spark Standalone cluster deploy mode, I think that master will still point to a regular standalone master URL (e.g. spark://my-master-hostname) I don't understand what master represents here, master url or value of spark.driver.host ?

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23198 has finished for PR 3112 at commit ff8d5f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23198/
Test PASSed.

@WangTaoTheTonic
Copy link
Contributor Author

Ohh..I got what you mean. It is indeed inappropriate to set spark.master to standalone-master as we still need the master url.

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23200 has finished for PR 3112 at commit 667cf24.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23200/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23213 has started for PR 3112 at commit 32a3f3f.

  • This patch merges cleanly.

@WangTaoTheTonic
Copy link
Contributor Author

Now we ignore it in SparkSubmit.

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23213 has finished for PR 3112 at commit 32a3f3f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23213/
Test PASSed.

@WangTaoTheTonic
Copy link
Contributor Author

@JoshRosen Is it ok?

@SparkQA
Copy link

SparkQA commented Nov 13, 2014

Test build #23313 has started for PR 3112 at commit 02c4e49.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 13, 2014

Test build #23313 has finished for PR 3112 at commit 02c4e49.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23313/
Test PASSed.

@JoshRosen
Copy link
Contributor

I have an integration test framework that has a reproduction for the case where I originally observed this problem, so I'd like to try to run your PR there to verify that it fixes the issue. Super minor nit, but do you mind reverting those unrelated formatting changes?

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23347 has started for PR 3112 at commit ed1a25c.

  • This patch merges cleanly.

@WangTaoTheTonic
Copy link
Contributor Author

@JoshRosen I have reverted the dot which I think is produced in modifying comments. And the blank between ! and args.isEmpty in ApplicationMasterArguments is unnecessary so I keep the change.

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23347 has finished for PR 3112 at commit ed1a25c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23347/
Test PASSed.

@JoshRosen
Copy link
Contributor

I haven't forgotten about reviewing this; while attempting to run my integration test to see whether this fixed my bug, I uncovered a potential regression (unrelated to this patch) in the master branch (and the current 1.1.1 RC): https://issues.apache.org/jira/browse/SPARK-4434

Once we fix that issue, I'd like to try re-running my test.

@WangTaoTheTonic
Copy link
Contributor Author

@JoshRosen I have tested on my cluster using on yarn mode, here is my result:

*before this patch

configure spark.driver.host client ip-host in /etc/profile in cluster nodes yarn-client result yarn-cluster result
Yes Yes OK Exception in thread "Driver" org.jboss.netty.channel.ChannelException: Failed to bind to: /9.91.8.140:23283
Yes No OK Exception in thread "Driver" org.jboss.netty.channel.ChannelException: Failed to bind to: /9.91.8.140:23283
No Yes OK OK
No No Failed to connect to driver at datasight1:23520, retrying … OK

*after this patch

configure spark.driver.host client ip-host in /etc/profile in cluster nodes yarn-client result yarn-cluster result
Yes Yes OK OK
Yes No OK OK
No Yes OK OK
No No Failed to connect to driver at datasight1:23520, retrying … OK

Haven't tested on standalone mode, but I think we will get same behavior.

asfgit pushed a commit that referenced this pull request Dec 4, 2014
…cluster modes

In yarn-cluster and standalone-cluster modes, we don't know where driver will run until it is launched.  If the `spark.driver.host` property is set on the submitting machine and propagated to the driver through SparkConf then this will lead to errors when the driver launches.

This patch fixes this issue by dropping the `spark.driver.host` property in SparkSubmit when running in a cluster deploy mode.

Author: WangTaoTheTonic <barneystinson@aliyun.com>
Author: WangTao <barneystinson@aliyun.com>

Closes #3112 from WangTaoTheTonic/SPARK4253 and squashes the following commits:

ed1a25c [WangTaoTheTonic] revert unrelated formatting issue
02c4e49 [WangTao] add comment
32a3f3f [WangTaoTheTonic] ingore it in SparkSubmit instead of SparkContext
667cf24 [WangTaoTheTonic] document fix
ff8d5f7 [WangTaoTheTonic] also ignore it in standalone cluster mode
2286e6b [WangTao] ignore spark.driver.host in yarn-cluster mode

(cherry picked from commit 8106b1e)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
@asfgit asfgit closed this in 8106b1e Dec 4, 2014
@JoshRosen
Copy link
Contributor

Thanks for testing this out! I also tested it with my own integration test and it now passes, so this looks good to me. I'm going to merge this into master and branch-1.2. I'll edit the commit message to reflect the bug description from JIRA.

@JoshRosen
Copy link
Contributor

I've also backported this to branch-1.1.

asfgit pushed a commit that referenced this pull request Dec 4, 2014
…cluster modes

In yarn-cluster and standalone-cluster modes, we don't know where driver will run until it is launched.  If the `spark.driver.host` property is set on the submitting machine and propagated to the driver through SparkConf then this will lead to errors when the driver launches.

This patch fixes this issue by dropping the `spark.driver.host` property in SparkSubmit when running in a cluster deploy mode.

Author: WangTaoTheTonic <barneystinson@aliyun.com>
Author: WangTao <barneystinson@aliyun.com>

Closes #3112 from WangTaoTheTonic/SPARK4253 and squashes the following commits:

ed1a25c [WangTaoTheTonic] revert unrelated formatting issue
02c4e49 [WangTao] add comment
32a3f3f [WangTaoTheTonic] ingore it in SparkSubmit instead of SparkContext
667cf24 [WangTaoTheTonic] document fix
ff8d5f7 [WangTaoTheTonic] also ignore it in standalone cluster mode
2286e6b [WangTao] ignore spark.driver.host in yarn-cluster mode

(cherry picked from commit 8106b1e)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Conflicts:
	core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants