Add pre-installed conda configuration and use to find rlib directory #700

jeremyjliu · 2020-07-07T21:29:12Z

Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain)

Conda env change, not filed in upstream.

What changes were proposed in this pull request?

Executor uses CondaEnvironment.condaEnvPath + /lib/R/library or looks in SPARK_HOME for the sparkR package in order to launch a worker/daemon R process. Thus, when conda is already pre-installed on docker image and CondaEnvironment is not set, RRunner is unable to locate a rLibDir to look for the sparkR package files. This is not an issue for python, as python is able to locate its own installed modules when launched non-interactively, unlike the Rscript command that requires an explicit file path.

This PR introduces a SparkConf variable to note the pre-installed conda env path that can be read by RRunner when appropriate.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

rshkv · 2020-07-08T10:28:08Z

Looks reasonable, just don't fully understand the problem. Can you elaborate this bit?

Thus, when conda is already pre-installed on docker image and CondaEnvironment is not set, RRunner is unable to locate a rLibDir to look for the sparkR package files.

rLibDir is referring to /lib/R/library? Why can't we get Rscript from the pre-installed Conda env? I'd assume pre-installed means Conda has added itself to the path, no?

jeremyjliu · 2020-07-08T15:51:07Z

What we currently do for pre-installed conda environments, is pass the Rscript path via the R_COMMAND variable: https://github.com/palantir/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/R.scala#L37

R_COMMAND would be "/opt/conda/bin/Rscript", which is where conda installs the Rscript executable (as well as the R executable) normally. Rscript works in that it expects a path of an .R file to execute, so the full command in RRunner.scala needs to be /opt/conda/bin/Rscript /opt/conda/lib/R/library/SparkR/worker/worker.R. This works differently than in python, where the analogous full command is /opt/conda/bin/python -m pyspark.worker and python is able automatically locate the relevant module/file.

rLibDir is referring to /lib/R/library?
yup, specifically /lib/R/library that exists
Why can't we get Rscript from the pre-installed Conda env?
we can, it's passed in as the R_COMMAND config variable
I'd assume pre-installed means Conda has added itself to the path, no?
not relevant, due to how Rscript works. the python executable installed by conda knows where its packages are installed, because conda properly installs them into the right site-packages. likewise, so does the R executable (used for interactive sessions) but sadly not the Rscript executable (used for executing scripts).

robert3005 · 2020-07-17T12:46:21Z

this needs an entry in https://github.com/palantir/spark/blob/master/FORK.md. We should change pr template to include mention of that

jeremyjliu · 2020-07-18T00:07:10Z

👍 for this PR, I did remember: https://github.com/palantir/spark/pull/700/files#diff-a4d566a93b3dacc2c090d52eeb50ece6R29

robert3005 · 2020-07-18T00:39:37Z

All good my bad for missing it

Add pre-installed conda environment variable

e7e1eb7

jeremyjliu changed the title ~~Add pre-installed conda configuration and utilize to local rlib directory~~ Add pre-installed conda configuration and use to find rlib directory Jul 7, 2020

Update FORK.md

83c3b57

rshkv approved these changes Jul 13, 2020

View reviewed changes

rshkv merged commit 597d5a2 into palantir:master Jul 13, 2020

rshkv mentioned this pull request Feb 26, 2021

Spark3 diff on top of tag 3.0.1 #735

Closed

jdcasale mentioned this pull request Mar 3, 2021

Palantir Spark Diff on v3.0.2 #737

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pre-installed conda configuration and use to find rlib directory #700

Add pre-installed conda configuration and use to find rlib directory #700

jeremyjliu commented Jul 7, 2020 •

edited

Loading

rshkv commented Jul 8, 2020

jeremyjliu commented Jul 8, 2020

robert3005 commented Jul 17, 2020

jeremyjliu commented Jul 18, 2020

robert3005 commented Jul 18, 2020

Add pre-installed conda configuration and use to find rlib directory #700

Add pre-installed conda configuration and use to find rlib directory #700

Conversation

jeremyjliu commented Jul 7, 2020 • edited Loading

Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain)

What changes were proposed in this pull request?

How was this patch tested?

rshkv commented Jul 8, 2020

jeremyjliu commented Jul 8, 2020

robert3005 commented Jul 17, 2020

jeremyjliu commented Jul 18, 2020

robert3005 commented Jul 18, 2020

jeremyjliu commented Jul 7, 2020 •

edited

Loading