-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node #118
Conversation
jongyoul
commented
Jun 24, 2015
- Spark supports pyspark on yarn cluster without deploying python libraries from Spark 1.4
- https://issues.apache.org/jira/browse/SPARK-6869
- [SPARK-6869][PySpark] Add pyspark archives path to PYTHONPATH spark#5580, [SPARK-6869][PySpark] Add pyspark archives path to PYTHONPATH spark#5478
export PYTHONPATH="${ZEPPELIN_HOME}/python/lib/pyspark.zip:${ZEPPELIN_HOME}/python/lib/py4j-0.8.2.1-src.zip" | ||
else | ||
export PYTHONPATH="$PYTHONPATH${ZEPPELIN_HOME}/lib/pyspark.zip:${ZEPPELIN_HOME}/python/lib/py4j-0.8.2.1-src.zip" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Leemoonsoo In case of PYTHONPATH on PysparkInterpreter, It's not affected by python driver. In the past, Zeppelin loads pyspark from SPARK_HOME/python
Rising a question on necessity of downloading spark on every dev machine because recently heard a joke from @anthonycorbacho (somebody on the Strata conference),
I.e Is there a reason we can't have it under some some kind of profile? |
Do you think it's good to make a profile? There is no problem to make it. I'll handle it to make that joke as just joke. |
Can anyone help me know the reason why last build fails? |
You can download the log file and search for "BUILD FAILURE" |
@Leemoonsoo thanks. I will check it tomorrow. And what do you think of making a profile for yarn-pyspark? |
Making profile for pyspark is not a bad idea. However, pyspark can work with not only yarn but also with mesos and standalone cluster. So i think it would be better if profile looks like |
I've rebased |
@bzz @Leemoonsoo Review this again, please. |
@jongyoul |
Thank you very much for contributing this! It would be great to have a high level summary of the changes, so please correct me in case I miss-understand something: This PR allows users of pyspark skip setting pythonpath env var, copy Python modules on every node of he cluster and have spark installed(in case of pyspark in local mode on 1 machine) by adding new artefact to the Zeppelin build, a python, hidden behind optional build profile, that brings py4j as well as Python code of pyspark by downloading (and caching) actual spark distribution and re-packing those to a zip file, available in Z class path at runtime. Is that correct? One question: a Python dir is not a maven submodule now but may be it should be, one day, what do you guys think? |
@bzz It's correct for your understanding. I'm worried about mesos cluster mode because I don't test my PR on these cluster yet, but standalone cluster will be ok because that cluster already has python libraries. I've only tested it from local mode and yarn cluster mode. Concerning a directory name of python, I think python interpreter will be added somedays, can you recommend the directory name? |
if [[ x"" == x${PYTHONPATH} ]]; then | ||
export PYTHONPATH="${ZEPPELIN_HOME}/python/lib/pyspark.zip:${ZEPPELIN_HOME}/python/lib/py4j-0.8.2.1-src.zip" | ||
else | ||
export PYTHONPATH="$PYTHONPATH${ZEPPELIN_HOME}/lib/pyspark.zip:${ZEPPELIN_HOME}/python/lib/py4j-0.8.2.1-src.zip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about add colon (:)?
from ....="$PYTHONPATH${ZEPPELIN_HOME}/lib/py....
to ....="$PYTHONPATH:${ZEPPELIN_HOME}/lib/py....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I'm a bit confused by this. is it not suppose to have a colon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Leemoonsoo @felixcheung Yes, you guys are right. I can't find this error because I don't set extra PYTHONPATH at all and don't test it. I'll fix it.
I'm trying to remove some unused settings. I hope not to set SPARK_HOME for using pyspark eventually. |
I've heard close and reopen issue is the easiest way to trigger a travis, but In my case, I have a merge conflict. |
…very yarn node - rebasing
- Removed redundant dependency setting
…very yarn node - Followed spark's way to support pyspark - https://issues.apache.org/jira/browse/SPARK-6869 - apache/spark#5580 - https://github.com/apache/spark/pull/5478/files
…very yarn node - Removed verbose setting
- Excludes python/** from apache-rat
…very yarn node - rebasing
…very yarn node - Fixed checkstyle
…very yarn node - rebasing
…very yarn node - Dummy for trigger
- Changed the location of pyspark's directory into interpreter/spark
@Leemoonsoo I've rebased it again because all of travis tests passed. |
Thanks @jongyoul. Great work! |
…very yarn node - Spark supports pyspark on yarn cluster without deploying python libraries from Spark 1.4 - https://issues.apache.org/jira/browse/SPARK-6869 - apache/spark#5580, apache/spark#5478 Author: Jongyoul Lee <jongyoul@gmail.com> Closes #118 from jongyoul/ZEPPELIN-18 and squashes the following commits: a47e27c [Jongyoul Lee] - Fixed test script for spark 1.4.0 72a65fd [Jongyoul Lee] - Fixed test script for spark 1.4.0 ee6d100 [Jongyoul Lee] - Cleanup codes 47fd9c9 [Jongyoul Lee] - Cleanup codes 248e330 [Jongyoul Lee] - Cleanup codes 4cd10b5 [Jongyoul Lee] - Removed meaningless codes comments c9cda29 [Jongyoul Lee] - Removed setting SPARK_HOME - Changed the location of pyspark's directory into interpreter/spark ef240f5 [Jongyoul Lee] - Fixed typo 06002fd [Jongyoul Lee] - Fixed typo 4b35c8d [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - Dummy for trigger 682986e [Jongyoul Lee] rebased 8a7bf47 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - rebasing ad610fb [Jongyoul Lee] rebased 94bdf30 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - Fixed checkstyle 929333d [Jongyoul Lee] rebased 64b8195 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - rebasing 0a2d90e [Jongyoul Lee] rebased b05ae6e [Jongyoul Lee] [ZEPPELIN-18] Remove setting SPARK_HOME for PySpark - Excludes python/** from apache-rat 71e2a92 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - Removed verbose setting 0ddb436 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - Followed spark's way to support pyspark - https://issues.apache.org/jira/browse/SPARK-6869 - apache/spark#5580 - https://github.com/apache/spark/pull/5478/files 1b192f6 [Jongyoul Lee] [ZEPPELIN-18] Remove setting SPARK_HOME for PySpark - Removed redundant dependency setting 32fd9e1 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - rebasing (cherry picked from commit 3bd2b21) Signed-off-by: Lee moon soo <moon@apache.org>