-
Notifications
You must be signed in to change notification settings - Fork 118
Python Bindings for launching PySpark Jobs from the JVM #364
Conversation
…rk-on-k8s/spark into branch-2.1-kubernetes
…rk-on-k8s/spark into pyspark-integration
This change overhauls the underlying architecture of the submission client, but it is intended to entirely preserve existing behavior of Spark applications. Therefore users will find this to be an invisible change. The philosophy behind this design is to reconsider the breakdown of the submission process. It operates off the abstraction of "submission steps", which are transformation functions that take the previous state of the driver and return the new state of the driver. The driver's state includes its Spark configurations and the Kubernetes resources that will be used to deploy it. Such a refactor moves away from a features-first API design, which considers different containers to serve a set of features. The previous design, for example, had a container files resolver API object that returned different resolutions of the dependencies added by the user. However, it was up to the main Client to know how to intelligently invoke all of those APIs. Therefore the API surface area of the file resolver became untenably large and it was not intuitive of how it was to be used or extended. This design changes the encapsulation layout; every module is now responsible for changing the driver specification directly. An orchestrator builds the correct chain of steps and hands it to the client, which then calls it verbatim. The main client then makes any final modifications that put the different pieces of the driver together, particularly to attach the driver container itself to the pod and to apply the Spark configuration as command-line arguments.
@mccheah PTAL |
childArgs += "org.apache.spark.deploy.PythonRunner" | ||
childArgs += "--other-py-files" | ||
childArgs += args.pyFiles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check if it's empty?
This was not supposed to include #363 so please hold off on reviews until we revert that merge. |
This reverts commit 4533df2.
@tnachen PTAL now |
} else None | ||
// Since you might need jars for SQL UDFs in PySpark | ||
def sparkJarFilter() : Seq[String] = | ||
pythonResource.map { p => p.sparkJars}.getOrElse( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: {p => p.SparkJars}
To be consistent every else here
Option(new PythonSubmissionResourcesImpl(mainAppResource, appArgs)) | ||
} else None | ||
// Since you might need jars for SQL UDFs in PySpark | ||
def sparkJarFilter() : Seq[String] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: sparkJarFilter(): Seq[String] =
jarsDownloadPath: String, | ||
filesDownloadPath: String) extends ContainerLocalizedFilesResolver { | ||
filesDownloadPath: String ) extends ContainerLocalizedFilesResolver { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra space before )
@@ -32,13 +32,15 @@ import org.apache.spark.util.Utils | |||
*/ | |||
private[spark] trait DriverInitContainerComponentsProvider { | |||
|
|||
def provideContainerLocalizedFilesResolver(): ContainerLocalizedFilesResolver | |||
def provideContainerLocalizedFilesResolver( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra space after )
resolvedPrimaryPySparkResource: String, | ||
resolvedPySparkFiles: String, | ||
driverContainerName: String, | ||
driverPodBuilder: PodBuilder) : Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the style consistent with the : spacing.
} | ||
} | ||
} | ||
override def primaryPySparkResource ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As well with ( spacing
import org.mockito.Mockito.when | ||
import org.scalatest.BeforeAndAfter | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra spaces
Besides style problems, nothing jumps out to me as being wrong. Since it's tested already, I think it LGTM. |
@foxish waiting on your OK to merge |
LGTM. Feel free to merge. Thanks! |
Likewise, LGTM |
* Adding PySpark Submit functionality. Launching Python from JVM * Addressing scala idioms related to PR351 * Removing extends Logging which was necessary for LogInfo * Refactored code to leverage the ContainerLocalizedFileResolver * Modified Unit tests so that they would pass * Modified Unit Test input to pass Unit Tests * Setup working environent for integration tests for PySpark * Comment out Python thread logic until Jenkins has python in Python * Modifying PythonExec to pass on Jenkins * Modifying python exec * Added unit tests to ClientV2 and refactored to include pyspark submission resources * Modified unit test check * Scalastyle * PR 348 file conflicts * Refactored unit tests and styles * further scala stylzing and logic * Modified unit tests to be more specific towards Class in question * Removed space delimiting for methods * Submission client redesign to use a step-based builder pattern. This change overhauls the underlying architecture of the submission client, but it is intended to entirely preserve existing behavior of Spark applications. Therefore users will find this to be an invisible change. The philosophy behind this design is to reconsider the breakdown of the submission process. It operates off the abstraction of "submission steps", which are transformation functions that take the previous state of the driver and return the new state of the driver. The driver's state includes its Spark configurations and the Kubernetes resources that will be used to deploy it. Such a refactor moves away from a features-first API design, which considers different containers to serve a set of features. The previous design, for example, had a container files resolver API object that returned different resolutions of the dependencies added by the user. However, it was up to the main Client to know how to intelligently invoke all of those APIs. Therefore the API surface area of the file resolver became untenably large and it was not intuitive of how it was to be used or extended. This design changes the encapsulation layout; every module is now responsible for changing the driver specification directly. An orchestrator builds the correct chain of steps and hands it to the client, which then calls it verbatim. The main client then makes any final modifications that put the different pieces of the driver together, particularly to attach the driver container itself to the pod and to apply the Spark configuration as command-line arguments. * Don't add the init-container step if all URIs are local. * Python arguments patch + tests + docs * Revert "Python arguments patch + tests + docs" This reverts commit 4533df2. * Revert "Don't add the init-container step if all URIs are local." This reverts commit e103225. * Revert "Submission client redesign to use a step-based builder pattern." This reverts commit 5499f6d. * style changes * space for styling
--conf spark.kubernetes.driver.docker.image=spark-driver-py:latest \ | ||
--conf spark.kubernetes.executor.docker.image=spark-executor-py:latest \ | ||
--conf spark.kubernetes.initcontainer.docker.image=spark-init:latest \ | ||
--py-files local:///opt/spark/examples/src/main/python/sort.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is unrelated, but so happy to have local support :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is merged, but I figured I'd leave some comments/questions if that works for folks?
} | ||
|
||
private[spark] class ContainerLocalizedFilesResolverImpl( | ||
sparkJars: Seq[String], | ||
sparkFiles: Seq[String], | ||
pySparkFiles: Seq[String], | ||
primaryPyFile: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be an option?
val containerLocalizedFilesResolver = provideContainerLocalizedFilesResolver() | ||
// Bypass init-containers if `spark.jars` and `spark.files` is empty or only has `local://` URIs | ||
if (KubernetesFileUtils.getNonContainerLocalFiles(uris).nonEmpty) { | ||
// Bypass init-containers if `spark.jars` and `spark.files` and '--py-rilfes' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is pre-existing but why?
if (KubernetesFileUtils.getNonContainerLocalFiles(uris).nonEmpty) { | ||
// Bypass init-containers if `spark.jars` and `spark.files` and '--py-rilfes' | ||
// is empty or only has `local://` URIs | ||
if ((KubernetesFileUtils.getNonContainerLocalFiles(uris) ++ pySparkSubmitted).nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So right now pySparkSubmitted is defined as val pySparkSubmitted = KubernetesFileUtils.getOnlySubmitterLocalFiles(pySparkFiles)
and that seems like not what is desired based on the previous comment and also the logic used for the non-container local files before hand.
# RUN apk add --update alpine-sdk python-dev | ||
# RUN pip install numpy | ||
|
||
ENV PYTHON_VERSION 2.7.13 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a weird version to set Python to? Do we expect people to roll their own Py3 containers?
ENV PYTHON_VERSION 2.7.13 | ||
ENV PYSPARK_PYTHON python | ||
ENV PYSPARK_DRIVER_PYTHON python | ||
ENV PYTHONPATH ${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:${PYTHONPATH} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we source this from one of the existing pyspark scripts so it doesn't need to be updated?
If not, and we need this, we should add a comment about needing to update this when we update elsehwere. Also spark master is now @ 0.10.6.
…-on-k8s#364) * Adding PySpark Submit functionality. Launching Python from JVM * Addressing scala idioms related to PR351 * Removing extends Logging which was necessary for LogInfo * Refactored code to leverage the ContainerLocalizedFileResolver * Modified Unit tests so that they would pass * Modified Unit Test input to pass Unit Tests * Setup working environent for integration tests for PySpark * Comment out Python thread logic until Jenkins has python in Python * Modifying PythonExec to pass on Jenkins * Modifying python exec * Added unit tests to ClientV2 and refactored to include pyspark submission resources * Modified unit test check * Scalastyle * PR 348 file conflicts * Refactored unit tests and styles * further scala stylzing and logic * Modified unit tests to be more specific towards Class in question * Removed space delimiting for methods * Submission client redesign to use a step-based builder pattern. This change overhauls the underlying architecture of the submission client, but it is intended to entirely preserve existing behavior of Spark applications. Therefore users will find this to be an invisible change. The philosophy behind this design is to reconsider the breakdown of the submission process. It operates off the abstraction of "submission steps", which are transformation functions that take the previous state of the driver and return the new state of the driver. The driver's state includes its Spark configurations and the Kubernetes resources that will be used to deploy it. Such a refactor moves away from a features-first API design, which considers different containers to serve a set of features. The previous design, for example, had a container files resolver API object that returned different resolutions of the dependencies added by the user. However, it was up to the main Client to know how to intelligently invoke all of those APIs. Therefore the API surface area of the file resolver became untenably large and it was not intuitive of how it was to be used or extended. This design changes the encapsulation layout; every module is now responsible for changing the driver specification directly. An orchestrator builds the correct chain of steps and hands it to the client, which then calls it verbatim. The main client then makes any final modifications that put the different pieces of the driver together, particularly to attach the driver container itself to the pod and to apply the Spark configuration as command-line arguments. * Don't add the init-container step if all URIs are local. * Python arguments patch + tests + docs * Revert "Python arguments patch + tests + docs" This reverts commit 4533df2. * Revert "Don't add the init-container step if all URIs are local." This reverts commit e103225. * Revert "Submission client redesign to use a step-based builder pattern." This reverts commit 5499f6d. * style changes * space for styling
What changes were proposed in this pull request?
The changes that were proposed in the pull request are the following:
These images differ with the inclusion of python and pyspark specific environment variables. The user-entry point also differs for driver-py as you must include the location of the primary PySpark file and distributed py-files in addition to driver args.
Example Spark Submit
This is an example spark-submit that uses the custom pyspark docker images and distributes the staged
sort.py
file across the cluster. The entry point for the driver is:org.apache.spark.deploy.PythonRunner <FILE_DOWNLOADS_PATH>/pi.py <FILE_DOWNLOADS_PATH>/sort.py 100
How was this patch tested?
This was fully tested by building a make_distribution environment and running on a local minikube cluster with a single executor. The following command is an example submission:
Integration and Unit tests have been added.
Future Versions of this PR
Launching JVM from Python (log issue)
MemoryOverhead testing (OOMKilled errors)