-
Notifications
You must be signed in to change notification settings - Fork 118
[WIP] Submission client redesign to use a step-based builder pattern #363
Conversation
This change overhauls the underlying architecture of the submission client, but it is intended to entirely preserve existing behavior of Spark applications. Therefore users will find this to be an invisible change. The philosophy behind this design is to reconsider the breakdown of the submission process. It operates off the abstraction of "submission steps", which are transformation functions that take the previous state of the driver and return the new state of the driver. The driver's state includes its Spark configurations and the Kubernetes resources that will be used to deploy it. Such a refactor moves away from a features-first API design, which considers different containers to serve a set of features. The previous design, for example, had a container files resolver API object that returned different resolutions of the dependencies added by the user. However, it was up to the main Client to know how to intelligently invoke all of those APIs. Therefore the API surface area of the file resolver became untenably large and it was not intuitive of how it was to be used or extended. This design changes the encapsulation layout; every module is now responsible for changing the driver specification directly. An orchestrator builds the correct chain of steps and hands it to the client, which then calls it verbatim. The main client then makes any final modifications that put the different pieces of the driver together, particularly to attach the driver container itself to the pod and to apply the Spark configuration as command-line arguments.
Unit tests are being rewritten. This passes integration tests in my development environment. |
.addNewEnv() | ||
.withName(ENV_PYSPARK_FILES) | ||
.withValue( | ||
KubernetesFileUtils.resolveFilePaths(otherPyFiles, filesDownloadPath).mkString(",")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if otherPyFiles is empty then this needs to have a value that isn't the empty string - "null", for example. Otherwise the PythonRunner will shift all the driver's arguments to the left by one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, it would be wise to pass down the Driver arguments to this class and modify the driver containers ENV variables. I will send a patch over on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need to pass down the arguments to get this. Just:
.withValue(if (otherPyFiles.isEmpty) "null" else KubernetesFileUtils....
@@ -621,14 +621,22 @@ object SparkSubmit { | |||
if (isKubernetesCluster) { | |||
childMainClass = "org.apache.spark.deploy.kubernetes.submit.Client" | |||
if (args.isPython) { | |||
childArgs += "--py-file" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming convention, for the sake of consistency, should make this --primary-py-file
childArgs += args.pyFiles | ||
} else { | ||
childArgs += "--primary-java-resource" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
java and python resources can be resolved with just a check if it ends with .py
, why is it necessary to pass down extra information from SparkSubmit that isn't needed per say. SparkSubmit imo should have minimal things being passed into our childMainClass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'm against having logic that requires file names to have a certain extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes the code brittle; if upstream spark-submit starts giving us Python files with extensions that aren't .py
then we want to be resilient against that.
} | ||
val sparkJars = submissionSparkConf.getOption("spark.jars") | ||
.map(_.split(",")) | ||
.getOrElse(Array.empty[String]) ++ additionalMainAppJar.toSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.toSeq()
vs. +: additionalMainAppJar.getOrElse("")
Why the decision to make into Seq?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no point in having an empty string in the sequence if just not having the element at all would achieve the same effect.
.addNewEnv() | ||
.withName(ENV_DRIVER_ARGS) | ||
.withValue(appArgs.mkString(" ")) | ||
.endEnv() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts on Creating a Submission step that resolves the arguments here instead of over-writing this created ENV in PythonStep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check most recent push that I made to this branch which resolves this and previous comments about argument issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems more consistent to just load everything as environment variables correctly. Both Python and Java expect SPARK_DRIVER_ARGS
to be the same, just that Python expects additional environment variables PRIMARY_PY_FILE
and PY_FILES
. I don't think we need any special argument parsing here, just need to load the correct environment variables with the correct values, and the Docker image will take care of formatting them properly.
val pythonStep = mainAppResource match { | ||
case PythonMainAppResource(mainPyResource) => | ||
Option(new PythonStep(mainPyResource, additionalPythonFiles, filesDownloadPath)) | ||
case _ => Option.empty[KubernetesSubmissionStep] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here instead of an Empty Step you would resolve the arguments to just be appArgs as you would have before in the BaseSubmissionStep, but otherwise you use the PythonStep. I believe SparkR has similair arguments to Java/Scala so this generalization could be just called: NonPythonArgumentResolver
import org.apache.spark.deploy.kubernetes.config._ | ||
import org.apache.spark.deploy.kubernetes.constants._ | ||
|
||
private[spark] class DriverKubernetesCredentialsStep( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be generalized to a Credentials step as a whole for secure HDFS secrets as well? or should HDFS mounting be considered another separate step? @kimoonkim @foxish
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HDFS should be separate. This class is already a little large as it is given that it's resolving various configuration options. We can extend this to allow users to give their certificates through a JKS trustStore.
I think this refactor promotes a reasonably structured architecture that I think we can easily adapt and introduce to future developers. The ability to just write a KubernetesSubmissionStep to apply a new modification to the Driver pod, container, resources, ...etc is a seemingly painless step. This is especially good timing with the integration of secure HDFS where pod modifications will be important and having that be offset to a step that is cast into an Option() [ if it is necessary or not ] seems to quite organized. |
This was not supposed to merge yet, unfortunately - we'll be working on a revert on the base branch and re-opening this in a new push. |
…-jars Sanitize empty strings for jars and files
Applies changes assuming PySpark is present from #364 .
This change overhauls the underlying architecture of the submission client, but it is intended to entirely preserve existing behavior of Spark applications. Therefore users will find this to be an invisible change.
The philosophy behind this design is to reconsider the breakdown of the submission process. It operates off the abstraction of "submission steps", which are transformation functions that take the previous state of the driver and return the new state of the driver. The driver's state includes its Spark configurations and the Kubernetes resources that will be used to deploy it.
Such a refactor moves away from a features-first API design, which considers different containers to serve a set of features. The previous design, for example, had a container files resolver API object that returned different resolutions of the dependencies added by the user. However, it was up to the main Client to know how to intelligently invoke all of those APIs. Therefore the API surface area of the file resolver became untenably large and it was not intuitive of how it was to be used or extended.
This design changes the encapsulation layout; every module is now responsible for changing the driver specification directly. An orchestrator builds the correct chain of steps and hands it to the client, which then calls it verbatim. The main client then makes any final modifications that put the different pieces of the driver together, particularly to attach the driver container itself to the pod and to apply the Spark configuration as command-line arguments.
The current steps are:
BaseSubmissionStep
: Baseline configurations such as the docker image and resource requests.DriverKubernetesCredentialsStep
: Resolves Kubernetes credentials configuration in the driver pod. Mounts a secret if necessary.InitContainerBootstrapStep
: Attaches the init-container, if necessary, to the driver pod. This is optional and wont' be loaded if all URIs are "local" or there are no URIs at all.DependencyResolutionStep
: Sets the classpath,spark.jars
, andspark.files
properties. This step is partially not isolated as it assumes that files that are remote or locally submitted will be downloaded to a given location. Unit tests should verify that this contract holds.PythonStep
: Configures Python environment variables if using PySpark.