Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

[WIP] Submission client redesign to use a step-based builder pattern #363

Merged
merged 2 commits into from
Jun 30, 2017

Conversation

mccheah
Copy link

@mccheah mccheah commented Jun 30, 2017

Applies changes assuming PySpark is present from #364 .

This change overhauls the underlying architecture of the submission client, but it is intended to entirely preserve existing behavior of Spark applications. Therefore users will find this to be an invisible change.

The philosophy behind this design is to reconsider the breakdown of the submission process. It operates off the abstraction of "submission steps", which are transformation functions that take the previous state of the driver and return the new state of the driver. The driver's state includes its Spark configurations and the Kubernetes resources that will be used to deploy it.

Such a refactor moves away from a features-first API design, which considers different containers to serve a set of features. The previous design, for example, had a container files resolver API object that returned different resolutions of the dependencies added by the user. However, it was up to the main Client to know how to intelligently invoke all of those APIs. Therefore the API surface area of the file resolver became untenably large and it was not intuitive of how it was to be used or extended.

This design changes the encapsulation layout; every module is now responsible for changing the driver specification directly. An orchestrator builds the correct chain of steps and hands it to the client, which then calls it verbatim. The main client then makes any final modifications that put the different pieces of the driver together, particularly to attach the driver container itself to the pod and to apply the Spark configuration as command-line arguments.

The current steps are:

  1. BaseSubmissionStep: Baseline configurations such as the docker image and resource requests.
  2. DriverKubernetesCredentialsStep: Resolves Kubernetes credentials configuration in the driver pod. Mounts a secret if necessary.
  3. InitContainerBootstrapStep: Attaches the init-container, if necessary, to the driver pod. This is optional and wont' be loaded if all URIs are "local" or there are no URIs at all.
  4. DependencyResolutionStep: Sets the classpath, spark.jars, and spark.files properties. This step is partially not isolated as it assumes that files that are remote or locally submitted will be downloaded to a given location. Unit tests should verify that this contract holds.
  5. PythonStep: Configures Python environment variables if using PySpark.

This change overhauls the underlying architecture of the submission
client, but it is intended to entirely preserve existing behavior of
Spark applications. Therefore users will find this to be an invisible
change.

The philosophy behind this design is to reconsider the breakdown of the
submission process. It operates off the abstraction of "submission
steps", which are transformation functions that take the previous state
of the driver and return the new state of the driver. The driver's state
includes its Spark configurations and the Kubernetes resources that will
be used to deploy it.

Such a refactor moves away from a features-first API design, which
considers different containers to serve a set of features. The previous
design, for example, had a container files resolver API object that
returned different resolutions of the dependencies added by the user.
However, it was up to the main Client to know how to intelligently
invoke all of those APIs. Therefore the API surface area of the file
resolver became untenably large and it was not intuitive of how it was
to be used or extended.

This design changes the encapsulation layout; every module is now
responsible for changing the driver specification directly. An
orchestrator builds the correct chain of steps and hands it to the
client, which then calls it verbatim. The main client then makes any
final modifications that put the different pieces of the driver
together, particularly to attach the driver container itself to the pod
and to apply the Spark configuration as command-line arguments.
@mccheah
Copy link
Author

mccheah commented Jun 30, 2017

Unit tests are being rewritten. This passes integration tests in my development environment.

@mccheah mccheah changed the title [WIP] Submission steps refactor [WIP] Submission client redesign to use a step-based builder pattern Jun 30, 2017
@ifilonenko
Copy link
Member

ifilonenko commented Jun 30, 2017

Thanks for this! (y) @ash211 @foxish PTAL

@ifilonenko ifilonenko self-requested a review June 30, 2017 01:51
.addNewEnv()
.withName(ENV_PYSPARK_FILES)
.withValue(
KubernetesFileUtils.resolveFilePaths(otherPyFiles, filesDownloadPath).mkString(","))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if otherPyFiles is empty then this needs to have a value that isn't the empty string - "null", for example. Otherwise the PythonRunner will shift all the driver's arguments to the left by one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it would be wise to pass down the Driver arguments to this class and modify the driver containers ENV variables. I will send a patch over on this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need to pass down the arguments to get this. Just:

.withValue(if (otherPyFiles.isEmpty) "null" else KubernetesFileUtils....

@@ -621,14 +621,22 @@ object SparkSubmit {
if (isKubernetesCluster) {
childMainClass = "org.apache.spark.deploy.kubernetes.submit.Client"
if (args.isPython) {
childArgs += "--py-file"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming convention, for the sake of consistency, should make this --primary-py-file

childArgs += args.pyFiles
} else {
childArgs += "--primary-java-resource"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java and python resources can be resolved with just a check if it ends with .py, why is it necessary to pass down extra information from SparkSubmit that isn't needed per say. SparkSubmit imo should have minimal things being passed into our childMainClass

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'm against having logic that requires file names to have a certain extension.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes the code brittle; if upstream spark-submit starts giving us Python files with extensions that aren't .py then we want to be resilient against that.

}
val sparkJars = submissionSparkConf.getOption("spark.jars")
.map(_.split(","))
.getOrElse(Array.empty[String]) ++ additionalMainAppJar.toSeq
Copy link
Member

@ifilonenko ifilonenko Jun 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.toSeq() vs. +: additionalMainAppJar.getOrElse("")

Why the decision to make into Seq?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no point in having an empty string in the sequence if just not having the element at all would achieve the same effect.

.addNewEnv()
.withName(ENV_DRIVER_ARGS)
.withValue(appArgs.mkString(" "))
.endEnv()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on Creating a Submission step that resolves the arguments here instead of over-writing this created ENV in PythonStep

Copy link
Member

@ifilonenko ifilonenko Jun 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check most recent push that I made to this branch which resolves this and previous comments about argument issues

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems more consistent to just load everything as environment variables correctly. Both Python and Java expect SPARK_DRIVER_ARGS to be the same, just that Python expects additional environment variables PRIMARY_PY_FILE and PY_FILES. I don't think we need any special argument parsing here, just need to load the correct environment variables with the correct values, and the Docker image will take care of formatting them properly.

val pythonStep = mainAppResource match {
case PythonMainAppResource(mainPyResource) =>
Option(new PythonStep(mainPyResource, additionalPythonFiles, filesDownloadPath))
case _ => Option.empty[KubernetesSubmissionStep]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here instead of an Empty Step you would resolve the arguments to just be appArgs as you would have before in the BaseSubmissionStep, but otherwise you use the PythonStep. I believe SparkR has similair arguments to Java/Scala so this generalization could be just called: NonPythonArgumentResolver

import org.apache.spark.deploy.kubernetes.config._
import org.apache.spark.deploy.kubernetes.constants._

private[spark] class DriverKubernetesCredentialsStep(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be generalized to a Credentials step as a whole for secure HDFS secrets as well? or should HDFS mounting be considered another separate step? @kimoonkim @foxish

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HDFS should be separate. This class is already a little large as it is given that it's resolving various configuration options. We can extend this to allow users to give their certificates through a JKS trustStore.

@ifilonenko
Copy link
Member

I think this refactor promotes a reasonably structured architecture that I think we can easily adapt and introduce to future developers. The ability to just write a KubernetesSubmissionStep to apply a new modification to the Driver pod, container, resources, ...etc is a seemingly painless step. This is especially good timing with the integration of secure HDFS where pod modifications will be important and having that be offset to a step that is cast into an Option() [ if it is necessary or not ] seems to quite organized.

@ifilonenko ifilonenko merged commit e103225 into pyspark-integration Jun 30, 2017
@mccheah
Copy link
Author

mccheah commented Jun 30, 2017

This was not supposed to merge yet, unfortunately - we'll be working on a revert on the base branch and re-opening this in a new push.

ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 26, 2019
…-jars

Sanitize empty strings for jars and files
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants