Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-27495][Core][YARN][k8s] Stage Level Scheduling code for reference #27053

Closed
wants to merge 183 commits into from

Conversation

tgravescs
Copy link
Contributor

@tgravescs tgravescs commented Dec 30, 2019

What changes were proposed in this pull request?

This is all the code for stage level scheduling feature - except for documentation.

This is meant to be for a reference when reviewing as I'm splitting this into mulitple prs with the intention its easier to review. Note that only YARN currently supports this and it requires dynamic allocation to be enabled because currently we get new executors that match the profile exactly. We do not try to fit tasks into executors that were acquired for a different profile.

At a high level in order to support having different stages with different ResourceProfiles the changes required include:

  • Add a ResourceProfileManager that tracks the profiles and is used to map a Resource Profile Id to the actual Resource Profile. This allows us to pass around and store the id rather then the entire profile.
  • Introduce the concept of a default profile. This is essentially the profile you get today without stage level scheduling from the application level configs.
  • ImmutableResourceProfile - this is the actual resource profile used internally to Spark that is immutable. This is to allow the user to create and change a ResourceProfile in their code but as soon as they associated the profile with an RDD then spark internally uses the Immutable version so that it doesn't change.
  • YARN cluster manager updated to handle the profiles and request the correct containers from YARN. I had to introduce using priorities here because YARN doesn't allow you to create containers with different resources within the same priority. Now we have the priority = ResourceProfile Id and its easy to match the container we get from Yarn to what ResourceProfile we requested it for.
  • ExecutorAllocationManager, ExecutorMonitor, CoarseGrainedExecutorBackend - Updated to handle tracking the executors per ResourceProfile.
  • Scheduler - updated to handle the ResourceProfile associated with an RDD. It creates the Stage with the appropriate ResourceProfile. It has logic for handling conflicting ResourceProfiles when multiple RDD are put into the same stage that have different ResourceProfiles. The default behavior is to throw an exception, but there is a config that will allow the scheduler to merge the profiles using the max value of each resource. The task scheduler was updated to make sure the resources of each executor meet the task resources for that profile and to assign them out properly.
  • I updated all the locations that used the hardcoded task cpus or other global configs to use the ResourceProfile based configs.
  • RDD api added and ResourceProfile, ExecutorResourceRequests, and TaskResourceRequests made public.

End user api looks like this:
val rpBuilder = new ResourceProfileBuilder()
val ereq = new ExecutorResourceRequests()
val treq = new TaskResourceRequests()

ereq.cores(2).memory("6g").memoryOverhead("2g").pysparkMemory("2g").resource("gpu", 2, "/home/tgraves/getGpus")
treq.cpus(2).resource("gpu", 2)
val resourceProfile = rpBuilder.require(ereq).require(treq).build
val rdd = sc.parallelize(1 to 1000, 6).withResources(resourceProfile).map(x => (x, x))

Why are the changes needed?

Allow for different stages to use different executor/task resources

Does this PR introduce any user-facing change?

Yes the RDD.withResources and ResourceProfile, ExecutorResourceRequest, TaskResourceRequest apis

How was this patch tested?

Unit tests and manually.

@SparkQA
Copy link

SparkQA commented Jan 13, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/21443/

@SparkQA
Copy link

SparkQA commented Jan 13, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/21443/

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116664 has finished for PR 27053 at commit f99f1cf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/21503/

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/21503/

@SparkQA
Copy link

SparkQA commented Jan 15, 2020

Test build #116728 has finished for PR 27053 at commit 954ba00.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor Author

test this please

@SparkQA
Copy link

SparkQA commented Jan 15, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/21559/

@SparkQA
Copy link

SparkQA commented Jan 15, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/21559/

@SparkQA
Copy link

SparkQA commented Jan 15, 2020

Test build #116787 has finished for PR 27053 at commit 954ba00.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 22, 2020

Test build #117240 has finished for PR 27053 at commit 8f40a0c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 22, 2020

Test build #117243 has finished for PR 27053 at commit 585df54.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 22, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/22003/

@SparkQA
Copy link

SparkQA commented Jan 22, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/22003/

@SparkQA
Copy link

SparkQA commented Jan 27, 2020

Test build #117451 has finished for PR 27053 at commit 0738be0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 27, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/22210/

@SparkQA
Copy link

SparkQA commented Jan 27, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/22210/

@SparkQA
Copy link

SparkQA commented Mar 5, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/24123/

@SparkQA
Copy link

SparkQA commented Mar 5, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/24123/

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jun 14, 2020
@github-actions github-actions bot closed this Jun 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants