[WIP][SPARK-27495][Core][YARN][k8s] Stage Level Scheduling code for reference #27053

tgravescs · 2019-12-30T18:05:24Z

What changes were proposed in this pull request?

This is all the code for stage level scheduling feature - except for documentation.

This is meant to be for a reference when reviewing as I'm splitting this into mulitple prs with the intention its easier to review. Note that only YARN currently supports this and it requires dynamic allocation to be enabled because currently we get new executors that match the profile exactly. We do not try to fit tasks into executors that were acquired for a different profile.

At a high level in order to support having different stages with different ResourceProfiles the changes required include:

Add a ResourceProfileManager that tracks the profiles and is used to map a Resource Profile Id to the actual Resource Profile. This allows us to pass around and store the id rather then the entire profile.
Introduce the concept of a default profile. This is essentially the profile you get today without stage level scheduling from the application level configs.
ImmutableResourceProfile - this is the actual resource profile used internally to Spark that is immutable. This is to allow the user to create and change a ResourceProfile in their code but as soon as they associated the profile with an RDD then spark internally uses the Immutable version so that it doesn't change.
YARN cluster manager updated to handle the profiles and request the correct containers from YARN. I had to introduce using priorities here because YARN doesn't allow you to create containers with different resources within the same priority. Now we have the priority = ResourceProfile Id and its easy to match the container we get from Yarn to what ResourceProfile we requested it for.
ExecutorAllocationManager, ExecutorMonitor, CoarseGrainedExecutorBackend - Updated to handle tracking the executors per ResourceProfile.
Scheduler - updated to handle the ResourceProfile associated with an RDD. It creates the Stage with the appropriate ResourceProfile. It has logic for handling conflicting ResourceProfiles when multiple RDD are put into the same stage that have different ResourceProfiles. The default behavior is to throw an exception, but there is a config that will allow the scheduler to merge the profiles using the max value of each resource. The task scheduler was updated to make sure the resources of each executor meet the task resources for that profile and to assign them out properly.
I updated all the locations that used the hardcoded task cpus or other global configs to use the ResourceProfile based configs.
RDD api added and ResourceProfile, ExecutorResourceRequests, and TaskResourceRequests made public.

End user api looks like this:
val rpBuilder = new ResourceProfileBuilder()
val ereq = new ExecutorResourceRequests()
val treq = new TaskResourceRequests()

ereq.cores(2).memory("6g").memoryOverhead("2g").pysparkMemory("2g").resource("gpu", 2, "/home/tgraves/getGpus")
treq.cpus(2).resource("gpu", 2)
val resourceProfile = rpBuilder.require(ereq).require(treq).build
val rdd = sc.parallelize(1 to 1000, 6).withResources(resourceProfile).map(x => (x, x))

Why are the changes needed?

Allow for different stages to use different executor/task resources

Does this PR introduce any user-facing change?

Yes the RDD.withResources and ResourceProfile, ExecutorResourceRequest, TaskResourceRequest apis

How was this patch tested?

Unit tests and manually.

ResourceProfileId

resource scheduling

…9306-based29415

…9415

SparkQA · 2020-01-13T23:28:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/21443/

SparkQA · 2020-01-13T23:41:55Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/21443/

SparkQA · 2020-01-14T00:42:19Z

Test build #116664 has finished for PR 27053 at commit f99f1cf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-14T22:28:55Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/21503/

SparkQA · 2020-01-14T22:50:18Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/21503/

SparkQA · 2020-01-15T04:23:02Z

Test build #116728 has finished for PR 27053 at commit 954ba00.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2020-01-15T14:39:09Z

test this please

SparkQA · 2020-01-15T15:34:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/21559/

SparkQA · 2020-01-15T15:56:15Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/21559/

SparkQA · 2020-01-15T21:21:53Z

Test build #116787 has finished for PR 27053 at commit 954ba00.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

…7495

SparkQA · 2020-01-22T14:37:20Z

Test build #117240 has finished for PR 27053 at commit 8f40a0c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-22T15:48:36Z

Test build #117243 has finished for PR 27053 at commit 585df54.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-22T16:28:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/22003/

SparkQA · 2020-01-22T16:57:30Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/22003/

SparkQA · 2020-01-27T20:20:34Z

Test build #117451 has finished for PR 27053 at commit 0738be0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-27T21:03:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/22210/

SparkQA · 2020-01-27T21:34:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/22210/

SparkQA · 2020-03-05T12:37:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/24123/

SparkQA · 2020-03-05T13:00:11Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/24123/

github-actions · 2020-06-14T00:23:40Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

tgravescs added 30 commits October 9, 2019 10:05

Add ResourceProfile, ExecutorResourceRequest, and TaskResourceRequest

81a4499

Minor updates

1108d90

Add in support for Executor launch and register tracking the

178c387

ResourceProfileId

Refactor to allow java API to work

73071c0

TaskResourceRequest takes a Double for amount to allow fractional

5666cf6

resource scheduling

Change ExecutorResourceRequest to use "" instead of Option

ac4ff9f

Update unit tests

0d9e3d4

Merge branch 'SPARK-29415' of github.com:tgravescs/spark into SPARK-2…

30083bf

…9306-based29415

Update names of ResourceRequest functions

6fffbb9

Add more executor backend suite tests for resource profiles

fd5751c

Update executor monitor suite

1b03773

checkpoint

0346a03

revert scala maven plugin version

b73ceaa

Merge branch 'master' of https://github.com/apache/spark into SPARK-2…

6c88bf8

…9415

make ExecutorResourceRequest private for now

94f27fd

Merge branch 'master' of https://github.com/apache/spark into SPARK-2…

faabfc4

…9415

Fix formatting

20f9102

remove uneeded logging

2349cb7

Add newline to end of Suites

9d02420

Fix java line length style issue

455b2f6

fix spacing

2bb8d7b

Fix javadoc malformed html with <=

23dd114

Main functionality in allocation manager

de5d6d7

Working dynamic allocation

942a657

Checkpoint tests

7c0ee8f

Change allowed resources to HashSet and remove Equals since not used yet

55c820e

fix spelling ExecutorResourceRequest

30d04e5

Add more documentation to ExecutorResourceRequest

ca429a9

Fix removeExecutors to work with resource profiles

a5c1fb2

Start updating executorallocation manager suite

f9936cc

tgravescs added 2 commits January 13, 2020 16:36

Add some formatting to the UI for resource profiles on env page

8cbdadd

Update test to clear default profile before

f99f1cf

Fix typo in test

954ba00

Merge branch 'master' of https://github.com/apache/spark into SPARK-2…

8f40a0c

…7495

Fix merge issues

585df54

tgravescs added 2 commits January 23, 2020 13:16

Fix python test hang by clearing the default profile in SparkContext

976b912

Add () to calls to clearResourceProfile

0738be0

dongjoon-hyun added the SPARK CORE label Feb 5, 2020

github-actions bot added the Stale label Jun 14, 2020

github-actions bot closed this Jun 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-27495][Core][YARN][k8s] Stage Level Scheduling code for reference #27053

[WIP][SPARK-27495][Core][YARN][k8s] Stage Level Scheduling code for reference #27053

tgravescs commented Dec 30, 2019 •

edited

Loading

SparkQA commented Jan 13, 2020

SparkQA commented Jan 13, 2020

SparkQA commented Jan 14, 2020

SparkQA commented Jan 14, 2020

SparkQA commented Jan 14, 2020

SparkQA commented Jan 15, 2020

tgravescs commented Jan 15, 2020

SparkQA commented Jan 15, 2020

SparkQA commented Jan 15, 2020

SparkQA commented Jan 15, 2020

SparkQA commented Jan 22, 2020

SparkQA commented Jan 22, 2020

SparkQA commented Jan 22, 2020

SparkQA commented Jan 22, 2020

SparkQA commented Jan 27, 2020

SparkQA commented Jan 27, 2020

SparkQA commented Jan 27, 2020

SparkQA commented Mar 5, 2020

SparkQA commented Mar 5, 2020

github-actions bot commented Jun 14, 2020

[WIP][SPARK-27495][Core][YARN][k8s] Stage Level Scheduling code for reference #27053

[WIP][SPARK-27495][Core][YARN][k8s] Stage Level Scheduling code for reference #27053

Conversation

tgravescs commented Dec 30, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jan 13, 2020

SparkQA commented Jan 13, 2020

SparkQA commented Jan 14, 2020

SparkQA commented Jan 14, 2020

SparkQA commented Jan 14, 2020

SparkQA commented Jan 15, 2020

tgravescs commented Jan 15, 2020

SparkQA commented Jan 15, 2020

SparkQA commented Jan 15, 2020

SparkQA commented Jan 15, 2020

SparkQA commented Jan 22, 2020

SparkQA commented Jan 22, 2020

SparkQA commented Jan 22, 2020

SparkQA commented Jan 22, 2020

SparkQA commented Jan 27, 2020

SparkQA commented Jan 27, 2020

SparkQA commented Jan 27, 2020

SparkQA commented Mar 5, 2020

SparkQA commented Mar 5, 2020

github-actions bot commented Jun 14, 2020

tgravescs commented Dec 30, 2019 •

edited

Loading