[SPARK-20540][CORE] Fix unstable executor requests. #17813

rdblue · 2017-04-30T22:36:13Z

There are two problems fixed in this commit. First, the
ExecutorAllocationManager sets a timeout to avoid requesting executors
too often. However, the timeout is always updated based on its value and
a timeout, not the current time. If the call is delayed by locking for
more than the ongoing scheduler timeout, the manager will request more
executors on every run. This seems to be the main cause of SPARK-20540.

The second problem is that the total number of requested executors is
not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates
the value based on the current status of 3 variables: the number of
known executors, the number of executors that have been killed, and the
number of pending executors. But, the number of pending executors is
never less than 0, even though there may be more known than requested.
When executors are killed and not replaced, this can cause the request
sent to YARN to be incorrect because there were too many executors due
to the scheduler's state being slightly out of date. This is fixed by tracking
the currently requested size explicitly.

How was this patch tested?

Existing tests.

rdblue · 2017-04-30T22:36:38Z

@vanzin, can you take a look at this? It is a dynamic allocation bug.

SparkQA · 2017-05-01T01:24:51Z

Test build #76332 has finished for PR 17813 at commit 96a7686.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-05-01T17:41:34Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

+                 |numExistingExecutors     = $numExistingExecutors
+                 |numPendingExecutors      = $numPendingExecutors
+                 |executorsPendingToRemove = ${executorsPendingToRemove.size}
+           """.stripMargin)


nit: indentation

Thanks, I'll fix it.

vanzin · 2017-05-01T17:51:05Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

-          doRequestTotalExecutors(
-            numExistingExecutors + numPendingExecutors - executorsPendingToRemove.size)
+          requestedTotalExecutors = math.max(requestedTotalExecutors - executorsToKill.size, 0)
+          if (requestedTotalExecutors !=


Won't this cause the message to be logged in the situation you describe in the PR description? Isn't that an "expected" situation? If so I'd demote this message, since users tend to get scared when messages like this one show up.

Yes, it would. I can change it to debug or remove it. This was mainly for us to see how often it happened. With the fix to the request timing, this doesn't tend to happen at all. It is just if the method is called every 100ms that you see the behavior all the time because there isn't enough time for kills and requests to complete before recomputing.

Hey @rdblue I have seen the message while testing dynamic allocation on mesos:

17/10/25 13:58:44 INFO MesosCoarseGrainedSchedulerBackend: Actual list of executor(s) to be killed is 1 17/10/25 13:58:44 DEBUG MesosCoarseGrainedSchedulerBackend: killExecutors(ArrayBuffer(1), false, false): Executor counts do not match: requestedTotalExecutors = 0 numExistingExecutors = 2 numPendingExecutors = 0 executorsPendingToRemove = 1

The executors are removed at some point after that message.
Test is here.
What should I expect here? I am a bit confused.

This is just informational. The problem is that the state of the allocation manager isn't synced with the scheduler. Instead, the allocator sends messages to try to control the scheduler backend to get the same state. For example, instead of telling the scheduler backend that the desired number of executors is 10, the allocator sends a message to add 2 executors. When this gets out of sync because of failures or network delay, you end up with these messages.

When you see these, make sure you're just out of sync (and will eventually get back in sync), and not in a state where the scheduler and allocator can't reconcile the required number of executors. That's what this PR tried to fix.

The long-term solution is to update the communication so that the allocator requests its ideal state, always telling the scheduler backend how many executors it currently needs, instead of killing or requesting more.

There are two problems fixed in this commit. First, the ExecutorAllocationManager sets a timeout to avoid requesting executors too often. However, the timeout is always updated based on its value and a timeout, not the current time. If the call is delayed by locking for more than the ongoing scheduler timeout, the manager will request more executors on every run. The second problem is that the total number of requested executors is not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates the value based on the current status of 3 variables: the number of known executors, the number of executors that have been killed, and the number of pending executors. But, the number of pending executors is never less than 0, even though there may be more known than requested. When executors are killed and not replaced, this can cause the request sent to YARN to be incorrect because there were too many executors due to the scheduler's state being slightly out of date.

SparkQA · 2017-05-01T21:26:28Z

Test build #76356 has finished for PR 17813 at commit 3e46f4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2017-05-01T21:27:52Z

@vanzin, I fixed your review comments and tests are passing.

vanzin · 2017-05-01T21:47:39Z

LGTM. Merging to master / 2.2 / 2.1.

There are two problems fixed in this commit. First, the ExecutorAllocationManager sets a timeout to avoid requesting executors too often. However, the timeout is always updated based on its value and a timeout, not the current time. If the call is delayed by locking for more than the ongoing scheduler timeout, the manager will request more executors on every run. This seems to be the main cause of SPARK-20540. The second problem is that the total number of requested executors is not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates the value based on the current status of 3 variables: the number of known executors, the number of executors that have been killed, and the number of pending executors. But, the number of pending executors is never less than 0, even though there may be more known than requested. When executors are killed and not replaced, this can cause the request sent to YARN to be incorrect because there were too many executors due to the scheduler's state being slightly out of date. This is fixed by tracking the currently requested size explicitly. ## How was this patch tested? Existing tests. Author: Ryan Blue <blue@apache.org> Closes #17813 from rdblue/SPARK-20540-fix-dynamic-allocation. (cherry picked from commit 2b2dd08) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

rdblue · 2017-05-01T21:49:01Z

Thanks!

There are two problems fixed in this commit. First, the ExecutorAllocationManager sets a timeout to avoid requesting executors too often. However, the timeout is always updated based on its value and a timeout, not the current time. If the call is delayed by locking for more than the ongoing scheduler timeout, the manager will request more executors on every run. This seems to be the main cause of SPARK-20540. The second problem is that the total number of requested executors is not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates the value based on the current status of 3 variables: the number of known executors, the number of executors that have been killed, and the number of pending executors. But, the number of pending executors is never less than 0, even though there may be more known than requested. When executors are killed and not replaced, this can cause the request sent to YARN to be incorrect because there were too many executors due to the scheduler's state being slightly out of date. This is fixed by tracking the currently requested size explicitly. ## How was this patch tested? Existing tests. Author: Ryan Blue <blue@apache.org> Closes #17813 from rdblue/SPARK-20540-fix-dynamic-allocation. (cherry picked from commit 2b2dd08) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

There are two problems fixed in this commit. First, the ExecutorAllocationManager sets a timeout to avoid requesting executors too often. However, the timeout is always updated based on its value and a timeout, not the current time. If the call is delayed by locking for more than the ongoing scheduler timeout, the manager will request more executors on every run. This seems to be the main cause of SPARK-20540. The second problem is that the total number of requested executors is not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates the value based on the current status of 3 variables: the number of known executors, the number of executors that have been killed, and the number of pending executors. But, the number of pending executors is never less than 0, even though there may be more known than requested. When executors are killed and not replaced, this can cause the request sent to YARN to be incorrect because there were too many executors due to the scheduler's state being slightly out of date. This is fixed by tracking the currently requested size explicitly. ## How was this patch tested? Existing tests. Author: Ryan Blue <blue@apache.org> Closes apache#17813 from rdblue/SPARK-20540-fix-dynamic-allocation. (cherry picked from commit 2b2dd08) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

vanzin reviewed May 1, 2017

View reviewed changes

rdblue force-pushed the SPARK-20540-fix-dynamic-allocation branch from 96a7686 to 3e46f4f Compare May 1, 2017 18:35

asfgit closed this in 2b2dd08 May 1, 2017

skonto mentioned this pull request Oct 25, 2017

dynamic allocation tests fail lightbend/mesos-spark-integration-tests#100

Open

mridulm mentioned this pull request Feb 9, 2020

[SPARK-29148][CORE] Add stage level scheduling dynamic allocation and scheduler backend changes #27313

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20540][CORE] Fix unstable executor requests. #17813

[SPARK-20540][CORE] Fix unstable executor requests. #17813

rdblue commented Apr 30, 2017

rdblue commented Apr 30, 2017

SparkQA commented May 1, 2017

vanzin May 1, 2017

rdblue May 1, 2017

vanzin May 1, 2017

rdblue May 1, 2017

skonto Oct 25, 2017 •

edited

Loading

rdblue Oct 25, 2017

SparkQA commented May 1, 2017

rdblue commented May 1, 2017

vanzin commented May 1, 2017

rdblue commented May 1, 2017

[SPARK-20540][CORE] Fix unstable executor requests. #17813

[SPARK-20540][CORE] Fix unstable executor requests. #17813

Conversation

rdblue commented Apr 30, 2017

How was this patch tested?

rdblue commented Apr 30, 2017

SparkQA commented May 1, 2017

vanzin May 1, 2017

Choose a reason for hiding this comment

rdblue May 1, 2017

Choose a reason for hiding this comment

vanzin May 1, 2017

Choose a reason for hiding this comment

rdblue May 1, 2017

Choose a reason for hiding this comment

skonto Oct 25, 2017 • edited Loading

Choose a reason for hiding this comment

rdblue Oct 25, 2017

Choose a reason for hiding this comment

SparkQA commented May 1, 2017

rdblue commented May 1, 2017

vanzin commented May 1, 2017

rdblue commented May 1, 2017

skonto Oct 25, 2017 •

edited

Loading