Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20540][CORE] Fix unstable executor requests. #17813

Closed

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Apr 30, 2017

There are two problems fixed in this commit. First, the
ExecutorAllocationManager sets a timeout to avoid requesting executors
too often. However, the timeout is always updated based on its value and
a timeout, not the current time. If the call is delayed by locking for
more than the ongoing scheduler timeout, the manager will request more
executors on every run. This seems to be the main cause of SPARK-20540.

The second problem is that the total number of requested executors is
not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates
the value based on the current status of 3 variables: the number of
known executors, the number of executors that have been killed, and the
number of pending executors. But, the number of pending executors is
never less than 0, even though there may be more known than requested.
When executors are killed and not replaced, this can cause the request
sent to YARN to be incorrect because there were too many executors due
to the scheduler's state being slightly out of date. This is fixed by tracking
the currently requested size explicitly.

How was this patch tested?

Existing tests.

@rdblue
Copy link
Contributor Author

rdblue commented Apr 30, 2017

@vanzin, can you take a look at this? It is a dynamic allocation bug.

@SparkQA
Copy link

SparkQA commented May 1, 2017

Test build #76332 has finished for PR 17813 at commit 96a7686.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

|numExistingExecutors = $numExistingExecutors
|numPendingExecutors = $numPendingExecutors
|executorsPendingToRemove = ${executorsPendingToRemove.size}
""".stripMargin)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll fix it.

doRequestTotalExecutors(
numExistingExecutors + numPendingExecutors - executorsPendingToRemove.size)
requestedTotalExecutors = math.max(requestedTotalExecutors - executorsToKill.size, 0)
if (requestedTotalExecutors !=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this cause the message to be logged in the situation you describe in the PR description? Isn't that an "expected" situation? If so I'd demote this message, since users tend to get scared when messages like this one show up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would. I can change it to debug or remove it. This was mainly for us to see how often it happened. With the fix to the request timing, this doesn't tend to happen at all. It is just if the method is called every 100ms that you see the behavior all the time because there isn't enough time for kills and requests to complete before recomputing.

Copy link
Contributor

@skonto skonto Oct 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rdblue I have seen the message while testing dynamic allocation on mesos:

17/10/25 13:58:44 INFO MesosCoarseGrainedSchedulerBackend: Actual list of executor(s) to be killed is 1
17/10/25 13:58:44 DEBUG MesosCoarseGrainedSchedulerBackend: killExecutors(ArrayBuffer(1), false, false): Executor counts do not match:
requestedTotalExecutors  = 0
numExistingExecutors     = 2
numPendingExecutors      = 0
executorsPendingToRemove = 1

The executors are removed at some point after that message.
Test is here.
What should I expect here? I am a bit confused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just informational. The problem is that the state of the allocation manager isn't synced with the scheduler. Instead, the allocator sends messages to try to control the scheduler backend to get the same state. For example, instead of telling the scheduler backend that the desired number of executors is 10, the allocator sends a message to add 2 executors. When this gets out of sync because of failures or network delay, you end up with these messages.

When you see these, make sure you're just out of sync (and will eventually get back in sync), and not in a state where the scheduler and allocator can't reconcile the required number of executors. That's what this PR tried to fix.

The long-term solution is to update the communication so that the allocator requests its ideal state, always telling the scheduler backend how many executors it currently needs, instead of killing or requesting more.

There are two problems fixed in this commit. First, the
ExecutorAllocationManager sets a timeout to avoid requesting executors
too often. However, the timeout is always updated based on its value and
a timeout, not the current time. If the call is delayed by locking for
more than the ongoing scheduler timeout, the manager will request more
executors on every run.

The second problem is that the total number of requested executors is
not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates
the value based on the current status of 3 variables: the number of
known executors, the number of executors that have been killed, and the
number of pending executors. But, the number of pending executors is
never less than 0, even though there may be more known than requested.
When executors are killed and not replaced, this can cause the request
sent to YARN to be incorrect because there were too many executors due
to the scheduler's state being slightly out of date.
@rdblue rdblue force-pushed the SPARK-20540-fix-dynamic-allocation branch from 96a7686 to 3e46f4f Compare May 1, 2017 18:35
@SparkQA
Copy link

SparkQA commented May 1, 2017

Test build #76356 has finished for PR 17813 at commit 3e46f4f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented May 1, 2017

@vanzin, I fixed your review comments and tests are passing.

@vanzin
Copy link
Contributor

vanzin commented May 1, 2017

LGTM. Merging to master / 2.2 / 2.1.

asfgit pushed a commit that referenced this pull request May 1, 2017
There are two problems fixed in this commit. First, the
ExecutorAllocationManager sets a timeout to avoid requesting executors
too often. However, the timeout is always updated based on its value and
a timeout, not the current time. If the call is delayed by locking for
more than the ongoing scheduler timeout, the manager will request more
executors on every run. This seems to be the main cause of SPARK-20540.

The second problem is that the total number of requested executors is
not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates
the value based on the current status of 3 variables: the number of
known executors, the number of executors that have been killed, and the
number of pending executors. But, the number of pending executors is
never less than 0, even though there may be more known than requested.
When executors are killed and not replaced, this can cause the request
sent to YARN to be incorrect because there were too many executors due
to the scheduler's state being slightly out of date. This is fixed by tracking
the currently requested size explicitly.

## How was this patch tested?

Existing tests.

Author: Ryan Blue <blue@apache.org>

Closes #17813 from rdblue/SPARK-20540-fix-dynamic-allocation.

(cherry picked from commit 2b2dd08)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
@rdblue
Copy link
Contributor Author

rdblue commented May 1, 2017

Thanks!

asfgit pushed a commit that referenced this pull request May 1, 2017
There are two problems fixed in this commit. First, the
ExecutorAllocationManager sets a timeout to avoid requesting executors
too often. However, the timeout is always updated based on its value and
a timeout, not the current time. If the call is delayed by locking for
more than the ongoing scheduler timeout, the manager will request more
executors on every run. This seems to be the main cause of SPARK-20540.

The second problem is that the total number of requested executors is
not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates
the value based on the current status of 3 variables: the number of
known executors, the number of executors that have been killed, and the
number of pending executors. But, the number of pending executors is
never less than 0, even though there may be more known than requested.
When executors are killed and not replaced, this can cause the request
sent to YARN to be incorrect because there were too many executors due
to the scheduler's state being slightly out of date. This is fixed by tracking
the currently requested size explicitly.

## How was this patch tested?

Existing tests.

Author: Ryan Blue <blue@apache.org>

Closes #17813 from rdblue/SPARK-20540-fix-dynamic-allocation.

(cherry picked from commit 2b2dd08)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
@asfgit asfgit closed this in 2b2dd08 May 1, 2017
yoonlee95 pushed a commit to yoonlee95/spark that referenced this pull request Aug 17, 2017
There are two problems fixed in this commit. First, the
ExecutorAllocationManager sets a timeout to avoid requesting executors
too often. However, the timeout is always updated based on its value and
a timeout, not the current time. If the call is delayed by locking for
more than the ongoing scheduler timeout, the manager will request more
executors on every run. This seems to be the main cause of SPARK-20540.

The second problem is that the total number of requested executors is
not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates
the value based on the current status of 3 variables: the number of
known executors, the number of executors that have been killed, and the
number of pending executors. But, the number of pending executors is
never less than 0, even though there may be more known than requested.
When executors are killed and not replaced, this can cause the request
sent to YARN to be incorrect because there were too many executors due
to the scheduler's state being slightly out of date. This is fixed by tracking
the currently requested size explicitly.

## How was this patch tested?

Existing tests.

Author: Ryan Blue <blue@apache.org>

Closes apache#17813 from rdblue/SPARK-20540-fix-dynamic-allocation.

(cherry picked from commit 2b2dd08)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants