-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft: Performance mega boost - queue per app #1990
base: master
Are you sure you want to change the base?
Conversation
/assign @ChenYi015 Please review the changes when you get a chance |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive work! A few questions:
- What do we need K8s burst and QPS for?
- What's the impact on memory and CPU consumption of setting a higher or lower frequency of queue cleaning?
- Will there be edge cases where a queue's mutex is already deleted due to not receiving any updates for a while but the application still exists?
Hi @yuchaoran2011 thanks for reviewing!
|
Thanks for this patch! We also ran into similar issues at scale. What we observed was a race condition between the SparkOperator reconciliation and k8s Pod GC. We have autoscaler that terminates nodes when no Pods is running so completed Pods are deleted from API server when the node is terminated. If the SparkOperator could not update the status in time, it won't find the driver Pod and thus make the app as failed due to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ben-tzionlmb Looking forward to see this feature to be merged. Would be nice to add some tests with this feature.
- -max-queue-time-without-update-in-minutes=30 | ||
- -queue-cleaner-interval-in-minutes=10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make these two values configurable? Would there be a possibility for users to modify these values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I will
timestamp=$(date +%Y-%m-%d_%H-%M-%S) | ||
|
||
# Function to fetch goroutine profile | ||
fetch_goroutine_profile() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this script used anywhere? Couldn't find the references or docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script is used for manual profiling which we've added as part of this feature (disabled by default). I probably need to add some documenting on how to use it
@bnetzi Thanks for the clarification. That makes sense. The default is 50, if I remember correctly. Doubling the Spark operator QPS strikes a balance between allowing for more requests from operator to be processed and not overloading k8s API server. Overall the idea looks really solid. I second fleshing out this PR to include tests |
type AppQueuesMap struct { | ||
appQueueMutex *sync.RWMutex | ||
appQueueMap map[string]AppQueueInfo | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming of this struct is strange. The overall type name is AppQueuesMap
but a field is named similarly appQueueMap
. Can we give the struct type a more intuitive name? Something that indicates that it includes the mutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is not good naming, I will change it
The default in golang is actually 5 qps and 10 burst which is very low. Maybe we should put default value which is lower than 100 and 200 by default for small k8s clusters or such with mix usage |
…me docs Signed-off-by: bnetzi <bnetzi@gmail.com>
53e6a2b
to
0f3789f
Compare
Please help me to understand your statement in the description more
For the original runWorker()->processNext(), |
@bnetzi I think @vincentye38 brought up a valid point above. What he's proposing (i.e. having more workers concurrently process items) complements the solution proposed in your PR. Have you tried bumping up the number of workers to see what effect that has on performance? |
@vincentye38 @yuchaoran2011 Thanks for commenting First I would mention that we obviously increased the amount of controller threads while raising the cpu, and it did not help at all. The weirdest thing is that although we were stuck and everything was very slow - the cpu was only 5% on a 32 vCPU instance. I think that in order to understand it we need to understand some of the mechanism and parameters involved here. In my understanding (and this behavior is what we saw in the logs as well) each of the go routines is waiting in line to preform a q.Get() but the next item in queue be processed by another go routine only when q.Done() is called which is after we are doing the sync. It actually makes a lot of sense as if we have an event that triggers app creation, and then some app update - we want the creation to be done and the app state to be synced before we are preforming the next operation. Also - the operator is using a a RateLimitingInterface workqueue, and this queue is also adding waits when the rate limit is reached, so it makes everything even worse. We tried to increase the limits and it helped a little but did not help enough, and that's why we decided on a queue per app approach. Another point to mention here, (and in my opinion it is what causing us to get to finally 20 minutes after running a long time in a large scale env) is that we have a vast amount of events per app. It is not just app creation, driver pod start and finish, but also each executor pod state change is an event. This is causing much longer waiting time as there are much more |
@bnetzi Thank you for explaining. It's interesting that the cpu utilization is so low.
My understanding from the source code of q.Get() and q.Done() is that go routines can pickup other items from the queue. the queue deduplicates item which has identical one in the queue or in process when the item is inserted. The new identical item will be set aside into I guess the problem that cause low cpu utilization, low throughput and blocking items being dequeued is RateLimitingInterface. Could you mind to test out using WorkQueue directory without wrapping it with RateLimitingInterface? Thank you. |
Although from the source code it looks like it - I don't think the way you describe it is the way it actually works. We are missing something that would explain exactly why it doesn't work in parallel. In any case - in my opinion the solution we came up with is a better design overall and it actually works for us in large scale production and after extensive tests. I don't have the capacity nor the resources to preform more tests at large scale just in order to understand the bug here, especially as we have a working version . However, If you can try to create a version without rate limit and test it yourself I think it would be beneficial |
Adding a google doc I've made for illustrating the issue and our solution https://docs.google.com/document/u/0/d/14x76jK7naRjxeFCkqUe7MgoDzZjYETixXXa828o0kSM/mobilebasic |
I wanted to share that we've been running a build where we replaced the RateLimitingInterface with the default workqueue. We've seen a significant reduction in reconciliation delay, even without increasing the client QPS as suggested in this PR. Occasionally, during peak hours, we still experience up to a 5-minute delay before the process starts, but this might not be entirely due to the operator reconciliation delay. We haven't noticed any negative impacts from removing the rate limiter. This makes me wonder why the spark-operator uses it. What specific scenario is it designed to protect against? |
ratelimit is necessary to control the pressure of apiserver and limit the repeated updates of the spark application |
LastUpdateTs time.Time | ||
} | ||
|
||
// A struck that will have map of queues and mutex for the map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct
// assert.NotNil(t, m.appQueueMap[app.Name].LastUpdateTs) | ||
|
||
// Check that the SparkApplication was enqueued. | ||
q := ctrl.GetOrCreateRelevantQueue(app.Name).Queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should GetOrCreateRelevantQueue with from calling key, err := keyFunc(app)
// Check that m is not nil. | ||
assert.NotNil(t, m) | ||
// Check that m.appQueueMap[appName].Queue is not nil. | ||
assert.NotNil(t, m.appQueueMap[app.Name]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be
key, _ := keyFunc(app)
m.appQueueMap[key]
// Start starts the Controller by registering a watcher for SparkApplication objects. | ||
func (c *Controller) Start(workers int, stopCh <-chan struct{}) error { | ||
func (c *Controller) Start(maxAge time.Duration, checkInterval time.Duration, stopCh <-chan struct{}) error { | ||
// Wait for all involved caches to be synced, before processing items from the queue is started. | ||
if !cache.WaitForCacheSync(stopCh, c.cacheSynced) { | ||
return fmt.Errorf("timed out waiting for cache to sync") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, we are breaking the leader election feature, as previously the worker thread creation is blocked by startCh so that only the leader operator has active worker thread. But we right now we are creating worker thread as new sparkapp comes in
https://www.kubeflow.org/docs/components/spark-operator/user-guide/leader-election/
One potential solution is that we rely on startCh
(which previously used to block worker thread creation) to set a global variable. e.g. startWorker
, which will only be set to true when the operator receive from startCh and become leader. And then we can skip start worker thread if startWorker
is false
We are doing load test on spark operator and we also run into this issue. The test is run on an EKS cluster with 50 m5.4xlarge EC2 instance. We keep increasing the input traffic load by creating many spark application from client, we noticed spark job submitted from operator can keep up with the speed of the input rate at the begining, but once the input rate goes between 250 or 300 jobs per minute the spark-submit rate become stalled and cannot further increase. The start job latency once increased to more than 800 seconds. We are running operator with 30 worker and tried increase further but it does not really help. Looking at the metrics, we noticed the same behavior that CPU utilization of the operator is always below 10 vcores, but at the same time, the workqueue depth keep increasing to even more than 1000, where we have to stop the test. We suspect we are mostly hit by not able to increase qps and burst as we see lots of such errors in operator log
But we may also be impacted by the single workqueue mutex given the high traffic. It will be really appreciated if this issue can be fixed in further spark operator versions. Thanks. |
For all the followers if this MR, per the massive change in spark operator design, I've waited with this MR to see if it is still relevant after the changes done. I haven't had time to test V2 at scale yet, nor to accommodate my changes to V2. |
My company actually went with a different implementation (customizing the workqueue ratelimiter) to handle high load clusters I've actually open a PR to backport it here -> #2186 To add context we implemented this to fix a customer issue because their airflow would submit 25K SparkApplication CR at the same time |
I actually tested V2 operator on the cluster with same size (50 ec2 m5.4xlarge) today see similar behavior as V1. The job start latency is 2-4 seconds when the input rate is around 50-100 jobs per minutes. After increasing it to 150, start latency goes to 12s, which is stills table. Further increased to 200 job per minutes causes the start latency increased continuously, at one time it reached 150s. It is also interesting that after the operator run with 200 job per min for some time (maybe after 30 minutes), the start latency and running job number all dropped to 0 (and at same time SUBMITTED/COMPLETED/SUCCEEDING job number all keep increasing), if I randomly sample job from operator log, the latency still increases to more than 4 minutes. |
@xyzyng from what @ImpSy wrote it sounds like the main issue with V2 is just some tweaks with the controller runtime configurations (which make sense, since it used for many high volume operators) @vara-bonthu - referring you to the discussion here, I think we just need to provide in general more options to configure the controller runtime, and that my PR is irrelevant |
@bnetzi , @vara-bonthu , regarding the point '
Summary of changes :
|
Thanks @gangahiremath ! Please add your feedback to this issue #2193 |
Hey @bnetzi -
Thank you. |
I'm not sure that MR 2186 is the solution here, as there are also qps and burst configurations for other k8s API users (such as CRD Listers etc.) As for the JVM overhead. Although it has some performance impact, moving to golang submit would require an operator version for every change in spark submit and would probably tend to break / have missing functions more often on new spark versions. |
I can't promise when, but if someone else wouldn't in the meantime - I will make sure V2 won't have scale performance issues. It might take a while but our platform is heavily based on spark operator and we do no intend to be stuck with V1. The thing is in V2 the whole code was changed and many infra issues were fixed by moving to be based on controller runtime. |
Another solution could be to move the submission phase to a dedicated pod |
I think in the first place the reason we have a queue is to make sure all actions on the pod are preformed by order. If you would separate it you might change this behavior. But in general I do think the spark operator pod is doing way too much work, especially executors related stuff and should be separated to different containers at least |
I only made this proposition because that's the implementation that the |
@bnetzi - Just wanted to share some feedback and our findings from testing on this PR. We (Salesforce) applied your PR and tested your changes at scale within our clusters. We compared your version, which is essentially a "queueless" version (since there's only one job in each queue) against the version @gangahiremath had authored, which skips spark-submit thus avoiding all of the JVM latency. From our testing we found @gangahiremath version outperform significantly than the queueless version as well as the current master branch. Also, we found that this PR runs into OOM issues as soon as the cluster is stressed. Specifically, we tested for one job submission per second for 15 minutes, i.e. 900 jobs in total and the SO with this PR experienced OOM. I would guess that arises from the fact that all the controller threads are just spark-submitting the jobs as soon as they are enqueued in their respective queues and cause OOM. Other than that, your approach is correct in the sense that having a single queue does affect SO performance whenever the rate of enqueueing becomes greater than the rate of dequeueing within the queue causing the queue length to grow. The performance is further affected because of the queue being guarded by a mutex. One approach could also be to use a concurrent lock-free single queue, which wouldn't require having a mutex. This PR does outperform the vanilla version of spark operator under light/normal load. Hope this is helpful. Thanks. |
@c-h-afzal - thanks for the info! We also tried @gangahiremath version and under heavy load it didn't performed well for us. The difference might be:
P.s. I would not say though that this PR is queue less, as each app operations are in a queue, and it prevents concurrent actions on the same app, which is the point of the current queue. Anyway - we started to work on a version that is compatible with spark operator V2, so more updates on that soon |
Issues related
#1138
#1574
#783
#1489
Purpose of this PR
First of all - this PR is mainly a Draft which we think should be discussed, and that's why we are submitting it even though we haven't added documentation not unit tests.
The original design of spark-operator posses one queue which is used for all the applications.
This design is causing a huge latency issue when trying to deal with hundreds / thousands of applications concurrently.
In benchmarks me and my fellows from Mobileye preformed, we showed clearly a linear latency increase depend on the amount of apps handled.
When getting to more than 500 applications, the avg time from creating a spark application object until pod creation is ~130 seconds for app to start. When going up to 1200 apps concurrently it can go Up to 20 minutes on average for each spark application to be created.
Scale up vertically would not be helpful as the CPU is doing nothing, most of the time is spent on the queue mutex.
The change we are presenting here is to create a queue for each app.
It required a big change all around the code, but it is not changing the main flow in any way.
Our benchmarks showed that even with 1000 apps concurrently, Avg time for application creation is ~7 seconds
We also added a nice feature of using memoryLimit for driver / executor which is larger than the request by using the admission webhook.
Proposed changes:
Change Category
What are we still missing:
I would point out that this code currently runs on our production environment with massive scale without any issues.