Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vine/WQ: Serial Task Latency #3432

Closed
dthain opened this issue Aug 2, 2023 · 17 comments
Closed

Vine/WQ: Serial Task Latency #3432

dthain opened this issue Aug 2, 2023 · 17 comments
Assignees
Labels
bug For modifications that fix a flaw in the code. Parsl TaskVine Work Queue

Comments

@dthain
Copy link
Member

dthain commented Aug 2, 2023

@benclifford reports:

The parsl+wq and parsl+taskvine test suite takes 493s to run, vs 8s for 
the same tests run in parsl+htex.

Right now that's the dominant time in our CI - lets say 15 minutes out of 
a 22 minute CI runtime.

I think that's coming from sequential task handling somewhere in the 
WQExecutor/WQ stack taking around 1 second: where task B is not submitted 
until the parsl DFK has become aware of task A completing. (task A being 
one test, task B being the next test, or in some cases task B being a task 
inside a test that depends on task A completing).

Is there a way to make this faster? In principle, parsl+wq coprocesses 
should be down at the 8s mark too.
The bit of CI that does this test is this:

pytest parsl/tests/ -k "not cleannet and not issue363" --config parsl/tests/configs/workqueue_ex.py

(that's the same test suite parameterised by different test 
configurations, so --config parsl/tests/configs/htex_local.py will get you 
the same tests but through the high throughput executor)
@dthain dthain added bug For modifications that fix a flaw in the code. Work Queue TaskVine Parsl labels Aug 2, 2023
@dthain
Copy link
Member Author

dthain commented Aug 2, 2023

There are two potential culprits here:
1 - The Vine (or WQ) manager itself. We need a simple benchmark to measure time to (submit/wait)xN to measure the average end-to-end latency of a single task. Should be of interest to @colinthomas-z80
2 - The Vine (or WQ) executor module in Parsl. There is an extra process in the middle that accepts tasks from Parsl and feeds them to Vine (or WQ) and I suspect there may be an accidental delay in that module. Should be of interest to @tphung3

@dthain
Copy link
Member Author

dthain commented Sep 14, 2023

@colinthomas-z80 has cleaned up a number of items to improve performance in the TaskVine scheduler natively. (Colin, please post a brief summary here of the performance that you are able to get so far.

@dthain
Copy link
Member Author

dthain commented Sep 14, 2023

Next step: I suspect that the Parsl-TaskVine executor can run into a degenerate case where it waits one second for a TV task to complete, then goes back to checking for newly submitted Parsl tasks. This could result in awful latency for serial task submission. Note the m.wait(1) here:
https://github.com/Parsl/parsl/blob/master/parsl/executors/taskvine/manager.py#L382

@colinthomas-z80
Copy link
Contributor

Figure_1

Regarding the size of the ready task queue we now have a linear increase in execution time across different task batch sizes.

The figure shows the performance after various stages of the development process up to the current.

I will work up a profile on the serial task execution now that this is stable

@dthain
Copy link
Member Author

dthain commented Dec 20, 2023

@colinthomas-z80 I know that you did some work on this problem, please summarize and link here.

Is there more to be done, or is it adequately fixed/

@colinthomas-z80
Copy link
Contributor

Regarding the serial task latency in parsl, we found the parsl executor loop was waiting on additional task retrievals after retrieving a single task. In the serialized scenario there are no more tasks to wait on, so the latency was a function of our wait timeout.

I removed this additional waiting with a check on the task queue size after each retrieval in the parsl wq executor. The PR is still open at Parsl/parsl#2984.

This improved the serial latency from 1 task per second, to about 6 tasks per second.

@dthain
Copy link
Member Author

dthain commented Dec 20, 2023

Ok, is the same problem present in the TaskVine executor, or no?

@colinthomas-z80
Copy link
Contributor

Yes the problem is present, I'm querying Ben to see if he is content with the fix, and if so I will update taskvine as well

@benclifford
Copy link

colin's PR 2984 looks good and I'm merging it now - 7m test suite run turned into 10s test suite run!

@dthain
Copy link
Member Author

dthain commented Jan 5, 2024

Nice!

@benclifford
Copy link

I added this comment to the PR after I merged it, after thinking about it some more: basically I think @colinthomas-z80 PR trades off latency for 100% use of a single core...

Parsl/parsl#2984 (comment)

@colinthomas-z80
Copy link
Contributor

@benclifford I will see about putting a wait on the submit queue if we have no work to do.

@colinthomas-z80
Copy link
Contributor

colinthomas-z80 commented Jan 9, 2024

I opened #3018 which should wait on the input queue for a task to appear when there is nothing to be done. It is hard to quantify CPU usage but it seems to average lower on the system monitor tool.

@dthain
Copy link
Member Author

dthain commented Jan 29, 2024

@colinthomas-z80 please check if the Vine executor has the same problem and can be addressed by the same solution in Parsl.

@colinthomas-z80
Copy link
Contributor

3038 is open at parsl, updating TaskVine with the loop changes

@colinthomas-z80
Copy link
Contributor

Merged

@dthain
Copy link
Member Author

dthain commented Jun 12, 2024

Performance has been dramatically improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For modifications that fix a flaw in the code. Parsl TaskVine Work Queue
Projects
Development

No branches or pull requests

3 participants