Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid comm/compute overlap for short-lived tasks #12964

Merged
merged 3 commits into from
May 17, 2019

Conversation

ronawho
Copy link
Contributor

@ronawho ronawho commented May 6, 2019

Avoid comm/compute overlap for short-lived (minimal comm) tasks.
Comm/compute overlap can artificially increase the lifetime of tasks
because we yield, put a task at the end of the queue, and will have to
cycle through all currently running/queued tasks before getting back to
it. This can lead to OOM situations when you have a lot of short-lived
tasks with no user-induced yielding. This can occur pretty easily today
because things like reductions will create numLocales tasks on locale 0.
These tasks are short-lived and don't have any user-induced yields, so
the OOM is pretty extreme.

Here we eliminate comm/compute overlap for unregistered puts/gets and
fast-ons in gasnet. We avoid doing comm/compute overlap for FMA (short
gets/puts and AMOs) under ugni unless a task has issues at least 100 FMA
operations. This allows us to get comm/compute overlap in cases where it
really matters for performance (many FMA request and large BTE comm)
without increasing the lifetime of short-lived tasks. We use
task-local-storage to track the number of FMA requests issued. This is
fast (~half the cost of an atomic operation) so the overhead is
negligible and still allows us to get good performance for
oversubscribed RA-atomics.

This isn't a perfect solution and primarily avoids task yields for
endcount manipulation, but I think this is better than what we had
before. A better solution is probably to decrease the size of our task
stacks (and/or avoid registering task stacks and keep them on
non-hugepages so we could limit the amount of physical memory used for
them.)

This should allow us to remove all our execenvs that reduce the task
stack size for multi-locale and scalability testing.

Closes #12874

@gbtitus
Copy link
Member

gbtitus commented May 7, 2019

As we've discussed previously, this still seems like an awfully indirect way to do what we want, which is basically to throttle task creation as a side effect of yields. I'd rather we do this by adding a throttle to Qthreads. But I'm resigned to my fate here.

That said, is there a strong reason not to do this directly in chpl_task_yield()? We'd need to add an argument saying that we were yielding to wait for communication to complete. But then the only difference, it seems to me, is that instead of not yielding only when waiting for small transfers after the 100th, we'd not yield when waiting for any transfer (including BTE) after the 100th. That doesn't seem a big difference, and I think it's possible we could re-code the Qthreads chpl_task_yield() so that this was effectively cost-free check. (We already do a qthread_shep() call there, but if we had to do a chpl_task_getPrvData() to support this change we could probably figure out from the result of that whether we had a shepherd or not. And I'd guess those 2 calls probably take about the same amount of time.)

@ronawho
Copy link
Contributor Author

ronawho commented May 7, 2019

As we've discussed previously, this still seems like an awfully indirect way to do what we want, which is basically to throttle task creation as a side effect of yields. I'd rather we do this by adding a throttle to Qthreads. But I'm resigned to my fate here.

I agree this is weird, but I think it's pretty simple for now and puts us in a better state. I'll open up a new issue for a better long-term fix.

For a qthreads throttle -- I'm not quite sure what that should look like just yet. For the case of oversubscribed ra-atomics we do want to task yield so we can get overlap and it's possible not all tasks will be running so a yield_to_already_running_task() sort of thing isn't quite what we want, so we have some design and thinking to do here.

That said, is there a strong reason not to do this directly in chpl_task_yield()? We'd need to add an argument saying that we were yielding to wait for communication to complete. But then the only difference, it seems to me, is that instead of not yielding only when waiting for small transfers after the 100th, we'd not yield when waiting for any transfer (including BTE) after the 100th. That doesn't seem a big difference, and I think it's possible we could re-code the Qthreads chpl_task_yield() so that this was effectively cost-free check. (We already do a qthread_shep() call there, but if we had to do a chpl_task_getPrvData() to support this change we could probably figure out from the result of that whether we had a shepherd or not. And I'd guess those 2 calls probably take about the same amount of time.)

Hnm, yeah that's an interesting/good idea. Let me think about that.

Copy link
Member

@gbtitus gbtitus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ronawho
Copy link
Contributor Author

ronawho commented May 16, 2019

Moving this logic to a yield wrapper is captured in https://github.com/Cray/chapel-private/issues/209. Offline we agreed to go with this to start and make improvements later.

Add a test that creates thousands of remote tasks (more task stacks than
there is physical memory for if they actually ran concurrently.)  I
believe this is a test we want to work under qthreads since there is no
user-induced task yields. It currently fails under both ugni and gasnet
because of task yielding in the comm layers.
They should be relatively quick (and are currently used for end count
manipulation, which can artificially increase the lifetime of tasks) so
avoid yielding.
Avoid comm/compute overlap for FMA in ungi for tasks that haven't issued at
least 100 FMA. This is a little bit odd, but it prevents us from doing
comm/compute overlap for tasks that haven't done much comm. This allow us to
run an arbitrary number of tasks with no user-induced yields, but still get
comm/compute overlap for RA.

Here we use task-local-storage to track the amount of comm. Accessing task-local
storage is pretty cheap (~ half the cost of a processor atomic) so this adds
virtually no overhead.
@ronawho ronawho force-pushed the reduce-comm-compute-overlap branch from 13ed38b to f504fc6 Compare May 17, 2019 04:31
@ronawho ronawho merged commit 742cc6b into chapel-lang:master May 17, 2019
@ronawho ronawho deleted the reduce-comm-compute-overlap branch May 17, 2019 06:24
ronawho added a commit that referenced this pull request Mar 16, 2020
Avoid more comm/compute overlap in ugni

[reviewed by @gbtitus]

In #12964 we eliminated comm/compute overlap for FMA operations when a
task hadn't done ~100 FMA operations. Here we extend that to RDMA and
chained FMA operations as well. This is motivated by aggregation work
where we do on-stmts followed by large RDMA PUTs/GETs, but we really
don't want too many tasks started at once since that increases memory
pressure. This also adds an option to completely disable comm/compute
overlap in ugni, but it's not enabled by default since it has a 2x
performance hit for SSCA.
mppf added a commit that referenced this pull request Jan 6, 2021
Adjust remote cache to better handle task yields in comm events

Before this PR, the `--cache-remote` cache has used a "lock" to ensure
that only a single cache enters the cache API functions. This PR adjusts
it to instead "lock" at a cache entry level. (Here "lock" means to check
a variable indicating it is in use; to loop and yield until it is 0; and
then set it. This is similar to how a lock works but in this case it is
only managing access from other tasks as a result of `chpl_task_yield`).

Details:
* Added the `entryReservedByTask` to the cache entries. Adjusted
  `flush_entry`, `cache_get`, and `cache_put` to take this "lock".
* Passed `task_local` to many functions since it is used to know if the
  current task has the "lock" on an entry.  This allows the code to be
  able to know if the lock is already taken by the current task. This is
  primarily used in assertion checking right now.
* Factored per-page operations out of put/get/invalidate
* Added "lock" code around per-page operations -- put/get/invalidate as
  well as evictions take the lock and release it when completed
* Adjusted the code waiting for nonblocking comm operations
  (`do_wait_for`) to better tolerate task yields
* Took care with locking another entry while an entry is locked. This
  kind of thing can cause deadlock. The place it was present was in the
  `cache_get` code - it was making sure that there was a freed page while
  waiting for the nonblocking GET. Since this is a performance hint, I
  adjusted it to just give up if the page needing to be evicted is
  locked.
* Automatic readahead when doing GETs does not lock multiple entries
  because it only happens on a cache hit - so it unlocks the entry with
  the data before starting the prefetching (which will lock each page
  involved, one at a time).
* Added `EXTRA_YIELD` setting for debugging which does yielding around
  comms events. This allowed me to debug the issues with yielding using a
  local gasnet configuration.
* Added more debug printing/tracing that I used during debugging.

Additionally, I noticed that the strided get/put calls were invalidating
the entire cache but now we have functions to invalidate just a region.
At the same time, we have strided "common helper" functions that
enumerate the contiguous regions in
runtime/include/chpl-comm-strd-xfer.h. So, I updated the strided get/put
support in the cache to only invalidate the portions updated by the
operation, instead of the entire cache. To do so, I needed to add another
variant of the code in chpl-comm-strd-xfer.h since the other functions
there call `comm_get` / `comm_put` and this one needs to invalidate. We
can change the setting `STRIDED_INVALIDATE_ALL` to check the performance
of this change independently.

Related to PRs #15266, #16020 and issues
Cray/chapel-private#585 and
Cray/chapel-private#879. PR #12964 talks a bit
about why comm-compute overlap changes are important in comm performance.
This PR improves performance for several Arkouda operations when running
with `--cache-remote`. The reason it improves performance is that the
previous coarse-grained locking strategy adding yields without really
making progress on communication in oversubscribed settings. Tasks would
be created to try to do comms but not be able to make progress.

Reviewed by @gbtitus - thanks!

- [x] test/runtime/configMatters/comm/cache-remote and
  test/optimizations/cache-remote pass with `VERIFY` and `EXTRA_YIELDS`
  enabled (except for `ra_prefetch.chpl` which times out due to problem
  size too large for this configuration but functions with a smaller
  problem size)
- [x] full local gasnet testing
- [x] full local gasnet testing with `--cache-remote`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Comm/compute overlap increases task lifetime under ugni
3 participants