Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device synchronize prior to freeing a set of RapidsBuffer #8936

Merged
merged 2 commits into from
Aug 8, 2023

Conversation

abellina
Copy link
Collaborator

@abellina abellina commented Aug 6, 2023

Closes #8610

This issue changes a function in RapidsBufferCatalog, spillBufferAndFree, to just copy the buffer to the appropriate tier, but the actual freeing part happens as the batch of copies have been already scheduled. What happens then is:

  1. Copy to lower tier the buffers to spill
  2. Return set of spilled buffers
  3. Device synchronize to prevent use-after-free issues in async cuDF calls
  4. Actually call safeFree on the set of spilled buffers.

The real solution would be to create an event per thread and synchronize at spill time against the event that was used to record for a specific rapids buffer.

I have run with this patch 6 times and no illegal access so far. I did have this failure which caused a task failure: #8939, that needs to be looked at. But this is not related to this change.

Follow on for Cuda.Event usage: #8937
Follow on for ColumnView issue in lazy spillable gather map #8938

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
@abellina
Copy link
Collaborator Author

abellina commented Aug 6, 2023

This succeeded 6 times when I ran it at 30TB. One of the runs had a task failure unrelated (not an illegal access) but scary as well #8939

@abellina abellina marked this pull request as ready for review August 6, 2023 13:26
@abellina
Copy link
Collaborator Author

abellina commented Aug 6, 2023

build

jlowe
jlowe previously approved these changes Aug 7, 2023
Copy link
Contributor

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit on comment clarification, but don't want to block this going in over that.

…Catalog.scala


Fix comment as suggested

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
@abellina
Copy link
Collaborator Author

abellina commented Aug 7, 2023

build

@abellina abellina merged commit 6c50e8d into NVIDIA:branch-23.08 Aug 8, 2023
@abellina abellina deleted the sync_before_freeing_in_spill branch August 8, 2023 13:24
@jlowe jlowe changed the title Device syncronize prior to freeing a set of RapidsBuffer Device synchronize prior to freeing a set of RapidsBuffer Aug 8, 2023
@sameerz sameerz added the bug Something isn't working label Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] query 95 @ SF30K fails with OOM exception
5 participants