Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] q93 failed in this week's NDS runs #4045

Closed
abellina opened this issue Nov 5, 2021 · 4 comments · Fixed by #4046
Closed

[BUG] q93 failed in this week's NDS runs #4045

abellina opened this issue Nov 5, 2021 · 4 comments · Fixed by #4046
Labels
bug Something isn't working P1 Nice to have for release performance A performance related task/issue

Comments

@abellina
Copy link
Collaborator

abellina commented Nov 5, 2021

We ran q93 without any setting changes and got an OOM exception, where we haven't in the past:

Executor task launch worker for task 123.0 in stage 36.0 (TID 12818) 21/11/04 21:13:46:119 INFO DeviceMemoryEventHandler: Device allocation of 345571976 bytes failed, device store has 0 bytes. Total RMM allocated is 9319438848 bytes.

This looks like a fragmented pool, as the relatively small ~350MB allocation should fit if we have 9GB worth of memory in RMM, and our GPU has 40GB of memory total.

That said, in the past this query worked without issues, since we had a maximum config in the RMM pool where it could go and allocate the extra memory required to satisfy the 350MB needed. Since the pool limit changes (i.e. #4019 and rapidsai/cudf#9583) we now have a single value for the size of the pool, and that is calculated at initialization using free instead of total GPU memory, and it does not grow. The initial value is further reduced by the "reserve" config, set to 1GB by default.

Since the failure we've run a few scenarios by lowering the spark.rapids.memory.gpu.reserve setting to 128MB and have gotten it to pass (the default is 1GB). But in the past we have had issues with kernel that need ~1GB just for the kernel launch, so it seems that a 128MB default may be too restrictive.

Filing this to track as this is something we would like to clarify at least in docs/configs for the 21.12 release.

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 5, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Nov 9, 2021
@abellina abellina reopened this Nov 19, 2021
@abellina
Copy link
Collaborator Author

This happened in this week's run again, even after #4046.

We started the pool with:

dispatcher-Executor 21/11/19 01:02:28:852 INFO GpuDeviceManager: Initializing RMM ARENA pool size = 39580.1875 MB on gpuId 0

Note that the total memory on the GPU is: 40536MiB, so at the time when we started, free must have been ~39836 (39580+256 for reserve). That still leaves some room ~1GB on the GPU that we are not able to grow into at later times of the query.

The error is the same, we ran out of memory after removing everything from the catalog, and we can't allocate ~350MiB with ~9GB allocated.

Executor task launch worker for task 60.0 in stage 36.0 (TID 12755) 21/11/19 01:03:01:46 INFO DeviceMemoryEventHandler: Device allocation of 345617912 bytes failed, device store has 0 bytes. Total RMM allocated is 8627529472 bytes.
Executor task launch worker for task 60.0 in stage 36.0 (TID 12755) 21/11/19 01:03:01:46 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 345617912 bytes. Total RMM allocated is 8805725184 bytes.
Executor task launch worker for task 108.0 in stage 36.0 (TID 12803) 21/11/19 01:03:01:64 ERROR Executor: Exception in task 108.0 in stage 36.0 (TID 12803)
java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/home/jenkins/agent/workspace/jenkins-cudf_nightly-pre_release-github-101-cuda11/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:157: Maximum pool size exceeded
        at ai.rapids.cudf.Table.concatenate(Native Method)
        at ai.rapids.cudf.Table.concatenate(Table.java:1424)

@Salonijain27 Salonijain27 added the P0 Must have for release label Nov 22, 2021
@abellina
Copy link
Collaborator Author

I verified this is not a leak by printing the RMM log and keeping a dictionary of allocations, deleting entries when frees were seen. The issue with this query is fragmentation and it's not consistently failing unfortunately.

@abellina abellina added P1 Nice to have for release and removed P0 Must have for release labels Jan 11, 2022
@abellina
Copy link
Collaborator Author

abellina commented Jan 11, 2022

This, among other queries, especially with UCX suffers from device memory fragmentation. There isn't a quick solution in sight with the current allocators, but the async allocator is very promising. I am downgrading this to P1 for 22.02 and we should see if the async allocator can be in a state where it can be recommended or made the default in 22.04

@sameerz sameerz added the performance A performance related task/issue label Jan 11, 2022
@abellina
Copy link
Collaborator Author

With the async allocator set as default in 22.04 (#4515) we expect fragmentation issues like this to be very unlikely. So we are going to close this and reopen if we see it again in any of our runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Nice to have for release performance A performance related task/issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants