-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] q93 failed in this week's NDS runs #4045
Comments
This happened in this week's run again, even after #4046. We started the pool with:
Note that the total memory on the GPU is: 40536MiB, so at the time when we started, free must have been ~39836 (39580+256 for reserve). That still leaves some room ~1GB on the GPU that we are not able to grow into at later times of the query. The error is the same, we ran out of memory after removing everything from the catalog, and we can't allocate ~350MiB with ~9GB allocated.
|
I verified this is not a leak by printing the RMM log and keeping a dictionary of allocations, deleting entries when frees were seen. The issue with this query is fragmentation and it's not consistently failing unfortunately. |
This, among other queries, especially with UCX suffers from device memory fragmentation. There isn't a quick solution in sight with the current allocators, but the async allocator is very promising. I am downgrading this to P1 for 22.02 and we should see if the async allocator can be in a state where it can be recommended or made the default in 22.04 |
With the async allocator set as default in 22.04 (#4515) we expect fragmentation issues like this to be very unlikely. So we are going to close this and reopen if we see it again in any of our runs. |
We ran q93 without any setting changes and got an OOM exception, where we haven't in the past:
This looks like a fragmented pool, as the relatively small ~350MB allocation should fit if we have 9GB worth of memory in RMM, and our GPU has 40GB of memory total.
That said, in the past this query worked without issues, since we had a maximum config in the RMM pool where it could go and allocate the extra memory required to satisfy the 350MB needed. Since the pool limit changes (i.e. #4019 and rapidsai/cudf#9583) we now have a single value for the size of the pool, and that is calculated at initialization using
free
instead oftotal
GPU memory, and it does not grow. The initial value is further reduced by the "reserve" config, set to 1GB by default.Since the failure we've run a few scenarios by lowering the
spark.rapids.memory.gpu.reserve
setting to128MB
and have gotten it to pass (the default is 1GB). But in the past we have had issues with kernel that need ~1GB just for the kernel launch, so it seems that a 128MB default may be too restrictive.Filing this to track as this is something we would like to clarify at least in docs/configs for the 21.12 release.
The text was updated successfully, but these errors were encountered: