[BUG] q93 failed in this week's NDS runs #4045

abellina · 2021-11-05T21:28:20Z

We ran q93 without any setting changes and got an OOM exception, where we haven't in the past:

Executor task launch worker for task 123.0 in stage 36.0 (TID 12818) 21/11/04 21:13:46:119 INFO DeviceMemoryEventHandler: Device allocation of 345571976 bytes failed, device store has 0 bytes. Total RMM allocated is 9319438848 bytes.

This looks like a fragmented pool, as the relatively small ~350MB allocation should fit if we have 9GB worth of memory in RMM, and our GPU has 40GB of memory total.

That said, in the past this query worked without issues, since we had a maximum config in the RMM pool where it could go and allocate the extra memory required to satisfy the 350MB needed. Since the pool limit changes (i.e. #4019 and rapidsai/cudf#9583) we now have a single value for the size of the pool, and that is calculated at initialization using free instead of total GPU memory, and it does not grow. The initial value is further reduced by the "reserve" config, set to 1GB by default.

Since the failure we've run a few scenarios by lowering the spark.rapids.memory.gpu.reserve setting to 128MB and have gotten it to pass (the default is 1GB). But in the past we have had issues with kernel that need ~1GB just for the kernel launch, so it seems that a 128MB default may be too restrictive.

Filing this to track as this is something we would like to clarify at least in docs/configs for the 21.12 release.

The text was updated successfully, but these errors were encountered:

abellina · 2021-11-19T15:08:02Z

This happened in this week's run again, even after #4046.

We started the pool with:

dispatcher-Executor 21/11/19 01:02:28:852 INFO GpuDeviceManager: Initializing RMM ARENA pool size = 39580.1875 MB on gpuId 0

Note that the total memory on the GPU is: 40536MiB, so at the time when we started, free must have been ~39836 (39580+256 for reserve). That still leaves some room ~1GB on the GPU that we are not able to grow into at later times of the query.

The error is the same, we ran out of memory after removing everything from the catalog, and we can't allocate ~350MiB with ~9GB allocated.

Executor task launch worker for task 60.0 in stage 36.0 (TID 12755) 21/11/19 01:03:01:46 INFO DeviceMemoryEventHandler: Device allocation of 345617912 bytes failed, device store has 0 bytes. Total RMM allocated is 8627529472 bytes.
Executor task launch worker for task 60.0 in stage 36.0 (TID 12755) 21/11/19 01:03:01:46 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 345617912 bytes. Total RMM allocated is 8805725184 bytes.
Executor task launch worker for task 108.0 in stage 36.0 (TID 12803) 21/11/19 01:03:01:64 ERROR Executor: Exception in task 108.0 in stage 36.0 (TID 12803)
java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/home/jenkins/agent/workspace/jenkins-cudf_nightly-pre_release-github-101-cuda11/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:157: Maximum pool size exceeded
        at ai.rapids.cudf.Table.concatenate(Native Method)
        at ai.rapids.cudf.Table.concatenate(Table.java:1424)

abellina · 2021-11-30T14:15:36Z

I verified this is not a leak by printing the RMM log and keeping a dictionary of allocations, deleting entries when frees were seen. The issue with this query is fragmentation and it's not consistently failing unfortunately.

abellina · 2022-01-11T21:05:01Z

This, among other queries, especially with UCX suffers from device memory fragmentation. There isn't a quick solution in sight with the current allocators, but the async allocator is very promising. I am downgrading this to P1 for 22.02 and we should see if the async allocator can be in a state where it can be recommended or made the default in 22.04

abellina · 2022-01-28T19:46:27Z

With the async allocator set as default in 22.04 (#4515) we expect fragmentation issues like this to be very unlikely. So we are going to close this and reopen if we see it again in any of our runs.

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 5, 2021

rongou mentioned this issue Nov 5, 2021

lower GPU memory reserve to 256MB #4046

Merged

abellina closed this as completed in #4046 Nov 8, 2021

sameerz removed the ? - Needs Triage Need team to review and classify label Nov 9, 2021

abellina reopened this Nov 19, 2021

Salonijain27 added the P0 Must have for release label Nov 22, 2021

abellina added P1 Nice to have for release and removed P0 Must have for release labels Jan 11, 2022

sameerz added the performance A performance related task/issue label Jan 11, 2022

abellina closed this as completed Jan 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] q93 failed in this week's NDS runs #4045

[BUG] q93 failed in this week's NDS runs #4045

abellina commented Nov 5, 2021 •

edited

Loading

abellina commented Nov 19, 2021

abellina commented Nov 30, 2021

abellina commented Jan 11, 2022 •

edited

Loading

abellina commented Jan 28, 2022

[BUG] q93 failed in this week's NDS runs #4045

[BUG] q93 failed in this week's NDS runs #4045

Comments

abellina commented Nov 5, 2021 • edited Loading

abellina commented Nov 19, 2021

abellina commented Nov 30, 2021

abellina commented Jan 11, 2022 • edited Loading

abellina commented Jan 28, 2022

abellina commented Nov 5, 2021 •

edited

Loading

abellina commented Jan 11, 2022 •

edited

Loading