[FEA] Trigger host memory spilling when more host memory is needed #8881
Labels
reliability
Features to improve reliability or bugs that severly impact the reliability of the plugin
task
Work required that improves the product but is not user facing
Is your feature request related to a problem? Please describe.
The HostMemoryStore has ways to spill memory to disk. But right now it only happens when spilling from the GPU needs more memory to complete a spill. The goal here is to expose a API so that the new APIs from #8879 can call into it when memory is needed.
At the same time we need to tie the two APIs together. This gets a little complicated though. The spill storage would then need to decide if it should spill pinned or pageable memory to satisfy the request. We should favor spilling pageable memory over pinned unless we have no other choice. In fact spilling pinned memory can be a follow on issue if we need to.
The host memory spill storage would be updated to use the new allocation APIs, and the new allocation APIs would be updated to call into the host spill storage spill API if an allocation would would violate a limit, but there is enough spillable memory to cover it. Ideally reservations will not trigger a spill, but because memory can be made to be not spillable dynamically it might be simplest to just spill as much as is needed for the reservation when it is allocated. If we do this we should have a follow on issue to understand what it would take to not do this. In order to avoid any deadlocks we will have the spill storage use the maxPriority flag when doing host memory allocations.
In addition to this if the spill storage code is informed that an allocation is too large to ever fit in the pool, then instead of allocating it anyways, which happens today, we will need to spill the data to disk from GPU memory using bounce buffers (possibly on heap buffers) and bypass a single large CPU allocation all together.
We also should update the code that gets data out of the disk spill storage and puts it on the GPU. It should use the new allocation APIs and if an allocation would never work it will also need to move the data in a guaranteed to work way back to the GPU. We still need this code to operate at super-priority levels.
Ideally this should expose callbacks to the allocation APIs so as more memory is made spillable blocked threads can be woken up. If this needs to wait for #8882 a follow on issue needs to be filed to so we don't drop this.
The text was updated successfully, but these errors were encountered: