You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GpuBroadcastToCpuExec is used to fixup a plan where a GPU broadcast is being reused for a CPU plan. However today it is re-executing the portion of the plan corresponding to the original broadcast, which not only wastes time but also artificially increases the reported metrics for that portion of the plan (e.g.: row counts from file reads will be a multiple of the actual number of rows in the file).
Ideally GpuBroadcastToCpuExec should reuse the existing GPU broadcast, but we also do not want to require the driver to have a GPU. Today deserializing a GPU broadcast automatically places the data in GPU memory as part of deserialization, but this is a use-case where we need the GPU broadcast data to remain in host memory after deserialization so the driver can work with it safely. One possible solution is to treat GPU broadcasts as we do GPU shuffles when using the legacy shuffle, i.e.: leave the data being transferred in host memory and update the plan with something similar to GpuShuffleCoalesceExec that expects its columnar input to be in host memory and it's sole job is to coalesce the data and put it on the GPU. For the GpuBroadcastToCpuExec use-case, we would deal with the host data representation directly in the driver.
The text was updated successfully, but these errors were encountered:
@sperlingxx have you actually verified this (e.g.: running a query that uses DPP on both the CPU and GPU)? I tried running NDS query 5 on partitioned data to verify myself, but that threw an exception, see #4625
GpuBroadcastToCpuExec
is used to fixup a plan where a GPU broadcast is being reused for a CPU plan. However today it is re-executing the portion of the plan corresponding to the original broadcast, which not only wastes time but also artificially increases the reported metrics for that portion of the plan (e.g.: row counts from file reads will be a multiple of the actual number of rows in the file).Ideally
GpuBroadcastToCpuExec
should reuse the existing GPU broadcast, but we also do not want to require the driver to have a GPU. Today deserializing a GPU broadcast automatically places the data in GPU memory as part of deserialization, but this is a use-case where we need the GPU broadcast data to remain in host memory after deserialization so the driver can work with it safely. One possible solution is to treat GPU broadcasts as we do GPU shuffles when using the legacy shuffle, i.e.: leave the data being transferred in host memory and update the plan with something similar toGpuShuffleCoalesceExec
that expects its columnar input to be in host memory and it's sole job is to coalesce the data and put it on the GPU. For theGpuBroadcastToCpuExec
use-case, we would deal with the host data representation directly in the driver.The text was updated successfully, but these errors were encountered: