[FEA] CPU based UDF to run efficiently and transfer data back to GPU for supported operations #3855

firestarman · 2021-10-19T02:20:14Z

The general idea is that if we see a UDF on the CPU we can pull back to the CPU just the columns that it needs. Do the processing on the CPU and then send the resulting column back to the GPU for more processing.

There are a few things that make this hard.

There is no good way to know if we should run this on the GPU or not. There may be a few situations when it will be easy to tell, but for now I think we just want a config than enables or disables this feature. Once we have more experience, we can start to come up with a plan for heuristics.
Memory Management. We have no good way right now to avoid going over the target batch size when doing a project. Most of the code that is written for going from rows to columns assumes that it can batch things how it wants, so we might have to re-write some code to be able to support dynamically growing the data, or more likely have us do multiple smaller batches that we then end up concatenating before the expression is done.
Metrics. It really would be nice to know how much time we are spending on the CPU to do this processing for a UDF. But we have not set up anything to do it right now, and the APIs are not configured for it either. (This might be follow-on work) but it would be great to have some way for project or others to have a metric that we can use to know how much time is being spent on these UDFs and have a relatively clean way to do it.
Multiple UDFs. There is the possibility that the output of one UDF is fed into another UDF, or there is a little bit of processing in between them. When we hit situations like this it probably would be best to keep as much of that processing on the CPU at once, but when to do the cutoff and how to bind all of them together starts to be difficult. For now, I would say that we just do the basic thing, and we will have sub-optimal performance when we run into these types of odd situations. But we should have a follow-on issue for us to go back and look at how to do this properly once we have more experience with it.

We can probably begin from a simple case with one UDF, and confirm the complicated case of multiple UDFs later.

Stage 1: (Release 21.12)

Enable one UDF running on CPU like the pseudo code.
Add a rapids configuration to enable/disable this feature.
Add metrics to collect the CPU UDF running time and memory size ([FEA] Improvements for CPU(Row) based UDF #3979)

Stage 2: (#3979)

Consider the case of exceeding column batch size after UDF
Investigate how multiple UDFs work together and how we can do.

How will it co-work with current implementations (UDF compiler and RapidsUDF) ? The below is the order of precedence for where a UDF runs.

The UDF to Catalyst compiler is enabled and it was able to translate it.
It is tagged with GPU support (a RapidsUDF), so it should run on the GPU using the provided APIs.
This new feature is enabled, and we move the data to the CPU to run this and then move the result back.

jlowe · 2021-10-29T14:39:16Z

@firestarman this issue targets 21.12 but the description covers more than what will be done in 21.12. It would be good to file a separate issue or set of issues to cover the desired functionality that will be addressed beyond 21.12 so this can track what needs to be done for the 21.12 release. For example, #3897 is merged and is the bulk of the work, but will we try to address #3942 or #3904 for 21.12?

firestarman · 2021-11-01T03:55:59Z

@firestarman this issue targets 21.12 but the description covers more than what will be done in 21.12. It would be good to file a separate issue or set of issues to cover the desired functionality that will be addressed beyond 21.12 so this can track what needs to be done for the 21.12 release. For example, #3897 is merged and is the bulk of the work, but will we try to address #3942 or #3904 for 21.12?

Filed #3979 to tack the remaining tasks.
I plan to try to fix #3942 first to catch up the 12 release. While #3904 would be probaly addressed beyond 21.12.

firestarman · 2021-11-01T03:56:34Z

Closing this issue

firestarman added feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 19, 2021

firestarman self-assigned this Oct 19, 2021

Salonijain27 added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Oct 19, 2021

revans2 mentioned this issue Oct 20, 2021

[BUG] In ANSI mode we can fail in cases Spark would not due to conditionals #3849

Closed

Salonijain27 added this to the Nov 1 - Nov 12 milestone Oct 31, 2021

firestarman closed this as completed Nov 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] CPU based UDF to run efficiently and transfer data back to GPU for supported operations #3855

[FEA] CPU based UDF to run efficiently and transfer data back to GPU for supported operations #3855

firestarman commented Oct 19, 2021 •

edited

Loading

jlowe commented Oct 29, 2021

firestarman commented Nov 1, 2021

firestarman commented Nov 1, 2021

[FEA] CPU based UDF to run efficiently and transfer data back to GPU for supported operations #3855

[FEA] CPU based UDF to run efficiently and transfer data back to GPU for supported operations #3855

Comments

firestarman commented Oct 19, 2021 • edited Loading

jlowe commented Oct 29, 2021

firestarman commented Nov 1, 2021

firestarman commented Nov 1, 2021

firestarman commented Oct 19, 2021 •

edited

Loading