[QST]Which operations of Spark DataFrame are suitable for GPU acceleration? #5377

YeahNew · 2020-10-29T17:35:48Z

YeahNew
Oct 29, 2020

What is your question?
For the current rapids-4-spark plugin.

Which operations of Spark DataFrame are suitable for GPU acceleration? I try to add timestamps before and after an operation, such as filter, join, agg, to calculate the execution time of the operation. However, it is found that the execution time of some operations in GPU mode is longer, and some operations are not much different from those in pure CPU mode. the dataset is tpcx-bb with 120GB.
Another question I would like to ask, when I increase CPU cores and keep executor-cores=4 unchanged, the number of executors will increase. At the same time, each executor is bound to a GPU, which will increase the number of GPUs called in the spark cluster, but the execution time of query#5,16,21,22 is almost unchanged, why?
Can you help me?

Oct 29, 2020

I try to add timestamps before and after an operation, such as filter, join, agg, to calculate the execution time of the operation.

Note that Spark normally executes in a row-by-row fashion, while the RAPIDS Accelerator operates on columnar batches. Can you elaborate more on how you isolated the timing for these operations? It's easy to accidentally measure more than what was intended (i.e.: also the cost of the operations producing the input).

Also the scale factor of the data is fairly low. GPUs do not excel at processing small amounts of data. You will probably see better performance by increasing the amount of data each task sees (e.g.: increasing spark.sql.files.maxPartitionBytes, …

View full answer

jlowe · 2020-10-29T19:51:03Z

jlowe
Oct 29, 2020
Maintainer

I try to add timestamps before and after an operation, such as filter, join, agg, to calculate the execution time of the operation.

Note that Spark normally executes in a row-by-row fashion, while the RAPIDS Accelerator operates on columnar batches. Can you elaborate more on how you isolated the timing for these operations? It's easy to accidentally measure more than what was intended (i.e.: also the cost of the operations producing the input).

Also the scale factor of the data is fairly low. GPUs do not excel at processing small amounts of data. You will probably see better performance by increasing the amount of data each task sees (e.g.: increasing spark.sql.files.maxPartitionBytes, decreasing spark.sql.shuffle.partitions, etc.). Alternatively you could scale up the dataset size.

As to which operations will perform particularly well on GPUs relative to CPUs, here's an incomplete list:

Large CSV/Parquet/ORC reads
Parquet/ORC writes
aggregations with high group keys cardinality (i.e.: hash table does not fit in CPU cache)
joins with high join key cardinality (i.e.: hash table does not fit in CPU cache)
sorting

the execution time of query#5,16,21,22 is almost unchanged, why?

Have you hit the maximum number of GPUs available in the cluster? If so then you would be running with the same number of executors which explains the same performance. You could try running more than 4 cores per executor, as some queries can benefit from more CPU cores on the executor even when running on the GPU, especially if there are significant parts of the query that are not translated for the GPU.

If you are seeing more executors being executed (and therefore more GPUs used) than before) then you may be I/O bound. Try running with just CPU executors and see if the performance changes as you scale the number of executors similarly (keeping the cores-per-executor constant).

0 replies

YeahNew · 2020-10-30T02:33:31Z

YeahNew
Oct 30, 2020
Author

Can you elaborate more on how you isolated the timing for these operations? OK. For example:
val startTime = System.currentTimeMills()
df1.join(df2)
//df1.filter("id > 2000 and age <60")
//others operations
val endTime = System.currentTimeMills()
val filterTime = endTime-statTime
println(s"The filter operation took: ${filterTime} ms")

Is this way correct? If not, how should I get the execution time of a specific operation?

Have you hit the maximum number of GPUs available in the cluster?
No, I did not hit the maximum number of GPUs available in the cluster. Then I keeped executor-cores=6 unchanged.
executor-cores=6
executor.resource.amount.gpu=1
There are 18 GPUs in total
the GPU were used as fllowed:
total-executor-cores the amount of called GPU
24 4
36 6
72 12
96 16

Even with the above setting, the execution efficiency of query in GPU mode is almost unchanged.

Perhaps improving IO is a solution that can be explored

0 replies

jlowe · 2020-10-30T14:01:52Z

jlowe
Oct 30, 2020
Maintainer

Spark executes in a lazy fashion, so this snippet:

val startTime = System.currentTimeMills()
df1.join(df2)
val endTime = System.currentTimeMills()

does not measure the time of a join. No join was actually performed during the time measured because nothing forced Spark to manifest the result of the join in any way (e.g.: writing the results of the join somewhere, collecting the results back to the driver via a .collect, etc.). The time spent in the section being measured is not going to depend upon the dataset size at all. All Spark does is add a join node to the query plan which won't be executed until something later comes along to force the query to be executed. This also explains why you see no performance difference between GPU and CPU and why adding executors does not help. It's not using any executors during the time being measured. 😄

Note that even if you changed the snippet to do something like this:

val startTime = System.currentTimeMills()
df1.join(df2).collect
val endTime = System.currentTimeMills()

That will force the join to execute during the section being timed, but it won't measure just the join. It will also measure the time it took to construct df1 and df2 (e.g.: loading from Parquet or whatever operations were done to build those tables) as well as the time spent transferring the results back to the driver. It is very difficult to localize the effect of a single operation in Spark since it does not normally execute all of the load first then all of the join, etc. It fetches one row of output at a time, only performing a minimal set of operations necessary to produce the next row of output, pulling from iterators chaining all the way back to fetching rows from the sources.

0 replies

YeahNew · 2020-11-03T06:50:08Z

YeahNew
Nov 3, 2020
Author

Can you elaborate more on how you isolated the timing for these operations? OK. For example:
val startTime = System.currentTimeMills()
df1.join(df2)
//df1.filter("id > 2000 and age <60")
//others operations
val endTime = System.currentTimeMills()
val filterTime = endTime-statTime
println(s"The filter operation took: ${filterTime} ms")

Is this way correct? If not, how should I get the execution time of a specific operation?

Have you hit the maximum number of GPUs available in the cluster?
No, I did not hit the maximum number of GPUs available in the cluster. Then I keeped executor-cores=6 unchanged.
executor-cores=6
executor.resource.amount.gpu=1
There are 18 GPUs in total
the GPU were used as fllowed:
total-executor-cores the amount of called GPU
24 4
36 6
72 12
96 16

Even with the above setting, the execution efficiency of query in GPU mode is almost unchanged.

Perhaps improving IO is a solution that can be explored

That will force the join to execute during the section being timed, but it won't measure just the join. It will also measure the time it took to construct df1 and df2 (e.g.: loading from Parquet or whatever operations were done to build those tables) as well as the time spent transferring the results back to the driver. It is very difficult to localize the effect of a single operation in Spark since it does not normally execute all of the load first then all of the join, etc. It fetches one row of output at a time, only performing a minimal set of operations necessary to produce the next row of output, pulling from iterators chaining all the way back to fetching rows from the sources.

Thanks，you are so nice!

0 replies

jlowe · 2020-11-03T14:32:05Z

jlowe
Nov 3, 2020
Maintainer

Thanks，you are so nice!

Thanks, @YeahNew! I'm closing this as answered. Please reopen or file a new question if there's more along these lines you'd like to discuss.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST]Which operations of Spark DataFrame are suitable for GPU acceleration? #5377

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

[QST]Which operations of Spark DataFrame are suitable for GPU acceleration? #5377

YeahNew Oct 29, 2020

Replies: 5 comments

jlowe Oct 29, 2020 Maintainer

YeahNew Oct 30, 2020 Author

jlowe Oct 30, 2020 Maintainer

YeahNew Nov 3, 2020 Author

jlowe Nov 3, 2020 Maintainer

YeahNew
Oct 29, 2020

jlowe
Oct 29, 2020
Maintainer

YeahNew
Oct 30, 2020
Author

jlowe
Oct 30, 2020
Maintainer

YeahNew
Nov 3, 2020
Author

jlowe
Nov 3, 2020
Maintainer