Skip to content

[QST] RowDataSourceScanExec cannot run on the GPU #5332

Answered by revans2
eyalhir74 asked this question in General
Discussion options

You must be logged in to vote

Okay I'll get into some architecture here to try and explain things. Reading data into Spark usually involves a few operations. Note that the order of these operations and the machine that they run on can change based off of what the input format is.

  1. Predicate push down/metadata calculations - This is to figure out what data to read in order to avoid reading too much data.
  2. Data transfer - This is actually copying the data from where it is stored to the Spark node so it can be processed more
  3. Data Decoding - This is translating the data into a format that Spark wants.

For file formats, like Parquet and ORC, stored in a blob store, like S3, we can only really accelerate the data decoding. …

Replies: 7 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by sameerz
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
2 participants
Converted from issue

This discussion was converted from issue #4903 on April 27, 2022 15:31.