You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently Comet cannot be triggered if Spark users read data from cached RDD. To support this use case, we'll need to add support for Spark's InMemoryRelation.
@advancedxy Yea, CometRowToColumnarExec could be a more general solution, not only for InMemoryRelation, but also for other types of data sources like CSV, JSON, etc. The advantage of implementing Arrow for CachedBatchSerializer here is that we can avoid the extra cost from row to columnar conversion, and potentially be more space efficient because of better compression.
The advantage of implementing Arrow for CachedBatchSerializer here is that we can avoid the extra cost from row to columnar conversion, and potentially be more space efficient because of better compression.
Yea, of course. I can get the rational. We could always add specialized operators to improve performance as long as it's worth the effort and there's interest to implement it.
What is the problem the feature request solves?
Currently Comet cannot be triggered if Spark users read data from cached RDD. To support this use case, we'll need to add support for Spark's
InMemoryRelation
.It looks like we may need to implement Arrow for CachedBatchSerializer.
Describe the potential solution
Add Comet support for
InMemoryRelation
, so that Spark query starts from cached RDD can also use Comet native execution.Additional context
It is not a priority as of now, but will be something good to have in future.
The text was updated successfully, but these errors were encountered: