Replies: 2 comments 9 replies
-
Sounds interesting. @majetideepak Deepak, what do you think? |
Beta Was this translation helpful? Give feedback.
0 replies
-
This sounds reasonable to me. I can see the benefit if there are many small files. |
Beta Was this translation helpful? Give feedback.
9 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In Spark, file length and modification time is present in PartitionedFile although only file name, offset, and length is used to construct the split which is then sent downstream from Gluten to Velox.
At present, an additional call is made in Velox to fetch metadata and initialize file length although the information is present upstream and not used to construct the split. For example in ABFS
velox/velox/connectors/hive/storage_adapters/abfs/AbfsFileSystem.cpp
Line 68 in 90fc393
velox/velox/connectors/hive/storage_adapters/s3fs/S3FileSystem.cpp
Line 92 in 90fc393
If fileSize and modificationTime from every PartitionedFile object are added to HiveConnectorSplit we can save on one network call per file handle. Additionally, if these are appended to the key used in the file handle cache, then the cache can be enabled by default even when files are being overwritten because the modification time and length will be different. If this idea sounds good I would like to take this up. We've seen improvement in the order of 102 seconds in total scan time across all tasks for TPCDS and a modest gain overall.
@mbasmanova @zhli1142015
Beta Was this translation helpful? Give feedback.
All reactions