Add file length and modification time to split #9305

acvictor · 2024-03-29T12:37:06Z

acvictor
Mar 29, 2024

In Spark, file length and modification time is present in PartitionedFile although only file name, offset, and length is used to construct the split which is then sent downstream from Gluten to Velox.

At present, an additional call is made in Velox to fetch metadata and initialize file length although the information is present upstream and not used to construct the split. For example in ABFS

velox/velox/connectors/hive/storage_adapters/abfs/AbfsFileSystem.cpp

Line 68 in 90fc393

auto properties = fileClient_->GetProperties();

and in S3

velox/velox/connectors/hive/storage_adapters/s3fs/S3FileSystem.cpp

Line 92 in 90fc393

auto outcome = client_->HeadObject(request);

If fileSize and modificationTime from every PartitionedFile object are added to HiveConnectorSplit we can save on one network call per file handle. Additionally, if these are appended to the key used in the file handle cache, then the cache can be enabled by default even when files are being overwritten because the modification time and length will be different. If this idea sounds good I would like to take this up. We've seen improvement in the order of 10² seconds in total scan time across all tasks for TPCDS and a modest gain overall.

@mbasmanova @zhli1142015