Delta: Support data skipping on tstamp columns #9

istreeter · 2023-09-15T20:45:41Z

This loader is intended to work well with Delta's data skipping feature to enable efficient downstream queries in the warehouse. It should work like this:

The loader outputs a file with the load_tstamp field set to a single uniform value in the file
The Delta metadata file contains statistics, so it knows the load_stamp value for each file, without needing to open the file.
An incremental query is run, including a clause SELECT ? WHERE load_tstamp > ?.
The query engine is able to go directly to the relevant files, using the Delta metadata. It does not need to scan every file in the partition.

However... this feature only works if the load_tstamp is one of the first few columns in the table. Currently, load_tstamp column is the 129th column, which means we don't get the statistics and we don't get the efficient query.

The solution is to re-order the atomic columns when we first create the table.

The text was updated successfully, but these errors were encountered:

istreeter closed this as completed in 28a01bb Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delta: Support data skipping on tstamp columns #9

Delta: Support data skipping on tstamp columns #9

istreeter commented Sep 15, 2023

Delta: Support data skipping on tstamp columns #9

Delta: Support data skipping on tstamp columns #9

Comments

istreeter commented Sep 15, 2023