Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta: Support data skipping on tstamp columns #9

Closed
istreeter opened this issue Sep 15, 2023 · 0 comments
Closed

Delta: Support data skipping on tstamp columns #9

istreeter opened this issue Sep 15, 2023 · 0 comments

Comments

@istreeter
Copy link
Collaborator

This loader is intended to work well with Delta's data skipping feature to enable efficient downstream queries in the warehouse. It should work like this:

  1. The loader outputs a file with the load_tstamp field set to a single uniform value in the file
  2. The Delta metadata file contains statistics, so it knows the load_stamp value for each file, without needing to open the file.
  3. An incremental query is run, including a clause SELECT ? WHERE load_tstamp > ?.
  4. The query engine is able to go directly to the relevant files, using the Delta metadata. It does not need to scan every file in the partition.

However... this feature only works if the load_tstamp is one of the first few columns in the table. Currently, load_tstamp column is the 129th column, which means we don't get the statistics and we don't get the efficient query.

The solution is to re-order the atomic columns when we first create the table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant