-
Notifications
You must be signed in to change notification settings - Fork 400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Brainstorming better file stats APIs / documentation #1100
Comments
this is only relevant about files beeing written by delta-rs, right? Or you want to expose an API to collect stats for existing files? |
@aersam - hard to say cause I'm not sure about the existing APIs / default behavior. But yes, if there is a Delta Lake that doesn't have stats collected, I think there should be some API to add stats. If the current behavior is to collect stats for the first N columns and then the data is reordered, we may way a way to kick off the stats generations for the "new first N columns" as well. |
Ok, sounds good. Also there are stats in deltalog and in parquet, right? Are stats fom parquets used as well? |
Parquet stats are being used when querying the table the right way ... more generally speaking, if we want to implement #1041 the right way, we would probably generate the delta log by reading (or inferring if not available) metadata from the parquet files. That said, while practically standard, noting in the delta specs says, that data files must be in parquet. |
Thats true, I'd really like support for underlying arrow ipc files :) but thats another topic |
FWIW, I don't think we have any logic that limits stats collection to a certain number of columns. We just collect for all of them. TBH I'm somewhat skeptical that the stats collection has that much overhead. I'd hold off on doing this until we profile and find this is a meaningful bottleneck. |
This is great. Way easier from a user perspective. I say we just document it in that case. For all the users coming from something else, this is what they'd expect. For all users coming from delta-io/delta, this would be quite surprising. |
@wjones127 - chatted about the price to pay for collecting stats and it's twofold (write-side and read side):
I'm just chiming in here again because I was only thinking about this from the write-side, but seems like there are performance implications from the read-side as well. |
This functionality of stats collection and usage is now on par with other delta implementations |
From what I understand, delta-rs collects stats on some files in a Delta table, but not others. It's important for users to make sure stats are collected for columns that are commonly used for filtering to maximize file skipping. If a user is frequently filtering on
colA
, then they'll definitely wantcolA
to have stats collected.Some APIs that could be nice:
Documenting the current behavior a bit would be nice too.
The text was updated successfully, but these errors were encountered: