Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brainstorming better file stats APIs / documentation #1100

Closed
MrPowers opened this issue Jan 25, 2023 · 9 comments
Closed

Brainstorming better file stats APIs / documentation #1100

MrPowers opened this issue Jan 25, 2023 · 9 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@MrPowers
Copy link
Collaborator

From what I understand, delta-rs collects stats on some files in a Delta table, but not others. It's important for users to make sure stats are collected for columns that are commonly used for filtering to maximize file skipping. If a user is frequently filtering on colA, then they'll definitely want colA to have stats collected.

Some APIs that could be nice:

  • function that returns columns with stats that are currently being collected
  • make it easy for users to "turn off" stats collector for certain columns (e.g. for columns that are never filtered on that are expensive for stats collection)
  • from what I understand, delta-io/delta performs stats collection for the first N columns. Certain column types are notoriously bad for stats collection. If delta-rs also uses the "collect stats for the first N columns approach", it could be nice to make it easier for the user to select the right columns

Documenting the current behavior a bit would be nice too.

@MrPowers MrPowers added the enhancement New feature or request label Jan 25, 2023
@aersam
Copy link
Contributor

aersam commented Jan 25, 2023

this is only relevant about files beeing written by delta-rs, right? Or you want to expose an API to collect stats for existing files?

@MrPowers
Copy link
Collaborator Author

@aersam - hard to say cause I'm not sure about the existing APIs / default behavior. But yes, if there is a Delta Lake that doesn't have stats collected, I think there should be some API to add stats. If the current behavior is to collect stats for the first N columns and then the data is reordered, we may way a way to kick off the stats generations for the "new first N columns" as well.

@aersam
Copy link
Contributor

aersam commented Jan 25, 2023

Ok, sounds good. Also there are stats in deltalog and in parquet, right? Are stats fom parquets used as well?

@roeap
Copy link
Collaborator

roeap commented Jan 25, 2023

Parquet stats are being used when querying the table the right way ... more generally speaking, if we want to implement #1041 the right way, we would probably generate the delta log by reading (or inferring if not available) metadata from the parquet files.

That said, while practically standard, noting in the delta specs says, that data files must be in parquet.

@aersam
Copy link
Contributor

aersam commented Jan 25, 2023

Thats true, I'd really like support for underlying arrow ipc files :) but thats another topic
The approach to use parquet stats to generate delta log stats makes sense

@wjones127
Copy link
Collaborator

FWIW, I don't think we have any logic that limits stats collection to a certain number of columns. We just collect for all of them.

TBH I'm somewhat skeptical that the stats collection has that much overhead. I'd hold off on doing this until we profile and find this is a meaningful bottleneck.

@MrPowers
Copy link
Collaborator Author

I don't think we have any logic that limits stats collection to a certain number of columns

This is great. Way easier from a user perspective.

I say we just document it in that case. For all the users coming from something else, this is what they'd expect. For all users coming from delta-io/delta, this would be quite surprising.

@MrPowers
Copy link
Collaborator Author

@wjones127 - chatted about the price to pay for collecting stats and it's twofold (write-side and read side):

  • write-side: As you mentioned, when you're writing you need to pay the price to collect stats
  • read-side: when you're reading, you pay a price to attempt skipping. Attempting skipping on pointless columns isn't great because you get a regression on query performance.

I'm just chiming in here again because I was only thinking about this from the write-side, but seems like there are performance implications from the read-side as well.

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Aug 19, 2024

This functionality of stats collection and usage is now on par with other delta implementations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

6 participants