Brainstorming better file stats APIs / documentation #1100

MrPowers · 2023-01-25T14:04:12Z

From what I understand, delta-rs collects stats on some files in a Delta table, but not others. It's important for users to make sure stats are collected for columns that are commonly used for filtering to maximize file skipping. If a user is frequently filtering on colA, then they'll definitely want colA to have stats collected.

Some APIs that could be nice:

function that returns columns with stats that are currently being collected
make it easy for users to "turn off" stats collector for certain columns (e.g. for columns that are never filtered on that are expensive for stats collection)
from what I understand, delta-io/delta performs stats collection for the first N columns. Certain column types are notoriously bad for stats collection. If delta-rs also uses the "collect stats for the first N columns approach", it could be nice to make it easier for the user to select the right columns

Documenting the current behavior a bit would be nice too.

The text was updated successfully, but these errors were encountered:

aersam · 2023-01-25T16:18:44Z

this is only relevant about files beeing written by delta-rs, right? Or you want to expose an API to collect stats for existing files?

MrPowers · 2023-01-25T16:56:01Z

@aersam - hard to say cause I'm not sure about the existing APIs / default behavior. But yes, if there is a Delta Lake that doesn't have stats collected, I think there should be some API to add stats. If the current behavior is to collect stats for the first N columns and then the data is reordered, we may way a way to kick off the stats generations for the "new first N columns" as well.

aersam · 2023-01-25T17:20:45Z

Ok, sounds good. Also there are stats in deltalog and in parquet, right? Are stats fom parquets used as well?

roeap · 2023-01-25T17:45:38Z

Parquet stats are being used when querying the table the right way ... more generally speaking, if we want to implement #1041 the right way, we would probably generate the delta log by reading (or inferring if not available) metadata from the parquet files.

That said, while practically standard, noting in the delta specs says, that data files must be in parquet.

aersam · 2023-01-25T18:06:55Z

Thats true, I'd really like support for underlying arrow ipc files :) but thats another topic
The approach to use parquet stats to generate delta log stats makes sense

wjones127 · 2023-01-26T01:33:09Z

FWIW, I don't think we have any logic that limits stats collection to a certain number of columns. We just collect for all of them.

TBH I'm somewhat skeptical that the stats collection has that much overhead. I'd hold off on doing this until we profile and find this is a meaningful bottleneck.

MrPowers · 2023-01-26T02:36:12Z

I don't think we have any logic that limits stats collection to a certain number of columns

This is great. Way easier from a user perspective.

I say we just document it in that case. For all the users coming from something else, this is what they'd expect. For all users coming from delta-io/delta, this would be quite surprising.

MrPowers · 2023-01-26T21:26:23Z

@wjones127 - chatted about the price to pay for collecting stats and it's twofold (write-side and read side):

write-side: As you mentioned, when you're writing you need to pay the price to collect stats
read-side: when you're reading, you pay a price to attempt skipping. Attempting skipping on pointless columns isn't great because you get a regression on query performance.

I'm just chiming in here again because I was only thinking about this from the write-side, but seems like there are performance implications from the read-side as well.

ion-elgreco · 2024-08-19T19:32:28Z

This functionality of stats collection and usage is now on par with other delta implementations

MrPowers added the enhancement New feature or request label Jan 25, 2023

junjunjd mentioned this issue Oct 13, 2023

Collect stats on parquet files when converting a Parquet table to a Delta table #1719

Open

rtyler added the good first issue Good for newcomers label Oct 25, 2023

ion-elgreco closed this as completed Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brainstorming better file stats APIs / documentation #1100

Brainstorming better file stats APIs / documentation #1100

MrPowers commented Jan 25, 2023

aersam commented Jan 25, 2023

MrPowers commented Jan 25, 2023

aersam commented Jan 25, 2023

roeap commented Jan 25, 2023

aersam commented Jan 25, 2023

wjones127 commented Jan 26, 2023

MrPowers commented Jan 26, 2023

MrPowers commented Jan 26, 2023

ion-elgreco commented Aug 19, 2024 •

edited

Loading

Brainstorming better file stats APIs / documentation #1100

Brainstorming better file stats APIs / documentation #1100

Comments

MrPowers commented Jan 25, 2023

aersam commented Jan 25, 2023

MrPowers commented Jan 25, 2023

aersam commented Jan 25, 2023

roeap commented Jan 25, 2023

aersam commented Jan 25, 2023

wjones127 commented Jan 26, 2023

MrPowers commented Jan 26, 2023

MrPowers commented Jan 26, 2023

ion-elgreco commented Aug 19, 2024 • edited Loading

ion-elgreco commented Aug 19, 2024 •

edited

Loading