-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory footprint of file-level statistics #435
Comments
I did some memory inspections. Just in case anyone else is working from a Mac (@dispanser uses Arch): I tried to reproduce this and was very confused because my
Here we indeed see ~44G memory utilization (post table load). We also observe heavy Swap and Memory Compression by the OS: |
I also think there is room to reduce the memory usage by 10x. I would drop all of stats, stats_parsed, partition_values and parsed_partition_values in memory. Both stats and partition values can be stored in a columnar format, e.g. arrow record batch, which should also help with applying filter push down on table scan. |
@houqp @dispanser I'll try to do some refactoring and see what I can come up with. The First approach would be exactly as you suggest, just try to store the fields using record batches. I hope to make some progress (first time digging into the codebase) but I'll ask for advice here if I get stuck :) |
Thanks @mgill25 , let us know if you need any help :) |
Tombstones have a similar structure and suffer from the memory consumption problem to some extend, having a field If we find an efficient way to represent |
Good call @dispanser for tombstone, i think we can further optimize it by remove all other fields and only keep path and deleted_timestamp. I don't think we need access for other fields for vacuum. |
@houqp and @dispanser Just to provide an update here, I have some working code, with no doubt lots of enhancements to be done. On the plus side, I can read The annoying part so far has mostly been trying to parse the extremely deeply nested structure and trying to re-arrange them into some familiar data structures. I suspect things will get better with time :) |
@mgill25 please feel free to send a draft PR directly if you prefer, we can comment and iterate there. |
# Description This is still very much a work in progress, opening it up for visibility and discussion. Finally I do hope that we can make the switch to arrow based log handling. Aside from hopefully advantages in the memory footprint, I also believe it opens us up to many future optimizations as well. To make the transition we introduce two new structs - `Snapshot` - a half lazy version of the Snapshot, which only tries to get `Protocol` & `Metadata` actions ASAP. Of course these drive all our planning activities and without them there is not much we can do. - `EagerSnapshot` - An intermediary structure, which eagerly loads file actions and does log replay to serve as a compatibility laver for the current `DeltaTable` APIs. One conceptually larger change is related to how we view the availability of information. Up until now `DeltaTableState` could be initialized empty, containing no useful information for any code to work with. State (snapshots) now always needs to be created valid. The thing that may not yet be initialized is the `DeltaTable`, which now only carries the table configuration and the `LogStore`. the state / snapshot is now optional. Consequently all code that works against a snapshot no longer needs to handle that matadata / schema etc may not be available. This also has implications for the datafusion integration. We already are working against snapshots mostly, but should abolish most traits implemented for `DeltaTable` as this does not provide the information (and never has) that is al least required to execute a query. Some larger notable changes include: * remove `DeltaTableMetadata` and always use `Metadata` action. * arrow and parquet are now required, as such the features got removed. Personalyl I would also argue, that if you cannot read checkpoints, you cannot read delta tables :). - so hopefully users weren't using arrow-free versions. ### Major follow-ups: * (pre-0.17) review integration with `log_store` and `object_store`. Currently we make use mostly of `ObjectStore` inside the state handling. What we really use is `head` / `list_from` / `get` - my hope would be that we end up with a single abstraction... * test cleanup - we are currently dealing with test flakiness and have several approaches to scaffolding tests. SInce we have the `deltalake-test` crate now, this can be reconciled. * ... * do more processing on borrowed data ... * perform file-heavy operations on arrow data * update checkpoint writing to leverage new state handling and arrow ... * switch to exposing URL in public APIs ## Questions * should paths be percent-encoded when written to checkpoint? # Related Issue(s) supersedes: #454 supersedes: #1837 closes: #1776 closes: #425 (should also be addressed in the current implementation) closes: #288 (multi-part checkpoints are deprecated) related: #435 # Documentation <!--- Share links to useful documentation ---> --------- Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
@dispanser - do you still have access to that table. If so, I'd be very curious to see what it looks like after the move to an arrow backend :). |
# Description This is still very much a work in progress, opening it up for visibility and discussion. Finally I do hope that we can make the switch to arrow based log handling. Aside from hopefully advantages in the memory footprint, I also believe it opens us up to many future optimizations as well. To make the transition we introduce two new structs - `Snapshot` - a half lazy version of the Snapshot, which only tries to get `Protocol` & `Metadata` actions ASAP. Of course these drive all our planning activities and without them there is not much we can do. - `EagerSnapshot` - An intermediary structure, which eagerly loads file actions and does log replay to serve as a compatibility laver for the current `DeltaTable` APIs. One conceptually larger change is related to how we view the availability of information. Up until now `DeltaTableState` could be initialized empty, containing no useful information for any code to work with. State (snapshots) now always needs to be created valid. The thing that may not yet be initialized is the `DeltaTable`, which now only carries the table configuration and the `LogStore`. the state / snapshot is now optional. Consequently all code that works against a snapshot no longer needs to handle that matadata / schema etc may not be available. This also has implications for the datafusion integration. We already are working against snapshots mostly, but should abolish most traits implemented for `DeltaTable` as this does not provide the information (and never has) that is al least required to execute a query. Some larger notable changes include: * remove `DeltaTableMetadata` and always use `Metadata` action. * arrow and parquet are now required, as such the features got removed. Personalyl I would also argue, that if you cannot read checkpoints, you cannot read delta tables :). - so hopefully users weren't using arrow-free versions. ### Major follow-ups: * (pre-0.17) review integration with `log_store` and `object_store`. Currently we make use mostly of `ObjectStore` inside the state handling. What we really use is `head` / `list_from` / `get` - my hope would be that we end up with a single abstraction... * test cleanup - we are currently dealing with test flakiness and have several approaches to scaffolding tests. SInce we have the `deltalake-test` crate now, this can be reconciled. * ... * do more processing on borrowed data ... * perform file-heavy operations on arrow data * update checkpoint writing to leverage new state handling and arrow ... * switch to exposing URL in public APIs ## Questions * should paths be percent-encoded when written to checkpoint? # Related Issue(s) supersedes: delta-io#454 supersedes: delta-io#1837 closes: delta-io#1776 closes: delta-io#425 (should also be addressed in the current implementation) closes: delta-io#288 (multi-part checkpoints are deprecated) related: delta-io#435 # Documentation <!--- Share links to useful documentation ---> --------- Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
Closing this one as it's potentially outdated and we have already made improvements here, if people perceive new issues with memory usage please create a new issue for tracking :) |
Description
I've been playing with a relatively large delta table recently, in the context of #425 (ignore tombstones for readers).
While the 5.3 million file paths themselves occupy about 600M, the overall memory consumption of a small rust program after loading the delta table adds up to 49.4G.
Here's the memory consumption for the various fields in
action::Add
, on that table with 5.3 million files, and 19 columns (each column has file statistics enabled).I think we should be able to get that down considerably. The biggest offender is
stats_parsed
, which is of typeOption<parquet::record::Row>
. If I understand the concept correctly, it hasminValues
,maxValues
,nullCounts
andnumRecords
as top-level entries. Each of the first three then has a list / vec of tuples of (column name, value), so it effectively repeats each of my column names three times, for every individual file.The text was updated successfully, but these errors were encountered: