Reduce memory footprint of file-level statistics #435

dispanser · 2021-09-18T05:40:32Z

Description

I've been playing with a relatively large delta table recently, in the context of #425 (ignore tombstones for readers).

While the 5.3 million file paths themselves occupy about 600M, the overall memory consumption of a small rust program after loading the delta table adds up to 49.4G.

Here's the memory consumption for the various fields in action::Add, on that table with 5.3 million files, and 19 columns (each column has file statistics enabled).

field	mem ps
stats parsed	37.1G
stats	6.4G
partition values	2.1G
parsed partition values	1.9G
path	600M

I think we should be able to get that down considerably. The biggest offender is stats_parsed, which is of type Option<parquet::record::Row>. If I understand the concept correctly, it has minValues, maxValues, nullCounts and numRecords as top-level entries. Each of the first three then has a list / vec of tuples of (column name, value), so it effectively repeats each of my column names three times, for every individual file.

The text was updated successfully, but these errors were encountered:

mgill25 · 2021-09-18T12:29:36Z

I did some memory inspections. Just in case anyone else is working from a Mac (@dispanser uses Arch):

I tried to reproduce this and was very confused because my htop did not show high memory usage. I was expecting the program with the same large dataset to blow up/or at least be OOM-killed. But turns out both Memory Compression and Swap played a role in displaying an incomplete picture (as far as htop) was concerned.

_delta_log size: ~15G

htop after Table Load:

Activity Monitor shows the rest of the picture:

Here we indeed see ~44G memory utilization (post table load). We also observe heavy Swap and Memory Compression by the OS:

The numbers come down when the program is killed:

houqp · 2021-09-18T18:59:55Z

I also think there is room to reduce the memory usage by 10x. I would drop all of stats, stats_parsed, partition_values and parsed_partition_values in memory. Both stats and partition values can be stored in a columnar format, e.g. arrow record batch, which should also help with applying filter push down on table scan.

mgill25 · 2021-09-19T10:30:44Z

@houqp @dispanser I'll try to do some refactoring and see what I can come up with. The First approach would be exactly as you suggest, just try to store the fields using record batches. I hope to make some progress (first time digging into the codebase) but I'll ask for advice here if I get stuck :)

houqp · 2021-09-20T04:57:11Z

Thanks @mgill25 , let us know if you need any help :)

dispanser · 2021-09-22T04:12:13Z

Tombstones have a similar structure and suffer from the memory consumption problem to some extend, having a field pub partition_values: HashMap<String, Option<String>> that's contributing to the overall memory consumption of tombstones considerably.

If we find an efficient way to represent Add operations, it would be great if that is also applied for tombstones in a later PR.

houqp · 2021-09-22T04:24:58Z

Good call @dispanser for tombstone, i think we can further optimize it by remove all other fields and only keep path and deleted_timestamp. I don't think we need access for other fields for vacuum.

mgill25 · 2021-09-26T16:02:49Z

@houqp and @dispanser Just to provide an update here, I have some working code, with no doubt lots of enhancements to be done.

On the plus side, I can read stats_parsed , and then make record batches out of them for any provided columns. On the negative side, I'm still working with the assumptions of i64s as data types (to make the testing easy). I'll do some clean up tomorrow and post a gist for you to take a look at?

The annoying part so far has mostly been trying to parse the extremely deeply nested structure and trying to re-arrange them into some familiar data structures. I suspect things will get better with time :)

houqp · 2021-09-26T17:13:42Z

@mgill25 please feel free to send a draft PR directly if you prefer, we can comment and iterate there.

# Description This is still very much a work in progress, opening it up for visibility and discussion. Finally I do hope that we can make the switch to arrow based log handling. Aside from hopefully advantages in the memory footprint, I also believe it opens us up to many future optimizations as well. To make the transition we introduce two new structs - `Snapshot` - a half lazy version of the Snapshot, which only tries to get `Protocol` & `Metadata` actions ASAP. Of course these drive all our planning activities and without them there is not much we can do. - `EagerSnapshot` - An intermediary structure, which eagerly loads file actions and does log replay to serve as a compatibility laver for the current `DeltaTable` APIs. One conceptually larger change is related to how we view the availability of information. Up until now `DeltaTableState` could be initialized empty, containing no useful information for any code to work with. State (snapshots) now always needs to be created valid. The thing that may not yet be initialized is the `DeltaTable`, which now only carries the table configuration and the `LogStore`. the state / snapshot is now optional. Consequently all code that works against a snapshot no longer needs to handle that matadata / schema etc may not be available. This also has implications for the datafusion integration. We already are working against snapshots mostly, but should abolish most traits implemented for `DeltaTable` as this does not provide the information (and never has) that is al least required to execute a query. Some larger notable changes include: * remove `DeltaTableMetadata` and always use `Metadata` action. * arrow and parquet are now required, as such the features got removed. Personalyl I would also argue, that if you cannot read checkpoints, you cannot read delta tables :). - so hopefully users weren't using arrow-free versions. ### Major follow-ups: * (pre-0.17) review integration with `log_store` and `object_store`. Currently we make use mostly of `ObjectStore` inside the state handling. What we really use is `head` / `list_from` / `get` - my hope would be that we end up with a single abstraction... * test cleanup - we are currently dealing with test flakiness and have several approaches to scaffolding tests. SInce we have the `deltalake-test` crate now, this can be reconciled. * ... * do more processing on borrowed data ... * perform file-heavy operations on arrow data * update checkpoint writing to leverage new state handling and arrow ... * switch to exposing URL in public APIs ## Questions * should paths be percent-encoded when written to checkpoint? # Related Issue(s) supersedes: #454 supersedes: #1837 closes: #1776 closes: #425 (should also be addressed in the current implementation) closes: #288 (multi-part checkpoints are deprecated) related: #435 # Documentation  --------- Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>

roeap · 2024-01-28T09:15:07Z

@dispanser - do you still have access to that table. If so, I'd be very curious to see what it looks like after the move to an arrow backend :).

# Description This is still very much a work in progress, opening it up for visibility and discussion. Finally I do hope that we can make the switch to arrow based log handling. Aside from hopefully advantages in the memory footprint, I also believe it opens us up to many future optimizations as well. To make the transition we introduce two new structs - `Snapshot` - a half lazy version of the Snapshot, which only tries to get `Protocol` & `Metadata` actions ASAP. Of course these drive all our planning activities and without them there is not much we can do. - `EagerSnapshot` - An intermediary structure, which eagerly loads file actions and does log replay to serve as a compatibility laver for the current `DeltaTable` APIs. One conceptually larger change is related to how we view the availability of information. Up until now `DeltaTableState` could be initialized empty, containing no useful information for any code to work with. State (snapshots) now always needs to be created valid. The thing that may not yet be initialized is the `DeltaTable`, which now only carries the table configuration and the `LogStore`. the state / snapshot is now optional. Consequently all code that works against a snapshot no longer needs to handle that matadata / schema etc may not be available. This also has implications for the datafusion integration. We already are working against snapshots mostly, but should abolish most traits implemented for `DeltaTable` as this does not provide the information (and never has) that is al least required to execute a query. Some larger notable changes include: * remove `DeltaTableMetadata` and always use `Metadata` action. * arrow and parquet are now required, as such the features got removed. Personalyl I would also argue, that if you cannot read checkpoints, you cannot read delta tables :). - so hopefully users weren't using arrow-free versions. ### Major follow-ups: * (pre-0.17) review integration with `log_store` and `object_store`. Currently we make use mostly of `ObjectStore` inside the state handling. What we really use is `head` / `list_from` / `get` - my hope would be that we end up with a single abstraction... * test cleanup - we are currently dealing with test flakiness and have several approaches to scaffolding tests. SInce we have the `deltalake-test` crate now, this can be reconciled. * ... * do more processing on borrowed data ... * perform file-heavy operations on arrow data * update checkpoint writing to leverage new state handling and arrow ... * switch to exposing URL in public APIs ## Questions * should paths be percent-encoded when written to checkpoint? # Related Issue(s) supersedes: delta-io#454 supersedes: delta-io#1837 closes: delta-io#1776 closes: delta-io#425 (should also be addressed in the current implementation) closes: delta-io#288 (multi-part checkpoints are deprecated) related: delta-io#435 # Documentation  --------- Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>

ion-elgreco · 2024-08-19T11:30:44Z

Closing this one as it's potentially outdated and we have already made improvements here, if people perceive new issues with memory usage please create a new issue for tracking :)

dispanser added the enhancement New feature or request label Sep 18, 2021

dispanser changed the title ~~Reduce memory footprint~~ Reduce memory footprint of file-level statistics Sep 18, 2021

dispanser mentioned this issue Sep 21, 2021

Non-linear performance for large size of delta log #442

Closed

mgill25 added a commit to mgill25/delta-rs that referenced this issue Oct 9, 2021

Record Batch support for delta log stats_parsed (delta-io#435)

2d5d9a4

houqp mentioned this issue Oct 10, 2021

WIP: Record Batch support for delta log stats_parsed (#435) #454

Closed

roeap mentioned this issue Nov 11, 2023

feat: arrow based table state and checkpoint handling #1837

Closed

roeap mentioned this issue Jan 7, 2024

feat: arrow backed log replay and table state #2037

Merged

ion-elgreco closed this as completed Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory footprint of file-level statistics #435

Reduce memory footprint of file-level statistics #435

dispanser commented Sep 18, 2021 •

edited

Loading

mgill25 commented Sep 18, 2021

houqp commented Sep 18, 2021

mgill25 commented Sep 19, 2021

houqp commented Sep 20, 2021

dispanser commented Sep 22, 2021

houqp commented Sep 22, 2021

mgill25 commented Sep 26, 2021

houqp commented Sep 26, 2021

roeap commented Jan 28, 2024

ion-elgreco commented Aug 19, 2024 •

edited

Loading

Reduce memory footprint of file-level statistics #435

Reduce memory footprint of file-level statistics #435

Comments

dispanser commented Sep 18, 2021 • edited Loading

Description

mgill25 commented Sep 18, 2021

houqp commented Sep 18, 2021

mgill25 commented Sep 19, 2021

houqp commented Sep 20, 2021

dispanser commented Sep 22, 2021

houqp commented Sep 22, 2021

mgill25 commented Sep 26, 2021

houqp commented Sep 26, 2021

roeap commented Jan 28, 2024

ion-elgreco commented Aug 19, 2024 • edited Loading

dispanser commented Sep 18, 2021 •

edited

Loading

ion-elgreco commented Aug 19, 2024 •

edited

Loading