Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory footprint of file-level statistics #435

Closed
dispanser opened this issue Sep 18, 2021 · 10 comments
Closed

Reduce memory footprint of file-level statistics #435

dispanser opened this issue Sep 18, 2021 · 10 comments
Labels
enhancement New feature or request

Comments

@dispanser
Copy link
Contributor

dispanser commented Sep 18, 2021

Description

I've been playing with a relatively large delta table recently, in the context of #425 (ignore tombstones for readers).

While the 5.3 million file paths themselves occupy about 600M, the overall memory consumption of a small rust program after loading the delta table adds up to 49.4G.

Here's the memory consumption for the various fields in action::Add, on that table with 5.3 million files, and 19 columns (each column has file statistics enabled).

field mem ps
stats parsed 37.1G
stats 6.4G
partition values 2.1G
parsed partition values 1.9G
path 600M

I think we should be able to get that down considerably. The biggest offender is stats_parsed, which is of type Option<parquet::record::Row>. If I understand the concept correctly, it has minValues, maxValues, nullCounts and numRecords as top-level entries. Each of the first three then has a list / vec of tuples of (column name, value), so it effectively repeats each of my column names three times, for every individual file.

@dispanser dispanser added the enhancement New feature or request label Sep 18, 2021
@dispanser dispanser changed the title Reduce memory footprint Reduce memory footprint of file-level statistics Sep 18, 2021
@mgill25
Copy link

mgill25 commented Sep 18, 2021

I did some memory inspections. Just in case anyone else is working from a Mac (@dispanser uses Arch):

I tried to reproduce this and was very confused because my htop did not show high memory usage. I was expecting the program with the same large dataset to blow up/or at least be OOM-killed. But turns out both Memory Compression and Swap played a role in displaying an incomplete picture (as far as htop) was concerned.

_delta_log size: ~15G

htop after Table Load:

Screenshot 2021-09-18 at 14 06 45

Activity Monitor shows the rest of the picture:
Screenshot 2021-09-18 at 14 06 29

Here we indeed see ~44G memory utilization (post table load). We also observe heavy Swap and Memory Compression by the OS:

Screenshot 2021-09-18 at 14 07 36

The numbers come down when the program is killed:
Screenshot 2021-09-18 at 14 08 51

@houqp
Copy link
Member

houqp commented Sep 18, 2021

I also think there is room to reduce the memory usage by 10x. I would drop all of stats, stats_parsed, partition_values and parsed_partition_values in memory. Both stats and partition values can be stored in a columnar format, e.g. arrow record batch, which should also help with applying filter push down on table scan.

@mgill25
Copy link

mgill25 commented Sep 19, 2021

@houqp @dispanser I'll try to do some refactoring and see what I can come up with. The First approach would be exactly as you suggest, just try to store the fields using record batches. I hope to make some progress (first time digging into the codebase) but I'll ask for advice here if I get stuck :)

@houqp
Copy link
Member

houqp commented Sep 20, 2021

Thanks @mgill25 , let us know if you need any help :)

@dispanser
Copy link
Contributor Author

Tombstones have a similar structure and suffer from the memory consumption problem to some extend, having a field pub partition_values: HashMap<String, Option<String>> that's contributing to the overall memory consumption of tombstones considerably.

If we find an efficient way to represent Add operations, it would be great if that is also applied for tombstones in a later PR.

@houqp
Copy link
Member

houqp commented Sep 22, 2021

Good call @dispanser for tombstone, i think we can further optimize it by remove all other fields and only keep path and deleted_timestamp. I don't think we need access for other fields for vacuum.

@mgill25
Copy link

mgill25 commented Sep 26, 2021

@houqp and @dispanser Just to provide an update here, I have some working code, with no doubt lots of enhancements to be done.

On the plus side, I can read stats_parsed , and then make record batches out of them for any provided columns. On the negative side, I'm still working with the assumptions of i64s as data types (to make the testing easy). I'll do some clean up tomorrow and post a gist for you to take a look at?

The annoying part so far has mostly been trying to parse the extremely deeply nested structure and trying to re-arrange them into some familiar data structures. I suspect things will get better with time :)

@houqp
Copy link
Member

houqp commented Sep 26, 2021

@mgill25 please feel free to send a draft PR directly if you prefer, we can comment and iterate there.

mgill25 added a commit to mgill25/delta-rs that referenced this issue Oct 9, 2021
rtyler added a commit that referenced this issue Jan 23, 2024
# Description

This is still very much a work in progress, opening it up for visibility
and discussion.

Finally I do hope that we can make the switch to arrow based log
handling. Aside from hopefully advantages in the memory footprint, I
also believe it opens us up to many future optimizations as well.

To make the transition we introduce two new structs 

- `Snapshot` - a half lazy version of the Snapshot, which only tries to
get `Protocol` & `Metadata` actions ASAP. Of course these drive all our
planning activities and without them there is not much we can do.
- `EagerSnapshot` - An intermediary structure, which eagerly loads file
actions and does log replay to serve as a compatibility laver for the
current `DeltaTable` APIs.

One conceptually larger change is related to how we view the
availability of information. Up until now `DeltaTableState` could be
initialized empty, containing no useful information for any code to work
with. State (snapshots) now always needs to be created valid. The thing
that may not yet be initialized is the `DeltaTable`, which now only
carries the table configuration and the `LogStore`. the state / snapshot
is now optional. Consequently all code that works against a snapshot no
longer needs to handle that matadata / schema etc may not be available.

This also has implications for the datafusion integration. We already
are working against snapshots mostly, but should abolish most traits
implemented for `DeltaTable` as this does not provide the information
(and never has) that is al least required to execute a query.

Some larger notable changes include:

* remove `DeltaTableMetadata` and always use `Metadata` action.
* arrow and parquet are now required, as such the features got removed.
Personalyl I would also argue, that if you cannot read checkpoints, you
cannot read delta tables :). - so hopefully users weren't using
arrow-free versions.

### Major follow-ups:

* (pre-0.17) review integration with `log_store` and `object_store`.
Currently we make use mostly of `ObjectStore` inside the state handling.
What we really use is `head` / `list_from` / `get` - my hope would be
that we end up with a single abstraction...
* test cleanup - we are currently dealing with test flakiness and have
several approaches to scaffolding tests. SInce we have the
`deltalake-test` crate now, this can be reconciled.
* ...
* do more processing on borrowed data ...
* perform file-heavy operations on arrow data
* update checkpoint writing to leverage new state handling and arrow ...
* switch to exposing URL in public APIs

## Questions

* should paths be percent-encoded when written to checkpoint?

# Related Issue(s)

supersedes: #454
supersedes: #1837
closes: #1776
closes: #425 (should also be addressed in the current implementation)
closes: #288 (multi-part checkpoints are deprecated)
related: #435

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
@roeap
Copy link
Collaborator

roeap commented Jan 28, 2024

@dispanser - do you still have access to that table. If so, I'd be very curious to see what it looks like after the move to an arrow backend :).

RobinLin666 pushed a commit to RobinLin666/delta-rs that referenced this issue Feb 2, 2024
# Description

This is still very much a work in progress, opening it up for visibility
and discussion.

Finally I do hope that we can make the switch to arrow based log
handling. Aside from hopefully advantages in the memory footprint, I
also believe it opens us up to many future optimizations as well.

To make the transition we introduce two new structs 

- `Snapshot` - a half lazy version of the Snapshot, which only tries to
get `Protocol` & `Metadata` actions ASAP. Of course these drive all our
planning activities and without them there is not much we can do.
- `EagerSnapshot` - An intermediary structure, which eagerly loads file
actions and does log replay to serve as a compatibility laver for the
current `DeltaTable` APIs.

One conceptually larger change is related to how we view the
availability of information. Up until now `DeltaTableState` could be
initialized empty, containing no useful information for any code to work
with. State (snapshots) now always needs to be created valid. The thing
that may not yet be initialized is the `DeltaTable`, which now only
carries the table configuration and the `LogStore`. the state / snapshot
is now optional. Consequently all code that works against a snapshot no
longer needs to handle that matadata / schema etc may not be available.

This also has implications for the datafusion integration. We already
are working against snapshots mostly, but should abolish most traits
implemented for `DeltaTable` as this does not provide the information
(and never has) that is al least required to execute a query.

Some larger notable changes include:

* remove `DeltaTableMetadata` and always use `Metadata` action.
* arrow and parquet are now required, as such the features got removed.
Personalyl I would also argue, that if you cannot read checkpoints, you
cannot read delta tables :). - so hopefully users weren't using
arrow-free versions.

### Major follow-ups:

* (pre-0.17) review integration with `log_store` and `object_store`.
Currently we make use mostly of `ObjectStore` inside the state handling.
What we really use is `head` / `list_from` / `get` - my hope would be
that we end up with a single abstraction...
* test cleanup - we are currently dealing with test flakiness and have
several approaches to scaffolding tests. SInce we have the
`deltalake-test` crate now, this can be reconciled.
* ...
* do more processing on borrowed data ...
* perform file-heavy operations on arrow data
* update checkpoint writing to leverage new state handling and arrow ...
* switch to exposing URL in public APIs

## Questions

* should paths be percent-encoded when written to checkpoint?

# Related Issue(s)

supersedes: delta-io#454
supersedes: delta-io#1837
closes: delta-io#1776
closes: delta-io#425 (should also be addressed in the current implementation)
closes: delta-io#288 (multi-part checkpoints are deprecated)
related: delta-io#435

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Aug 19, 2024

Closing this one as it's potentially outdated and we have already made improvements here, if people perceive new issues with memory usage please create a new issue for tracking :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants