Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use async writer + multipart + explore Datafusion sink #1984

Open
ion-elgreco opened this issue Dec 19, 2023 · 4 comments
Open

Use async writer + multipart + explore Datafusion sink #1984

ion-elgreco opened this issue Dec 19, 2023 · 4 comments
Labels
enhancement New feature or request
Milestone

Comments

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Dec 19, 2023

Description

The rust writer in it current state keeps a buffer instead of steaming to disk which causes the writer use quite some extra memory.

We need to address this performance issue.

@wjones127 mentioned what needs to happen here: https://delta-users.slack.com/archives/C013LCAEB98/p1700330750311529?thread_ts=1700325888.484149&cid=C013LCAEB98

"
What we want to do instead is stream out to disk. Right now the writer is ArrowWriter :

pub(super) arrow_writer: ArrowWriter<ShareableBuffer>,

We should change it so it uses put_multipart and AsyncArrowWriter. That would make the type AsyncArrowWriter<Box<dyn AsyncWrite + Unpin + Send>>
What we want to do instead is combine
"

Another thing is to explore the Datafusion sink functionality as per suggestion of @roeap

@ion-elgreco ion-elgreco added the enhancement New feature or request label Dec 19, 2023
@wjones127
Copy link
Collaborator

Originally discussed in: #1225

@roeap
Copy link
Collaborator

roeap commented Dec 19, 2023

btw. had a quick look into the datafusion sinks, and I believe they may not be the best fit for us, considering the work delta needs to do on write. More specifically I had look if it would make sense to implement a FileFormat for Delta...

The TableProvider does however have more methods available now, that integrate into the framkework - they also did some really great work integrating with @wjones127's multi-part writer....

@tustvold
Copy link

FYI I plan to add some first-party writer support to the parquet crate as part of apache/arrow-rs#5524

@ion-elgreco
Copy link
Collaborator Author

ion-elgreco commented Mar 17, 2024

@tustvold awesome!

@aersam this may be interesting to look out for

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants