-
Notifications
You must be signed in to change notification settings - Fork 400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not load full source into RAM on write_to_deltalake #2255
Labels
enhancement
New feature or request
Comments
I can pick it up, but I'd rather do it on the write.rs operation |
@aersam that's fine! |
Ok, I see partioning makes this quite complicated 🙂 And MemoryExec of DataFusion is not helpful, so might take some time |
I'll just implement it using chunks. This is not perfect, but should work and is not as invasive as rewriting the whole partitioning |
This was referenced Mar 8, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
In python/lib.rs, the first thing that happens on
write_to_deltalake
is to collect to batches to a Vec. This loads all RecordBatches into RAM, no? This seems like not a good thing to me. I think the main reason is that write.rs tries to get the schema from the batches, but the schema would have been known in python anyway, so why not pass it directly?Use Case
I don't want to waste resources ;)
Related Issue(s)
The text was updated successfully, but these errors were encountered: