-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use ObjectStore for dataframe writes #6987
Conversation
Thank you @devinjdangelo |
I plan to review this PR tomorrow if no one else beats me to it |
Looks like I didn't handle temp directories in the unit tests correctly for all environments based on the checks failing. I'll look into fixing that tonight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @devinjdangelo -- this is looking very close. I think we should consider using multi-part uploads to stream the files to object store but otherwise 👌 other than the CI failures this PR looks good to me
It would also be awesome, longer term, to unify these "write_to_json" type methods into DataSinks that are object store enabled -- like https://github.com/apache/arrow-datafusion/blob/9338880d8f64f8143e348a60beee8af2789fa8ae/datafusion/core/src/physical_plan/joins/symmetric_hash_join.rs#L390-L408
let out_path = format!("s3://{bucket_name}/test_write/"); | ||
df.write_parquet(&out_path, None).await?; | ||
|
||
//write as JSON to s3 instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to leave this commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The example is showing how you would write as parquet to S3, but I added the comments just to call out that it is nearly the same code for writing any other file type to S3. I could uncomment and clone the df so the example writes out all file types instead if that is preferred.
let plan: Arc<dyn ExecutionPlan> = plan.clone(); | ||
let filename = format!("{}/part-{i}.csv", parsed.prefix()); | ||
let file = object_store::path::Path::parse(filename)?; | ||
buffer = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand this code that it effectively buffers the entire output to memory prior to writing to the remote store?
What do you think about using the multi-part put API instead? https://docs.rs/object_store/0.6.1/object_store/#multipart-put-object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, everything is buffered into memory and uploaded in a single put, which will be very memory inefficient if you are writing large files.
If you are writing many small files, multipart upload could add unnecessary overhead creating and finalizing the uploads which may only have a single part.
If we must choose only one or the other, I would also favor multipart upload, since large files could fail in the current implementation, whereas small files would at worst be slower in a multipart implementation. I will work on a multipart implementation of this!
This is probably too ambitious for this PR, but we could in the future automatically select the optimal choice between put and multipart-put based on information in the ExecutionPlan
(maybe this?). I.e. if you anticipate files on average >threshold_bytes, do a streaming multipart-put, otherwise a single put.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we must choose only one or the other, I would also favor multipart upload, since large files could fail in the current implementation, whereas small files would at worst be slower in a multipart implementation. I will work on a multipart implementation of this!
Thank you -- I think if the overhead of doing a multi-part upload is too big, it would make sense to add (optional ) buffering in the object_store crate (if it doesn't have one already) such that multiple small multi-part writes are coalesced into a smaller number of larger writes
Thanks for the feedback @alamb! I'll work on a multipart upload version of this and fixing the failing checks. I'm not sure if this is what you meant by unifying these methods into DataSinks, but I did experiment a little with creating a generic Edit: found the trait here. Could refactor with this first so I don't have to replicate the changes in all 3 methods. |
I am sorry -- I ran out of time again today but this PR is at the top of my list for tomorrow |
No worries, @alamb. Today is better anyway since I just pushed up changes to support multipart incremental uploads instead of always buffering the entire file. For parquet, it was relatively straightforward to use let mut buffer = Vec::new();
while let Some(next_batch) = stream.next().await{
let batch = next_batch?;
let writer = csv::Writer::new(buffer);
writer.write(next_batch)?;
buffer = writer.into_inner();
multipart_writer.write_all(&buffer).await?;
buffer.clear(); This approached mostly worked, except for the fact that recreating a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking very cool @devinjdangelo -- I think the code looks good to me.
I will see if @tustvold has time to review this prior to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a brief look, I think this is perfectly valid, My one thought is that the writing logic for CSV and JSON could be made significantly simpler by not keeping the writer around. Something like
let mut buffer = Vec::with_capacity(1024);
let mut out = store.put_multipart(...).await?;
while let Some(batch) = stream.next().await.transpose()? {
let mut writer = make_writer(&mut buffer).await?;
writer.write(&batch)?;
out.write_all(&buffer).await?;
buffer.clear();
}
The only wrinkle is that for CSV it will need to only enable headers on WriterBuilder
for the first write, however, this should still be simpler than the futures locking dance.
Oh! Thanks @tustvold I agree that approach is simpler and I had tried that. I did not realize I could use the I just pushed a commit to revert back to that approach. |
Thanks @devinjdangelo -- once the CI passes I plan to mereg this PR |
Which issue does this PR close?
Related to #1777
Rationale for this change
Many ETL and other workflows need to persist data to an ObjectStore. Currently, DataFrame write_parquet, write_json, and write_csv only support writing to a local file system.
What changes are included in this PR?
DataFrame write methods (write_parquet, write_json, and write_csv) now expect a ObjectStore URL registered in the current context. Unit tests are updated to accommodate this change, and an example for writing to S3 is added.
Are these changes tested?
Existing unit tests are updated to cover this change.
Are there any user-facing changes?
Yes, users will now need to pass an ObjectStore url to write methods rather than a local path.