-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Async CSV Writer #3740
Comments
I wonder if this could be achieved by simply writing a batch to an in-memory |
I think this requires constantly creating a "blocking" writer for each record batch since it will own the in-memory I couldn't think of a solution on how to keep buffer ownership while writing with the usual Writer. Do you have any idea how I can code that? Btw, I verified the performance degradation, I agree with you that CPU-bound computations like serialization shouldn't be async since there is no gain. I am trying to isolate the IO-bound operation (flush) async as you said. |
Perhaps something like (not tested)
Whilst creating the |
I made a benchmark for record batch writing by 3 cases. Ordinary writer, async writer, and buffered async writer (current discussion) For low batch size (10) - batch count (1000) with the usual schema.
For larger batch sizes (1000) - batch count (100) with the usual schema.
For the usual batch size (4096) - batch count (100) with the usual schema.
I think the buffered version is also scalable immediately into JSON and AVRO:
If you are also satisfied with the result of buffered version, I will add this functionality into CSV and JSON cc @tustvold. |
The performance across all seems to be basically comparable, it would be interesting to see a profile, but I suspect the difference is in the sizing of the intermediate buffer, which will be highly dependent on the destination sink as to what the optimal size is.
Thus far we have managed to avoid async within arrow-rs, and I think this encourages a nice separation of compute and IO. What do you think about adding this functionality instead to DataFusion and perhaps just adding a doc comment to arrow-rs showing how it can be done? e.g. something like (not tested)
|
I am OK with the separation. The main idea behind adding arrow-rs/object_store/src/lib.rs Lines 191 to 217 in 3508674
I was planning to add a new API like It would look like
Consider the |
Store specific functionality, let alone operating specific functionality, doesn't seem like a great fit for the object store crate. Python fsspec which is more filesystem focused doesn't support them either. I'm not familiar with your use-case, but trying to shoehorn streaming through OS-specific filesystem APIs seems a little odd to me. Is there a reason you opted for such an approach over a custom operator? This would also allow generalising to streaming brokers like Kafka, Kinesis, etc... and potentially using things like unix domain sockets which have better async stories? |
Suppose you read a I was looking for a Overall, we believe that overfitting the batch solutions is mostly avoidable. However, I understand your concern. |
I don't think this is avoidable, arrow is a columnar data format, it fundamentally assumes batching to amortise dispatch overheads. Row-based streaming would require a completely different architecture, likely using a JIT? FWIW kafka and kinesis aren't really streaming APIs, under the hood they rely on aggressive batching for throughput
I am aware, I wrote a lot of that logic, my confusion is why this is the point of integration, instead of say a custom operator or TableProvider? This would be more flexible and would avoid all these issues? ListingTable is intended for the case of ad-hoc querying files in the absence of a catalog, I would expect service workloads to always make use of some sort of catalog be it Hive MetaStore, Glue, Delta, Lakehouse, etc... |
@tustvold, I think there is maybe some terminology-related confusion going on here w.r.t. batching. I am sure @metesynnada was not trying to say he wants to avoid batching in its entirety. I think what he envisions (albeit maybe not conveyed clearly) is simply an API that operates with an async writer so that non-IO operations can carry on when the actual write to the object store is taking place. The current API (i.e. the Given that we are analyzing this part of the code, one good thing we can do is to investigate whether avoiding the new IO thread and using async primitives to do the actual writing within the same thread makes sense. I am not entirely sure what the advantages/disadvantages of doing that will be. @metesynnada can do some measurements to quantify this. Maybe you can share the reasoning behind the current choice? |
FWIW tokio doesn't support non-blocking filesystem IO, tokio-uring is still experimental, so it will always dispatch to a separate blocking threadpool. This was what I was alluding to when I suggested sockets might be a more compelling primitive than using filesystem APIs, as they support epoll.
This is true, but each put creates a new file, overwriting anything already present, which I suspect will be a problem? |
OK, so we will need proper
Right. A simple mechanism to choose between overwrite/append behaviors should be enough for @metesynnada's purposes. Everything is already async. Any suggestions on how we can add this capability? |
For more context, I believe the use case for this feature may be related to apache/datafusion#5130 Having a "native" async csv writer in arrow-rs I can see being a compelling UX feature (aka it would make it easier to use and integrate). However, I don't know how much easier it would be and how much maintenance it would require. I really like the idea, mentioned in a few places on this thread, of creating an async CSV writer by wrapping the existing blocking / batching one outside of arrow-rs initially (I don't think there is any reason this can not be done). Once we have the actual code and an an example use of it it, then we will be in a much better position to evaluate how it performs compared to a "native" async writer, how it is actually used, and if we should reconsider incorporating it into arrow-rs |
cc @tustvold , I introduced a new API for objects that support append. What do you think about it? |
I'm not a massive fan of introducing an abstraction to an object_store crate that few if any actual object stores support... Perhaps @alamb might be able to facilitate a synchronous chat if you would be amenable? I think we may be able to facilitate your use-case without needing to graft it to filesystem APIs Edit: Whilst Azure does have an Append Blob concept, I think we would need to bifurcate the API as the way you interact with them is different. The general approach of object_store has been to expose the common functionality across multiple backends, it is unclear how we could do this in a portable manner |
A synchronous chat would be OK. Meanwhile, you can check apache/datafusion#5130 for more insight into the use case. In short, we want to support appending files (only supporting ones). |
|
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I want an async CSV writer to use it in an async context. The current implementation is blocking even if we are in the tokio environment.
Describe the solution you'd like
A writer using
tokio::io::AsyncWrite
might be the solution.Describe alternatives you've considered
NA
Additional context
Na
The text was updated successfully, but these errors were encountered: