Return format of Arrow Stream data to an Elixir NIF #1033
-
Hi! This is more about my lack of Rust knowledge and the "best" way to return data back to Elixir. Basically, here is my use case:
Sorry for this long ramble, just looking for some opinions about writing back data, it shouldn't be many gigabytes, most files would be a couple of megs in length at max from my initial testing. Example Arrow Stream file here (from Snowflake example data): https://github.com/joshuataylor/arrow_fail_read_example/blob/main/example_snowflake_data Tracking issue for edit: I think the best option at this point is to do similar to what the CSV writer does to write back to Elixir types. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 1 reply
-
This is a very interesting question! Generally, there are two questions to answer:
From these questions, we can design our needs: a. (1.1, 2.1) - a mechanism to expose shared structs like Starting very small, consider the following struct:
how would we map this to Elixir? Is it possible to perform a memcopy of e.g. Thus, without an Elixir Arrow librarywe can't benefit from sharing a reference or even memcoping the whole In this case, I would prob try
Note that this does incur a CPU-cost, as we are in the traditional ser-de conversion between Arrow and another in-memory format ( With a minimal Elixir Arrow library (i.e. that implements reading from the C data interface / FFI), we would could leverage Rust's implementation for everything else (e.g. Arrow IPC, Parquet, etc.), by passing Rust's Hope this helps somehow. |
Beta Was this translation helpful? Give feedback.
-
It does! Thank you so much for your incredibly thorough answer! I think experimenting with how fast it is to encode to Elixir types in rust then return that vs using the Foreign Interface and doing that in Elixir is a first step. As my first experiment, I'm skipping using serde and just encoding the values using rustler .encode, which I'll then benchmark. I'll post back my findings, as they will probably be relevant for other languages as well, and as a generic Elixir Arrow binding at some point as well (this is just a Snowflake specific binding right now, where we read Arrow streaming files). edit: For the first implementation, I've ended up just serialising to Elixir formats, which is great as we don't have to convert the column types like we do with the JSON implementation (https://github.com/joshuataylor/req_snowflake/blob/initial/lib/req_snowflake/req_snowflake.ex#L251). It's also absurdly fast so far (I honestly thought I hadn't setup the benchmark properly), I'll provide benchmark results once I get the rest of the types mapped. |
Beta Was this translation helpful? Give feedback.
-
Again, thank you so much for your explanation, it really helped. I went with the approach of serialising to Rust types, which Rustler can just convert for me. I'm looking into other options to see if there is another way to return data to Elixir using some tricks, but from my initial testing it's fast. Snowflake tends to send files ranging from 100kb-1mb up to 20mb, so it's not a huge amount of data to parse anyway. I've got an initial PR here - joshuataylor/snowflake_arrow#1 Here are some initial benchmarking results, these aren't casting into Elixir structs etc yet: Laptop, Apple Macbook m1
Desktop (slowish single core performance)
Desktop, AMD Ryzen 5 5600X
|
Beta Was this translation helpful? Give feedback.
-
It's hard to compare the JSON implementation, as we get back strings/integers, then need to also cast them. I'll create proper benchmarks later which compare the full decode process we have to go through (as we want to map these to elixir types) with the JSON approach, where as we can do this all in Rust and return what we need to Elixir. But for parsing the files, here are similar JSON files from the same result set, closest to the file size that Snowflake returns. This uses Jason, which is a pure Elixir jason decoder, if we wanted a fair comparison we should also test with a JSON nif, but 🤷♂️ Removed benchmarks as I realised I was only getting partial results for larger files. |
Beta Was this translation helpful? Give feedback.
This is a very interesting question!
Generally, there are two questions to answer:
which allocator are you planning to use to handle snowflake's arrow data?
1.1. Rust's allocator
1.2. Elixir's allocator
which in-memory format do you plan to use in your application?
2.1. Arrow
2.2. Custom in-memory format
From these questions, we can design our needs:
a. (1.1, 2.1) - a mechanism to expose shared structs like
Arc<dyn Array>
to Elixir (this is what the Arrow C data interface is intended for)b. (1.1, 2.2) - a mechanism to expose non-owning structs like
&[T]
to Elixir, and a mechanism to convert them to the custom formatc. (1.2, 2.1) - a mechanism to expose
Arc<dyn Array>
to Elixir and…