Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async reader #96

Merged
merged 32 commits into from
Jul 10, 2022
Merged

Async reader #96

merged 32 commits into from
Jul 10, 2022

Conversation

kylebarron
Copy link
Owner

@kylebarron kylebarron commented Apr 24, 2022

Working async reader just for the arrow2 bindings.

TODO:

  • Test with a parquet dataset
  • Make read_row_group take a RowGroupMetaData instead of a FileMetaData?
  • Ideally wouldn't need to copy the metadata object for each fetch, but not clear if it's possible in current wasm-bindgen for the async functions to take references as function parameters, since wasm-bindgen doesn't support lifetimes

Very early, just putting this up for planning.

Todo:

  • working fetch example from rust (from here)
  • [ ]

References:

Future work:

  • Update arrow2/parquet2 to not need an initial HEAD request for the length of the data. Should be able to use a negative byte range for the first fetch, and that should include all byte ranges.
  • Configurable footer initial read size (right now hard coded in arrow2 to use 64kb)

@kylebarron
Copy link
Owner Author

I currently have issues with future cannot be sent between threads safely.

From rustwasm/wasm-bindgen#2409 (comment):

Futures generated with wasm-bindgen aren't Send so if you use async traits with wasm-bindgen you'll have to turn that off. Otherwise you'll need to arrange yourself for futures to be send by spawning tasks and such and using channels.

At least with arrow2 it's really designed around futures that implement Send, so it would be ideal to learn how to get this to work.

The async-executor example (https://github.com/wasm-rs/async-executor/blob/master/example/src/lib.rs) seems like a good place to start.

References:

@kylebarron
Copy link
Owner Author

kylebarron commented Apr 30, 2022

Tons of progress:

  • Structure of AsyncParquetFile class. Async constructor to fetch content length and metadata.
  • Working async read_row_group! (for a single row group)

Future work:

  • Right now read_row_group uses self instead of &self because the latter gives a lifetimes error. "Value does not live long enough". https://stackoverflow.com/questions/42503296/value-does-not-live-long-enough
  • Figure out how to send only one request for the whole row group? I think right now it sends one request per column.
  • Check that multiple requests are indeed being made concurrently, not sequentially

@kylebarron
Copy link
Owner Author

It looks like this isn't generally possible without consuming self. See: rustwasm/wasm-bindgen#1858 (comment), rustwasm/wasm-bindgen#1733 (comment), rustwasm/wasm-bindgen#2195 (comment).

Given this, a class-based solution will only work if the entire class can be cloned. At that point, I think it'll be simpler to have a functional API instead:

pub async fn read_row_group(
    url: String,
    content_length: usize,
    metadata: FileMetaData,
    i: usize,
) -> Result<Uint8Array, JsValue>

The inputs can be copied on the JS side. So to create an async iterator, you'd do

const metadata = await read_parquet_metadata(url);
const batches = [];
for (let i of metadata.numRowGroups) {
  const batch = await read_row_group(metadata, i);
  batches.push(batch);
}

@kylebarron kylebarron marked this pull request as ready for review July 10, 2022 20:17
@kylebarron kylebarron merged commit b49105a into main Jul 10, 2022
@kylebarron kylebarron deleted the kyle/async-reader branch July 10, 2022 21:05
@kylebarron kylebarron mentioned this pull request Jul 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant