-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create an Arrow Coder for Beam that allows us to create Ray Datasets #17
Comments
Thanks, I think it's worth tracking this in a Beam issue as well. Could you provide some references for Ray Datasets that would inform how an Arrow encoded PCollection can integrate with it? |
theres's very silly superficial stuff I wrote here: https://docs.google.com/document/d/1DcuKhCPnZezIvu9vFMsM4BRdBv0kgAWewOJqRbS42GI/edit# Specifically, I would say I suppose an integration we could have is something like:
or something like that |
Beam has a few utilities to convert to-from Beam and Arrow schemas (see here). A first step would be to write an
And then we can write a simple PTransform that takes a Beam PCollection of rows with schema, into a Beam PCollection where each element is an Arrow RecordBatch. (and viceversa) Then it becomes easier to pass this to Ray's Datasets (and also into Beam from Datasets). A second step could be to encode Beam Rows as batches of Arrow Records, but we can think about that once we do the first step. |
This coder is the first step to allow us to create Ray Datasets based on Beam PCollections.
fyi @TheNeuralBit
The text was updated successfully, but these errors were encountered: