-
-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
async zarr #1104
Comments
(I'm not sure if this needs any specs discussion, since it provides an alternative user API, but it doesn't actually change what zarr does or what the metadata looks like) |
I think this might be more appropriate to discuss under the zarr-python repository, since it is just about the zarr-python API. You might find it interesting to look at the tensorstore python API for ideas, as tensorstore provides an async API. An alternative to consider would be to just address the limitation in pyscript directly: I think with the help of a separate webworker thread it is possible to emulate sync fetch requests. I think there are also options for compiling c code to "async" web assembly, though that does hurt performance. |
I am happy to have this in zarr-python instead. How do you imagine using the tensorstore model? The problem we are facing, is being forced to call zarr's synchronous code in an async context, so using another Futures abstraction sounds like even more complexity.
In general, getting IO to work well in pyscript is an unsolved problem, and websorkers-as-thread might be part of the solution. Certainly, that's the only way that the browser allows binary sync connections. To be sure, though: we do not want sync requests and pay the latency cost for every single chunk.
We are stuck with the sync python API, being called from an async context, so this is a python programming problem. Anything lower level will not help us. As I said at the start though, pyscript is not the only reason to want this. |
I think from an API perspective futures are the most natural choice. You can always create an async API on top of a sync API using a thread pool. In general it might be best to gradually add async apis to zarr-python from the top down, using thread pools as needed to convert lower-level components from sync to async. Ultimately would want to add async store implementations so that there are no sync i/o components left. The codecs are pure computation and don't need to be converted --- they would just always require a thread pool.
My understanding is that pyscript is built by compiling cpython to webassembly, where sync Python corresponds to sync webassembly/javascript. I was proposing that instead it could be compiled such that sync Python corresponds to async JavaScript. Thinking about it more though, I realize that in addition to being a major re-architecting of pyscript, it would also come with major restrictions on "re-entering" python during other operations, and therefore wouldn't really be practical.
|
The fsspec store and indeed the JS HTTP fetch methods are async, so we actually already have this already at the bottom of the stack. Making the compute part "concurrent" isn't useful, it's the IO that matters. You are advocating a completely async alternate codepath all the way through zarr? I am trying to make use of zarr's simplicity to implement something that works quickly without changing the core. Question to everyone: what does the the zarr.js API look like, is there any async there? I would assume there must be. |
AIUI PR ( #534 ) was exploring the |
I have started writing a blog about my implementation, might be out this afternoon. That won't say anything new to people who are already on this thread, but might gain more general interest. Specifically for pyodide/pyscript, I think it's still fair to say that the IO story is very far from solved for typical pydata libraries. |
This was a great discussion. Pointing folks to the continuation of this idea slated for |
A quick sketch of how we can couple zarr with async code. This is aimed slightly at pyscript, but can be useful in its own right: for instance, I asked the question a while ago what it would take to be able to fetch concurrent chunks not just from one array but, say, one chunk each from multiple arrays in a dataset.
This sketch is only for reading...
Outline:
__getitem__
produces an AsyncArray_get_selection
(which calls IO) up to__getitem__
(which is the user-facing API)The text was updated successfully, but these errors were encountered: