-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include fragment metadata when pickling array #614
Comments
Hi @gatesn, Right now the pickling implementation only serializes the URI and some very high level metadata (key and timestamp) so that the array can be reopened on the worker.
That's an interesting idea. As far as I know, we don't currently expose an API in libtiledb to pre-populate the fragment metadata, but I'll ping @stavrospapadopoulos about that and see what he thinks.
We serialize the opened-at timestamp, so workers should have consistent views on the data. I've just created an issue to add support for timestamp ranges (start/end timestamp) which is a new feature in TileDB 2.3.
Yes
That would be great. Happy to discuss ideas/proposals and review PRs, and also call/chat for higher bandwidth if needed. If you are specifically interested in working on either of the above issues (start/end timestamp serialization, or SparseArray serialization), let us know - otherwise we'll slot those in for the next few weeks. |
@ihnorton this is where the cap'n proto serialization of the objects comes for items like these. |
Ah thanks for the info, I'll need to check some timelines but will let you know on the PR front.
Isn't it possible to open the array 'in the past' and write additional data? Which would then be included by the workers? I realise in practice this is unlikely to happen, but with a small amount of clock skew and a read-after-write workflow the chances of hitting this increase slightly. |
Yes. Given the eventually-consistent design, clock skew is an issue to be aware of in general unless workers are synchronized. Your idea to serialize fragment info would help, because the controller and worker views would then be consistent unless someone intentionally overwrote existing timestamp(s). We are working on an API to open an array with a (sub)set of fragment IDs, which could be used for this purpose; or the |
I'm curious about the comment in this test that sort of mentions that array data shouldn't be serialized when pickling. Does that mean the contents of the array itself? Or does that also cover fragment metadata?
TileDB-Py/tiledb/tests/test_libtiledb.py
Lines 3050 to 3053 in 0a150c0
It would be neat I think if the pickled array included fragment metadatas so each worker doesn't need to separately download the files from VFS (which may lead to consistency problems?)
Also, is there a reason this is only for Dense arrays currently? Or just because that's the Dask integration that exists? Happy to look at contributing anything along these lines :)
The text was updated successfully, but these errors were encountered: