Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include fragment metadata when pickling array #614

Open
gatesn opened this issue Jul 1, 2021 · 4 comments
Open

Include fragment metadata when pickling array #614

gatesn opened this issue Jul 1, 2021 · 4 comments

Comments

@gatesn
Copy link
Contributor

gatesn commented Jul 1, 2021

I'm curious about the comment in this test that sort of mentions that array data shouldn't be serialized when pickling. Does that mean the contents of the array itself? Or does that also cover fragment metadata?

class PickleTest(DiskTestCase):
# test that DenseArray and View can be pickled for multiprocess use
# note that the current pickling is by URI and attributes (it is
# not, and likely should not be, a way to serialize array data)

It would be neat I think if the pickled array included fragment metadatas so each worker doesn't need to separately download the files from VFS (which may lead to consistency problems?)

Also, is there a reason this is only for Dense arrays currently? Or just because that's the Dask integration that exists? Happy to look at contributing anything along these lines :)

@ihnorton
Copy link
Member

ihnorton commented Jul 1, 2021

Hi @gatesn,

Right now the pickling implementation only serializes the URI and some very high level metadata (key and timestamp) so that the array can be reopened on the worker.

It would be neat I think if the pickled array included fragment metadatas so each worker doesn't need to separately download the files from VFS (which may lead to consistency problems?)

That's an interesting idea. As far as I know, we don't currently expose an API in libtiledb to pre-populate the fragment metadata, but I'll ping @stavrospapadopoulos about that and see what he thinks.

which may lead to consistency problems?

We serialize the opened-at timestamp, so workers should have consistent views on the data. I've just created an issue to add support for timestamp ranges (start/end timestamp) which is a new feature in TileDB 2.3.

Or just because that's the Dask integration that exists?

Yes

Happy to look at contributing anything along these lines

That would be great. Happy to discuss ideas/proposals and review PRs, and also call/chat for higher bandwidth if needed. If you are specifically interested in working on either of the above issues (start/end timestamp serialization, or SparseArray serialization), let us know - otherwise we'll slot those in for the next few weeks.

@Shelnutt2
Copy link
Member

@ihnorton this is where the cap'n proto serialization of the objects comes for items like these.

@gatesn
Copy link
Contributor Author

gatesn commented Jul 1, 2021

Ah thanks for the info, I'll need to check some timelines but will let you know on the PR front.

We serialize the opened-at timestamp, so workers should have consistent views on the data. I've just created an issue to add support for timestamp ranges (start/end timestamp) which is a new feature in TileDB 2.3.

Isn't it possible to open the array 'in the past' and write additional data? Which would then be included by the workers?

I realise in practice this is unlikely to happen, but with a small amount of clock skew and a read-after-write workflow the chances of hitting this increase slightly.

@ihnorton
Copy link
Member

ihnorton commented Jul 2, 2021

Isn't it possible to open the array 'in the past' and write additional data? Which would then be included by the workers?

Yes. Given the eventually-consistent design, clock skew is an issue to be aware of in general unless workers are synchronized. Your idea to serialize fragment info would help, because the controller and worker views would then be consistent unless someone intentionally overwrote existing timestamp(s).

We are working on an API to open an array with a (sub)set of fragment IDs, which could be used for this purpose; or the Array serialization mechanism in TileDB core could be used, as mentioned by @Shelnutt2 (that API is not currently exposed in Python, but only a few functions would need to be wrapped).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants