Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZEP0002 Review #254

Closed
normanrz opened this issue Jul 27, 2023 · 57 comments
Closed

ZEP0002 Review #254

normanrz opened this issue Jul 27, 2023 · 57 comments

Comments

@normanrz
Copy link
Member

normanrz commented Jul 27, 2023

Hi everyone,

I hope you’re doing well.

Thank you all for taking the time to review the ZEP0001 and V3 specification. The V3 specification is approved and accepted by the ZSC, ZIC and the Zarr community.

The initial discussion on sharding dates back to 11/2021; please see zarr-developers/zarr-python#877. There have been major developments since the proposal of sharding, some of them are:

Now, we want to put forth the ZEP0002 - Sharding Codec for voting.

We have created this issue to track the approvals from the ZSC, ZIC and the broader Zarr community.
Specific technical feedback on sharding should be made via narrowly scoped issues on the zarr-specs repository that link to this issue.

Now, according to the section, ‘How does a ZEP becomes accepted’ - ZEP0000, a ZEP must satisfy three conditions for approval:

  • Unanimous approval of the Zarr Steering Council
  • Majority approval of the Zarr Implementations Council
  • And, no vetos from the Zarr Implementations Council

As an implementation council member, you have three options for your vote:

  • YES - Approve the ZEP0002. This indicates that your implementation intends to implement support for sharding codec.
  • ABSTAIN - This would be appropriate if you do not want to implement the sharding codec.
  • VETO - As a ZIC member, you can veto a ZEP, including this one.

We request you, the ZIC, and the ZSC review the ZEP0002 and let us know your thoughts. We’ve listed steps to read and understand the sharding completely. They are as follows:

  • Please read and review the ZEP0002. This document contains various vital sections, e.g. Motivation and Scope, Usage and Impact, Backward Compatibility and a detailed description with illustrations.
  • After this, please review the specification document here: https://zarr-specs.readthedocs.io/en/latest/v3/codecs/sharding-indexed/v1.0.html.
  • Gather your thoughts
  • Optional but recommended: begin implementing the ZEP experimentally.
  • Cast your vote by dropping a comment on the issue here.

We understand that the whole process takes time, so we’ve decided to have a relaxed timeline for ZEP0002 voting. We’d appreciate your vote by 31 October 2023, 23:59:59 AoE.

Example implementations

Please let us know if there are any questions. Thank you for your time.

Voting status:

Github user Project Vote
@joshmoore ZSC YES
@ryan-williams ZSC YES
@alimanfoo ZSC YES
@rabernat ZSC YES
@jakirkham ZSC YES
@andersy005 freeman-lab/zarr-js YES
@axtimwalde saalfeldlab/n5-zarr YES
@aschampion sci-rs/zarr YES
@meggart JuliaIO/Zarr.jl YES
@jbms google/tensorstore YES
@constantinpape constantinpape/z5 ABSTAIN
@WardF Unidata/netcdf-c + Unidata/netcdf-java YES
@davidbrochart xtensor-stack/xtensor-zarr ABSTAIN
@grlee77 zarr-developers/zarr-python YES
@manzt gzuidhof/zarr.js YES
@normanrz
Copy link
Member Author

CC: @zarr-developers/implementation-council @zarr-developers/steering-council

@clbarnes
Copy link
Contributor

clbarnes commented Aug 2, 2023

Just a quick sanity check - this spec depends heavily on storage backends supporting Range requests; suffixes in particular (for getting the shard location footer). Over on this issue apache/arrow-rs#4611 , it's been suggested that common stores like S3 don't support suffixes. I don't have a store/ account I could use to test - could anyone give it a go? If not it's obviously a HEAD then a non-suffix GET Range request in serial, which isn't ideal. But possibly not that big an issue when accounting for the lack of multipart range support.

From apache/arrow-rs#4612 it looks like multipart ranges aren't supported by any cloud provider (have yet to interact with a server implementation which does tbh, and you need to jump through a lot of hoops to interpret the response as, according to the HTTP spec, they can basically send you whatever they want) and that particular library gets around it by making a bunch of different requests in parallel. Presumably that isn't a blocker in and of itself, although making hundreds or thousands of requests to pull down a region of a single shard is hardly ideal, especially as any given backend could just decide to send you the entire shard, or indeed just about any subset of it, with each request!

@normanrz
Copy link
Member Author

normanrz commented Aug 2, 2023

Just a quick sanity check - this spec depends heavily on storage backends supporting Range requests; suffixes in particular (for getting the shard location footer). Over on this issue apache/arrow-rs#4611 , it's been suggested that common stores like S3 don't support suffixes. I don't have a store/ account I could use to test - could anyone give it a go? If not it's obviously a HEAD then a non-suffix GET Range request in serial, which isn't ideal. But possibly not that big an issue when accounting for the lack of multipart range support.

All major cloud providers (including S3, GCS, Azure) and static file HTTP server applications support requesting suffixes.
Here is an example against S3:

curl -H 'Range: bytes=-524292' https://static.webknossos.org/data/zarr_v3/l4_sample/color/1/c/0/3/3/1 | wc -c

From apache/arrow-rs#4612 it looks like multipart ranges aren't supported by any cloud provider (have yet to interact with a server implementation which does tbh, and you need to jump through a lot of hoops to interpret the response as, according to the HTTP spec, they can basically send you whatever they want) and that particular library gets around it by making a bunch of different requests in parallel. Presumably that isn't a blocker in and of itself, although making hundreds or thousands of requests to pull down a region of a single shard is hardly ideal, especially as any given backend could just decide to send you the entire shard, or indeed just about any subset of it, with each request!

Multipart ranges are not as widely supported. I know that S3 doesn't support it. Issuing single range requests per inner chunk is equivalent to using Zarr without sharding. So, I would argue sharding doesn't make the situation worse. On the contrary, implementations can choose to coalesce the byte ranges of multiple chunks into single requests to reduce the number of requests. This works especially well, if the chunks are laid out in an order that matches the common access pattern (e.g. Z-ordering). Implementations can also download entire shards. In our standard configuration that means reducing the number of requests by a factor of ~32,000.

@clbarnes
Copy link
Contributor

clbarnes commented Aug 2, 2023

Issuing single range requests per inner chunk is equivalent to using Zarr without sharding.

Yes, so long as the existence of sharding doesn't change users' heuristics on (sub)chunk layout. I suppose it comes down to whether you see sharding as a way to coalesce your small chunks into single files, or as a way to access inner regions of your large chunks!

In our standard configuration that means reducing the number of requests by a factor of ~32,000.

The "worst case" I was thinking about was when you need about half the chunk - clients may want some internal strategy trading off making thousands of small requests to download just what you need, or downloading considerably more than you need in 1 request to then slice the result.

@JackKelly
Copy link

JackKelly commented Aug 2, 2023

clients may want some internal strategy trading off making thousands of small requests to download just what you need, or downloading considerably more than you need in 1 request to then slice the result

Yes, and perhaps the Zarr implementation would allow the user to specify the threshold. e.g. "I'm reading data from a RAID6 array of HDDs, which has very high bandwidth but also terrifyingly high latencies for random reads, so it's faster to read sequentially, even if I throw away 90% of the chunks after reading them from disk".

@constantinpape
Copy link

Hi everyone,
I will go on vacation next week, and don't have time to really read up on the changes here.
Overall I am very excited about sharding and its adoption in zarr, as this is a feature that I would extensively use myself / in our group down the line. I just have one high-level question before a formal vote:
How do the changes here affect zarr without sharding? I assume the default will still be to use normal chunking, and libraries that don't support sharding yet can just use the rest of the zarr spec as usual? (And would need to be adapted to raise an appropriate exception if it encounters a sharded zarr array)

@normanrz
Copy link
Member Author

normanrz commented Aug 8, 2023

Sharding is specified as a new codec. That means that Zarr without sharding is not affected. Libraries that don't support it can still use the normal chunking. Of course, libraries that don't support sharding will not be able to open/read arrays that have been created with sharding.

@constantinpape
Copy link

Thanks for the clarification @normanrz !
Than I vote ABSTAIN (as I don't have the capacity to implement it in z5 myself), but with full support for going this direction!

@manzt
Copy link
Member

manzt commented Aug 10, 2023

I vote YES! Currently migrating zarrita.js (future for zarr.js) to ZEP0001 and hope to have the capacity to support ZEP0002.

@jbms
Copy link
Contributor

jbms commented Aug 17, 2023

I vote yes for tensorstore and neuroglancer!

@meggart
Copy link
Member

meggart commented Aug 17, 2023

I vote YES for Zarr.jl , it is definitely something I would need and use, but can not promise an implementation time line.

@manzt
Copy link
Member

manzt commented Aug 23, 2023

I've implemented the sharding codec in manzt/zarrita.js v0.3.2. It is compatible with the latest from scalableminds/zarrita.

@jbms
Copy link
Contributor

jbms commented Sep 15, 2023

tensorstore now supports zarr v3 with sharding.

@andersy005
Copy link
Member

I vote yes for zarr-js. We recently added experimental support for Zarr v3 including support for sharding codec

@jbms
Copy link
Contributor

jbms commented Oct 6, 2023

Neuroglancer also now supports zarr v3 with sharding.

@jbms
Copy link
Contributor

jbms commented Oct 6, 2023

Is there by any chance an interesting public v3 sharded image dataset, that could be used as an example?

@normanrz
Copy link
Member Author

normanrz commented Oct 9, 2023

We have a public EM dataset at https://static.webknossos.org/data/zarr_v3/l4dense_motta_et_al_demo. Here is the data in Webknossos: https://webknossos.org/links/yo8nbTpjtSm_F210

There is no multiscale metadata for v3, yet. So this is what the hierarchy looks like:

├── color
│   ├── 1
│   ├── 2-2-1
│   ├── 4-4-2
│   ├── 8-8-4
│   ├── 16-16-8
│   ├── 32-32-16
│   ├── 64-64-32
│   ├── 128-128-64
│   ├── 256-256-128
│   ├── 512-512-256
│   └── 1024-1024-512
└── segmentation
    ├── 1
    ├── 2-2-1
    ├── 4-4-2
    ├── 8-8-4
    ├── 16-16-8
    ├── 32-32-16
    ├── 64-64-32
    ├── 128-128-64
    ├── 256-256-128
    ├── 512-512-256
    └── 1024-1024-512

The arrays use chunk shape [1, 32, 32, 32] and shard shape [1, 1024, 1024, 1024]. The EM (color) data is 195G and the segmentation is 10G in total.

EM data by Motta et al., Science 2019, segmentation by scalable minds.

@jbms
Copy link
Contributor

jbms commented Oct 10, 2023

Thanks. I was able to access that dataset in Neuroglancer.

Note though that there appears to be a misconfiguration with your server for e.g. https://static.webknossos.org/data/zarr_v3/l4dense_motta_et_al_demo/color/1/zarr.json (which seems to involve cloudfront and s3). Specifically the issue is that cloudfront is caching the responses without regard to the Origin request header, but the response (namely the Access-Control-Allow-Origin header depends on the Origin request header). In particular, if you make a request with Origin: https://whatever.com, it will return Access-Control-Allow-Origin: https://whatever.com. However, that response will be cached, and a subsequent request with a different origin will still receive Access-Control-Allow-Origin: https://whatever.com which won't have the desired effect. Since these requests do not require cookies to be sent, it would also work to return Access-Control-Allow-Origin: *, which would not need to depend on the Origin.

As far as OME, I know you are working on an update to the OME-zarr spec for zarr v3. Neuroglancer also supports the existing OME-zarr metadata for zarr v3, exactly as it is supported for zarr v2.

@jbms
Copy link
Contributor

jbms commented Oct 10, 2023

Note: If you have a strong objection to Neuroglancer supporting the existing OME-zarr metadata with zarr v3, I am open to changing that.

@normanrz
Copy link
Member Author

Thanks! I fixed the header caching and added OME v0.4 metadata to color and segmentation.

@jbms
Copy link
Contributor

jbms commented Oct 12, 2023

Thanks, I'm able to view it in neuroglancer now. Regarding the OME metadata, I think you need to add a half-voxel translation to each scale to account for the fact that OME assumes the origin is in the center of a voxel, in order to properly align the scales.

@WardF
Copy link

WardF commented Oct 24, 2023

I vote YES on behalf of the Unidata seat at the table. Thanks!

@normanrz
Copy link
Member Author

Thanks to everybody, who already voted! Everybody else, I am looking forward to your votes. Please note that the deadline for voting is in less than a week. Thanks!

cc @zarr-developers/implementation-council @zarr-developers/steering-council

@rabernat
Copy link
Contributor

I vote in favor.

@axtimwalde
Copy link

I vote in favor to not further block this extension.

I find some aspects of the current extension proposal unfortunate and hope for an improved sharding extension in the future. Here are my opinions:

  1. The line of arguments for introducing shards is not compelling and lacks quantifiable justification: e.g. on a filesystem that can store 2.4TB, 10mio files are not too many files. Cloud storage providers, in our experience, do not currently impose limits on the number of objects and keys.
  2. The specification as a codec for arrays of dtype causes two issues:
  3. Since a shard is now a chunk of an array of dtype, implementers of extensible codec readers could fall back to use case scenario 1, i.e. read and decompress the entire shard. In that scenario, the shard is a chunk, existing codecs often create internal compression blocks already, so shards are obsolete. In any implementation of a zarr-library that does not require copy-conversion of chunks after decompression reading shards completely is the same as reading chunks, i.e. shards are obsolete.
  4. The fact that the chunk-size in the shard has to align with the shard-size by spec leaves room for complication and error. It'd be preferable to store the size of the grids in less fragile ways, e.g. storing an array of shards of dtype as (1) an array of chunks of chunks of dtype or (2) as an array of arrays of chunks of dtype would simplify and generalize this. Both ideas can be stacked, so shards of shards would be possible. I understand that this is difficult with the v3 spec (in particular the use of numcodec strings for dtypes) at this time and recognize this difficulty by voting in favor of this extension.
  5. Storing the fix-size index at the end of the shard instead of at the beginning feels like the worst of two possible decision. It requires that the size of the stream is known to read just the index, and it requires that the entire index must be rewritten if a chunk grows (instead of just all the following chunks).
  6. The documentation about checksums is incomplete and confusing.

@jbms
Copy link
Contributor

jbms commented Oct 27, 2023

We could add an "index_location" configuration option that may be either "start" or "end", defaulting to "end". Would that address your concerns?

Why not just allow for an arbitrary byte offset where 0 indicates the "start"?

Since it is a configuration option, it would presumably have to be the same for all shards, which means that anything other than "start" or "end" is unlikely to be useful, and "end" couldn't be indicated by a fixed byte offset.

@mkitti
Copy link
Contributor

mkitti commented Oct 27, 2023

Could end be the additive inverse of the number of the byte length of the index? Basically you would just forward that negative value to be the Range field of the request. We could just make it the exact HTTP Range request that one would use to obtain the index.

@clbarnes
Copy link
Contributor

The line of arguments for introducing shards is not compelling and lacks quantifiable justification: e.g. on a filesystem that can store 2.4TB, 10mio files are not too many files

Our institute uses cephfs for its large-scale cluster-accessible storage, which gets grumpy if you have too many files. Strictly I think that file counts only impact performance when there are too many in one directory (which is much more likely with V2-style dot-separated chunk indices), but in practice cephfs' administration tools impose quotas on total number of files. (LMB example here - not sure how modern this setup is or how common similar types of storage are elsewhere)

@normanrz
Copy link
Member Author

We could add an "index_location" configuration option that may be either "start" or "end", defaulting to "end". Would that address your concerns?

Why not just allow for an arbitrary byte offset where 0 indicates the "start"?

Since it is a configuration option, it would presumably have to be the same for all shards, which means that anything other than "start" or "end" is unlikely to be useful, and "end" couldn't be indicated by a fixed byte offset.

I like the idea of having an index_location option. That would also be helpful for some of our use cases.
@mkitti What would be your use case for a numeric value instead of start/end?

  1. The line of arguments for introducing shards is not compelling and lacks quantifiable justification: e.g. on a filesystem that can store 2.4TB, 10mio files are not too many files. Cloud storage providers, in our experience, do not currently impose limits on the number of objects and keys.

Our original motivation: Some file systems that hold multiple petabytes are configured to have larger block sizes (e.g. 2MB). For the chunk sizes that we need for interactive visualization (e.g 64,64,64) and assuming a uint8 datatype, storing each chunk as one file (=262kB) would be wasteful. Also, prior to sharding we were constanly running out of inodes on our file system.

@normanrz
Copy link
Member Author

I added a proof-of-concept for the index_location option to zarrita: https://github.com/scalableminds/zarrita/compare/sharding-index-location

@mkitti
Copy link
Contributor

mkitti commented Oct 28, 2023

@mkitti What would be your use case for a numeric value instead of start/end?

  1. We could reference a chunk index in the middle of the shard.
  2. We could append a chunk to the end of the file without rewriting the chunk index.
  3. We could have more than one chunk index in a shard.
  4. For a HDF5 file, we may be able to access multiple datasets (arrays) within the file with just the sharding codec.

@normanrz
Copy link
Member Author

@mkitti What would be your use case for a numeric value instead of start/end?

  1. We could reference a chunk index in the middle of the shard.
  2. We could append a chunk to the end of the file without rewriting the chunk index.
  3. We could have more than one chunk index in a shard.
  4. For a HDF5 file, we may be able to access multiple datasets (arrays) within the file with just the sharding codec.

Since the codec configuration, including the index_location, is per array, these seem to me only practical for single-shard arrays?

@jbms
Copy link
Contributor

jbms commented Oct 29, 2023

Another consideration is that both start and end are relatively easy to support when writing but an arbitrary byte offset would be kind of tricky --- you may have to add padding bytes.

@mkitti
Copy link
Contributor

mkitti commented Oct 29, 2023

Yes, I understand. I'm imagining a scenario where there is an existing HDF5 file that is symlinked and used as a shard in multiple arrays.

That said a few of Stephan's scenarios include the index somewhere in the middle of the file, so having the index at an arbitrary offset would help address those.

Implementations may need to calculate the length of the chunk index anyways so is checking if the location is the negative length really that different?

I think the main burden is having to validate another parameter.

My proposal is that we accept anything that is a valid value for a HTTP Range request.

@mkitti
Copy link
Contributor

mkitti commented Oct 29, 2023

Another consideration is that both start and end are relatively easy to support when writing but an arbitrary byte offset would be kind of tricky --- you may have to add padding bytes.

I'm confused. Wouldn't the writer be the one setting the indexLocation?

@jbms
Copy link
Contributor

jbms commented Oct 29, 2023

You might want to write to an existing array created by a different implementation.

Which of Stephan's points relates to having an index in the middle?

@axtimwalde
Copy link

@mkitti I agree that doing this as a codec enables other shard codecs in the future, this is great. Sorry if I wasn't clear and caused confusion. None of my scenarios demand an index in the middle and I agree with everybody else that this sounds complicated and does not generalize across shards. Start and end as options sounds great.

Thanks for the updates on the many files argument. It'd be great if specific cases where this is relevant could be listed. For our use of AWS S3, and the institute managed file system, we try to reach 1MB blocks and consult with the providers and administrators to make it so that this is ok. The streaming-speed argument holds and is most important for the streaming case (which can be lessened by parallel asynchronous access), random access of single chunks would not be accelerated by shards. I haven't found strict rules about number of files/ keys in the S3 or GC docs but I probably haven't looked carefully enough. Generally, I believe that it would be great to have concrete real world examples where this is useful. @clbarnes ' hostile admin example may be a good one.

@mkitti
Copy link
Contributor

mkitti commented Oct 30, 2023

There seems to be a general preference for the index to be at the end from the others, so if we ever needed to add chunks to the end of the file, then perhaps having the index in the middle is at least a temporary scenario before a rewrite.

My abstractions generally do not depend on the index anywhere in particular. Kerchunk exports chunk information to an external file. I also have a program to move HDF5's chunk index to an arbitrary location in the file.

My understanding is that the Betzig lab, or perhaps more specifically the Advanced Bioimaging Center at UC Berkeley, also encountered some issues with file limits on Lustre based clusters.

The practical limits probably should be directory based, but from a file system perspective.

The other complaint I have heard from storage admins is a relatively high amount of metadata IOPs hitting thr file systems, basically stat calls. This problem is avoided on object storage systems due to the use of buckets rather than per file access control.

@mkitti
Copy link
Contributor

mkitti commented Oct 30, 2023

Searching around, I found another anecdote of someone running out of inodes when using Zarr:
pangeo-data/pangeo#659

The solution there was to use ZipStore.

Incidentally, zip files also have a central directory at the end of a file.

I'm also reminded here of @rabernat 's benchmarks of the initial implementation in zarr-python. I would be interested in hearing how the other implementations of sharding have avoided the pitfalls that Ryan encountered.

@davidbrochart
Copy link
Contributor

Hi everyone,
I vote ABSTAIN, since I don't have time to work on xtensor-zarr anymore.

@aschampion
Copy link

Just to amplify @normanrz's example, since that was exactly what motivated my and @clbarnes original interest in sharding. The block allocation and inode limits of our various network filesystems meant that 2-4MB files were the best common denominator, but this made chunks far larger than optimal for remote visualization. I also see this as somewhat of an amelioration on getting in-memory chunk-size to be comparable for multiple datasets for computational purposes, when compressed size may differ greatly for efficient storage purposes (e.g., raw microscopy vs. seg), or more generally, some accommodation for memory-awareness/striding.

I vote YES on behalf of sci-rs/zarr.

@joshmoore
Copy link
Member

Happy 🎃, everyone.

I vote YES for the ZSC.

I also assume that there will be a follow-on PR to introduce the index location. The exact process of those adjustments is still a bit up in the air. I’d propose as with #263 that we will ping the ZIC for votes or vetoes there.

And in general, as with ZEP1, please keep any further clarifications and questions coming as implementations are written. But I think we’ll all be quite enthused to have another ZEP signed off on. Thanks, all!

jbms commented 3 weeks ago
Note: If you have a strong objection to Neuroglancer supporting the existing OME-zarr metadata with zarr v3, I am open to changing that.

Though this is more a discussion for elsewhere, I personally don’t see any issue with having support for that combination, but I’d highly suggest we not expose the community to that mix until Norman’s NGFF spec is decided on (i.e. let’s not write or publish them)

@joshmoore
Copy link
Member

I vote YES for the ZSC.

Sorry for the confusion, I should have said, "as a member of the ZSC". I've updated the description but also pinged all of the remaining voters.

@ryan-williams
Copy link
Member

ryan-williams commented Oct 31, 2023

I vote YES (and updated OP), thanks all!

@alimanfoo
Copy link
Member

Hi all, I vote YES as a member of the ZSC. Congratulations on the degree of consensus achieved and on the progress towards multiple implementations. I note there are still technical dimensions to be explored but am confident this is a good step forward.

@jakirkham
Copy link
Member

I vote YES

Appreciate all the hard work everyone has done here. Am sure implementers and users will appreciate all the thought and effort that has gone into this implementation. This is a major achievement! 👏

Would make two suggestions we can discuss separately. Have filed new issues to discuss those items independently:

Again thanks for all of your hard work and congratulations! 🎉

@grlee77
Copy link
Contributor

grlee77 commented Nov 1, 2023

Hi, all. Sorry about missing the deadline on this.

I was in favor last time I looked at it, but let me review the latest version this evening before officially voting.

@grlee77
Copy link
Contributor

grlee77 commented Nov 1, 2023

I vote YES as a ZIC member, but cannot personally commit immediate effort toward a codec-based sharding implementation in zarr-python.

I also agree with above comments that index_location with start or end options seems like a reasonable addition.

Congratulations to the ZEP0002 authors! I found the proposal be well written and the codec-based version was easier to follow than an earlier draft I had read.

@normanrz
Copy link
Member Author

normanrz commented Nov 2, 2023

This concludes the voting process. ZEP2 has been accepted by ZIC and ZSC 🎉 . Thanks everybody for reviewing the specification, providing feedback and participating in the voting process!

I am looking forward to seeing the sharding codec being implemented in the various Zarr implementations.

As already mentioned by @joshmoore, there may be smaller changes (e.g. index_location) before the spec is finalized.

@joshmoore
Copy link
Member

In the spirit of looking forward to an exciting new year, I'm going to close this issue. If anyone has concerns about that, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests