-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec version 1 #17
Spec version 1 #17
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,16 @@ | ||
# fsspec-reference-maker | ||
|
||
Functions to make reference descriptions for ReferenceFileSystem | ||
|
||
|
||
Proposed spec for the structure required by ReferenceFileSystem: | ||
### Version 0 | ||
|
||
``` | ||
Prototype spec for the structure required by ReferenceFileSystem: | ||
|
||
```json | ||
{ | ||
"key0": "data", | ||
"key1": { | ||
["protocol://target_url", 10000, 100] | ||
} | ||
"key1": ["protocol://target_url", 10000, 100] | ||
} | ||
``` | ||
where: | ||
|
@@ -18,7 +19,7 @@ where: | |
|
||
For example, Zarr data in this proposed spec might be represented as: | ||
|
||
``` | ||
```json | ||
{ | ||
".zgroup": "{\n \"zarr_format\": 2\n"}, | ||
".zattrs": "{\n \"Conventions\": \"UGRID-0.9.0\n\"}, | ||
|
@@ -28,7 +29,100 @@ For example, Zarr data in this proposed spec might be represented as: | |
}, | ||
``` | ||
|
||
### Version 1 | ||
|
||
Metadata structure in JSON. We note, for future possible binary storage, that "version", "gen" and "templates" should | ||
be considered attributes, and "refs" as the data that ought to dominate the storage size. The previous definition, | ||
Version 0, is compatible with the "refs" entry, but here we add features. It will also be possible to *expand* | ||
this new enhanced spec into Version 0 format. | ||
|
||
|
||
``` | ||
{ | ||
"version": (required, must be equal to) 1, | ||
"templates": (optional, zero or more arbitary keys) { | ||
"template_name": jinja-str | ||
}, | ||
"gen": (optional, zero or more items) [ | ||
"key": (required) jinja-str, | ||
"url": (required) jinja-str, | ||
"offset": (required) jinja-str, | ||
"length": (required) jinja-str, | ||
"dimensions": (required, one or more arbitrary keys) { | ||
"variable_name": (required) | ||
{"start": (optional) int, "stop": (required) int, "step": (optional) int} | ||
OR | ||
[int, ...] | ||
} | ||
], | ||
"refs": (optional, zero or more arbiritary keys) { | ||
"key_name": (required) str OR [url(jinja-str)] OR [url(jinja-str), offset(int), length(int)] | ||
} | ||
} | ||
``` | ||
|
||
Where: | ||
- `jinja-str` is a string which will be rendered by jinja2 or its non-python equivalent; i.e., it may be | ||
a literal string, or may include "{{..}}" annotations, where | ||
- for the values associated with a template_name, the variables are to be passed in reference URL strings that | ||
use this template | ||
- for the values within a "gen" object, variables come from the "dimensions" and "templates" | ||
- the `str` format of a reference value may be | ||
- a string starting "base64:", which will be decoded to binary | ||
- any other string, interpreted as ascii data | ||
- the str version of ref values indicates data, the one-element array a whole url, and the three-element version | ||
a binary section of a url | ||
|
||
Here is an example | ||
|
||
```json | ||
{ | ||
"version": 1, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to link explicitly to a spec document URL, as in the new zarr spec? https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html#entry-point-metadata |
||
"templates": { | ||
"u": "server.domain/path", | ||
"f": "{{c}}" | ||
}, | ||
"gen": [ | ||
{ | ||
"key": "gen_key{{i}}", | ||
"url": "http://{{u}}_{{i}}", | ||
"offset": "{{(i + 1) * 1000}}", | ||
"length": "1000", | ||
"dimensions": | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Somehow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Happy to hear other suggestions, but it seems clear enough to me (as described) - in terms of inputs for the cartesian product. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please go ahead and merge when ready. I did not mean to delay that with my comment. |
||
{ | ||
"i": {"stop": 5} | ||
} | ||
} | ||
], | ||
"refs": { | ||
"key0": "data", | ||
"key1": ["http://target_url", 10000, 100], | ||
"key2": ["http://{{u}}", 10000, 100], | ||
"key3": ["http://{{f(c='text')}}", 10000, 100] | ||
} | ||
} | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not clear to me whether this json document is meant to be an example or the actual specification. If it is meant to be an example, I would replace |
||
Here the variable `i` takes the values `[0, 1, 2, 3, 4]`, which could have been provided in array form. Where there | ||
is more than one variable, a cartesian product is formed. | ||
|
||
This example evaluates to the Version 0 equivalent | ||
```json | ||
{ | ||
"key0": "data", | ||
"key1": ["http://target_url", 10000, 100], | ||
"key2": ["http://server.domain/path", 10000, 100], | ||
"key3": ["http://text", 10000, 100], | ||
"key_get0": ["http://server.domain/path_0", 1000, 1000], | ||
"key_get1": ["http://server.domain/path_1", 2000, 1000], | ||
"key_get2": ["http://server.domain/path_2", 3000, 1000], | ||
"key_get3": ["http://server.domain/path_3", 4000, 1000], | ||
"key_get4": ["http://server.domain/path_4", 5000, 1000], | ||
} | ||
``` | ||
such that accessing, for instance, "key0" returns `b"data"` and accessing "key_get0" returns 1000 bytes | ||
from the given URL, at an offset of 1000. | ||
|
||
## Examples | ||
|
||
Run a notebook example comparing reading HDF5 using this approach vs. native Zarr format: <br> | ||
[![Binder](https://aws-uswest2-binder.pangeo.io/badge_logo.svg)](https://aws-uswest2-binder.pangeo.io/v2/gh/intake/fsspec-reference-maker/main?urlpath=lab%2Ftree%2Fexamples%2Fike_intake.ipynb) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to explicitly reference the json spec here? Or the Zarr spec? In general, I feel like we should reference the other specs that this builds on.