Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec version 1 #17

Merged
merged 5 commits into from
Mar 12, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 100 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
# fsspec-reference-maker

Functions to make reference descriptions for ReferenceFileSystem


Proposed spec for the structure required by ReferenceFileSystem:
### Version 0

```
Prototype spec for the structure required by ReferenceFileSystem:

```json
{
"key0": "data",
"key1": {
["protocol://target_url", 10000, 100]
}
"key1": ["protocol://target_url", 10000, 100]
}
```
where:
Expand All @@ -18,7 +19,7 @@ where:

For example, Zarr data in this proposed spec might be represented as:

```
```json
{
".zgroup": "{\n \"zarr_format\": 2\n"},
".zattrs": "{\n \"Conventions\": \"UGRID-0.9.0\n\"},
Expand All @@ -28,7 +29,100 @@ For example, Zarr data in this proposed spec might be represented as:
},
```

### Version 1

Metadata structure in JSON. We note, for future possible binary storage, that "version", "gen" and "templates" should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to explicitly reference the json spec here? Or the Zarr spec? In general, I feel like we should reference the other specs that this builds on.

be considered attributes, and "refs" as the data that ought to dominate the storage size. The previous definition,
Version 0, is compatible with the "refs" entry, but here we add features. It will also be possible to *expand*
this new enhanced spec into Version 0 format.


```
{
"version": (required, must be equal to) 1,
"templates": (optional, zero or more arbitary keys) {
"template_name": jinja-str
},
"gen": (optional, zero or more items) [
"key": (required) jinja-str,
"url": (required) jinja-str,
"offset": (required) jinja-str,
"length": (required) jinja-str,
"dimensions": (required, one or more arbitrary keys) {
"variable_name": (required)
{"start": (optional) int, "stop": (required) int, "step": (optional) int}
OR
[int, ...]
}
],
"refs": (optional, zero or more arbiritary keys) {
"key_name": (required) str OR [url(jinja-str)] OR [url(jinja-str), offset(int), length(int)]
}
}
```

Where:
- `jinja-str` is a string which will be rendered by jinja2 or its non-python equivalent; i.e., it may be
a literal string, or may include "{{..}}" annotations, where
- for the values associated with a template_name, the variables are to be passed in reference URL strings that
use this template
- for the values within a "gen" object, variables come from the "dimensions" and "templates"
- the `str` format of a reference value may be
- a string starting "base64:", which will be decoded to binary
- any other string, interpreted as ascii data
- the str version of ref values indicates data, the one-element array a whole url, and the three-element version
a binary section of a url

Here is an example

```json
{
"version": 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to link explicitly to a spec document URL, as in the new zarr spec? https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html#entry-point-metadata

"templates": {
"u": "server.domain/path",
"f": "{{c}}"
},
"gen": [
{
"key": "gen_key{{i}}",
"url": "http://{{u}}_{{i}}",
"offset": "{{(i + 1) * 1000}}",
"length": "1000",
"dimensions":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow dimensions seems misplaced to me. What is i a dimension of? Perhaps something self-explanatory like "gen_vars", although not as convenient as a single word.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to hear other suggestions, but it seems clear enough to me (as described) - in terms of inputs for the cartesian product.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please go ahead and merge when ready. I did not mean to delay that with my comment.

{
"i": {"stop": 5}
}
}
],
"refs": {
"key0": "data",
"key1": ["http://target_url", 10000, 100],
"key2": ["http://{{u}}", 10000, 100],
"key3": ["http://{{f(c='text')}}", 10000, 100]
}
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me whether this json document is meant to be an example or the actual specification. If it is meant to be an example, I would replace protocol:// with an example protocol (e.g. http).

Here the variable `i` takes the values `[0, 1, 2, 3, 4]`, which could have been provided in array form. Where there
is more than one variable, a cartesian product is formed.

This example evaluates to the Version 0 equivalent
```json
{
"key0": "data",
"key1": ["http://target_url", 10000, 100],
"key2": ["http://server.domain/path", 10000, 100],
"key3": ["http://text", 10000, 100],
"key_get0": ["http://server.domain/path_0", 1000, 1000],
"key_get1": ["http://server.domain/path_1", 2000, 1000],
"key_get2": ["http://server.domain/path_2", 3000, 1000],
"key_get3": ["http://server.domain/path_3", 4000, 1000],
"key_get4": ["http://server.domain/path_4", 5000, 1000],
}
```
such that accessing, for instance, "key0" returns `b"data"` and accessing "key_get0" returns 1000 bytes
from the given URL, at an offset of 1000.

## Examples

Run a notebook example comparing reading HDF5 using this approach vs. native Zarr format: <br>
[![Binder](https://aws-uswest2-binder.pangeo.io/badge_logo.svg)](https://aws-uswest2-binder.pangeo.io/v2/gh/intake/fsspec-reference-maker/main?urlpath=lab%2Ftree%2Fexamples%2Fike_intake.ipynb)
Expand Down