Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec version 1 #17

Merged
merged 5 commits into from
Mar 12, 2021
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 60 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
# fsspec-reference-maker

Functions to make reference descriptions for ReferenceFileSystem


Proposed spec for the structure required by ReferenceFileSystem:
### Version 0

```
Prototype spec for the structure required by ReferenceFileSystem:

```json
{
"key0": "data",
"key1": {
["protocol://target_url", 10000, 100]
}
"key1": ["protocol://target_url", 10000, 100]
}
```
where:
Expand All @@ -18,7 +19,7 @@ where:

For example, Zarr data in this proposed spec might be represented as:

```
```json
{
".zgroup": "{\n \"zarr_format\": 2\n"},
".zattrs": "{\n \"Conventions\": \"UGRID-0.9.0\n\"},
Expand All @@ -28,7 +29,60 @@ For example, Zarr data in this proposed spec might be represented as:
},
```

### Version 1

Metadata structure in JSON. We note, for future possible binary storage, that "version", "gen" and "templates" should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to explicitly reference the json spec here? Or the Zarr spec? In general, I feel like we should reference the other specs that this builds on.

be considered attributes, and "refs" as the data that ought to dominate the storage size. The previous definition,
Version 0, is compatible with the "refs" entry, but here we add features. It will also be possible to *expand*
this new enhanced spec into Version 0 format.

```json
{
"version": 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to link explicitly to a spec document URL, as in the new zarr spec? https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html#entry-point-metadata

"templates": {
"u": "long_text_template",
"f": "{c}"
},
"gen": [
{
"key": "gen_key{i}",
"url": "protocol://{u}_{i}",
"offset": "{(i + 1) * 1000}",
Copy link
Contributor

@manzt manzt Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong, but I don't think this is a valid template.

Suggested change
"offset": "{(i + 1) * 1000}",
"offset": "({{ i }} + 1) * 1000",

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be "{{(i + 1) * 1000}}" - to evaluate the whole thing within braces as an expression. Your version would evaluate to the string "(0 + 1) * 1000" for i=0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct. My mistake!

"length": "1000",
"i": "range(9)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting python code in the spec seems problematic

}
],
"refs": {
"key0": "data",
"key1": ["protocol://target_url", 10000, 100],
"key2": ["protocol://{u}", 10000, 100],
"key3": ["protocol://{f(c='text')}", 10000, 100]
}
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me whether this json document is meant to be an example or the actual specification. If it is meant to be an example, I would replace protocol:// with an example protocol (e.g. http).


Explanation of fields follows. Only "version" and "refs" are required:

- version: set to 1 for this spec.
- templates: set of named string templates. These can be plain strings, to be included verbatim, or format strings
(anything containing "{" and "}" characters) which will be called with parameters. The format specifiers for each
variable follows the python string formatting spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a language-agnostic version of format strings we could point to? I am uncomfortable with using python concepts in the spec, as it would prevent implementations from other languages.

- gen: programmatically generated key/value pairs. Each entry adds one or more items to "refs"; in practice, in the
implementation, we may choose to populate these or create them on-demand. Any of the fields can contain
templated parameters.
- key, url: generated key names and target URLs
- offset, length: to define the bytes range, will be converted to int
- additional named parameters: for each iterable found (i.e., returns successfully from `iter()`), creates a
dimension of generated keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must confess that I can't really understand how this works by reading the above text. I think a significant rewrite may be needed.

- refs: keys with either data or [url, offset, length]. The URL will be treated as a template if it contains
"{" and "}".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we be more clear about how the template is related to the "ref" section.


In the example, "key2" becomes ["protocol://long_text_template", ..] and "key3" becomes ["protocol://text", ..].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explicitly expand these lists rather than using ellipses.

Also contained will be keys "gen_ref0": ["protocol://long_text_template_0", 1000, 1000] to "gen_ref8":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be gen_key0 and gen_key8 instead of gen_ref...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... which is why it's good to have your spec be in code and tested too, right?

["protocol://long_text_template_9", 9000, 1000].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this whole section would be easier to read if you showed the complete representation that would be generated by the template.



## Examples

Run a notebook example comparing reading HDF5 using this approach vs. native Zarr format: <br>
[![Binder](https://aws-uswest2-binder.pangeo.io/badge_logo.svg)](https://aws-uswest2-binder.pangeo.io/v2/gh/intake/fsspec-reference-maker/main?urlpath=lab%2Ftree%2Fexamples%2Fike_intake.ipynb)
Expand Down