-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec version 1 #17
Spec version 1 #17
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for working on this @martindurant! I'm so excited about the possibilities.
My major concerns are the following:
- This is not really a "spec"; it's more of an example. A true spec would be more explicit about MUST, MAY, etc. and try to enumerate the full space of allowed possibilities .
- It seems like we are baking python language conventions into the spec. Is that the right choice?
@@ -28,7 +29,60 @@ For example, Zarr data in this proposed spec might be represented as: | |||
}, | |||
``` | |||
|
|||
### Version 1 | |||
|
|||
Metadata structure in JSON. We note, for future possible binary storage, that "version", "gen" and "templates" should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to explicitly reference the json spec here? Or the Zarr spec? In general, I feel like we should reference the other specs that this builds on.
|
||
```json | ||
{ | ||
"version": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to link explicitly to a spec document URL, as in the new zarr spec? https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html#entry-point-metadata
README.md
Outdated
- version: set to 1 for this spec. | ||
- templates: set of named string templates. These can be plain strings, to be included verbatim, or format strings | ||
(anything containing "{" and "}" characters) which will be called with parameters. The format specifiers for each | ||
variable follows the python string formatting spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a language-agnostic version of format strings we could point to? I am uncomfortable with using python concepts in the spec, as it would prevent implementations from other languages.
README.md
Outdated
"url": "protocol://{u}_{i}", | ||
"offset": "{(i + 1) * 1000}", | ||
"length": "1000", | ||
"i": "range(9)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting python code in the spec seems problematic
README.md
Outdated
- additional named parameters: for each iterable found (i.e., returns successfully from `iter()`), creates a | ||
dimension of generated keys | ||
- refs: keys with either data or [url, offset, length]. The URL will be treated as a template if it contains | ||
"{" and "}". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we be more clear about how the template is related to the "ref" section.
README.md
Outdated
- key, url: generated key names and target URLs | ||
- offset, length: to define the bytes range, will be converted to int | ||
- additional named parameters: for each iterable found (i.e., returns successfully from `iter()`), creates a | ||
dimension of generated keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must confess that I can't really understand how this works by reading the above text. I think a significant rewrite may be needed.
"key3": ["protocol://{f(c='text')}", 10000, 100] | ||
} | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me whether this json document is meant to be an example or the actual specification. If it is meant to be an example, I would replace protocol://
with an example protocol (e.g. http).
README.md
Outdated
- refs: keys with either data or [url, offset, length]. The URL will be treated as a template if it contains | ||
"{" and "}". | ||
|
||
In the example, "key2" becomes ["protocol://long_text_template", ..] and "key3" becomes ["protocol://text", ..]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explicitly expand these lists rather than using ellipses.
README.md
Outdated
|
||
In the example, "key2" becomes ["protocol://long_text_template", ..] and "key3" becomes ["protocol://text", ..]. | ||
Also contained will be keys "gen_ref0": ["protocol://long_text_template_0", 1000, 1000] to "gen_ref8": | ||
["protocol://long_text_template_9", 9000, 1000]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this whole section would be easier to read if you showed the complete representation that would be generated by the template.
I thought more like an example with explanations that can together be turned into a formal spec :) It could be encoded in a jsonschema and used for validation.
Good idea. Indeed, when implemented in code, this ought to be a test case.
I think version should be a number, but it would be reasonable to also add a link/DOI, etc., in a different field. In the python implementation, we would probably not check this field.
I am happy to have one pointed out, but I believe python's is moderately simple and similar to other languages. Other popular template engines like jinja are also typically tied to a particular implementation.
The alternative is to come up with our own language... For the specific case of |
This is the python version of the "moustache" templating framework, which appears to have libs in many many languages: https://github.com/noahmorrison/chevron |
README.md
Outdated
"{" and "}". | ||
|
||
In the example, "key2" becomes ["protocol://long_text_template", ..] and "key3" becomes ["protocol://text", ..]. | ||
Also contained will be keys "gen_ref0": ["protocol://long_text_template_0", 1000, 1000] to "gen_ref8": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be gen_key0
and gen_key8
instead of gen_ref...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... which is why it's good to have your spec be in code and tested too, right?
Only comment on top of those given so far, I think, is that Regarding the Python-ism points, I could certainly see having a second implementation to keep us honest. I'd probably default to doing so in Java, but I won't start on that until I hear whether or not @manzt already has one in JS (which is what usually happens...) |
Indeed. And that the remote chunks referenced are byte blocks too. To be included in JSON, these literals would have to be ascii (which is fine and normal for zarr). We could optionally introduce an encoding like b64 at this point. As an interesting contrast, for something like the grib2 case, where we need to make a local file to pass to C, one could imagine using this kind of structure and specify a "literals processor" and/or "reference processor" somehow. Very useful, but I think beyond Version 1. |
Perhaps we want to restrict what sort of python expressions can be used. Do we envision needing anything besides
|
open to suggestions. By using a specific dict as you suggest, though, would hinder future expansion of possibilities, so I would still prefer |
Then what about
My point is that functions and arguments can be encoded as json and then translated to python, rather than putting python code as strings into the json document. |
That syntax is OK with me. |
Thanks for working on this @martindurant
I think moustache is a good choice for this, echoing @rabernat concern with python-isms.
Ideally we could define the "set" of builtins that the spec may include in JSON like @rabernat's example. This would help clarify which builtins are absolutely necessary and as @joshmoore said "keep us honest" between implementations.
Also somewhat unclear to me. Can
I've written a few ad-hoc pieces of code to parse the spec, but it's coupled with a custom zarr.js store. I can try to separate out pieces and work on something more "official" and reusable for the JS folks. |
Since you don't have an fsspec to handle the references, I imagine a custom store is the only way to go. That's how this "pieces of HDF" was initially implemented too; but I am keen that the idea be transferrable beyond zarr if possible, at least in python. |
Makes sense. I could image an |
As a side note, I'm not super familiar with
could be challenging. EDIT: One option would be to use templating strictly for substitution and then parse and evaluate the resulting expression. Although use of eval like this can pose a security risk (at least it's warned against in the browser) import chevron
template = "({{ i }} + 1) * 1000}"
expr = chevron.render(template=template, data=dict(i=10)) # "(10 + 1) * 1000"
offset = eval(expr) |
Checking whether this in the spec: {
"refs": {
"key0": "data",
}
} covers the use case below (here for context): {
"key0": "data:application/octet-stream;base64:<array data>"
} |
Good point @ajelenak : this is not described. If the only two formats we consider are text (ascii) and b64-binary, then maybe it can be shorter and simpler. |
Simpler by avoiding " |
Updates as requested. Given the availability of nunjucks (js), liquid (ruby), twig (php) and probably others, I think sticking with jinja2 is fine. It means double-braces, unless all of the above permit setting the variable delimiter as python'd version does. |
Untested but looks active: https://github.com/HubSpot/jinjava |
README.md
Outdated
{ | ||
"key": "gen_key{{i}}", | ||
"url": "http://{{u}}_{{i}}", | ||
"offset": "{(i + 1) * 1000}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong, but I don't think this is a valid template.
"offset": "{(i + 1) * 1000}", | |
"offset": "({{ i }} + 1) * 1000", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should be "{{(i + 1) * 1000}}"
- to evaluate the whole thing within braces as an expression. Your version would evaluate to the string "(0 + 1) * 1000" for i=0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct. My mistake!
I was able to write a parser in JS to expand the V1 spec: https://github.com/manzt/reference-spec-reader. Thanks for your work on this @martindurant . To reiterate a potential concern,
I don't have the background to evaluate this is concern in this context, but something to be aware of. |
The specific concern, at least the example, is about passing strings to the browser, which then evaluates them. In our case, we don't eval strings, we turn strings into other strings with the exception of cast to |
I'm struggling to see how evaluating this template |
This particular value has three stages:
|
Thanks for walking me through that. This was a misunderstanding on my part (re: which functions are allowed via jinja). |
I have specified what the "ref" entries should be, and included the possibility that they are whole files (i.e., no offset/length - but we don't need to scan all files to find their lengths). I thin this can complete the spec, and I can start making the implementation in fsspec as well as a recipe in pangeo-forge ( @rabernat ?). This whole-file change would account for the grib2 case or small-hdf case, where we suppose there will be a "filter" stage in the zarr loader which takes the bytes, writes them to disc (or BytesIO for HDF), loads and selects data and passes back the numpy buffer. This meets the intake-informaticslab use case (especially if we have caching, which could be for the original file or the final array buffer). Of course, specifying this (python) function doesn't happen in this JSON spec, it's either contained in the zarr metadata or even in an intake spec. We could have another convention for those, if we like, such as |
Merging today, unless I hear otherwise. This will be followed by implementation in fsspec's ReferenceFileSystem, and then the pangeo-forge recipe (which will depend on fsspec HEAD). |
"url": "http://{{u}}_{{i}}", | ||
"offset": "{{(i + 1) * 1000}}", | ||
"length": "1000", | ||
"dimensions": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow dimensions
seems misplaced to me. What is i
a dimension of? Perhaps something self-explanatory like "gen_vars", although not as convenient as a single word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to hear other suggestions, but it seems clear enough to me (as described) - in terms of inputs for the cartesian product.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please go ahead and merge when ready. I did not mean to delay that with my comment.
See fsspec/kerchunk#17 Note that this *expands* the given data. and builds a dircache. In the future, entries should be available only on demand.
Implementation: fsspec/filesystem_spec#568 When that's in, I'll go back fsspec_reference_maker.hdf and edit it accordingly, to be called from a pangeo-forge recipe. |
Not sure if anyone here is interested, but figured this an appropriate place to share for now (working on a blob post). I have released an JS package on npm that includes a parser for the specification and a valid store implementation for package: https://github.com/manzt/reference-spec-reader The example url is a preview of this PR in |
Vizarr is so cool! Can we point it at some climate data? |
@rabernat I'm not sure. I'd have to look at the Zarr (both hierarchy structure and chunk shape for arrays). Currently Vizarr relies on the "multiscales" extension for representing downsampled data, the last two dimensions must be |
No description provided.