Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec version 1 #17

Merged
merged 5 commits into from
Mar 12, 2021
Merged

Spec version 1 #17

merged 5 commits into from
Mar 12, 2021

Conversation

martindurant
Copy link
Member

No description provided.

@martindurant
Copy link
Member Author

Fixes #7
Fises #8
Fixes #13

Ping @rabernat @rsignell-usgs @manzt @joshmoore

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for working on this @martindurant! I'm so excited about the possibilities.

My major concerns are the following:

  • This is not really a "spec"; it's more of an example. A true spec would be more explicit about MUST, MAY, etc. and try to enumerate the full space of allowed possibilities .
  • It seems like we are baking python language conventions into the spec. Is that the right choice?

@@ -28,7 +29,60 @@ For example, Zarr data in this proposed spec might be represented as:
},
```

### Version 1

Metadata structure in JSON. We note, for future possible binary storage, that "version", "gen" and "templates" should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to explicitly reference the json spec here? Or the Zarr spec? In general, I feel like we should reference the other specs that this builds on.


```json
{
"version": 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to link explicitly to a spec document URL, as in the new zarr spec? https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html#entry-point-metadata

README.md Outdated
- version: set to 1 for this spec.
- templates: set of named string templates. These can be plain strings, to be included verbatim, or format strings
(anything containing "{" and "}" characters) which will be called with parameters. The format specifiers for each
variable follows the python string formatting spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a language-agnostic version of format strings we could point to? I am uncomfortable with using python concepts in the spec, as it would prevent implementations from other languages.

README.md Outdated
"url": "protocol://{u}_{i}",
"offset": "{(i + 1) * 1000}",
"length": "1000",
"i": "range(9)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting python code in the spec seems problematic

README.md Outdated
- additional named parameters: for each iterable found (i.e., returns successfully from `iter()`), creates a
dimension of generated keys
- refs: keys with either data or [url, offset, length]. The URL will be treated as a template if it contains
"{" and "}".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we be more clear about how the template is related to the "ref" section.

README.md Outdated
- key, url: generated key names and target URLs
- offset, length: to define the bytes range, will be converted to int
- additional named parameters: for each iterable found (i.e., returns successfully from `iter()`), creates a
dimension of generated keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must confess that I can't really understand how this works by reading the above text. I think a significant rewrite may be needed.

"key3": ["protocol://{f(c='text')}", 10000, 100]
}
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me whether this json document is meant to be an example or the actual specification. If it is meant to be an example, I would replace protocol:// with an example protocol (e.g. http).

README.md Outdated
- refs: keys with either data or [url, offset, length]. The URL will be treated as a template if it contains
"{" and "}".

In the example, "key2" becomes ["protocol://long_text_template", ..] and "key3" becomes ["protocol://text", ..].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explicitly expand these lists rather than using ellipses.

README.md Outdated

In the example, "key2" becomes ["protocol://long_text_template", ..] and "key3" becomes ["protocol://text", ..].
Also contained will be keys "gen_ref0": ["protocol://long_text_template_0", 1000, 1000] to "gen_ref8":
["protocol://long_text_template_9", 9000, 1000].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this whole section would be easier to read if you showed the complete representation that would be generated by the template.

@martindurant
Copy link
Member Author

This is not really a "spec"; it's more of an example

I thought more like an example with explanations that can together be turned into a formal spec :) It could be encoded in a jsonschema and used for validation.

if you showed the complete representation that would be generated by the template

Good idea. Indeed, when implemented in code, this ought to be a test case.

Do we want to link explicitly to a spec document URL

I think version should be a number, but it would be reasonable to also add a link/DOI, etc., in a different field. In the python implementation, we would probably not check this field.

Is there a language-agnostic version of format strings we could point to?

I am happy to have one pointed out, but I believe python's is moderately simple and similar to other languages. Other popular template engines like jinja are also typically tied to a particular implementation.

Putting python code in the spec seems problematic

The alternative is to come up with our own language... For the specific case of range, you could replace with a list literal, but would not want to do that for long lists.

@martindurant
Copy link
Member Author

This is the python version of the "moustache" templating framework, which appears to have libs in many many languages: https://github.com/noahmorrison/chevron

README.md Outdated
"{" and "}".

In the example, "key2" becomes ["protocol://long_text_template", ..] and "key3" becomes ["protocol://text", ..].
Also contained will be keys "gen_ref0": ["protocol://long_text_template_0", 1000, 1000] to "gen_ref8":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be gen_key0 and gen_key8 instead of gen_ref...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... which is why it's good to have your spec be in code and tested too, right?

@joshmoore
Copy link

Only comment on top of those given so far, I think, is that data doesn't seem to be specified. My assumption is that it's the raw return value with no need for further lookup.

Regarding the Python-ism points, I could certainly see having a second implementation to keep us honest. I'd probably default to doing so in Java, but I won't start on that until I hear whether or not @manzt already has one in JS (which is what usually happens...)

@martindurant
Copy link
Member Author

data doesn't seem to be specified. My assumption is that it's the raw return value with no need for further lookup.

Indeed. And that the remote chunks referenced are byte blocks too. To be included in JSON, these literals would have to be ascii (which is fine and normal for zarr). We could optionally introduce an encoding like b64 at this point.

As an interesting contrast, for something like the grib2 case, where we need to make a local file to pass to C, one could imagine using this kind of structure and specify a "literals processor" and/or "reference processor" somehow. Very useful, but I think beyond Version 1.

@rabernat
Copy link
Contributor

rabernat commented Mar 3, 2021

Perhaps we want to restrict what sort of python expressions can be used. Do we envision needing anything besides range? If not, we could do something like

"i": {"start": 0, "stop": 9}

@martindurant
Copy link
Member Author

Do we envision needing anything besides range?

open to suggestions. By using a specific dict as you suggest, though, would hinder future expansion of possibilities, so I would still prefer range (with one, two or three arguments). I originally had imagined allowing anything in python builtins, but indeed that would be hard to make language agnostic.

@rabernat
Copy link
Contributor

rabernat commented Mar 3, 2021

Then what about

"i": {"range": {"start": 0, "stop": 9}}

My point is that functions and arguments can be encoded as json and then translated to python, rather than putting python code as strings into the json document.

@martindurant
Copy link
Member Author

That syntax is OK with me.

@manzt
Copy link
Contributor

manzt commented Mar 3, 2021

Thanks for working on this @martindurant

This is the python version of the "moustache" templating framework, which appears to have libs in many many languages.

I think moustache is a good choice for this, echoing @rabernat concern with python-isms.

I originally had imagined allowing anything in python builtins, but indeed that would be hard to make language agnostic.

Ideally we could define the "set" of builtins that the spec may include in JSON like @rabernat's example. This would help clarify which builtins are absolutely necessary and as @joshmoore said "keep us honest" between implementations.

Only comment on top of those given so far, I think, is that data doesn't seem to be specified. My assumption is that it's the raw return value with no need for further lookup.

Also somewhat unclear to me. Can data be both an ascii-encoded blob that needs to be parsed as JSON or just JSON?

I won't start on that until I hear whether or not @manzt already has one in JS (which is what usually happens...)

I've written a few ad-hoc pieces of code to parse the spec, but it's coupled with a custom zarr.js store. I can try to separate out pieces and work on something more "official" and reusable for the JS folks.

@martindurant
Copy link
Member Author

I've written a few ad-hoc pieces of code to parse the spec, but it's coupled with a custom zarr.js store.

Since you don't have an fsspec to handle the references, I imagine a custom store is the only way to go. That's how this "pieces of HDF" was initially implemented too; but I am keen that the idea be transferrable beyond zarr if possible, at least in python.

@manzt
Copy link
Contributor

manzt commented Mar 3, 2021

Since you don't have an fsspec to handle the references, I imagine a custom store is the only way to go.

Makes sense. I could image an reference-spec-reader-js that is just a parser for spec and returns either data or a ref tuple for a given key, delegating how/if a data request should be made to someone else. I think for now (since I'll probably be one of few users), it makes sense to keep these together.

@manzt
Copy link
Contributor

manzt commented Mar 3, 2021

As a side note, I'm not super familiar with mustache, but I'm not sure (don't think) expression evaluation is supported. So handing:

  "offset": "{(i + 1) * 1000}"

could be challenging.

EDIT: One option would be to use templating strictly for substitution and then parse and evaluate the resulting expression. Although use of eval like this can pose a security risk (at least it's warned against in the browser)

import chevron
template = "({{ i }} + 1) * 1000}"
expr = chevron.render(template=template, data=dict(i=10)) # "(10 + 1) * 1000"
offset = eval(expr)

@ajelenak
Copy link
Collaborator

ajelenak commented Mar 4, 2021

Checking whether this in the spec:

{
    "refs": {
         "key0": "data",
    }
}

covers the use case below (here for context):

{
    "key0": "data:application/octet-stream;base64:<array data>"
}

@martindurant
Copy link
Member Author

Good point @ajelenak : this is not described. If the only two formats we consider are text (ascii) and b64-binary, then maybe it can be shorter and simpler.

@ajelenak
Copy link
Collaborator

ajelenak commented Mar 4, 2021

Simpler by avoiding "data:application/octet-stream;base64:"? I think that in cases of specs explicit is better than implicit. But avoiding the data URI part will certainly save some bytes.

@martindurant
Copy link
Member Author

Updates as requested.

Given the availability of nunjucks (js), liquid (ruby), twig (php) and probably others, I think sticking with jinja2 is fine. It means double-braces, unless all of the above permit setting the variable delimiter as python'd version does.

@joshmoore
Copy link

Given the availability of nunjucks (js), liquid (ruby), twig (php) and probably others

Untested but looks active: https://github.com/HubSpot/jinjava

@martindurant
Copy link
Member Author

README.md Outdated
{
"key": "gen_key{{i}}",
"url": "http://{{u}}_{{i}}",
"offset": "{(i + 1) * 1000}",
Copy link
Contributor

@manzt manzt Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong, but I don't think this is a valid template.

Suggested change
"offset": "{(i + 1) * 1000}",
"offset": "({{ i }} + 1) * 1000",

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be "{{(i + 1) * 1000}}" - to evaluate the whole thing within braces as an expression. Your version would evaluate to the string "(0 + 1) * 1000" for i=0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct. My mistake!

@manzt
Copy link
Contributor

manzt commented Mar 10, 2021

I was able to write a parser in JS to expand the V1 spec: https://github.com/manzt/reference-spec-reader. Thanks for your work on this @martindurant .

To reiterate a potential concern, nunjucks (and I'm guessing other template renders) has a very clear warning that evaluating user defined templates is a security vulnerability:

nunjucks does not sandbox execution so it is not safe to run user-defined templates or inject user-defined content into template definitions.

I don't have the background to evaluate this is concern in this context, but something to be aware of.

@martindurant
Copy link
Member Author

The specific concern, at least the example, is about passing strings to the browser, which then evaluates them. In our case, we don't eval strings, we turn strings into other strings with the exception of cast to int.

@manzt
Copy link
Contributor

manzt commented Mar 10, 2021

In our case, we don't eval strings, we turn strings into other strings with the exception of cast to int.

I'm struggling to see how evaluating this template "{{ (i + 1) * 1000 }}", for example, is a combination of string substitution and integer casting. Ultimately the substituted string expression must be evaluated, but I could missing something.

@martindurant
Copy link
Member Author

This particular value has three stages:

  • arithmetic evaluation based on the value of i (no functions allowed here except a very limited set that jinja implements)
  • forms a string output
  • converted to int with a simple cast (either is an int, or error - no arbitrary evaluation).

@manzt
Copy link
Contributor

manzt commented Mar 10, 2021

Thanks for walking me through that. This was a misunderstanding on my part (re: which functions are allowed via jinja).

@martindurant
Copy link
Member Author

I have specified what the "ref" entries should be, and included the possibility that they are whole files (i.e., no offset/length - but we don't need to scan all files to find their lengths).

I thin this can complete the spec, and I can start making the implementation in fsspec as well as a recipe in pangeo-forge ( @rabernat ?).

This whole-file change would account for the grib2 case or small-hdf case, where we suppose there will be a "filter" stage in the zarr loader which takes the bytes, writes them to disc (or BytesIO for HDF), loads and selects data and passes back the numpy buffer.

This meets the intake-informaticslab use case (especially if we have caching, which could be for the original file or the final array buffer). Of course, specifying this (python) function doesn't happen in this JSON spec, it's either contained in the zarr metadata or even in an intake spec. We could have another convention for those, if we like, such as grib2, hdf:tree/var or generic module:func. I don't intend to implement anything like this quite just yet, but we want it to be possible.

cc @tam203 @arh89

@martindurant
Copy link
Member Author

Merging today, unless I hear otherwise. This will be followed by implementation in fsspec's ReferenceFileSystem, and then the pangeo-forge recipe (which will depend on fsspec HEAD).

"url": "http://{{u}}_{{i}}",
"offset": "{{(i + 1) * 1000}}",
"length": "1000",
"dimensions":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow dimensions seems misplaced to me. What is i a dimension of? Perhaps something self-explanatory like "gen_vars", although not as convenient as a single word.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to hear other suggestions, but it seems clear enough to me (as described) - in terms of inputs for the cartesian product.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please go ahead and merge when ready. I did not mean to delay that with my comment.

@martindurant martindurant merged commit 7f0e2d9 into fsspec:main Mar 12, 2021
@martindurant martindurant deleted the spec1 branch March 12, 2021 02:07
martindurant pushed a commit to martindurant/filesystem_spec that referenced this pull request Mar 12, 2021
See fsspec/kerchunk#17

Note that this *expands* the given data. and builds a dircache.
In the future, entries should be available only on demand.
@martindurant
Copy link
Member Author

Implementation: fsspec/filesystem_spec#568

When that's in, I'll go back fsspec_reference_maker.hdf and edit it accordingly, to be called from a pangeo-forge recipe.

@manzt
Copy link
Contributor

manzt commented Mar 28, 2021

Not sure if anyone here is interested, but figured this an appropriate place to share for now (working on a blob post). I have released an JS package on npm that includes a parser for the specification and a valid store implementation for Zarr.js in the browser (relies on http(s) protocol).

package: https://github.com/manzt/reference-spec-reader

example: https://6060eed9dc5ef100075be024--vizarr.netlify.app/?source=https://gist.githubusercontent.com/manzt/436fc2966c484205a2c60824f659b412/raw/cdc69f2ce645d953185f10d7552501bfd459dd12/Vanderbilt-Spraggins-Kidney-MxIF.ome.tif.json&channel_axis=0

The example url is a preview of this PR in vizarr, an interative viewer for zarr-based images. The reference (v0) was generated via tifffile for a multiscale OME-TIFF we host on GCS. Same reference can be viewed in python via napari with fsspec + dask & zarr.

@rabernat
Copy link
Contributor

Vizarr is so cool! Can we point it at some climate data?

@manzt
Copy link
Contributor

manzt commented Mar 31, 2021

@rabernat I'm not sure. I'd have to look at the Zarr (both hierarchy structure and chunk shape for arrays). Currently Vizarr relies on the "multiscales" extension for representing downsampled data, the last two dimensions must be (y, x) and non-xy dimensions must have a chunksize of one. The other thing I'm curious about is coordinate system. Happy to chat!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants