Spec version 1 #17

martindurant · 2021-03-01T20:56:33Z

No description provided.

martindurant · 2021-03-02T16:23:30Z

Fixes #7
Fises #8
Fixes #13

Ping @rabernat @rsignell-usgs @manzt @joshmoore

rabernat

Thanks so much for working on this @martindurant! I'm so excited about the possibilities.

My major concerns are the following:

This is not really a "spec"; it's more of an example. A true spec would be more explicit about MUST, MAY, etc. and try to enumerate the full space of allowed possibilities .
It seems like we are baking python language conventions into the spec. Is that the right choice?

rabernat · 2021-03-02T16:43:33Z

README.md

@@ -28,7 +29,60 @@ For example, Zarr data in this proposed spec might be represented as:
 },
 ```

+### Version 1
+
+Metadata structure in JSON. We note, for future possible binary storage, that "version", "gen" and "templates" should


Do you want to explicitly reference the json spec here? Or the Zarr spec? In general, I feel like we should reference the other specs that this builds on.

rabernat · 2021-03-02T16:45:34Z

README.md

+
+```json
+{
+    "version": 1,


Do we want to link explicitly to a spec document URL, as in the new zarr spec? https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html#entry-point-metadata

rabernat · 2021-03-02T16:47:15Z

README.md

+- version: set to 1 for this spec.
+- templates: set of named string templates. These can be plain strings, to be included verbatim, or format strings
+  (anything containing "{" and "}" characters) which will be called with parameters. The format specifiers for each
+  variable follows the python string formatting spec.


Is there a language-agnostic version of format strings we could point to? I am uncomfortable with using python concepts in the spec, as it would prevent implementations from other languages.

rabernat · 2021-03-02T16:48:40Z

README.md

+            "url": "protocol://{u}_{i}",
+            "offset": "{(i + 1) * 1000}",
+            "length": "1000",
+            "i": "range(9)"


Putting python code in the spec seems problematic

rabernat · 2021-03-02T16:51:41Z

README.md

+    - additional named parameters: for each iterable found (i.e., returns successfully from `iter()`), creates a 
+      dimension of generated keys
+- refs: keys with either data or [url, offset, length]. The URL will be treated as a template if it contains 
+  "{" and "}".


Could we be more clear about how the template is related to the "ref" section.

rabernat · 2021-03-02T16:52:21Z

README.md

+    - key, url: generated key names and target URLs
+    - offset, length: to define the bytes range, will be converted to int
+    - additional named parameters: for each iterable found (i.e., returns successfully from `iter()`), creates a 
+      dimension of generated keys


I must confess that I can't really understand how this works by reading the above text. I think a significant rewrite may be needed.

rabernat · 2021-03-02T16:54:16Z

README.md

+      "key3": ["protocol://{f(c='text')}", 10000, 100]
+    }
+}
+```


It's not clear to me whether this json document is meant to be an example or the actual specification. If it is meant to be an example, I would replace protocol:// with an example protocol (e.g. http).

rabernat · 2021-03-02T16:55:53Z

README.md

+- refs: keys with either data or [url, offset, length]. The URL will be treated as a template if it contains 
+  "{" and "}".
+
+In the example, "key2" becomes ["protocol://long_text_template", ..] and "key3" becomes ["protocol://text", ..].


Could you explicitly expand these lists rather than using ellipses.

rabernat · 2021-03-02T16:56:59Z

README.md

+
+In the example, "key2" becomes ["protocol://long_text_template", ..] and "key3" becomes ["protocol://text", ..].
+Also contained will be keys "gen_ref0": ["protocol://long_text_template_0", 1000, 1000] to "gen_ref8":
+["protocol://long_text_template_9", 9000, 1000].


Perhaps this whole section would be easier to read if you showed the complete representation that would be generated by the template.

martindurant · 2021-03-02T20:49:49Z

This is not really a "spec"; it's more of an example

I thought more like an example with explanations that can together be turned into a formal spec :) It could be encoded in a jsonschema and used for validation.

if you showed the complete representation that would be generated by the template

Good idea. Indeed, when implemented in code, this ought to be a test case.

Do we want to link explicitly to a spec document URL

I think version should be a number, but it would be reasonable to also add a link/DOI, etc., in a different field. In the python implementation, we would probably not check this field.

Is there a language-agnostic version of format strings we could point to?

I am happy to have one pointed out, but I believe python's is moderately simple and similar to other languages. Other popular template engines like jinja are also typically tied to a particular implementation.

Putting python code in the spec seems problematic

The alternative is to come up with our own language... For the specific case of range, you could replace with a list literal, but would not want to do that for long lists.

martindurant · 2021-03-02T21:06:48Z

This is the python version of the "moustache" templating framework, which appears to have libs in many many languages: https://github.com/noahmorrison/chevron

ajelenak · 2021-03-03T02:08:43Z

README.md

+  "{" and "}".
+
+In the example, "key2" becomes ["protocol://long_text_template", ..] and "key3" becomes ["protocol://text", ..].
+Also contained will be keys "gen_ref0": ["protocol://long_text_template_0", 1000, 1000] to "gen_ref8":


Should it be gen_key0 and gen_key8 instead of gen_ref...?

... which is why it's good to have your spec be in code and tested too, right?

joshmoore · 2021-03-03T14:01:36Z

Only comment on top of those given so far, I think, is that data doesn't seem to be specified. My assumption is that it's the raw return value with no need for further lookup.

Regarding the Python-ism points, I could certainly see having a second implementation to keep us honest. I'd probably default to doing so in Java, but I won't start on that until I hear whether or not @manzt already has one in JS (which is what usually happens...)

martindurant · 2021-03-03T14:10:50Z

data doesn't seem to be specified. My assumption is that it's the raw return value with no need for further lookup.

Indeed. And that the remote chunks referenced are byte blocks too. To be included in JSON, these literals would have to be ascii (which is fine and normal for zarr). We could optionally introduce an encoding like b64 at this point.

As an interesting contrast, for something like the grib2 case, where we need to make a local file to pass to C, one could imagine using this kind of structure and specify a "literals processor" and/or "reference processor" somehow. Very useful, but I think beyond Version 1.

rabernat · 2021-03-03T14:17:35Z

Perhaps we want to restrict what sort of python expressions can be used. Do we envision needing anything besides range? If not, we could do something like

"i": {"start": 0, "stop": 9}

martindurant · 2021-03-03T14:23:37Z

Do we envision needing anything besides range?

open to suggestions. By using a specific dict as you suggest, though, would hinder future expansion of possibilities, so I would still prefer range (with one, two or three arguments). I originally had imagined allowing anything in python builtins, but indeed that would be hard to make language agnostic.

rabernat · 2021-03-03T14:28:27Z

Then what about

"i": {"range": {"start": 0, "stop": 9}}

My point is that functions and arguments can be encoded as json and then translated to python, rather than putting python code as strings into the json document.

martindurant · 2021-03-03T17:50:18Z

That syntax is OK with me.

manzt · 2021-03-03T18:45:30Z

Thanks for working on this @martindurant

This is the python version of the "moustache" templating framework, which appears to have libs in many many languages.

I think moustache is a good choice for this, echoing @rabernat concern with python-isms.

I originally had imagined allowing anything in python builtins, but indeed that would be hard to make language agnostic.

Ideally we could define the "set" of builtins that the spec may include in JSON like @rabernat's example. This would help clarify which builtins are absolutely necessary and as @joshmoore said "keep us honest" between implementations.

Only comment on top of those given so far, I think, is that data doesn't seem to be specified. My assumption is that it's the raw return value with no need for further lookup.

Also somewhat unclear to me. Can data be both an ascii-encoded blob that needs to be parsed as JSON or just JSON?

I won't start on that until I hear whether or not @manzt already has one in JS (which is what usually happens...)

I've written a few ad-hoc pieces of code to parse the spec, but it's coupled with a custom zarr.js store. I can try to separate out pieces and work on something more "official" and reusable for the JS folks.

martindurant · 2021-03-03T18:50:03Z

I've written a few ad-hoc pieces of code to parse the spec, but it's coupled with a custom zarr.js store.

Since you don't have an fsspec to handle the references, I imagine a custom store is the only way to go. That's how this "pieces of HDF" was initially implemented too; but I am keen that the idea be transferrable beyond zarr if possible, at least in python.

manzt · 2021-03-03T19:02:45Z

Since you don't have an fsspec to handle the references, I imagine a custom store is the only way to go.

Makes sense. I could image an reference-spec-reader-js that is just a parser for spec and returns either data or a ref tuple for a given key, delegating how/if a data request should be made to someone else. I think for now (since I'll probably be one of few users), it makes sense to keep these together.

manzt · 2021-03-03T22:46:34Z

As a side note, I'm not super familiar with mustache, but I'm not sure (don't think) expression evaluation is supported. So handing:

  "offset": "{(i + 1) * 1000}"

could be challenging.

EDIT: One option would be to use templating strictly for substitution and then parse and evaluate the resulting expression. Although use of eval like this can pose a security risk (at least it's warned against in the browser)

import chevron
template = "({{ i }} + 1) * 1000}"
expr = chevron.render(template=template, data=dict(i=10)) # "(10 + 1) * 1000"
offset = eval(expr)

ajelenak · 2021-03-04T00:26:35Z

Checking whether this in the spec:

{
    "refs": {
         "key0": "data",
    }
}

covers the use case below (here for context):

{
    "key0": "data:application/octet-stream;base64:<array data>"
}

martindurant · 2021-03-04T13:45:43Z

Good point @ajelenak : this is not described. If the only two formats we consider are text (ascii) and b64-binary, then maybe it can be shorter and simpler.

ajelenak · 2021-03-04T17:37:35Z

Simpler by avoiding "data:application/octet-stream;base64:"? I think that in cases of specs explicit is better than implicit. But avoiding the data URI part will certainly save some bytes.

martindurant · 2021-03-09T15:29:21Z

Updates as requested.

Given the availability of nunjucks (js), liquid (ruby), twig (php) and probably others, I think sticking with jinja2 is fine. It means double-braces, unless all of the above permit setting the variable delimiter as python'd version does.

joshmoore · 2021-03-09T17:26:53Z

Given the availability of nunjucks (js), liquid (ruby), twig (php) and probably others

Untested but looks active: https://github.com/HubSpot/jinjava

martindurant · 2021-03-09T21:37:17Z

https://github.com/jinja2cpp/Jinja2Cpp

manzt · 2021-03-09T22:59:51Z

README.md

+        {
+            "key": "gen_key{{i}}",
+            "url": "http://{{u}}_{{i}}",
+            "offset": "{(i + 1) * 1000}",


I could be wrong, but I don't think this is a valid template.

Suggested change

"offset": "{(i + 1) * 1000}",

"offset": "({{ i }} + 1) * 1000",

I should be "{{(i + 1) * 1000}}" - to evaluate the whole thing within braces as an expression. Your version would evaluate to the string "(0 + 1) * 1000" for i=0.

Yes, you are correct. My mistake!

manzt · 2021-03-10T16:48:56Z

I was able to write a parser in JS to expand the V1 spec: https://github.com/manzt/reference-spec-reader. Thanks for your work on this @martindurant .

To reiterate a potential concern, nunjucks (and I'm guessing other template renders) has a very clear warning that evaluating user defined templates is a security vulnerability:

nunjucks does not sandbox execution so it is not safe to run user-defined templates or inject user-defined content into template definitions.

I don't have the background to evaluate this is concern in this context, but something to be aware of.

martindurant · 2021-03-10T16:55:39Z

The specific concern, at least the example, is about passing strings to the browser, which then evaluates them. In our case, we don't eval strings, we turn strings into other strings with the exception of cast to int.

manzt · 2021-03-10T17:11:44Z

In our case, we don't eval strings, we turn strings into other strings with the exception of cast to int.

I'm struggling to see how evaluating this template "{{ (i + 1) * 1000 }}", for example, is a combination of string substitution and integer casting. Ultimately the substituted string expression must be evaluated, but I could missing something.

martindurant · 2021-03-10T17:14:56Z

This particular value has three stages:

arithmetic evaluation based on the value of i (no functions allowed here except a very limited set that jinja implements)
forms a string output
converted to int with a simple cast (either is an int, or error - no arbitrary evaluation).

manzt · 2021-03-10T20:54:30Z

Thanks for walking me through that. This was a misunderstanding on my part (re: which functions are allowed via jinja).

martindurant · 2021-03-11T02:18:09Z

I have specified what the "ref" entries should be, and included the possibility that they are whole files (i.e., no offset/length - but we don't need to scan all files to find their lengths).

I thin this can complete the spec, and I can start making the implementation in fsspec as well as a recipe in pangeo-forge ( @rabernat ?).

This whole-file change would account for the grib2 case or small-hdf case, where we suppose there will be a "filter" stage in the zarr loader which takes the bytes, writes them to disc (or BytesIO for HDF), loads and selects data and passes back the numpy buffer.

This meets the intake-informaticslab use case (especially if we have caching, which could be for the original file or the final array buffer). Of course, specifying this (python) function doesn't happen in this JSON spec, it's either contained in the zarr metadata or even in an intake spec. We could have another convention for those, if we like, such as grib2, hdf:tree/var or generic module:func. I don't intend to implement anything like this quite just yet, but we want it to be possible.

cc @tam203 @arh89

martindurant · 2021-03-11T16:20:19Z

Merging today, unless I hear otherwise. This will be followed by implementation in fsspec's ReferenceFileSystem, and then the pangeo-forge recipe (which will depend on fsspec HEAD).

ajelenak · 2021-03-11T17:38:11Z

README.md

+            "url": "http://{{u}}_{{i}}",
+            "offset": "{{(i + 1) * 1000}}",
+            "length": "1000",
+            "dimensions": 


Somehow dimensions seems misplaced to me. What is i a dimension of? Perhaps something self-explanatory like "gen_vars", although not as convenient as a single word.

Happy to hear other suggestions, but it seems clear enough to me (as described) - in terms of inputs for the cartesian product.

Please go ahead and merge when ready. I did not mean to delay that with my comment.

See fsspec/kerchunk#17 Note that this *expands* the given data. and builds a dircache. In the future, entries should be available only on demand.

martindurant · 2021-03-12T20:39:31Z

Implementation: fsspec/filesystem_spec#568

When that's in, I'll go back fsspec_reference_maker.hdf and edit it accordingly, to be called from a pangeo-forge recipe.

manzt · 2021-03-28T21:16:13Z

Not sure if anyone here is interested, but figured this an appropriate place to share for now (working on a blob post). I have released an JS package on npm that includes a parser for the specification and a valid store implementation for Zarr.js in the browser (relies on http(s) protocol).

package: https://github.com/manzt/reference-spec-reader

example: https://6060eed9dc5ef100075be024--vizarr.netlify.app/?source=https://gist.githubusercontent.com/manzt/436fc2966c484205a2c60824f659b412/raw/cdc69f2ce645d953185f10d7552501bfd459dd12/Vanderbilt-Spraggins-Kidney-MxIF.ome.tif.json&channel_axis=0

The example url is a preview of this PR in vizarr, an interative viewer for zarr-based images. The reference (v0) was generated via tifffile for a multiscale OME-TIFF we host on GCS. Same reference can be viewed in python via napari with fsspec + dask & zarr.

rabernat · 2021-03-29T00:19:53Z

Vizarr is so cool! Can we point it at some climate data?

manzt · 2021-03-31T21:44:01Z

@rabernat I'm not sure. I'd have to look at the Zarr (both hierarchy structure and chunk shape for arrays). Currently Vizarr relies on the "multiscales" extension for representing downsampled data, the last two dimensions must be (y, x) and non-xy dimensions must have a chunksize of one. The other thing I'm curious about is coordinate system. Happy to chat!

Martin Durant added 2 commits March 1, 2021 15:56

Spec version 1

262ba48

too many zeros

bdbb0e6

rabernat reviewed Mar 2, 2021

View reviewed changes

ajelenak reviewed Mar 3, 2021

View reviewed changes

responses

fe2c8a8

manzt reviewed Mar 9, 2021

View reviewed changes

Add braces; add expanded form

b5560bf

Clarify ref structure

3115b08

ajelenak reviewed Mar 11, 2021

View reviewed changes

martindurant merged commit 7f0e2d9 into fsspec:main Mar 12, 2021

martindurant deleted the spec1 branch March 12, 2021 02:07

martindurant mentioned this pull request Mar 16, 2021

[Proposal]: Dump ReferenceFileSystem spec for ZarrTiffStore that can be read natively as zarr cgohlke/tifffile#56

Closed

cisaacstern mentioned this pull request Oct 13, 2023

Status of JSON Schema for references spec #373

Open

	"offset": "{(i + 1) * 1000}",
	"offset": "({{ i }} + 1) * 1000",

Spec version 1 #17

Spec version 1 #17

Conversation

martindurant commented Mar 1, 2021

martindurant commented Mar 2, 2021

rabernat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant commented Mar 2, 2021

martindurant commented Mar 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshmoore commented Mar 3, 2021

martindurant commented Mar 3, 2021

rabernat commented Mar 3, 2021

martindurant commented Mar 3, 2021

rabernat commented Mar 3, 2021

martindurant commented Mar 3, 2021

manzt commented Mar 3, 2021 • edited Loading

martindurant commented Mar 3, 2021

manzt commented Mar 3, 2021 • edited Loading

manzt commented Mar 3, 2021 • edited Loading

ajelenak commented Mar 4, 2021

martindurant commented Mar 4, 2021

ajelenak commented Mar 4, 2021

martindurant commented Mar 9, 2021

joshmoore commented Mar 9, 2021

martindurant commented Mar 9, 2021

manzt Mar 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manzt commented Mar 10, 2021 • edited Loading

martindurant commented Mar 10, 2021

manzt commented Mar 10, 2021

martindurant commented Mar 10, 2021

manzt commented Mar 10, 2021

martindurant commented Mar 11, 2021

martindurant commented Mar 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant commented Mar 12, 2021

manzt commented Mar 28, 2021 • edited Loading

rabernat commented Mar 29, 2021

manzt commented Mar 31, 2021

manzt commented Mar 3, 2021 •

edited

Loading

manzt commented Mar 3, 2021 •

edited

Loading

manzt commented Mar 3, 2021 •

edited

Loading

manzt Mar 9, 2021 •

edited

Loading

manzt commented Mar 10, 2021 •

edited

Loading

manzt commented Mar 28, 2021 •

edited

Loading