From 262ba4850008b4e2b9b47d25d8afa90a40c32486 Mon Sep 17 00:00:00 2001 From: Martin Durant Date: Mon, 1 Mar 2021 15:56:06 -0500 Subject: [PATCH 1/5] Spec version 1 --- README.md | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 60 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 43ca309b..7aef228d 100644 --- a/README.md +++ b/README.md @@ -1,15 +1,16 @@ # fsspec-reference-maker + Functions to make reference descriptions for ReferenceFileSystem -Proposed spec for the structure required by ReferenceFileSystem: +### Version 0 -``` +Prototype spec for the structure required by ReferenceFileSystem: + +```json { "key0": "data", - "key1": { - ["protocol://target_url", 10000, 100] - } + "key1": ["protocol://target_url", 10000, 100] } ``` where: @@ -18,7 +19,7 @@ where: For example, Zarr data in this proposed spec might be represented as: -``` +```json { ".zgroup": "{\n \"zarr_format\": 2\n"}, ".zattrs": "{\n \"Conventions\": \"UGRID-0.9.0\n\"}, @@ -28,7 +29,60 @@ For example, Zarr data in this proposed spec might be represented as: }, ``` +### Version 1 + +Metadata structure in JSON. We note, for future possible binary storage, that "version", "gen" and "templates" should +be considered attributes, and "refs" as the data that ought to dominate the storage size. The previous definition, +Version 0, is compatible with the "refs" entry, but here we add features. It will also be possible to *expand* +this new enhanced spec into Version 0 format. + +```json +{ + "version": 1, + "templates": { + "u": "long_text_template", + "f": "{c}" + }, + "gen": [ + { + "key": "gen_key{i}", + "url": "protocol://{u}_{i}", + "offset": "{(i + 1) * 1000}", + "length": "1000", + "i": "range(10)" + } + ], + "refs": { + "key0": "data", + "key1": ["protocol://target_url", 10000, 100], + "key2": ["protocol://{u}", 10000, 100], + "key3": ["protocol://{f(c='text')}", 10000, 100] + } +} +``` + +Explanation of fields follows. Only "version" and "refs" are required: + +- version: set to 1 for this spec. +- templates: set of named string templates. These can be plain strings, to be included verbatim, or format strings + (anything containing "{" and "}" characters) which will be called with parameters. The format specifiers for each + variable follows the python string formatting spec. +- gen: programmatically generated key/value pairs. Each entry adds one or more items to "refs"; in practice, in the + implementation, we may choose to populate these or create them on-demand. Any of the fields can contain + templated parameters. + - key, url: generated key names and target URLs + - offset, length: to define the bytes range, will be converted to int + - additional named parameters: for each iterable found (i.e., returns successfully from `iter()`), creates a + dimension of generated keys +- refs: keys with either data or [url, offset, length]. The URL will be treated as a template if it contains + "{" and "}". + +In the example, "key2" becomes ["protocol://long_text_template", ..] and "key3" becomes ["protocol://text", ..]. +Also contained will be keys "gen_ref0": ["protocol://long_text_template_0", 10000, 1000] to "gen_ref9": +["protocol://long_text_template_9", 100000, 1000]. + +## Examples Run a notebook example comparing reading HDF5 using this approach vs. native Zarr format:
[![Binder](https://aws-uswest2-binder.pangeo.io/badge_logo.svg)](https://aws-uswest2-binder.pangeo.io/v2/gh/intake/fsspec-reference-maker/main?urlpath=lab%2Ftree%2Fexamples%2Fike_intake.ipynb) From bdbb0e625d1c89028fa1b8e6b2e5ba4f52c96254 Mon Sep 17 00:00:00 2001 From: Martin Durant Date: Mon, 1 Mar 2021 17:03:47 -0500 Subject: [PATCH 2/5] too many zeros --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 7aef228d..63e37828 100644 --- a/README.md +++ b/README.md @@ -49,7 +49,7 @@ this new enhanced spec into Version 0 format. "url": "protocol://{u}_{i}", "offset": "{(i + 1) * 1000}", "length": "1000", - "i": "range(10)" + "i": "range(9)" } ], "refs": { @@ -78,8 +78,8 @@ Explanation of fields follows. Only "version" and "refs" are required: "{" and "}". In the example, "key2" becomes ["protocol://long_text_template", ..] and "key3" becomes ["protocol://text", ..]. -Also contained will be keys "gen_ref0": ["protocol://long_text_template_0", 10000, 1000] to "gen_ref9": -["protocol://long_text_template_9", 100000, 1000]. +Also contained will be keys "gen_ref0": ["protocol://long_text_template_0", 1000, 1000] to "gen_ref8": +["protocol://long_text_template_9", 9000, 1000]. ## Examples From fe2c8a8edea45b875b3ec44a4679af6aa0e424ba Mon Sep 17 00:00:00 2001 From: Martin Durant Date: Tue, 9 Mar 2021 10:21:56 -0500 Subject: [PATCH 3/5] responses --- README.md | 94 ++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 66 insertions(+), 28 deletions(-) diff --git a/README.md b/README.md index 63e37828..aaebce41 100644 --- a/README.md +++ b/README.md @@ -36,51 +36,89 @@ be considered attributes, and "refs" as the data that ought to dominate the stor Version 0, is compatible with the "refs" entry, but here we add features. It will also be possible to *expand* this new enhanced spec into Version 0 format. + +``` +{ + "version": (required, must be equal to) 1, + "templates": (optional, zero or more arbitary keys) { + "template_name": jinja-str + }, + "gen": (optional, zero or more items) [ + "key": (required) jinja-str, + "url": (required) jinja-str, + "offset": (required) jinja-str, + "length": (required) jinja-str, + "dimensions": (required, one or more arbitrary keys) { + "variable_name": (required) + {"start": (optional) int, "stop": (required) int, "step": (optional) int} + OR + [int, ...] + } + ], + "refs": (optional, zero or more arbiritary keys) { + "key_name": (required) str OR [url(jinja-str), offset(int), length(int)] + } +} +``` + +Where: +- `jinja-str` is a string which will be rendered by jinja2 or its non-python equivalent; i.e., it may be + a literal string, or may include "{{..}}" annotations, where + - for the values associated with a template_name, the variables are to be passed in reference URL strings that + use this template + - for the values within a "gen" object, variables come from the "dimensions" and "templates" +- the `str` format of a reference value may be + - a string starting "base64:", which will be decoded to binary + - any other string, interpreted as ascii data + +Here is an example + ```json { "version": 1, "templates": { - "u": "long_text_template", - "f": "{c}" + "u": "server.domain/path", + "f": "{{c}}" }, "gen": [ { - "key": "gen_key{i}", - "url": "protocol://{u}_{i}", + "key": "gen_key{{i}}", + "url": "http://{{u}}_{{i}}", "offset": "{(i + 1) * 1000}", "length": "1000", - "i": "range(9)" + "dimensions": + { + "i": {"stop": 5} + } } ], "refs": { "key0": "data", - "key1": ["protocol://target_url", 10000, 100], - "key2": ["protocol://{u}", 10000, 100], - "key3": ["protocol://{f(c='text')}", 10000, 100] + "key1": ["http://target_url", 10000, 100], + "key2": ["http://{{u}}", 10000, 100], + "key3": ["http://{{f(c='text')}}", 10000, 100] } } ``` +Here the variable `i` takes the values `[0, 1, 2, 3, 4]`, which could have been provided in array form. Where there +is more than one variable, a cartesian product is formed. -Explanation of fields follows. Only "version" and "refs" are required: - -- version: set to 1 for this spec. -- templates: set of named string templates. These can be plain strings, to be included verbatim, or format strings - (anything containing "{" and "}" characters) which will be called with parameters. The format specifiers for each - variable follows the python string formatting spec. -- gen: programmatically generated key/value pairs. Each entry adds one or more items to "refs"; in practice, in the - implementation, we may choose to populate these or create them on-demand. Any of the fields can contain - templated parameters. - - key, url: generated key names and target URLs - - offset, length: to define the bytes range, will be converted to int - - additional named parameters: for each iterable found (i.e., returns successfully from `iter()`), creates a - dimension of generated keys -- refs: keys with either data or [url, offset, length]. The URL will be treated as a template if it contains - "{" and "}". - -In the example, "key2" becomes ["protocol://long_text_template", ..] and "key3" becomes ["protocol://text", ..]. -Also contained will be keys "gen_ref0": ["protocol://long_text_template_0", 1000, 1000] to "gen_ref8": -["protocol://long_text_template_9", 9000, 1000]. - +This example evaluates to the Version 0 equivalent +```json +{ + "key0": "data", + "key1": ["http://target_url", 10000, 100], + "key2": ["http://server.domain/path", 10000, 100], + "key3": ["http://text", 10000, 100], + "key_get0": ["http://server.domain/path_0", 1000, 1000], + "key_get1": ["http://server.domain/path_1", 2000, 1000], + "key_get2": ["http://server.domain/path_2", 3000, 1000], + "key_get3": ["http://server.domain/path_3", 4000, 1000], + "key_get4": ["http://server.domain/path_4", 5000, 1000], +} +``` +such that accessing, for instance, "key0" returns `b"data"` and accessing "key_get0" returns 1000 bytes +from the given URL, at an offset of 1000. ## Examples From b5560bf90d5393509c4bc32e6aba939188f93c81 Mon Sep 17 00:00:00 2001 From: Martin Durant Date: Wed, 10 Mar 2021 09:13:29 -0500 Subject: [PATCH 4/5] Add braces; add expanded form --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index aaebce41..660c248a 100644 --- a/README.md +++ b/README.md @@ -84,7 +84,7 @@ Here is an example { "key": "gen_key{{i}}", "url": "http://{{u}}_{{i}}", - "offset": "{(i + 1) * 1000}", + "offset": "{{(i + 1) * 1000}}", "length": "1000", "dimensions": { From 3115b087153644b5d35a5533c6719d6704734f6f Mon Sep 17 00:00:00 2001 From: Martin Durant Date: Wed, 10 Mar 2021 15:23:02 -0500 Subject: [PATCH 5/5] Clarify ref structure --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 660c248a..39b62467 100644 --- a/README.md +++ b/README.md @@ -56,7 +56,7 @@ this new enhanced spec into Version 0 format. } ], "refs": (optional, zero or more arbiritary keys) { - "key_name": (required) str OR [url(jinja-str), offset(int), length(int)] + "key_name": (required) str OR [url(jinja-str)] OR [url(jinja-str), offset(int), length(int)] } } ``` @@ -70,6 +70,8 @@ Where: - the `str` format of a reference value may be - a string starting "base64:", which will be decoded to binary - any other string, interpreted as ascii data +- the str version of ref values indicates data, the one-element array a whole url, and the three-element version + a binary section of a url Here is an example