-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement templating for easier modifications of metadata like descriptions #68
Comments
I'm not sure about the STAC API part. Do you mean jsonnet being returned by the STAC API? I've not looked at the STAC API yet, so that might be weird. What we have now with Earth Engine:
I believe that's basically what I was thinking. I was mostly thinking in the context of:
Some of the use cases and concepts I was thinking about:
The functions I made for any thing that has a global extent:
That means that a collection or item would only have to put something like this in the .jsonnet:
Trouble spots:
Some of these items boil down to projects need to do careful code reviews on new jsonnet just like they would do for any python, javascript, or any other full language. There is no networking or ability to read arbitrarily named files outside of those passed |
Interesting, thanks for all this context and ideation. I like the idea that we (meaning the folks managing public open STAC datasets, as well as anyone else who'd like to use them) could all pull from a repo that is full of the common information about datasets. That way we can collaborate on making sure all the collection-level data we are capturing is complete and accurate. It would be pretty simple to set up a repo with templates for some of the collections that AI for Earth will be hosting in our STAC API soon. It would actually be nice to seed that with the collection information that GEE has put together. The pipeline of updating those collection templates to updating our STAC API information would be manual at first, but could eventually be automated. The STAC API part of it is just updating the collections in the database to be served through the API, so the pipeline would go one step further:
I didn't realize jsonnet was as powerful as it was - is this overkill? Have you seen situations where this is a lot better than say a markdown file or YAML that can be edited by anyone, then parsed and templated into JSON via Jinja2 or something similar? That sounds like what you have now with your reference to a brittle script, and good to know that eventually breaks...though even for the example of a global extent, there's a bit of requisite knowledge of how to import a library of functions like that into jsonnet and then call it in the new syntax. There's a cognitive burden for introducing more languages for folks to learn, and I want to be cognizant of that as the folks I think we'd want to attract to write content for the templates and review for correctness might not want to learn a new syntax. That said, you've done a lot of digging into this and I trust you've considered this from many angles - if you think jsonnet is the best way forward, I'd be very willing to try it out. That'd start with creating a repo of jsonnet templates (probably in this github org) and like I mentioned seeding it with publicly available collection information for datasets we're interested in (the current next-month-horizon list is NAIP, Landsat 8 C2, Sentinel 2, and ASTER). For some of the trouble spots - I think if the process was based on a PR-review workflow that systems could hook into to get notifications of new updates, with a CI that ensured you could review the rendered product (even potentially with a Netlify PR preview, which would be nice), that would mitigate some of the issues around attacks. If folks wanted an extra layer, they could keep their own internal repo and review upstream changes themselves before kicking off their own processes.
This is a good point - how is the STAC metadata in GEE currently licensed? If we had a single repo with all the metadata and clear licensing that we all used, that would be ideal for sure. |
@lossyrob , Lots of good material there. Some follow on thoughts (that don't address everything)...
Anything we do for the Earth Engine is going to require human review by a Google employee before triggering any update. We do automate as much of the process as possible. If there is a common library of jsonnet things, we would be motivated to contribute to it and hopefully use it.
It is a lot and I'm trying to keep the fraction of the (turing complete) language to a minimum. One thing it can't do is read files outside of libjsonnet files. e.g. it can't do a
It would be awesome to have others give it a go. If anyone tries it and doesn't like it, reverting to just the JSON is just running the forward transform. That's a new one to me.
There is currently no license on the metadata contained in the GEE STAC catalog and there is no copyright assertion in the JSON files. As for the jsonnet code around that, I am pretty sure that Google (via me) will be releasing the jsonnet code around that metadata as Apache 2.0. All of the code snippets that I shared to date should have the copyright header and the SPDX Apache 2.0 tag, so folks should already be able to use the prototype examples I've shared (at least from the open source license point-of-view). |
@schwehr on a tactical note, I tried to spin up a jsonnet for a collection I'm templating. tbh I found the documentation a bit hard to parse, so maybe this is easy and I just couldn't find a way to do it - I really want the "description" field for the collection to be a markdown file that can be edited easily, and then imported into the template as part of the render process. Do you know how to do that in jsonnet? |
I got some help getting started, so hopefully I can pass it along. Ask away so we can capture some of the parts that are confusing at startup. Separate markdown files are not an option as jsonnet can't bring them in. The best I can offer the I have a similar issue with things that have large CSV tables in separate files that need to be expanded into the resulting STAC json somehow. A quick example of markdown in
Then running
I've been working on code to convert parts of the existing EE STAC JSON into jsonnet function. However, it's super specific stuff. e.g. LINK_SELF_RE = re.compile(
r"""{\s+rel:\s*'self',\s*href:\s*'https://[^']+',\s+},""", re.M)
def LinkSelf(src: str) -> str:
link_self = """{ rel: 'self', href: self_url },"""
result = LINK_SELF_RE.sub(link_self, src, 1)
return result this bit of python + import re find the self link and cleans it up to be:
Where self_url is defined at the top of the file:
|
Related to this issue of templating is how to layout STAC catalogs. For something like jsonnet, the structure determines how hard it is to find required libsonnet files that might be sprinkled through the tree. I have two prototypes described here: https://twitter.com/kurtschwehr/status/1371896201063243777 Things like an We have total flexibility as this STAC links are just URLs. Things could span multiple buckets, cloud hosting services (S3, GCS, Azure Blob Storage, etc). People could go nuts and make a CAS, but I don't think folks are going to appreciate a structure like this with hashes.
|
Sort-of-related, I've implemented less-than-templating for the modis package which I called fragments. I was noticing how almost every stactools package had a I'd be curious what folks think about formalizing the "fragments" concept into a stactools-supported setup. It's very intentionally less-than-templating, as I was trying to ensure that logic didn't creep into the static metadata (keep constants constant). |
In radiantearth/stac-spec#986, @schwehr outlines a technique to use jsonnet to implement templating, which would allow users who are not necessarily Python devs to be able to effectively edit metadata that is then used to generate STAC Catalogs, Collections and Items.
I think this would be an effective technique to use in stactools. If we had jsonnet templates for the collections and items that were used to generate those objects, then users could make pull requests against stactools to update those templates in case there's any metadata errors or additions. There could be a core function that would take an object, say a Collection, and a template, and then update the collection based on any template values as a way to update the information.
For example, as someone who maintains a STAC API, I could:
The text was updated successfully, but these errors were encountered: