Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-dataset abstraction layer #142

Open
JackKelly opened this issue Jun 21, 2024 · 2 comments
Open

Multi-dataset abstraction layer #142

JackKelly opened this issue Jun 21, 2024 · 2 comments
Labels
enhancement New feature or request performance Improvements to runtime performance usability Make things more user-friendly

Comments

@JackKelly
Copy link
Owner

JackKelly commented Jun 21, 2024

Maybe have a layer which sits above multiple datasets. Those datasets could be in any format (zarr, grib, etc.) and live anywhere (maybe some datasets are on local disks, some are in cloud object storage). Possibly some data is duplicated to optimise for different read patterns (see #141).

Users would query the "multi-dataset layer". When reading, the "multi-dataset layer" would select which underlying dataset to use for a given query, and could merge multiple datasets (e.g. NWP and satellite).

Perhaps this layer could also be responsible for keeping multiple on-disk datasets up-to-date when new data comes along (e.g. duplicating new data to two different datasets, which are optimised for different read patterns). But maybe that's best kept disaggregated as something the user can schedule in a data orchestration tool like Dagster.

Also, maybe the layer could automatically figure out when it'd be worth creating a new "optimised" dataset. e.g. the layer would keep track of the read patterns that it's used for.

Maybe this fits into "layer 5: applications"?

Related

@JackKelly
Copy link
Owner Author

It might be best to store multiple representations of a given dataset at creation time, rather than first creating a dense Zarr, and then creating a differently chunked dataset from that Zarr. So you could imagine wanting to pipe data into multiple drains in parallel.

@JackKelly JackKelly changed the title Multi-dataset layer Multi-dataset abstraction layer Jun 27, 2024
@JackKelly
Copy link
Owner Author

Better analogy: YouTube stores each video multiple times, each with a different compression setting and resolution. Let's do the same for ndim arrays!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Improvements to runtime performance usability Make things more user-friendly
Projects
Status: Todo
Development

No branches or pull requests

1 participant