-
Notifications
You must be signed in to change notification settings - Fork 1
collaborate with Pangeo? #1
Comments
Hello! Indeed, you might find some inspiration from intake-astro which contains some code for parallel loading of FITS with Dask from any fsspec URLs, including cloud object stores and more. This was to a great extent based on conversations with sunpy people such as @Cadair (who I believe have also developed the idea further). |
hey @rabernat and @martindurant. Thank you for the points. I've been focused on writing a CI/CD solution these past few weeks and am just now getting around to reading my Github Notifications. I'll be free to follow up with these leads a few weeks from now. Thank you again! |
No worries @jbcurtin! Just about everyone is snowed under right now! 🙃 We recently published a preprint you might be interested in: Cloud Native Repositories for Big Scientific Data. |
hey @rabernat , thank you for sending over "Cloud Native Repositories for Big Scientific Data.(CNRBSD)" My background is in MLOps, DevOps, & Data Engineering(ETL). My views are in alignment with a lot of points the CNRBSD paper makes. 👍 I've looked into Zarr, Cloud Object Storage, and a few other solutions for a project that initiated the creation of this Github Repository, Cloud Optimized Fits. The need came about when we started looking at serving TESS data-cubes over AWS Lambda. The mindset I've accepted for this problem area is, all management of ARD data can be roughly organized in some kind of way that is specifically required by the tools a scientist uses. Dask(ML), Kubeflow, Jupyter Notebooks, Astropy, Python, etc. Depending on the tool utilization, an appropriate implementation can be written in a low number of sprints to accommodate the scientist and optimize loading data into algorithms created by scientist(s). The most basic blue-print of this resource serving structure is HTTP(s) protocol implemented in Nginx, Caddy, and other HTTP(s) comparable services capable of serving static files; including "Cloud Object Storage". By focusing on the HTTP(s) headers provided. We're given basic functions required to perform more complex actions of serving data over larger networks(Internet). While we were working with this idea, we looked at Cloud Providers(AWS, GCP, Digital Ocean) and didn't consider specialized requirements for HPC environments. Additionally, the abstraction layer we're looking to leverage has been implemented in most if not all modern HTTP(s) servers such as Nginx, Apache2, Jetty, JBoss, Caddy, and other servers.
I can't say officially, but I'd be willing to wager that everyone supportive of the research that has gone into writing this Github Repository would welcome the idea of collaborating with Pangeo. Is there someone I can reach out to via Email? Where should I look on the Internet to learn more about collaborating with Pangeo? |
(your cloud-optimised-fits link points to a non-public google doc) Note that if your data is stored on a server supporting range requests or cloud store, you can already load FITS files without any special handling, e.g.,
(notice that this data is in a pangeo bucket) If you read the thread pangeo-data/pangeo#269, you will see that many cloud big-data operations have been demonstrated with FITS data, and there is no need to invent new server technology or formats. As well as the code I liked above to fetch extensions data from multiple FITS files in parallel, I would also like to draw your attention to https://github.com/intake/fsspec-reference-maker/ , which is an effort to extract metadata and/or offsets to binary blocks within cloud-accessible files. The process was conceived with HDF5 files in mind, as a way to make them directly readable with zarr. Something very similar could work for FITS files (so long as there is no whole-file compression). The interesting difference would be
@DPeterK : this all sounds like the many-small-files problem you are facing, and it occurs to me that, after all, I can think of a way to write your method in terms of an fsspec implementation (or modification to ReferenceFileSystem), where each block returns bytes, as expected by a filesystem, but the bytes are generated by reading the original data files. |
and here is the example from your readme, again with no special code, and not downloading the 44GB file
|
hey @martindurant , thank you for sending this overview of ( Please request access to the Doc. I'll run it past managers to make sure the info can be shared publicly before I open it up. ) |
hey @martindurant, managers have approved opening up the documents. I've updated the comment above to point to documents meant for the public. |
I just discovered this project and the cloud-fits repo. It looks like a great contribution!
I wanted to invite you to collaborate with the Pangeo community around cloud-native patterns for scientific data analysis. We are mostly coming from the geospatial / weather / climate world. A lot of our efforts has gone into developing cloud-native workflows that scale with Dask for distributed processing, and the file format is an important part of this.
We have written about our approach to cloud-based data a bit here: http://pangeo.io/data.html#data-in-the-cloud
We use the Zarr format heavily.
We had some explorations with astro data early on in our project here: https://github.com/pangeo-data/pangeo-astro-examples
I'm tagging @martindurant, an ex-astronomer who now maintains filesystem-spec, s3fs, gcsfs, and other such tools which are crucial for getting good performance with data in object store. Martin has often speculated about how FITS could be adopted to work better with object storage. I imagine he might want to say hello.
Thanks again for your work on open source! If you think a conversation with Pangeo folks would be helpful to your goals, we'd be happy to set something up.
The text was updated successfully, but these errors were encountered: