Develop benchmarking criteria for consistent comparison across format options #2

asteiker · 2023-08-03T20:55:27Z

Candidate criteria:

Formats / chunking schemes to compare
- Re-chunked HDF5
- Cloud-optimized HDF5
- Geoparquet
- Zarr
- Kerchunk json
- h5coro
Environment
- CryoCloud - Small instance
- Assume we'll store all example files in CryoCloud (i.e. Sync or shared_public)
Libraries or clients used to open/read data
For each format option:
- Dataset(s)
  - Based on community feedback/discussion, initial focus on ATL03
- Files
  - Single and multiple? Files can vary by several GBs ; optimally produce and test 10 files
- Variable(s)
- Spatial subset(s)
- Temporal subset(s)
- Aggregation
- End-to-end wall clock time
  - Time to re-chunk or reformat
  - Time to open/read file
    - Multiple tools/libraries/clients to compare per format option?
      - Geopandas, xarray
      - Should we consider dask data frame
- Compute cost
- Do we include a real-world example?
  - Time series of 60 day repeat cycle
  - Real world example tie in: Jacobshavn surface height

abarciauskas-bgse mentioned this issue Aug 9, 2023

xarray is reading more data than h5coro ICESAT-2HackWeek/h5cloud#4

Closed

Provide feedback