Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop benchmarking criteria for consistent comparison across format options #2

Open
asteiker opened this issue Aug 3, 2023 · 0 comments

Comments

@asteiker
Copy link
Member

asteiker commented Aug 3, 2023

Candidate criteria:

  • Formats / chunking schemes to compare
    • Re-chunked HDF5
    • Cloud-optimized HDF5
    • Geoparquet
    • Zarr
    • Kerchunk json
    • h5coro
  • Environment
    • CryoCloud - Small instance
    • Assume we'll store all example files in CryoCloud (i.e. Sync or shared_public)
  • Libraries or clients used to open/read data
  • For each format option:
    • Dataset(s)
      • Based on community feedback/discussion, initial focus on ATL03
    • Files
      • Single and multiple? Files can vary by several GBs ; optimally produce and test 10 files
    • Variable(s)
    • Spatial subset(s)
    • Temporal subset(s)
    • Aggregation
    • End-to-end wall clock time
      • Time to re-chunk or reformat
      • Time to open/read file
        • Multiple tools/libraries/clients to compare per format option?
          • Geopandas, xarray
          • Should we consider dask data frame
    • Compute cost
    • Do we include a real-world example?
      • Time series of 60 day repeat cycle
      • Real world example tie in: Jacobshavn surface height
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant