Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[use case demonstration] Kvikio Direct-to-gpu -> xarray -> xbatcher -> ml model #87

Open
jhamman opened this issue Aug 25, 2022 · 19 comments
Labels

Comments

@jhamman
Copy link
Contributor

jhamman commented Aug 25, 2022

What is your issue?

Recent developments by @NVIDIA and @dcherian are opening the door for direct-to-gpu data loading in Xarray. This could mean that when combined with Xbatcher and the tensorflow or pytorch data loaders, a complete workflow from Zarr all the way to a ml model training could be accomplished without ever handling data on a CPU.

Here's a short illustration of the potential workflow:

import xarray as xr
import xbatcher

ds = xr.open_dataset(store, engine="kvikio", consolidated=False)

x_gen = xbatcher.BatchGenerator(ds[xvars], {'time': 10}) 
y_gen = xbatcher.BatchGenerator(ds[yvars], {'time': 10}) 

tf_dataset = xbatcher.loaders.keras.CustomTFDataset(x_gen, y_gen)

model.fit(tf_dataset, ...)

This would be awesome to demonstrate in a single example. Perhaps as a second tutorial on Xbatcher's documentation site.

xref: xarray-contrib/cupy-xarray#10

cc @dcherian, @negin513, and @weiji14

@dcherian
Copy link

I like how you tagged NVIDIA hahaha.

The RAPIDS folks (@jakirkham, @madsbk, @jacobtomlinson) were really interested in a blogpost about this stuff

@weiji14
Copy link
Member

weiji14 commented Aug 25, 2022

👍 for a blog post. I'd be happy to contribute to a draft blog post as @dcherian suggested at a recent Pangeo meeting for https://medium.com/pangeo (or https://medium.com/rapids-ai), but probably need to wait for pydata/xarray#6874 and zarr-developers/zarr-python#934 to get merged and new xarray and Zarr releases first.

One issue with having this kvikio tutorial on xbatcher's documentation though is that we don't have GPUs in GitHub Actions CI or Readthedocs, so it can't be built dynamically 🙂 We'll either need to cache the outputs, or find another way or place to host the tutorial.

@jhamman
Copy link
Contributor Author

jhamman commented Aug 25, 2022

I love the idea of a blog post here. Perhaps we publish the post in a few places at once (xarray's blog would also work).

One issue with having this kvikio tutorial on xbatcher's documentation though is that we don't have GPUs in GitHub Actions CI or Readthedocs, so it can't be built dynamically 🙂 We'll either need to cache the outputs, or find another way or place to host the tutorial.

I think its probably worth publishing a "cached" notebook here even though it won't be running by most folks. A strong disclaimer at the top stating the purpose will probably be sufficient to avoid confusion in the future.

@dcherian
Copy link

dcherian commented Aug 25, 2022

OK thanks for the prompt. I added a super brief intro blogpost here: xarray-contrib/xarray.dev#308 to get the word out. This blogpost could then just link to that for extra details.

@weiji14
Copy link
Member

weiji14 commented Sep 2, 2022

One issue with having this kvikio tutorial on xbatcher's documentation though is that we don't have GPUs in GitHub Actions CI or Readthedocs, so it can't be built dynamically slightly_smiling_face We'll either need to cache the outputs, or find another way or place to host the tutorial.

I think its probably worth publishing a "cached" notebook here even though it won't be running by most folks. A strong disclaimer at the top stating the purpose will probably be sufficient to avoid confusion in the future.

At https://discourse.pangeo.io/t/statement-of-need-integrating-jupyterbook-and-jupyterhubs-via-ci/2705, there's some ideas on how to run 'expensive' (read: GPU required) notebooks via the Pangeo Binder Jupyter Hub. It'll be more work than the caching solution, but probably allows for easier reproducibility long-term for the wider community, especially if the GPU direct storage/kvikIO technology gets updated in the future and we need to re-run things for newer versions. Thoughts?

@maxrjones
Copy link
Member

One issue with having this kvikio tutorial on xbatcher's documentation though is that we don't have GPUs in GitHub Actions CI or Readthedocs, so it can't be built dynamically slightly_smiling_face We'll either need to cache the outputs, or find another way or place to host the tutorial.

I think its probably worth publishing a "cached" notebook here even though it won't be running by most folks. A strong disclaimer at the top stating the purpose will probably be sufficient to avoid confusion in the future.

At https://discourse.pangeo.io/t/statement-of-need-integrating-jupyterbook-and-jupyterhubs-via-ci/2705, there's some ideas on how to run 'expensive' (read: GPU required) notebooks via the Pangeo Binder Jupyter Hub. It'll be more work than the caching solution, but probably allows for easier reproducibility long-term for the wider community, especially if the GPU direct storage/kvikIO technology gets updated in the future and we need to re-run things for newer versions. Thoughts?

I think the eventual goal should be to build the examples that are 'expensive' and cross-cutting in terms of software (e.g., Kvikio Direct-to-gpu -> xarray -> xbatcher -> ml model) as part of the Project Pythia cookbooks and link to those cookbooks from the individual package docs (e.g., xbatcher). But, as discussed on that thread some infrastructure developments are required before Project Pythia can support those examples. The notebook discussed here could be a great test case for the integration between JupyterHubs and JupyterBook and could be "cached" in xbatcher docs while that development happens.

@weiji14
Copy link
Member

weiji14 commented Sep 5, 2022

Just on the infrastructure point, I noticed that GPU-enabled GitHub Actions is on the roadmap (github/roadmap#505), but unsure if this will be limited to Teams/Enterprise plans only as with https://github.blog/changelog/2022-09-01-github-actions-larger-runners-are-now-in-public-beta. In theory, this would allow for us to store an uncached version of the notebook and run it from time to time (though it will probably cost some $$).

Still, I think the Project Pythia cookbook method is worth pursuing, as the close integration with Pangeo Binder would allow users to actually run the example kvikIO notebook on the cloud. In practical terms, we could:

  1. Wait for the PRs mentioned in Add Kvikio backend entrypoint cupy-xarray#10 to be merged, and releases made for xarray/cupy-xarray/zarr
  2. Have a 'cached' kvikIO notebook
  3. Have an un-cached kvikIO notebook using either
    1. GitHub Actions GPU (if it becomes available)
    2. Project Pythia infrastructure

@joshmoore
Copy link

joshmoore commented Sep 8, 2022

@weiji14 commented 14 days ago
but probably need to wait for ... zarr-developers/zarr-python#934 to get merged and new xarray and Zarr releases first.

Now available in zarr-python 2.13.0a2 for testing.

@dcherian
Copy link

dcherian commented Sep 9, 2022

Is there a cloud provider that has the necessary GDS stuff set up?

@weiji14
Copy link
Member

weiji14 commented Sep 10, 2022

Is there a cloud provider that has the necessary GDS stuff set up?

Tried running on Microsoft Planetary Computer (gpu-pytorch container), GPU direct storage doesn't work yet, but compatibility mode works. Below are results from python single-node-io.py (script from https://github.com/rapidsai/kvikio/blob/29c52f76035002d91f301895250c0ff14f18f50a/python/benchmarks/single-node-io.py):

----------------------------------
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   WARNING - KvikIO compat mode   
      libcufile.so not used       
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
GPU               | Unknown (install pynvml)
GPU Memory Total  | Unknown (install pynvml)
BAR1 Memory Total | Unknown (install pynvml)
GDS driver        | N/A (Compatibility Mode)
GDS config.json   | /etc/cufile.json
----------------------------------
nbytes            | 10485760 bytes (10.00 MiB)
4K aligned        | True
pre-reg-buf       | True
diretory          | /tmp/tmp9a8nd5kz
nthreads          | 1
nruns             | 1
==================================
cufile read       |   4.28 GiB/s
cufile write      |  92.59 MiB/s
posix read        |   1.23 GiB/s
posix write       |   1.24 GiB/s

Could try to get in a PR to install the necessary GPU direct storage and kvikIO packages perhaps, they're usually pretty responsive. Edit: opened issue at microsoft/planetary-computer-containers#51.

@weiji14
Copy link
Member

weiji14 commented Sep 10, 2022

Oh, and if we do get GPU direct storage setup on Microsoft Planetary Computer (on Azure West Europe), I have an idea to get a demo working with the https://github.com/carbonplan/cmip6-downscaling dataset (since it's also on Azure West Europe?). This may or may not require the multi-resolution issue at #93 to be resolved, but it looked like a good Zarr machine learning dataset to play with.

As a start, I did try this quickly:

xr.open_dataset(
    "https://cpdataeuwest.blob.core.windows.net/cp-cmip/version1/data/DeepSD/ScenarioMIP.CCCma.CanESM5.ssp245.r1i1p1f1.day.DeepSD.pr.zarr",
    engine="kvikio",
    consolidated=False,
)

but got a strange GroupNotFoundError: group not found at path '' (Using xr.open_zarr worked fine though). So realistically, still a few things to iron out on cupy-xarray and xarray perhaps, maybe a month or two's worth of work?

@weiji14
Copy link
Member

weiji14 commented Aug 1, 2023

Ok, looks like I've severely underestimated how long this is going to take 😅 Hoping to get some time to work on this in October 2023 🤞, but just gonna make a TODO list on things that need to happen:

  • Documentation. Right now everything is in a blog post. There's been some related work at https://github.com/negin513/cupy-xarray-tutorials (not direct GPU, but CPU->GPU), which we could build on top of
  • Cloud infrastructure. Maybe start with one cloud provider (AWS?), and ensure that the disk partition, network connections and all that are setup properly to ensure low I/O latency.

Longer term, we'll also look into:

@dcherian
Copy link

dcherian commented Aug 1, 2023

Maybe start with one cloud provider (AWS?), and ensure that the disk partition, network connections and all that are setup properly to ensure low I/O latency.

It may be a lot easier to experiment on NCAR systems once they can do it. @negin513 seems very interested in this kind of thing :)

@maxrjones
Copy link
Member

thanks for creating the to-do list @weiji14! as we discussed earlier today, I'll also have some time in October to contribute and am particularly interesting in the kerchunk connections.

@jakirkham
Copy link

Starting with the name brand CSPs is a reasonable first step

While lesser known, CoreWeave has been putting in good effort to configuring hardware optimally

Though if you have your own system that you are planning to use long term, setting up there sounds good

@weiji14
Copy link
Member

weiji14 commented Aug 2, 2023

Cool, the idea is to enable more people to run kvikIO/NVIDIA GPUDirect Storage, either on a local GPU, or in the cloud if they don't have one. That's why I'd like to start with the documentation, and we could experiment on NCAR first to understand how involved the configuration would be. Once we've figured out the config settings, we can then expand to other HPC or commercial cloud systems. That CoreWeave offering does look nice, though I can't see on their webpage if they do support NVIDIA GDS (would like to hope that they do)!

@weiji14
Copy link
Member

weiji14 commented Oct 13, 2023

Have managed to run some benchmark experiments on a WeatherBench2/ERA5 subset comparing kvikIO (GPU-based) and zarr (CPU-based) engines at zarr-developers/zarr-benchmark#14. See also related discussion at zarr-developers/zarr-benchmark#14 where I describe the technical stuff in more detail. And yes, the benchmark code uses xbatcher too 😉

compare_kvikio_zarr

Initial results are that kvikIO takes ~25% less time to load data than zarr (though I'm not confident with that number yet, because the numbers change drastically between subsequent runs due to some strange factors like caching). I'll be giving a talk next week at FOSS4G SotM Oceania 2023 to get people excited about this, and hope that things can move forward a bit more 😄

@KiranModukuri
Copy link

@weiji14 can you please describe where these tests were run local Machine or in Cloud environment ?

@weiji14
Copy link
Member

weiji14 commented Oct 25, 2023

Hi @KiranModukuri, yes, these tests were ran locally (using an NVIDIA RTX A2000 8GB GPU). I did try to set up a GCP container to run the benchmarks (WeatherBench2's ERA5 is at https://console.cloud.google.com/storage/browser/weatherbench2/datasets/era5), but was running into quota issues allocating GPUs on us-central1 where the dataset is stored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants