-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More work on getting Dataflow to run #9
Conversation
An important note! In order to properly deploy (and in particular write to the proper location), the |
Ughhh I think dataflow also does not like uppercase letters, this is annoying.... |
Got past the naming issue, but there was a problem with the requirements. The xarray requirement was not formatted correctly (wondering how that did not fail in the local tests, but no big deal for now). |
Getting another error here that should have been caught in a local test?
I think this was due to depending on an old pgf-recipe version |
Hmmmm the dataflow job seems stalled.... Ill try to switch from Dataflow prime to specific high ram workers |
The run succeeded 🎉 But the path = "gs://leap-scratch/data-library/feedstocks/eNATL_feedstock/eNATL60-BLBT02.zarr"
import xarray as xr
xr.open_dataset(path, engine='zarr', chunks={}) Looking at the temp storage location we can take a look at the output: path = "gs://leap-scratch/data-library/feedstocks/output/eNATL_feedstock/enatl60-blbt02-9908751732-1/eNATL60_BLBT02.zarr"
import xarray as xr
xr.open_dataset(path, engine='zarr', chunks={}) ![]() We should never get rid of the Copy Stage!!! I have added an ad-hoc comment to the recipe in leap-stc/LEAP_template_feedstock#53, but it would be very helpful to get feedback from @SammyAgrawal on where this concept would be best explained for new feedstock creators. |
Ah the full run just failed with
|
WTF why is this failing all the time now... EDIT: Apparently because Zenodo does not allow parallel downloads...lame. |
It worked, but something seems off... ds = xr.open_dataset("gs://leap-persistent/data-library/feedstocks/eNATL_feedstock/eNATL60-BLBT02.zarr", engine='zarr', chunks={}) ![]() The time is just [0,1,0,1,....], wondering if the time encoding is lost here. Might be related to #5? @SammyAgrawal do you still have a 'raw' file xarray representation around so we can compare? Also this is just 31 time steps, that seems low? |
@jbusecke how do you feel about merging this branch into main? Contingent on how you feel about the eNATL output. |
Few comments, that hopefully are not too much work? |
@@ -42,6 +42,7 @@ jobs: | |||
# AT that point, screw it, not worth it. | |||
run: | | |||
jobname="${{ env.JOB_NAME }}" | |||
echo "$JOB_NAME" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this useful to bring over to the template feedstock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
configs/config_dataflow.py
Outdated
c.DataflowBakery.use_dataflow_prime = False | ||
c.DataflowBakery.machine_type = "e2-highmem-16" | ||
c.DataflowBakery.disk_size_gb = 400 | ||
c.DataflowBakery.use_shuffle = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do? I am actually just curious. Again it might be good to document this as a 'case' in the template feedstock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should cut it, since I had to create a fork of pangeo-forge-runner to add it. It disables dataflow shuffle , which I thought had some disk space limitations, but I think I was wrong, so we can use shuffle.
name: "The even cooler large Proto Dataset" # no pyramids | ||
url: "gs://leap-scratch/data-library/feedstocks/proto_feedstock/large.zarr" | ||
- id: "enatl60-blbt02" | ||
name: "Needs a name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@auraoupa Can you help here? This name would show up in the LEAP catalog, see the marked portion here as example
ds = ds.set_coords(["deptht", "depthw", "nav_lon", "nav_lat", "tmask"]) | ||
|
||
ds = ds.rename({"time_counter": "time"}) | ||
ds = ds.set_coords(("nav_lat", "nav_lon")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did t_mask
go? See #8 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, I'll rerun a subset to see what was up. We might have to regen.
ds = ds.set_coords(("nav_lat", "nav_lon")) | ||
ds.attrs["deptht"] = ds.deptht.values[0] | ||
ds = ds.drop("deptht") | ||
ds = ds[["vosaline", "votemper", "vovecrtz"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah probably dropped here!
ds = ds.set_coords(["deptht", "depthw", "nav_lon", "nav_lat", "tmask"]) | ||
|
||
ds = ds.rename({"time_counter": "time"}) | ||
ds = ds.set_coords(("nav_lat", "nav_lon")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ds = ds.set_coords(("nav_lat", "nav_lon")) | |
ds = ds.set_coords(("nav_lat", "nav_lon", "t_mask")) |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think I remember. I'm pretty sure some of the input netcdf files are missing "t_mask".
I kind of think we should do our 'prod' build from main, then iterate from that if we need updates? Also, I wonder if we should figure out how to incorporate a git commit into the dataset metadata? |
Already part of the injected attrs by default 😁 |
Ah incredible, I forgot about this haha. |
dataflow does not like underscores in jobnames. Our machinery here using pangeo-forge-runner uses the recipe_id to make jobnames.
Until we have more general fix upstream we will have to fix this within each feedstock.
This is somewhat frustrating (I have stumbled upon this many times), wondering if there is an easy way to check/validate the values of
recipe_id
automatically with the linting (cc @andersy005?).