More work on getting Dataflow to run #9

jbusecke · 2024-07-12T10:20:42Z

dataflow does not like underscores in jobnames. Our machinery here using pangeo-forge-runner uses the recipe_id to make jobnames.
Until we have more general fix upstream we will have to fix this within each feedstock.

This is somewhat frustrating (I have stumbled upon this many times), wondering if there is an easy way to check/validate the values of recipe_id automatically with the linting (cc @andersy005?).

jbusecke · 2024-07-12T10:40:14Z

An important note! In order to properly deploy (and in particular write to the proper location), the catalog.yaml has to be edited! See 1ed8bba for details here. I think some of this should be automatically checked (see leap-stc/LEAP_template_feedstock#51 for tracking.

jbusecke · 2024-07-12T10:42:57Z

Ughhh I think dataflow also does not like uppercase letters, this is annoying....

jbusecke · 2024-07-12T10:48:38Z

Got past the naming issue, but there was a problem with the requirements. The xarray requirement was not formatted correctly (wondering how that did not fail in the local tests, but no big deal for now).

jbusecke · 2024-07-12T10:53:09Z

Getting another error here that should have been caught in a local test?

ImportError: cannot import name 'ConsolidateMetadata' from 'pangeo_forge_recipes.transforms'

I think this was due to depending on an old pgf-recipe version

jbusecke · 2024-07-12T10:55:44Z

The dataflow job is now running! I can see the console here but I think nobody of you can. We should change this (tracking this here)

jbusecke · 2024-07-12T13:28:51Z

Hmmmm the dataflow job seems stalled.... Ill try to switch from Dataflow prime to specific high ram workers

jbusecke · 2024-07-12T14:02:54Z

The run succeeded 🎉

But the Copy() stage is missing, and thus this does not work:

path = "gs://leap-scratch/data-library/feedstocks/eNATL_feedstock/eNATL60-BLBT02.zarr"
import xarray as xr
xr.open_dataset(path, engine='zarr', chunks={})

Looking at the temp storage location we can take a look at the output:

path = "gs://leap-scratch/data-library/feedstocks/output/eNATL_feedstock/enatl60-blbt02-9908751732-1/eNATL60_BLBT02.zarr"
import xarray as xr
xr.open_dataset(path, engine='zarr', chunks={})

We should never get rid of the Copy Stage!!! I have added an ad-hoc comment to the recipe in leap-stc/LEAP_template_feedstock#53, but it would be very helpful to get feedback from @SammyAgrawal on where this concept would be best explained for new feedstock creators.

jbusecke · 2024-07-12T14:28:25Z

Ok now everything runs as expected and I can see the final output (at the moment still pruned in time) like follows:

path = "gs://leap-persistent/data-library/feedstocks/eNATL_feedstock/eNATL60-BLBT02.zarr"
import xarray as xr
ds = xr.open_dataset(path, engine='zarr', chunks={})

Ill remove the time pruning now and run the whole thing!

for more information, see https://pre-commit.ci

jbusecke · 2024-07-12T14:40:51Z

Ah the full run just failed with

aiohttp.client_exceptions.ClientResponseError: 429, message='TOO MANY REQUESTS', url=URL('https://zenodo.org/records/10513552/files/eNATL60-BLBT02_y2009m07d18.1d_TSWm_60m.nc') [while running 'Create|OpenURLWithFSSpec|OpenWithXarray|Preprocess|StoreToZarr|ConsolidateDimensionCoordinates|ConsolidateMetadata|Copy/OpenURLWithFSSpec/MapWithConcurrencyLimit/open_url-ptransform-81']
```

I think we have seen this before. Zenodo is relatively strict about many parallel requests, so I added a [concurrency limit](https://github.com/leap-stc/eNATL_feedstock/pull/9/commits/6152eb638cbcb152061d77832d6fdc11dfe373df)

jbusecke · 2024-07-12T15:06:51Z

WTF why is this failing all the time now... EDIT: Apparently because Zenodo does not allow parallel downloads...lame.

jbusecke · 2024-07-22T11:03:08Z

I am suspecting that we ran out of memory in the last run

Lets see how this fares when we bump the worker size.

jbusecke · 2024-07-22T12:56:17Z

It worked, but something seems off...

ds = xr.open_dataset("gs://leap-persistent/data-library/feedstocks/eNATL_feedstock/eNATL60-BLBT02.zarr", engine='zarr', chunks={})

The time is just [0,1,0,1,....], wondering if the time encoding is lost here. Might be related to #5? @SammyAgrawal do you still have a 'raw' file xarray representation around so we can compare?

Also this is just 31 time steps, that seems low?

norlandrhagen · 2024-08-20T22:38:49Z

@jbusecke how do you feel about merging this branch into main? Contingent on how you feel about the eNATL output.

jbusecke · 2024-08-21T16:11:19Z

Few comments, that hopefully are not too much work?
Curious what you think about only merging once the dataset is actually fully built vs running the 'last build' from main @norlandrhagen?

jbusecke · 2024-08-21T16:00:35Z

.github/workflows/deploy_recipe.yaml

@@ -42,6 +42,7 @@ jobs:
        # AT that point, screw it, not worth it.
        run: |
          jobname="${{ env.JOB_NAME }}"
+          echo "$JOB_NAME"


Is this useful to bring over to the template feedstock?

leap-stc/LEAP_template_feedstock#55

jbusecke · 2024-08-21T16:01:38Z

configs/config_dataflow.py

+c.DataflowBakery.use_dataflow_prime = False
+c.DataflowBakery.machine_type = "e2-highmem-16"
+c.DataflowBakery.disk_size_gb = 400
+c.DataflowBakery.use_shuffle = False


What does this do? I am actually just curious. Again it might be good to document this as a 'case' in the template feedstock.

I should cut it, since I had to create a fork of pangeo-forge-runner to add it. It disables dataflow shuffle , which I thought had some disk space limitations, but I think I was wrong, so we can use shuffle.

jbusecke · 2024-08-21T16:04:00Z

feedstock/catalog.yaml

-    name: "The even cooler large Proto Dataset" # no pyramids
-    url: "gs://leap-scratch/data-library/feedstocks/proto_feedstock/large.zarr"
+  - id: "enatl60-blbt02"
+    name: "Needs a name"


@auraoupa Can you help here? This name would show up in the LEAP catalog, see the marked portion here as example

jbusecke · 2024-08-21T16:09:11Z

feedstock/eNATL60.py

-        ds = ds.set_coords(["deptht", "depthw", "nav_lon", "nav_lat", "tmask"])
-
+        ds = ds.rename({"time_counter": "time"})
+        ds = ds.set_coords(("nav_lat", "nav_lon"))


Where did t_mask go? See #8 (comment)

good question, I'll rerun a subset to see what was up. We might have to regen.

jbusecke · 2024-08-21T16:09:24Z

feedstock/eNATL60.py

+        ds = ds.set_coords(("nav_lat", "nav_lon"))
+        ds.attrs["deptht"] = ds.deptht.values[0]
+        ds = ds.drop("deptht")
+        ds = ds[["vosaline", "votemper", "vovecrtz"]]


Ah probably dropped here!

jbusecke · 2024-08-21T16:09:59Z

feedstock/eNATL60.py

-        ds = ds.set_coords(["deptht", "depthw", "nav_lon", "nav_lat", "tmask"])
-
+        ds = ds.rename({"time_counter": "time"})
+        ds = ds.set_coords(("nav_lat", "nav_lon"))


Suggested change

ds = ds.set_coords(("nav_lat", "nav_lon"))

ds = ds.set_coords(("nav_lat", "nav_lon", "t_mask"))

?

Ah I think I remember. I'm pretty sure some of the input netcdf files are missing "t_mask".

norlandrhagen · 2024-08-22T20:04:02Z

Few comments, that hopefully are not too much work? Curious what you think about only merging once the dataset is actually fully built vs running the 'last build' from main @norlandrhagen?

I kind of think we should do our 'prod' build from main, then iterate from that if we need updates? Also, I wonder if we should figure out how to incorporate a git commit into the dataset metadata?

jbusecke · 2024-08-22T20:19:10Z

Also, I wonder if we should figure out how to incorporate a git commit into the dataset metadata?

Already part of the injected attrs by default 😁

https://github.com/leap-stc/leap-data-management-utils/blob/5d1e80270ec0d6373c5e9947ea4653e90c507bd0/leap_data_management_utils/data_management_transforms.py#L150

norlandrhagen · 2024-08-22T20:30:46Z

Ah incredible, I forgot about this haha.

jbusecke added 2 commits July 12, 2024 11:17

Fix recipe_id naming

71140d0

Update catalog.yaml

1ed8bba

jbusecke added 3 commits July 12, 2024 11:43

Update catalog.yaml

6c4f75d

Update meta.yaml

cc99385

fix dependencies

d39ce90

bump pgf-recipes

66b4088

jbusecke mentioned this pull request Jul 12, 2024

eNATL feedstock #8

Merged

Update config_dataflow.py

6816b1a

jbusecke mentioned this pull request Jul 12, 2024

Warning not to remove the copy stage leap-stc/LEAP_template_feedstock#53

Merged

jbusecke added 3 commits July 12, 2024 15:03

move final output to persistent bucket

92102ab

Re add logic for copy stage

363fb00

Update eNATL60.py

3c183f1

jbusecke and others added 3 commits July 12, 2024 15:29

remove prune and reduce machine size

e9ae02d

[pre-commit.ci] auto fixes from pre-commit.com hooks

b19e7a3

for more information, see https://pre-commit.ci

Update eNATL60.py

6152eb6

Update eNATL60.py

57d68ed

jbusecke added 2 commits July 12, 2024 16:07

Update eNATL60.py

9e27af0

Update config_dataflow.py

4f89453

norlandrhagen added 19 commits August 14, 2024 10:39

single worker + highmem

23a37a9

task

4ddfd92

disk size

c529398

300GB storage

5c41c75

120 days

95211e4

weird z3 machine

39b7d1f

region for z3

642af62

use_shuffle = False + fork of runner with new opts

1f91f7a

nit

c2c8b80

max_num_workers=1

26baa53

tmask

dcf3066

shuffle off retry + 34 days

aa09a95

nit

ff79c11

pruned version, no copy w/ shuffle and ssd

b9d9ea7

eNATL full recipe + shuffle no copy

99af8c6

pooch concurrent test

53c6683

coord update

231d1de

1 worker

8f3b264

config update

5d32583

jbusecke changed the title ~~Fix recipe_id naming~~ More work on getting Dataflow to run Aug 22, 2024

jbusecke commented Aug 22, 2024

View reviewed changes

norlandrhagen mentioned this pull request Aug 22, 2024

[Chore] Add echo job name to template leap-stc/LEAP_template_feedstock#55

Open

incorporate review + smaller worker + prune

1c33507

jbusecke merged commit a412394 into main Aug 23, 2024
3 of 4 checks passed

jbusecke deleted the recipe-id-wo-underscore branch August 23, 2024 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More work on getting Dataflow to run #9

More work on getting Dataflow to run #9

jbusecke commented Jul 12, 2024 •

edited

Loading

jbusecke commented Jul 12, 2024

jbusecke commented Jul 12, 2024

jbusecke commented Jul 12, 2024

jbusecke commented Jul 12, 2024

jbusecke commented Jul 12, 2024 •

edited

Loading

jbusecke commented Jul 12, 2024

jbusecke commented Jul 12, 2024

jbusecke commented Jul 12, 2024

jbusecke commented Jul 12, 2024

jbusecke commented Jul 12, 2024 •

edited

Loading

jbusecke commented Jul 22, 2024

jbusecke commented Jul 22, 2024 •

edited

Loading

norlandrhagen commented Aug 20, 2024

jbusecke commented Aug 21, 2024

jbusecke Aug 21, 2024

norlandrhagen Aug 22, 2024

jbusecke Aug 21, 2024

norlandrhagen Aug 22, 2024

jbusecke Aug 21, 2024

jbusecke Aug 21, 2024

norlandrhagen Aug 22, 2024

jbusecke Aug 21, 2024

jbusecke Aug 21, 2024

norlandrhagen Aug 22, 2024

norlandrhagen commented Aug 22, 2024

jbusecke commented Aug 22, 2024

norlandrhagen commented Aug 22, 2024

	ds = ds.set_coords(("nav_lat", "nav_lon"))
	ds = ds.set_coords(("nav_lat", "nav_lon", "t_mask"))

More work on getting Dataflow to run #9

More work on getting Dataflow to run #9

Conversation

jbusecke commented Jul 12, 2024 • edited Loading

jbusecke commented Jul 12, 2024

jbusecke commented Jul 12, 2024

jbusecke commented Jul 12, 2024

jbusecke commented Jul 12, 2024

jbusecke commented Jul 12, 2024 • edited Loading

jbusecke commented Jul 12, 2024

jbusecke commented Jul 12, 2024

jbusecke commented Jul 12, 2024

jbusecke commented Jul 12, 2024

jbusecke commented Jul 12, 2024 • edited Loading

jbusecke commented Jul 22, 2024

jbusecke commented Jul 22, 2024 • edited Loading

norlandrhagen commented Aug 20, 2024

jbusecke commented Aug 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

norlandrhagen commented Aug 22, 2024

jbusecke commented Aug 22, 2024

norlandrhagen commented Aug 22, 2024

jbusecke commented Jul 12, 2024 •

edited

Loading

jbusecke commented Jul 12, 2024 •

edited

Loading

jbusecke commented Jul 12, 2024 •

edited

Loading

jbusecke commented Jul 22, 2024 •

edited

Loading