Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complex domain detection for collections of data items with CLI support #566

Conversation

robertbartel
Copy link
Contributor

Adding basic capabilities for detecting domains for collections of data items, be they a directory of files or a dataset. Also including CLI support, both as dedicated detection command and within dataset operation commands involving uploading data.

Note that this branch builds on the branch for #565. This PR should remain in draft status until #565 is merged.

@robertbartel robertbartel added enhancement New feature or request maas MaaS Workstream labels Apr 2, 2024
@robertbartel robertbartel force-pushed the f/dataset_domain_autodetect/2/collection_detectors branch from 9ae5ed8 to daa7049 Compare April 2, 2024 18:44
python/lib/client/dmod/client/__main__.py Outdated Show resolved Hide resolved
python/lib/client/dmod/client/__main__.py Outdated Show resolved Hide resolved
python/lib/client/dmod/client/__main__.py Outdated Show resolved Hide resolved
python/lib/client/dmod/client/dmod_client.py Show resolved Hide resolved
python/lib/client/dmod/client/dmod_client.py Outdated Show resolved Hide resolved
python/lib/core/dmod/core/dataset.py Outdated Show resolved Hide resolved
python/lib/modeldata/dmod/test/__init__.py Outdated Show resolved Hide resolved
@@ -622,12 +797,17 @@ def handle_type_map(t):

@root_validator()
def validate_sufficient_restrictions(cls, values):
data_format = values.get("data_format")
if data_format == DataFormat.EMPTY or data_format == DataFormat.GENERIC:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment here that it is permissible for an EMPTY or GENERIC to have restrictions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to this, we should document dataset invariants at some point. So almost like a retrospective design doc. Seems like something we could put together in a DMOD team brown bag that deep dives into the concepts of a DMOD dataset.

@robertbartel robertbartel force-pushed the f/dataset_domain_autodetect/2/collection_detectors branch 3 times, most recently from a8d139f to 5abe142 Compare April 4, 2024 15:23
@robertbartel robertbartel force-pushed the f/dataset_domain_autodetect/2/collection_detectors branch 4 times, most recently from 5a11135 to d9e0f1a Compare April 10, 2024 15:13
Copy link
Member

@aaraney aaraney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reworking a lot of things, @robertbartel! I think we are getting close and a lot of my comments can be moved to TODOs.

python/lib/core/dmod/core/data_domain_detectors.py Outdated Show resolved Hide resolved
python/lib/core/dmod/core/data_domain_detectors.py Outdated Show resolved Hide resolved
python/lib/core/dmod/core/data_domain_detectors.py Outdated Show resolved Hide resolved
python/lib/core/dmod/core/data_domain_detectors.py Outdated Show resolved Hide resolved
python/lib/core/dmod/core/data_domain_detectors.py Outdated Show resolved Hide resolved
@robertbartel
Copy link
Contributor Author

I need to make another pass or two, but I think most things have been addressed. @christophertubbs and @aaraney, I think many of the open conversation threads can probably be closed. Could you mark any threads you feel still need discussion and/or changes?

@robertbartel robertbartel force-pushed the f/dataset_domain_autodetect/2/collection_detectors branch 3 times, most recently from d882a7a to 5563a42 Compare May 7, 2024 14:05
@robertbartel robertbartel marked this pull request as ready for review May 7, 2024 14:05
Copy link
Member

@aaraney aaraney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great! We are getting really close to / might already be at the point where we can merge. I left a few comments / suggestions that are mostly stylistic but I do think will improve usage of the APIs.

python/lib/core/dmod/core/data_domain_detectors.py Outdated Show resolved Hide resolved
python/lib/core/dmod/core/data_domain_detectors.py Outdated Show resolved Hide resolved
python/lib/core/dmod/core/data_domain_detectors.py Outdated Show resolved Hide resolved
python/lib/core/dmod/core/data_domain_detectors.py Outdated Show resolved Hide resolved
python/lib/core/dmod/core/data_domain_detectors.py Outdated Show resolved Hide resolved
@@ -94,9 +97,8 @@ class DataFormat(PydanticEnum):
"""
AORC_CSV = (0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fairly certain the AORC variable names are the following (in sorted order):

APCP_surface
DLWRF_surface
DSWRF_surface
latitude
longitude
PRES_surface
SPFH_2maboveground
time
TMP_2maboveground
UGRD_10maboveground
VGRD_10maboveground

It looks like the AORC pdf that was public is now private. So it is a little more difficult to verify this. I did find AORC data on AWS's Registry of Open Data (here is a link to the bucket itself). If you go to the bucket that contains the zarr chunks, you can click on one of the year "directories" and the children directories are the variable names. This is how zarr chunks are organized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find the variable names spelled out in the publication of the AORC 1.1 dataset, but I figured it was worth linking it either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be our source of truth for better or worse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I deliberately changed this here because I noticed the data I was seeing was different than the previous fields. And the data I had were CSVs I'd regridded myself using @jduckerOWP utility.

@aaraney, you're talking about the official AORC format. For better or worse, that's not exactly the same data format we see (or at least have seen) in the CSV regriddings, and I am fairly sure this was intentional. In particular, (aside: ug, this old chestnut again), the CSVs include RAINRATE instead of APCP_surface.

There probably should be something formal and official outside of DMOD (and ngen) defining NextGen regridded CSV and NetCDF data formats, though I don't think that there is right now (@hellkite500?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a working link to some documentation (https://www.weather.gov/media/owp/operations/aorc_v1_1_methods.pdf).

Copy link
Member

@aaraney aaraney May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing a little more digging, I am thinking these names are coming from the NWM AORC 1.1 dataset. I pulled a netcdf forcing file from the NWM 3.0 retrospective dataset and dumped and sorted the variable names:

char crs ;
double x(x) ;
double y(y) ;
float RAINRATE(time, y, x) ;
int LQFRAC(time, y, x) ;
int LWDOWN(time, y, x) ;
int PSFC(time, y, x) ;
int Q2D(time, y, x) ;
int reference_time(reference_time) ;
int SWDOWN(time, y, x) ;
int T2D(time, y, x) ;
int time(time) ;
int U2D(time, y, x) ;
int V2D(time, y, x) ;

These line up perfectly with the WRF-Hydro forcing inputs (table 5.4):

SWDOWN Incoming shortwave radiation W/m2
LWDOWN Incoming longwave radiation W/m2
Q2D Specific humidity kg/kg
T2D Air temperature K
PSFC Surface pressure Pa
U2D Near surface wind in the u-component m/s
V2D Near surface wind in the v-component m/s
RAINRATE Precipitation rate mm/s or kg/m2/s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What i'm getting at is, what we are calling AORC forcing are spatially aggregated AORC forcing that closely follow the WRF-Hydro forcing naming convention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When really, this code is just making sure they follow the WRF-Hydro forcing naming conventions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I don't think this is on us. There seems to be overloading of the term AORC forcing in several places amongst the ORG.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #613 so that this conversation has a place to continue. I haven't prioritized it highly, but I wouldn't argue if you felt like it needed to be closer to the top of our list, so feel free to adjust the priority or move it up the board.

discrete_restrictions=new_d_restricts)

def __eq__(self, other):
return isinstance(other, DataDomain) and self.__hash__() == other.__hash__()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return isinstance(other, DataDomain) and self.__hash__() == other.__hash__()
return isinstance(other, DataDomain) and hash(self) == hash(other)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaraney, I'm curious if you had something particular in mind when you suggested this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is purely a style/idiomatic suggestion. They will desugar to the dunder methods like you have it, but generally it is advised to avoid calling dunder methods unless it can't be avoided. For example, you would likely never write return self.__str__(), you would instead write return str(self). It doesnt need to be changed, just is the "pythonic" way for better or worse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, fair enough.

Including both detector types and serialized domains in exception
message raised when multiple detector types successfully produce
different domains.
Using explicit params in AorcCsvFileDomainDetector __init__ rather than
*args and **kwargs for better clarity.
Fixing regex to catch no-suffix case in _get_catchment_id() function.
Call to super must happen first to ensure item_name can be inferred from
item when needed before enforcing item_name was set.
Fixing issue where region string check for "conus" value wasn't handling
case where there was no region string; also fixing other unit test for
region checking where code was altered due to VPU ids not being integer
only.
Required for use of domain detectors.
And for subclasses (for now, only AorcCsvFileDomainDetector overrides
__init__).
Improve signature for AbstractUniversalItemDomainDetector with use of
explicit params rather than kwargs.
Use BytesIO, rather than StringIO, when detecting and needing to use
a bytes object as the data item from which to load a dataframe.
@robertbartel robertbartel force-pushed the f/dataset_domain_autodetect/2/collection_detectors branch from 56a2db0 to 6f4dd60 Compare May 13, 2024 13:35
Bumping core to 0.15.2 to address issue with Github Actions cache, and
updating modeldata and client as well to account for this transitively
(and, in the case of client, address the aforementioned cache issue).
Copy link
Member

@aaraney aaraney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for working through this, @robertbartel!

I think we should continue the dialog about how the general we refer to forcing datasets either in an issue here or in the ngen repo. There seems to be several places where the use of AORC forcing have become overloaded to the point that I think it would be confusing to an outsider.

@aaraney aaraney dismissed christophertubbs’s stale review May 13, 2024 18:19

Requested changes have been addressed.

@aaraney aaraney merged commit 5734d93 into NOAA-OWP:master May 13, 2024
8 checks passed
@robertbartel robertbartel deleted the f/dataset_domain_autodetect/2/collection_detectors branch May 13, 2024 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request maas MaaS Workstream
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants