-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complex domain detection for collections of data items with CLI support #566
Complex domain detection for collections of data items with CLI support #566
Conversation
9ae5ed8
to
daa7049
Compare
python/lib/modeldata/dmod/modeldata/data/item_domain_detector.py
Outdated
Show resolved
Hide resolved
python/lib/modeldata/dmod/test/test_geopackage_hydrofabric_domain_detector.py
Show resolved
Hide resolved
@@ -622,12 +797,17 @@ def handle_type_map(t): | |||
|
|||
@root_validator() | |||
def validate_sufficient_restrictions(cls, values): | |||
data_format = values.get("data_format") | |||
if data_format == DataFormat.EMPTY or data_format == DataFormat.GENERIC: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment here that it is permissible for an EMPTY
or GENERIC
to have restrictions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to this, we should document dataset invariants at some point. So almost like a retrospective design doc. Seems like something we could put together in a DMOD team brown bag that deep dives into the concepts of a DMOD dataset.
python/lib/modeldata/dmod/modeldata/data/item_domain_detector.py
Outdated
Show resolved
Hide resolved
a8d139f
to
5abe142
Compare
5a11135
to
d9e0f1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reworking a lot of things, @robertbartel! I think we are getting close and a lot of my comments can be moved to TODOs.
python/lib/modeldata/dmod/modeldata/data/item_domain_detector.py
Outdated
Show resolved
Hide resolved
python/lib/modeldata/dmod/modeldata/data/item_domain_detector.py
Outdated
Show resolved
Hide resolved
python/lib/modeldata/dmod/modeldata/data/item_domain_detector.py
Outdated
Show resolved
Hide resolved
python/lib/modeldata/dmod/modeldata/data/item_domain_detector.py
Outdated
Show resolved
Hide resolved
d9e0f1a
to
5bb5602
Compare
I need to make another pass or two, but I think most things have been addressed. @christophertubbs and @aaraney, I think many of the open conversation threads can probably be closed. Could you mark any threads you feel still need discussion and/or changes? |
d882a7a
to
5563a42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks great! We are getting really close to / might already be at the point where we can merge. I left a few comments / suggestions that are mostly stylistic but I do think will improve usage of the APIs.
@@ -94,9 +97,8 @@ class DataFormat(PydanticEnum): | |||
""" | |||
AORC_CSV = (0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fairly certain the AORC variable names are the following (in sorted order):
APCP_surface
DLWRF_surface
DSWRF_surface
latitude
longitude
PRES_surface
SPFH_2maboveground
time
TMP_2maboveground
UGRD_10maboveground
VGRD_10maboveground
It looks like the AORC pdf that was public is now private. So it is a little more difficult to verify this. I did find AORC data on AWS's Registry of Open Data (here is a link to the bucket itself). If you go to the bucket that contains the zarr
chunks, you can click on one of the year "directories" and the children directories are the variable names. This is how zarr
chunks are organized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find the variable names spelled out in the publication of the AORC 1.1 dataset, but I figured it was worth linking it either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be our source of truth for better or worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I deliberately changed this here because I noticed the data I was seeing was different than the previous fields. And the data I had were CSVs I'd regridded myself using @jduckerOWP utility.
@aaraney, you're talking about the official AORC format. For better or worse, that's not exactly the same data format we see (or at least have seen) in the CSV regriddings, and I am fairly sure this was intentional. In particular, (aside: ug, this old chestnut again), the CSVs include RAINRATE
instead of APCP_surface
.
There probably should be something formal and official outside of DMOD (and ngen) defining NextGen regridded CSV and NetCDF data formats, though I don't think that there is right now (@hellkite500?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a working link to some documentation (https://www.weather.gov/media/owp/operations/aorc_v1_1_methods.pdf).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After doing a little more digging, I am thinking these names are coming from the NWM AORC 1.1 dataset. I pulled a netcdf forcing file from the NWM 3.0 retrospective dataset and dumped and sorted the variable names:
char crs ;
double x(x) ;
double y(y) ;
float RAINRATE(time, y, x) ;
int LQFRAC(time, y, x) ;
int LWDOWN(time, y, x) ;
int PSFC(time, y, x) ;
int Q2D(time, y, x) ;
int reference_time(reference_time) ;
int SWDOWN(time, y, x) ;
int T2D(time, y, x) ;
int time(time) ;
int U2D(time, y, x) ;
int V2D(time, y, x) ;
These line up perfectly with the WRF-Hydro forcing inputs (table 5.4):
SWDOWN Incoming shortwave radiation W/m2
LWDOWN Incoming longwave radiation W/m2
Q2D Specific humidity kg/kg
T2D Air temperature K
PSFC Surface pressure Pa
U2D Near surface wind in the u-component m/s
V2D Near surface wind in the v-component m/s
RAINRATE Precipitation rate mm/s or kg/m2/s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What i'm getting at is, what we are calling AORC
forcing are spatially aggregated AORC
forcing that closely follow the WRF-Hydro
forcing naming convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When really, this code is just making sure they follow the WRF-Hydro
forcing naming conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I don't think this is on us. There seems to be overloading of the term AORC
forcing in several places amongst the ORG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #613 so that this conversation has a place to continue. I haven't prioritized it highly, but I wouldn't argue if you felt like it needed to be closer to the top of our list, so feel free to adjust the priority or move it up the board.
discrete_restrictions=new_d_restricts) | ||
|
||
def __eq__(self, other): | ||
return isinstance(other, DataDomain) and self.__hash__() == other.__hash__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return isinstance(other, DataDomain) and self.__hash__() == other.__hash__() | |
return isinstance(other, DataDomain) and hash(self) == hash(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaraney, I'm curious if you had something particular in mind when you suggested this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is purely a style/idiomatic suggestion. They will desugar to the dunder methods like you have it, but generally it is advised to avoid calling dunder methods unless it can't be avoided. For example, you would likely never write return self.__str__()
, you would instead write return str(self)
. It doesnt need to be changed, just is the "pythonic" way for better or worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, fair enough.
python/lib/modeldata/dmod/modeldata/data/item_domain_detector.py
Outdated
Show resolved
Hide resolved
python/lib/modeldata/dmod/modeldata/data/item_domain_detector.py
Outdated
Show resolved
Hide resolved
5563a42
to
8265387
Compare
Including both detector types and serialized domains in exception message raised when multiple detector types successfully produce different domains.
Using explicit params in AorcCsvFileDomainDetector __init__ rather than *args and **kwargs for better clarity.
Fixing regex to catch no-suffix case in _get_catchment_id() function.
Call to super must happen first to ensure item_name can be inferred from item when needed before enforcing item_name was set.
Fixing issue where region string check for "conus" value wasn't handling case where there was no region string; also fixing other unit test for region checking where code was altered due to VPU ids not being integer only.
Required for use of domain detectors.
And for subclasses (for now, only AorcCsvFileDomainDetector overrides __init__).
Improve signature for AbstractUniversalItemDomainDetector with use of explicit params rather than kwargs.
Use BytesIO, rather than StringIO, when detecting and needing to use a bytes object as the data item from which to load a dataframe.
56a2db0
to
6f4dd60
Compare
Bumping core to 0.15.2 to address issue with Github Actions cache, and updating modeldata and client as well to account for this transitively (and, in the case of client, address the aforementioned cache issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thanks for working through this, @robertbartel!
I think we should continue the dialog about how the general we refer to forcing datasets either in an issue here or in the ngen
repo. There seems to be several places where the use of AORC
forcing have become overloaded to the point that I think it would be confusing to an outsider.
Requested changes have been addressed.
Adding basic capabilities for detecting domains for collections of data items, be they a directory of files or a dataset. Also including CLI support, both as dedicated detection command and within dataset operation commands involving uploading data.
Note that this branch builds on the branch for #565. This PR should remain in draft status until #565 is merged.