-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add core functionality for parsing attributes from an open file #3
Conversation
Here's a demo notebook |
Note: the latest commit includes the necessary code for creating both the CSV and JSON files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andersy005: I think this is coming together, but I think the public API is still a bit confusing. I am welcome to arguments against this, but it seems to me that the most useful form (from the user's perspective...or mine) of what you are doing would be reflected in the following public API:
-
Builder.__init__()
: The result of constructing theBuilder
should be a valid object that knows (as well as possible) that it will run without error. -
Builder.build()
: The result of theBuilder.build
method should be the construction of theBuilder.df
and (possibly)Builder.dict
(?) attributes, described below: -
Builder.df
: the pre-state of the CSV file, and -
Builder.dict
: the pre-state of the JSON file. Perhaps this should be named something else? -
Builder.save()
: Write thedf
anddict
attributes to files. Perhaps this should be named something else, too?
Does that cover the user behavior you are aiming for?
@kmpaul, I like the simplicity of the public API suggested above, but I do have a few questions:
Could you expand on "it will run without error"? What errors should we check for? I am asking because I think this can be very tricky. For example, if the user provides a
Does this imply that all necessary arguments are handled in
One of the things I was aiming for was a Builder object that behaves as a pipeline i.e. one that makes it easy to replace intermediate steps if need be. For example, when building a large catalog, for instance CMIP6 catalog, you may want to run the pipeline end-to-end for the first time only. For the subsequent runs, you would only collect the list of files available, compare this list to the list of files in the existing catalog, and only parse attributes of files that don't appear in the catalog previously built. My understanding of the public API you suggested above is that all steps are tightly coupled, making it a little tricky to replace intermediate steps when necessary. I could be missing something though. |
That's one idea, or you can just check to make sure all of the necessary inputs to the Then you need validation unit tests to match.
Not necessarily. The
Yes. I didn't realize this was a use case, but if that use case is to be assumed (and it should), then you are correct. My pitch is tightly coupled. So, could you add an I am envisioning a 2-option use path:
Does that work? Is that clearer? |
It makes sense now :) Many thanks for the feedback |
You're welcome. I hope what I'm suggesting is reasonable. I think that if what you are suggesting is the ability to let users build an entire catalog from scratch and to let users update an existing catalog, then it makes sense to call out those 2 scenarios in the names of the API functions. It will be more clear to the user. |
Gentle ping @kmpaul. When you get a chance, this is ready for another round of review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still feel like there are steps being done here that don't fit the init
--> build
--> save
form. Do you feel like the esmcol_data
is not something that other people want to be able to see before saving it to the JSON file?
ecgtools/core.py
Outdated
esmcol_data = OrderedDict() | ||
if esmcat_version is None: | ||
esmcat_version = '0.0.1' | ||
|
||
if cat_id is None: | ||
cat_id = '' | ||
|
||
if description is None: | ||
description = '' | ||
esmcol_data['esmcat_version'] = esmcat_version | ||
esmcol_data['id'] = cat_id | ||
esmcol_data['description'] = description | ||
esmcol_data['catalog_file'] = catalog_file | ||
if attributes is None: | ||
attributes = [] | ||
for column in self.df.columns: | ||
attributes.append({'column_name': column, 'vocabulary': ''}) | ||
|
||
esmcol_data['attributes'] = attributes | ||
esmcol_data['assets'] = {'column_name': path_column} | ||
if (data_format is not None) and (format_column is not None): | ||
raise ValueError('data_format and format_column are mutually exclusive') | ||
|
||
if data_format is not None: | ||
esmcol_data['assets']['format'] = data_format | ||
else: | ||
esmcol_data['assets']['format_column_name'] = format_column | ||
if groupby_attrs is None: | ||
groupby_attrs = [path_column] | ||
|
||
if aggregations is None: | ||
aggregations = [] | ||
esmcol_data['aggregation_control'] = { | ||
'variable_column_name': variable_column, | ||
'groupby_attrs': groupby_attrs, | ||
'aggregations': aggregations, | ||
} | ||
|
||
self.esmcol_data = esmcol_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all seems like a build
step...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andersy005 I like this. I think this is a good public API. I'm happy seeing this merged into the master branch.
Hooray 🎉 @kmpaul & @sherimickelson thank you for your feedback! |
This PR adds a
Builder
class. This class implements most of the functionality needed to parse attributes from an open file/xarray dataset. The missing feature consists of parsing attributes from a filename/filepath template + global attributes i.e. @sherimickelson's work in NCAR/intake-esm-datastore#70.@sherimickelson, I added sample files to this repo. I am hoping that we can use them to test your work in NCAR/intake-esm-datastore#70 once it's migrated here.