Add core functionality for parsing attributes from an open file #3

andersy005 · 2020-06-03T11:18:00Z

This PR adds a Builder class. This class implements most of the functionality needed to parse attributes from an open file/xarray dataset. The missing feature consists of parsing attributes from a filename/filepath template + global attributes i.e. @sherimickelson's work in NCAR/intake-esm-datastore#70.

@sherimickelson, I added sample files to this repo. I am hoping that we can use them to test your work in NCAR/intake-esm-datastore#70 once it's migrated here.

andersy005 · 2020-06-03T15:26:49Z

Here's a demo notebook

ecgtools/core.py

ecgtools/parsers.py

andersy005 · 2020-06-08T18:07:57Z

Note: the latest commit includes the necessary code for creating both the CSV and JSON files.

sherimickelson

Looks good

kmpaul

@andersy005: I think this is coming together, but I think the public API is still a bit confusing. I am welcome to arguments against this, but it seems to me that the most useful form (from the user's perspective...or mine) of what you are doing would be reflected in the following public API:

Builder.__init__(): The result of constructing the Builder should be a valid object that knows (as well as possible) that it will run without error.
Builder.build(): The result of the Builder.build method should be the construction of the Builder.df and (possibly) Builder.dict (?) attributes, described below:
Builder.df: the pre-state of the CSV file, and
Builder.dict: the pre-state of the JSON file. Perhaps this should be named something else?
Builder.save(): Write the df and dict attributes to files. Perhaps this should be named something else, too?

Does that cover the user behavior you are aiming for?

ecgtools/core.py

andersy005 · 2020-06-11T20:22:20Z

@kmpaul, I like the simplicity of the public API suggested above, but I do have a few questions:

Builder.init(): The result of constructing the Builder should be a valid object that knows (as well as possible) that it will run without error.

Could you expand on "it will run without error"? What errors should we check for? I am asking because I think this can be very tricky. For example, if the user provides a root_path, do we check whether they have the right permissions to crawl the sub-directories during the class instantiation step?

Builder.build(): The result of the Builder.build method should be the construction of the Builder.df and (possibly) Builder.dict (?) attributes, described below:

Does this imply that all necessary arguments are handled in __init__()?

Does that cover the user behavior you are aiming for?

One of the things I was aiming for was a Builder object that behaves as a pipeline i.e. one that makes it easy to replace intermediate steps if need be. For example, when building a large catalog, for instance CMIP6 catalog, you may want to run the pipeline end-to-end for the first time only. For the subsequent runs, you would only collect the list of files available, compare this list to the list of files in the existing catalog, and only parse attributes of files that don't appear in the catalog previously built. My understanding of the public API you suggested above is that all steps are tightly coupled, making it a little tricky to replace intermediate steps when necessary. I could be missing something though.

kmpaul · 2020-06-11T20:43:22Z

@kmpaul, I like the simplicity of the public API suggested above, but I do have a few questions:

Builder.__init__(): The result of constructing the Builder should be a valid object that knows (as well as possible) that it will run without error.

Could you expand on "it will run without error"? What errors should we check for? I am asking because I think this can be very tricky. For example, if the user provides a root_path, do we check whether they have the right permissions to crawl the sub-directories during the class instantiation step?

That's one idea, or you can just check to make sure all of the necessary inputs to the __init__() constructor are valid. If a root_path is given, then it makes sense to make sure that path is valid and you have access to it, but maybe you don't have to crawl the entire tree. Just make sure the __init__() inputs are valid. If the validity checks are too complex, then you should just throw an exception upon error when the operation runs.

Then you need validation unit tests to match.

Builder.build(): The result of the Builder.build method should be the construction of the Builder.df and (possibly) Builder.dict (?) attributes, described below:

Does this imply that all necessary arguments are handled in __init__()?

Not necessarily. The build() method could take arguments of its own, as long as it makes sense for them to be supplied to build() instead of __init__(), right? From my understanding, the build function is really just the parse_files_attributes method and the current create_catalog method, combined. So, as a first pass, you could make the parse_files_attributes and create_catalog methods private and called in sequence from a build method. Then do a little cleanup.

Does that cover the user behavior you are aiming for?

One of the things I was aiming for was a Builder object that behaves as a pipeline i.e. one that makes it easy to replace intermediate steps if need be. For example, when building a large catalog, for instance CMIP6 catalog, you may want to run the pipeline end-to-end for the first time only. For the subsequent runs, you would only collect the list of files available, compare this list to the list of files in the existing catalog, and only parse attributes of files that don't appear in the catalog previously built. My understanding of the public API you suggested above is that all steps are tightly coupled, making it a little tricky to replace intermediate steps when necessary. I could be missing something though.

Yes. I didn't realize this was a use case, but if that use case is to be assumed (and it should), then you are correct. My pitch is tightly coupled.

So, could you add an update method to the public API that addresses your additional use case? And then modify the private helper functions to allow for the update step?

I am envisioning a 2-option use path:

[init] --> [build] --> [save]
[init] --> [update] --> [save]

Does that work? Is that clearer?

andersy005 · 2020-06-11T20:50:00Z

It makes sense now :) Many thanks for the feedback

kmpaul · 2020-06-11T21:02:14Z

You're welcome. I hope what I'm suggesting is reasonable. I think that if what you are suggesting is the ability to let users build an entire catalog from scratch and to let users update an existing catalog, then it makes sense to call out those 2 scenarios in the names of the API functions. It will be more clear to the user.

andersy005 · 2020-06-15T11:14:21Z

Gentle ping @kmpaul. When you get a chance, this is ready for another round of review.

kmpaul

I still feel like there are steps being done here that don't fit the init --> build --> save form. Do you feel like the esmcol_data is not something that other people want to be able to see before saving it to the JSON file?

kmpaul · 2020-06-15T19:30:36Z

ecgtools/core.py

+        esmcol_data = OrderedDict()
+        if esmcat_version is None:
+            esmcat_version = '0.0.1'
+
+        if cat_id is None:
+            cat_id = ''
+
+        if description is None:
+            description = ''
+        esmcol_data['esmcat_version'] = esmcat_version
+        esmcol_data['id'] = cat_id
+        esmcol_data['description'] = description
+        esmcol_data['catalog_file'] = catalog_file
+        if attributes is None:
+            attributes = []
+            for column in self.df.columns:
+                attributes.append({'column_name': column, 'vocabulary': ''})
+
+        esmcol_data['attributes'] = attributes
+        esmcol_data['assets'] = {'column_name': path_column}
+        if (data_format is not None) and (format_column is not None):
+            raise ValueError('data_format and format_column are mutually exclusive')
+
+        if data_format is not None:
+            esmcol_data['assets']['format'] = data_format
+        else:
+            esmcol_data['assets']['format_column_name'] = format_column
+        if groupby_attrs is None:
+            groupby_attrs = [path_column]
+
+        if aggregations is None:
+            aggregations = []
+        esmcol_data['aggregation_control'] = {
+            'variable_column_name': variable_column,
+            'groupby_attrs': groupby_attrs,
+            'aggregations': aggregations,
+        }
+
+        self.esmcol_data = esmcol_data


This all seems like a build step...?

kmpaul

@andersy005 I like this. I think this is a good public API. I'm happy seeing this merged into the master branch.

andersy005 · 2020-06-24T10:46:53Z

@andersy005 I like this. I think this is a good public API. I'm happy seeing this merged into the master branch.

Hooray 🎉 @kmpaul & @sherimickelson thank you for your feedback!

andersy005 added 9 commits June 2, 2020 00:46

Add Builder class

e407745

Merge branch 'master' of github.com:NCAR/ecg into core-functionality

4e6f67c

Delete .DS_Store files

3163431

Add cmip6_parser

63f4909

Batch delayed calls

1d43f6d

Update tests

63b5285

Update _parse_file_attributes

0eca5ed

Add docstrings

cf0c73f

Update requirements

f86899c

andersy005 requested review from mnlevy1981, sherimickelson, bonnland and kmpaul June 3, 2020 11:18

andersy005 added 2 commits June 3, 2020 07:31

Rename package

18ea264

Add documentation

68c6eb5

andersy005 added 2 commits June 3, 2020 09:27

Add pandoc

107156f

Install pandoc

fb6db76

kmpaul requested changes Jun 3, 2020

View reviewed changes

ecgtools/core.py Show resolved Hide resolved

ecgtools/parsers.py Outdated Show resolved Hide resolved

andersy005 added 3 commits June 4, 2020 11:37

Make helper methods private

1e25c8c

Update method for creating catalog

a865237

Update create_catalog() method

0b2a491

andersy005 requested a review from kmpaul June 8, 2020 18:07

sherimickelson approved these changes Jun 8, 2020

View reviewed changes

kmpaul requested changes Jun 11, 2020

View reviewed changes

ecgtools/core.py Outdated Show resolved Hide resolved

ecgtools/core.py Outdated Show resolved Hide resolved

ecgtools/core.py Outdated Show resolved Hide resolved

ecgtools/core.py Outdated Show resolved Hide resolved

andersy005 added 3 commits June 11, 2020 16:18

Simplify Public API per @kmpaul's suggestion

37b9914

Remove the tutorial notebook for the time being

ffb05a9

Add update method

bd4005b

andersy005 requested a review from kmpaul June 12, 2020 15:35

kmpaul requested changes Jun 15, 2020

View reviewed changes

andersy005 mentioned this pull request Jun 15, 2020

Add a common utility for building catalogs NCAR/intake-esm-datastore#43

Closed

Move esmcol_data bits into .build()

0a3e53c

andersy005 requested a review from kmpaul June 19, 2020 18:48

add test for esmcol_data

256920d

kmpaul approved these changes Jun 23, 2020

View reviewed changes

andersy005 merged commit e5108a4 into ncar-xdev:master Jun 24, 2020

andersy005 deleted the core-functionality branch June 24, 2020 10:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add core functionality for parsing attributes from an open file #3

Add core functionality for parsing attributes from an open file #3

andersy005 commented Jun 3, 2020

andersy005 commented Jun 3, 2020

andersy005 commented Jun 8, 2020

sherimickelson left a comment

kmpaul left a comment •

edited

Loading

andersy005 commented Jun 11, 2020 •

edited

Loading

kmpaul commented Jun 11, 2020

andersy005 commented Jun 11, 2020

kmpaul commented Jun 11, 2020

andersy005 commented Jun 15, 2020

kmpaul left a comment

kmpaul Jun 15, 2020

andersy005 Jun 19, 2020

kmpaul left a comment

andersy005 commented Jun 24, 2020

Add core functionality for parsing attributes from an open file #3

Add core functionality for parsing attributes from an open file #3

Conversation

andersy005 commented Jun 3, 2020

andersy005 commented Jun 3, 2020

andersy005 commented Jun 8, 2020

sherimickelson left a comment

Choose a reason for hiding this comment

kmpaul left a comment • edited Loading

Choose a reason for hiding this comment

andersy005 commented Jun 11, 2020 • edited Loading

kmpaul commented Jun 11, 2020

andersy005 commented Jun 11, 2020

kmpaul commented Jun 11, 2020

andersy005 commented Jun 15, 2020

kmpaul left a comment

Choose a reason for hiding this comment

kmpaul Jun 15, 2020

Choose a reason for hiding this comment

andersy005 Jun 19, 2020

Choose a reason for hiding this comment

kmpaul left a comment

Choose a reason for hiding this comment

andersy005 commented Jun 24, 2020

kmpaul left a comment •

edited

Loading

andersy005 commented Jun 11, 2020 •

edited

Loading