Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve name of output file #117

Closed
danielfromearth opened this issue Jun 5, 2024 · 4 comments · Fixed by #121
Closed

Improve name of output file #117

danielfromearth opened this issue Jun 5, 2024 · 4 comments · Fixed by #121
Labels
enhancement New feature or request

Comments

@danielfromearth
Copy link
Collaborator

From discussions with @alexrad71...

Issue

Currently, CONCISE writes the output to a file named with the Collection ID and "_merged.nc", as defined here. This name is not very useful to many users.

Proposed Change

Add both the Collection's short name and the version number to the output filename.

What would be even better :)

Include information about the start and end granules in the output filename. For instance, the start and end timestamps could be retrieved from CMR for the start and end granules, and then converted to str, and then added to the output filename.

@frankinspace
Copy link
Member

👍 I think it's a good idea.

For a little additional historical context; there was some discussion around output file naming conventions used for harmony services back in 2020 (https://wiki.earthdata.nasa.gov/display/HARMONY/Output+File+Naming+Convention). At the time the general consensus was that the original filename should be preserved as much as possible. Obviously with concise, that is not exactly feasible because it is a combination of many input files (this is noted on the https://wiki.earthdata.nasa.gov/display/HARMONY/Transformation+service+availability+and+compliance#Transformationserviceavailabilityandcompliance-Servicecompliance page)

As such, as part of this ticket, that wiki page should be updated to specify what the output filename format is.

With respect to start and end timestamps, rather than introduce a dependency on CMR, I wonder if it would be better to just inspect the data in the final output file and find the min/max timestamps in the data itself to use for filename.

@danielfromearth
Copy link
Collaborator Author

Looking at the relevant code for this a bit more, it's seeming to me now like the short name and version number aren't directly accessible in the harmony.message information or the granules' data themselves. So, unlike min/max timestamps—which, as @frankinspace mentioned, can be retrieved from the granules directly—the short name and version number would only be retrievable with a call to CMR during execution of CONCISE's service adapter. Thanks @frankinspace for referencing the confluence page, because I see now that @bilts also recommended including these bits of information. To implement this cleanly (without a separate call to CMR), would the harmony.message need to be amended to include the short name and version  information? And would that be a good change to make?

The only alternatives I can currently think of—i.e., to include useful information beyond (just) the ConceptID and timestamps—is to put the full granule name of the first granule along with the number of granules, or to put the names of the first and last granules. Having the full granule names would actually be more analogous to the output filenames from other services, such as l2ss, harmony-netcdf-to-zarr, or net2cog.

So, instead of the current naming, which is:

filename = f'{collection}_merged.nc4'

, the new naming could look something like (with min/max time stamps):

filename = f'{collection}_{datetimes[0].isoformat()}-{datetimes[1].isoformat()}_merged.nc4'

, or (with # granules and first name; note, this is what stitchee is doing currently):

filename = f"{collection}-concatenated_{number_of_granules}_starting_from_{first_url_name}.nc4"

, or (with first and last names):

filename = f"{collection}_concatenated_granules_from_{first_url_name}_to_{last_url_name}.nc4"

Do any of these look like good approaches? Or are there other ideas?

Also tagging @ank1m, @chris-durbin, and @owenlittlejohns, since this is likely relevant for other and/or future "many-to-one" output services.

@alexrad71
Copy link

I like this naming
filename = f'{collection}_{datetimes[0].isoformat()}-{datetimes[1].isoformat()}_merged.nc4'
more, but with one correction - currently collection mean C2930726639-LARC_CLOUD, while normal user will not remember in a day that this collection ID means TEMPO_O3TOT_L2_V03. So, it would be great to replace the collection ID by the humanly readable collection full name.

@ank1m
Copy link
Collaborator

ank1m commented Jun 18, 2024

Maybe we can start with {first_url_name} which seems to include short_name, version and first-datetime already?
Say f'{first_url_name}_{last-datetime.isoformat()}_{collection}_merged.nc4'?

@danielfromearth danielfromearth added the enhancement New feature or request label Jun 18, 2024
@danielfromearth danielfromearth linked a pull request Jul 5, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants