Improve name of output file #117

danielfromearth · 2024-06-05T17:42:23Z

From discussions with @alexrad71...

Issue

Currently, CONCISE writes the output to a file named with the Collection ID and "_merged.nc", as defined here. This name is not very useful to many users.

Proposed Change

Add both the Collection's short name and the version number to the output filename.

What would be even better :)

Include information about the start and end granules in the output filename. For instance, the start and end timestamps could be retrieved from CMR for the start and end granules, and then converted to str, and then added to the output filename.

The text was updated successfully, but these errors were encountered:

frankinspace · 2024-06-05T19:18:11Z

👍 I think it's a good idea.

For a little additional historical context; there was some discussion around output file naming conventions used for harmony services back in 2020 (https://wiki.earthdata.nasa.gov/display/HARMONY/Output+File+Naming+Convention). At the time the general consensus was that the original filename should be preserved as much as possible. Obviously with concise, that is not exactly feasible because it is a combination of many input files (this is noted on the https://wiki.earthdata.nasa.gov/display/HARMONY/Transformation+service+availability+and+compliance#Transformationserviceavailabilityandcompliance-Servicecompliance page)

As such, as part of this ticket, that wiki page should be updated to specify what the output filename format is.

With respect to start and end timestamps, rather than introduce a dependency on CMR, I wonder if it would be better to just inspect the data in the final output file and find the min/max timestamps in the data itself to use for filename.

danielfromearth · 2024-06-17T15:03:25Z

Looking at the relevant code for this a bit more, it's seeming to me now like the short name and version number aren't directly accessible in the harmony.message information or the granules' data themselves. So, unlike min/max timestamps—which, as @frankinspace mentioned, can be retrieved from the granules directly—the short name and version number would only be retrievable with a call to CMR during execution of CONCISE's service adapter. Thanks @frankinspace for referencing the confluence page, because I see now that @bilts also recommended including these bits of information. To implement this cleanly (without a separate call to CMR), would the harmony.message need to be amended to include the short name and version information? And would that be a good change to make?

The only alternatives I can currently think of—i.e., to include useful information beyond (just) the ConceptID and timestamps—is to put the full granule name of the first granule along with the number of granules, or to put the names of the first and last granules. Having the full granule names would actually be more analogous to the output filenames from other services, such as l2ss, harmony-netcdf-to-zarr, or net2cog.

So, instead of the current naming, which is:

filename = f'{collection}_merged.nc4'

, the new naming could look something like (with min/max time stamps):

filename = f'{collection}_{datetimes[0].isoformat()}-{datetimes[1].isoformat()}_merged.nc4'

, or (with # granules and first name; note, this is what stitchee is doing currently):

filename = f"{collection}-concatenated_{number_of_granules}_starting_from_{first_url_name}.nc4"

, or (with first and last names):

filename = f"{collection}_concatenated_granules_from_{first_url_name}_to_{last_url_name}.nc4"

Do any of these look like good approaches? Or are there other ideas?

Also tagging @ank1m, @chris-durbin, and @owenlittlejohns, since this is likely relevant for other and/or future "many-to-one" output services.

alexrad71 · 2024-06-17T20:10:58Z

I like this naming
filename = f'{collection}_{datetimes[0].isoformat()}-{datetimes[1].isoformat()}_merged.nc4'
more, but with one correction - currently collection mean C2930726639-LARC_CLOUD, while normal user will not remember in a day that this collection ID means TEMPO_O3TOT_L2_V03. So, it would be great to replace the collection ID by the humanly readable collection full name.

ank1m · 2024-06-18T13:24:39Z

Maybe we can start with {first_url_name} which seems to include short_name, version and first-datetime already?
Say f'{first_url_name}_{last-datetime.isoformat()}_{collection}_merged.nc4'?

danielfromearth added the enhancement New feature or request label Jun 18, 2024

danielfromearth mentioned this issue Jul 5, 2024

Feature/issue 117 improve name of output file #121

Merged

3 tasks

danielfromearth linked a pull request Jul 5, 2024 that will close this issue

Feature/issue 117 improve name of output file #121

Merged

3 tasks

jamesfwood closed this as completed in #121 Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve name of output file #117

Improve name of output file #117

danielfromearth commented Jun 5, 2024

frankinspace commented Jun 5, 2024

danielfromearth commented Jun 17, 2024

alexrad71 commented Jun 17, 2024

ank1m commented Jun 18, 2024

Improve name of output file #117

Improve name of output file #117

Comments

danielfromearth commented Jun 5, 2024

Issue

Proposed Change

What would be even better :)

frankinspace commented Jun 5, 2024

danielfromearth commented Jun 17, 2024

alexrad71 commented Jun 17, 2024

ank1m commented Jun 18, 2024