Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide support for granule wildcard patterns in data downloader #138

Merged
merged 7 commits into from
Jun 20, 2023

Conversation

jjmcnelis
Copy link
Member

@jjmcnelis jjmcnelis commented Jun 14, 2023

Shailen expressed a need for this capability in SWOTCalVal (indirectly, this seems like the most straightforward way to support selective downloading w/o a dedicated UMM field to query on).

I added 2 lines to the subscriber/podaac_data_downloader.py script to allow for CMR wildcard functionality to be supported through the existing options. This adds the wildcard pattern option to request parameters whenever the user gives a granuleur containing '*' or '?':

        #jmcnelis, 2023/06/14 - provide for wildcards in granuleur-based search
        if '*' or '?' in cmr_granule:
            params.append(('options[GranuleUR][pattern]', 'true'))

This supports Shailen's use case where he wants to selectively download granules by campaign (SWOTCalVal). Here are the invocations for two example cases --

  • Download both files from WM and TM campaigns, based on prior knowledge of the filename convention:
$ python subscriber/podaac_data_downloader.py -c SWOTCalVal_GNSS_L2_1.0 -gr 'SWOTCalVal_??_GNSS_L2_*' -d ./data/
[2023-06-14 14:12:12,727] {podaac_data_downloader.py:270} INFO - Found 2 total files to download
[2023-06-14 14:12:19,628] {podaac_data_downloader.py:313} INFO - 2023-06-14 14:12:19.628547 SUCCESS: https://archive.swot.podaac.earthdata.nasa.gov/podaac-swot-ops-cumulus-protected/SWOTCalVal_GNSS_L2_1.0/SWOTCalVal_T2_GNSS_L2_Rec11_20230201T221500_20230201T232230_20230227T220903.nc
[2023-06-14 14:12:21,952] {podaac_data_downloader.py:313} INFO - 2023-06-14 14:12:21.952641 SUCCESS: https://archive.swot.podaac.earthdata.nasa.gov/podaac-swot-ops-cumulus-protected/SWOTCalVal_GNSS_L2_1.0/SWOTCalVal_WM_GNSS_L2_Rec2_20220729T222100_20220730T023300_20230227T211845.nc
[2023-06-14 14:12:21,952] {podaac_data_downloader.py:324} INFO - Downloaded Files: 2
[2023-06-14 14:12:21,952] {podaac_data_downloader.py:325} INFO - Failed Files:     0
[2023-06-14 14:12:21,952] {podaac_data_downloader.py:326} INFO - Skipped Files:    0
[2023-06-14 14:12:22,329] {podaac_data_downloader.py:334} INFO - END
  • Download one granule, from WM campaign, ...:
$ python subscriber/podaac_data_downloader.py -c SWOTCalVal_GNSS_L2_1.0 -gr 'SWOTCalVal_WM_GNSS_L2_*' -d ./data/
[2023-06-14 14:12:29,910] {podaac_data_downloader.py:270} INFO - Found 1 total files to download
[2023-06-14 14:12:35,532] {podaac_data_downloader.py:313} INFO - 2023-06-14 14:12:35.532384 SUCCESS: https://archive.swot.podaac.earthdata.nasa.gov/podaac-swot-ops-cumulus-protected/SWOTCalVal_GNSS_L2_1.0/SWOTCalVal_WM_GNSS_L2_Rec2_20220729T222100_20220730T023300_20230227T211845.nc
[2023-06-14 14:12:35,532] {podaac_data_downloader.py:324} INFO - Downloaded Files: 1
[2023-06-14 14:12:35,532] {podaac_data_downloader.py:325} INFO - Failed Files:     0
[2023-06-14 14:12:35,532] {podaac_data_downloader.py:326} INFO - Skipped Files:    0
[2023-06-14 14:12:35,845] {podaac_data_downloader.py:334} INFO - END

This needs further testing by someone besides me.

mike-gangl and others added 2 commits April 28, 2023 10:47
* Issues/91 (podaac#92)

* added citation creation tests and functionality to subscriber and downloader

* added verbose option to create_citation_file command, previously hard coded

* updated changelog (whoops) and fixed regression test:
1. Issue where the citation file now downloaded affected the counts
2. Issue where the logic for determining if a file modified time was changing or not was picking up the new citation file which _always_ gets rewritten to update the 'last accessed' date.

* updated request to include exec_info in warning; fixed issue with params not being a dictionary caused errors

* changed a warning to debug for citation file. fixed test issues

* Enable debug logging during regression tests and set max parallel workflows to 2

* added output to pytest

* fixed test to only look for downlaoded data files not citation file due to 'random' cmr errors when creating a citation.

* added mock testing and retry on 503

* added 503 fixes

Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>

* fixed issues where token was not proagated to CMR queries (podaac#95)

* Misc fixes (podaac#101)

* added ".tiff" to default extensions to address podaac#100

* removed 'warning' message on not downloading all data to close podaac#99

* updated help documentation for start/end times to close podaac#79

* added version update, updates to CHANGELOG

* added token get,delete, refresh and list operations

* Revert "added token get,delete, refresh and list operations"

This reverts commit 15aba90.

* Update python-app.yml

* updated poetry version 

Version matches build/test versions.

* Issues/98 (podaac#107)

* added token get,delete, refresh and list operations

* Revert "added token get,delete, refresh and list operations"

This reverts commit 15aba90.

* added  EDL (not cmr-token) based get, list,delete, refresh token

* updated token regression tests

* updates and tests for subscriber moving to EDL.

* marked tests as regression test

* Update subscriber/podaac_data_downloader.py

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* Update subscriber/podaac_data_subscriber.py

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* Update subscriber/podaac_access.py

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* Update subscriber/podaac_access.py

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* Update subscriber/podaac_access.py

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* added exec info to errors, cleaned up some log statements

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* Issues/109 (podaac#111)

* Develop (podaac#103)

* Issues/91 (podaac#92)

* added citation creation tests and functionality to subscriber and downloader

* added verbose option to create_citation_file command, previously hard coded

* updated changelog (whoops) and fixed regression test:
1. Issue where the citation file now downloaded affected the counts
2. Issue where the logic for determining if a file modified time was changing or not was picking up the new citation file which _always_ gets rewritten to update the 'last accessed' date.

* updated request to include exec_info in warning; fixed issue with params not being a dictionary caused errors

* changed a warning to debug for citation file. fixed test issues

* Enable debug logging during regression tests and set max parallel workflows to 2

* added output to pytest

* fixed test to only look for downlaoded data files not citation file due to 'random' cmr errors when creating a citation.

* added mock testing and retry on 503

* added 503 fixes

Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>

* fixed issues where token was not proagated to CMR queries (podaac#95)

* Misc fixes (podaac#101)

* added ".tiff" to default extensions to address podaac#100

* removed 'warning' message on not downloading all data to close podaac#99

* updated help documentation for start/end times to close podaac#79

* added version update, updates to CHANGELOG

* added token get,delete, refresh and list operations

* Revert "added token get,delete, refresh and list operations"

This reverts commit 15aba90.

* Update python-app.yml

Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>

* updated poetry version 

Version matches build/test versions.

* Update README.md

* Update podaac_data_downloader.py

Fixing for issues 109 - adding capability to download by granule-name

* Update Downloader.md

Fixed the help file

* added changelog entries, regressiont ests

* added poetry lock cleanup

Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>
Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>
Co-authored-by: sureshshsv <45676320+sureshshsv@users.noreply.github.com>
Co-authored-by: sureshshsv <suresh.vannan@jpl.nasa.gov>

* added README information and updates (podaac#113)

* fixed pymock issues... again

* Extension regex (podaac#121)

* extend -e option to handle regular expressions (podaac#115)

* Develop into Main (1.12.0) (podaac#114)

* Issues/91 (podaac#92)

* added citation creation tests and functionality to subscriber and downloader

* added verbose option to create_citation_file command, previously hard coded

* updated changelog (whoops) and fixed regression test:
1. Issue where the citation file now downloaded affected the counts
2. Issue where the logic for determining if a file modified time was changing or not was picking up the new citation file which _always_ gets rewritten to update the 'last accessed' date.

* updated request to include exec_info in warning; fixed issue with params not being a dictionary caused errors

* changed a warning to debug for citation file. fixed test issues

* Enable debug logging during regression tests and set max parallel workflows to 2

* added output to pytest

* fixed test to only look for downlaoded data files not citation file due to 'random' cmr errors when creating a citation.

* added mock testing and retry on 503

* added 503 fixes

Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>

* fixed issues where token was not proagated to CMR queries (podaac#95)

* Misc fixes (podaac#101)

* added ".tiff" to default extensions to address podaac#100

* removed 'warning' message on not downloading all data to close podaac#99

* updated help documentation for start/end times to close podaac#79

* added version update, updates to CHANGELOG

* added token get,delete, refresh and list operations

* Revert "added token get,delete, refresh and list operations"

This reverts commit 15aba90.

* Update python-app.yml

* updated poetry version 

Version matches build/test versions.

* Issues/98 (podaac#107)

* added token get,delete, refresh and list operations

* Revert "added token get,delete, refresh and list operations"

This reverts commit 15aba90.

* added  EDL (not cmr-token) based get, list,delete, refresh token

* updated token regression tests

* updates and tests for subscriber moving to EDL.

* marked tests as regression test

* Update subscriber/podaac_data_downloader.py

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* Update subscriber/podaac_data_subscriber.py

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* Update subscriber/podaac_access.py

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* Update subscriber/podaac_access.py

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* Update subscriber/podaac_access.py

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* added exec info to errors, cleaned up some log statements

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* Issues/109 (podaac#111)

* Develop (podaac#103)

* Issues/91 (podaac#92)

* added citation creation tests and functionality to subscriber and downloader

* added verbose option to create_citation_file command, previously hard coded

* updated changelog (whoops) and fixed regression test:
1. Issue where the citation file now downloaded affected the counts
2. Issue where the logic for determining if a file modified time was changing or not was picking up the new citation file which _always_ gets rewritten to update the 'last accessed' date.

* updated request to include exec_info in warning; fixed issue with params not being a dictionary caused errors

* changed a warning to debug for citation file. fixed test issues

* Enable debug logging during regression tests and set max parallel workflows to 2

* added output to pytest

* fixed test to only look for downlaoded data files not citation file due to 'random' cmr errors when creating a citation.

* added mock testing and retry on 503

* added 503 fixes

Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>

* fixed issues where token was not proagated to CMR queries (podaac#95)

* Misc fixes (podaac#101)

* added ".tiff" to default extensions to address podaac#100

* removed 'warning' message on not downloading all data to close podaac#99

* updated help documentation for start/end times to close podaac#79

* added version update, updates to CHANGELOG

* added token get,delete, refresh and list operations

* Revert "added token get,delete, refresh and list operations"

This reverts commit 15aba90.

* Update python-app.yml

Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>

* updated poetry version 

Version matches build/test versions.

* Update README.md

* Update podaac_data_downloader.py

Fixing for issues 109 - adding capability to download by granule-name

* Update Downloader.md

Fixed the help file

* added changelog entries, regressiont ests

* added poetry lock cleanup

Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>
Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>
Co-authored-by: sureshshsv <45676320+sureshshsv@users.noreply.github.com>
Co-authored-by: sureshshsv <suresh.vannan@jpl.nasa.gov>

* added README information and updates (podaac#113)

* fixed pymock issues... again

Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>
Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>
Co-authored-by: sureshshsv <45676320+sureshshsv@users.noreply.github.com>
Co-authored-by: sureshshsv <suresh.vannan@jpl.nasa.gov>

* extend -e option to handle regular expressions

formerly, -e could not handle PTM_\d+ extensions without the user explicitly
calling all of them.

---------

Co-authored-by: mike-gangl <59702631+mike-gangl@users.noreply.github.com>
Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>
Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>
Co-authored-by: sureshshsv <45676320+sureshshsv@users.noreply.github.com>
Co-authored-by: sureshshsv <suresh.vannan@jpl.nasa.gov>

* added dcoumentation and tests for regex

* converted defaults to regexes, added gtiff test

---------

Co-authored-by: Peter Mao <peter.mao@gmail.com>
Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>
Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>
Co-authored-by: sureshshsv <45676320+sureshshsv@users.noreply.github.com>
Co-authored-by: sureshshsv <suresh.vannan@jpl.nasa.gov>

* closes 118. retries was never hit because range is not end inclusive. (podaac#119)

* closes 118. retries was never hit ebcause range is not end inclusive.

* updated test to catch now-thrown exception

* added --dry-run option, docs, and test cases (podaac#124)

* added --dry-run option, docs, and test cases

* Update subscriber/podaac_data_downloader.py

Added more elegant way of download limit application

Co-authored-by: Stepheny Perez <skorper@users.noreply.github.com>

---------

Co-authored-by: Stepheny Perez <skorper@users.noreply.github.com>

* Issues/70 (podaac#117)

* added code for updating version

* added chagnelog

* moved version check into __main__ instead of on import of the module

* added sorting of releases from github to find latest release.

* added authenticated (option) access to github API to rpevent rate limiting

* separate out auth/token regression tests

* Issues/127 (podaac#128)

* added token sensitivity filter to remove tokens from CMR queries

* added changelog updates

* updated some lingering merge issues (huh?)

* updated regression test

* updated ubuntu versions

* removed 18.04 ubuntu from workflows/actions

* version and documentation updates (podaac#130)

* 1.13.1 changelog and dependecny updates

* fixed formatting from unsaved merges

---------

Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>
Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>
Co-authored-by: sureshshsv <45676320+sureshshsv@users.noreply.github.com>
Co-authored-by: sureshshsv <suresh.vannan@jpl.nasa.gov>
Co-authored-by: Peter Mao <peter.mao@gmail.com>
Co-authored-by: Stepheny Perez <skorper@users.noreply.github.com>
subscriber/podaac_data_downloader.py Outdated Show resolved Hide resolved
Co-authored-by: Stepheny Perez <skorper@users.noreply.github.com>
@jjmcnelis
Copy link
Member Author

Here's some evidence that these updates still have the expected outcome after fixes caught by @skorper:

(base) jmcnelis@MT-209219:main  [ ~/subscriber-feature-enhancements/data-subscriber ] 
 02:12:35 $ python subscriber/podaac_data_downloader.py -c SWOTCalVal_GNSS_L2_1.0 -gr 'SWOTCalVal_??_GNSS_L2_*' -d ./data/
[2023-06-14 14:50:16,780] {podaac_data_downloader.py:270} INFO - Found 2 total files to download
[2023-06-14 14:50:23,706] {podaac_data_downloader.py:313} INFO - 2023-06-14 14:50:23.706562 SUCCESS: https://archive.swot.podaac.earthdata.nasa.gov/podaac-swot-ops-cumulus-protected/SWOTCalVal_GNSS_L2_1.0/SWOTCalVal_T2_GNSS_L2_Rec11_20230201T221500_20230201T232230_20230227T220903.nc
[2023-06-14 14:50:25,960] {podaac_data_downloader.py:313} INFO - 2023-06-14 14:50:25.960080 SUCCESS: https://archive.swot.podaac.earthdata.nasa.gov/podaac-swot-ops-cumulus-protected/SWOTCalVal_GNSS_L2_1.0/SWOTCalVal_WM_GNSS_L2_Rec2_20220729T222100_20220730T023300_20230227T211845.nc
[2023-06-14 14:50:25,960] {podaac_data_downloader.py:324} INFO - Downloaded Files: 2
[2023-06-14 14:50:25,960] {podaac_data_downloader.py:325} INFO - Failed Files:     0
[2023-06-14 14:50:25,960] {podaac_data_downloader.py:326} INFO - Skipped Files:    0
[2023-06-14 14:50:26,310] {podaac_data_downloader.py:334} INFO - END


 
(base) jmcnelis@MT-209219:main  [ ~/subscriber-feature-enhancements/data-subscriber ] 
 02:50:26 $ python subscriber/podaac_data_downloader.py -c SWOTCalVal_GNSS_L2_1.0 -gr 'SWOTCalVal_WM_GNSS_L2_*' -d ./data
/
[2023-06-14 14:50:37,917] {podaac_data_downloader.py:270} INFO - Found 1 total files to download
[2023-06-14 14:50:43,467] {podaac_data_downloader.py:313} INFO - 2023-06-14 14:50:43.467900 SUCCESS: https://archive.swot.podaac.earthdata.nasa.gov/podaac-swot-ops-cumulus-protected/SWOTCalVal_GNSS_L2_1.0/SWOTCalVal_WM_GNSS_L2_Rec2_20220729T222100_20220730T023300_20230227T211845.nc
[2023-06-14 14:50:43,468] {podaac_data_downloader.py:324} INFO - Downloaded Files: 1
[2023-06-14 14:50:43,468] {podaac_data_downloader.py:325} INFO - Failed Files:     0
[2023-06-14 14:50:43,468] {podaac_data_downloader.py:326} INFO - Skipped Files:    0
[2023-06-14 14:50:43,806] {podaac_data_downloader.py:334} INFO - END

@skorper
Copy link
Contributor

skorper commented Jun 14, 2023

@jjmcnelis A few more things..

  • Can you please change this PR to point to develop instead of main? new features go into develop, then are eventually merged to main when we release
  • Can you add a line in the changelog? Create a new [unreleased] section
  • Can you add something in the downloader readme about this new feature?

@jjmcnelis jjmcnelis changed the base branch from main to develop June 14, 2023 20:05
@jjmcnelis
Copy link
Member Author

@jjmcnelis A few more things..

* Can you please change this PR to point to `develop` instead of `main`? new features go into `develop`, then are eventually merged to main when we release

* Can you add a line in the changelog? Create a new [unreleased] section

* Can you add something in the downloader readme about this new feature?

Thanks for your patience, @skorper. I'm out of my element..

I made edits to each of CHANGELOG.md, Downloader.md, and to the help text for the -gr option inside podaac_data_downloader.py to expand on its use with wildcard patterns. The Downloader.md links CMR Search API docs describing this wildcard search feature, which functions in exactly the same way thru our tool as it does for the REST API parameters (many more of which are supported than just Granule UR, but this one is the most useful to expose to users thru downloader tool IMO). Let me know if these updates don't meet our standards and I'll take another shot at it right away, thanks again

@skorper
Copy link
Contributor

skorper commented Jun 14, 2023

Thank you @jjmcnelis ! The last thing would be to resolve merge conflicts, then I will approve 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants