Cull duplicate dataURIs for MAST in download_products #2497

jdavies-st · 2022-08-19T10:13:26Z

Resolves #2496

I'm not precisely sure how to test this in a unit test, as I don't understand the mocking system. And there doesn't seem to be any JWST unit tests yet.

Regardless, this is what I see when I run the workflow in the issue above locally.

Before this PR:

>>> manifest = Observations.download_products(data_products)
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:JWST/product/jw02736008001_01_msa.fits to ./mastDownload/JWST/jw02736008001_03103_00003_nrs2/jw02736008001_01_msa.fits ...
|============================================================================| 567k/567k (100.00%)         1s
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:JWST/product/jw02736008001_01_msa.fits to ./mastDownload/JWST/jw02736008001_03103_00001_nrs2/jw02736008001_01_msa.fits ...
|============================================================================| 567k/567k (100.00%)         1s
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:JWST/product/jw02736008001_01_msa.fits to ./mastDownload/JWST/jw02736008001_03103_00002_nrs1/jw02736008001_01_msa.fits ...
|============================================================================| 567k/567k (100.00%)         0s
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:JWST/product/jw02736008001_01_msa.fits to ./mastDownload/JWST/jw02736008001_03103_00002_nrs2/jw02736008001_01_msa.fits ...
|============================================================================| 567k/567k (100.00%)         1s
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:JWST/product/jw02736008001_01_msa.fits to ./mastDownload/JWST/jw02736008001_03103_00003_nrs1/jw02736008001_01_msa.fits ...
|============================================================================| 567k/567k (100.00%)         0s
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:JWST/product/jw02736008001_01_msa.fits to ./mastDownload/JWST/jw02736008001_03103_00003_nrs2/jw02736008001_01_msa.fits ...
|============================================================================| 567k/567k (100.00%)         1s

This PR:

manifest = Observations.download_products(data_products)
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:JWST/product/jw02736008001_01_msa.fits to ./mastDownload/JWST/jw02736008001_03103_00001_nrs1/jw02736008001_01_msa.fits ...
|============================================================================| 567k/567k (100.00%)         0s

Any further suggestions or existing examples on how to test this in a unit test would be appreciated.

codecov · 2022-08-19T10:18:03Z

Codecov Report

Merging #2497 (0c96a7e) into main (f0de9d4) will increase coverage by 0.13%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##             main    #2497      +/-   ##
==========================================
+ Coverage   62.86%   62.99%   +0.13%     
==========================================
  Files         133      133              
  Lines       17276    17302      +26     
==========================================
+ Hits        10860    10899      +39     
+ Misses       6416     6403      -13

Impacted Files	Coverage Δ
astroquery/exceptions.py	`0.00% <0.00%> (ø)`
astroquery/mast/observations.py	`76.47% <80.00%> (+0.05%)`	⬆️
astroquery/alma/core.py	`48.19% <0.00%> (+4.83%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

keflavich

lgtm, it's a simple fix and the issue describes the problem well. But, I defer to @ceb8 for final review.

keflavich · 2022-08-19T13:59:01Z

Regarding the unit tests - I think it would be a good idea to have some JWST-specific ones, since that ought to be in high demand in coming months & years.

In the meantime, you could perhaps riff off of this example:

astroquery/astroquery/mast/tests/test_mast.py

Lines 476 to 479 in af6b41b

    
           def test_observations_download_file(patch_post, tmpdir): 
        
               # pull a single data product 
        
               products = mast.Observations.get_product_list('2003738726') 
        
               uri = products['dataURI'][0]

and just add a check that the list of uris is unique (assuming you can find a data set ID that has non-unique entries that need to be filtered)

bsipocz · 2022-08-19T18:21:11Z

I suppose it would be also useful to be able to turn off the progress bar, but that points beyond the scope here.

bsipocz

Please add a changelog entry and then this is good to go.

jaymedina · 2022-08-19T20:58:40Z

Hi @jdavies-st , this looks good, although I would suggest having the fix under get_product_list instead of download_products . Methods such as download_products, get_cloud_uris, etc all bank on the output received from get_product_list, so fixing that output straightaway before feeding the product list anywhere else would be cleaner.

jdavies-st · 2022-08-22T09:01:34Z

Thanks for the feedback!

I had thought of putting the unique() call in get_product_list, but currently for JWST, one often has to repeatedly call this function for each observation, as passing all observations of a proposal (or even some in batches) times out. I.e. the following recipe from the astroquery.mast doc examples just doesn't work for most JWST proposals:

from astroquery.mast import Observations


obs_table = Observations.query_criteria(proposal_id="2736", instrument_name="NIRSPEC")
data_products = Observations.get_product_list(obs_table)

You'll get a timeout or it will just hang and fail. One has to break up into chunks or just loop through each obs.

As an example, there can be 250k products (many duplicates) for a couple dozen NIRSpec observations which include level 3 products. And for scientists, this is going to be the way they will want to query - get me all the data for my <proposal_id>, maybe minus the guide star data.

So if I have to already call in a loop, say like the following:

from astroquery.mast import Observations
from astropy.table import unique, vstack


obs_table = Observations.query_criteria(proposal_id="2736", instrument_name="NIRSPEC")
data_products_list = []
for obs in obs_table:
    data_products_list.append(Observations.get_product_list(obs))
data_products = vstack(data_products_list)

I still will have lots of duplicates in my final vstacked table, as between observations is where all the duplication is occurring not within, especially for an instrument like NIRSpec where each level 3 product uses the same level 2 products as input. So if I have NIRSPec MOS with 100 spectra, and get the level 3 products, then I will get 100 copies of the level 2 products, one for each level 3 observation.

Also, if one is interested in the parent observation of a data product, each of the many copies of a particular file to download will have different parent metadata. So culling early might make people miss data if later in their script (before download) they select by observation, say the newest observation that was just taken.

So that was my thinking in leaving the duplicate culling to the last moment, before download, as that's really when you want to cull. Just don't make me download this file 600 times. =)

Thoughts? Or do you think it would still be best to cull early?

jaymedina · 2022-08-22T15:25:27Z

Hi @jdavies-st , those are good points. Doing the cull during the download step instead of the get_product_list step means the get_cloud_uri(s) methods will still return duplicate products, so just for completion I would suggest adding the unique() to the get_cloud_uri and get_cloud_uris methods as well. And I'm in agreement with @keflavich that a JWST-specific unit test that checks for duplicates would be another good addition to this PR. Thanks very much for catching this!

I also want to mention that the hang-up issue you're experiencing when trying to download large amounts of products is on our radar. We're currently trying to find the root cause of this and will implement a fix so users don't have to do those iterative product queries anymore.

ceb8 · 2022-08-22T16:29:57Z

I think that dealing with this duplicate issue is important but a have a few concerns about this solution. Mostly that it breaks the connection between the metadata and file location. As @jdavies-st mentions above the metadata is unique for each parent observation even though the file is the same. Ironically the download location was designed to be guaranteed to be unique for each individual observation because a few Hubble products do not have unique names (while being unique files I believe), a fact which is now causing this problem for JWST.

I have a few ideas for solutions:

Having a flat directory download option (with an argument flag or some such) where all files are dumped in the given download directory with no subdirectories is something users have been requesting for a while, maybe now is the time to implement this.
In this case once the first copy is downloaded all other download attempts will be short circuited by the cache functionality which will find the file already downloaded and not download it again.

Pros:
Adds functionality that users already want
Means that each indvidual metadata entry will be associated with the file location
Cons:
JWST users have to set the flag and if they forget they will still get all the copies
A little more work to implement

Have JWST files go to local_path = os.path.join(base_dir, data_product['obs_collection']) rather than local_path = os.path.join(base_dir, data_product['obs_collection']), data_product['obs_id']). This means that all JWST files go in the same directory, so lilke option one the caching takes over and prevents duplicate downloads.

Pros:
Does not requere a user argument
Means that each indvidual metadata entry will be associated with the file location.
Cons:
Means harding coding different treatment for JWST
Not a general solution

The fix you've already implemented but with a check and warning if the unique call indicates duplicates present. I think the warning is necessary because it is a departure from the functionality up till now which gives the user one file per product row, and the user will need to know they need to do a manual cross-match on dataURI to connect downloaded files to their metadata entries.

Pros:
Easiest coding wise, just add warning
Cons:
Breaks the metadata to file connection

I defer to @jaymedina to decide what solution is best, but at minimum I do think the user needs to be warned if they are going to get a download manifest with a different number of rows than they supplied to the download function.

I revoke my approval in favour of the discussion above.

jaymedina · 2022-08-22T19:00:16Z

Thanks @ceb8 , I'm in favor of option 1 especially if this functionality has already been requested by users. Perhaps also adding a warning in the docstring to users: If they have JWST products in their product list + don't use the flat download directory kwd argument that they may be re-downloading files unintentionally.

jdavies-st · 2022-08-24T10:50:11Z

@ceb8 I agree that option 1) above is great solution. It is something I'd like to see solved as well, as one always has to move the files to a flat structure so that the calibration pipeline can be run. Though I didn't find a Github issue here for that. But I'm not convinced that JWST users will enjoy having another kwarg to control this behavior. It would be nice if it was smart. And if one forgets to add the flat=True or similar kwarg, one will still get the old behavior of multiple downloads, which is not desirable ever. It places a big strain on all infrastructure involved - mast servers, bandwidth, local storage and researchers' time. The duplicates can make the download size (and time) be a factor of 10 to 100 times larger. This is significant.

Option 2 would also work very well and would prevent users from having to use flat=True or similar. If the current behavior of organizing into obsid subdirs was brought about by HST not having unique filenames, this is not a problem with JWST. So while one could argue option 2 is not a general solution, its actually the HST handling which is not a general solution. There should be no assumption that the same files won't be found across multiple obs_ids given the JWST calibration pipeline routinely creates level 3 products for almost all instruments and observing modes.

That said, I think the 2 options above solve an issue that is separate from the one this PR solves, or at least broader scope. And this PR is actually compatible with both options 1 and 2. It will be needed if a user forgets flat=True in option 1, and it will be technically equivalent in option 2, but much more efficient. Instead of having to iterate through 10 or 100 times more data products and checking if they exist, this sort-circuits for cases where the same product is listed more than once (often hundreds of times). It will be much faster, and there's little downside except the manifest will now be smaller and won't have duplicates.

Happy to add a warning in this PR if that is thought to be needed, though I think once one is downloading, the resulting products on disk no longer have different metadata - they are the same files. The only place metadata will be lost is in the manifest table of the downloads (and the files won't be found multiple times across multiple subdirs).

Thoughts on the way forward?

I've added a test to this PR based on the example in the issue.

bsipocz · 2022-08-24T20:27:15Z

It is something I'd like to see solved as well, as one always has to move the files to a flat structure

totally independently I was about to open an issue about this as I've just run into it.

ceb8 · 2022-08-25T10:31:47Z

@jdavies-st what I meant about metatdata was not in the file itself but in the connection between the file and the prodct list, because the way you do it the file will appear under in single JWST/obsid directory while being associated with a lot of different obsids. This is not prohibitive, but is why I think it is important to have a warning trigger when this happens. Also it is a change to what the user can expect in that up till now they were guarenteed a manifest with a one-to-one correspondence to their input product list, and I wouldn't want to change that behaviour without letting the user know.

I think your point about it being inadvisable for this to happen ever is a good one, which makes me think that a combination of 3 and 1 are probably correct, with this PR being the immediate fix and then the flat directory option being considered relatively high priority (within the constraints on the MAST team of course). Thoughts @jaymedina?

eerovaher · 2022-08-25T12:22:02Z

Have you considered using hard links? That would allow you to maintain the directory structure while still avoiding downloading and writing duplicate files.

jaymedina · 2022-08-25T20:32:26Z

What do you mean by hard links @eerovaher ?

I just spoke with one of the MAST database handlers to confirm that the reason for the current convention was because of different Hubble products having the same name and this was their reply:

I'm not aware of any duplicated HST pipeline products with the same name. The original reason for the dir structure was to put all files for the same observation in the same directory which is a 'natural' way to group them. The problem is JWST which doesn't have such a logical structure with the same file being used for multiple observations so it appears in multiple places.

So if this is true, Method 2 could be a general solution. They also suggested going with a directory structure like:
->obs_collection sub-directory (as mentioned in Clara's Method 2)
--> sub-sub-directory for auxiliary files that are shared across observations like asn files, guidestar files, msa files, etc.

This would generalize the file directory structure so it's compatible across missions, and the cacheing would control duplicates in the auxiliary files sub-sub-directory. So this would basically be Method 2 with some flare. Let me know your thoughts and if you see potential problems with this convention.

EDIT: Scratch that, I received feedback from others that there were duplicate files seen in the GALEX mission among others, so method 2 is probably a no.

One other suggestion would be to fiddle with the manifest itself in download_files and download_curl_scripts. So for example in download_files, adding an if/else after calling local_path stating that if a file is already in manifest_array, add the next row with a status message denoting that this file is a duplicate and will be skipped. You would have to assert that the product is not part of any HST mission, though.

eerovaher · 2022-08-25T21:34:32Z

What do you mean by hard links @eerovaher ?

https://en.wikipedia.org/wiki/Hard_link

jdavies-st · 2022-08-26T09:29:10Z

Thanks @ceb8. I'm in full agreement. I will add a warning to the user in this PR.

Btw, if one uses the MAST web portal to pull the same level 3 obs that is in the test in this PR, one gets the same 6 MSA config files in one's basket (due to 2 detectors, 3 nods):

So the exact same MSA config file listed 6 times for each "dataset". But if I hit the download button and look at the resulting curl script:

# Download Product Files:



cat <<EOT
<<< Downloading File: mast:JWST/product/jw02736008001_01_msa.fits
                  To: ${DOWNLOAD_FOLDER}/JWST/jw02736008001_03101_00002_nrs1/jw02736008001_01_msa.fits
EOT

curl --globoff --location-trusted -f --progress-bar --create-dirs $CONT --output ./${DOWNLOAD_FOLDER}'/JWST/jw02736008001_03101_00002_nrs1/jw02736008001_01_msa.fits' 'https://mast.stsci.edu/api/v0.1/Download/file?bundle_name=MAST_2022-08-26T0519.sh&uri=mast:JWST/product/jw02736008001_01_msa.fits'




exit 0

It has culled them to a single download, but keeping the subdir structure. So the MAST portal is doing the same thing that this PR is doing. Confluence.

Of course better would be not to do the subdirs in the first place. A separate PR, and further discussion.

jdavies-st · 2022-08-26T10:20:21Z

@jaymedina I've added the duplicate culling to get_cloud_uris as well. It looks like that's the only that is needed for cloud URIs, as get_cloud_uri only works with a single data product.

jdavies-st · 2022-08-29T15:06:10Z

I believe this PR is complete and ready for final review.

bsipocz

Looks all good to me, thanks @jdavies-st! Though, before going ahead and merging this I would like to get an approval from either @ceb8 or @jaymedina.

jaymedina · 2022-09-06T13:22:03Z

This looks good to me, thanks for the updates! This just needs a commit squash and I believe a rebase, but @bsipocz can confirm if this is the case or not.

eerovaher · 2022-09-06T13:40:18Z

As it currently stands, if a query results in a DuplicateResultsWarning then there isn't any way for the user to perform the same query without the warning being emitted. In such cases the warning should go through the logging system, see https://docs.python.org/3/howto/logging.html#when-to-use-logging

bsipocz · 2022-09-06T15:14:51Z

Thanks for approving @jaymedina! No need to squash nor to rebase (commits are logical chunks and not back and forth, and there are no duplicates or conflicts).

bsipocz · 2022-09-06T15:37:58Z

Thanks @jdavies-st!

bsipocz · 2022-09-06T15:38:36Z

As a follow-up, I agree with @eerovaher, warning that a user can do nothing about are not ideal, please consider using the logger instead.

jdavies-st · 2022-09-07T08:23:21Z

@eerovaher yes, I agree. Thanks for pointing that out to me. I will make a follow-up PR to fix this.

Cull duplicate dataURIs for MAST in download_products

f22680e

keflavich requested a review from ceb8 August 19, 2022 13:54

keflavich approved these changes Aug 19, 2022

View reviewed changes

bsipocz mentioned this pull request Aug 19, 2022

MAST: add JWST specific tests to test suite and flashed out examples to user docs #2498

Open

bsipocz added bug mast labels Aug 19, 2022

bsipocz added this to the v0.4.7 milestone Aug 19, 2022

bsipocz previously approved these changes Aug 19, 2022

View reviewed changes

Update CHANGES.rst

21ba694

Add test for duplicate downloads

44e6f36

This was referenced Aug 24, 2022

ENH: provided flat directory structure option for mast.Observations.download_products #2500

Closed

BUG: mast.Observations.download_file local_path should accept directory #2501

Closed

jdavies-st added 2 commits August 26, 2022 12:02

Add warning to use if duplicate downloads are culled

8e7fa23

Encapsulate duplicate product removal; apply to get_cloud_uris

3e0d04c

jdavies-st added 2 commits August 26, 2022 12:26

Update CHANGES.rst

e666051

Fix bug in get_cloud_uris; add test

0c96a7e

bsipocz reviewed Aug 29, 2022

View reviewed changes

bsipocz merged commit 3a57b4c into astropy:main Sep 6, 2022

jdavies-st deleted the mast-cull-duplicate-downloads branch September 7, 2022 08:23

jdavies-st mentioned this pull request Sep 7, 2022

Log a warning if there are download duplicates for MAST #2510

Merged

snbianco mentioned this pull request Jul 24, 2024

INFO message when removing duplicate products should give the correct number of products based on filters #3074

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cull duplicate dataURIs for MAST in download_products #2497

Cull duplicate dataURIs for MAST in download_products #2497

jdavies-st commented Aug 19, 2022 •

edited

Loading

codecov bot commented Aug 19, 2022 •

edited

Loading

keflavich left a comment

keflavich commented Aug 19, 2022

bsipocz commented Aug 19, 2022

bsipocz left a comment

jaymedina commented Aug 19, 2022

jdavies-st commented Aug 22, 2022 •

edited

Loading

jaymedina commented Aug 22, 2022 •

edited

Loading

ceb8 commented Aug 22, 2022

jaymedina commented Aug 22, 2022

jdavies-st commented Aug 24, 2022 •

edited

Loading

bsipocz commented Aug 24, 2022

ceb8 commented Aug 25, 2022

eerovaher commented Aug 25, 2022

jaymedina commented Aug 25, 2022 •

edited

Loading

eerovaher commented Aug 25, 2022

jdavies-st commented Aug 26, 2022 •

edited

Loading

jdavies-st commented Aug 26, 2022

jdavies-st commented Aug 29, 2022

bsipocz left a comment

jaymedina commented Sep 6, 2022

eerovaher commented Sep 6, 2022

bsipocz commented Sep 6, 2022

bsipocz commented Sep 6, 2022

bsipocz commented Sep 6, 2022

jdavies-st commented Sep 7, 2022

Cull duplicate dataURIs for MAST in download_products #2497

Cull duplicate dataURIs for MAST in download_products #2497

Conversation

jdavies-st commented Aug 19, 2022 • edited Loading

codecov bot commented Aug 19, 2022 • edited Loading

Codecov Report

keflavich left a comment

Choose a reason for hiding this comment

keflavich commented Aug 19, 2022

bsipocz commented Aug 19, 2022

bsipocz left a comment

Choose a reason for hiding this comment

jaymedina commented Aug 19, 2022

jdavies-st commented Aug 22, 2022 • edited Loading

jaymedina commented Aug 22, 2022 • edited Loading

ceb8 commented Aug 22, 2022

jaymedina commented Aug 22, 2022

jdavies-st commented Aug 24, 2022 • edited Loading

bsipocz commented Aug 24, 2022

ceb8 commented Aug 25, 2022

eerovaher commented Aug 25, 2022

jaymedina commented Aug 25, 2022 • edited Loading

eerovaher commented Aug 25, 2022

jdavies-st commented Aug 26, 2022 • edited Loading

jdavies-st commented Aug 26, 2022

jdavies-st commented Aug 29, 2022

bsipocz left a comment

Choose a reason for hiding this comment

jaymedina commented Sep 6, 2022

eerovaher commented Sep 6, 2022

bsipocz commented Sep 6, 2022

bsipocz commented Sep 6, 2022

bsipocz commented Sep 6, 2022

jdavies-st commented Sep 7, 2022

jdavies-st commented Aug 19, 2022 •

edited

Loading

codecov bot commented Aug 19, 2022 •

edited

Loading

jdavies-st commented Aug 22, 2022 •

edited

Loading

jaymedina commented Aug 22, 2022 •

edited

Loading

jdavies-st commented Aug 24, 2022 •

edited

Loading

jaymedina commented Aug 25, 2022 •

edited

Loading

jdavies-st commented Aug 26, 2022 •

edited

Loading