-
-
Notifications
You must be signed in to change notification settings - Fork 405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cull duplicate dataURIs for MAST in download_products #2497
Cull duplicate dataURIs for MAST in download_products #2497
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2497 +/- ##
==========================================
+ Coverage 62.86% 62.99% +0.13%
==========================================
Files 133 133
Lines 17276 17302 +26
==========================================
+ Hits 10860 10899 +39
+ Misses 6416 6403 -13
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, it's a simple fix and the issue describes the problem well. But, I defer to @ceb8 for final review.
Regarding the unit tests - I think it would be a good idea to have some JWST-specific ones, since that ought to be in high demand in coming months & years. In the meantime, you could perhaps riff off of this example: astroquery/astroquery/mast/tests/test_mast.py Lines 476 to 479 in af6b41b
and just add a check that the list of uris is unique (assuming you can find a data set ID that has non-unique entries that need to be filtered) |
I suppose it would be also useful to be able to turn off the progress bar, but that points beyond the scope here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a changelog entry and then this is good to go.
Hi @jdavies-st , this looks good, although I would suggest having the fix under |
Thanks for the feedback! I had thought of putting the from astroquery.mast import Observations
obs_table = Observations.query_criteria(proposal_id="2736", instrument_name="NIRSPEC")
data_products = Observations.get_product_list(obs_table) You'll get a timeout or it will just hang and fail. One has to break up into chunks or just loop through each obs. As an example, there can be 250k products (many duplicates) for a couple dozen NIRSpec observations which include level 3 products. And for scientists, this is going to be the way they will want to query - get me all the data for my <proposal_id>, maybe minus the guide star data. So if I have to already call in a loop, say like the following: from astroquery.mast import Observations
from astropy.table import unique, vstack
obs_table = Observations.query_criteria(proposal_id="2736", instrument_name="NIRSPEC")
data_products_list = []
for obs in obs_table:
data_products_list.append(Observations.get_product_list(obs))
data_products = vstack(data_products_list) I still will have lots of duplicates in my final vstacked table, as between observations is where all the duplication is occurring not within, especially for an instrument like NIRSpec where each level 3 product uses the same level 2 products as input. So if I have NIRSPec MOS with 100 spectra, and get the level 3 products, then I will get 100 copies of the level 2 products, one for each level 3 observation. Also, if one is interested in the parent observation of a data product, each of the many copies of a particular file to download will have different parent metadata. So culling early might make people miss data if later in their script (before download) they select by observation, say the newest observation that was just taken. So that was my thinking in leaving the duplicate culling to the last moment, before download, as that's really when you want to cull. Just don't make me download this file 600 times. =) Thoughts? Or do you think it would still be best to cull early? |
Hi @jdavies-st , those are good points. Doing the cull during the download step instead of the I also want to mention that the hang-up issue you're experiencing when trying to download large amounts of products is on our radar. We're currently trying to find the root cause of this and will implement a fix so users don't have to do those iterative product queries anymore. |
I think that dealing with this duplicate issue is important but a have a few concerns about this solution. Mostly that it breaks the connection between the metadata and file location. As @jdavies-st mentions above the metadata is unique for each parent observation even though the file is the same. Ironically the download location was designed to be guaranteed to be unique for each individual observation because a few Hubble products do not have unique names (while being unique files I believe), a fact which is now causing this problem for JWST. I have a few ideas for solutions:
I defer to @jaymedina to decide what solution is best, but at minimum I do think the user needs to be warned if they are going to get a download manifest with a different number of rows than they supplied to the download function. |
I revoke my approval in favour of the discussion above.
Thanks @ceb8 , I'm in favor of option 1 especially if this functionality has already been requested by users. Perhaps also adding a warning in the docstring to users: If they have JWST products in their product list + don't use the flat download directory kwd argument that they may be re-downloading files unintentionally. |
@ceb8 I agree that option 1) above is great solution. It is something I'd like to see solved as well, as one always has to move the files to a flat structure so that the calibration pipeline can be run. Though I didn't find a Github issue here for that. But I'm not convinced that JWST users will enjoy having another kwarg to control this behavior. It would be nice if it was smart. And if one forgets to add the Option 2 would also work very well and would prevent users from having to use That said, I think the 2 options above solve an issue that is separate from the one this PR solves, or at least broader scope. And this PR is actually compatible with both options 1 and 2. It will be needed if a user forgets Happy to add a warning in this PR if that is thought to be needed, though I think once one is downloading, the resulting products on disk no longer have different metadata - they are the same files. The only place metadata will be lost is in the manifest table of the downloads (and the files won't be found multiple times across multiple subdirs). Thoughts on the way forward? I've added a test to this PR based on the example in the issue. |
totally independently I was about to open an issue about this as I've just run into it. |
@jdavies-st what I meant about metatdata was not in the file itself but in the connection between the file and the prodct list, because the way you do it the file will appear under in single I think your point about it being inadvisable for this to happen ever is a good one, which makes me think that a combination of 3 and 1 are probably correct, with this PR being the immediate fix and then the flat directory option being considered relatively high priority (within the constraints on the MAST team of course). Thoughts @jaymedina? |
Have you considered using hard links? That would allow you to maintain the directory structure while still avoiding downloading and writing duplicate files. |
What do you mean by hard links @eerovaher ? I just spoke with one of the MAST database handlers to confirm that the reason for the current convention was because of different Hubble products having the same name and this was their reply: I'm not aware of any duplicated HST pipeline products with the same name. The original reason for the dir structure was to put all files for the same observation in the same directory which is a 'natural' way to group them. The problem is JWST which doesn't have such a logical structure with the same file being used for multiple observations so it appears in multiple places. So if this is true, Method 2 could be a general solution. They also suggested going with a directory structure like: This would generalize the file directory structure so it's compatible across missions, and the cacheing would control duplicates in the auxiliary files sub-sub-directory. So this would basically be Method 2 with some flare. Let me know your thoughts and if you see potential problems with this convention. EDIT: Scratch that, I received feedback from others that there were duplicate files seen in the GALEX mission among others, so method 2 is probably a no. One other suggestion would be to fiddle with the |
|
Thanks @ceb8. I'm in full agreement. I will add a warning to the user in this PR. Btw, if one uses the MAST web portal to pull the same level 3 obs that is in the test in this PR, one gets the same 6 MSA config files in one's basket (due to 2 detectors, 3 nods): So the exact same MSA config file listed 6 times for each "dataset". But if I hit the download button and look at the resulting curl script:
It has culled them to a single download, but keeping the subdir structure. So the MAST portal is doing the same thing that this PR is doing. Confluence. Of course better would be not to do the subdirs in the first place. A separate PR, and further discussion. |
@jaymedina I've added the duplicate culling to |
I believe this PR is complete and ready for final review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good to me, thanks @jdavies-st! Though, before going ahead and merging this I would like to get an approval from either @ceb8 or @jaymedina.
This looks good to me, thanks for the updates! This just needs a commit squash and I believe a rebase, but @bsipocz can confirm if this is the case or not. |
As it currently stands, if a query results in a |
Thanks for approving @jaymedina! No need to squash nor to rebase (commits are logical chunks and not back and forth, and there are no duplicates or conflicts). |
Thanks @jdavies-st! |
As a follow-up, I agree with @eerovaher, warning that a user can do nothing about are not ideal, please consider using the logger instead. |
@eerovaher yes, I agree. Thanks for pointing that out to me. I will make a follow-up PR to fix this. |
Resolves #2496
I'm not precisely sure how to test this in a unit test, as I don't understand the mocking system. And there doesn't seem to be any JWST unit tests yet.
Regardless, this is what I see when I run the workflow in the issue above locally.
Before this PR:
This PR:
Any further suggestions or existing examples on how to test this in a unit test would be appreciated.