Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ESO authentication and download #2681

Merged
merged 33 commits into from
Dec 8, 2023

Conversation

szampier
Copy link
Contributor

@szampier szampier commented Mar 10, 2023

Dear astroquery maintainers, @almicol, @Pharisaeus

this PR is a complete rewrite of the authentication and download part of astroquery.eso.
As you can see in my initial commits, I first tried to reuse ALMA's download_files and astroquery's _download_file, but I ended up re-implementing it from scratch because it doesn't support well our use case:

  • we don't want to send a HEAD request just to get the filename from content-disposition (unnecessary overhead)
  • we don't support range queries / partial downloads (actually we do except for cases where we correct the FITS header upon download)
  • the progress bar used in _download_file creates a too verbose output ( ESO downloads: no 'verbose' option #1357)

Public interface

The public interface remains retrieve_data but with two arguments removed because they are no longer necessary. This method is now quite small, essentially calling find_associated_files, _download_eso_files and _unzip_files in sequence.

Authentication and Authorization

Authentication (login) is no longer necessary for public data.
We switched to token-based authentication and authorization. The token is retrieved when calling login (more specifically in the _authenticate method) and saved in self._auth_info, together with username and password so we can re-authenticate automatically when the token is about to expire. The token normally expires after 8 hours and we re-authenticate 10 minutes before the expiration date.
NOTE: since PyJWT is not a dependency of astroquery, to extract the expiration time from the token we need to "manually" decode it but this is relatively straightforward.

Support for .Z files

The current implementation unfortunately is not able to uncompress .Z files which is a big issue for ESO since most of our raw data are Z-compressed. At ESO we are considering the option of recompressing files upon download on-the-fly from .Z to .gz, which is a better supported format especially on Windows where the gunzip command is normally not available. For now I've re-implemented the uncompression using the system command if available. I know it's not ideal but at least it works, except for Windows which is not a very popular platform among our users anyways.

Test and documentation

  • test_nologin has been fixed since we now support anonymous download
  • test_retrieve_data_and_calib does not pass because of missing calselector (see TODO)
  • test_each_instrument_SgrAstar does not pass for unknown reasons independent from this PR
  • when executing the remote tests, pytest hangs at the end of the tests, I don't know why
  • documentation has been updated to reflect the changes

Solved issues

I believe this PR will solve at least partially the following issues:

@keflavich
Copy link
Contributor

we don't support range queries / partial downloads

Is this an intentional choice? When downloading large files, partial downloads have historically been pretty common for me at least, and it's nice to be able to pick up where you left off.

@szampier
Copy link
Contributor Author

Is this an intentional choice? When downloading large files, partial downloads have historically been pretty common for me at least, and it's nice to be able to pick up where you left off.

apologies, my statement about range queries was wrong and I've deleted it. I'll investigate how to support partial downloads. However we don't always return content-length because in some cases we do stream-processing on the files upon download so the server doesn't know the size of the downloaded file. A notable case of stream-processing is header correction, i.e. when we find issues in the FITS header we store the corrected header in a database and we apply the changes on-the-fly when downloading the file.

@keflavich
Copy link
Contributor

OK, then I think it makes sense to check for content-length. If it is present, then a range query makes sense and partial downloads can be resumed. But if there are streamed corrections that change the content length, partial downloads are not possible.

@bsipocz bsipocz added the eso label Mar 10, 2023
@codecov
Copy link

codecov bot commented Mar 10, 2023

Codecov Report

Attention: 49 lines in your changes are missing coverage. Please review.

Comparison is base (885734b) 66.53% compared to head (9ea4138) 67.28%.
Report is 8 commits behind head on main.

Files Patch % Lines
astroquery/eso/core.py 76.32% 49 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2681      +/-   ##
==========================================
+ Coverage   66.53%   67.28%   +0.74%     
==========================================
  Files         235      235              
  Lines       18114    18136      +22     
==========================================
+ Hits        12052    12202     +150     
+ Misses       6062     5934     -128     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Pharisaeus
Copy link

At ESO we are considering the option of recompressing files upon download on-the-fly from .Z to .gz

Keep in mind this would also mean none of those files would support partial download, because, similarly to header corrections, we would not know the size of the file if we pass it through compress and gzip and stream the response.

@szampier
Copy link
Contributor Author

I've added two small files to the test data but they don't seem to be installed when running the tests on the server. Should I add them to some setup/configuration file? Also I've extended astroquery.utils.mocks.MockResponse to be able to test file download without remote server, I hope it's OK.

@szampier
Copy link
Contributor Author

@keflavich @bsipocz can you please help with the issue I reported in my previous comment? I've added some files in the test data folder but they are not found when running the tests in github. See failed tests in the last couple of commits.

@keflavich
Copy link
Contributor

@szampier yes, you need to add those files under the setup_package.py file, e.g. as in Vizier:
https://github.com/astropy/astroquery/blob/main/astroquery/vizier/setup_package.py

@bsipocz
Copy link
Member

bsipocz commented Oct 9, 2023

@szampier @Pharisaeus - There have been a few reports of bugs with the current ESO module of astroquery, and it reminded me to double-check the planned timeline for swapping the currently used API out. Do you have some ETAs or roadmaps, however far it might be?

@szampier szampier force-pushed the refactor_eso_download branch from 87f1537 to 217af29 Compare October 13, 2023 19:57
@pep8speaks
Copy link

pep8speaks commented Oct 20, 2023

Hello @szampier! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2023-12-08 11:45:50 UTC

@szampier
Copy link
Contributor Author

@bsipocz in the past days I made a few additional changes and I now consider this PR ready for review. There are still two failing checks, one in the cadc module and one because of a missing changelog entry. I'm not sure what to do with the changelog. As written in the initial comment I've reimplemented file download mainly to avoid sending a HEAD request each time we download a file just to get the filename from content-disposition. Files are downloaded to a temp (part) file and moved to the final destination after the download has been completed. Resuming partial downloads using range queries has not been implemented and I don't consider it a critical feature for the moment, also considering that we stream-process about 20% of the archive files. I would prefer to implement this feature in a separate PR.
The bug reports you have mentioned refer to the query part which I didn't change in this PR. Once this PR is merged we will refactor also the query part and address the relevant tickets.
Do you agree to remove the Draft status from this PR?

@szampier szampier marked this pull request as ready for review October 27, 2023 10:55
@jwoillez
Copy link
Member

I tested the PR on macOS and the following works fine:

from astroquery.eso import Eso
eso.login(username="xxxx")  # <== use your own login
eso.retrieve_data("GRAVI.2023-11-24T15:03:58.695")

Both identification and decompression work (with gunzip is available on macOS).

Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did one quick review, without trying to run any of the changes or looking into the coverage stats. There are only some superficial comments, some of them are open ended, and a lot are just heads up, fyi-s.

Also, I see one failing test, but that one is spotted on main, too, so it can/should be fixed separately (I expect it to be something trivial on the bookkeeping level, e.g. one instrument will need to be skipped for that test.

And this will need a rebase in order to pick up recent changes that would make the unrelated CI failures gone.

Comment on lines +13 to +14
just to get the original filename. [#1580]
- Restore support for .Z files. [#1818]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: We list the number of the PR in the changelog, not the issues fixed. I'll fix this up when cleaning up the changelog for release

CALSELECTOR_URL = "https://archive.eso.org/calselector/v1/associations"
DOWNLOAD_URL = "https://dataportal.eso.org/dataPortal/file/"
AUTH_URL = "https://www.eso.org/sso/oidc/token"
GUNZIP = "gunzip"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see tests or examples changing this, what would be the use case, windows users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I didn't foresee to change these values, that's why I've defined them as "constants". Regarding gunzip, we might change the handling of .Z files again in the future since we have now the possibility of changing compression on-the-fly upon download so we could return only .gz files to astroquery users.

Comment on lines 631 to 761
def retrieve_data(self, datasets, *, continuation=False, destination=None,
with_calib='none', request_all_objects=False,
unzip=True, request_id=None):
def retrieve_data(self, datasets, *, overwrite=False, destination=None,
with_calib=None, unzip=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there have been quite a number of keywords removed, without deprecation. Given that this is a significant rewrite of the module, as well as changes are upstream, I would think it's OK to break API without warning, though it may then need another half sentence in the changelog.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the only remaining question that I would like to address before merging. Given request_id was featured in an example in the documentation, I'm somewhat inclined towards adding a deprecation warning for all the removed parameters.
But I'm also OK if you say it has to be a clean cut for the users, I would really just like to see it's a decision whether giving warnings or exception and not a side effect of the code refactoring.
(knowing that in practice warnings are mostly ignored by users, with the deprecation warning they could run something that has no effect any more, while with the exception they face the changes straight away).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've restored the original API and added a deprecation warning using deprecated_renamed_argument decorator from astropy, as done for example in irsa/core.py.
BTW I think the since argument should be corrected in irsa/core.py at line 63 and replaced with an array of 3 elements since there are 3 changed arguments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I think the since argument should be corrected in irsa/core.py at line 63 and replaced with an array of 3 elements since there are 3 changed arguments.

Thanks for the heads up, I suppose I was working with the assumption that it will be broadcasted, but I'll have a look.

Btw, what happened with overwrite instead of continuation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, what happened with overwrite instead of continuation?

I've changed it back to the original argument to keep the same API, but the meaning is the same, and indeed the description of this argument has not changed

Force the retrieval of data that are present in the destination directory.

astroquery/eso/core.py Outdated Show resolved Hide resolved
return True
return False

def _download_eso_file(self, file_link: str, destination: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heads up that we're in the middle of adding/upstreaming download functionality to pyvo, in the long term this module may benefit from that, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know, I'm happy to use the upstream version if it supports our requirements.

astroquery/eso/core.py Outdated Show resolved Hide resolved
astroquery/eso/core.py Outdated Show resolved Hide resolved
@szampier szampier force-pushed the refactor_eso_download branch from e16465d to 1b927ee Compare December 7, 2023 09:35
@bsipocz bsipocz added this to the v0.4.7 milestone Dec 7, 2023
Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I think this should go in now.

I still think that renaming continuation to overwrite makes sense, as the functionality changed, but that can come (or not) in a follow-up PR, no need to bikeshed this one any further.

@bsipocz bsipocz merged commit 60e6353 into astropy:main Dec 8, 2023
10 checks passed
@bsipocz
Copy link
Member

bsipocz commented Dec 8, 2023

Thank you @szampier!

I haven't triaged the linked issues, let me know which ones to close either because they are resolved or because they won't be supported.

@szampier
Copy link
Contributor Author

Thank you @szampier!

I haven't triaged the linked issues, let me know which ones to close either because they are resolved or because they won't be supported.

The following tickets can be closed:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants