-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Google Cloud Storage buckets #1094
Conversation
This commit adds support for GCS hosted buckets storing source data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Just 2 minor nits/Q's and thats it from me.
if not expected_size_in_bytes: | ||
expected_size_in_bytes = download.total_bytes | ||
while not download.finished: | ||
if progress_indicator and download.bytes_downloaded and download.total_bytes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there ever a case where we wont have a progress_indicator
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use it in
rally/esrally/mechanic/supplier.py
Line 519 in 1bdec4f
progress = net.Progress("[INFO] Downloading Elasticsearch %s" % self.version) |
Line 410 in 1bdec4f
progress = net.Progress("[INFO] Downloading data for track %s" % self.track_name, accuracy=1) |
logger.info("Downloading from S3 bucket [%s] and path [%s] to [%s].", bucket, bucket_path, local_path) | ||
_download_from_s3_bucket(bucket, bucket_path, local_path, expected_size_in_bytes, progress_indicator) | ||
logger.info("Downloading from [%s] bucket [%s] and path [%s] to [%s].", blobstore, bucket, bucket_path, local_path) | ||
blob_downloader[blobstore](bucket, bucket_path, local_path, expected_size_in_bytes, progress_indicator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
entirely up to you, but i find if
statements easier to reason about than looking up some string->function call magic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, used it because it is a pattern we use elsewhere too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I left some comments around the design and we need to add the new dependencies to our notice file.
docs/track.rst
Outdated
@@ -225,7 +225,10 @@ The ``corpora`` section contains all document corpora that are used by this trac | |||
|
|||
Each entry in the ``documents`` list consists of the following properties: | |||
|
|||
* ``base-url`` (optional): A http(s) or S3 URL that points to the root path where Rally can obtain the corresponding source file. Rally can also download data from private S3 buckets if access is properly `configured <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration>`_. | |||
* ``base-url`` (optional): A http(s), S3 or GS URL that points to the root path where Rally can obtain the corresponding source file. Rally can also download data from private S3 or GS buckets if access is properly configured: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: GS
-> GCS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scheme is gs not gcs (i.e. gs://bucket
). Also the service itself is called Google Storage (https://cloud.google.com/storage).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While Amazon calls the product "S3", the official name of the equivalent Google service seems to be "Cloud Storage" according to the docs you point to. Should we then refer to it by that name instead of a different acronym? Or is "GS" the official acronym (I did not find it in Google's documentation)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Standardizing on Google Storage
and fixed in c3fa7b9
docs/track.rst
Outdated
* ``base-url`` (optional): A http(s), S3 or GS URL that points to the root path where Rally can obtain the corresponding source file. Rally can also download data from private S3 or GS buckets if access is properly configured: | ||
|
||
* S3 according to `docs <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration>`_. | ||
* GS: Either using `client library authentication <https://cloud.google.com/storage/docs/reference/libraries#setting_up_authentication>`_ or by presenting an `oath2 token <https://cloud.google.com/storage/docs/authentication>`_ via the ``GOOGLE_AUTH_TOKEN`` environment variable, typically done using: ``export GOOGLE_AUTH_TOKEN=$(gcloud auth print-access-token)``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: GS
-> GCS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in c3fa7b9
docs/track.rst
Outdated
* ``base-url`` (optional): A http(s), S3 or GS URL that points to the root path where Rally can obtain the corresponding source file. Rally can also download data from private S3 or GS buckets if access is properly configured: | ||
|
||
* S3 according to `docs <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration>`_. | ||
* GS: Either using `client library authentication <https://cloud.google.com/storage/docs/reference/libraries#setting_up_authentication>`_ or by presenting an `oath2 token <https://cloud.google.com/storage/docs/authentication>`_ via the ``GOOGLE_AUTH_TOKEN`` environment variable, typically done using: ``export GOOGLE_AUTH_TOKEN=$(gcloud auth print-access-token)``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: oath2
-> oauth2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in c3fa7b9
# License: Apache 2.0 | ||
# transitive dependencies: | ||
# google-crc32c: Apache 2.0 | ||
"google-resumable-media==1.1.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to update the script create-notice.sh
accordingly. I also wonder whether we should investigate to make S3 and GCS optional dependencies and avoid pulling them in core Rally. I'd be ok if we tackle this separately though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. Done in 27419b6 (where I also fixed the URL for yarl
which was actually downloading html).
As discussed offline, will create an issue for making the S3 dependencies optional.
if url.startswith("s3"): | ||
expected_size_in_bytes = download_s3(url, tmp_data_set_path, expected_size_in_bytes, progress_indicator) | ||
scheme = urllib3.util.parse_url(url).scheme | ||
if scheme in ["s3", "gs"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we should start encapsulating this now that we add a third scheme. We could implement this as a Downloader
that supports certain schemes, register them upon startup and implement def supports(self, url: str) -> bool
to check to which Downloader
to pass a URL instead of spreading this logic in multiple places. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, we'll defer this for now, but certainly look at it again, if we are to extend to another scheme.
and also fix pulling yarl license
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating. LGTM
This commit adds support for GCS hosted buckets storing source data.
While at it, switch
utils/net.py
tests to pytest.