Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Google Cloud Storage buckets #1094

Merged
merged 5 commits into from
Oct 19, 2020
Merged

Conversation

dliappis
Copy link
Contributor

This commit adds support for GCS hosted buckets storing source data.

While at it, switch utils/net.py tests to pytest.

This commit adds support for GCS hosted buckets storing source data.
@dliappis dliappis added enhancement Improves the status quo :Track Management New operations, changes in the track format, track download changes and the like labels Oct 16, 2020
@dliappis dliappis added this to the 2.0.2 milestone Oct 16, 2020
@dliappis dliappis self-assigned this Oct 16, 2020
Copy link
Contributor

@hub-cap hub-cap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Just 2 minor nits/Q's and thats it from me.

if not expected_size_in_bytes:
expected_size_in_bytes = download.total_bytes
while not download.finished:
if progress_indicator and download.bytes_downloaded and download.total_bytes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there ever a case where we wont have a progress_indicator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use it in

progress = net.Progress("[INFO] Downloading Elasticsearch %s" % self.version)
and
progress = net.Progress("[INFO] Downloading data for track %s" % self.track_name, accuracy=1)
but want to be able to support a case (in the future?) were it shouldn't be used.

logger.info("Downloading from S3 bucket [%s] and path [%s] to [%s].", bucket, bucket_path, local_path)
_download_from_s3_bucket(bucket, bucket_path, local_path, expected_size_in_bytes, progress_indicator)
logger.info("Downloading from [%s] bucket [%s] and path [%s] to [%s].", blobstore, bucket, bucket_path, local_path)
blob_downloader[blobstore](bucket, bucket_path, local_path, expected_size_in_bytes, progress_indicator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entirely up to you, but i find if statements easier to reason about than looking up some string->function call magic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, used it because it is a pattern we use elsewhere too.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I left some comments around the design and we need to add the new dependencies to our notice file.

docs/track.rst Outdated
@@ -225,7 +225,10 @@ The ``corpora`` section contains all document corpora that are used by this trac

Each entry in the ``documents`` list consists of the following properties:

* ``base-url`` (optional): A http(s) or S3 URL that points to the root path where Rally can obtain the corresponding source file. Rally can also download data from private S3 buckets if access is properly `configured <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration>`_.
* ``base-url`` (optional): A http(s), S3 or GS URL that points to the root path where Rally can obtain the corresponding source file. Rally can also download data from private S3 or GS buckets if access is properly configured:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: GS -> GCS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheme is gs not gcs (i.e. gs://bucket). Also the service itself is called Google Storage (https://cloud.google.com/storage).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While Amazon calls the product "S3", the official name of the equivalent Google service seems to be "Cloud Storage" according to the docs you point to. Should we then refer to it by that name instead of a different acronym? Or is "GS" the official acronym (I did not find it in Google's documentation)?

Copy link
Contributor Author

@dliappis dliappis Oct 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standardizing on Google Storage and fixed in c3fa7b9

docs/track.rst Outdated
* ``base-url`` (optional): A http(s), S3 or GS URL that points to the root path where Rally can obtain the corresponding source file. Rally can also download data from private S3 or GS buckets if access is properly configured:

* S3 according to `docs <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration>`_.
* GS: Either using `client library authentication <https://cloud.google.com/storage/docs/reference/libraries#setting_up_authentication>`_ or by presenting an `oath2 token <https://cloud.google.com/storage/docs/authentication>`_ via the ``GOOGLE_AUTH_TOKEN`` environment variable, typically done using: ``export GOOGLE_AUTH_TOKEN=$(gcloud auth print-access-token)``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: GS -> GCS

Copy link
Contributor Author

@dliappis dliappis Oct 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c3fa7b9

docs/track.rst Outdated
* ``base-url`` (optional): A http(s), S3 or GS URL that points to the root path where Rally can obtain the corresponding source file. Rally can also download data from private S3 or GS buckets if access is properly configured:

* S3 according to `docs <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration>`_.
* GS: Either using `client library authentication <https://cloud.google.com/storage/docs/reference/libraries#setting_up_authentication>`_ or by presenting an `oath2 token <https://cloud.google.com/storage/docs/authentication>`_ via the ``GOOGLE_AUTH_TOKEN`` environment variable, typically done using: ``export GOOGLE_AUTH_TOKEN=$(gcloud auth print-access-token)``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: oath2 -> oauth2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c3fa7b9

# License: Apache 2.0
# transitive dependencies:
# google-crc32c: Apache 2.0
"google-resumable-media==1.1.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to update the script create-notice.sh accordingly. I also wonder whether we should investigate to make S3 and GCS optional dependencies and avoid pulling them in core Rally. I'd be ok if we tackle this separately though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. Done in 27419b6 (where I also fixed the URL for yarl which was actually downloading html).
As discussed offline, will create an issue for making the S3 dependencies optional.

if url.startswith("s3"):
expected_size_in_bytes = download_s3(url, tmp_data_set_path, expected_size_in_bytes, progress_indicator)
scheme = urllib3.util.parse_url(url).scheme
if scheme in ["s3", "gs"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we should start encapsulating this now that we add a third scheme. We could implement this as a Downloader that supports certain schemes, register them upon startup and implement def supports(self, url: str) -> bool to check to which Downloader to pass a URL instead of spreading this logic in multiple places. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, we'll defer this for now, but certainly look at it again, if we are to extend to another scheme.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating. LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the status quo :Track Management New operations, changes in the track format, track download changes and the like
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants