Add support for Google Cloud Storage buckets #1094

dliappis · 2020-10-16T18:37:44Z

This commit adds support for GCS hosted buckets storing source data.

While at it, switch utils/net.py tests to pytest.

This commit adds support for GCS hosted buckets storing source data.

hub-cap

This is great! Just 2 minor nits/Q's and thats it from me.

hub-cap · 2020-10-16T19:21:32Z

esrally/utils/net.py

+        if not expected_size_in_bytes:
+            expected_size_in_bytes = download.total_bytes
+        while not download.finished:
+            if progress_indicator and download.bytes_downloaded and download.total_bytes:


is there ever a case where we wont have a progress_indicator?

We use it in

rally/esrally/mechanic/supplier.py

Line 519 in 1bdec4f

progress = net.Progress("[INFO] Downloading Elasticsearch %s" % self.version)

and

rally/esrally/track/loader.py

Line 410 in 1bdec4f

progress = net.Progress("[INFO] Downloading data for track %s" % self.track_name, accuracy=1)

but want to be able to support a case (in the future?) were it shouldn't be used.

hub-cap · 2020-10-16T19:23:20Z

esrally/utils/net.py

-    logger.info("Downloading from S3 bucket [%s] and path [%s] to [%s].", bucket, bucket_path, local_path)
-    _download_from_s3_bucket(bucket, bucket_path, local_path, expected_size_in_bytes, progress_indicator)
+    logger.info("Downloading from [%s] bucket [%s] and path [%s] to [%s].", blobstore, bucket, bucket_path, local_path)
+    blob_downloader[blobstore](bucket, bucket_path, local_path, expected_size_in_bytes, progress_indicator)


entirely up to you, but i find if statements easier to reason about than looking up some string->function call magic

Thanks, used it because it is a pattern we use elsewhere too.

danielmitterdorfer

Thanks! I left some comments around the design and we need to add the new dependencies to our notice file.

danielmitterdorfer · 2020-10-19T07:32:15Z

docs/track.rst

@@ -225,7 +225,10 @@ The ``corpora`` section contains all document corpora that are used by this trac

 Each entry in the ``documents`` list consists of the following properties:

-* ``base-url`` (optional): A http(s) or S3 URL that points to the root path where Rally can obtain the corresponding source file. Rally can also download data from private S3 buckets if access is properly `configured <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration>`_.
+* ``base-url`` (optional): A http(s), S3 or GS URL that points to the root path where Rally can obtain the corresponding source file. Rally can also download data from private S3 or GS buckets if access is properly configured:


typo: GS -> GCS

The scheme is gs not gcs (i.e. gs://bucket). Also the service itself is called Google Storage (https://cloud.google.com/storage).

While Amazon calls the product "S3", the official name of the equivalent Google service seems to be "Cloud Storage" according to the docs you point to. Should we then refer to it by that name instead of a different acronym? Or is "GS" the official acronym (I did not find it in Google's documentation)?

Standardizing on Google Storage and fixed in c3fa7b9

danielmitterdorfer · 2020-10-19T07:32:39Z

docs/track.rst

+* ``base-url`` (optional): A http(s), S3 or GS URL that points to the root path where Rally can obtain the corresponding source file. Rally can also download data from private S3 or GS buckets if access is properly configured:
+
+  * S3 according to `docs <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration>`_.
+  * GS: Either using `client library authentication <https://cloud.google.com/storage/docs/reference/libraries#setting_up_authentication>`_ or by presenting an `oath2 token <https://cloud.google.com/storage/docs/authentication>`_ via the ``GOOGLE_AUTH_TOKEN`` environment variable, typically done using: ``export GOOGLE_AUTH_TOKEN=$(gcloud auth print-access-token)``.


typo: GS -> GCS

Fixed in c3fa7b9

danielmitterdorfer · 2020-10-19T07:34:36Z

docs/track.rst

+* ``base-url`` (optional): A http(s), S3 or GS URL that points to the root path where Rally can obtain the corresponding source file. Rally can also download data from private S3 or GS buckets if access is properly configured:
+
+  * S3 according to `docs <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration>`_.
+  * GS: Either using `client library authentication <https://cloud.google.com/storage/docs/reference/libraries#setting_up_authentication>`_ or by presenting an `oath2 token <https://cloud.google.com/storage/docs/authentication>`_ via the ``GOOGLE_AUTH_TOKEN`` environment variable, typically done using: ``export GOOGLE_AUTH_TOKEN=$(gcloud auth print-access-token)``.


typo: oath2 -> oauth2.

Fixed in c3fa7b9

danielmitterdorfer · 2020-10-19T07:38:36Z

setup.py

+    # License: Apache 2.0
+    # transitive dependencies:
+    #   google-crc32c: Apache 2.0
+    "google-resumable-media==1.1.0",


You also need to update the script create-notice.sh accordingly. I also wonder whether we should investigate to make S3 and GCS optional dependencies and avoid pulling them in core Rally. I'd be ok if we tackle this separately though.

Great catch. Done in 27419b6 (where I also fixed the URL for yarl which was actually downloading html).
As discussed offline, will create an issue for making the S3 dependencies optional.

danielmitterdorfer · 2020-10-19T07:42:28Z

esrally/utils/net.py

-        if url.startswith("s3"):
-            expected_size_in_bytes = download_s3(url, tmp_data_set_path, expected_size_in_bytes, progress_indicator)
+        scheme = urllib3.util.parse_url(url).scheme
+        if scheme in ["s3", "gs"]:


I wonder whether we should start encapsulating this now that we add a third scheme. We could implement this as a Downloader that supports certain schemes, register them upon startup and implement def supports(self, url: str) -> bool to check to which Downloader to pass a URL instead of spreading this logic in multiple places. Wdyt?

As discussed offline, we'll defer this for now, but certainly look at it again, if we are to extend to another scheme.

and also fix pulling yarl license

danielmitterdorfer

Thanks for iterating. LGTM

dliappis added 3 commits October 12, 2020 20:29

Initial commit

1035709

Merge branch 'master' into gcs

4b946e1

Add support for Google Cloud Storage buckets

66db69c

This commit adds support for GCS hosted buckets storing source data.

dliappis added enhancement Improves the status quo :Track Management New operations, changes in the track format, track download changes and the like labels Oct 16, 2020

dliappis added this to the 2.0.2 milestone Oct 16, 2020

dliappis requested review from hub-cap and danielmitterdorfer October 16, 2020 18:37

dliappis self-assigned this Oct 16, 2020

hub-cap approved these changes Oct 16, 2020

View reviewed changes

danielmitterdorfer requested changes Oct 19, 2020

View reviewed changes

dliappis added 2 commits October 19, 2020 15:36

Fix Google Storage terminology

c3fa7b9

Add dependencies to create-notice.sh

27419b6

and also fix pulling yarl license

dliappis requested a review from danielmitterdorfer October 19, 2020 12:53

danielmitterdorfer approved these changes Oct 19, 2020

View reviewed changes

dliappis merged commit e5c5164 into elastic:master Oct 19, 2020

dliappis deleted the gcs branch October 19, 2020 14:01

dliappis mentioned this pull request May 20, 2021

Missing transitive dependency requests from the google-resumable-media library #1269

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Google Cloud Storage buckets #1094

Add support for Google Cloud Storage buckets #1094

dliappis commented Oct 16, 2020

hub-cap left a comment

hub-cap Oct 16, 2020

dliappis Oct 19, 2020

hub-cap Oct 16, 2020

dliappis Oct 19, 2020

danielmitterdorfer left a comment

danielmitterdorfer Oct 19, 2020

dliappis Oct 19, 2020

danielmitterdorfer Oct 19, 2020

dliappis Oct 19, 2020 •

edited

Loading

danielmitterdorfer Oct 19, 2020

dliappis Oct 19, 2020 •

edited

Loading

danielmitterdorfer Oct 19, 2020

dliappis Oct 19, 2020

danielmitterdorfer Oct 19, 2020

dliappis Oct 19, 2020

danielmitterdorfer Oct 19, 2020

dliappis Oct 19, 2020

danielmitterdorfer left a comment

Add support for Google Cloud Storage buckets #1094

Add support for Google Cloud Storage buckets #1094

Conversation

dliappis commented Oct 16, 2020

hub-cap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielmitterdorfer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dliappis Oct 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dliappis Oct 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielmitterdorfer left a comment

Choose a reason for hiding this comment

dliappis Oct 19, 2020 •

edited

Loading

dliappis Oct 19, 2020 •

edited

Loading