Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide file hashes in the URLs to avoid unnecessary file downloads (bandwidth saver) #1433

Merged
merged 15 commits into from
Sep 23, 2023
Merged
24 changes: 23 additions & 1 deletion s3_management/manage.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from packaging.version import parse

import boto3
from botocore.exceptions import NoCredentialsError


S3 = boto3.resource('s3')
Expand Down Expand Up @@ -212,6 +213,23 @@ def normalize_package_version(self: S3IndexType, obj: str) -> str:
def obj_to_package_name(self, obj: str) -> str:
return path.basename(obj).split('-', 1)[0]

def fetch_checksum_from_s3(self, s3_key):
s3_key = s3_key.replace("%2B", "+")
try:
response = CLIENT.get_object_attributes(
Bucket=BUCKET,
Key=s3_key,
ObjectAttributes=['Checksum']
)
checksum = response['Checksum']['ChecksumSHA256']

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to leverage the call in from_S3 to also include this information?

If not, perhaps this should be done there so the returned objects already have the checksum attached and the downstream code can choose to use it or not?

That would isolate the S3 retrieval code to be in one logical place.

Copy link
Contributor Author

@matteius matteius Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to leverage the call in from_S3 to also include this information?

From what I can tell they are different clients/endpoints that serve different purposes.

If not, perhaps this should be done there so the returned objects already have the checksum attached and the downstream code can choose to use it or not?

From my perspective, the from_S3 appears to just be tracking a list of strings which is the object s3_key with List[str] so I am not sure it makes sense to tackle a larger refactor just to add all the additional calls to pre-lookup these checksums and store them in memory until their point of use.

That would isolate the S3 retrieval code to be in one logical place.

Isn't it already isolated to this one file manage.py where the clients are in scope for the whole file?

Maybe I am being dense and don't see the obvious way to accomplish this without spending a couple more hours on it, plus the more complexity I add, well I have no real way of testing it. Looking for any guidance on the right exceptions to catch or not catch above, but I think we need to catch something to cover the case that s3 isn't ready or able to provide the hashes.

Copy link

@thejcannon thejcannon Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so looks like you should be able to do this in from_s3 (which is ideal, because we'll want the checksums for metadata files in #1457 as well:

>>> import boto3
>>> BUCKET=S3.Bucket("<bucketname>")
>>> obj = next(iter(BUCKET.objects.filter(Prefix="<some prefix>")))
>>> response = obj.meta.client.head_object(Bucket=BUCKET.name, Key=obj.key, ChecksumMode="ENABLED")
>>> response["ChecksumSHA256"]
'7Uc3AtM1Dlc4XiQ8A6NkQdV9JMU4liPtz7wOd+RRCTE='

And then for an object without a checksum:

>>> response = obj.meta.client.head_object(Bucket=BUCKET.name, Key=obj.key, ChecksumMode="ENABLED")
>>> list(response)
['ResponseMetadata', 'AcceptRanges', 'LastModified', 'ContentLength', 'ETag', 'ContentType', 'ServerSideEncryption', 'Metadata']

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build is failing with:

Unknown parameter in input: "ChecksumMode", must be one of: Bucket, IfMatch, IfModifiedSince, IfNoneMatch, IfUnmodifiedSince, Key, Range, VersionId, SSECustomerAlgorithm, SSECustomerKey, SSECustomerKeyMD5, RequestPayer, PartNumber```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tagging @malfet again, as these inner comments are notoriously hard to discover. Hope it's no big deal 😇

return checksum
except NoCredentialsError:
print("No AWS credentials found")
return None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just let this propogate? Otherwise it'll just repeatedly print the error and generate a sub-optimal index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that the index page should still generate as it does now even if there is no hash available to provide. I should definitely remove the print statements. I am not sure what exceptions would be raised before the Checksum values get added to the objects.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Did this mean to get commented below? on https://github.com/pytorch/builder/pull/1433/files/b3799a8eb42b56118a4bc0658902875734888721#r1261472864? If not I'm not following the thread here. If the credentials aren't found we really can't know if there's a checksum or not)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean its the same code block in my mind -- I am just saying I don't know what exceptions we want to catch and handle for the case of the checksum isn't available yet -- maybe we don't care about NoCredentialsError but I think we need to handle something

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to assume:

  • The runner has valid credentials to the object
  • The runner has the correct permissions to access the attributes
  • S3 is alive and well and will return good JSON.

Because otherwise if these assumptions aren't held, there's a chance the script generates an index.html without the checksums and we're back to square 1.

So I think the only thing to handle here is "this object has no checksum attribute".

Which means the code looks roughly like:

response = CLIENT.get_object_attributes(
    Bucket=BUCKET,
    Key=s3_key,
    ObjectAttributes=["Checksum"]
)
checksum = response.get("Checksum", {}).get("ChecksumSHA256", None)

without any try/except.

FWIW I tested your code against a work bucket and got TypeError: expected string or bytes-like object. Switching the argument for Bucket to the string bucket name then gives me a valid response.

So probably use BUCKET.name as well.


Here's some local testing...

>>> import boto3
>>> BUCKET=S3.Bucket("<bucketname>")
>>> CLIENT = boto3.client("s3")
>>> response = CLIENT.get_object_attributes(Bucket=BUCKET.name,Key="<object-without-checksum>",ObjectAttributes=['Checksum'])
>>> list(response)
['ResponseMetadata', 'LastModified', 'VersionId']
>>> response.get("Checksum", {}).get("ChecksumSHA256", None)
>>> response = CLIENT.get_object_attributes(Bucket=BUCKET.name,Key="<object-WITH-checksum>",ObjectAttributes=['Checksum'])
>>> list(response)
['ResponseMetadata', 'LastModified', 'Checksum']
>>> response.get("Checksum", {}).get("ChecksumSHA256", None)
'7Uc3AtM1Dlc4XiQ8A6NkQdV9JMU4liPtz7wOd+RRCTE='

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

except Exception as e:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a wide exception. I think the try: block and relevant exceptions likely should be scoped to be explicit about what is expected to fail and what isn't.

Otherwise, a simple bug could manifest as an index not having hashes. When the alternative is an error with a stacktrace given to the user, and no index generated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

print(f"Unable to retrieve checksum due to {e}")
return None

def to_legacy_html(
self,
subdir: Optional[str]=None
Expand Down Expand Up @@ -255,7 +273,11 @@ def to_simple_package_html(
out.append(' <body>')
out.append(' <h1>Links for {}</h1>'.format(package_name.lower().replace("_","-")))
for obj in sorted(self.gen_file_list(subdir, package_name)):
out.append(f' <a href="/{obj}">{path.basename(obj).replace("%2B","+")}</a><br/>')
checksum = self.fetch_checksum_from_s3(obj)
if checksum:
out.append(f' <a href="/{obj}#sha256={checksum}">{path.basename(obj).replace("%2B","+")}</a><br/>')
else:
out.append(f' <a href="/{obj}">{path.basename(obj).replace("%2B","+")}</a><br/>')
matteius marked this conversation as resolved.
Show resolved Hide resolved
# Adding html footer
out.append(' </body>')
out.append('</html>')
Expand Down