Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@TRACER.capture_method causes botocore response body of objects retrieved from S3 to be pre-exhausted #238

Closed
paulalex opened this issue Dec 9, 2020 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@paulalex
Copy link

paulalex commented Dec 9, 2020

Hi,

I am really loving powertools, its been really useful recently for me, although I have encountered an issue since starting to use it in a new serverless application, I am currently using it in a few others and it is working o.k there and I had not noticed this issue.

This may well be a 'User Error' and what I am experiencing may be desired or necessary behaviour and I can change the way my code functions so I can still decorate this method but I think it is worth raising this.

I have provided as much information as I can below.

When a method is decorated with @TRACER.capture_method and that method retrieves a file from S3 using boto3 and returns the S3 object\dictionary then the botocore.response.StreamingBody object is already read meaning there is no data to be read anymore.

To add tracing to a method that calls S3 to retrieve a CSV file using boto3, has affected me in that since adding powertools tracing retrieved CSV files had no data when converted to data frame yet the response body was populated and files retrieved from S3.

Expected Behavior

Method is decorated and objects stream data is not read.

Current Behavior

See the following code snippet which works if the decorator is removed:

@TRACER.capture_method
def load_file_from_s3(bucket_name, key):
    try:
        obj = s3_client.get_object(Bucket=bucket_name, Key=key)        
    except ClientError as exc:
        if exc.response["Error"]["Code"] != "404":
            raise exc
    
    return obj

Environment

  • Powertools version used:
    Latest

How to enable debug mode**

{
    "level": "ERROR",
    "location": "generate_single_shape:186",
    "message": "No columns to parse from file",
    "timestamp": "2020-12-09 18:41:03,474",
    "service": "",
    "sampling_rate": 0,
    "cold_start": true,
    "function_name": "",
    "function_memory_size": "128",
    "function_arn": "arn:aws:lambda:eu-west-1:",
    "function_request_id": "2b74000d-10a6-48a9-8858-822f1f1f01e5",
    "exception": "Traceback (most recent call last):\n  File \"/var/task/src/x/controller.py\", line 166, in generate_single_shape\n    df_shape = get_shape_dataframe(int(shape_number), \"shape\")\n  File \"/var/task/src/x/controller.py\", line 68, in get_shape_dataframe\n    df = service.s3_shape_obj_to_pandas(file_obj, columns)\n  File \"/var/task/src/forecasting/shape_tool_service.py\", line 85, in s3_shape_obj_to_pandas\n    df = pd.read_csv(csv_file, usecols=columns, index_col=0, skiprows=2)\n  File \"/opt/python/pandas/io/parsers.py\", line 688, in read_csv\n    return _read(filepath_or_buffer, kwds)\n  File \"/opt/python/pandas/io/parsers.py\", line 454, in _read\n    parser = TextFileReader(fp_or_buf, **kwds)\n  File \"/opt/python/pandas/io/parsers.py\", line 948, in __init__\n    self._make_engine(self.engine)\n  File \"/opt/python/pandas/io/parsers.py\", line 1180, in _make_engine\n    self._engine = CParserWrapper(self.f, **self.options)\n  File \"/opt/python/pandas/io/parsers.py\", line 2010, in __init__\n    self._reader = parsers.TextReader(src, **kwds)\n  File \"pandas/_libs/parsers.pyx\", line 540, in pandas._libs.parsers.TextReader.__cinit__\npandas.errors.EmptyDataError: No columns to parse from file",
    "xray_trace_id": "1-5fd11a3a-75ee6c8d49267424492ea927"
}
@paulalex paulalex added bug Something isn't working triage Pending triage from maintainers labels Dec 9, 2020
@heitorlessa
Copy link
Contributor

heitorlessa commented Dec 10, 2020

Hey Paul - Thanks for raising this issue, and also great to hear you find this library helpful!

If I understood this scenario correctly, you're downloading a file from S3 but not reading it's chunks (.read()) since you might have another function doing that e.g. pandas to turn "file-like" object into a Data Frame.

However, when you add capture_method decorator it is reading the object - calling .read() on your obj return - and this does not happen without the Tracer.

Did I understand this right?

If so, could you also share a snippet on how you're calling load_file_from_s3 fn, and your boto3 version?

From Powertools perspective, we primarily decorate and call your function when you consume it - we are unaware of what's inside.

This could be something else, but more than happy to try reproduce it tomorrow with the same boto version and snippet you have.

Thank you :)

@heitorlessa heitorlessa added the need-more-information Pending information to continue label Dec 10, 2020
@paulalex
Copy link
Author

paulalex commented Dec 10, 2020

Hi Heitor,

Yes it is exactly what I am doing and the behaviour is as you describe, if I have the decorator then the output outside of that method isb'', if I dump it inside the method before returning from it I get my data, if remove the decorator and I return that object I also get the data, so it appears as if something is reading the data.

The versions of boto3 below:

boto3 >> version": "==1.16.30"
botocore >> "version": "==1.19.30"

Below are the functions involved, although worth reiterating that once the object returned from the boto call is outside of load_file_from_s3 and it is decorated there is no data, so currently I have had to remove the decorator from that function.

import boto3
from botocore.exceptions import ClientError

#Decorator removed for now as does not work with it
def load_file_from_s3(bucket_name, key):
    try:
        obj = s3_client.get_object(Bucket=bucket_name, Key=key)
    except ClientError as exc:
        if exc.response["Error"]["Code"] != "404":
            raise exc
    # Getting the value of the body as csv is fine up to this point, once returned it is no longer available to be read
    return obj

def s3_shape_obj_to_pandas(obj, columns):
    """Reads a csv file containing the shape data and returns this as a pandas dataframe"""
    csv_file = io.BytesIO(obj["Body"].read())

    # Read the shape file using Pandas
    # Exception is triggered here because the retrieved data is equal to b''
    df = pd.read_csv(csv_file, usecols=columns, index_col=0, skiprows=2)

    return df

def get_shape_dataframe(shape_number, shape_type):
    """
    Pulls a shape from s3 and returns it as a pandas dataframe object.
    Inputs:
        shape_number: integer in the range 0-6 inclusive
        shape_type: string containing either "shape" or "historic"
    """

    # Read shape data from s3
    bucket_name = os.environ["RESOURCE_BUCKET_NAME"]
    key = service.get_shape_file_name(shape_number, shape_type)
    file_obj = service.load_file_from_s3(bucket_name, key)

    columns = service.get_hh_shape_file_columns()

    if shape_type == "shape":
        df = service.s3_shape_obj_to_pandas(file_obj, columns)
    elif shape_type == "historic":
        df = service.s3_historic_shape_obj_to_pandas(file_obj, columns)
    else:
        raise Exception("Invalid shape type")

    return df

Apologies if this does turn out out to be something else, a configuration issue or something. Its just the application was working absolutely fine until we started a ticket to add power tools tracing and structured logging.

@heitorlessa
Copy link
Contributor

Don't worry @paulalex it could be something with us, as I can't see anything immediately odd in that code.

If there is something on our side, or if it takes too long to figure out, I'll push a context manager for Tracer to you can still use Tracer capabilities within your code more easily
-- it's something I've been meaning to do too.

I'll take this for a spin tomorrow (5pm here now).

Thanks for sharing all that info

@heitorlessa heitorlessa removed need-more-information Pending information to continue triage Pending triage from maintainers labels Dec 10, 2020
@heitorlessa heitorlessa self-assigned this Dec 11, 2020
@heitorlessa
Copy link
Contributor

hey @paulalex - I've managed to replicate it using that snippet you sent last. There's something in the decorator logic and in X-Ray that I can't figure out why yet - I'll keep digging.

In the meantime, you can use a context manager as it works expected:

def load_file_from_s3(bucket_name, key):
    try:
        with tracer.provider.in_subsegment("## load_file_from_s3") as subsegment:
            obj = s3_client.get_object(Bucket=bucket_name, Key=key)
            # you can add annotations and metadata using subsegment.put_annotation, subsegment.put_metadata
            # https://awslabs.github.io/aws-lambda-powertools-python/core/tracer/#escape-hatch-mechanism
    except botocore.exceptions.ClientError as exc:
        if exc.response["Error"]["Code"] != "404":
            raise exc
    # Getting the value of the body as csv is fine up to this point, once returned it is no longer available to be read
    return obj

It's really odd because the response from S3, Content-Length at least, is exactly the same as before. It's almost as if read() was called, and given you can only call it once, it returns an empty response - I might need to create an issue in boto3 or X-Ray.

Things I've tried for the record:

  • Compared responses with and without Tracer decorator on Content-Length, calls to S3, etc.
  • Created dummy decorator to rule out any shenanigans with decorated functions
  • Tested reproducing it locally, or with Tracer disabled (no effect)
  • Tested with and without Pandas and IO Stream (no effect)
  • Tested using csvreader (you know, just in case)
  • Tested with non-CSV file (.txt, .yaml, .json) to no avail too

I'll update here if we find the root cause and here's the source I used to reproduce now without all the boilerplate: https://github.com/heitorlessa/issue238-pt

image

@paulalex
Copy link
Author

@heitorlessa Great work you definitely put a lot of effort in! So it looks like the issue is not with powertools but with either the X-ray libraries or boto?

Thanks for the context manager snippet, I will switch to using this in my code for now. Noticed the same thing, that the response length is the same but it appears as if something has already read the body.

Thanks a lot for all of this!

@heitorlessa
Copy link
Contributor

Funny enough if you pass the Content-Length for the read() function to force boto to read the entire chunk it works consistently:

file_obj["Body"].read(file_obj["Content-Length"]).decode('utf-8')

So it's definitely not an issue with X-Ray per se but something to be investigated in Boto and the IO stream, because at first glance boto isn't actually giving a file-like stream but a modified version of it -- I need to dig in as I'm personally hooked into the problem now despite not being Powertools per se :D

TL;DR, if you pass the content-length to read() it will work without changing the decorator and context manager.

  • Before: csv_file = io.BytesIO(obj["Body"].read())
  • After: csv_file = io.BytesIO(obj["Body"].read(obj["Content-Length"]))

I'll do more digging next week when I free up more time, hope that helps ;)

@paulalex
Copy link
Author

Thats great thanks again and I hope this doesnt become your weekend!

@to-mc
Copy link
Contributor

to-mc commented Dec 12, 2020

Like Heitor, I was also pretty hooked on figuring this one out! Long story short, its due to the capture_method decorator trying to capture the response and add it to the X-Ray trace metadata. You can disable this behaviour in the decorator: @capture_method(capture_response=False) but by default its enabled. What this means is that the xray-sdk tries to serialize the function's return value, in this case eventually causing .read() to be called on the StreamingBody object.

Ultimately its expected behaviour - if you want the function response data in the trace, it has to be serialized. I'll add something to our docs to clarify before closing this issue, as I can certainly imagine this catching others out. Thanks for the helping us figure this out with the detailed bug report @paulalex!

@heitorlessa
Copy link
Contributor

In simple words, X-Ray SDK uses jsonpickle dependency to serialize Python objects into JSON when they're added as Metadata trace - e.g. Any response from decorated function.

When running locally this is a No-Op operation meaning it simply ignores hence why it wasn't easy to reproduce but in Lambda - @cakepietoast brilliantly traced all calls to reproduce that ;)

@heitorlessa heitorlessa added the pending-release Fix or implementation already in dev waiting to be released label Dec 14, 2020
@michaelbrewer
Copy link
Contributor

@heitorlessa a hidden benefit of the capture_response flag ;)

@heitorlessa heitorlessa removed the pending-release Fix or implementation already in dev waiting to be released label Dec 21, 2020
@heitorlessa
Copy link
Contributor

indeed @michaelbrewer though if I'm honest it took me by surprise the lack of seek in botocore stream response, and the upstream side effect of using jsonpickle.

Closing this as it's now available as part of a bugfix release 1.9.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

4 participants