S3 upload manager retry broken with seekable body #2485

project0 · 2024-02-09T10:36:55Z

Describe the bug

I am not sure if this is a exclusive problem of the upload manager or PutObject in general is affected.

failed to rewind transport stream for retry, request stream is not seekable

The retry mechnanism with putObject is not triggered because the "request" stream is not seekable. Not sure where it is messed up or how to fix it.

Expected Behavior

Retry of upload content is working as expected. The upload manager should be able to produce a full working version in this case.

Current Behavior

It looks like isStreamSeekable is not correctly set on the request object.

Reproduction Steps

I am not sure how to reproduce or provoke the bug

Possible Solution

No response

Additional Information/Context

AWS Go SDK V2 Module Versions Used

github.com/aws/aws-sdk-go-v2 v1.24.0
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.14.10
github.com/aws/aws-sdk-go-v2/feature/s3/manager v1.15.1
github.com/aws/aws-sdk-go-v2/service/dynamodb v1.26.6
github.com/aws/aws-sdk-go-v2/service/s3 v1.47.5
github.com/aws/smithy-go v1.19.0

Compiler and Version used

1.21.5

Operating System and version

n/a

The text was updated successfully, but these errors were encountered:

RanVaknin · 2024-02-14T23:08:48Z

Hi @project0 ,

Thanks for reaching out. Without reproduction steps its very hard to understand what the problem is. I'm able to supply to uploader a body with bytes.NewReader, and I'm simulating an error that the SDK will treat as retryable.

import (
	"bytes"
	"context"
	"github.com/aws/aws-sdk-go-v2/aws"
	"github.com/aws/aws-sdk-go-v2/config"
	"github.com/aws/aws-sdk-go-v2/feature/s3/manager"
	"github.com/aws/aws-sdk-go-v2/service/s3"
	"io"
	"log"
	"net/http"
)

type CustomTransport struct {
	http.RoundTripper
	Counter int
}

func (c *CustomTransport) RoundTrip(req *http.Request) (*http.Response, error) {
	c.Counter++
	if c.Counter <= 2 {
        // simulating a generic server error (500), which the SDK treats as a retryable error.
		return &http.Response{
			StatusCode: 500,
			Body:       io.NopCloser(bytes.NewBufferString("Simulated failure")),
			Header:     make(http.Header),
		}, nil
	}
	return c.RoundTripper.RoundTrip(req)
}

func main() {
	cfg, err := config.LoadDefaultConfig(
		context.TODO(),
		config.WithRegion("us-east-1"),
		config.WithClientLogMode(aws.LogRetries|aws.LogResponseWithBody),
		config.WithHTTPClient(&http.Client{
			Transport: &CustomTransport{
				RoundTripper: http.DefaultTransport,
			},
		}))

	if err != nil {
		log.Fatalf("unable to load SDK config, %v", err)
	}

	client := s3.NewFromConfig(cfg)

	uploader := manager.NewUploader(client)

	reader := bytes.NewReader([]byte("hello world"))
	_, err = uploader.Upload(context.Background(), &s3.PutObjectInput{
		Bucket: aws.String("testbucket"),
		Key:    aws.String("foo"),
		Body:   reader,
	})
	if err != nil {
		panic(err)
	}
}

This will result in the following logs:

SDK 2024/02/14 14:07:32 DEBUG Response
HTTP/0.0 500 Internal Server Error

Simulated failure
SDK 2024/02/14 14:07:33 DEBUG retrying request S3/PutObject, attempt 2
SDK 2024/02/14 14:07:33 DEBUG Response
HTTP/0.0 500 Internal Server Error

Simulated failure
SDK 2024/02/14 14:07:33 DEBUG retrying request S3/PutObject, attempt 3
SDK 2024/02/14 14:07:34 DEBUG Response
HTTP/1.1 200 OK
Content-Length: 0
Date: Wed, 14 Feb 2024 22:07:35 GMT
Etag: "REDACTED"
Server: AmazonS3
X-Amz-Id-2: REDACTED
X-Amz-Request-Id: REDACTED
X-Amz-Server-Side-Encryption: AES256
X-Amz-Version-Id: wtHqrxdKsYqVsZ54_qBaSwEZfUWbK0I_

As evident in the logs, the SDK retries successfully. Now you mention rewinding the body, I have put a breakpoint in line 155 of retry/middleware.go and RewindStream() does not return this error indicating that the request has a seekable stream.

Without better more isolated reproduction case it will be hard for us to assist you.

Thanks,
Ran~

dmitry-sherlok · 2024-02-15T23:28:20Z

@RanVaknin I can see you're simulating a 500 error in your tests, which implies getting a complete HTTP response back. What about the case of a closed network connection?

We regularly get this kind of errors from our GitHub Actions workers (which I suspect are hosted somewhere in the US), when writing to S3 buckets in ap-southeast-2 region:

Error saving state: failed to upload state: operation error S3: PutObject,
exceeded maximum number of attempts, 1, https response error StatusCode: 0,
RequestID: , HostID: , request send failed, Put
"https://<my_bucket>.s3.ap-southeast-2.amazonaws.com/<my_state_file>?x-id=PutObject":
write tcp 10.1.0.121:54026->52.95.129.26:443: use of closed network
connection

If attempting to retry the upload stream, Terraform produces the following error:

failed to upload state: operation error S3: PutObject, failed to rewind transport stream for retry, request stream is not seekable

stephenc · 2024-02-16T11:42:49Z

Still reproducible in terraform 1.7.3 when I run terraform across 10 modules with terragrunt, at least 1 of the 10 modules will fail with this issue

lucix-aws · 2024-02-19T21:51:37Z

The nature of this issue is inherently complex. I'll do my best to outline the factors at play here.

I will assert though that the specifics of the error (including whether an HTTP round trip actually took place) don't have any bearing on the behavior in question:

The inner "request stream is not seekable" error will surface when the SDK attempts to rewind the input HTTP request body (via the ResetStream() wrapper on smithyhttp.Request) but the internal isStreamSeekable flag in the request is false.
The value of isStreamSeekable is ultimately determined by calls to SetStream on the smithyhttp.Request body. When that happens is mostly irrelevant - for the context of this issue, just know that isStreamSeekable will be set to true when the input body implements io.Seeker, which the value returned by bytes.NewReader() does

What can go wrong here though is that unwittingly or otherwise, a calling agent (that could be the SDK itself or a user) could put middleware before the beginning of the retry loop that elides the "seekability" of the stream. In practice, the only inbuilt middleware that does this is the body payload calculation routine that sigv4 signing depends on. Immediately after the recent auth refactor, there was an issue with the middleware order such that payload calculation did come before the retry loop, which would have deterministically caused this issue to surface. That was fixed here: #2438.

Therefore, the root cause of "request stream is not seekable" is exactly as @project0 has indicated. The crux of the back-and-forth here, though, is that we're not seeing that happening in the SDK's current release state. The code snippet provided by @RanVaknin simulates the exact scenario in which the issue would arise, but the reported behavior isn't observed.

In reality then we can only assume one of the following things are occuring within the terraform project:

it is using older modules, or modules across the boundary of when this PR was merged
terraform is inserting some other middleware into the operation that breaks the seekability of the input as outlined above

This issue tugs at the themes we observed in #2370, where the unfortunate combination of the following:

the built-at-runtime design of the middleware stack
the splitting of operation-critical middleware stack assembly across the service client and core aws module

caused obnoxious cross-module incompatibility issues.

I can leave this open for a while longer, but as of now it appears to me based on everything I've seen and done in-code that the most current set of modules is operating correctly.

[EDIT]: "after the beginning of the retry loop" => "before the beginning of the retry loop"

project0 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 9, 2024

RanVaknin self-assigned this Feb 12, 2024

RanVaknin added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Feb 14, 2024

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Feb 16, 2024

lucix-aws added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Feb 16, 2024

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Feb 17, 2024

lucix-aws added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Feb 17, 2024

lucix-aws added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Feb 19, 2024

github-actions bot added closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Feb 25, 2024

github-actions bot closed this as completed Feb 25, 2024

kojoru mentioned this issue Mar 8, 2024

Update AWS SDK versions to the latest hashicorp/terraform#34796

Merged

This was referenced Mar 8, 2024

Backport of Update AWS SDK versions to the latest into v1.7 hashicorp/terraform#34804

Merged

Backport of Update AWS SDK versions to the latest into v1.8 hashicorp/terraform#34805

Merged

jellevandenhooff mentioned this issue Aug 5, 2024

Round of log-resetting changes FiloSottile/sunlight#18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 upload manager retry broken with seekable body #2485

S3 upload manager retry broken with seekable body #2485

project0 commented Feb 9, 2024 •

edited

Loading

RanVaknin commented Feb 14, 2024

dmitry-sherlok commented Feb 15, 2024 •

edited

Loading

stephenc commented Feb 16, 2024

lucix-aws commented Feb 19, 2024 •

edited

Loading

S3 upload manager retry broken with seekable body #2485

S3 upload manager retry broken with seekable body #2485

Comments

project0 commented Feb 9, 2024 • edited Loading

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

AWS Go SDK V2 Module Versions Used

Compiler and Version used

Operating System and version

RanVaknin commented Feb 14, 2024

dmitry-sherlok commented Feb 15, 2024 • edited Loading

stephenc commented Feb 16, 2024

lucix-aws commented Feb 19, 2024 • edited Loading

project0 commented Feb 9, 2024 •

edited

Loading

dmitry-sherlok commented Feb 15, 2024 •

edited

Loading

lucix-aws commented Feb 19, 2024 •

edited

Loading