Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 upload manager retry broken with seekable body #2485

Closed
project0 opened this issue Feb 9, 2024 · 4 comments
Closed

S3 upload manager retry broken with seekable body #2485

project0 opened this issue Feb 9, 2024 · 4 comments
Assignees
Labels
bug This issue is a bug. closed-for-staleness p2 This is a standard priority issue response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days.

Comments

@project0
Copy link

project0 commented Feb 9, 2024

Describe the bug

I am not sure if this is a exclusive problem of the upload manager or PutObject in general is affected.

failed to rewind transport stream for retry, request stream is not seekable

The retry mechnanism with putObject is not triggered because the "request" stream is not seekable. Not sure where it is messed up or how to fix it.

see also hashicorp/terraform#34528.

Relevant code:
https://github.com/hashicorp/terraform/blob/v1.7.3/internal/backend/remote-state/s3/client.go#L200-L205

bytes.NewReader produces a seekable reader, so the problem must be somewhere in the aws sdk.

Expected Behavior

Retry of upload content is working as expected. The upload manager should be able to produce a full working version in this case.

Current Behavior

It looks like isStreamSeekable is not correctly set on the request object.

see also

https://github.com/aws/smithy-go/blob/daa9e2bc92cf10976d9cac2893b74834ded16150/transport/http/request.go#L95-L106

// Following attempts must ensure the request payload stream starts in a
// rewound state.
if attemptNum > 1 {
if rewindable, ok := in.Request.(interface{ RewindStream() error }); ok {
if rewindErr := rewindable.RewindStream(); rewindErr != nil {
return out, attemptResult, nopRelease, fmt.Errorf(
"failed to rewind transport stream for retry, %w", rewindErr)
}
}
r.logf(logger, logging.Debug, "retrying request %s/%s, attempt %d",
service, operation, attemptNum)
}

results into retry failing with:

operation error S3: PutObject,
failed to rewind transport stream for retry, request stream is not seekable

Reproduction Steps

I am not sure how to reproduce or provoke the bug

Possible Solution

No response

Additional Information/Context

see also hashicorp/terraform#34528

similar issue was reported here #2038, but in this case we have actually provided a proper body implementing ReadSeeker.

AWS Go SDK V2 Module Versions Used

github.com/aws/aws-sdk-go-v2 v1.24.0
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.14.10
github.com/aws/aws-sdk-go-v2/feature/s3/manager v1.15.1
github.com/aws/aws-sdk-go-v2/service/dynamodb v1.26.6
github.com/aws/aws-sdk-go-v2/service/s3 v1.47.5
github.com/aws/smithy-go v1.19.0

Compiler and Version used

1.21.5

Operating System and version

n/a

@project0 project0 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 9, 2024
@RanVaknin RanVaknin self-assigned this Feb 12, 2024
@RanVaknin
Copy link
Contributor

Hi @project0 ,

Thanks for reaching out. Without reproduction steps its very hard to understand what the problem is. I'm able to supply to uploader a body with bytes.NewReader, and I'm simulating an error that the SDK will treat as retryable.

import (
	"bytes"
	"context"
	"github.com/aws/aws-sdk-go-v2/aws"
	"github.com/aws/aws-sdk-go-v2/config"
	"github.com/aws/aws-sdk-go-v2/feature/s3/manager"
	"github.com/aws/aws-sdk-go-v2/service/s3"
	"io"
	"log"
	"net/http"
)

type CustomTransport struct {
	http.RoundTripper
	Counter int
}

func (c *CustomTransport) RoundTrip(req *http.Request) (*http.Response, error) {
	c.Counter++
	if c.Counter <= 2 {
        // simulating a generic server error (500), which the SDK treats as a retryable error.
		return &http.Response{
			StatusCode: 500,
			Body:       io.NopCloser(bytes.NewBufferString("Simulated failure")),
			Header:     make(http.Header),
		}, nil
	}
	return c.RoundTripper.RoundTrip(req)
}

func main() {
	cfg, err := config.LoadDefaultConfig(
		context.TODO(),
		config.WithRegion("us-east-1"),
		config.WithClientLogMode(aws.LogRetries|aws.LogResponseWithBody),
		config.WithHTTPClient(&http.Client{
			Transport: &CustomTransport{
				RoundTripper: http.DefaultTransport,
			},
		}))

	if err != nil {
		log.Fatalf("unable to load SDK config, %v", err)
	}

	client := s3.NewFromConfig(cfg)

	uploader := manager.NewUploader(client)

	reader := bytes.NewReader([]byte("hello world"))
	_, err = uploader.Upload(context.Background(), &s3.PutObjectInput{
		Bucket: aws.String("testbucket"),
		Key:    aws.String("foo"),
		Body:   reader,
	})
	if err != nil {
		panic(err)
	}
}

This will result in the following logs:

SDK 2024/02/14 14:07:32 DEBUG Response
HTTP/0.0 500 Internal Server Error

Simulated failure
SDK 2024/02/14 14:07:33 DEBUG retrying request S3/PutObject, attempt 2
SDK 2024/02/14 14:07:33 DEBUG Response
HTTP/0.0 500 Internal Server Error

Simulated failure
SDK 2024/02/14 14:07:33 DEBUG retrying request S3/PutObject, attempt 3
SDK 2024/02/14 14:07:34 DEBUG Response
HTTP/1.1 200 OK
Content-Length: 0
Date: Wed, 14 Feb 2024 22:07:35 GMT
Etag: "REDACTED"
Server: AmazonS3
X-Amz-Id-2: REDACTED
X-Amz-Request-Id: REDACTED
X-Amz-Server-Side-Encryption: AES256
X-Amz-Version-Id: wtHqrxdKsYqVsZ54_qBaSwEZfUWbK0I_

As evident in the logs, the SDK retries successfully. Now you mention rewinding the body, I have put a breakpoint in line 155 of retry/middleware.go and RewindStream() does not return this error indicating that the request has a seekable stream.

Without better more isolated reproduction case it will be hard for us to assist you.

Thanks,
Ran~

@RanVaknin RanVaknin added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Feb 14, 2024
@dmitry-sherlok
Copy link

dmitry-sherlok commented Feb 15, 2024

@RanVaknin I can see you're simulating a 500 error in your tests, which implies getting a complete HTTP response back. What about the case of a closed network connection?

We regularly get this kind of errors from our GitHub Actions workers (which I suspect are hosted somewhere in the US), when writing to S3 buckets in ap-southeast-2 region:

Error saving state: failed to upload state: operation error S3: PutObject,
exceeded maximum number of attempts, 1, https response error StatusCode: 0,
RequestID: , HostID: , request send failed, Put
"https://<my_bucket>.s3.ap-southeast-2.amazonaws.com/<my_state_file>?x-id=PutObject":
write tcp 10.1.0.121:54026->52.95.129.26:443: use of closed network
connection

If attempting to retry the upload stream, Terraform produces the following error:

failed to upload state: operation error S3: PutObject, failed to rewind transport stream for retry, request stream is not seekable

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Feb 16, 2024
@lucix-aws lucix-aws added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Feb 16, 2024
@stephenc
Copy link

Still reproducible in terraform 1.7.3 when I run terraform across 10 modules with terragrunt, at least 1 of the 10 modules will fail with this issue

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Feb 17, 2024
@lucix-aws lucix-aws added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Feb 17, 2024
@lucix-aws
Copy link
Contributor

lucix-aws commented Feb 19, 2024

The nature of this issue is inherently complex. I'll do my best to outline the factors at play here.

I will assert though that the specifics of the error (including whether an HTTP round trip actually took place) don't have any bearing on the behavior in question:

  • The inner "request stream is not seekable" error will surface when the SDK attempts to rewind the input HTTP request body (via the ResetStream() wrapper on smithyhttp.Request) but the internal isStreamSeekable flag in the request is false.
  • The value of isStreamSeekable is ultimately determined by calls to SetStream on the smithyhttp.Request body. When that happens is mostly irrelevant - for the context of this issue, just know that isStreamSeekable will be set to true when the input body implements io.Seeker, which the value returned by bytes.NewReader() does

What can go wrong here though is that unwittingly or otherwise, a calling agent (that could be the SDK itself or a user) could put middleware before the beginning of the retry loop that elides the "seekability" of the stream. In practice, the only inbuilt middleware that does this is the body payload calculation routine that sigv4 signing depends on. Immediately after the recent auth refactor, there was an issue with the middleware order such that payload calculation did come before the retry loop, which would have deterministically caused this issue to surface. That was fixed here: #2438.

Therefore, the root cause of "request stream is not seekable" is exactly as @project0 has indicated. The crux of the back-and-forth here, though, is that we're not seeing that happening in the SDK's current release state. The code snippet provided by @RanVaknin simulates the exact scenario in which the issue would arise, but the reported behavior isn't observed.

In reality then we can only assume one of the following things are occuring within the terraform project:

  • it is using older modules, or modules across the boundary of when this PR was merged
  • terraform is inserting some other middleware into the operation that breaks the seekability of the input as outlined above

This issue tugs at the themes we observed in #2370, where the unfortunate combination of the following:

  • the built-at-runtime design of the middleware stack
  • the splitting of operation-critical middleware stack assembly across the service client and core aws module

caused obnoxious cross-module incompatibility issues.

I can leave this open for a while longer, but as of now it appears to me based on everything I've seen and done in-code that the most current set of modules is operating correctly.

[EDIT]: "after the beginning of the retry loop" => "before the beginning of the retry loop"

@lucix-aws lucix-aws added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Feb 19, 2024
@github-actions github-actions bot added closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Feb 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. closed-for-staleness p2 This is a standard priority issue response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days.
Projects
None yet
Development

No branches or pull requests

5 participants