S3/SQS source events fails sporadically #20314

Brightside56 · 2024-04-17T01:32:26Z

Problem

I have configured s3 source and getting following error from time to time, once in a ~30-60 min

ERROR source{component_kind="source" component_id=s3 component_type=aws_s3}: vector::internal_events::aws_sqs::s3: Failed to process SQS message. message_id=8983e08a-46b8-48b8-a7f8-b06120861783 error=Failed to fetch s3://xxx/2024/04/16/1713311288-2e57bf4b-6612-4ab8-960e-517f01126830.log.gz: dispatch failure error_code="failed_processing_sqs_message" error_type="parser_failed" stage="processing" internal_log_rate_limit=true

however after being in-flight 10 minutes or so, these messages with corresponding S3 objects seem to be reprocessed successfully from SQS

Configuration

Below are s3 source config:

          sources:
            s3:
              type: aws_s3
              compression: gzip
              region: eu-west-1
              sqs:
                queue_url: https://sqs.eu-west-1.amazonaws.com/xxx/xxx-xxx
                delete_message: true
                client_concurrency: 5

also I have logLevel: aws_smithy_http=info,vector=info, however along with above-mentioned error message and various vector: Beep - there are no other informative error messages which could explain root cause of issue

Version

0.37.1

The text was updated successfully, but these errors were encountered:

jszwedko · 2024-04-18T23:13:41Z

Hmm, yeah, that is odd. I think that sort of error would be coming from the AWS SDK (similar issue, but not for S3: awslabs/aws-sdk-rust#844). You could try increasing the log level of aws_smithy_http to, say, debug or trace but I realize it'd be pretty chatty and so might not be feasible.

hoazgazh · 2024-05-04T17:09:40Z

same here, is anyone resolve this?

2024-05-04T17:07:15.594521Z INFO vector::topology::running: Running healthchecks. 2024-05-04T17:07:15.595648Z INFO vector: Vector has started. debug="false" version="0.37.1" arch="x86_64" revision="cb6635a 2024-04-09 13:45:06.561412437" 2024-05-04T17:07:15.595990Z INFO vector::app: API is disabled, enable by setting api.enabledtotrueand use commands likevector top. 2024-05-04T17:07:15.609041Z INFO source{component_kind="source" component_id=my_source_id component_type=aws_s3}:lazy_load_identity: aws_smithy_runtime::client::identity::cache::lazy: identity cache miss occurred; added new identity (took 185µs) new_expiration=2024-05-04T17:22:15.607636Z valid_for=899.998602s partition=IdentityCachePartition(2) 2024-05-04T17:07:15.826752Z INFO vector::topology::builder: Healthcheck passed. 2024-05-04T17:07:15.860011Z ERROR source{component_kind="source" component_id=my_source_id component_type=aws_s3}: vector::internal_events::aws_sqs: Failed to fetch SQS events. error=service error error_code="failed_fetching_sqs_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true 2024-05-04T17:07:15.907364Z ERROR source{component_kind="source" component_id=my_source_id component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] is being suppressed to avoid flooding.

fpytloun · 2024-05-22T09:41:35Z

Also happens with S3 sink:

2024-05-22T09:40:24.026717Z  WARN sink{component_kind="sink" component_id=out_kafka_access_s3 component_type=aws_s3}:request{request_id=23796}: vector::sinks::util::retries: Retrying after error. error=dispatch failure internal_log_rate_limit=true

seluard · 2024-06-12T10:57:40Z

I think all of these is related about how AWS sdk handle retries and failures for AWS IAM.

#20266

Brightside56 · 2024-06-17T19:51:05Z

Hmm, yeah, that is odd. I think that sort of error would be coming from the AWS SDK (similar issue, but not for S3: awslabs/aws-sdk-rust#844). You could try increasing the log level of aws_smithy_http to, say, debug or trace but I realize it'd be pretty chatty and so might not be feasible.

I've tried, but messages weren't more verbose in this specific part even with aws_smithy_http=trace

biatwc · 2024-10-22T11:51:45Z

Hello, I'm seeing similar with S3 source and SQS, happens every 6 hours or so most of the time, sometimes there is a 12 hour gap between error. Vector is on an EC2 instance using an instance profile, and the queue and bucket are in a different account to this EC2 instance. At the bottom are some debug logs when the error occurs, previous errors when debug logging wasn't enabled had different content, like:

error=Failed to fetch s3://<object url>: dispatch failure error_code=\"failed_processing_sqs_message\" error_type=\"parser_failed\" stage=\"processing\"
aws_sqs: Failed to fetch SQS events. error=service error error_code=\"failed_fetching_sqs_events\" error_type=\"request_failed\" stage=\"receiving\"

When this error happens the queue metrics show messages living on the queue for 200+ seconds, but they will eventually get consumed (as least I think that is what is happening, but maybe they are being dropped and events lost - but I do tell vector to not delete failed messages).
The source is configured like this:

type: aws_s3
region: eu-central-1
sqs:
  queue_url: "{{ queue_url }}"
  delete_failed_message: false

Any advice on config that could alleviate this issue please let me now, or any impending release that aims to fix it.
If this continues I'll have to switch to static creds and see how things look using them.

Debug logs:

"2024-10-22T10:48:48.918729Z DEBUG hyper::client::connect::dns: resolving host=""sqs.eu-central-1.amazonaws.com""
"2024-10-22T10:48:48.918694Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op:try_attempt: aws_smithy_runtime::client::http::body::minimum_throughput: applying minimum upload throughput check future options=MinimumThroughputBodyOptions { minimum_throughput: Throughput { bytes_read: 1, per_time_elapsed: 1s }, grace_period: 5s, check_window: 1s }"
"2024-10-22T10:48:48.918682Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op:try_attempt: aws_smithy_runtime_api::client::interceptors::context: entering 'transmit' phase"
"2024-10-22T10:48:48.918620Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op:try_attempt: aws_smithy_runtime::client::identity::cache::lazy: loaded identity from cache buffer_time=10s cached_expiration=Some(SystemTime { tv_sec: 1729594138, tv_nsec: 0 }) now=SystemTime { tv_sec: 1729594128, tv_nsec: 918617969 }"
"2024-10-22T10:48:48.918602Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op:try_attempt: aws_smithy_runtime::client::orchestrator::endpoints: will use endpoint Endpoint { url: ""https://sqs.eu-central-1.amazonaws.com"", headers: {}, properties: {} }"
"2024-10-22T10:48:48.918595Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op:try_attempt: aws_smithy_runtime::client::orchestrator::endpoints: resolving endpoint endpoint_params=EndpointResolverParams(TypeErasedBox[!Clone]:Params { region: Some(""eu-central-1""), use_dual_stack: false, use_fips: false, endpoint: None }) endpoint_prefix=None"
"2024-10-22T10:48:48.918589Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op: aws_smithy_runtime::client::orchestrator: beginning attempt #1"
"2024-10-22T10:48:48.918584Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op: aws_smithy_runtime::client::orchestrator: retry strategy has OKed initial request"
"2024-10-22T10:48:48.918580Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op: aws_smithy_runtime::client::retries::strategy::standard: no client rate limiter configured, so no token is required for the initial request."
"2024-10-22T10:48:48.918573Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op: aws_smithy_runtime_api::client::interceptors::context: entering 'before transmit' phase"
"2024-10-22T10:48:48.918562Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op: aws_smithy_runtime_api::client::interceptors::context: entering 'serialization' phase"
"2024-10-22T10:48:48.918551Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:apply_configuration: aws_smithy_runtime::client::orchestrator: timeout settings for this operation: TimeoutConfig { connect_timeout: Disabled, read_timeout: Disabled, operation_timeout: Disabled, operation_attempt_timeout: Unset }"
"2024-10-22T10:48:48.918519Z ERROR source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}: vector::internal_events::aws_sqs: Failed to fetch SQS events. error=service error error_code=""failed_fetching_sqs_events"" error_type=""request_failed"" stage=""receiving"" internal_log_rate_limit=true"
"2024-10-22T10:48:48.918515Z ERROR source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] has been suppressed 414 times."
"2024-10-22T10:48:48.918486Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8819242}:try_op: aws_smithy_runtime::client::orchestrator: a retry is either unnecessary or not possible, exiting attempt loop"
"2024-10-22T10:48:48.918481Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8819242}:try_op: aws_smithy_runtime::client::retries::strategy::standard: not retrying because we are out of attempts attempts=1 max_attempts=1"
"2024-10-22T10:48:48.918468Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8819242}:try_op:try_attempt: aws_smithy_runtime_api::client::interceptors::context: entering 'after deserialization' phase"
"2024-10-22T10:48:48.918452Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8819242}:try_op:try_attempt:deserialization:deserialize_nonstreaming: aws_sdk_sqs::operation::receive_message: request_id=Some(""2293170b-6dda-521e-8f04-489ae881545b"")"
"2024-10-22T10:48:48.918435Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8819242}:try_op:try_attempt: aws_smithy_runtime_api::client::interceptors::context: entering 'deserialization' phase"
"2024-10-22T10:48:48.918414Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8819242}:try_op:try_attempt: aws_smithy_runtime_api::client::interceptors::context: entering 'before deserialization' phase"
"2024-10-22T10:48:48.918287Z DEBUG hyper::proto::h1::conn: incoming body completed"
"2024-10-22T10:48:48.918260Z DEBUG hyper::proto::h1::conn: incoming body is content-length (125 bytes)"
"2024-10-22T10:48:48.918251Z DEBUG hyper::proto::h1::io: parsed 6 headers"
"2024-10-22T10:48:48.913970Z DEBUG hyper::proto::h1::io: flushed 2293 bytes"

kamilkarczewski · 2024-12-12T10:12:46Z

Got same issue. Running vector on EC2 instance, reading cloudtrail logs from S3+sqs as source...
Got 1k of these errors in a minute...

Dec 12 07:26:16 ip-10-0-142-188 vector: 2024-12-12T07:26:16.516860Z ERROR source{component_kind="source" 
component_id=source_s3 component_type=aws_s3}: vector::internal_events::aws_sqs::s3: Failed to process SQS message.
message_id=0a45ffd3-0dd8-40cb-b4e1-1aa41c4bce5d error=Failed to fetch s3://{REDACTED}.json.gz: service error
error_code="failed_processing_sqs_message" error_type="parser_failed" stage="processing" internal_log_rate_limit=true
Dec 12 07:26:16 ip-10-0-142-188 vector: 2024-12-12T07:26:16.636846Z ERROR source{component_kind="source"
component_id=source_s3 component_type=aws_s3}: vector::internal_events::aws_sqs::s3: Internal log[Failed to process SQS
message.] is being suppressed to avoid flooding.

chlunde · 2024-12-17T08:14:16Z

I it would be nice if the ERROR message here captured some more information about the cause.
Another issue is that it is constantly flooding the network with DNS lookups and API calls, around 3 Gbit/s. There should be some backoff here.

vector-0 vector 2024-12-17T07:42:42.401349Z  INFO vector::sinks::prometheus::exporter: Building HTTP server. address=0.0.0.0:9090
vector-0 vector 2024-12-17T07:42:42.459117Z ERROR source{component_kind="source" component_id=sqs_source component_type=aws_s3}: vector::internal_events::aws_sqs: Failed to fetch SQS events. error=service error error_code="failed_fetching_sqs_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
vector-0 vector 2024-12-17T07:42:42.460024Z ERROR source{component_kind="source" component_id=sqs_source component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] is being suppressed to avoid flooding.
vector-0 vector 2024-12-17T07:42:42.494630Z  INFO vector::topology::builder: Healthcheck passed.
vector-0 vector 2024-12-17T07:42:52.459938Z ERROR source{component_kind="source" component_id=sqs_source component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] has been suppressed 5610 times.
vector-0 vector 2024-12-17T07:42:52.459979Z ERROR source{component_kind="source" component_id=sqs_source component_type=aws_s3}: vector::internal_events::aws_sqs: Failed to fetch SQS events. error=service error error_code="failed_fetching_sqs_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
vector-0 vector 2024-12-17T07:42:52.460620Z ERROR source{component_kind="source" component_id=sqs_source component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] is being suppressed to avoid flooding.
vector-0 vector 2024-12-17T07:43:02.462471Z ERROR source{component_kind="source" component_id=sqs_source component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] has been suppressed 5839 times.
vector-0 vector 2024-12-17T07:43:02.462511Z ERROR source{component_kind="source" component_id=sqs_source component_type=aws_s3}: vector::internal_events::aws_sqs: Failed to fetch SQS events. error=service error error_code="failed_fetching_sqs_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true

chlunde · 2024-12-17T08:28:39Z

With TRACE logging, I found this error:

2024-12-17T08:22:58.980007Z TRACE source{component_kind="source" component_id=sqs_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=6213509}:try_op:try_attempt:deserialization:deserialize_nonstreaming: aws_smithy_runtime::client::orchestrator::http: read HTTP response body response=Response { status: StatusCode(403), headers: Headers { headers: {"x-amzn-requestid": HeaderValue { _private: H0("9e280947-....") }, "x-amzn-query-error": HeaderValue { _private: H0("AccessDenied;Sender") }, "connection": HeaderValue { _private: H0("close") }, "date": HeaderValue { _private: H0("Tue, 17 Dec 2024 08:22:58 GMT") }, "content-type": HeaderValue { _private: H0("application/x-amz-json-1.0") }, "content-length": HeaderValue { _private: H0("349") }} }, body: SdkBody { inner: Once(Some(b"{\"__type\":\"com.amazon.coral.service#AccessDeniedException\",\"Message\":\"User: arn:aws:sts::nnnnn:assumed-role/rrrrrrrrrrrrrr/web-identity-token-1734423722539 is not authorized to perform: sqs:receivemessage on resource: arn:aws:sqs:eu-north-1:nnnn:vector-logs with an explicit deny in a resource-based policy\"}")), retryable: true }, extensions: Extensions { extensions_02x: Extensions, extensions_1x: Extensions } }

It would be nice

if the remote error message could be part of the normal ERROR log statement
retries were throttled
and/or: DNS lookups cached?

Brightside56 added the type: bug A code related bug. label Apr 17, 2024

Brightside56 changed the title ~~S3 source events fails sporadically~~ S3/SQS source events fails sporadically Apr 17, 2024

jszwedko added the source: aws_s3 Anything `aws_s3` source related label Apr 18, 2024

milas mentioned this issue Sep 24, 2024

chore(deps): Update aws-config crate #20663

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3/SQS source events fails sporadically #20314

S3/SQS source events fails sporadically #20314

Brightside56 commented Apr 17, 2024 •

edited

Loading

jszwedko commented Apr 18, 2024

hoazgazh commented May 4, 2024

fpytloun commented May 22, 2024

seluard commented Jun 12, 2024

Brightside56 commented Jun 17, 2024

biatwc commented Oct 22, 2024 •

edited

Loading

kamilkarczewski commented Dec 12, 2024

chlunde commented Dec 17, 2024

chlunde commented Dec 17, 2024

S3/SQS source events fails sporadically #20314

S3/SQS source events fails sporadically #20314

Comments

Brightside56 commented Apr 17, 2024 • edited Loading

Problem

Configuration

Version

jszwedko commented Apr 18, 2024

hoazgazh commented May 4, 2024

fpytloun commented May 22, 2024

seluard commented Jun 12, 2024

Brightside56 commented Jun 17, 2024

biatwc commented Oct 22, 2024 • edited Loading

kamilkarczewski commented Dec 12, 2024

chlunde commented Dec 17, 2024

chlunde commented Dec 17, 2024

Brightside56 commented Apr 17, 2024 •

edited

Loading

biatwc commented Oct 22, 2024 •

edited

Loading