Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3/SQS source events fails sporadically #20314

Open
Brightside56 opened this issue Apr 17, 2024 · 9 comments
Open

S3/SQS source events fails sporadically #20314

Brightside56 opened this issue Apr 17, 2024 · 9 comments
Labels
source: aws_s3 Anything `aws_s3` source related type: bug A code related bug.

Comments

@Brightside56
Copy link

Brightside56 commented Apr 17, 2024

Problem

I have configured s3 source and getting following error from time to time, once in a ~30-60 min

ERROR source{component_kind="source" component_id=s3 component_type=aws_s3}: vector::internal_events::aws_sqs::s3: Failed to process SQS message. message_id=8983e08a-46b8-48b8-a7f8-b06120861783 error=Failed to fetch s3://xxx/2024/04/16/1713311288-2e57bf4b-6612-4ab8-960e-517f01126830.log.gz: dispatch failure error_code="failed_processing_sqs_message" error_type="parser_failed" stage="processing" internal_log_rate_limit=true

however after being in-flight 10 minutes or so, these messages with corresponding S3 objects seem to be reprocessed successfully from SQS

Configuration

Below are s3 source config:

          sources:
            s3:
              type: aws_s3
              compression: gzip
              region: eu-west-1
              sqs:
                queue_url: https://sqs.eu-west-1.amazonaws.com/xxx/xxx-xxx
                delete_message: true
                client_concurrency: 5

also I have logLevel: aws_smithy_http=info,vector=info, however along with above-mentioned error message and various vector: Beep - there are no other informative error messages which could explain root cause of issue

Version

0.37.1

@Brightside56 Brightside56 added the type: bug A code related bug. label Apr 17, 2024
@Brightside56 Brightside56 changed the title S3 source events fails sporadically S3/SQS source events fails sporadically Apr 17, 2024
@jszwedko
Copy link
Member

Hmm, yeah, that is odd. I think that sort of error would be coming from the AWS SDK (similar issue, but not for S3: awslabs/aws-sdk-rust#844). You could try increasing the log level of aws_smithy_http to, say, debug or trace but I realize it'd be pretty chatty and so might not be feasible.

@jszwedko jszwedko added the source: aws_s3 Anything `aws_s3` source related label Apr 18, 2024
@hoazgazh
Copy link

hoazgazh commented May 4, 2024

same here, is anyone resolve this?

2024-05-04T17:07:15.594521Z INFO vector::topology::running: Running healthchecks. 2024-05-04T17:07:15.595648Z INFO vector: Vector has started. debug="false" version="0.37.1" arch="x86_64" revision="cb6635a 2024-04-09 13:45:06.561412437" 2024-05-04T17:07:15.595990Z INFO vector::app: API is disabled, enable by setting api.enabledtotrueand use commands likevector top. 2024-05-04T17:07:15.609041Z INFO source{component_kind="source" component_id=my_source_id component_type=aws_s3}:lazy_load_identity: aws_smithy_runtime::client::identity::cache::lazy: identity cache miss occurred; added new identity (took 185µs) new_expiration=2024-05-04T17:22:15.607636Z valid_for=899.998602s partition=IdentityCachePartition(2) 2024-05-04T17:07:15.826752Z INFO vector::topology::builder: Healthcheck passed. 2024-05-04T17:07:15.860011Z ERROR source{component_kind="source" component_id=my_source_id component_type=aws_s3}: vector::internal_events::aws_sqs: Failed to fetch SQS events. error=service error error_code="failed_fetching_sqs_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true 2024-05-04T17:07:15.907364Z ERROR source{component_kind="source" component_id=my_source_id component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] is being suppressed to avoid flooding.

@fpytloun
Copy link
Contributor

Also happens with S3 sink:

2024-05-22T09:40:24.026717Z  WARN sink{component_kind="sink" component_id=out_kafka_access_s3 component_type=aws_s3}:request{request_id=23796}: vector::sinks::util::retries: Retrying after error. error=dispatch failure internal_log_rate_limit=true

@seluard
Copy link

seluard commented Jun 12, 2024

I think all of these is related about how AWS sdk handle retries and failures for AWS IAM.

#20266

@Brightside56
Copy link
Author

Hmm, yeah, that is odd. I think that sort of error would be coming from the AWS SDK (similar issue, but not for S3: awslabs/aws-sdk-rust#844). You could try increasing the log level of aws_smithy_http to, say, debug or trace but I realize it'd be pretty chatty and so might not be feasible.

I've tried, but messages weren't more verbose in this specific part even with aws_smithy_http=trace

@biatwc
Copy link

biatwc commented Oct 22, 2024

Hello, I'm seeing similar with S3 source and SQS, happens every 6 hours or so most of the time, sometimes there is a 12 hour gap between error. Vector is on an EC2 instance using an instance profile, and the queue and bucket are in a different account to this EC2 instance. At the bottom are some debug logs when the error occurs, previous errors when debug logging wasn't enabled had different content, like:

error=Failed to fetch s3://<object url>: dispatch failure error_code=\"failed_processing_sqs_message\" error_type=\"parser_failed\" stage=\"processing\"
aws_sqs: Failed to fetch SQS events. error=service error error_code=\"failed_fetching_sqs_events\" error_type=\"request_failed\" stage=\"receiving\"

When this error happens the queue metrics show messages living on the queue for 200+ seconds, but they will eventually get consumed (as least I think that is what is happening, but maybe they are being dropped and events lost - but I do tell vector to not delete failed messages).
The source is configured like this:

type: aws_s3
region: eu-central-1
sqs:
  queue_url: "{{ queue_url }}"
  delete_failed_message: false

Any advice on config that could alleviate this issue please let me now, or any impending release that aims to fix it.
If this continues I'll have to switch to static creds and see how things look using them.

Debug logs:

"2024-10-22T10:48:48.918729Z DEBUG hyper::client::connect::dns: resolving host=""sqs.eu-central-1.amazonaws.com""
"2024-10-22T10:48:48.918694Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op:try_attempt: aws_smithy_runtime::client::http::body::minimum_throughput: applying minimum upload throughput check future options=MinimumThroughputBodyOptions { minimum_throughput: Throughput { bytes_read: 1, per_time_elapsed: 1s }, grace_period: 5s, check_window: 1s }"
"2024-10-22T10:48:48.918682Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op:try_attempt: aws_smithy_runtime_api::client::interceptors::context: entering 'transmit' phase"
"2024-10-22T10:48:48.918620Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op:try_attempt: aws_smithy_runtime::client::identity::cache::lazy: loaded identity from cache buffer_time=10s cached_expiration=Some(SystemTime { tv_sec: 1729594138, tv_nsec: 0 }) now=SystemTime { tv_sec: 1729594128, tv_nsec: 918617969 }"
"2024-10-22T10:48:48.918602Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op:try_attempt: aws_smithy_runtime::client::orchestrator::endpoints: will use endpoint Endpoint { url: ""https://sqs.eu-central-1.amazonaws.com"", headers: {}, properties: {} }"
"2024-10-22T10:48:48.918595Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op:try_attempt: aws_smithy_runtime::client::orchestrator::endpoints: resolving endpoint endpoint_params=EndpointResolverParams(TypeErasedBox[!Clone]:Params { region: Some(""eu-central-1""), use_dual_stack: false, use_fips: false, endpoint: None }) endpoint_prefix=None"
"2024-10-22T10:48:48.918589Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op: aws_smithy_runtime::client::orchestrator: beginning attempt #1"
"2024-10-22T10:48:48.918584Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op: aws_smithy_runtime::client::orchestrator: retry strategy has OKed initial request"
"2024-10-22T10:48:48.918580Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op: aws_smithy_runtime::client::retries::strategy::standard: no client rate limiter configured, so no token is required for the initial request."
"2024-10-22T10:48:48.918573Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op: aws_smithy_runtime_api::client::interceptors::context: entering 'before transmit' phase"
"2024-10-22T10:48:48.918562Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:try_op: aws_smithy_runtime_api::client::interceptors::context: entering 'serialization' phase"
"2024-10-22T10:48:48.918551Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8349334}:apply_configuration: aws_smithy_runtime::client::orchestrator: timeout settings for this operation: TimeoutConfig { connect_timeout: Disabled, read_timeout: Disabled, operation_timeout: Disabled, operation_attempt_timeout: Unset }"
"2024-10-22T10:48:48.918519Z ERROR source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}: vector::internal_events::aws_sqs: Failed to fetch SQS events. error=service error error_code=""failed_fetching_sqs_events"" error_type=""request_failed"" stage=""receiving"" internal_log_rate_limit=true"
"2024-10-22T10:48:48.918515Z ERROR source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] has been suppressed 414 times."
"2024-10-22T10:48:48.918486Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8819242}:try_op: aws_smithy_runtime::client::orchestrator: a retry is either unnecessary or not possible, exiting attempt loop"
"2024-10-22T10:48:48.918481Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8819242}:try_op: aws_smithy_runtime::client::retries::strategy::standard: not retrying because we are out of attempts attempts=1 max_attempts=1"
"2024-10-22T10:48:48.918468Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8819242}:try_op:try_attempt: aws_smithy_runtime_api::client::interceptors::context: entering 'after deserialization' phase"
"2024-10-22T10:48:48.918452Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8819242}:try_op:try_attempt:deserialization:deserialize_nonstreaming: aws_sdk_sqs::operation::receive_message: request_id=Some(""2293170b-6dda-521e-8f04-489ae881545b"")"
"2024-10-22T10:48:48.918435Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8819242}:try_op:try_attempt: aws_smithy_runtime_api::client::interceptors::context: entering 'deserialization' phase"
"2024-10-22T10:48:48.918414Z DEBUG source{component_kind=""source"" component_id=filters_s3_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=8819242}:try_op:try_attempt: aws_smithy_runtime_api::client::interceptors::context: entering 'before deserialization' phase"
"2024-10-22T10:48:48.918287Z DEBUG hyper::proto::h1::conn: incoming body completed"
"2024-10-22T10:48:48.918260Z DEBUG hyper::proto::h1::conn: incoming body is content-length (125 bytes)"
"2024-10-22T10:48:48.918251Z DEBUG hyper::proto::h1::io: parsed 6 headers"
"2024-10-22T10:48:48.913970Z DEBUG hyper::proto::h1::io: flushed 2293 bytes"

@kamilkarczewski
Copy link

Got same issue. Running vector on EC2 instance, reading cloudtrail logs from S3+sqs as source...
Got 1k of these errors in a minute...

Dec 12 07:26:16 ip-10-0-142-188 vector: 2024-12-12T07:26:16.516860Z ERROR source{component_kind="source" 
component_id=source_s3 component_type=aws_s3}: vector::internal_events::aws_sqs::s3: Failed to process SQS message.
message_id=0a45ffd3-0dd8-40cb-b4e1-1aa41c4bce5d error=Failed to fetch s3://{REDACTED}.json.gz: service error
error_code="failed_processing_sqs_message" error_type="parser_failed" stage="processing" internal_log_rate_limit=true
Dec 12 07:26:16 ip-10-0-142-188 vector: 2024-12-12T07:26:16.636846Z ERROR source{component_kind="source"
component_id=source_s3 component_type=aws_s3}: vector::internal_events::aws_sqs::s3: Internal log[Failed to process SQS
message.] is being suppressed to avoid flooding.

@chlunde
Copy link

chlunde commented Dec 17, 2024

I it would be nice if the ERROR message here captured some more information about the cause.
Another issue is that it is constantly flooding the network with DNS lookups and API calls, around 3 Gbit/s. There should be some backoff here.

vector-0 vector 2024-12-17T07:42:42.401349Z  INFO vector::sinks::prometheus::exporter: Building HTTP server. address=0.0.0.0:9090
vector-0 vector 2024-12-17T07:42:42.459117Z ERROR source{component_kind="source" component_id=sqs_source component_type=aws_s3}: vector::internal_events::aws_sqs: Failed to fetch SQS events. error=service error error_code="failed_fetching_sqs_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
vector-0 vector 2024-12-17T07:42:42.460024Z ERROR source{component_kind="source" component_id=sqs_source component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] is being suppressed to avoid flooding.
vector-0 vector 2024-12-17T07:42:42.494630Z  INFO vector::topology::builder: Healthcheck passed.
vector-0 vector 2024-12-17T07:42:52.459938Z ERROR source{component_kind="source" component_id=sqs_source component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] has been suppressed 5610 times.
vector-0 vector 2024-12-17T07:42:52.459979Z ERROR source{component_kind="source" component_id=sqs_source component_type=aws_s3}: vector::internal_events::aws_sqs: Failed to fetch SQS events. error=service error error_code="failed_fetching_sqs_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
vector-0 vector 2024-12-17T07:42:52.460620Z ERROR source{component_kind="source" component_id=sqs_source component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] is being suppressed to avoid flooding.
vector-0 vector 2024-12-17T07:43:02.462471Z ERROR source{component_kind="source" component_id=sqs_source component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] has been suppressed 5839 times.
vector-0 vector 2024-12-17T07:43:02.462511Z ERROR source{component_kind="source" component_id=sqs_source component_type=aws_s3}: vector::internal_events::aws_sqs: Failed to fetch SQS events. error=service error error_code="failed_fetching_sqs_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true

@chlunde
Copy link

chlunde commented Dec 17, 2024

With TRACE logging, I found this error:

2024-12-17T08:22:58.980007Z TRACE source{component_kind="source" component_id=sqs_source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage sdk_invocation_id=6213509}:try_op:try_attempt:deserialization:deserialize_nonstreaming: aws_smithy_runtime::client::orchestrator::http: read HTTP response body response=Response { status: StatusCode(403), headers: Headers { headers: {"x-amzn-requestid": HeaderValue { _private: H0("9e280947-....") }, "x-amzn-query-error": HeaderValue { _private: H0("AccessDenied;Sender") }, "connection": HeaderValue { _private: H0("close") }, "date": HeaderValue { _private: H0("Tue, 17 Dec 2024 08:22:58 GMT") }, "content-type": HeaderValue { _private: H0("application/x-amz-json-1.0") }, "content-length": HeaderValue { _private: H0("349") }} }, body: SdkBody { inner: Once(Some(b"{\"__type\":\"com.amazon.coral.service#AccessDeniedException\",\"Message\":\"User: arn:aws:sts::nnnnn:assumed-role/rrrrrrrrrrrrrr/web-identity-token-1734423722539 is not authorized to perform: sqs:receivemessage on resource: arn:aws:sqs:eu-north-1:nnnn:vector-logs with an explicit deny in a resource-based policy\"}")), retryable: true }, extensions: Extensions { extensions_02x: Extensions, extensions_1x: Extensions } }

It would be nice

  • if the remote error message could be part of the normal ERROR log statement
  • retries were throttled
  • and/or: DNS lookups cached?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: aws_s3 Anything `aws_s3` source related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

8 participants