aws_s3 object sizes written by sink do not align with batch vector.yaml configuration settings. #21087

bennettbri62 · 2024-08-16T01:28:14Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Expectations:

Since batch.timeout_secs is defaulting to 300 seconds, that decompressed S3 objects would be approximately 104,857,600 bytes.

Results:

Each S3 object decompressed is approximately 56,567,699 bytes and approximately 55,187 events.
We are getting objects created every 45 seconds-1 minute so batch

If we add match.max_events: 1000000, we get the same results concerning the S3 objects sizes and times written, but also get:
024-08-16T01:06:31.275065Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="file_in, s3_sink" time_remaining="59 seconds left"
2024-08-16T01:06:36.272169Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="file_in, s3_sink" time_remaining="54 seconds left"
...
2024-08-16T01:07:31.272169Z ERROR vector::topology::running: Failed to gracefully shut down in time. Killing components. components="s3_sink"

Additionally, I've also tested replacing the file source with a Kafka ingestion, and we note significantly smaller s3 objects being written; it APPEARS that sources are influencing sinks?

Configuration

sources:
  file_in:
    type: file
    include:
      - /home/xxxxxxxxxx/vector_logs/log2.txt
sinks:
  s3_sink:
    type: aws_s3
    inputs:
      - file_in
    bucket: (bucket))
    region: "us-xxxx-x"
    compression: "gzip"
    acl: "bucket-owner-full-control"
    encoding:
      codec: "text"
    storage_class: "STANDARD"
    server_side_encryption: "AES256"
    batch:
      max_bytes: 104857600 # 100 Mb

Version

vector 0.40.0 (x86_64-unknown-linux-gnu 1167aa9 2024-07-29 15:08:44.028365803)

Debug Output

I've used the -v on the vector command invocation and can see WHEN the s3 objects are being written, but can't spot what is TRIGGERING each write. If you could guide on what I'm looking for, I'm happy to generate/analyze/post that.

Example Data

Resulting S3 objects:
2024-08-15 19:18:18 42640656 date=2024-08-161723767488-c3fb1dd1-5685-4315-9fad-b2413daf49f3.log.gz
2024-08-15 19:19:03 42640750 date=2024-08-161723767490-e095afd0-0b62-45d2-9519-5e010ceadd56.log.gz
2024-08-15 19:19:39 42640723 date=2024-08-161723767538-06616895-5088-44e8-9229-f34db6780ba4.log.gz
2024-08-15 19:20:23 42640715 date=2024-08-161723767541-655b9394-a8b1-4663-bd8c-951c615054ee.log.gz
2024-08-15 19:21:00 42640687 date=2024-08-161723767620-d75c5ece-b0ff-40de-a248-81e1a18de25b.log.gz
2024-08-15 19:21:44 42640925 date=2024-08-161723767623-be214765-891f-4f6c-a309-385abd2c9eaa.log.gz
2024-08-15 19:22:21 42640596 date=2024-08-161723767700-0222517e-48aa-4c9d-96ef-6b25d4cc2fb3.log.gz
...

Additional Context

No response

References

#19759 #14416 #10020 #3829

The text was updated successfully, but these errors were encountered:

jszwedko · 2024-08-16T19:31:15Z

Hi @bennettbri62 ,

Thanks for opening this issue. I think there are a few things happening here:

As mentioned by the issue you found, Vector batch bytes limits are based on in-memory sizing of events #10020, the batch sizing is frequently off by a sizable margin because it uses the in-memory size of the event to size the batches (rather than the serialized size). This is a known issue that we haven't been able to address yet.
I'm guessing that the object sizes are different when using the kafka source vs. the file source because either: the in-memory size of the events differs (more metadata for the kafka source) or consumption from kafka is slower than the file source and so you are hitting the batch timeout rather than the batch size limit

I think this issue is covered by #10020 and so we could close this one as a duplicate, but let me know if you think there is something different happening in your case and I can reopen this one.

bennettbri62 added the type: bug A code related bug. label Aug 16, 2024

jszwedko closed this as not planned Won't fix, can't repro, duplicate, stale Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws_s3 object sizes written by sink do not align with batch vector.yaml configuration settings. #21087

aws_s3 object sizes written by sink do not align with batch vector.yaml configuration settings. #21087

bennettbri62 commented Aug 16, 2024

jszwedko commented Aug 16, 2024

aws_s3 object sizes written by sink do not align with batch vector.yaml configuration settings. #21087

aws_s3 object sizes written by sink do not align with batch vector.yaml configuration settings. #21087

Comments

bennettbri62 commented Aug 16, 2024

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

jszwedko commented Aug 16, 2024