Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_s3 object sizes written by sink do not align with batch vector.yaml configuration settings. #21087

Closed
bennettbri62 opened this issue Aug 16, 2024 · 1 comment
Labels
type: bug A code related bug.

Comments

@bennettbri62
Copy link

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Expectations:

  1. Since batch.timeout_secs is defaulting to 300 seconds, that decompressed S3 objects would be approximately 104,857,600 bytes.

Results:

  1. Each S3 object decompressed is approximately 56,567,699 bytes and approximately 55,187 events.
  2. We are getting objects created every 45 seconds-1 minute so batch

If we add match.max_events: 1000000, we get the same results concerning the S3 objects sizes and times written, but also get:
024-08-16T01:06:31.275065Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="file_in, s3_sink" time_remaining="59 seconds left"
2024-08-16T01:06:36.272169Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="file_in, s3_sink" time_remaining="54 seconds left"
...
2024-08-16T01:07:31.272169Z ERROR vector::topology::running: Failed to gracefully shut down in time. Killing components. components="s3_sink"

Additionally, I've also tested replacing the file source with a Kafka ingestion, and we note significantly smaller s3 objects being written; it APPEARS that sources are influencing sinks?

Configuration

sources:
  file_in:
    type: file
    include:
      - /home/xxxxxxxxxx/vector_logs/log2.txt
sinks:
  s3_sink:
    type: aws_s3
    inputs:
      - file_in
    bucket: (bucket))
    region: "us-xxxx-x"
    compression: "gzip"
    acl: "bucket-owner-full-control"
    encoding:
      codec: "text"
    storage_class: "STANDARD"
    server_side_encryption: "AES256"
    batch:
      max_bytes: 104857600 # 100 Mb

Version

vector 0.40.0 (x86_64-unknown-linux-gnu 1167aa9 2024-07-29 15:08:44.028365803)

Debug Output

I've used the -v on the vector command invocation and can see WHEN the s3 objects are being written, but can't spot what is TRIGGERING each write. If you could guide on what I'm looking for, I'm happy to generate/analyze/post that.

Example Data

Resulting S3 objects:
2024-08-15 19:18:18 42640656 date=2024-08-161723767488-c3fb1dd1-5685-4315-9fad-b2413daf49f3.log.gz
2024-08-15 19:19:03 42640750 date=2024-08-161723767490-e095afd0-0b62-45d2-9519-5e010ceadd56.log.gz
2024-08-15 19:19:39 42640723 date=2024-08-161723767538-06616895-5088-44e8-9229-f34db6780ba4.log.gz
2024-08-15 19:20:23 42640715 date=2024-08-161723767541-655b9394-a8b1-4663-bd8c-951c615054ee.log.gz
2024-08-15 19:21:00 42640687 date=2024-08-161723767620-d75c5ece-b0ff-40de-a248-81e1a18de25b.log.gz
2024-08-15 19:21:44 42640925 date=2024-08-161723767623-be214765-891f-4f6c-a309-385abd2c9eaa.log.gz
2024-08-15 19:22:21 42640596 date=2024-08-161723767700-0222517e-48aa-4c9d-96ef-6b25d4cc2fb3.log.gz
...

Additional Context

No response

References

#19759 #14416 #10020 #3829

@bennettbri62 bennettbri62 added the type: bug A code related bug. label Aug 16, 2024
@jszwedko
Copy link
Member

Hi @bennettbri62 ,

Thanks for opening this issue. I think there are a few things happening here:

  • As mentioned by the issue you found, Vector batch bytes limits are based on in-memory sizing of events #10020, the batch sizing is frequently off by a sizable margin because it uses the in-memory size of the event to size the batches (rather than the serialized size). This is a known issue that we haven't been able to address yet.
  • I'm guessing that the object sizes are different when using the kafka source vs. the file source because either: the in-memory size of the events differs (more metadata for the kafka source) or consumption from kafka is slower than the file source and so you are hitting the batch timeout rather than the batch size limit

I think this issue is covered by #10020 and so we could close this one as a duplicate, but let me know if you think there is something different happening in your case and I can reopen this one.

@jszwedko jszwedko closed this as not planned Won't fix, can't repro, duplicate, stale Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

2 participants