Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(file source)!: use uncompressed content for fingerprinting files (lines and ignored_header_bytes) #22050

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

roykim98
Copy link

@roykim98 roykim98 commented Dec 18, 2024

Summary

This PR addresses #13193.

This PR will handle the case where attempting to fingerprint compressed content. It will also handle a use case with log rotation, where an uncompressed active file might be monitored and then rotated into a compressed file with a new inode by a log rotation service. By comparing the uncompressed lines as a source of truth, it prevents messages from processing the files 2X.

BREAKING CHANGE: When sourcing from compressed files, ignored_header_bytes will no longer look at the compressed file's bytes, which would include the magic bytes for the compression header. Instead, it will ignore the bytes from the uncompressed content. Similarly, lines will no longer look for new line delimiters in the compressed content, but the uncompressed content. Arguably, both of these current mechanisms as a bug, as compressed content would not have any explicit lines or intentional header aside from the magic bytes.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

How did you test this PR?

Unit tests

cargo test --package file-source --lib -- fingerprinter::test --show-output

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the "no-changelog" label to this PR.

Checklist

References

@bits-bot
Copy link

bits-bot commented Dec 18, 2024

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added the domain: external docs Anything related to Vector's external, public documentation label Dec 18, 2024
@roykim98 roykim98 force-pushed the roy/fingerprint-rotation branch from ccdfe60 to ae39cc0 Compare December 18, 2024 06:32
@github-actions github-actions bot added the domain: sources Anything related to the Vector's sources label Dec 18, 2024
@roykim98 roykim98 marked this pull request as ready for review December 18, 2024 07:12
@roykim98 roykim98 requested review from a team as code owners December 18, 2024 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: external docs Anything related to Vector's external, public documentation domain: sources Anything related to the Vector's sources
Projects
None yet
Development

Successfully merging this pull request may close these issues.

file source: checksum fingerprint is not correct with gzipped files
2 participants