Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle EOF on single line content #33568

Merged
merged 24 commits into from
Nov 30, 2022
Merged

handle EOF on single line content #33568

merged 24 commits into from
Nov 30, 2022

Conversation

aspacca
Copy link
Contributor

@aspacca aspacca commented Nov 3, 2022

Bug

What does this PR do?

Closes #30436

Why is it important?

Checklist

  • My code follows the style guidelines of this project
    - [ ] I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@aspacca aspacca requested a review from a team as a code owner November 3, 2022 07:45
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 3, 2022
@aspacca aspacca self-assigned this Nov 3, 2022
@aspacca aspacca added Team:Cloud-Monitoring Label for the Cloud Monitoring team and removed needs_team Indicates that the issue/PR needs a Team:* label labels Nov 3, 2022
@mergify
Copy link
Contributor

mergify bot commented Nov 3, 2022

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @aspacca? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Nov 3, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-11-30T03:12:03.372+0000

  • Duration: 138 min 22 sec

Test stats 🧪

Test Results
Failed 0
Passed 27570
Skipped 2127
Total 29697

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@aspacca aspacca requested a review from a team as a code owner November 3, 2022 13:48
@aspacca aspacca requested review from fearful-symmetry and leehinman and removed request for a team November 3, 2022 13:48
Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like some more unit tests. I'm specifically worried about the case where a partial event is written to the file, the reader reads that, then the rest of event is written to the file. The current code will give us one event, because it waits for the separator. I'd like verification that the proposed code does that as well.

@aspacca aspacca marked this pull request as draft November 4, 2022 03:19
@aspacca
Copy link
Contributor Author

aspacca commented Nov 4, 2022

I'd like some more unit tests. I'm specifically worried about the case where a partial event is written to the file, the reader reads that, then the rest of event is written to the file. The current code will give us one event, because it waits for the separator. I'd like verification that the proposed code does that as well.

hello @leehinman, totally agree with that!

I thought that the bug could be handled only in the aws s3 input part (see 26eb38f)

then I saw the test failing and realised that it's more complicated, I've moved the PR to draft

basically we have to find how to handle this and this

the problem is that in the second err check, when we have n = 0, err = io.EOF from the next iteration, we return from the loop, and we fail to collect the buffer here

so we receive an empty message in the input anyway

I'm thinking about changing the approach of the bugfix and do a check in the input if the content contains newline separator at all, if not, let's not attach the linereader, what do you think about it? it's safer but going to be more expensive in terms of performance. a trade off could be to fallback to a non linereader only if zero events were emitted

@leehinman
Copy link
Contributor

I'm thinking about changing the approach of the bugfix and do a check in the input if the content contains newline separator at all, if not, let's not attach the linereader, what do you think about it? it's safer but going to be more expensive in terms of performance. a trade off could be to fallback to a non linereader only if zero events were emitted

I think the root of the problem is the current linereader is expecting data to be appended, so EOF doesn't mean the last data you could ever receive, just last for now. But in the case of s3 input, we actually have all the data, so EOF means the last data you could ever receive.

Could we maybe just have a flag where the line reader will treat EOF like a line separator? That way it would work for data that has no line separators and ones where the first n-1 events have line separators but the nth one doesn't. I'm having a hard time coming up with a good heuristic for determining append vs non-append case, which is why I'm leaning towards a flag.

@mergify
Copy link
Contributor

mergify bot commented Nov 8, 2022

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b issue-30436 upstream/issue-30436
git merge upstream/main
git push upstream issue-30436

@aspacca
Copy link
Contributor Author

aspacca commented Nov 9, 2022

@leehinman please, look a the current solution

if you are not satisfied I will add the flag for EOF

my only concern in the last commit I've pushed is that I have to use a TeeReader, and that can be very expensive

@aspacca
Copy link
Contributor Author

aspacca commented Nov 22, 2022

hi @leehinman , did you have the chance to review my changes? thanks

Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aspacca aspacca added backport-v8.5.0 Automated backport with mergify backport-v8.6.0 Automated backport with mergify labels Nov 30, 2022
@aspacca aspacca merged commit 7b45320 into elastic:main Nov 30, 2022
mergify bot pushed a commit that referenced this pull request Nov 30, 2022
* handle EOF on single line content

* changelog

* fallback to encode_eof if no events in aws-s3 input

* lint

* lint

* collect on EOF in line reader

* remove encode eof

* remove iterN

* fix test

* increase test coverage

* linting

* more linting

* increase coverage

(cherry picked from commit 7b45320)
mergify bot pushed a commit that referenced this pull request Nov 30, 2022
* handle EOF on single line content

* changelog

* fallback to encode_eof if no events in aws-s3 input

* lint

* lint

* collect on EOF in line reader

* remove encode eof

* remove iterN

* fix test

* increase test coverage

* linting

* more linting

* increase coverage

(cherry picked from commit 7b45320)
aspacca pushed a commit that referenced this pull request Nov 30, 2022
* handle EOF on single line content

* changelog

* fallback to encode_eof if no events in aws-s3 input

* lint

* lint

* collect on EOF in line reader

* remove encode eof

* remove iterN

* fix test

* increase test coverage

* linting

* more linting

* increase coverage

(cherry picked from commit 7b45320)

Co-authored-by: Andrea Spacca <andrea.spacca@elastic.co>
aspacca pushed a commit that referenced this pull request Dec 1, 2022
* handle EOF on single line content (#33568)

* handle EOF on single line content

* changelog

* fallback to encode_eof if no events in aws-s3 input

* lint

* lint

* collect on EOF in line reader

* remove encode eof

* remove iterN

* fix test

* increase test coverage

* linting

* more linting

* increase coverage

(cherry picked from commit 7b45320)

* changelog

Co-authored-by: Andrea Spacca <andrea.spacca@elastic.co>
@zmoog zmoog added the backport-7.17 Automated backport to the 7.17 branch with mergify label Apr 1, 2023
mergify bot pushed a commit that referenced this pull request Apr 1, 2023
* handle EOF on single line content

* changelog

* fallback to encode_eof if no events in aws-s3 input

* lint

* lint

* collect on EOF in line reader

* remove encode eof

* remove iterN

* fix test

* increase test coverage

* linting

* more linting

* increase coverage

(cherry picked from commit 7b45320)

# Conflicts:
#	libbeat/reader/readfile/line.go
#	libbeat/reader/readfile/line_test.go
#	x-pack/filebeat/input/awss3/s3_objects.go
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
* handle EOF on single line content

* changelog

* fallback to encode_eof if no events in aws-s3 input

* lint

* lint

* collect on EOF in line reader

* remove encode eof

* remove iterN

* fix test

* increase test coverage

* linting

* more linting

* increase coverage
tdancheva added a commit that referenced this pull request Jun 5, 2023
* handle EOF on single line content (#33568)

* handle EOF on single line content

* changelog

* fallback to encode_eof if no events in aws-s3 input

* lint

* lint

* collect on EOF in line reader

* remove encode eof

* remove iterN

* fix test

* increase test coverage

* linting

* more linting

* increase coverage

(cherry picked from commit 7b45320)

# Conflicts:
#	libbeat/reader/readfile/line.go
#	libbeat/reader/readfile/line_test.go
#	x-pack/filebeat/input/awss3/s3_objects.go

* Fix conflicts

* Fix failing test - TestMaxBytesLimit

* Fix #2 failing test - TestMaxBytesLimit

* Fix failing test checks

* Fix linter errors

* Fix typo

* Fix linter errors #2

* Fix linter errors #3

* Fix linter errors #4

* Fix linter errors #5

* Changelog clean up

* Change order of publish event

---------

Co-authored-by: Andrea Spacca <andrea.spacca@elastic.co>
Co-authored-by: Tamara Dancheva <tamara.dancheva@elastic.co>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.5.0 Automated backport with mergify backport-v8.6.0 Automated backport with mergify Team:Cloud-Monitoring Label for the Cloud Monitoring team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Filebeat] aws-s3 drops data when files do not end with EOL
5 participants