Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests and support for truncated files in filestream input #24424

Merged
merged 4 commits into from
Apr 7, 2021

Conversation

kvch
Copy link
Contributor

@kvch kvch commented Mar 8, 2021

What does this PR do?

Add support for truncated files and adds migrates related tests from test_harvester.py. It also adds more tests to cover the following cases:

  • truncation is detected only by the Prospector
  • truncation is detected first by the Harvester then by the Prospector
  • truncation is detected first by the Prospector then by the Harvester
  • file gets truncated when the output is not able to accept events

Why is it important?

The support for stopping reading from truncated files was already implemented. However, filestream input could not start reading it from the beginning.

A new file system event is added called OpTruncate. When the size of a file has shrinked compared to the last time the scanner has encountered it, an OpTruncate event is emitted. When the prospector gets this event, the HarvesterGroup is restarting the Harvester of the file. Restarting basically means that the new Harvester cancels the previous reader and starts a new one.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
    - [ ] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 8, 2021
@kvch kvch added the Team:Elastic-Agent Label for the Agent team label Mar 8, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 8, 2021
@kvch kvch requested a review from urso March 8, 2021 17:15
@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 8, 2021

❕ Build Aborted

Either there was a build timeout or someone aborted the build.'}

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts

Expand to view the summary

Build stats

  • Build Cause: Pull request #24424 updated

  • Start Time: 2021-04-06T08:59:32.471+0000

  • Duration: 183 min 35 sec

  • Commit: f4e4bb1

Test stats 🧪

Test Results
Failed 0
Passed 12368
Skipped 2085
Total 14453

Trends 🧪

Image of Build Times

Image of Tests

Log output

Expand to view the last 100 lines of log output

[2021-04-06T11:20:56.453Z] 9.85s call     x-pack/filebeat/tests/system/test_xpack_modules.py::XPackTest::test_fileset_file_034_cisco
[2021-04-06T11:20:56.453Z] 9.74s call     x-pack/filebeat/tests/system/test_xpack_modules.py::XPackTest::test_fileset_file_235_oracle
[2021-04-06T11:20:56.453Z] 9.72s call     x-pack/filebeat/tests/system/test_xpack_modules.py::XPackTest::test_fileset_file_241_oracle
[2021-04-06T11:20:56.453Z] ======================= 323 passed in 1592.98s (0:26:32) =======================
[2021-04-06T11:20:56.453Z] >> python test: Integration Testing Complete
[2021-04-06T11:20:56.813Z] Cleaning up /var/lib/jenkins/workspace/PR-24424-15-b72851d4-851e-46ef-b634-f149a5ad6f23
[2021-04-06T11:20:56.813Z] Client: Docker Engine - Community
[2021-04-06T11:20:56.813Z]  Version:           20.10.3
[2021-04-06T11:20:56.813Z]  API version:       1.41
[2021-04-06T11:20:56.813Z]  Go version:        go1.13.15
[2021-04-06T11:20:56.813Z]  Git commit:        48d30b5
[2021-04-06T11:20:56.813Z]  Built:             Fri Jan 29 14:33:13 2021
[2021-04-06T11:20:56.813Z]  OS/Arch:           linux/amd64
[2021-04-06T11:20:56.813Z]  Context:           default
[2021-04-06T11:20:56.813Z]  Experimental:      true
[2021-04-06T11:20:56.813Z] 
[2021-04-06T11:20:56.813Z] Server: Docker Engine - Community
[2021-04-06T11:20:56.813Z]  Engine:
[2021-04-06T11:20:56.813Z]   Version:          20.10.3
[2021-04-06T11:20:56.813Z]   API version:      1.41 (minimum version 1.12)
[2021-04-06T11:20:56.813Z]   Go version:       go1.13.15
[2021-04-06T11:20:56.813Z]   Git commit:       46229ca
[2021-04-06T11:20:56.813Z]   Built:            Fri Jan 29 14:31:25 2021
[2021-04-06T11:20:56.813Z]   OS/Arch:          linux/amd64
[2021-04-06T11:20:56.813Z]   Experimental:     false
[2021-04-06T11:20:56.813Z]  containerd:
[2021-04-06T11:20:56.813Z]   Version:          1.4.4
[2021-04-06T11:20:56.813Z]   GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
[2021-04-06T11:20:56.813Z]  runc:
[2021-04-06T11:20:56.813Z]   Version:          1.0.0-rc93
[2021-04-06T11:20:56.813Z]   GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
[2021-04-06T11:20:56.813Z]  docker-init:
[2021-04-06T11:20:56.813Z]   Version:          0.19.0
[2021-04-06T11:20:56.813Z]   GitCommit:        de40ad0
[2021-04-06T11:20:56.813Z] Change ownership of all files inside the specific folder from root/root to current user/group
[2021-04-06T11:20:56.813Z] Unable to find image 'alpine:3.4' locally
[2021-04-06T11:20:57.384Z] 3.4: Pulling from library/alpine
[2021-04-06T11:20:57.645Z] c1e54eec4b57: Pulling fs layer
[2021-04-06T11:20:57.904Z] c1e54eec4b57: Download complete
[2021-04-06T11:20:58.165Z] c1e54eec4b57: Pull complete
[2021-04-06T11:20:58.165Z] Digest: sha256:b733d4a32c4da6a00a84df2ca32791bb03df95400243648d8c539e7b4cce329c
[2021-04-06T11:20:58.165Z] Status: Downloaded newer image for alpine:3.4
[2021-04-06T11:21:00.079Z] Change permissions with write access of all files inside the specific folder
[2021-04-06T11:21:01.480Z] Running in /var/lib/jenkins/workspace/PR-24424-15-b72851d4-851e-46ef-b634-f149a5ad6f23/src/github.com/elastic/beats/build
[2021-04-06T11:21:01.776Z] + rm -rf ve
[2021-04-06T11:21:01.776Z] + find . -type d -name vendor -exec rm -r {} ;
[2021-04-06T11:21:02.091Z] + python .ci/scripts/pre_archive_test.py
[2021-04-06T11:21:04.015Z] Copy ./x-pack/filebeat/build into build/x-pack/filebeat/build
[2021-04-06T11:21:04.028Z] Running in /var/lib/jenkins/workspace/PR-24424-15-b72851d4-851e-46ef-b634-f149a5ad6f23/src/github.com/elastic/beats/build
[2021-04-06T11:21:04.044Z] Recording test results
[2021-04-06T11:21:05.436Z] [Checks API] No suitable checks publisher found.
[2021-04-06T11:21:05.776Z] + go clean -modcache
[2021-04-06T11:21:09.385Z] Cleaning up /var/lib/jenkins/workspace/PR-24424-15-b72851d4-851e-46ef-b634-f149a5ad6f23
[2021-04-06T11:21:09.386Z] Client: Docker Engine - Community
[2021-04-06T11:21:09.386Z]  Version:           20.10.3
[2021-04-06T11:21:09.386Z]  API version:       1.41
[2021-04-06T11:21:09.386Z]  Go version:        go1.13.15
[2021-04-06T11:21:09.386Z]  Git commit:        48d30b5
[2021-04-06T11:21:09.386Z]  Built:             Fri Jan 29 14:33:13 2021
[2021-04-06T11:21:09.386Z]  OS/Arch:           linux/amd64
[2021-04-06T11:21:09.386Z]  Context:           default
[2021-04-06T11:21:09.386Z]  Experimental:      true
[2021-04-06T11:21:09.386Z] 
[2021-04-06T11:21:09.386Z] Server: Docker Engine - Community
[2021-04-06T11:21:09.386Z]  Engine:
[2021-04-06T11:21:09.386Z]   Version:          20.10.3
[2021-04-06T11:21:09.386Z]   API version:      1.41 (minimum version 1.12)
[2021-04-06T11:21:09.386Z]   Go version:       go1.13.15
[2021-04-06T11:21:09.386Z]   Git commit:       46229ca
[2021-04-06T11:21:09.386Z]   Built:            Fri Jan 29 14:31:25 2021
[2021-04-06T11:21:09.386Z]   OS/Arch:          linux/amd64
[2021-04-06T11:21:09.386Z]   Experimental:     false
[2021-04-06T11:21:09.386Z]  containerd:
[2021-04-06T11:21:09.386Z]   Version:          1.4.4
[2021-04-06T11:21:09.386Z]   GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
[2021-04-06T11:21:09.386Z]  runc:
[2021-04-06T11:21:09.386Z]   Version:          1.0.0-rc93
[2021-04-06T11:21:09.386Z]   GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
[2021-04-06T11:21:09.386Z]  docker-init:
[2021-04-06T11:21:09.386Z]   Version:          0.19.0
[2021-04-06T11:21:09.386Z]   GitCommit:        de40ad0
[2021-04-06T11:21:09.386Z] Change ownership of all files inside the specific folder from root/root to current user/group
[2021-04-06T11:21:15.965Z] Change permissions with write access of all files inside the specific folder
[2021-04-06T11:21:16.250Z] Running in /var/lib/jenkins/workspace/PR-24424-15-b72851d4-851e-46ef-b634-f149a5ad6f23
[2021-04-06T11:21:21.217Z] + gsutil --version
[2021-04-06T11:21:22.650Z] Masking supported pattern matches of $FILE_CREDENTIAL
[2021-04-06T11:21:22.961Z] + gcloud auth activate-service-account --key-file ****
[2021-04-06T11:21:23.536Z] Activated service account credentials for: [beats-ci-gcs-plugin@elastic-ci-prod.iam.gserviceaccount.com]
[2021-04-06T11:21:23.854Z] + gsutil -m -q cp -a public-read eC1wYWNrL2ZpbGViZWF0LWJ1aWxkZjRlNGJiMWNmZTQxYTVlNTRiMWIzNmIzMDY1ZTJiNTgzMzBlN2E1Mg gs://beats-ci-temp/ci/cache/
[2021-04-06T12:02:05.950Z] Cancelling nested steps due to timeout
[2021-04-06T12:02:05.960Z] Failed in branch filebeat-windows-7-32-windows-7-32-bit
[2021-04-06T12:02:06.044Z] Stage "Packaging" skipped due to earlier failure(s)
[2021-04-06T12:02:06.098Z] Running in /var/lib/jenkins/workspace/Beats_beats_PR-24424/src/github.com/elastic/beats
[2021-04-06T12:02:06.473Z] Running on Jenkins in /var/lib/jenkins/workspace/Beats_beats_PR-24424
[2021-04-06T12:02:06.517Z] [INFO] getVaultSecret: Getting secrets
[2021-04-06T12:02:06.611Z] Masking supported pattern matches of $VAULT_ADDR or $VAULT_ROLE_ID or $VAULT_SECRET_ID
[2021-04-06T12:02:07.268Z] + chmod 755 generate-build-data.sh
[2021-04-06T12:02:07.268Z] + ./generate-build-data.sh https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-24424/ https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-24424/runs/15 ABORTED 10954534
[2021-04-06T12:02:07.518Z] INFO: curl https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-24424/runs/15/steps/?limit=10000 -o steps-info.json
[2021-04-06T12:02:08.862Z] INFO: curl https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-24424/runs/15/tests/?status=FAILED -o tests-errors.json


truncatedTestLines := []byte("truncated first line\n")
env.mustWriteLinesToFile(testlogName, truncatedTestLines)
env.waitUntilEventCount(4)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test makes me wonder, what happens if the file is bigger than before when it is finally discovered by the prospector that the file has been truncated?

This has been a common problem in the logs input (where only the harvester did detect truncation). I presume with the prospector restarting the harvvester early the chance of this happening is reduced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it gets a bit better. However, if the output is blocked filestream is waiting to publish previous events. So the Harvester has to wait for the previous one to publish and stop before it can start reading from the beginning. I was thinking about moving the check if the file was truncated out-of-band just like we check if the file was removed. Maybe we could even introduce a new option close.on_state_change.truncated, so users can choose if they are willing to lose events in case of truncation just to make sure the system does not get blocked.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, if the output is blocked filestream is waiting to publish previous events. So the Harvester has to wait for the previous one to publish and stop before it can start reading from the beginning.

Hmm... One of the goals of the filestream input is to decouple control from the state of the output. Even if the output is blocked, the harvester must be reactive and shutdown if it is told to do so. As annoying as it is, but once a file is truncated, there is no way for us to recover the content of the file if lines have not yet read, or filebeat is restart. On truncate we should ensure that the pending state updates are not written to the registry anymore... they must get lost. This ensures that we can unlock the registry state immediately and start a new harvester that can start overwriting existing state immediately.

The invalidation of pending registry state updates can be implemented in the ack handler, or the update op itself (e.g. using a shared 'alive' atomic bool). For example the ack handler for the harvester is setup here. Each harvester has its own beat.Client, including its private ACK handling. An atomic bool can be introduced to tell the acker to discard updateOps. The bool would be set to true if a file is truncated, meaning that all pending state is to be discarded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it can be solved quite easily by running defaultHarvesterGroup.Stop before restart. My reasoning behind not doing so was to be able to flush the events that are already picked up by Filebeat but not yet accepted by the output.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ack handler is async and might even outlive the harvester or harvester group. The ack handler lives for as long as there are pending events in the queue to be ACKed.
The events that have already been enqueued will still be shippped. They are not removed. At worst only the current event to be published will be discarded, because the output as blocking publishing. But as the truncate operation is racy anyways I think that is an OK compromise.

Events for a single resource are always ACKed in order. That is when just calling stop and then starting a new harvester, we still will see the offset in the registry to increase with the old events before it is reset.

Just calling Stop would be good enough indeed. But as the offset is still updated, it is impossible for the prospector to reset the read offset to 0, as there might be pending events still updating the offset on disk. Unlinking the ack handler/updateOp in case of a truncate event allows us to drop pending updates, set the read offset in the persistent store to 0 and start a new harvester. Setting the read to 0 in the persisted store improves resiliency in case Filebeat is restarted. If Filebeat is not restarted, everything will be fine. Auto discovery adding/removing the filestream input for said files will have no impact as well.

@kvch kvch force-pushed the test-filebeat-filestream-trucated-files branch 3 times, most recently from 01522dd to 042311a Compare March 17, 2021 16:19
@kvch kvch force-pushed the test-filebeat-filestream-trucated-files branch from 042311a to ebb90dd Compare March 17, 2021 16:47
filebeat/input/filestream/fswatch.go Outdated Show resolved Hide resolved
filebeat/input/filestream/fswatch.go Outdated Show resolved Hide resolved
filebeat/input/filestream/input.go Show resolved Hide resolved
filebeat/input/filestream/internal/input-logfile/store.go Outdated Show resolved Hide resolved
@kvch kvch force-pushed the test-filebeat-filestream-trucated-files branch 2 times, most recently from 646daa4 to 61e2fe1 Compare March 25, 2021 09:44
filebeat/input/filestream/input.go Outdated Show resolved Hide resolved
ctx, cancelInput := context.WithCancel(context.Background())
env.startInput(ctx, inp)

for len(env.pipeline.clients) != 1 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line might get flagged at times if we run go test -race

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets flagged, I could try to refactor it to be able wait for more clients without a dead lock.


resource, err := lock(ctx, hg.store, sourceName)
resource, err := lock(ctx, hg.store, srcID)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the lock call before the hg.readers.newContext? I hope this will also help with the error handling and hot loop if we really have to wait for harvesters to be shut down (the worst might a k8s meta data event that basically asks us to restart all harvesters/inputs at once).

Instead of waiting in a hot loop for 5 seconds we can have a timeout without spending CPU by creating a context like this:

lockCtx := ctx.Cancelation
if mustWait {
   lockCtx = context.WithTimeout(...)
}
resource, err := lock(lockCtx, hg.store, srcID)

The mutex used to protect the resource also has a TryLock and a LockTimeout method. e.g. If you don't want to block on startup if the resource is still taken, just use TryLock. Your 'magic' could also be pushed down to the lock helper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the timeout for the waiting. Regardless, I will move the locking.

filebeat/input/filestream/internal/input-logfile/input.go Outdated Show resolved Hide resolved
filebeat/input/filestream/internal/input-logfile/store.go Outdated Show resolved Hide resolved
filebeat/input/filestream/internal/input-logfile/store.go Outdated Show resolved Hide resolved
filebeat/input/filestream/internal/input-logfile/store.go Outdated Show resolved Hide resolved
filebeat/input/filestream/prospector.go Show resolved Hide resolved
@kvch kvch force-pushed the test-filebeat-filestream-trucated-files branch from 61e2fe1 to 3b61d53 Compare March 26, 2021 16:04
@kvch
Copy link
Contributor Author

kvch commented Mar 26, 2021

I am adding needs_docs to make sure resend_on_touch is documented. I will open a separate PR for that once this is merged.

@kvch kvch added needs_backport PR is waiting to be backported to other branches. needs_docs labels Mar 26, 2021
@kvch kvch force-pushed the test-filebeat-filestream-trucated-files branch 2 times, most recently from 3110809 to 7d1eb83 Compare March 30, 2021 12:03
@kvch kvch requested a review from urso March 30, 2021 13:15
@kvch kvch force-pushed the test-filebeat-filestream-trucated-files branch 2 times, most recently from 3073316 to ae2e537 Compare April 1, 2021 11:04
@kvch
Copy link
Contributor Author

kvch commented Apr 1, 2021

I have rebased the PR so it only contains up to date code.

filebeat/input/filestream/input.go Outdated Show resolved Hide resolved
// If it is stopping itself, it must clean up the bookkeeper.
if ctx.Cancelation.Err() != context.Canceled {
hg.readers.remove(srcID)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we always remove the ID on shutdown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But I had to separate it into several cases because calling hg.Stop, hg.readers.remove and hg.Start after one another lead to random order of execution. In some cases it made handling truncation undefined, because first the Prospector stopped the Harvester for a given Source. Stop is already removing the ID from the bookkeeper. When the new Harvester was starting for the truncated file, the previous Harvester might be still shutting down and as hg.readers.remove was deferred the newly started Harvester was cancelled by the previous one sometimes.

It is still not perfect. One open question what happens the context is cancelled for some reason not by hg.readers.remove. So far I haven't found a case that would cause problem with this method. Maybe when the Prospector is stopped I have to delete all running sources from the bookkeeper?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It seems to work for now. Still, we should try to clean this up in a follow up PR. Shutdown and group source ID handling is not fully intuitive and a potential source of bugs in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It definitely needs more work and we need a bit more time to come up with a clearer solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I hope when we introduce the copytruncate prospector most of the complexity due to supporting truncation in the default prospector can be moved and refactored. The complexity is coming from the facts that we cannot anticipate how the files are truncated and rotated. With specialized prospectors, we can handle special cases a bit better.

@kvch kvch force-pushed the test-filebeat-filestream-trucated-files branch from 0deeac1 to f4e4bb1 Compare April 6, 2021 08:56
@kvch kvch merged commit 2875faf into elastic:master Apr 7, 2021
kvch added a commit to kvch/beats that referenced this pull request Apr 7, 2021
…c#24424)

## What does this PR do?

Add support for truncated files and adds migrates related tests from `test_harvester.py`. It also adds more tests to cover the following cases:

* truncation is detected only by the `Prospector`
* truncation is detected first by the `Harvester` then by the `Prospector`
* truncation is detected first by the `Prospector` then by the `Harvester`
* file gets truncated when the output is not able to accept events

## Why is it important?

The support for stopping reading from truncated files was already implemented. However, `filestream` input could not start reading it from the beginning.

A new file system event is added called `OpTruncate`. When the size of a file has shrinked compared to the last time the scanner has encountered it, an `OpTruncate` event is emitted. When the prospector gets this event, the `HarvesterGroup` is restarting the `Harvester` of the file. Restarting basically means that the new `Harvester` cancels the previous reader and starts

(cherry picked from commit 2875faf)
@kvch kvch added v7.13.0 and removed needs_backport PR is waiting to be backported to other branches. labels Apr 7, 2021
kvch added a commit that referenced this pull request Apr 7, 2021
#24959)

## What does this PR do?

Add support for truncated files and adds migrates related tests from `test_harvester.py`. It also adds more tests to cover the following cases:

* truncation is detected only by the `Prospector`
* truncation is detected first by the `Harvester` then by the `Prospector`
* truncation is detected first by the `Prospector` then by the `Harvester`
* file gets truncated when the output is not able to accept events

## Why is it important?

The support for stopping reading from truncated files was already implemented. However, `filestream` input could not start reading it from the beginning.

A new file system event is added called `OpTruncate`. When the size of a file has shrinked compared to the last time the scanner has encountered it, an `OpTruncate` event is emitted. When the prospector gets this event, the `HarvesterGroup` is restarting the `Harvester` of the file. Restarting basically means that the new `Harvester` cancels the previous reader and starts

(cherry picked from commit 2875faf)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants