Add tests and support for truncated files in filestream input #24424

kvch · 2021-03-08T17:15:37Z

What does this PR do?

Add support for truncated files and adds migrates related tests from test_harvester.py. It also adds more tests to cover the following cases:

truncation is detected only by the Prospector
truncation is detected first by the Harvester then by the Prospector
truncation is detected first by the Prospector then by the Harvester
file gets truncated when the output is not able to accept events

Why is it important?

The support for stopping reading from truncated files was already implemented. However, filestream input could not start reading it from the beginning.

A new file system event is added called OpTruncate. When the size of a file has shrinked compared to the last time the scanner has encountered it, an OpTruncate event is emitted. When the prospector gets this event, the HarvesterGroup is restarting the Harvester of the file. Restarting basically means that the new Harvester cancels the previous reader and starts a new one.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~- [ ] I have made corresponding changes to the documentation~~
~~- [ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
~~- [ ] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.~~

elasticmachine · 2021-03-08T17:15:53Z

Pinging @elastic/agent (Team:Agent)

elasticmachine · 2021-03-08T20:18:25Z

❕ Build Aborted

Either there was a build timeout or someone aborted the build.'}

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: Pull request #24424 updated
Start Time: 2021-04-06T08:59:32.471+0000
Duration: 183 min 35 sec
Commit: f4e4bb1

Test stats 🧪

Test	Results
Failed	0
Passed	12368
Skipped	2085
Total	14453

Trends 🧪

Log output

Expand to view the last 100 lines of log output

[2021-04-06T11:20:56.453Z] 9.85s call     x-pack/filebeat/tests/system/test_xpack_modules.py::XPackTest::test_fileset_file_034_cisco
[2021-04-06T11:20:56.453Z] 9.74s call     x-pack/filebeat/tests/system/test_xpack_modules.py::XPackTest::test_fileset_file_235_oracle
[2021-04-06T11:20:56.453Z] 9.72s call     x-pack/filebeat/tests/system/test_xpack_modules.py::XPackTest::test_fileset_file_241_oracle
[2021-04-06T11:20:56.453Z] ======================= 323 passed in 1592.98s (0:26:32) =======================
[2021-04-06T11:20:56.453Z] >> python test: Integration Testing Complete
[2021-04-06T11:20:56.813Z] Cleaning up /var/lib/jenkins/workspace/PR-24424-15-b72851d4-851e-46ef-b634-f149a5ad6f23
[2021-04-06T11:20:56.813Z] Client: Docker Engine - Community
[2021-04-06T11:20:56.813Z]  Version:           20.10.3
[2021-04-06T11:20:56.813Z]  API version:       1.41
[2021-04-06T11:20:56.813Z]  Go version:        go1.13.15
[2021-04-06T11:20:56.813Z]  Git commit:        48d30b5
[2021-04-06T11:20:56.813Z]  Built:             Fri Jan 29 14:33:13 2021
[2021-04-06T11:20:56.813Z]  OS/Arch:           linux/amd64
[2021-04-06T11:20:56.813Z]  Context:           default
[2021-04-06T11:20:56.813Z]  Experimental:      true
[2021-04-06T11:20:56.813Z] 
[2021-04-06T11:20:56.813Z] Server: Docker Engine - Community
[2021-04-06T11:20:56.813Z]  Engine:
[2021-04-06T11:20:56.813Z]   Version:          20.10.3
[2021-04-06T11:20:56.813Z]   API version:      1.41 (minimum version 1.12)
[2021-04-06T11:20:56.813Z]   Go version:       go1.13.15
[2021-04-06T11:20:56.813Z]   Git commit:       46229ca
[2021-04-06T11:20:56.813Z]   Built:            Fri Jan 29 14:31:25 2021
[2021-04-06T11:20:56.813Z]   OS/Arch:          linux/amd64
[2021-04-06T11:20:56.813Z]   Experimental:     false
[2021-04-06T11:20:56.813Z]  containerd:
[2021-04-06T11:20:56.813Z]   Version:          1.4.4
[2021-04-06T11:20:56.813Z]   GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
[2021-04-06T11:20:56.813Z]  runc:
[2021-04-06T11:20:56.813Z]   Version:          1.0.0-rc93
[2021-04-06T11:20:56.813Z]   GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
[2021-04-06T11:20:56.813Z]  docker-init:
[2021-04-06T11:20:56.813Z]   Version:          0.19.0
[2021-04-06T11:20:56.813Z]   GitCommit:        de40ad0
[2021-04-06T11:20:56.813Z] Change ownership of all files inside the specific folder from root/root to current user/group
[2021-04-06T11:20:56.813Z] Unable to find image 'alpine:3.4' locally
[2021-04-06T11:20:57.384Z] 3.4: Pulling from library/alpine
[2021-04-06T11:20:57.645Z] c1e54eec4b57: Pulling fs layer
[2021-04-06T11:20:57.904Z] c1e54eec4b57: Download complete
[2021-04-06T11:20:58.165Z] c1e54eec4b57: Pull complete
[2021-04-06T11:20:58.165Z] Digest: sha256:b733d4a32c4da6a00a84df2ca32791bb03df95400243648d8c539e7b4cce329c
[2021-04-06T11:20:58.165Z] Status: Downloaded newer image for alpine:3.4
[2021-04-06T11:21:00.079Z] Change permissions with write access of all files inside the specific folder
[2021-04-06T11:21:01.480Z] Running in /var/lib/jenkins/workspace/PR-24424-15-b72851d4-851e-46ef-b634-f149a5ad6f23/src/github.com/elastic/beats/build
[2021-04-06T11:21:01.776Z] + rm -rf ve
[2021-04-06T11:21:01.776Z] + find . -type d -name vendor -exec rm -r {} ;
[2021-04-06T11:21:02.091Z] + python .ci/scripts/pre_archive_test.py
[2021-04-06T11:21:04.015Z] Copy ./x-pack/filebeat/build into build/x-pack/filebeat/build
[2021-04-06T11:21:04.028Z] Running in /var/lib/jenkins/workspace/PR-24424-15-b72851d4-851e-46ef-b634-f149a5ad6f23/src/github.com/elastic/beats/build
[2021-04-06T11:21:04.044Z] Recording test results
[2021-04-06T11:21:05.436Z] [Checks API] No suitable checks publisher found.
[2021-04-06T11:21:05.776Z] + go clean -modcache
[2021-04-06T11:21:09.385Z] Cleaning up /var/lib/jenkins/workspace/PR-24424-15-b72851d4-851e-46ef-b634-f149a5ad6f23
[2021-04-06T11:21:09.386Z] Client: Docker Engine - Community
[2021-04-06T11:21:09.386Z]  Version:           20.10.3
[2021-04-06T11:21:09.386Z]  API version:       1.41
[2021-04-06T11:21:09.386Z]  Go version:        go1.13.15
[2021-04-06T11:21:09.386Z]  Git commit:        48d30b5
[2021-04-06T11:21:09.386Z]  Built:             Fri Jan 29 14:33:13 2021
[2021-04-06T11:21:09.386Z]  OS/Arch:           linux/amd64
[2021-04-06T11:21:09.386Z]  Context:           default
[2021-04-06T11:21:09.386Z]  Experimental:      true
[2021-04-06T11:21:09.386Z] 
[2021-04-06T11:21:09.386Z] Server: Docker Engine - Community
[2021-04-06T11:21:09.386Z]  Engine:
[2021-04-06T11:21:09.386Z]   Version:          20.10.3
[2021-04-06T11:21:09.386Z]   API version:      1.41 (minimum version 1.12)
[2021-04-06T11:21:09.386Z]   Go version:       go1.13.15
[2021-04-06T11:21:09.386Z]   Git commit:       46229ca
[2021-04-06T11:21:09.386Z]   Built:            Fri Jan 29 14:31:25 2021
[2021-04-06T11:21:09.386Z]   OS/Arch:          linux/amd64
[2021-04-06T11:21:09.386Z]   Experimental:     false
[2021-04-06T11:21:09.386Z]  containerd:
[2021-04-06T11:21:09.386Z]   Version:          1.4.4
[2021-04-06T11:21:09.386Z]   GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
[2021-04-06T11:21:09.386Z]  runc:
[2021-04-06T11:21:09.386Z]   Version:          1.0.0-rc93
[2021-04-06T11:21:09.386Z]   GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
[2021-04-06T11:21:09.386Z]  docker-init:
[2021-04-06T11:21:09.386Z]   Version:          0.19.0
[2021-04-06T11:21:09.386Z]   GitCommit:        de40ad0
[2021-04-06T11:21:09.386Z] Change ownership of all files inside the specific folder from root/root to current user/group
[2021-04-06T11:21:15.965Z] Change permissions with write access of all files inside the specific folder
[2021-04-06T11:21:16.250Z] Running in /var/lib/jenkins/workspace/PR-24424-15-b72851d4-851e-46ef-b634-f149a5ad6f23
[2021-04-06T11:21:21.217Z] + gsutil --version
[2021-04-06T11:21:22.650Z] Masking supported pattern matches of $FILE_CREDENTIAL
[2021-04-06T11:21:22.961Z] + gcloud auth activate-service-account --key-file ****
[2021-04-06T11:21:23.536Z] Activated service account credentials for: [beats-ci-gcs-plugin@elastic-ci-prod.iam.gserviceaccount.com]
[2021-04-06T11:21:23.854Z] + gsutil -m -q cp -a public-read eC1wYWNrL2ZpbGViZWF0LWJ1aWxkZjRlNGJiMWNmZTQxYTVlNTRiMWIzNmIzMDY1ZTJiNTgzMzBlN2E1Mg gs://beats-ci-temp/ci/cache/
[2021-04-06T12:02:05.950Z] Cancelling nested steps due to timeout
[2021-04-06T12:02:05.960Z] Failed in branch filebeat-windows-7-32-windows-7-32-bit
[2021-04-06T12:02:06.044Z] Stage "Packaging" skipped due to earlier failure(s)
[2021-04-06T12:02:06.098Z] Running in /var/lib/jenkins/workspace/Beats_beats_PR-24424/src/github.com/elastic/beats
[2021-04-06T12:02:06.473Z] Running on Jenkins in /var/lib/jenkins/workspace/Beats_beats_PR-24424
[2021-04-06T12:02:06.517Z] [INFO] getVaultSecret: Getting secrets
[2021-04-06T12:02:06.611Z] Masking supported pattern matches of $VAULT_ADDR or $VAULT_ROLE_ID or $VAULT_SECRET_ID
[2021-04-06T12:02:07.268Z] + chmod 755 generate-build-data.sh
[2021-04-06T12:02:07.268Z] + ./generate-build-data.sh https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-24424/ https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-24424/runs/15 ABORTED 10954534
[2021-04-06T12:02:07.518Z] INFO: curl https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-24424/runs/15/steps/?limit=10000 -o steps-info.json
[2021-04-06T12:02:08.862Z] INFO: curl https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-24424/runs/15/tests/?status=FAILED -o tests-errors.json

filebeat/input/filestream/input.go

urso · 2021-03-09T14:21:07Z

filebeat/input/filestream/input_integration_test.go

+
+	truncatedTestLines := []byte("truncated first line\n")
+	env.mustWriteLinesToFile(testlogName, truncatedTestLines)
+	env.waitUntilEventCount(4)


This test makes me wonder, what happens if the file is bigger than before when it is finally discovered by the prospector that the file has been truncated?

This has been a common problem in the logs input (where only the harvester did detect truncation). I presume with the prospector restarting the harvvester early the chance of this happening is reduced.

Yes, it gets a bit better. However, if the output is blocked filestream is waiting to publish previous events. So the Harvester has to wait for the previous one to publish and stop before it can start reading from the beginning. I was thinking about moving the check if the file was truncated out-of-band just like we check if the file was removed. Maybe we could even introduce a new option close.on_state_change.truncated, so users can choose if they are willing to lose events in case of truncation just to make sure the system does not get blocked.

However, if the output is blocked filestream is waiting to publish previous events. So the Harvester has to wait for the previous one to publish and stop before it can start reading from the beginning.

Hmm... One of the goals of the filestream input is to decouple control from the state of the output. Even if the output is blocked, the harvester must be reactive and shutdown if it is told to do so. As annoying as it is, but once a file is truncated, there is no way for us to recover the content of the file if lines have not yet read, or filebeat is restart. On truncate we should ensure that the pending state updates are not written to the registry anymore... they must get lost. This ensures that we can unlock the registry state immediately and start a new harvester that can start overwriting existing state immediately.

The invalidation of pending registry state updates can be implemented in the ack handler, or the update op itself (e.g. using a shared 'alive' atomic bool). For example the ack handler for the harvester is setup here. Each harvester has its own beat.Client, including its private ACK handling. An atomic bool can be introduced to tell the acker to discard updateOps. The bool would be set to true if a file is truncated, meaning that all pending state is to be discarded.

Ok, it can be solved quite easily by running defaultHarvesterGroup.Stop before restart. My reasoning behind not doing so was to be able to flush the events that are already picked up by Filebeat but not yet accepted by the output.

The ack handler is async and might even outlive the harvester or harvester group. The ack handler lives for as long as there are pending events in the queue to be ACKed.
The events that have already been enqueued will still be shippped. They are not removed. At worst only the current event to be published will be discarded, because the output as blocking publishing. But as the truncate operation is racy anyways I think that is an OK compromise.

Events for a single resource are always ACKed in order. That is when just calling stop and then starting a new harvester, we still will see the offset in the registry to increase with the old events before it is reset.

Just calling Stop would be good enough indeed. But as the offset is still updated, it is impossible for the prospector to reset the read offset to 0, as there might be pending events still updating the offset on disk. Unlinking the ack handler/updateOp in case of a truncate event allows us to drop pending updates, set the read offset in the persistent store to 0 and start a new harvester. Setting the read to 0 in the persisted store improves resiliency in case Filebeat is restarted. If Filebeat is not restarted, everything will be fine. Auto discovery adding/removing the filestream input for said files will have no impact as well.

filebeat/input/filestream/input_integration_test.go

filebeat/input/filestream/prospector.go

filebeat/input/filestream/internal/input-logfile/harvester.go

filebeat/input/filestream/fswatch.go

filebeat/input/filestream/input.go

filebeat/input/filestream/internal/input-logfile/store.go

filebeat/input/filestream/input.go

urso · 2021-03-25T11:44:42Z

filebeat/input/filestream/input_integration_test.go

+	ctx, cancelInput := context.WithCancel(context.Background())
+	env.startInput(ctx, inp)
+
+	for len(env.pipeline.clients) != 1 {


this line might get flagged at times if we run go test -race

It gets flagged, I could try to refactor it to be able wait for more clients without a dead lock.

filebeat/input/filestream/input_integration_test.go

urso · 2021-03-25T12:08:06Z

filebeat/input/filestream/internal/input-logfile/harvester.go


-		resource, err := lock(ctx, hg.store, sourceName)
+		resource, err := lock(ctx, hg.store, srcID)


Can we move the lock call before the hg.readers.newContext? I hope this will also help with the error handling and hot loop if we really have to wait for harvesters to be shut down (the worst might a k8s meta data event that basically asks us to restart all harvesters/inputs at once).

Instead of waiting in a hot loop for 5 seconds we can have a timeout without spending CPU by creating a context like this:

lockCtx := ctx.Cancelation if mustWait { lockCtx = context.WithTimeout(...) } resource, err := lock(lockCtx, hg.store, srcID)

The mutex used to protect the resource also has a TryLock and a LockTimeout method. e.g. If you don't want to block on startup if the resource is still taken, just use TryLock. Your 'magic' could also be pushed down to the lock helper.

I have removed the timeout for the waiting. Regardless, I will move the locking.

filebeat/input/filestream/internal/input-logfile/input.go

filebeat/input/filestream/internal/input-logfile/store.go

filebeat/input/filestream/internal/input-logfile/prospector.go

filebeat/input/filestream/internal/input-logfile/store.go

filebeat/input/filestream/prospector.go

kvch · 2021-03-26T16:07:36Z

I am adding needs_docs to make sure resend_on_touch is documented. I will open a separate PR for that once this is merged.

filebeat/input/filestream/fswatch.go

filebeat/input/filestream/internal/input-logfile/harvester.go

filebeat/input/filestream/internal/input-logfile/store.go

kvch · 2021-04-01T11:05:14Z

I have rebased the PR so it only contains up to date code.

filebeat/input/filestream/input.go

filebeat/input/filestream/internal/input-logfile/harvester.go

urso · 2021-04-01T13:21:00Z

filebeat/input/filestream/internal/input-logfile/harvester.go

+		// If it is stopping itself, it must clean up the bookkeeper.
+		if ctx.Cancelation.Err() != context.Canceled {
+			hg.readers.remove(srcID)
+		}


shouldn't we always remove the ID on shutdown?

Yes. But I had to separate it into several cases because calling hg.Stop, hg.readers.remove and hg.Start after one another lead to random order of execution. In some cases it made handling truncation undefined, because first the Prospector stopped the Harvester for a given Source. Stop is already removing the ID from the bookkeeper. When the new Harvester was starting for the truncated file, the previous Harvester might be still shutting down and as hg.readers.remove was deferred the newly started Harvester was cancelled by the previous one sometimes.

It is still not perfect. One open question what happens the context is cancelled for some reason not by hg.readers.remove. So far I haven't found a case that would cause problem with this method. Maybe when the Prospector is stopped I have to delete all running sources from the bookkeeper?

I see. It seems to work for now. Still, we should try to clean this up in a follow up PR. Shutdown and group source ID handling is not fully intuitive and a potential source of bugs in the future.

Ok. It definitely needs more work and we need a bit more time to come up with a clearer solution.

Also, I hope when we introduce the copytruncate prospector most of the complexity due to supporting truncation in the default prospector can be moved and refactored. The complexity is coming from the facts that we cannot anticipate how the files are truncated and rotated. With specialized prospectors, we can handle special cases a bit better.

filebeat/input/filestream/input.go

…nput

…c#24424) ## What does this PR do? Add support for truncated files and adds migrates related tests from `test_harvester.py`. It also adds more tests to cover the following cases: * truncation is detected only by the `Prospector` * truncation is detected first by the `Harvester` then by the `Prospector` * truncation is detected first by the `Prospector` then by the `Harvester` * file gets truncated when the output is not able to accept events ## Why is it important? The support for stopping reading from truncated files was already implemented. However, `filestream` input could not start reading it from the beginning. A new file system event is added called `OpTruncate`. When the size of a file has shrinked compared to the last time the scanner has encountered it, an `OpTruncate` event is emitted. When the prospector gets this event, the `HarvesterGroup` is restarting the `Harvester` of the file. Restarting basically means that the new `Harvester` cancels the previous reader and starts (cherry picked from commit 2875faf)

#24959) ## What does this PR do? Add support for truncated files and adds migrates related tests from `test_harvester.py`. It also adds more tests to cover the following cases: * truncation is detected only by the `Prospector` * truncation is detected first by the `Harvester` then by the `Prospector` * truncation is detected first by the `Prospector` then by the `Harvester` * file gets truncated when the output is not able to accept events ## Why is it important? The support for stopping reading from truncated files was already implemented. However, `filestream` input could not start reading it from the beginning. A new file system event is added called `OpTruncate`. When the size of a file has shrinked compared to the last time the scanner has encountered it, an `OpTruncate` event is emitted. When the prospector gets this event, the `HarvesterGroup` is restarting the `Harvester` of the file. Restarting basically means that the new `Harvester` cancels the previous reader and starts (cherry picked from commit 2875faf)

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 8, 2021

kvch added the Team:Elastic-Agent Label for the Agent team label Mar 8, 2021

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 8, 2021

kvch requested a review from urso March 8, 2021 17:15

This was referenced Mar 8, 2021

Add tests for truncated and symlinked files #24425

Merged

Finish porting all tests from Python to Golang from test_harvester.py #24397

Closed