[libbeat] Disk queue implementation #21176

faec · 2020-09-18T18:08:54Z

What does this PR do?

This PR implements a new disk-based queue for libbeat.

This is a draft PR: it is ready to start receiving feedback and rolling reviews, but some implementation details are still pending and it isn't ready for checkin.

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Add to the beat configuration:

queue.disk:
  max_size: 1GB

The beat should operate as usual, and the event data should be stored in data/diskqueue while waiting to be ingested.

Release testing

While the preceding is good enough for smoke testing, we should include a few more scenarios for release prep:

Ingest from multiple beats (Filebeat and Metricbeat are essential since that's where we expect the most use, the rest would be nice to include as time allows)
Ingest to both standalone Elasticsearch and Elastic Cloud
Confirm that queued events persist through a beat restart (an easy way to test this is to run a beat initially without a working Elasticsearch, so it fills the disk queue, then restart it with Elasticsearch running as well. If you want to get extra granular, you can ingest small amounts at a time and view the queue's data files in data/diskqueue/NNN.seg in a hex editor -- the event data blobs are JSON and can be recognized visually -- but that is optional as that level of detail belongs more in the automated tests).
Try a wide range for max_size, both small enough that the queue consistently reaches its maximum size (which can be seen by checking the files in data/diskqueue) and large or unbounded (max_size=0).
Run with unbounded size targeting a nearly-full partition (using the queue.disk.path setting so the beat itself isn't on a full partition).
- It's not recommended for many reasons to fill up the main system partition or the partition a beat is running on, but this queue should be able to operate with no problems (beyond the intrinsic throughput decrease) if the partition hosting its data is full.

Since the queue itself operates strictly locally, it is fine to test each of these scenarios in isolation -- we wouldn't gain much practically by testing on the powerset of the preceding conditions, so follow whatever sequence makes things simplest.

libbeat/publisher/queue/diskqueue/serialize.go

libbeat/publisher/queue/diskqueue/state_file.go

libbeat/publisher/queue/diskqueue/config.go

kvch

Awesome PR so far. I've left a few notes for you. Tomorrow I will look at it again with fresh eyes.

kvch

Let's merge it! I assume when tests are added a few bugs might come out, but that is fine.
Awesome PR \o/

elasticmachine · 2020-09-24T14:38:21Z

Pinging @elastic/integrations (Team:Integrations)

fearful-symmetry · 2020-09-24T15:05:20Z

libbeat/publisher/queue/diskqueue/config.go

+func SettingsForUserConfig(config *common.Config) (Settings, error) {
+	userConfig := userConfig{}
+	if err := config.Unpack(&userConfig); err != nil {
+		return Settings{}, err


There's a lot of inconsistent error return styles here. Sometimes we wrap an error, sometimes we don't. Unless something else is requesting a specific error type, we might want to just use errors.wrap or fmt.Errorf for everything.

I vote for fmt.Errorf.

The heuristic I aim for is to ask whether the error being returned is understandable in the context of the caller. For example, errors in the deleter loop aren't wrapped, because they are reported (and grouped / wrapped) when they get back to the core loop, and it looks silly to have a message that looks like "couldn't delete segment file: [couldn't delete segment file: [actual error]]". So for low-level helpers I often leave it unwrapped, knowing that the caller is responsible for reporting it comprehensibly.

That said, config.go doesn't seem to follow that convention, and could use a little more verbosity in the messages, so I fixed the calls here :-)

fearful-symmetry · 2020-09-24T15:05:48Z

I think I understand most of this. Good work!

faec · 2020-09-24T18:04:06Z

Updates:

This is nearly ready so it's no longer a draft PR :-)
I added unit tests! Only a couple, to core_loop_test.go. The intention is for there to be many more like these, covering the other possible queue states, but I wanted the initial checkin to at least have a proof of concept showing how the state transitions / representation invariants can be unit tested.

…lization

kvch · 2020-09-28T08:04:28Z

libbeat/publisher/queue/diskqueue/core_loop_test.go

+	dq.handleProducerWriteRequest(request)
+
+	// The request inserts 100 bytes into an empty queue, so it should succeed.
+	// We expect:


Nit: describing the expectations in the comment do not provide value as the messages passed to t.Error functions tell the same story.

It's technically redundant, but the verbal description is shorter and easier to parse. I also like having a description of what I'm about to check, because the two listed invariants are logically equivalent to the 5 conditions in the code, but a reader seeing the test for the first time only knows what the code actually tests, and not what the author thought they were testing (which isn't always the same). This way if a test fails it's easier to recognize whether the problem is in the package or the test itself (i.e. whether the test is checking the invariants incorrectly, rather than the invariants actually failing).

(The secret long-term plan is to have verbal descriptions like this for every logically distinct state change, and to coalesce them into package documentation so it's easier to see how the pieces fit together.)

Initial implementation of the new libbeat disk queue (cherry picked from commit 2b8fd7c)

* upstream/master: feat: prepare release pipelines (elastic#21238) Add IP validation to Security module (elastic#21325) Fixes for new 7.10 rsa2elk datasets (elastic#21240) o365input: Restart after fatal error (elastic#21258) Fix panic in cgroups monitoring (elastic#21355) Handle multiple upstreams in ingress-controller (elastic#21215) [CI] Fix runbld when workspace does not exist (elastic#21350) [Filebeat] Fix checkpoint (elastic#21344) [CI] Archive build reasons (elastic#21347) Add dashboard for pubsub metricset in googlecloud module (elastic#21326) [Elastic Agent] Allow embedding of certificate (elastic#21179) Adds a default for failure_cache.min_ttl (elastic#21085) [libbeat] Disk queue implementation (elastic#21176)

Initial implementation of the new libbeat disk queue (cherry picked from commit 2b8fd7c)

faec added 30 commits April 8, 2020 16:03

Initial disk queue skeleton

e76a41b

Merge branch 'feature-disk-queue' into disk-queue

1d8bf65

Sketching out some top-level disk queue data structures

e8c8128

add queue type registration

97a7ed5

Merge branch 'feature-disk-queue' into disk-queue

f5ad9a2

use new registry helper

a78b85c

connect external user config to the queue's Settings struct

b12020c

Fill out more default settings

67540d6

review comments

94f125c

review comments (add panic to unimplemented functions)

f29a96f

Merge branch 'feature-disk-queue' into disk-queue

7ce01f9

some state file stuff

a04980e

revising code to match new design

26b4248

more state file handling

f30f30b

reading data frames from segments

c312d69

fleshing out segment logic

1a40b06

lots of partial work on reader and writer

ce65718

reworking segments

22ae148

reworking reader code

3f5f8fe

working on writer loop

61fa5d7

Merge branch 'master' into disk-queue-0

0191dc6

fix most build errors

3bf35ff

checksumType -> ChecksumType

50bd450

working on read / write loops

04c9b60

replace filebeat with a queue wrapper for testing

7a2e09a

adapting encoder stuff from the disk spool

132ba8e

add most of the api logic for the reader / writer loops

e73f55f

filling in segment-deletion api

6d2ca31

connect consumer ack endpoints

988cef6

organize, delete dead code

7774dc4