Adding unit tests for publisher output #17460

ycombinator · 2020-04-03T04:19:48Z

What does this PR do?

Adds unit tests for the publisher output functionality in libbeat.

Why is it important?

It checks that the publisher output publishes the expected number of events under various scenarios.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
~~I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.~~

Related PRs

In preparation for Fix issues with output reloading #17381

elasticmachine · 2020-04-03T04:20:11Z

Pinging @elastic/integrations-services (Team:Services)

ycombinator · 2020-04-03T05:16:29Z

libbeat/publisher/pipeline/output_test.go

+	rand.Seed(time.Now().UnixNano())
+
+	wqu := makeWorkQueue()
+	client := &mockNetworkClient{}


The reason I'm only testing with a outputs.NetworkClient here is because that will result in a netClientWorker below (instead of a clientWorker). And it appears that only in the case of a netClientWorker do we cancel the batch if the worker was closed before the batch was published:

beats/libbeat/publisher/pipeline/output.go

Line 120 in 8d43169

batch.Cancelled()

.

Are we missing a similar check + batch cancellation for clientWorker inside this loop?

beats/libbeat/publisher/pipeline/output.go

Lines 69 to 75 in 8d43169

for batch := range w.qu {

w.observer.outBatchSend(len(batch.events))

if err := w.client.Publish(batch); err != nil {

break

}

}

For reloading, I would say we're missing the Cancel for client worker. clientworker is only used for console and file output, which I don't think are available via fleet, but better to have the code correct.
Could the reason be that the console and file output do not support a 'Stop/Close' method? Maybe we need to touch those as well?

Just noticed the difference is the Connect method. Yeah, clientWorker should also cancel batches in case it is replaced.

Implemented in 1dd819f2ab787081389564534ea927ef17e086a8.

urso · 2020-04-03T11:42:04Z

libbeat/publisher/pipeline/output_test.go

+
+	for name, test := range tests {
+		t.Run(name, func(t *testing.T) {
+			rand.Seed(time.Now().UnixNano())


Tip: print the seed via t.Log and add a CLI flag to configure a static seed. In case randomized testing fails it should be reproducible.

Implemented in bd38221da24c14ae7797b2a48603688ac7a591d1.

urso · 2020-04-03T11:46:39Z

libbeat/publisher/pipeline/output_test.go

+					wqu <- batch
+				}()
+			}
+			wg.Wait()


for randomized testing testing/quick is your friend.

Implemented in 94b133a94.

urso · 2020-04-03T11:51:45Z

libbeat/publisher/pipeline/output_test.go

+	rand.Seed(time.Now().UnixNano())
+
+	wqu := makeWorkQueue()
+	client := &mockNetworkClient{}


For reloading, I would say we're missing the Cancel for client worker. clientworker is only used for console and file output, which I don't think are available via fleet, but better to have the code correct.
Could the reason be that the console and file output do not support a 'Stop/Close' method? Maybe we need to touch those as well?

urso · 2020-04-03T12:02:18Z

libbeat/publisher/pipeline/output_test.go

+
+			// Make sure that all events have eventually been published
+			c := test.client.(interface{ Published() int })
+			assert.Equal(t, numEvents.Load(), c.Published())


better have an assert loop with timeout instead of a hard sleep.

Implemented in 9b1cf7133.

urso · 2020-04-03T12:03:26Z

libbeat/publisher/pipeline/output_test.go

+					numEvents.Add(len(batch.Events()))
+
+					wqu <- batch
+				}()


we are moving batches into the work queue only. Why do we need a go-routine for this?
In case the queue/worker blocks the test will timeout (default test timeout in go is 10min).

Yeah, good point. I was trying to simulate as much as possible how the publisher is being used in reality — e.g. multiple FB inputs concurrently trying to publish batches of events. However, the work queue is an unbuffered channel so trying to concurrently send batches to it is pointless 🤦‍♂. I will remove the goroutine.

Fixed in 33af88a42.

urso · 2020-04-03T12:04:42Z

libbeat/publisher/pipeline/output_test.go

+	assert.Equal(t, numEvents.Load(), client.Published())
+}
+
+type mockClient struct{ published int }


we run our tests with -race. The way this type is used we should either use an atomic or protect operations via mutex.

Fixed in 33391da48.

ycombinator · 2020-04-03T22:05:29Z

@urso I've addressed all your review feedback. It's my first time using testing/quick so I hope I'm using it correctly. Please re-review this PR when you get a chance. Thank you!

libbeat/publisher/pipeline/output_test.go

ycombinator · 2020-04-06T08:29:26Z

@urso I could use some help here with the TestPublishWithClose test. This test consistently passes on my laptop and consistently fails on Travis CI. When it fails it fails like so:

--- FAIL: TestPublishWithClose (0.74s)
    --- FAIL: TestPublishWithClose/client (0.51s)
        output_test.go:239: reproduce test with `go test ... -seed 1586134537014319105`
        output_test.go:121: 
            	Error Trace:	output_test.go:121
            	            				value.go:460
            	            				value.go:321
            	            				quick.go:290
            	            				output_test.go:93
            	Error:      	"0" is not greater than "0"
            	Test:       	TestPublishWithClose/client
        output_test.go:137: #2: failed on input 0x3212993917d8ebbc
    --- FAIL: TestPublishWithClose/network_client (0.23s)
        output_test.go:239: reproduce test with `go test ... -seed 1586134537525700861`
        output_test.go:121: 
            	Error Trace:	output_test.go:121
            	            				value.go:460
            	            				value.go:321
            	            				quick.go:290
            	            				output_test.go:93
            	Error:      	"0" is not greater than "0"
            	Test:       	TestPublishWithClose/network_client
        output_test.go:137: #1: failed on input 0x1a171ef25f916b87
FAIL

Essentially this means that all batches are being consumed by the worker before Close() is called on it.

So I tried a couple things (see the last few commits) to try and leave some batches in the queue when Close() is called to see if that would make the test pass on Travis CI but it didn't help. Any ideas?

Related: I can only see this test failing on Travis CI right now as the libbbeat build is being skipped on Jenkins CI. I rebased this PR on the latest master a couple of ~~days~~ minutes ago. Perhaps the Jenkins pipeline dependencies need to be checked?

ycombinator · 2020-04-07T18:12:02Z

libbeat/publisher/pipeline/output_test.go

+	// Block publishing
+	if c.publishLimit > 0 && c.published >= c.publishLimit {
+		batch.Retry() // to simulate not acking
+		return nil
+	}


@urso WDYT about this logic to emulate blocking publishing?

This emulation is not blocking, but a failing output. The batch is not acked, but this Retrty + return nil will signal that the output is ready to consumer another batch.

A Blocking simulation would require you to wait for some signal (e.g. via a control channel).

At one point I had a sleep in here between the retry and return. The idea then was that the first client would be closed before the sleep finished.

A control channel is better than a sleep. Once the first client is closed I can close the control channel to remove the block. However, the Retry (before waiting to consume from the control channel) will still be needed, otherwise the final publish count doesn't add up to the expected total number of events.

Implemented blocking with control channel in 508b606.

ycombinator · 2020-04-07T18:13:01Z

libbeat/publisher/pipeline/output_test.go

+					total := published + client.Published()
+					return numEvents.Load() == total
+				})
+			}, &quick.Config{MaxCount: 50})


Keeping the MaxCount to 50 seems to keep Travis happy. Before that the build job was being killed because it was taking too long.

Before introducing quick check the count was actually 1 :)

This is some kind of stress test. Unfortunately stress tests don't sit well with travis. We have had bad performance issues with the queue stress tests as well. I think long termI think we should not have stress tests run by travis, but have a separate job running those for even longer. For 'some' simple unit testing a count of 1 might be ok.

True, good point :)

So would you recommend leaving this at 50, lowering it to 1 or maybe somewhere in between? I ask because while 50 is working at the moment I'm worried whether it'll become a source of flakiness. I don't think there's a way to know for sure until Travis runs this several times, though?

well, it's difficult to find the right value. Maybe set it to 25, so we have some more head-room.

urso · 2020-04-07T19:51:32Z

libbeat/publisher/pipeline/output_test.go

+						numEvents.Add(uint(len(batch.Events())))
+						wqu <- batch
+					}()
+				}


instead of creating a go-routine per batch, how about creating a single go-routine that will execute the for loop? This would actually more close to how the workqueue is used.

Sure, I will do that. Just out of curiosity, say there are multiple Filebeat inputs — don't they each get their own goroutine sending to the same publisher work queue?

Implemented in 94f3445.

urso · 2020-04-07T20:05:41Z

libbeat/publisher/pipeline/output_test.go

+	SeedFlag = flag.Int64("seed", 0, "Randomization seed")
+)
+
+func TestPublish(t *testing.T) {


maybe we can rename the test? We don't test publish, but the tests seem to check the behavior of the clientWorkers.

Done in be4bf09.

urso · 2020-04-07T20:07:52Z

libbeat/publisher/pipeline/output_test.go

+type mockClient struct {
+	mu           sync.RWMutex
+	publishLimit uint
+	published    uint


maybe we can excert more control by replacing published and publishLimit with a func. The func could decide when to block and when to continue. This way the test logic would be more contained within the test. WDYT?

Implemented mockable publish behavior function in 508b606.

ycombinator · 2020-04-08T23:22:40Z

Travis CI is green and Jenkins CI failures are unrelated. Merging.

* Adding unit tests for publisher output * Adding another unit test (TODO) * Adding unit test for closing worker midway * Reorganizing imports * Output PRNG seed + provide flag to specify seed * Cancel batch with netClient if it is closed * Use waitUntil loop instead of hard sleep * Making mockClient threadsafe * Removing goroutine from happy path unit test * Using testing/quick * Increase batch sizes in tests * Adding sleep to ensure some batches are still at time of close * Experiment witht with slihigher sleep time * Moving sleep to publish time * Increase publish latency * Increasing publish latency again * Removing publishLatency * Fix timeout to large value * Make first client block after publishing X events * Actually block publishing * Reduce number of batches to prevent running out of memory * Bumping up # of batches * Bumping up # of batches again * Try different strategy - publish 80% of events * Cranking up sleep time in publish blocking * Only publish first 20% of events * Make sure to return batch for retrying * Adding debugging statements to see what's happening in Travis * More robust to race conditions * Restricting quick iterations to 50 to see if that helps in Travis * Put entire loop into goroutine * Renaming tests * Emulate blocking + mockable publish behavior * Removing retry and return * Clarify intent with comment * Setting # of quick iterations to 25

ycombinator added libbeat Team:Services (Deprecated) Label for the former Integrations-Services team labels Apr 3, 2020

ycombinator added needs_backport PR is waiting to be backported to other branches. v7.8.0 v8.0.0 labels Apr 3, 2020

ycombinator requested a review from urso April 3, 2020 04:20

ycombinator added the [zube]: In Review label Apr 3, 2020

ycombinator mentioned this pull request Apr 3, 2020

Fix issues with output reloading #17381

Merged

6 tasks

ycombinator commented Apr 3, 2020

View reviewed changes

urso reviewed Apr 3, 2020

View reviewed changes

ycombinator requested a review from urso April 3, 2020 22:05

ycombinator force-pushed the lb-publisher-output-unit-tests branch 2 times, most recently from b47a623 to e9e20c5 Compare April 4, 2020 17:08

ycombinator commented Apr 5, 2020

View reviewed changes

libbeat/publisher/pipeline/output_test.go Outdated Show resolved Hide resolved

ycombinator added 10 commits April 6, 2020 08:52

Adding unit tests for publisher output

d7ffd02

Adding another unit test (TODO)

7e49fce

Adding unit test for closing worker midway

5627ff9

Reorganizing imports

babb398

Output PRNG seed + provide flag to specify seed

b7163dd

Cancel batch with netClient if it is closed

0034f4b

Use waitUntil loop instead of hard sleep

c0ee69e

Making mockClient threadsafe

a950b19

Removing goroutine from happy path unit test

8018169

Using testing/quick

a92ec73

ycombinator added 5 commits April 7, 2020 00:24

Try different strategy - publish 80% of events

de3d3f6

Cranking up sleep time in publish blocking

e8cde93

Only publish first 20% of events

6f89e19

Make sure to return batch for retrying

915e6ce

Adding debugging statements to see what's happening in Travis

8bb19ee

ycombinator force-pushed the lb-publisher-output-unit-tests branch from cc2d4b3 to 8bb19ee Compare April 7, 2020 15:59

ycombinator added 2 commits April 7, 2020 09:29

More robust to race conditions

57b527b

Restricting quick iterations to 50 to see if that helps in Travis

310dc4a

ycombinator commented Apr 7, 2020

View reviewed changes

urso reviewed Apr 7, 2020

View reviewed changes

ycombinator added 5 commits April 7, 2020 15:04

Put entire loop into goroutine

94f3445

Renaming tests

be4bf09

Emulate blocking + mockable publish behavior

508b606

Removing retry and return

457434f

Clarify intent with comment

3b3de96

urso approved these changes Apr 8, 2020

View reviewed changes

Setting # of quick iterations to 25

6a8793f

ycombinator merged commit aca75f5 into elastic:master Apr 8, 2020

ycombinator deleted the lb-publisher-output-unit-tests branch April 8, 2020 23:22

zube bot added [zube]: Done and removed [zube]: In Review labels Apr 8, 2020

ycombinator removed the needs_backport PR is waiting to be backported to other branches. label Apr 8, 2020

ycombinator mentioned this pull request Apr 8, 2020

[7.x] Adding unit tests for publisher output (#17460) #17625

Merged

andresrc removed the [zube]: Done label Apr 13, 2020

	for batch := range w.qu {
	w.observer.outBatchSend(len(batch.events))

	if err := w.client.Publish(batch); err != nil {
	break
	}
	}

Adding unit tests for publisher output #17460

Adding unit tests for publisher output #17460

Conversation

ycombinator commented Apr 3, 2020 • edited by zube bot Loading

What does this PR do?

Why is it important?

Checklist

Related PRs

elasticmachine commented Apr 3, 2020

ycombinator Apr 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ycombinator Apr 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ycombinator commented Apr 3, 2020

ycombinator commented Apr 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ycombinator commented Apr 8, 2020

ycombinator commented Apr 3, 2020 •

edited by zube bot

Loading

ycombinator Apr 3, 2020 •

edited

Loading

ycombinator Apr 3, 2020 •

edited

Loading

ycombinator commented Apr 6, 2020 •

edited

Loading