Events → BatchMode → Flush doesn't block and wait #476

skyzyx · 2020-09-15T21:36:28Z

Description

The documentation for the Events.Flush method says:

Flush gives the user a way to manually flush the queue in the foreground. This is also used by watchdog when the timer expires.

I expect foreground to be synonymous with block and wait, and the client doesn't appear to be doing that. This is the culmination of research that was originally discussed in #464.

Go Version

go version go1.15.2 darwin/amd64

Current behavior

Reading thousands of records from one system, and pushing them into New Relic via pkg/events in batch mode. Code is very similar to what was posted in #464, but essentially boils down to something like this:

(I've stripped out some event handling and logging code, so there may be some phantom variables here.)

func main() {
	ctx := context.Background()

	cfg := config.New()
	cfg.InsightsInsertKey = os.Getenv("NEWRELIC_INSIGHTS_INSERT")
	cfg.Compression = config.Compression.Gzip
	cfg.LogLevel = "trace"

	client := events.New(cfg)
	logger := cfg.GetLogger()

	// Start batch events mode
	if err := client.BatchMode(
		ctx,
		nrAccountID,
		events.BatchConfigTimeout(batchTimeout),
	); err != nil {
		log.Fatalln("error starting batch mode:", err)
	}

	// Spawn goroutines...
	wgRawData.Add(1)
	go profileRaw(rawChannel, &client, &cfg)

	// Wait for goroutines to return...
	go func() {
		wgRawData.Wait()
		close(rawChannel)
	}()

	for rawData := range rawChannel {
		spew.Dump(rawData)
	}

	// Force-flush the events when we're done with this batch of data so that
	// the function ends and returns instead of listening indefinitely.
	if err := client.Flush(); err != nil {
		logger.Error(fmt.Sprintf("error flushing event queue: %s", err.Error()))
	}

	logger.Info("==> Events instructed to flush.")
}

// Leverages Go channels to execute the processing concurrently.
func profileRaw(c chan Response, client *events.Events, cfg *config.Config) {
	// Prepare to close the channel
	defer wgRawData.Done()

	ctx := context.Background()
	logger := cfg.GetLogger()

	logger.Info("==> Kicking off the profiler...")

	// Run the profiler
	output, ok := profiler(profilerDuration)
	if !ok {
		logger.Error("Profiler function failed.")
		c <- resp
	}

	logger.Info("==> Parsing profiler data...")

	payloads := strings.Split(output, "\n\n")

	logger.Info("==> Sending data chunks to New Relic as events...")

	for i := range payloads {
		payload := payloads[i]

		// Read the JSON into the appropriate Go struct
		err = json.Unmarshal([]byte(payload), &event)
		if err != nil {
			logger.Error(fmt.Sprintf("[] error unmarshaling JSON into struct: %s %s", err.Error(), payload))
			break
		}

		// ...do stuff to transform raw profiler event data into a flattened
		// structure that the New Relic Events API wants...

		// Queue a custom event
		if err := client.EnqueueEvent(ctx, flatEvent); err != nil {
			logger.Error(fmt.Sprintf("error posting custom event: %s", err.Error()))
		}
	}

	logger.Info("==> Sending events completed.")

	// Return data to the channel, closing it
	c <- resp
}

What happens is that client.Flush() isn't preventing the program from ending, so main() exits before the queue flushes (or, more accurately, even starts).

Expected behavior

My expectation is that client.Flush() should block and wait — preventing main() from exiting until the flushed events have responded successfully (or at least have been sent in the first place).

Steps To Reproduce

Create a new events client.
Put it into batch mode.
Set it to flush every second.
Enqueue a few thousand event messages.
Trigger a flush.
Program ends.

Additional Context

I was seeing cases where sometimes events made it, and other times they did not. The difference appeared to be related to which log level I was setting on the client. Once I narrowed-down the variables in the puzzle, I discovered that it wasn't the log level so much as the time it took for those log levels to write data to stdout/stderr.

debug and trace wrote a lot more data to stdout/stderr, which slowed-down the program enough to allow the goroutines to complete.
info wrote less data, enabling the program to run faster, and the goroutines didn't have enough time to finish (or even start).

A workaround has been to sleep at the end of the main() function.

time.Sleep(5 * time.Second)

But what this means is that it always ends up sleeping either too long or not long enough. Too long means that I'm missing out on other live events while I'm waiting for the program to finish sleeping before listening again. Too short means that some events I've read aren't making their way to the Events API.

Ideally, the client should intrinsically understand what it needs to block for, then stop blocking once the queue flushing has completed (for whatever definition of "completed" is appropriate — request has been sent, or response has been received).

References or Related Issues

Custom Events sent with Batch Event API not showing up in Insights: Data Explorer (NRQL) #464

The text was updated successfully, but these errors were encountered:

zlesnr · 2020-09-18T18:49:35Z

Thank you for raising the issue. It looks to me what we need is a new blocking Shutdown() method that will wait until all the messages are sent before returning, and prevent new messages from entering the buffer. Looks like the Flush() method only write a control message to the channel, which triggers the listener routine to perform the actual sending. But this is not quite right either, since the grabAndConsumeEvents() method also spawns a go routine to perform the actual sending.

The batchWorker is maintaining some state about the count, which I'm not sure how to handle if the flusher logic were moved out. Too, a Flush() doesn't stop the execution, so even while a flush is happening, it looks like more events could be written to the buffer, which is probably less than desirable. So perhaps if we took the route of writing a new Shutdown() method, this could terminate the context, but that might not be right since the context is passed to batch worker from the caller.

Perhaps a combination of sync.Wait usage within grabAndConsumeEvents and copying the flush logic into a shutdown method could help.

zlesnr · 2020-09-28T23:41:28Z

I wonder too, if we might be able to leverage the context object to close down the rest of the system while the post completes. I don't expect to be able to look into this in the near term though.

jthurman42 · 2020-10-12T21:00:12Z

@skyzyx This looks like a documentation issue as Flush was meant to allow a lever to manually initiate sending data, but not block/wait. (The initial use-cases for this were all services, and not expected to shutdown). Adding logic as @zlesnr suggests to a new func seems reasonable.

stale · 2020-10-26T23:33:23Z

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs.

skyzyx · 2020-10-27T16:09:00Z

Bump.

kidk · 2023-05-22T16:28:47Z

Hi @skyzyx

Were you able to work around this or is it still an active issue?

stale bot added the stale Issue or PR has been inactive for a period of time. This label is configured in stale-bot. label Oct 26, 2020

stale bot removed the stale Issue or PR has been inactive for a period of time. This label is configured in stale-bot. label Oct 27, 2020

ctrombley added the pinned Pin an issue to prevent auto-closing by a bot. label Oct 28, 2020

jpvajda added this to Feature Backlog in NR1 Developer Toolkit Community May 24, 2021

jpvajda added the client-go label May 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Events → BatchMode → Flush doesn't block and wait #476

Events → BatchMode → Flush doesn't block and wait #476

skyzyx commented Sep 15, 2020

zlesnr commented Sep 18, 2020

zlesnr commented Sep 28, 2020

jthurman42 commented Oct 12, 2020

stale bot commented Oct 26, 2020

skyzyx commented Oct 27, 2020

kidk commented May 22, 2023

Events → BatchMode → Flush doesn't block and wait #476

Events → BatchMode → Flush doesn't block and wait #476

Comments

skyzyx commented Sep 15, 2020

Description

Go Version

Current behavior

Expected behavior

Steps To Reproduce

Additional Context

References or Related Issues

zlesnr commented Sep 18, 2020

zlesnr commented Sep 28, 2020

jthurman42 commented Oct 12, 2020

stale bot commented Oct 26, 2020

skyzyx commented Oct 27, 2020

kidk commented May 22, 2023