Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Events → BatchMode → Flush doesn't block and wait #476

Open
skyzyx opened this issue Sep 15, 2020 · 6 comments
Open

Events → BatchMode → Flush doesn't block and wait #476

skyzyx opened this issue Sep 15, 2020 · 6 comments
Labels
client-go pinned Pin an issue to prevent auto-closing by a bot.

Comments

@skyzyx
Copy link

skyzyx commented Sep 15, 2020

Description

The documentation for the Events.Flush method says:

Flush gives the user a way to manually flush the queue in the foreground. This is also used by watchdog when the timer expires.

I expect foreground to be synonymous with block and wait, and the client doesn't appear to be doing that. This is the culmination of research that was originally discussed in #464.

Go Version

go version go1.15.2 darwin/amd64

Current behavior

Reading thousands of records from one system, and pushing them into New Relic via pkg/events in batch mode. Code is very similar to what was posted in #464, but essentially boils down to something like this:

(I've stripped out some event handling and logging code, so there may be some phantom variables here.)

func main() {
	ctx := context.Background()

	cfg := config.New()
	cfg.InsightsInsertKey = os.Getenv("NEWRELIC_INSIGHTS_INSERT")
	cfg.Compression = config.Compression.Gzip
	cfg.LogLevel = "trace"

	client := events.New(cfg)
	logger := cfg.GetLogger()

	// Start batch events mode
	if err := client.BatchMode(
		ctx,
		nrAccountID,
		events.BatchConfigTimeout(batchTimeout),
	); err != nil {
		log.Fatalln("error starting batch mode:", err)
	}

	// Spawn goroutines...
	wgRawData.Add(1)
	go profileRaw(rawChannel, &client, &cfg)

	// Wait for goroutines to return...
	go func() {
		wgRawData.Wait()
		close(rawChannel)
	}()

	for rawData := range rawChannel {
		spew.Dump(rawData)
	}

	// Force-flush the events when we're done with this batch of data so that
	// the function ends and returns instead of listening indefinitely.
	if err := client.Flush(); err != nil {
		logger.Error(fmt.Sprintf("error flushing event queue: %s", err.Error()))
	}

	logger.Info("==> Events instructed to flush.")
}

// Leverages Go channels to execute the processing concurrently.
func profileRaw(c chan Response, client *events.Events, cfg *config.Config) {
	// Prepare to close the channel
	defer wgRawData.Done()

	ctx := context.Background()
	logger := cfg.GetLogger()

	logger.Info("==> Kicking off the profiler...")

	// Run the profiler
	output, ok := profiler(profilerDuration)
	if !ok {
		logger.Error("Profiler function failed.")
		c <- resp
	}

	logger.Info("==> Parsing profiler data...")

	payloads := strings.Split(output, "\n\n")

	logger.Info("==> Sending data chunks to New Relic as events...")

	for i := range payloads {
		payload := payloads[i]

		// Read the JSON into the appropriate Go struct
		err = json.Unmarshal([]byte(payload), &event)
		if err != nil {
			logger.Error(fmt.Sprintf("[] error unmarshaling JSON into struct: %s %s", err.Error(), payload))
			break
		}

		// ...do stuff to transform raw profiler event data into a flattened
		// structure that the New Relic Events API wants...

		// Queue a custom event
		if err := client.EnqueueEvent(ctx, flatEvent); err != nil {
			logger.Error(fmt.Sprintf("error posting custom event: %s", err.Error()))
		}
	}

	logger.Info("==> Sending events completed.")

	// Return data to the channel, closing it
	c <- resp
}

What happens is that client.Flush() isn't preventing the program from ending, so main() exits before the queue flushes (or, more accurately, even starts).

Expected behavior

My expectation is that client.Flush() should block and wait — preventing main() from exiting until the flushed events have responded successfully (or at least have been sent in the first place).

Steps To Reproduce

  1. Create a new events client.
  2. Put it into batch mode.
  3. Set it to flush every second.
  4. Enqueue a few thousand event messages.
  5. Trigger a flush.
  6. Program ends.

Additional Context

I was seeing cases where sometimes events made it, and other times they did not. The difference appeared to be related to which log level I was setting on the client. Once I narrowed-down the variables in the puzzle, I discovered that it wasn't the log level so much as the time it took for those log levels to write data to stdout/stderr.

  • debug and trace wrote a lot more data to stdout/stderr, which slowed-down the program enough to allow the goroutines to complete.

  • info wrote less data, enabling the program to run faster, and the goroutines didn't have enough time to finish (or even start).

A workaround has been to sleep at the end of the main() function.

time.Sleep(5 * time.Second)

But what this means is that it always ends up sleeping either too long or not long enough. Too long means that I'm missing out on other live events while I'm waiting for the program to finish sleeping before listening again. Too short means that some events I've read aren't making their way to the Events API.

Ideally, the client should intrinsically understand what it needs to block for, then stop blocking once the queue flushing has completed (for whatever definition of "completed" is appropriate — request has been sent, or response has been received).

References or Related Issues

@zlesnr
Copy link
Contributor

zlesnr commented Sep 18, 2020

Thank you for raising the issue. It looks to me what we need is a new blocking Shutdown() method that will wait until all the messages are sent before returning, and prevent new messages from entering the buffer. Looks like the Flush() method only write a control message to the channel, which triggers the listener routine to perform the actual sending. But this is not quite right either, since the grabAndConsumeEvents() method also spawns a go routine to perform the actual sending.

The batchWorker is maintaining some state about the count, which I'm not sure how to handle if the flusher logic were moved out. Too, a Flush() doesn't stop the execution, so even while a flush is happening, it looks like more events could be written to the buffer, which is probably less than desirable. So perhaps if we took the route of writing a new Shutdown() method, this could terminate the context, but that might not be right since the context is passed to batch worker from the caller.

Perhaps a combination of sync.Wait usage within grabAndConsumeEvents and copying the flush logic into a shutdown method could help.

@zlesnr
Copy link
Contributor

zlesnr commented Sep 28, 2020

I wonder too, if we might be able to leverage the context object to close down the rest of the system while the post completes. I don't expect to be able to look into this in the near term though.

@jthurman42
Copy link
Contributor

@skyzyx This looks like a documentation issue as Flush was meant to allow a lever to manually initiate sending data, but not block/wait. (The initial use-cases for this were all services, and not expected to shutdown). Adding logic as @zlesnr suggests to a new func seems reasonable.

@stale
Copy link

stale bot commented Oct 26, 2020

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale Issue or PR has been inactive for a period of time. This label is configured in stale-bot. label Oct 26, 2020
@skyzyx
Copy link
Author

skyzyx commented Oct 27, 2020

Bump.

@stale stale bot removed the stale Issue or PR has been inactive for a period of time. This label is configured in stale-bot. label Oct 27, 2020
@ctrombley ctrombley added the pinned Pin an issue to prevent auto-closing by a bot. label Oct 28, 2020
@jpvajda jpvajda added this to Feature Backlog in NR1 Developer Toolkit Community May 24, 2021
@kidk
Copy link
Contributor

kidk commented May 22, 2023

Hi @skyzyx

Were you able to work around this or is it still an active issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client-go pinned Pin an issue to prevent auto-closing by a bot.
Projects
No open projects
Development

No branches or pull requests

6 participants