Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"fatal error" when calling Batch Eval with Tags at Volume #398

Closed
fenriskiba opened this issue Sep 8, 2020 · 7 comments · Fixed by #402
Closed

"fatal error" when calling Batch Eval with Tags at Volume #398

fenriskiba opened this issue Sep 8, 2020 · 7 comments · Fixed by #402

Comments

@fenriskiba
Copy link
Contributor

We found an issue using the Tag Functionality for Batch Evaluations that only shows up when at high volumes. Related to #369

Expected Behavior

Calling postEvaluationBatch with valid flagTags should result in evaluations for every Flag with the provided Tags.

Current Behavior

At volume (we verified with 1 call every 10 milliseconds), we receive a fatal error: concurrent map writes error, resulting in Flagr crashing. We believe this is an issue of a race condition in the eval cache logic.

Possible Solution

We've identified an additional memory lock we can add to the GetByTags function in eval_cache.go that should resolve the problem. We are currently preparing some internal stress testing to validate that it resolves the issue.

While working on this, we also discovered a similar issue with the rateLimitPerFlagConsoleLogging function in eval.go that we are updating as well.

@zhouzhuojie
Copy link
Collaborator

Yes, thanks for reporting!

  1. GetByTags - just noticed that it changes the value during a RLock, I think the fix is that instead of changing the protected tagsCache, merging flags into the results temp variable is much safer.
  2. rateLimitPerFlagConsoleLogging may need a RWMutex, but I haven't seen this in production with very high volume of traffic, the race condition only happens when two concurrent read happen at the same time when rateLimitMap doesn't have the flagID key, so it's rare to reproduce - but yes, definitely worth adding a RWMutex, or even simpler, with the new sync.Map with https://golang.org/src/sync/map.go?s=6882:6965#L189, LoadOrStore is convenient to setup the per flag ratelimiter.

@zhouzhuojie
Copy link
Collaborator

I was looking for solutions, and it makes sense to just use atomic.Value to store all the cache maps - As https://golang.org/pkg/sync/atomic/#Value points out, it also reduces the lock contention.

The following example shows how to use Value for periodic program config updates and propagation of the changes to worker goroutines.

I can also work on the refactoring it after your fix.

@fenriskiba
Copy link
Contributor Author

Cool, your first post is actually what we had already started on testing. 😄

The second comment definitely sounds interesting. I think the priority for us is getting the current issue resolved so we can safely start using Tags, so I think we'll submit the short term fix.

@zhouzhuojie
Copy link
Collaborator

@fenriskiba I was refactoring the code #402 and thought it might be just handy to test out the solutions I mentioned above, hopefully this also fix the fatal error: concurrent map writes.

@fenriskiba
Copy link
Contributor Author

@zhouzhuojie We did some quick performance testing on our end with that branch, and it definitely seemed to fix that error. Do you have an ETA on when we could see this in a release?

@zhouzhuojie
Copy link
Collaborator

@zhouzhuojie We did some quick performance testing on our end with that branch, and it definitely seemed to fix that error. Do you have an ETA on when we could see this in a release?

1.1.12 is ready. https://hub.docker.com/layers/checkr/flagr/1.1.12/images/sha256-407d7099d6ce7e3632b6d00682a43028d75d3b088600797a833607bd629d1ed5?context=explore

@fenriskiba
Copy link
Contributor Author

I love how quick you work. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants