Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Flaky test TestGatewayBufferingWhileReparenting #13465

Closed
GuptaManan100 opened this issue Jul 10, 2023 · 1 comment · Fixed by #13469
Closed

Bug Report: Flaky test TestGatewayBufferingWhileReparenting #13465

GuptaManan100 opened this issue Jul 10, 2023 · 1 comment · Fixed by #13469

Comments

@GuptaManan100
Copy link
Member

Overview of the Issue

The test TestGatewayBufferingWhileReparenting has been flaky in main.

Reproduction Steps

Run TestGatewayBufferingWhileReparenting over and over.

Binary Version

main

Operating System and Environment details

-

Log Fragments

No response

@GuptaManan100
Copy link
Member Author

As far as I can tell, there is an underlying issue that can happen.

When we try to run a query, we first use the healthcheck to find the tablets which are serving -

tablets := gw.hc.GetHealthyTabletStats(target)

If we find no such tablets, we eventually use the keyspace event watcher to find whether the primary is serving or not -

primary, notServing := kev.PrimaryIsNotServing(target)

The information of healthchecks that come from the vttablets are first digested by the healthchecker in vtgates. keyspace event watcher registers for the notifications of these changes with the health-check, but those notifications are processed asynchronously.

This can lead to a situation where the primary tablet becomes non-serving, which is updated in the health-check, but not as yet in the keyspace-event-watcher.

In this case, we would infer from the health check that there are no healthy tablets, but keyspace event watcher would say that there is a serving primary tablet. This causes the loop to run again because we find ourselves in an inconsistant state. However, we have no wait before retyring the same checks, so it is possible for us to run into the same issue in the next iteration too, until the keyspace-event-watcher is able to process the health-check update. This sometimes leads to this error "inconsistent state detected, primary is serving but initially found no available tablet" surfacing out and causing the test to fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant