Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: DiscoverInstance (instance is nil) logs in VTOrc #13112

Closed
GuptaManan100 opened this issue May 18, 2023 · 0 comments · Fixed by #13243
Closed

Bug Report: DiscoverInstance (instance is nil) logs in VTOrc #13112

GuptaManan100 opened this issue May 18, 2023 · 0 comments · Fixed by #13243

Comments

@GuptaManan100
Copy link
Member

Overview of the Issue

It has been noticed that VTOrc sometimes has spurious logs like - DiscoverInstance(10.10.10.10:3307) instance is nil in 0.002s (Backend: 0.002s, Instance: 0.000s), error=tablet alias is nil.

I have looked at the code and I know how this is happening. Let's say initially you have a vttablet with hostname h1, port p1, and alias a1. Then, in the VTOrc backend, you would have 1 row in vitess_tablet for this tablet having all the three values h1, p1 and a1 and you would have a record in database_instance for this tablet with the values h1, p1 in it.

Now, let's say that this tablet gets evicted by Kubernetes and it restarts on a different machine. The tablet's alias remains the same, but the host and port would change, let's say to h2 and p2.

When VTOrc tries to refresh the information from the topo-server it would see this new record for the vttablet and try to insert a row into vitess_tablet with the values h2, p2 and a1. Since there is a uniqueness constraint on alias we end up replacing the row and the first row is automatically removed. We also load the MySQL information for this tablet and populate the data in database_instance with the values h2, p2. We don't store the alias in this table, so no uniqueness constraint fails and we have both the rows in the table now!

Now, we run the check to see what all tablets we need to forget about. This check runs by looking at the tablet aliases only and since the tablet alias for the given tablet didn't change, we conclude we have nothing to forget about.

Overall, this sequence of steps leads to a row in the database_instance table that should have actually been removed and is in the table without having a corresponding row in vitess_tablet. ReadOutdatedInstanceKeys picks up on this record and tries to refresh its information, but this errors out with DiscoverInstance(10.10.10.10:3307) instance is nil in 0.002s (Backend: 0.002s, Instance: 0.000s), error=tablet alias is nil

Reproduction Steps

Described in the description.

Binary Version

main

Operating System and Environment details

all

Log Fragments

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant