-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*: tests flaking with "not enough time left on migration lease, terminating for safety" #21444
Comments
@jordanlewis is there a way for us to know whether a teamcity agent was messed up last night? 4 different tests failed with this error, and they were all running preposterously slow. For example, |
It's worth noting that the issues you closed occurred on four separate TeamCity agents. Perhaps GCE was having general trouble? |
Huh, that changes things a bit. It seems unlikely that the disks in the zone were all busted. If no one has specific ideas, I'd probably wait another couple days and see whether this sort of failure happens again. |
That's my best idea too. |
Well, it happened again last night on |
No luck reproducing in 55 minutes of |
One shot in the dark: perhaps another test in the package chews up disk resources. Does it reproduce if you instead stress the whole package?
|
Still no luck reproducing, not with
|
Shoot. Next step is to try to repro on an affected TeamCity agent. Let's kick off a few stress builds via TC during working hours next week. If an agent fails, we can (hopefully) manually repro. |
I'm stressracing pkg/sql on a preemptible TC agent right now. Hopefully that proves fruitful. |
Ok, I think I'm on the right track. |
Forgot to mention: the "migration lease expired; terminating for safety" reproduces reliably in about five minutes of |
Ah, nice. Is swapping enabled on the machines? |
No, it's not. But after about 5m of stressrace running, my SSH connection becomes entirely unresponsive. If I'm lucky, it comes back and tells me that stress found the "migration lease expired" error. So something is throttling aggressively. Perhaps the hypervisor? I'm not sure. I haven't seen an OOM kill for whatever that's worth. |
this looks like it's worth moving to 2.1? |
We haven’t seen this in a month, so I’m just going to close it out! |
SHA: https://github.com/cockroachdb/cockroach/commits/3cd0224615394f7606be514ad42419c538e7bd7b
Parameters:
Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=478328&tab=buildLog
The text was updated successfully, but these errors were encountered: