-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.3.7 panic: send on closed channel #9956
Comments
It also happened in my case.I know how to reproduce. |
@cfc4n Can you provide a way to reproduce? |
yeah, I'm tring... |
@cfc4n If you can't reproduce, can you tell me more about how this happened (e.g. sending writes and panic, etc.)? |
It seems to happen randomly multiple times a day on the version shipped in Bionic, I had to setup systemd to Restart=on-failure to make sure it stays up. $ etcd --version $ apt-cache policy etcd $ journalctl -u etcd
Since this is a testing environment, we're doing a lot of test reboots, and this happens once in a while too, we get more details but same panic message.
|
So in fact, it seems to happen on all server at once when it happens, here is a log from our syslog server grepping only the servers in quorum. |
/cc @jpbetz |
This happens to me as well. I have etcd providing storage for a kubernetes master (from hyperkube:v1.11.4), and I've installed the etcd from bionic. It just randomly falls over, almost always in < 1 hour. My etcd version is:
|
It really feels like I should open an issue on Ubuntu's launchpad, feels like a downstream issue. |
We have also hit this issue when issuing a hard power off of a host during failure testing. We had 1332 etcd clusters (3 members each) spread across 111 nodes. We performed a test where we did a hard shutdown of all nodes in 1 of the zones. We expected all etcds to maintain quorum as they had 2 members in the remaining zones. Out of 1332 clusters 32 of them hit this issue at the time the hosts in 1 zone were powered off. The other 1300 clusters survived as expected. The 32 clusters that hit the issue lost quorum as they had lost 1 member due to the power offs, and 1 other member hit the panic issue, so terminated, leaving just 1 surviving member - so this is not good for HA. The etcd clusters members that got the panic seemed to be spread across a variety of nodes. We ran a similar test previously, but just powering off 3 nodes in a zone, and none of the etcd clusters had any issues - so this seems to be a timing condition that can happen when a host shuts down/crashes that is quite difficult to hit. We are using etcd-operator to manage the etcdclusters. In some cases, from the logs it looks like the original peer gets removed from the cluster (as its node was shutdown), and etcd-operator starts another peer, but presumably this peer then dies (as the node it was scheduled on was subsequently shutdown)
But I also have cases where the panic occurs before another member is scheduled:
Etcd version:
|
So, this panic happens in the new node? |
No, the panic is always on one of the existing nodes that was not shutdown (Although it is caused by one of its peers being shutdown) |
Does anyone have any thoughts on a potential fix for this? I can recreate it quite easily on our test system. We haven't hit it in prod yet, but it's quite serious if this does occur. I'd be quite happy to have the option to sacrifice the GRPC metrics if it meant we could avoid this panic. |
Reviewing this thread it is unclear to me the reproducible steps can you outline them? Once we can reproduce this issue we can help resolve. |
I'm not sure how easy it will be to reproduce, as in our case we hit it something like 32 out of 1300 cases.. I'm not sure if having multiple etcds on the same node is contributing, or if you would hit this with fewer etcds on each node. The fact others have seen it probably means it can happen at lower scale, it's just that running at such scale we increase the chances of hitting whatever timing condition is causing it. So in terms of reproduction I think it's just a matter of issuing a hard shutdown - but it seems the window to hit the issue is small. If there's any debug patch or anything that would help to diagnose I should be able to run that on our test system. |
@mcginne could you try to recreate this in your testing ENV in a small scale such as single cluster on a single node? Use something to randomly perform the hard shutdown and see if you capture the panic logging. If you can get that far it would be greatly appreciated and will go a long way to getting to the bottom of this. Meanwhile I will take a look at the panic and see if @gyuho has any ideas. |
Just so you're aware I have setup 3 nodes with a single etcd cluster running across them. I have it running in a loop to power-off/power-on one of the hosts in the hope of recreating the issue at smaller scale. It has currently got through ~50 loops with no issues so far, but will leave it running. |
@mcginne awesome! If anyone is going to recreate this is will be you! Thanks a lot for doing this and please keep us posted if you have any questions. |
Adding to this we saw it again. This was a 3 node etcd cluster peer 1's logs
peers 2 logs:
instance 3's logs
|
Note that instance 2 and 3 die within a second of one another. Instance 2 is the first to die with the panic followed by instance 3. Then Instance 1 dies 6 minutes later time with the same error. |
I've run hundreds of loops of powering off/on hosts in my simple recreate, and haven't been able to recreate it unfortunately. |
Note we still see this in v3.3.11 as well. |
@ticpu Yes, these are two different problems here. However since I am probably not the only one Googling ending up in this thread, I will report my findings. The original issue stems from a However I also encountered this on 3.2.17 packaged in Ubuntu Bionic. The problem is
Ubuntu still packages and uses an old grpc-go which does not include grpc/grpc-go@22c3f92. I will update https://bugs.launchpad.net/ubuntu/+source/etcd/+bug/1800973 to highlight this. |
grpc/grpc-go#2695 is merged. We should update etcd grpc dependency to fix this bug after grpc 1.12 is released. |
@xiang90 I will make sure to cover this. |
This version includes patch from grpc/grpc-go#2695 which fixes etcd-io#9956 Signed-off-by: André Martins <aanm90@gmail.com>
This version includes patch from grpc/grpc-go#2695 which fixes etcd-io#9956 Signed-off-by: André Martins <aanm90@gmail.com>
This version includes patch from grpc/grpc-go#2695 which fixes etcd-io#9956 Signed-off-by: André Martins <aanm90@gmail.com>
@hexfusion @xiang90 @aanm grpc-go 1.20 has been released with the fix included [0] Can we help to get the rebase done? We are hitting this daily on any large scale etcd deployment. [0] https://github.com/grpc/grpc-go/releases/tag/v1.20.0 |
This version includes patch from grpc/grpc-go#2695 which fixes etcd-io#9956 Signed-off-by: André Martins <aanm90@gmail.com>
This version includes patch from grpc/grpc-go#2695 which fixes etcd-io#9956 Signed-off-by: André Martins <aanm90@gmail.com>
Going to review and move this forward for 3.3. backport. |
This is fixed by #10911, correct? |
already fixed. |
hi,
i got a panic on 3.3.7.
happend today and 10 days ago
version:
cmdline:
stacktrace:
The text was updated successfully, but these errors were encountered: