-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd discovery fails #15705
Comments
Could you be so kind and include the steps you follow to get to the issue in one coherent and straightforward listing (including which node runs which command and its env)? Are you running this locally against VMs or where are those IPs coming from? If it's VMs, can you include a vagrant setup for us to reproduce this? |
OK Will do. I don't know how to do the vagrant thing though. |
How do you currently run this? Just spin up some linux VM and run commands? |
These are EC2 instances running ubuntu 20.04.
|
Then let's check the basics first. Are the ports open in the security groups between the instances? Ports 2379 and 2380 are most relevant here. |
Yes, this entire system was fully functional before upgrading etcd to 3.5.7. It was also functioning before I sent you these initial bug reports, albeit with those intermittent errors |
First member on devewt02:
|
What is in "/dcs/"?
certainly doesn't look very good. |
Next member on devudb01:
This has always worked for me in the past, even with 3.5.7. I'm reaching out to our cloud engineers to see if something changed on the backend since yesterday |
It does not appear that anything in the backend has changed. |
I tried starting with 3.5.2, the prior installed version - now it's hanging. Kinda tells me it's not 3.5.7 and something backend wise. Let me keep trying from a different angle. |
maybe check what device/disk is mounted under that path and why it's not able to remove anything. |
/dcs/etcd is a mount point, the the message is normal |
a mount point to what device? is this an EBS block volume? maybe check the syslog on why it's considering itself busy. |
ok, i got it started. I started etcd on the new member (devudb01) with just:
and patroni started:
|
Well - I've been able to start some of the system, dev is now available.
3 out of 4 worked this way, still trying to get the 4th one up |
this is weird:
Same command repeated returning different results |
Well this has been totally bizarre. Still having issue, the dcs is simply not being built correctly evern though parts of the system are starting up. But only after the second time I issue the command. |
I've noticed that you've used Also, I've noticed that |
Wow - just noticed:
Belay this: "I don't understand how that happened, I will try to fix that. It's not in the yaml file like that..." |
fixed:
|
@pgodfrin-nov what was the fix? Port typo? Can we close the issue? |
I think the port typo was an inconsequential error, as the issues persisted whether or not that particular member was configured. I think the appropriate course of action is to review patroni 3.0.1 and it's behavior with etcd v3.5.7, even though the patroni folks will say For the record, the etcd and patroni config files (yaml) had zero changes (port typo and all), so of course the gRPC gateway was on. I'm not sure what https://github.com/CyberDem0n was trying to accomplish. I think patroni doesn't use etcd in the exact way that etcd expects, which might explain why the port typo made no difference. Nevertheless, etcd 3.5.7 was NOT initializing in an expected manner and in fact I couldn't get it run properly at all. Then perhaps there is actually an issue with 3.5.7 and NOT (just) patroni... I don't know. But I would recommend someone follow-up on that. You may close this issue as far as I'm concerned. I will not be upgrading etcd beyond 3.5.2 for the time being. |
Using 3.5.2 isn't recommenced because there was data inconsistency bug
are you changing
are you using
I've tried replicating this scenario locally and was able to start 4 node cluster with latest 3.5.8:
but if you know all your cluster peer urls upfront, it's easier to start with same I'd recommend experimenting with etcd cluster locally. Etcd is somewhat strict about peer-urls and state of the data dir on startup, so it can take couple tries to get config right. |
What happened?
Please review comments made in #15700
etcd refuses to intialize, I don't understand why.
My entire DEV system is down and unavailable because I cannot start etcd
Please help
What did you expect to happen?
etcd to start
How can we reproduce it (as minimally and precisely as possible)?
see notes in https://github.com/etcd-io/etcd/issues/15700q
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
see #15700
Relevant log output
The text was updated successfully, but these errors were encountered: