-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic due to race between wal rotation and snapshot #14252
Comments
I would start from |
According to the logs, it is unclear what specific file is missing, there are files with wal in the folder. Perhaps some specific one is missing or damaged, is it possible to understand it somehow? |
I found the problem. If the rotation of wal files occurs faster than the server takes a snapshot, then the next restart of the server leads to an error. |
It would be nice to solve this problem in the next version |
This seems like a bug, can you provide a minimal reproduction steps? |
Yes, a very strange situation. Reproduced like this. name: node1
listen-peer-urls: http://localhost:2002
listen-client-urls: http://localhost:2001
initial-advertise-peer-urls: http://localhost:2002
advertise-client-urls: http://localhost:2001
initial-cluster-token: cluster_1'
initial-cluster: node1=http://localhost:2002
initial-cluster-state: 'new'
heartbeat-interval: 250
election-timeout: 1250
max-txn-ops: 12800
auto-compaction-mode: periodic
auto-compaction-retention: 1m
max-wals: 5
snapshot-count: 10000 I run a task that continuously writes records of 200,000 bytes each. On Windows Server 2019, the situation is a little different. wal files are not immediately deleted, they continue to accumulate until the snapshot is created, but after that the node still does not restart due to the same error. On debian, i couldn't reproduce this situation. |
Windows is currently in Tier 3 support. https://etcd.io/docs/v3.5/op-guide/supported-platform/#current-support Feel free to look into the issue, however it's not unexpected that there might be problems. |
The current wal file lock check does not work on windows. Fixes etcd-io#14252
The current wal file lock check does not work on windows. Fixes etcd-io#14252 Signed-off-by: mivertft <894972+MIVerTFT@users.noreply.github.com>
Things to check:
|
|
I wrote a very simple test.
On windows 10, the file is deleted after step 3, but on windows server 2019 only after step 4, while there are no errors at step 3. |
The file being deleting after step 4 is expected behavior on Linux as well. I suspect the issue is caused by TryLockFile. It actually failed to lock the file, but the lockFile returns nil, accodingly the purge goroutine regards it as a successful lock, then it deletes the file. Could you double check it? |
Yes, that's right, |
I am seeing this issue as well, 3.5.4 on linux/kubernetes using the official images. We are shipping our etcd using the bitnami helm chart. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Hi, there is a problem on several etc clusters on windows server at the customer. process есв.ехе managed by the windows service. It looks like this: the cluster is working, but if you restart the etc service on one of the servers, we will get an error:
{"level":"fatal","ts":"2022-07-21T13:17:36.221+0300","caller":"etcdserver/storage.go:95","msg":"failed to open WAL","error":"wal: file not found","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.readWAL\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/storage.go:95\ngo.etcd.io/etcd/server/v3/etcdserver.restartNode\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/raft.go:481\ngo.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:533\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
What did you expect to happen?
I expect a normal restart of the etcd process
How can we reproduce it?
so far we have not been able to repeat the error on our test servers
Etcd version
etcd.exe --version
etcd Version: 3.5.4
etcdctl.exe version
etcdctl version: 3.5.4
Etcd configuration of one of the servers:
name: ck11-apache2.test.local
initial-advertise-peer-urls: http://10.X.X.178:2380
listen-peer-urls: http://10.X.X.178:2380
listen-client-urls: http://10.X.X.178:2379,http://127.0.0.1:2379
advertise-client-urls: http://10.X.X.178:2379
initial-cluster-token: etcd-cluster-123456789
initial-cluster: ck11-apache1.test.local=http://10.X.X.177:2380,ck11-apache2.test.local=http://10.X.X.178:2380,ck11-apache-nlb.test.local=http://10.X.X.179:2381
initial-cluster-state: 'new'
max-txn-ops: 12800
heartbeat-interval: 250
election-timeout: 1250
logger: zap
log-level: info
Etcd debug information
C:\Program Files\etcd>etcdctl -w table endpoint --cluster status
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://10.X.X.179:2379 | 90f53ed17cd0b92f | 3.5.4 | 1.7 GB | false | false | 2 | 2048771 | 2048771 | |
| http://10.X.X.178:2379 | 97a661acdde4fc12 | 3.5.4 | 1.7 GB | true | false | 2 | 2048771 | 2048771 | |
| http://10.X.X.177:2379 | a82db5fe4c233dac | 3.5.4 | 1.7 GB | false | false | 2 | 2048771 | 2048771 | |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
I really need help with this problem. It is still unclear how to reproduce it, what to focus on
The text was updated successfully, but these errors were encountered: