Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nacos2.0 集群 Failed operation in LogStorage,导致集群崩溃; #7237

Closed
suanyi001 opened this issue Nov 15, 2021 · 7 comments
Closed
Labels
kind/question Category issues related to questions or problems status/need feedback

Comments

@suanyi001
Copy link

suanyi001 commented Nov 15, 2021

Describe the bug
Nacos 集群在运行过程中,由于其中一个POD出现Failed operation in LogStorage,导致整个集群崩溃不能提供服务;

Expected behavior
集群正常运行;

Acutally behavior
集群在运行过程中多次down掉;

How to Reproduce

  1. 集群中其中一个pod 报出:
org.rocksdb.RocksDBException: While fdatasync: /home/nacos/data/protocol/raft/naming_service_metadata/log/000156.log: Bad file descriptor
	at org.rocksdb.RocksDB.put(Native Method)
	at org.rocksdb.RocksDB.put(RocksDB.java:591)
	at com.alipay.sofa.jraft.storage.impl.RocksDBLogStorage.saveFirstLogIndex(RocksDBLogStorage.java:291)
	at com.alipay.sofa.jraft.storage.impl.RocksDBLogStorage.truncatePrefix(RocksDBLogStorage.java:563)
	at com.alipay.sofa.jraft.storage.impl.LogManagerImpl$StableClosureEventHandler.onEvent(LogManagerImpl.java:527)
	at com.alipay.sofa.jraft.storage.impl.LogManagerImpl$StableClosureEventHandler.onEvent(LogManagerImpl.java:496)
	at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
	at java.lang.Thread.run(Thread.java:748)
2021-11-13 08:17:21,734 ERROR Fail to truncatePrefix 403.

org.rocksdb.RocksDBException: While fdatasync: /home/nacos/data/protocol/raft/naming_service_metadata/log/000156.log: Bad file descriptor
	at org.rocksdb.RocksDB.deleteRange(Native Method)
	at org.rocksdb.RocksDB.deleteRange(RocksDB.java:1991)
	at com.alipay.sofa.jraft.storage.impl.RocksDBLogStorage.lambda$truncatePrefixInBackground$2(RocksDBLogStorage.java:584)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2021-11-13 08:17:21,739 ERROR Encountered an error=Status[EIO<1014>: Failed operation in LogStorage] on StateMachine com.alibaba.nacos.core.distributed.raft.NacosStateMachine, it's highly recommended to implement this method as raft stops working since some error occurs, you should figure out the cause and repair or remove this node.
  1. 随着集群运行,逐渐导致其他节点出现问题,但没有明显报错,最终集群不可用;
  2. 下面是一次nacos集群故障的发生时间:

共5台nacos节点,nacos-0 ~ nacos-4
image

Desktop (please complete the following information):

  • OS: ubuntu
  • Version : nacos:2.0.3
  • Module: naming/config
  • SDK: spring-cloud-alibaba-nacos:2021.1
  • K8S: V1.17.9
  • storage: Azurefile

Additional context

1. 集群部署形式
Nacos集群的部署方式是以官方提供的nacos-K8S为模板,只在存储的部分替换成了现有的云存储(Azurefile,类似于NFS的网络存储)。部署在云服务的K8S集群上,共5个POD

2. 对于Jraft的指令log,是否会由于网络波动,云存储性能等原因导致执行失败?

3. Nacos挂载的内容中,对于Data目录下的文件需要读&写,对于Logs下的文件只需要写,这样理解对吗?

@longzhihun
Copy link

我也是碰到了类似的问题,特别头疼,不知道是不是Nacos自身的问题,麻烦大佬给定位一下

@bizhenchao1201
Copy link

看来是个共性的issue,我这边也碰到了。

@realJackSun realJackSun added kind/question Category issues related to questions or problems status/need feedback labels Nov 19, 2021
@stale
Copy link

stale bot commented Jun 19, 2022

Thanks for your feedback and contribution. But the issue/pull request has not had recent activity more than 180 days. This issue/pull request will be closed if no further activity occurs 7 days later.
We may solve this issue in new version. So can you upgrade to newest version and retry?
If there are still issues or want to contribute again. Please create new issue or pull request again.

@stale stale bot added the expired No active for a long time label Jun 19, 2022
@PeiAlan
Copy link

PeiAlan commented Jul 13, 2022

以上兄弟 ,有解决方案了吗?我这nacos集群也遇到 Error [type=ERROR_TYPE_STATE_MACHINE, status=Status[ESTATEMACHINE<10002>:
错误了😭

@stale stale bot removed the expired No active for a long time label Jul 13, 2022
@longzhihun
Copy link

以上兄弟 ,有解决方案了吗?我这nacos集群也遇到 Error [type=ERROR_TYPE_STATE_MACHINE, status=Status[ESTATEMACHINE<10002>: 错误了😭

换了存储就可以了,可以试试

@qq2032554981
Copy link

以上兄弟 ,有解决方案了吗?我这nacos集群也遇到 Error [type=ERROR_TYPE_STATE_MACHINE, status=Status[ESTATEMACHINE<10002>: 错误了😭

换了存储就可以了,可以试试

大佬,换存储是什么意思呢?

@suanyi001
Copy link
Author

以上兄弟 ,有解决方案了吗?我这nacos集群也遇到 Error [type=ERROR_TYPE_STATE_MACHINE, status=Status[ESTATEMACHINE<10002>: 错误了😭

换了存储就可以了,可以试试

大佬,换存储是什么意思呢?

Nacos 默认的logback 输出了很多debug info 的日志,所以在每天rolling的时候对磁盘的io很高,而jraft的数据也需要写到磁盘上,这就有可能出现状态机的异常,所以需要给Nacos集群配置高性能的存储,也需要检查一下日志输出的大小根据需要重新定义logback文件。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Category issues related to questions or problems status/need feedback
Projects
None yet
Development

No branches or pull requests

6 participants