Nacos2.0 集群 Failed operation in LogStorage，导致集群崩溃； #7237

suanyi001 · 2021-11-15T11:22:51Z

Describe the bug
Nacos 集群在运行过程中，由于其中一个POD出现Failed operation in LogStorage，导致整个集群崩溃不能提供服务；

Expected behavior
集群正常运行；

Acutally behavior
集群在运行过程中多次down掉；

How to Reproduce

集群中其中一个pod 报出:

org.rocksdb.RocksDBException: While fdatasync: /home/nacos/data/protocol/raft/naming_service_metadata/log/000156.log: Bad file descriptor
	at org.rocksdb.RocksDB.put(Native Method)
	at org.rocksdb.RocksDB.put(RocksDB.java:591)
	at com.alipay.sofa.jraft.storage.impl.RocksDBLogStorage.saveFirstLogIndex(RocksDBLogStorage.java:291)
	at com.alipay.sofa.jraft.storage.impl.RocksDBLogStorage.truncatePrefix(RocksDBLogStorage.java:563)
	at com.alipay.sofa.jraft.storage.impl.LogManagerImpl$StableClosureEventHandler.onEvent(LogManagerImpl.java:527)
	at com.alipay.sofa.jraft.storage.impl.LogManagerImpl$StableClosureEventHandler.onEvent(LogManagerImpl.java:496)
	at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
	at java.lang.Thread.run(Thread.java:748)
2021-11-13 08:17:21,734 ERROR Fail to truncatePrefix 403.

org.rocksdb.RocksDBException: While fdatasync: /home/nacos/data/protocol/raft/naming_service_metadata/log/000156.log: Bad file descriptor
	at org.rocksdb.RocksDB.deleteRange(Native Method)
	at org.rocksdb.RocksDB.deleteRange(RocksDB.java:1991)
	at com.alipay.sofa.jraft.storage.impl.RocksDBLogStorage.lambda$truncatePrefixInBackground$2(RocksDBLogStorage.java:584)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2021-11-13 08:17:21,739 ERROR Encountered an error=Status[EIO<1014>: Failed operation in LogStorage] on StateMachine com.alibaba.nacos.core.distributed.raft.NacosStateMachine, it's highly recommended to implement this method as raft stops working since some error occurs, you should figure out the cause and repair or remove this node.

随着集群运行，逐渐导致其他节点出现问题，但没有明显报错，最终集群不可用；
下面是一次nacos集群故障的发生时间:

共5台nacos节点，nacos-0 ~ nacos-4

Desktop (please complete the following information):

OS: ubuntu
Version : nacos:2.0.3
Module: naming/config
SDK: spring-cloud-alibaba-nacos:2021.1
K8S: V1.17.9
storage: Azurefile

Additional context

1. 集群部署形式
Nacos集群的部署方式是以官方提供的nacos-K8S为模板，只在存储的部分替换成了现有的云存储（Azurefile，类似于NFS的网络存储）。部署在云服务的K8S集群上，共5个POD；

2. 对于Jraft的指令log，是否会由于网络波动，云存储性能等原因导致执行失败？

3. Nacos挂载的内容中，对于Data目录下的文件需要读&写，对于Logs下的文件只需要写，这样理解对吗？

The text was updated successfully, but these errors were encountered:

longzhihun · 2021-11-15T11:26:47Z

我也是碰到了类似的问题，特别头疼，不知道是不是Nacos自身的问题，麻烦大佬给定位一下

bizhenchao1201 · 2021-11-15T11:38:25Z

看来是个共性的issue，我这边也碰到了。

stale · 2022-06-19T14:18:22Z

Thanks for your feedback and contribution. But the issue/pull request has not had recent activity more than 180 days. This issue/pull request will be closed if no further activity occurs 7 days later.
We may solve this issue in new version. So can you upgrade to newest version and retry?
If there are still issues or want to contribute again. Please create new issue or pull request again.

PeiAlan · 2022-07-13T11:06:32Z

以上兄弟，有解决方案了吗？我这nacos集群也遇到 Error [type=ERROR_TYPE_STATE_MACHINE, status=Status[ESTATEMACHINE<10002>:
错误了😭

longzhihun · 2022-07-18T00:33:44Z

以上兄弟，有解决方案了吗？我这nacos集群也遇到 Error [type=ERROR_TYPE_STATE_MACHINE, status=Status[ESTATEMACHINE<10002>: 错误了😭

换了存储就可以了，可以试试

qq2032554981 · 2022-08-09T04:48:24Z

以上兄弟，有解决方案了吗？我这nacos集群也遇到 Error [type=ERROR_TYPE_STATE_MACHINE, status=Status[ESTATEMACHINE<10002>: 错误了😭

换了存储就可以了，可以试试

大佬，换存储是什么意思呢？

suanyi001 · 2022-08-18T10:39:18Z

以上兄弟，有解决方案了吗？我这nacos集群也遇到 Error [type=ERROR_TYPE_STATE_MACHINE, status=Status[ESTATEMACHINE<10002>: 错误了😭

换了存储就可以了，可以试试

大佬，换存储是什么意思呢？

Nacos 默认的logback 输出了很多debug info 的日志，所以在每天rolling的时候对磁盘的io很高，而jraft的数据也需要写到磁盘上，这就有可能出现状态机的异常，所以需要给Nacos集群配置高性能的存储，也需要检查一下日志输出的大小根据需要重新定义logback文件。

suanyi001 mentioned this issue Nov 15, 2021

Nacos 2.0.3 ERROR_TYPE_STATE_MACHINE #6877

Closed

realJackSun added kind/question Category issues related to questions or problems status/need feedback labels Nov 19, 2021

stale bot added the expired No active for a long time label Jun 19, 2022

stale bot removed the expired No active for a long time label Jul 13, 2022

suanyi001 closed this as completed Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nacos2.0 集群 Failed operation in LogStorage，导致集群崩溃； #7237

Nacos2.0 集群 Failed operation in LogStorage，导致集群崩溃； #7237

suanyi001 commented Nov 15, 2021 •

edited

Loading

longzhihun commented Nov 15, 2021

bizhenchao1201 commented Nov 15, 2021

stale bot commented Jun 19, 2022

PeiAlan commented Jul 13, 2022

longzhihun commented Jul 18, 2022

qq2032554981 commented Aug 9, 2022

suanyi001 commented Aug 18, 2022

Nacos2.0 集群 Failed operation in LogStorage，导致集群崩溃； #7237

Nacos2.0 集群 Failed operation in LogStorage，导致集群崩溃； #7237

Comments

suanyi001 commented Nov 15, 2021 • edited Loading

longzhihun commented Nov 15, 2021

bizhenchao1201 commented Nov 15, 2021

stale bot commented Jun 19, 2022

PeiAlan commented Jul 13, 2022

longzhihun commented Jul 18, 2022

qq2032554981 commented Aug 9, 2022

suanyi001 commented Aug 18, 2022

suanyi001 commented Nov 15, 2021 •

edited

Loading