Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<POD NETWORK DELAY> : cmd exec failed, err: RTNETLINK answers: No such file or directory exit status 2 #173

Open
ddd1123 opened this issue Aug 10, 2022 · 16 comments
Assignees
Labels
difficulty/medium type/question Further information is requested

Comments

@ddd1123
Copy link

ddd1123 commented Aug 10, 2022

Issue Description

Type: bug report

Describe what happened (or what feature you want)

  1. 在 chaosblade-box 中通过 agent 获取 K8s 集群信息,进行 POD NETWORK DELAY 演练;
  2. 报错内容为:原因: /opt/chaosblade/bin/nsexec -t 77143 -p -n -- /bin/sh -c tc qdisc add dev eth0 root netem delay 100ms 10ms: cmd exec failed, err: RTNETLINK answers: No such file or directory exit status 2
  3. 我尝试在对应的 Node 主机上输入 /opt/chaosblade/bin/nsexec,同样报 -bash: /opt/chaosblade/bin/nsexec: No such file or directory

Describe what you expected to happen

希望可以提供相关解决方法or解决思路,thanks!

How to reproduce it (as minimally and precisely as possible)

Tell us your environment

K8s:v1.18.18
chaosblade-box:v1.0.1
chaos-agent:v1.0.0
chaos-operator:v1.6.0
chaos-tool:v1.6.0

Anything else we need to know?

机器执行信息:
{

"response": {

"code": 54000,
"error": "unexpected status, expected status: `create`, but the real status: `Error`, please wait!",
"result": {
  "error": "`/opt/chaosblade/bin/nsexec -t 77143 -p -n -- /bin/sh -c tc qdisc add dev eth0 root netem delay 100ms 10ms`: cmd exec failed, err: RTNETLINK answers: No such file or directory\n exit status 2",
  "statuses": [
    {
      "error": "`/opt/chaosblade/bin/nsexec -t 77143 -p -n -- /bin/sh -c tc qdisc add dev eth0 root netem delay 100ms 10ms`: cmd exec failed, err: RTNETLINK answers: No such file or directory\n exit status 2",
      "kind": "pod",
      "state": "Error",
      "success": false
    }
  ],
  "success": false,
  "uid": "2c29793a6c8da651"
},
"success": false

}
}

@ddd1123
Copy link
Author

ddd1123 commented Aug 10, 2022

但是我做了一个pod-process-kill的演练:

  1. 由于填错了signal所以报错,通过报错信息Reason: /opt/chaosblade/bin/nsexec -t 5165 -p -m -- /bin/sh -c kill -128 128: cmd exec failed, err: /bin/sh: line 0: kill: 128: invalid signal specification exit status 1。我发现也是用的/opt/chaosblade/bin/nsexec
  2. 后续我更改了正确的signal后,该pod-process-kill演练成功了。

由此我感觉上述的问题是不是不在于说在node主机上输入/opt/chaosblade/bin/nsexec的报错呢?
那问题出现在哪里呢。。

@ddd1123
Copy link
Author

ddd1123 commented Aug 10, 2022

补充:在进行pod-network-delay的演练时:

  1. 由于一直报错,我曾尝试进入对应的pod内输入tc qdisc add dev eth0 root netem delay 100ms 10ms,仍报错:Error: Specified qdisc not found.
  2. 通过调研发现可能是缺少kernel-modules-extra包,由于种种原因还未成功安装此包
  3. 是和这个有关系嘛,如果此包安装成功是否上述的问题则不复存在。

@ddd1123
Copy link
Author

ddd1123 commented Aug 11, 2022

operator日志:
演练进行时节选:
time="2022-08-11T01:41:38Z" level=info msg="experiment identifiers: [{{ docker 7420e6a5bff3 centos-tc centos-tc-demo-88d8ff5f8-vg278 192.168.0.4 centos-tc-demo} /opt/chaosblade/blade create cri network delay --timeout=125 --time=100 --interface=eth0 --offset=10 --container-id 7420e6a5bff3 --container-runtime docker 0 chaosblade-tool-ljpcv chaosblade chaosblade-tool}]" experiment=c6c75e78f1a37877
time="2022-08-11T01:41:38Z" level=info msg="execute identifier: {ContainerObjectMeta:{Id: ContainerRuntime:docker ContainerId:7420e6a5bff3 ContainerName:centos-tc PodName:centos-tc-demo-88d8ff5f8-vg278 NodeName:192.168.0.4 Namespace:centos-tc-demo} Command:/opt/chaosblade/blade create cri network delay --timeout=125 --time=100 --interface=eth0 --offset=10 --container-id 7420e6a5bff3 --container-runtime docker Error: Code:0 ChaosBladePodName:chaosblade-tool-ljpcv ChaosBladeNamespace:chaosblade ChaosBladeContainerName:chaosblade-tool}" experiment=c6c75e78f1a37877
time="2022-08-11T01:41:38Z" level=info msg="Exec command in pod" command="[/opt/chaosblade/blade create cri network delay --timeout=125 --time=100 --interface=eth0 --offset=10 --container-id 7420e6a5bff3 --container-runtime docker]" container=chaosblade-tool podName=chaosblade-tool-ljpcv podNamespace=chaosblade
time="2022-08-11T01:41:38Z" level=info msg="get err message" command="[/opt/chaosblade/blade create cri network delay --timeout=125 --time=100 --interface=eth0 --offset=10 --container-id 7420e6a5bff3 --container-runtime docker]" container=chaosblade-tool err="{"code":63063,"success":false,"error":"/opt/chaosblade/bin/nsexec -t 5165 -p -n -- /bin/sh -c tc qdisc add dev eth0 root netem delay 100ms 10ms: cmd exec failed, err: RTNETLINK answers: No such file or directory\n exit status 2"}" out= podName=chaosblade-tool-ljpcv podNamespace=chaosblade
time="2022-08-11T01:41:38Z" level=error msg="pods/exec: k8s exec failed, err: {"code":63063,"success":false,"error":"/opt/chaosblade/bin/nsexec -t 5165 -p -n -- /bin/sh -c tc qdisc add dev eth0 root netem delay 100ms 10ms: cmd exec failed, err: RTNETLINK answers: No such file or directory\n exit status 2"}\n" location=github.com/chaosblade-io/chaosblade-spec-go/util.Errorf uid=

@Icesource Icesource added difficulty/medium type/question Further information is requested labels Aug 11, 2022
@Icesource
Copy link
Contributor

2. kernel-modules-extra

yum install -y kernel-modules-extra 可以安装该模块,问题似乎是由于 pod 内关于 linux 内核流控工具 tc 引起的相关问题,由于内核默认缺少 netem 流控队列,所以会报错 Error: Specified qdisc not found.

但安装该模块不一定能解决 RTNETLINK answers: No such file or directory exit status 2 该问题,可以先尝试安装kernel-modules-extra该模块看是否能解决问题

@tiny-x
Copy link
Member

tiny-x commented Aug 11, 2022

补充:在进行pod-network-delay的演练时:

  1. 由于一直报错,我曾尝试进入对应的pod内输入tc qdisc add dev eth0 root netem delay 100ms 10ms,仍报错:Error: Specified qdisc not found.
  2. 通过调研发现可能是缺少kernel-modules-extra包,由于种种原因还未成功安装此包
  3. 是和这个有关系嘛,如果此包安装成功是否上述的问题则不复存在。

你系统什么版本 centos 8.x 吗,这个包 8.x 需要安装,并且安装后还要重启机器

@ddd1123
Copy link
Author

ddd1123 commented Aug 11, 2022

  1. kernel-modules-extra

yum install -y kernel-modules-extra 可以安装该模块,问题似乎是由于 pod 内关于 linux 内核流控工具 tc 引起的相关问题,由于内核默认缺少 netem 流控队列,所以会报错 Error: Specified qdisc not found.

但安装该模块不一定能解决 RTNETLINK answers: No such file or directory exit status 2 该问题,可以先尝试安装kernel-modules-extra该模块看是否能解决问题

好嘞,我再尝试安装下此模块,但是的确之前尝试安装遇到了问题,我尝试的yum源均提示没有此包可以安装。。
另外我在其他问题上看到有说在1.6.x以后不会用到pod内的tc了,这是真的嘛

@tiny-x
Copy link
Member

tiny-x commented Aug 11, 2022

是的,你先确认下你内核版本和发行版本吧

@ddd1123
Copy link
Author

ddd1123 commented Aug 11, 2022

补充:在进行pod-network-delay的演练时:

  1. 由于一直报错,我曾尝试进入对应的pod内输入tc qdisc add dev eth0 root netem delay 100ms 10ms,仍报错:Error: Specified qdisc not found.
  2. 通过调研发现可能是缺少kernel-modules-extra包,由于种种原因还未成功安装此包
  3. 是和这个有关系嘛,如果此包安装成功是否上述的问题则不复存在。

你系统什么版本 centos 8.x 吗,这个包 8.x 需要安装,并且安装后还要重启机器

我的系统是CentOS Linux release 7.6.1810 (Core)
我是自己拉取的一个docker镜像,然后通过yum -y install iproute装上了tc命令

@ddd1123
Copy link
Author

ddd1123 commented Aug 11, 2022

4.18.0-193.el8.x86_64

@Icesource Icesource self-assigned this Aug 15, 2022
@ddd1123
Copy link
Author

ddd1123 commented Aug 17, 2022

当前进展:

1、我的系统今天进行了一次变更。现在的版本是CentOS Linux release 8.2.2004 (Core)
2、随即成功安装了kernel-modules-extra包
3、演练成功了!,但是恢复阶段报错,报错信息如下

信息:
{
"response": {
"code": 54000,
"error": "unexpected status, expected status: destroy, but the real status: Destroying, please wait!",
"result": {
"error": "pods/exec: k8s exec failed, err: command terminated with exit code 126",
"statuses": [
{
"error": "pods/exec: k8s exec failed, err: command terminated with exit code 126",
"id": "fff09b30e7e8f4a2",
"kind": "pod",
"state": "Error",
"success": false
}
],
"success": false,
"uid": "f9e03fa41bc7c31f"
},
"success": false
}
}
错误:原因: pods/exec: k8s exec failed, err: command terminated with exit code 126
排查:场景状态不匹配,请稍后再试

日志节选:
time="2022-08-17T07:16:20Z" level=info msg="Exec command in pod" command="[/opt/chaosblade/blade status fff09b30e7e8f4a2]" container=chaosblade-tool podName=chaosblade-tool-ljpcv podNamespace=chaosblade
time="2022-08-17T07:16:20Z" level=info msg="get output message" command="[/opt/chaosblade/blade status fff09b30e7e8f4a2]" container=chaosblade-tool err= out="{"code":200,"success":true,"result":{"Uid":"fff09b30e7e8f4a2","Command":"cri","SubCommand":"network delay","Flag":" --offset=5 --container-id=3e1db8dce103 --timeout=125 --container-runtime=docker --time=60 --interface=eth0","Status":"Success","Error":"","CreateTime":"2022-08-17T07:13:45.153329761Z","UpdateTime":"2022-08-17T07:13:45.181644002Z"}}" podName=chaosblade-tool-ljpcv podNamespace=chaosblade
time="2022-08-17T07:16:20Z" level=error msg="pods/exec: k8s exec failed, err: {"code":200,"success":true,"result":{"Uid":"fff09b30e7e8f4a2","Command":"cri","SubCommand":"network delay","Flag":" --offset=5 --container-id=3e1db8dce103 --timeout=125 --container-runtime=docker --time=60 --interface=eth0","Status":"Success","Error":"","CreateTime":"2022-08-17T07:13:45.153329761Z","UpdateTime":"2022-08-17T07:13:45.181644002Z"}}\n" location=github.com/chaosblade-io/chaosblade-spec-go/util.Errorf uid=

@ddd1123
Copy link
Author

ddd1123 commented Aug 17, 2022

恢复阶段报错的operator日志:

time="2022-08-17T07:14:08Z" level=info msg="execute identifier: {ContainerObjectMeta:{Id:fff09b30e7e8f4a2 ContainerRuntime:docker ContainerId:3e1db8dce103 ContainerName:centos-tc-done PodName:centos-tc-done-6b584445b9-g5hnw NodeName:192.168.0.4 Namespace:centos-tc-done} Command: --container-label-selector io.kubernetes.pod.name=centos-tc-done-6b584445b9-g5hnw,io.kubernetes.pod.namespace=centos-tc-done,io.kubernetes.docker.type=podsandbox --container-runtime docker Error: Code:0 ChaosBladePodName:chaosblade-tool-ljpcv ChaosBladeNamespace:chaosblade ChaosBladeContainerName:chaosblade-tool}" experiment=f9e03fa41bc7c31f
time="2022-08-17T07:14:08Z" level=info msg="Exec command in pod" command="[ --container-label-selector io.kubernetes.pod.name=centos-tc-done-6b584445b9-g5hnw,io.kubernetes.pod.namespace=centos-tc-done,io.kubernetes.docker.type=podsandbox --container-runtime docker]" container=chaosblade-tool podName=chaosblade-tool-ljpcv podNamespace=chaosblade
time="2022-08-17T07:14:08Z" level=error msg="Invoke exec command error" command="[ --container-label-selector io.kubernetes.pod.name=centos-tc-done-6b584445b9-g5hnw,io.kubernetes.pod.namespace=centos-tc-done,io.kubernetes.docker.type=podsandbox --container-runtime docker]" container=chaosblade-tool err= error="command terminated with exit code 126" out="OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: exec: "": executable file not found in $PATH: unknown" podName=chaosblade-tool-ljpcv podNamespace=chaosblade
time="2022-08-17T07:14:08Z" level=error msg="pods/exec: k8s exec failed, err: command terminated with exit code 126" location=github.com/chaosblade-io/chaosblade-spec-go/util.Errorf uid=fff09b30e7e8f4a2

@Icesource
Copy link
Contributor

这可能是平台侧设置的轮训时间太短导致的异常,实际上实验不久后就被正常销毁了,你可以通过观察现象判断 实验是否被正常销毁

@ddd1123
Copy link
Author

ddd1123 commented Aug 23, 2022

这可能是平台侧设置的轮训时间太短导致的异常,实际上实验不久后就被正常销毁了,你可以通过观察现象判断 实验是否被正常销毁

感谢回复
尝试了几次并进行观察现象,均没能销毁实验。通过tc qdisc show查看仍存在tc qdisc add ... 添加的实验内容

通过查看报错信息"error": "pods/exec: k8s exec failed, err: command terminated with exit code 126",考虑是因为恢复时并没有成功进入对应的pod,故障注入是能够成功进入的,而恢复不能进入pod就有点问题

@Icesource Icesource reopened this Aug 23, 2022
@Icesource
Copy link
Contributor

在pod内看看chaosblade的执行日志呢? 日志一般在/opt/chaosblade下

@ddd1123
Copy link
Author

ddd1123 commented Aug 23, 2022

在pod内看看chaosblade的执行日志呢? 日志一般在/opt/chaosblade下
感谢回复

日志如下,其中10:43为成功执行,10:45为恢复日志
time="2022-08-23 10:43:28.128243385 UTC" level=info msg="create uid: 72e94f5b8b62644b, target: network, scope: pod, action: delay"
time="2022-08-23 10:43:28.142013125 UTC" level=error msg="chaosblade result: []" location=github.com/chaosblade-io/chaosblade/exec/kubernetes.QueryStatus uid=72e94f5b8b62644b
time="2022-08-23 10:45:39.431496547 UTC" level=info msg="destroy by 72e94f5b8b62644b uid, force-remove: false, target: "
time="2022-08-23 10:45:39.65012422 UTC" level=error msg="unexpected status, expected status: destroyed, but the real status: Running, please wait!" location=github.com/chaosblade-io/chaosblade/exec/kubernetes.QueryStatus uid=72e94f5b8b62644b
time="2022-08-23 10:45:43.434151464 UTC" level=error msg="chaosblade result: [{pod network delay false Success see resStatus for the error details [{fd98461695b31b38 Error 0 pods/exec: k8s exec failed, err: command terminated with exit code 126 false pod centos-tc/192.168.0.3/centos-tc-5bc68ff56f-f46fl/centos-tc-done/46d20d1c607c/docker}]}]" location=github.com/chaosblade-io/chaosblade/exec/kubernetes.QueryStatus uid=72e94f5b8b62644b

@zshmmm
Copy link

zshmmm commented Mar 1, 2023

pod network delay 实验时,销毁实验失败:
/opt/chaosblade/bin/nsexec -t 11077 -p -n -- /bin/sh -c tc qdisc del dev eth0 root`: cmd exec failed, err: RTNETLINK answers: No such file or directory exit status 2

在响应node节点的 chaosblade-tool 容器中执行 /opt/chaosblade/bin/nsexec -t 11077 -p -n -- /bin/sh -c tc qdisc del dev eth0 root 同样报错,需要将执行名字加“引号”,然后再执行就可以了。
像下面这样:
/opt/chaosblade/bin/nsexec -t 11077 -p -n -- "/bin/sh -c tc qdisc del dev eth0 root"

是否是因为演练工具的 exec 模块执行命令的格式不对。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty/medium type/question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants