RKE will not finish deployment if certain network mounts exist on the target node. #964

emm-dee · 2018-10-16T17:37:11Z

RKE version:
v0.1.10

Docker version: (docker version,docker info preferred)
17.0.3

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
Ubuntu 16.04

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Bare metal

cluster.yml file:

nodes:
  - address: 10.52.130.8
    user: user
    ssh_key_path: /usr2/user/.ssh/id_rsa
    role: [controlplane,worker,etcd]
  - address: 10.52.130.9
    user: user
    ssh_key_path: /usr2/user/.ssh/id_rsa
    role: [controlplane,worker,etcd]
  - address: 10.52.130.45
    user: user
    ssh_key_path: /usr2/user/.ssh/id_rsa
    role: [controlplane,worker,etcd]

Steps to Reproduce:
rke up

Results:
Failed deployment with the following error:

FATA[0191] [workerPlane] Failed to bring up Worker Plane: Failed to verify healthcheck: Failed to check https://localhost:10250/healthz for service [kubelet] on host [10.52.130.45]: Get https://localhost:10250/healthz: Unable to access the service on localhost:10250. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: + umount /var/lib/docker/aufs/mnt/7c3207c585e17eeb80d061352d13e8d2d6677ed5719c42a1cd41c34d52c911be/host/usr/local/doc

Findings
End result was that the issue was caused by having a network mount to /usr/local/doc on the target nodes. When the user removed the network mount, the rke up completed successfully.

Filing issue as customer requires various network mounts across their systems and it would be ideal if the deployment could ignore if any paths are externally mounted. (More details and chatter is on internal Slack comms on 10/15/2018)

The text was updated successfully, but these errors were encountered:

skaven81 · 2018-10-16T18:12:42Z

To add some additional context to this, /usr/local/doc in this case was an NFS-mounted filesystem, but more importantly, it was an automounted NFS filesystem, using the "direct maps" model for mounting. This means that the kernel itself is managing /usr/local/doc (it's not "just" a mountpoint).

galal-hussein · 2019-02-06T21:54:08Z

@emm-dee @skaven81 I wasn't able to reproduce the issue on my setup with the latest rke, here are my steps:

create a /usr/loca/doc on the rke node
mount the /nfs from nfs server using autofs and direct maps
run rke up on the rke node

rke finish building the cluster successfully and kubelet started correctly, can you try to test it again with the latest rke and see if you can still reproduce

skaven81 · 2019-02-06T23:25:29Z

@galal-hussein did you mount the nfs filesystem at /usr/local/doc? Just having an NFS mount at /nfs isn't going to reproduce the issue. The problem is caused when the installer attempts to umount /usr/local/doc, which fails because the filesystem is in use.

galal-hussein · 2019-02-13T19:19:33Z

@skaven81 I tried that, on the nfs server the /etc/exports:

/nfs *(rw,sync,no_subtree_check)

on the rke node:

/etc/auto.master

/usr/local  /etc/auto.nfs

and in /etc/auto.nfs

doc   x.x.x.x:/nfs

I was able to verify that /usr/local/doc was automounted from the nfs server

# mount | grep /usr/local
/etc/auto.nfs on /usr/local type autofs (rw,relatime,fd=6,pgrp=16284,timeout=300,minproto=5,maxproto=5,indirect)

After that i added the rke node to cluster.yml and ran rke up

INFO[0371] Finished building Kubernetes cluster successfully

Can you share your configuration file i am not sure what is not working correctly on your setup.

alena1108 · 2019-02-14T17:18:53Z

@sangeethah can you give it a try too

skaven81 · 2019-02-16T19:22:39Z

Your NFS configuration is not using direct maps. I suspect the issue arises through the use of direct maps, because direct maps appear in /proc/mounts even when they're not mounted.

Your /etc/auto.master should contain something like this:

/-    auto.direct

And then /etc/auto.direct should contain:

/usr/local/src -rw,noquota       <some filer>/src
/usr/local/doc -rw,noquota       <some filer>/doc

The /- is what enables direct maps.

cjellick · 2019-02-20T19:21:54Z

Have we figured out how to reproduce this yet?

aiqs4 · 2019-04-14T08:25:05Z

Seems to be the same problem with non existing glusterfs network mount.

I removed the glusterfs-server, and forget to remove the /etc/fstab entry:

127.0.0.1:home /home/asdf/devel glusterfs defaults,_netdev,nofail 0 0

victort · 2019-05-06T23:46:43Z

~~i'm experiencing this problem on one cluster, but not another (identical cluster in a different datacenter), I have no NFS mounts. ~~

argh, nevermind. typo on my part. (don't indent kubeproxy: in your cluster.yaml)

alena1108 · 2019-09-18T17:50:15Z

@moelsayed can you help to validate

moelsayed · 2019-09-18T23:29:30Z

I just tried to reproduce this again. Here is my configuration:

/etc/auto.master

/-	/etc/auto.share

/etc/auto.share

/usr/local/doc	-rw,noquota 	x.x.x.x:/share

Related mounts on the node:

/etc/auto.share on /usr/local/doc type autofs (rw,relatime,fd=6,pgrp=21820,timeout=300,minproto=5,maxproto=5,direct)
x.x.x.x:/share on /usr/local/doc type nfs4 (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=y.y.y.y,local_lock=none,addr=x.x.x.x)

Using latest master, cluster provisioning completed successfully.

moelsayed · 2019-09-18T23:35:41Z

@emm-dee Are you still having this problem? Is there any additional configuration you can provide to reproduce it ?

skaven81 · 2019-09-19T00:02:44Z

The key for reproduction might be that the filesystem has to be in use so that it can't be unmounted. Perhaps try opening a separate shell and cd to the NFS filesystem, so that the kernel refuses to unmount it due to it being busy/in-use.

moelsayed · 2019-09-19T00:10:06Z

@skaven81 I vimed a file in the same location. Also cd'ed into the directory. Still was able to provision the cluster successfully.
Can you reproduce it with the latest rc ? If you can, would it be possible to share rke debug log and kubelet log?

skaven81 · 2019-09-19T00:13:47Z

We worked around the issue by dropping the two noted automount points off of our RKE systems a looong time ago. So for all I know the issue is fixed in modern versions of RKE.

I don't have a way of reproducing it anymore, without a lot of work to un-do our workaround in a test environment.

alena1108 · 2019-09-19T21:25:02Z

@skaven81

So for all I know the issue is fixed in modern versions of RKE.

Based on the ^ closing the issue. Please reopen if you see it again.

JasonvanBrackel assigned deniseschannon Oct 16, 2018

deniseschannon removed their assignment Oct 16, 2018

deniseschannon added kind/bug internal labels Oct 16, 2018

alena1108 added this to the v0.1.12 milestone Oct 17, 2018

alena1108 assigned moelsayed Oct 22, 2018

deniseschannon added the priority/0 label Nov 12, 2018

alena1108 assigned galal-hussein and unassigned moelsayed Feb 5, 2019

galal-hussein assigned sangeethah and unassigned galal-hussein Feb 6, 2019

galal-hussein added the status/to-test label Feb 6, 2019

cjellick assigned galal-hussein Feb 7, 2019

sangeethah removed the status/to-test label Feb 12, 2019

galal-hussein added the status/to-test label Feb 13, 2019

cjellick added the team/ca label Feb 15, 2019

sangeethah assigned sowmyav27 and unassigned sangeethah Feb 20, 2019

sangeethah unassigned sowmyav27 Feb 20, 2019

sangeethah added the status/to-validate label Feb 20, 2019

alena1108 modified the milestones: v0.2.0, Backlog Mar 11, 2019

deniseschannon removed the internal label Apr 8, 2019

deniseschannon modified the milestones: Backlog, v0.3.0 Apr 8, 2019

deniseschannon assigned sangeethah and unassigned galal-hussein Apr 8, 2019

deniseschannon added the status/triaged label Apr 8, 2019

jira-sync-svc changed the title ~~RKE will not finish deployment if certain network mounts exist on the target node.~~ RKE will not finish deployment if certain network mounts exist on the target node. Apr 23, 2019

cjellick added the [zube]: To Test label May 7, 2019

deniseschannon removed status/to-test labels May 29, 2019

cjellick assigned galal-hussein Aug 23, 2019

sangeethah removed their assignment Sep 4, 2019

sangeethah added the status/to-validate label Sep 4, 2019

alena1108 assigned moelsayed Sep 18, 2019

moelsayed added status/more-info and removed status/to-validate labels Sep 18, 2019

alena1108 closed this as completed Sep 19, 2019

zube bot added [zube]: Done and removed [zube]: To Test labels Sep 19, 2019

zube bot removed the [zube]: Done label Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RKE will not finish deployment if certain network mounts exist on the target node. #964

RKE will not finish deployment if certain network mounts exist on the target node. #964

emm-dee commented Oct 16, 2018 •

edited by jira-sync-svc

Loading

skaven81 commented Oct 16, 2018

galal-hussein commented Feb 6, 2019

skaven81 commented Feb 6, 2019

galal-hussein commented Feb 13, 2019

alena1108 commented Feb 14, 2019

skaven81 commented Feb 16, 2019

cjellick commented Feb 20, 2019

aiqs4 commented Apr 14, 2019

victort commented May 6, 2019 •

edited

Loading

alena1108 commented Sep 18, 2019

moelsayed commented Sep 18, 2019

moelsayed commented Sep 18, 2019

skaven81 commented Sep 19, 2019

moelsayed commented Sep 19, 2019

skaven81 commented Sep 19, 2019

alena1108 commented Sep 19, 2019

RKE will not finish deployment if certain network mounts exist on the target node. #964

RKE will not finish deployment if certain network mounts exist on the target node. #964

Comments

emm-dee commented Oct 16, 2018 • edited by jira-sync-svc Loading

skaven81 commented Oct 16, 2018

galal-hussein commented Feb 6, 2019

skaven81 commented Feb 6, 2019

galal-hussein commented Feb 13, 2019

alena1108 commented Feb 14, 2019

skaven81 commented Feb 16, 2019

cjellick commented Feb 20, 2019

aiqs4 commented Apr 14, 2019

victort commented May 6, 2019 • edited Loading

alena1108 commented Sep 18, 2019

moelsayed commented Sep 18, 2019

moelsayed commented Sep 18, 2019

skaven81 commented Sep 19, 2019

moelsayed commented Sep 19, 2019

skaven81 commented Sep 19, 2019

alena1108 commented Sep 19, 2019

emm-dee commented Oct 16, 2018 •

edited by jira-sync-svc

Loading

victort commented May 6, 2019 •

edited

Loading