Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RKE will not finish deployment if certain network mounts exist on the target node. #964

Closed
emm-dee opened this issue Oct 16, 2018 · 16 comments

Comments

@emm-dee
Copy link

emm-dee commented Oct 16, 2018

RKE version:
v0.1.10

Docker version: (docker version,docker info preferred)
17.0.3

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
Ubuntu 16.04

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Bare metal

cluster.yml file:

nodes:
  - address: 10.52.130.8
    user: user
    ssh_key_path: /usr2/user/.ssh/id_rsa
    role: [controlplane,worker,etcd]
  - address: 10.52.130.9
    user: user
    ssh_key_path: /usr2/user/.ssh/id_rsa
    role: [controlplane,worker,etcd]
  - address: 10.52.130.45
    user: user
    ssh_key_path: /usr2/user/.ssh/id_rsa
    role: [controlplane,worker,etcd]

Steps to Reproduce:
rke up

Results:
Failed deployment with the following error:

FATA[0191] [workerPlane] Failed to bring up Worker Plane: Failed to verify healthcheck: Failed to check https://localhost:10250/healthz for service [kubelet] on host [10.52.130.45]: Get https://localhost:10250/healthz: Unable to access the service on localhost:10250. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: + umount /var/lib/docker/aufs/mnt/7c3207c585e17eeb80d061352d13e8d2d6677ed5719c42a1cd41c34d52c911be/host/usr/local/doc 

Findings
End result was that the issue was caused by having a network mount to /usr/local/doc on the target nodes. When the user removed the network mount, the rke up completed successfully.

Filing issue as customer requires various network mounts across their systems and it would be ideal if the deployment could ignore if any paths are externally mounted. (More details and chatter is on internal Slack comms on 10/15/2018)

@skaven81
Copy link

To add some additional context to this, /usr/local/doc in this case was an NFS-mounted filesystem, but more importantly, it was an automounted NFS filesystem, using the "direct maps" model for mounting. This means that the kernel itself is managing /usr/local/doc (it's not "just" a mountpoint).

@galal-hussein
Copy link
Contributor

@emm-dee @skaven81 I wasn't able to reproduce the issue on my setup with the latest rke, here are my steps:

  • create a /usr/loca/doc on the rke node
  • mount the /nfs from nfs server using autofs and direct maps
  • run rke up on the rke node

rke finish building the cluster successfully and kubelet started correctly, can you try to test it again with the latest rke and see if you can still reproduce

@skaven81
Copy link

skaven81 commented Feb 6, 2019

@galal-hussein did you mount the nfs filesystem at /usr/local/doc? Just having an NFS mount at /nfs isn't going to reproduce the issue. The problem is caused when the installer attempts to umount /usr/local/doc, which fails because the filesystem is in use.

@galal-hussein
Copy link
Contributor

@skaven81 I tried that, on the nfs server the /etc/exports:

/nfs *(rw,sync,no_subtree_check)

on the rke node:

/etc/auto.master

/usr/local  /etc/auto.nfs

and in /etc/auto.nfs

doc   x.x.x.x:/nfs

I was able to verify that /usr/local/doc was automounted from the nfs server

# mount | grep /usr/local
/etc/auto.nfs on /usr/local type autofs (rw,relatime,fd=6,pgrp=16284,timeout=300,minproto=5,maxproto=5,indirect)

After that i added the rke node to cluster.yml and ran rke up

INFO[0371] Finished building Kubernetes cluster successfully

Can you share your configuration file i am not sure what is not working correctly on your setup.

@alena1108
Copy link

@sangeethah can you give it a try too

@skaven81
Copy link

Your NFS configuration is not using direct maps. I suspect the issue arises through the use of direct maps, because direct maps appear in /proc/mounts even when they're not mounted.

Your /etc/auto.master should contain something like this:

/-    auto.direct

And then /etc/auto.direct should contain:

/usr/local/src -rw,noquota       <some filer>/src
/usr/local/doc -rw,noquota       <some filer>/doc

The /- is what enables direct maps.

@sangeethah sangeethah assigned sowmyav27 and unassigned sangeethah Feb 20, 2019
@cjellick
Copy link

Have we figured out how to reproduce this yet?

@deniseschannon deniseschannon modified the milestones: Backlog, v0.3.0 Apr 8, 2019
@aiqs4
Copy link

aiqs4 commented Apr 14, 2019

Seems to be the same problem with non existing glusterfs network mount.

I removed the glusterfs-server, and forget to remove the /etc/fstab entry:

127.0.0.1:home /home/asdf/devel glusterfs defaults,_netdev,nofail 0 0

@jira-sync-svc jira-sync-svc changed the title RKE will not finish deployment if certain network mounts exist on the target node. RKE will not finish deployment if certain network mounts exist on the target node. Apr 23, 2019
@victort
Copy link

victort commented May 6, 2019

~~i'm experiencing this problem on one cluster, but not another (identical cluster in a different datacenter), I have no NFS mounts. ~~

argh, nevermind. typo on my part. (don't indent kubeproxy: in your cluster.yaml)

@alena1108
Copy link

@moelsayed can you help to validate

@moelsayed
Copy link
Contributor

I just tried to reproduce this again. Here is my configuration:

  • /etc/auto.master
/-	/etc/auto.share
  • /etc/auto.share
/usr/local/doc	-rw,noquota 	x.x.x.x:/share

Related mounts on the node:

/etc/auto.share on /usr/local/doc type autofs (rw,relatime,fd=6,pgrp=21820,timeout=300,minproto=5,maxproto=5,direct)
x.x.x.x:/share on /usr/local/doc type nfs4 (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=y.y.y.y,local_lock=none,addr=x.x.x.x)

Using latest master, cluster provisioning completed successfully.

@moelsayed
Copy link
Contributor

@emm-dee Are you still having this problem? Is there any additional configuration you can provide to reproduce it ?

@skaven81
Copy link

The key for reproduction might be that the filesystem has to be in use so that it can't be unmounted. Perhaps try opening a separate shell and cd to the NFS filesystem, so that the kernel refuses to unmount it due to it being busy/in-use.

@moelsayed
Copy link
Contributor

@skaven81 I vimed a file in the same location. Also cd'ed into the directory. Still was able to provision the cluster successfully.
Can you reproduce it with the latest rc ? If you can, would it be possible to share rke debug log and kubelet log?

@skaven81
Copy link

We worked around the issue by dropping the two noted automount points off of our RKE systems a looong time ago. So for all I know the issue is fixed in modern versions of RKE.

I don't have a way of reproducing it anymore, without a lot of work to un-do our workaround in a test environment.

@alena1108
Copy link

@skaven81

So for all I know the issue is fixed in modern versions of RKE.

Based on the ^ closing the issue. Please reopen if you see it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests