Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

'kube-node-drainer' does not work 'kube2iam' in 0.9.9 #1105

Closed
whereisaaron opened this issue Jan 6, 2018 · 47 comments
Closed

'kube-node-drainer' does not work 'kube2iam' in 0.9.9 #1105

whereisaaron opened this issue Jan 6, 2018 · 47 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@whereisaaron
Copy link
Contributor

whereisaaron commented Jan 6, 2018

I have kube-aws 0.9.9 clusters (us-east-2) with kube-node-drainer enabled and running, but it doesn't actually appear to be working. I don't see any sign of pods draining during ASG termination. The controller doesn't reschedule any pods until well after the instance has terminated and the nodeMonitorGracePeriod has elapsed (which I set to 90s to test). All the pods then get rescheduled at the same instant.

The ASG instances spend 5-10 minutes in Terminating: Pending state before transitioning to Terminating: Proceed. At that point all connection to the instance drops. But only a couple minutes later do any pods get rescheduled.

A log from a kube-node-drainer on a node being terminated is below. It gets a 404 response from the AWS API, right up to the point when the connection drops.

The clusters have kube2iam installed, does that interfer or block the AWS API access for kube-node-drainer?

Output of kubectl -n kube-system logs kube-node-drainer-ds-289hr --follow

<=== At this point the instance is "Terminating: Wait" in ASG console
+ curl -o /dev/null -w %{http_code} -sL http://169.254.169.254/latest/meta-data/spot/termination-time
+ http_status=404
+ [ 404 -eq 200 ]
+ [ -e /etc/kube-node-drainer/asg ]
+ grep -q i-08fc5fb751b30d164 /etc/kube-node-drainer/asg
+ sleep 10
+ curl -o /dev/null -w %{http_code} -sL http://169.254.169.254/latest/meta-data/spot/termination-time
+ http_status=404
+ [ 404 -eq 200 ]
+ [ -e /etc/kube-node-drainer/asg ]
+ grep -q i-08fc5fb751b30d164 /etc/kube-node-drainer/asg
+ sleep 10
+ curl -o /dev/null -w %{http_code} -sL http://169.254.169.254/latest/meta-data/spot/termination-time
+ http_status=404
+ [ 404 -eq 200 ]
+ [ -e /etc/kube-node-drainer/asg ]
+ grep -q i-08fc5fb751b30d164 /etc/kube-node-drainer/asg
+ sleep 10
+ curl -o /dev/null -w %{http_code} -sL http://169.254.169.254/latest/meta-data/spot/termination-time
+ http_status=404
+ [ 404 -eq 200 ]
+ [ -e /etc/kube-node-drainer/asg ]
+ grep -q i-08fc5fb751b30d164 /etc/kube-node-drainer/asg
+ sleep 10
+ curl -o /dev/null -w %{http_code} -sL http://169.254.169.254/latest/meta-data/spot/termination-time
+ http_status=404
+ [ 404 -eq 200 ]
+ [ -e /etc/kube-node-drainer/asg ]
+ grep -q i-08fc5fb751b30d164 /etc/kube-node-drainer/asg
+ sleep 10
+ curl -o /dev/null -w %{http_code} -sL http://169.254.169.254/latest/meta-data/spot/termination-time
+ http_status=404
+ [ 404 -eq 200 ]
+ [ -e /etc/kube-node-drainer/asg ]
+ grep -q i-08fc5fb751b30d164 /etc/kube-node-drainer/asg
+ sleep 10
<=== At this point the connection is cut and instance is "Terminating: Proceed" in ASG console
[a minute or so passes]
<=== At this point the instance is terminated, but still showed as 'Ready' in `kubectl get nodes`
[a minute later]
<=== Instance drops from `kubectl get nodes` and all the pods are rescheduled
@whereisaaron
Copy link
Contributor Author

I rebuilt the cluster without kube2iam and everything worked perfectly! I deliberately rolled a spot instance ASG and watched kube-node-drainer-asg-status-updater detect the ASG change for ~80 seconds, kube-node-drainer noticed the 'asg' signal, drained everything, and notified AWS.

(The HTTP status code on http://169.254.169.254/latest/meta-data/spot/termination-time didn't seem to work. But I guess that is only for spot fleet instances?)

Is it possible that kube2iam is blocking kube-node-drainer-asg-status-updater from working? Do we need a IAM role and kube2iam annotation for this?

Log from kube-node-drainer-asg-status-updater

+ sleep 10
+ jq -r [.AutoScalingGroups[] | select((.Tags[].Key | contains("kube-aws:")) and (.Tags[].Key | contains("kubernetes.io/cluster/gnat"))) | .Instances[] | select(.LifecycleState == "Terminating:Wait") | .InstanceId] | sort | join(",")
+ asg describe-auto-scaling-groups
+ aws --region=us-east-2 autoscaling describe-auto-scaling-groups
+ updated_instances_to_drain=
+ [  ==  ]
+ continue
+ sleep 10
+ jq -r [.AutoScalingGroups[] | select((.Tags[].Key | contains("kube-aws:")) and (.Tags[].Key | contains("kubernetes.io/cluster/gnat"))) | .Instances[] | select(.LifecycleState == "Terminating:Wait") | .InstanceId] | sort | join(",")
+ asg describe-auto-scaling-groups
+ aws --region=us-east-2 autoscaling describe-auto-scaling-groups
+ updated_instances_to_drain=i-057618ec61ea6b39b
+ [ i-057618ec61ea6b39b ==  ]
+ instances_to_drain=i-057618ec61ea6b39b
+ kubectl -n kube-system apply -f -
+ /lib/ld-musl-x86_64.so.1 /opt/bin/hyperkube kubectl -n kube-system apply -f -
+ echo {"apiVersion": "v1", "kind": "ConfigMap", "metadata": {"name": "kube-node-drainer-status"}, "data": {"asg": "i-057618ec61ea6b39b"}}
configmap "kube-node-drainer-status" configured
+ sleep 10
+ asg describe-auto-scaling-groups
+ aws --region=us-east-2 autoscaling describe-auto-scaling-groups
+ jq -r [.AutoScalingGroups[] | select((.Tags[].Key | contains("kube-aws:")) and (.Tags[].Key | contains("kubernetes.io/cluster/gnat"))) | .Instances[] | select(.LifecycleState == "Terminating:Wait") | .InstanceId] | sort | join(",")
+ updated_instances_to_drain=i-057618ec61ea6b39b
+ [ i-057618ec61ea6b39b == i-057618ec61ea6b39b ]
+ continue
... XXXX detection lasted about a minute while in 'Terminate: Wait' state
+ sleep 10
+ jq -r [.AutoScalingGroups[] | select((.Tags[].Key | contains("kube-aws:")) and (.Tags[].Key | contains("kubernetes.io/cluster/gnat"))) | .Instances[] | select(.LifecycleState == "Terminating:Wait") | .InstanceId] | sort | join(",")
+ asg describe-auto-scaling-groups
+ aws --region=us-east-2 autoscaling describe-auto-scaling-groups
+ updated_instances_to_drain=i-057618ec61ea6b39b
+ [ i-057618ec61ea6b39b == i-057618ec61ea6b39b ]
+ continue
+ sleep 10
+ asg describe-auto-scaling-groups
+ aws --region=us-east-2 autoscaling describe-auto-scaling-groups
+ jq -r [.AutoScalingGroups[] | select((.Tags[].Key | contains("kube-aws:")) and (.Tags[].Key | contains("kubernetes.io/cluster/gnat"))) | .Instances[] | select(.LifecycleState == "Terminating:Wait") | .InstanceId] | sort | join(",")
+ updated_instances_to_drain=i-057618ec61ea6b39b
+ [ i-057618ec61ea6b39b == i-057618ec61ea6b39b ]
+ continue
+ sleep 10
+ + asg describe-auto-scaling-groups
+ jq -r [.AutoScalingGroups[] | select((.Tags[].Key | contains("kube-aws:")) and (.Tags[].Key | contains("kubernetes.io/cluster/gnat"))) | .Instances[] | select(.LifecycleState == "Terminating:Wait") | .InstanceId] | sort | join(",")
aws --region=us-east-2 autoscaling describe-auto-scaling-groups
+ updated_instances_to_drain=
+ [  == i-057618ec61ea6b39b ]
+ instances_to_drain=
+ echo {"apiVersion": "v1", "kind": "ConfigMap", "metadata": {"name": "kube-node-drainer-status"}, "data": {"asg": ""}}
+ kubectl -n kube-system apply -f -
+ /lib/ld-musl-x86_64.so.1 /opt/bin/hyperkube kubectl -n kube-system apply -f -
configmap "kube-node-drainer-status" configured
+ sleep 10
+ + jq -r [.AutoScalingGroups[] | select((.Tags[].Key | contains("kube-aws:")) and (.Tags[].Key | contains("kubernetes.io/cluster/gnat"))) | .Instances[] | select(.LifecycleState == "Terminating:Wait") | .InstanceId] | sort | join(",")
asg describe-auto-scaling-groups
+ aws --region=us-east-2 autoscaling describe-auto-scaling-groups
+ updated_instances_to_drain=
+ [  ==  ]
+ continue
+ sleep 10

Log from kube-node-drainer

+ sleep 10
+ curl -o /dev/null -w %{http_code} -sL http://169.254.169.254/latest/meta-data/spot/termination-time
+ http_status=404
+ [ 404 -eq 200 ]
+ [ -e /etc/kube-node-drainer/asg ]
+ grep -q i-057618ec61ea6b39b /etc/kube-node-drainer/asg
+ sleep 10
+ curl -o /dev/null -w %{http_code} -sL http://169.254.169.254/latest/meta-data/spot/termination-time
+ http_status=404
+ [ 404 -eq 200 ]
+ [ -e /etc/kube-node-drainer/asg ]
+ grep -q i-057618ec61ea6b39b /etc/kube-node-drainer/asg
+ sleep 10
+ curl -o /dev/null -w %{http_code} -sL http://169.254.169.254/latest/meta-data/spot/termination-time
+ http_status=404
+ [ 404 -eq 200 ]
+ [ -e /etc/kube-node-drainer/asg ]
+ grep -q i-057618ec61ea6b39b /etc/kube-node-drainer/asg
+ termination_source=asg
+ break
+ true
+ echo Node is terminating, draining it...
Node is terminating, draining it...
+ kubectl drain --ignore-daemonsets=true --delete-local-data=true --force=true --timeout=60s ip-172-24-11-248.us-east-2.compute.internal
+ /lib/ld-musl-x86_64.so.1 /opt/bin/hyperkube kubectl drain --ignore-daemonsets=true --delete-local-data=true --force=true --timeout=60s ip-172-24-11-248.us-east-2.compute.internal
node "ip-172-24-11-248.us-east-2.compute.internal" cordoned
WARNING: Deleting pods with local storage: stash-backup-c78b5f64d-477d8, kube-node-drainer-ds-m47x8, alertmanager-kube-prometheus-0, kube-prometheus-grafana-54f96fcdd4-dlv5k; Ignoring DaemonSet-managed pods: calico-node-xznng, kube-node-drainer-ds-m47x8, kube-proxy-gwjq7, kube-prometheus-exporter-node-h4qk6
pod "alertmanager-kube-prometheus-0" evicted
pod "prometheus-operator-prometheus-operator-9d75db7c-l2pzq" evicted
pod "nginx-ingress-prod-default-backend-66cf6965d-697xg" evicted
pod "cert-manager-5d7bb78c46-vzjkn" evicted
pod "stash-backup-c78b5f64d-477d8" evicted
pod "kube-prometheus-grafana-54f96fcdd4-dlv5k" evicted
pod "nginx-ingress-prod-controller-7745975c7b-g74vb" evicted
node "ip-172-24-11-248.us-east-2.compute.internal" drained
+ echo All evictable pods are gone
+ [ asg == asg ]
+ echo Notifying AutoScalingGroup that instance i-057618ec61ea6b39b can be shutdown
All evictable pods are gone
Notifying AutoScalingGroup that instance i-057618ec61ea6b39b can be shutdown
+ jq -r .AutoScalingInstances[].AutoScalingGroupName
+ asg describe-auto-scaling-instances --instance-ids i-057618ec61ea6b39b
+ aws --region=us-east-2 autoscaling describe-auto-scaling-instances --instance-ids i-057618ec61ea6b39b
+ ASG_NAME=gnat-Spot1-FJTMHVBZ4EYD-Workers-83SCY2IEP5GV
+ asg+ jq -r .LifecycleHooks[].LifecycleHookName
+ grep -i nodedrainer
 describe-lifecycle-hooks --auto-scaling-group-name gnat-Spot1-FJTMHVBZ4EYD-Workers-83SCY2IEP5GV
+ aws --region=us-east-2 autoscaling describe-lifecycle-hooks --auto-scaling-group-name gnat-Spot1-FJTMHVBZ4EYD-Workers-83SCY2IEP5GV
+ HOOK_NAME=gnat-Spot1-FJTMHVBZ4EYD-WorkersNodeDrainerLH-TJJEA9UD5PX5
+ asg complete-lifecycle-action --lifecycle-action-result CONTINUE --instance-id i-057618ec61ea6b39b --lifecycle-hook-name gnat-Spot1-FJTMHVBZ4EYD-WorkersNodeDrainerLH-TJJEA9UD5PX5 --auto-scaling-group-name gnat-Spot1-FJTMHVBZ4EYD-Workers-83SCY2IEP5GV
+ aws --region=us-east-2 autoscaling complete-lifecycle-action --lifecycle-action-result CONTINUE --instance-id i-057618ec61ea6b39b --lifecycle-hook-name gnat-Spot1-FJTMHVBZ4EYD-WorkersNodeDrainerLH-TJJEA9UD5PX5 --auto-scaling-group-name gnat-Spot1-FJTMHVBZ4EYD-Workers-83SCY2IEP5GV
+ sleep 300

@whereisaaron whereisaaron changed the title kube-node-drainer maybe not working with 0.9.9 (RBAC, kube2iam) kube-node-drainer maybe not working with 0.9.9 and kube2iam Jan 6, 2018
@whereisaaron whereisaaron changed the title kube-node-drainer maybe not working with 0.9.9 and kube2iam 'kube-node-drainer' maybe not working with 0.9.9 and kube2iam Jan 6, 2018
@whereisaaron
Copy link
Contributor Author

I confirmed that kube-node-drainer does not work if kube2iam support is enabled. With kube2iam enabled kube-node-drainer-asg-status-updater can't get credentials: 'Unable to locate credentials. You can configure credentials by running "aws configure".' And so it doesn't notice when the change happens to the ASG, and so the kube-node-drainer doesn't know to drain the node.

I confirmed above that is all works fine when kube2iam is disabled.

I guess we need to create and annotate IAM roles for the node drainer and kube-resources-autosave services if we want the to work in conjunction with kube2iam?

Log from kube-node-drainer-asg-status-updater

+ sleep 10
+ asg describe-auto-scaling-groups
+ aws --region=us-east-2 autoscaling describe-auto-scaling-groups
+ jq -r [.AutoScalingGroups[] | select((.Tags[].Key | contains("kube-aws:")) and (.Tags[].Key | contains("kubernetes.io/cluster/gnat"))) | .Instances[] | select(.LifecycleState == "Terminating:Wait") | .InstanceId] | sort | join(",")
Unable to locate credentials. You can configure credentials by running "aws configure".
+ updated_instances_to_drain=
+ [  ==  ]
+ continue
+ sleep 10
+ jq -r [.AutoScalingGroups[] | select((.Tags[].Key | contains("kube-aws:")) and (.Tags[].Key | contains("kubernetes.io/cluster/gnat"))) | .Instances[] | select(.LifecycleState == "Terminating:Wait") | .InstanceId] | sort | join(",")
+ asg describe-auto-scaling-groups
+ aws --region=us-east-2 autoscaling describe-auto-scaling-groups
Unable to locate credentials. You can configure credentials by running "aws configure".
+ updated_instances_to_drain=
+ [  ==  ]
+ continue
+ sleep 10

@whereisaaron whereisaaron changed the title 'kube-node-drainer' maybe not working with 0.9.9 and kube2iam 'kube-node-drainer' does not work 'kube2iam' in 0.9.9 Jan 7, 2018
@tyrannasaurusbanks
Copy link
Contributor

We're seeing the same issue with the nodeDrainer, thanks for the detailed report @whereisaaron - it helped us zero in on it.
Kube2iam is important for my company going forwards, so we're going to dive into this more and hope to contribute a PR back

@whereisaaron
Copy link
Contributor Author

Thanks for looking into it @tyrannasaurusbanks! I believe the same issue applies to kube-resource-autosave as discussed #912.

@mumoshu said there "...we should manage an IAM role dedicated to kube-resources-autosave which is discovered and assumed to be used by the autosave app via the kube2iam annotation."

So I think we need to add IAM roles to the CloudFormation stack template for each of kube-resource-autosave and kube-node-drainer, and then annotate those pod specs with the kube2iam role spec.

Also if you, like me, like to use the kube2iam --namespace-restrictions option (important for multi-tenant), then we also need to annotate the kube-system namespace with either the specific IAM roles needed by kube-system, or a fairly broad wildcard.

@kiich
Copy link
Contributor

kiich commented Jan 11, 2018

We've also ran into this and disabling kube2iam indeed removed the error in node-drainer-asg-status. Going forward though we would like kube2iam on so would be good to get to bottom of this.

@tyrannasaurusbanks
Copy link
Contributor

tyrannasaurusbanks commented Jan 11, 2018

cheers for the crash course in kube2iam @whereisaaron, will get cracking!

@whereisaaron
Copy link
Contributor Author

Hey sorry @tyrannasaurusbanks, you guys are probably the kube2iam experts, I didn't mean to tell you want you already know! I'm just trying to get a clear picture of the changes needed. I was hoping you'd confirm if I had got it right or confused 😄

@tyrannasaurusbanks
Copy link
Contributor

hah! No we're the opposite - total kube2iam newbs - your proposed changes make sense to me, let's review them in the PR i'm hoping to push today.

@mumoshu
Copy link
Contributor

mumoshu commented Feb 2, 2018

Thanks everyone - I'm now suspecting if this is due to slow responses from kube2iam?
In that case, I wonder if setting a longer timeout/a more retries via envvars of kube-node-drainer-asg-status-updater would help.

Update: fixed wrong link

@whereisaaron
Copy link
Contributor Author

@mumoshu it is not a timing thing is it? It looks like kube2iam completely blocks node drainer from the AWS credentials, because there is no allowed IAM role for the node-drainer Pods. There is no iam.amazonaws.com/role annotation on the Pods, so they have no credentials. Same problem affects kube-resources-autosave. Or is there some special exception to this for kube-system?

If it is timing somehow, you could also consider switching to kaim which pre-fetches and caches IAM credentials for role annotated on Pods/Namespaces to reduce delays. It's a kube2iam rewrite to try and improve on things like performance.

@mumoshu
Copy link
Contributor

mumoshu commented Feb 2, 2018

@whereisaaron AFAIK, kube2iam is configured with--auto-discover-default-role and therefore it should provide a credential of the node's role to any pod without the annotation. So my guess is it is caused by whether delays like that and/or reliability issues with kube2iam+default role. If it doesnt' work with timeouts/retries set, perhaps it is a reliability issue.

And yes, kiam would be a good alternative anyway. Also see #1055!

@cknowles
Copy link
Contributor

cknowles commented Feb 2, 2018

Does the node drainer boot before kube2iam? I’ve found several applications that boot before it get stuck like this, perhaps in some startup race condition. I filed an issue a while ago about the same thing. I thought perhaps always starting kube2iam first would solve it but now I’m thinking there are some additional problems.

@mumoshu
Copy link
Contributor

mumoshu commented Feb 2, 2018

@c-knowles Thx! Possible. Would recreating pods fix the issue in that case?

@whereisaaron
Copy link
Contributor Author

@c-knowles @mumoshu kiam specifically cite race conditions in kube2iam leading to wrong permissions, as well as delays/performance, being the reason they started kiam despite being big kube2iam users/fans.

I think we should work out some proper IAM roles for node-drainer and kube-resources-autosaver and assign those explicitly with either kube2iam or kiam (the Pod annotation is the same for both systems: iam.amazonaws.com/role).

It would be ideal to be able to enable namespace role filters, to limit the roles that pods can assume in different name spaces. But I am a little unsure about how to identify the minimum roles needed for kube-system and haven't found an example. A bit like RBAC, the developers don't explicitly say what permissions the code needs. You have to reverse engineer from the code, or wait for things to break :-)

@cknowles
Copy link
Contributor

cknowles commented Feb 6, 2018

@mumoshu unfortunately not, terminating the pod just means the new one also goes into crash loop backoff as well. It seems the cause is some 502s:

wget -O - -q http://169.254.169.254/2016-09-02/meta-data/instance-id
wget: server returned error: HTTP/1.1 502 Bad Gateway 

I can confirm though that at least one node in that same cluster has a Running node drainer, but the other nodes were not so lucky. Why those nodes get 502 is the part I don't yet know.

On the health node, I've exec-ed to the pod and done this:

/ # wget -O - -q http://169.254.169.254/latest/meta-data/
wget: server returned error: HTTP/1.1 502 Bad Gateway

/ # curl http://169.254.169.254/latest/meta-data/
ami-id
ami-launch-index
ami-manifest-path
block-device-mapping/
hostname
[...]

Seems like some problem with wget + kube2iam?

@mumoshu
Copy link
Contributor

mumoshu commented Feb 6, 2018

@c-knowles Seems like you have a different problem than the one originally reported by this issue.

Seems like some problem with wget + kube2iam?

Probably.

(Un)fortunately, my colleague @cw-sakamoto(thx!) spotted that yours can be fixed by replacing wget with curl(isn't it surprising?)

Probably related issue in wget in busybox: http://svn.dd-wrt.com/ticket/5771

@cknowles
Copy link
Contributor

cknowles commented Feb 6, 2018

Ah ok, I will file another issue for that then.

@kiich
Copy link
Contributor

kiich commented Feb 6, 2018

@c-knowles Strange as we've seen same issues with curl as well where call to the AWS metadata was hanging. Deleting the kube2iam and waiting a little made it srping back to life.
As part of this, we've PR-ed a liveness for it at helm/charts#3400 so we don't have to manually restart it.

@mumoshu
Copy link
Contributor

mumoshu commented Feb 6, 2018

@kiich Hi, thanks for the info and your work!

Just curious but was the exact error you've seen in your cluster was wget: server returned error: HTTP/1.1 502 Bad Gateway?
In this 502 case, the initial wget isn't passing, and it happens even when kube2iam is responsive.

However, what @whereisaaron encountered seems different. The log seems to indicate that wget is passing but kube2iam is failing to return creds afterwards. I guess it could be fixed via the PR, but not the 502 one?

@kiich
Copy link
Contributor

kiich commented Feb 6, 2018

@mumoshu Hi!
Unfortunately we never tried wget as curl was showing up the error - actually, I was not getting Bad Gateway but it was just hanging on the command prompt and eventually timing out.

We did find that kube2iam was OOM/Restarting lots so restarting the node (apologies, I said in my previous post that restarting the pod made it work but it was the node instead.) made it all work.

We never dived in to deep to find out the cause but rather turned kube2iam off for now and look at alternatives for now.

@kiich
Copy link
Contributor

kiich commented Feb 6, 2018

Though 1 thing to point out as it just came to me is kube2iam pods were all working/not continuously restarting before.
It is only when we turned it off via cluster.yaml and rolled the cluster BUT not delete the pods is when this hang started to happen.

Not sure if that is of any help though!

@mumoshu
Copy link
Contributor

mumoshu commented Feb 6, 2018

@kiich Thanks for clarification! As far as I have read code of kube2iam, it doesn't survive node restarts. iptables rule added by kube2iam doesn't get updated after the kube2iam pod on the node is recreated with a different pod IP. That could be the cause of your hanging issue.

@kiich
Copy link
Contributor

kiich commented Feb 6, 2018

@mumoshu Thanks for checking! Yeah that's the conclusion we've come to as well. I'm quite interested in looking into kiam so will be keeping an eye on that PR.

@mumoshu
Copy link
Contributor

mumoshu commented Feb 6, 2018

Note: Considering jtblin/kube2iam#9 and jtblin/kube2iam#126, probably what we'd need at best would be a dedicated pod to monitor kube2iam pod and add/remove/update iptables accordingly. The --iptables option in kube2iam would never be a complete solution to every aspect of the problem.

@mumoshu
Copy link
Contributor

mumoshu commented Feb 6, 2018

@kiich Thanks for confirming! Glad it wasn't only me who came to the conclusion.

@mumoshu
Copy link
Contributor

mumoshu commented Feb 6, 2018

@pingles Hi, are you aware of this issue?

In nutshell, managing iptables from within kube2iam seems like an incomplete solution/we need a external process to monitor it so that a corresponding iptables rule can be added/removed/updated accordingly. No missing route, no compromise.

Is this solved somehow in the kiam community?

@cknowles
Copy link
Contributor

cknowles commented Feb 6, 2018

@mumoshu kiam has some deferred removal code but I don't think it will solve most of these issues. I didn't spot anything related to monitoring yet. Both kube2iam and kiam run in host network mode so the pod IP should always match the node IP and hence it survives restarts, both projects use the coreos wrapper for iptables and call methods which check for duplicates.

@pingles
Copy link

pingles commented Feb 6, 2018

@c-knowles it's not something I was aware of. Do you mean there's a situation where that rule removal doesn't apply/work, or that it's not sufficient for all cases?

@mumoshu
Copy link
Contributor

mumoshu commented Feb 6, 2018

@c-knowles Thx for the correction!
I remember I have seen both cases(podIP eq or neq nodeIP) before, not sure it was a temp issue or due to my misoperation or etc. but yes, basically a podIP should match nodeIP when in hostNetwork.

@kiich Could you confirm that your kube2iam pod does report node ip as pod ip?

@kiich
Copy link
Contributor

kiich commented Feb 6, 2018

@mumoshu Hi, it sure does!

kubectl get pod kube2iam -n kube-system
...
  hostIP: 10.xx.y.zz
  phase: Running
  podIP: 10.xx.y.zz

whereas other pods report hostIP and podIP differently.

@mumoshu
Copy link
Contributor

mumoshu commented Feb 6, 2018

@kiich Thanks for the confirmation!

Then, sry but my previous guess was incorret, as corrected by c-knowles. Perhaps theres something going wrong I havent experienced yet?
Hopefully, your PR be a workaround.

Anyway, trying kiam would be a good idea ☺️ Thx you for chiming in, @pingles!

@pingles
Copy link

pingles commented Feb 6, 2018

@mumoshu no worries- be glad to try some tests to see whether we're susceptible to the same issues that you're hitting. Trying to think up something similar without using kube-aws and node-drainer.

@whereisaaron
Copy link
Contributor Author

@pingles the problem experienced by me and others is that if Pods that need access to an IAM role start before kube2iam then that Pod cannot get credentials. Even hours after the cluster started, the Pods still don't get credentials. From others' reports there seems to be a similar issue if you delete/restart nodes, and the kube2iam Pods are not the first Pods to start on the node. So basically the test is just this: start a container on a node before the kiam container, then start kiam, then see if you your container can get any IAM credentials.

Q1) With kiam, if you have an existing cluster with already running allowed-roles annotated Pods, and then you start kiam, does it have a mechanism to provide credentials to those already running Pods?

Q2) With kiam, if the kiam container is brutally killed without a chance to clean-up, does it have a way - on restart - to find and delete any stale roots it added, and/or to reconcile the route table and cached credentials with the Pods already running on the node?

The kube2iam project owner said to @c-knowles: "Yes definitely kube2iam should be started before any pod scheduling is done. I will update the readme to add a note for that". But since our node-drainer is a static Pod too, that is going to be a pretty random race situation. It seems like a significant design problem if kube2iam can't eventually reconcile the route table and credentials with the state of the node.

The kube2iam readme says: "The kube2iam daemon and iptables rule (see below) need to run before all other pods that would require access to AWS resources." I thought this was just to ensure security, so that no Pod could sneak a role before kuab2iam started. But it appears to be a structural problem.

@pingles
Copy link

pingles commented Feb 6, 2018

@whereisaaron interesting, I'll try and get some time tomorrow to try and test it.

I'll try and answer your questions though.

Q1) kiam reuses a lot of the cache code k8s.io/client-go library (in the early days we tried writing our own state machine for it and it was a bad decision :). These are based on an informer which both watches for resources and periodically syncs (retrieving a full list of all resources). This should ensure that if a pod has already been started when the kiam server starts it'll retrieve details for it.

we also broke the system into 2 processes: an agent which runs on all nodes running user workloads that should have their metadata api requests intercepted, and a server which runs on a subset of nodes (we run these on our masters that don't run other user-facing things).

the server process is the only one that maintains the caches from the k8s api server and the only process that talks to aws. these together mean that, in general, there's less movement around the server processes than the agents- so by the time an agent starts there's already likely a server process in place. we also do a lot of retries + backoffs when performing rpc against the server to handle when pods aren't yet stored in the cache.

this server process is super important to let us prefetch quickly, and to maintain a high cache hit rate (as well as reducing the number of AWS API calls we make).

however, all this relies on the timeouts of the sdk clients being something sensible.

the agent http server generally has an unlimited timeout up to when the client disconnects, so if the pod can't be found it'll keep retrying until its found or the client disconnects. we've noticed some client libraries behaving more strictly than others but in general its been ok- we encourage teams to ensure they have retries and backoffs around their aws operations that would be requesting credentials so they recover nicely also.

Q2) the agent is what installs itself with iptables and uses a prerouting rule to capture stuff heading out of a particular interface destined for 169.254.169.254 and rewrites it to the agent process IP which, given it's host networked, is the node ip. so, if an agent dies it can restart and carry on processing requests.

I'm not 100% certain of what happens should a pod be started before the kiam agent. my guess is that it'd end up trying to talk to the AWS metadata api and fail to retrieve credentials (our nodes have almost 0 IAM permissions) causing the process to exit and be restarted- in effect it wouldn't succeed until kiam was started.

Hope that helps, happy to answer more here, on the Kubernetes slack (I'm @pingles there too) or you can email me (my GitHub profile has the address).

@whereisaaron
Copy link
Contributor Author

Thanks for the extensive explanation @pingles! It sounds like using the 'informer' pattern is a smart move that should address the problems we are seeing with kube2iam. The proof is in the trying-to-break-it, so yeah, love to hear how your testing goes.

Regards the agent/server model that sounds good too, as agent restarts, or reboots of nodes with agents don't lose any credentials state. I wondered, how do the server replicas manage their state and agent access? Are they active/active load balanced or a controller election approach? I was hoping the former, both for scaling the number of agents and I was thinking it would be better for at least two servers to maintain cached credentials, so that the loss of one server wouldn't cause lots of latency across the whole cluster while the credentials are repopulated into the cache.

Regards the clients, they ought to be delay tolerant, either back-off retries, or else fails health checks or exits so the scheduler can handle that for them. Pretty sure our 'node-drainer' tries to fetch credentials every cycle, so should be no problem there. Containers that get wedged because the first attempt wasn't answered fast enough should not apply 😄

@pingles
Copy link

pingles commented Feb 6, 2018

@whereisaaron the servers are active/active- they run all run the same server process with agents connecting via a k8s Service. Because it uses gRPC there's also some client-side load-balancing performed by the client within the agent process.

All servers maintain a cache of all pods and credential cache (so we duplicate AWS requests for the same credentials across each server) but it means any server process can serve any request. It keeps it relatively simple and works well.

Checking our data dog stats right now on our busiest cluster (around 1k pods) the role name handler has a 95th percentile response of 54ms (where it checks the role of an attached pod) and the credentials handler is at 88ms.

@whereisaaron
Copy link
Contributor Author

Thanks @pingles sounds great, I'd love to try it in action with kube-aws clusters. One total n00b question: When installing kiam, how does once make sure that the kube-system namespace Pods get the roles they need to e.g. create AWS loadbalancers and EBS volumes and the other AWS functionality rolled into k8s core? Surely some kube-system Pods need role annotations? Or is there a special bypass in kiam/kube2iam for Pods in the kube-system namespace?

@pingles
Copy link

pingles commented Feb 7, 2018

@whereisaaron good question. if a pod needs credentials it just needs the iam annotation. for us the k8s master components run on separate master nodes which have the relevant IAM policy granted directly (and so don't go via kiam).

@whereisaaron
Copy link
Contributor Author

@pingles for kube-aws I think it is the same, the kube-apiserver and kube-scheduler Pods run on a pool of master nodes, and those nodes have a broader IAM policy. However DaemonSet Pods (e.g. kube-proxy) run on all nodes, including master nodes, so a kiam DaemonSet would - I figure - add a route and intercept the master component credential requests also? Do you keep DaemonSets off your master nodes somehow? Or maybe since kube-apiserver and kube-scheduler run hostNetwork: true they bypass kiam and get to AWS that way?

@pingles
Copy link

pingles commented Feb 8, 2018

@whereisaaron yep- we run api server and scheduler pods as host networked

@cknowles
Copy link
Contributor

cknowles commented Feb 8, 2018

@pingles thanks for adding so much detail here! Between us it seems we'll be able to work this out pretty soon.

@c-knowles it's not something I was aware of. Do you mean there's a situation where that rule removal doesn't apply/work, or that it's not sufficient for all cases?

To clarify that, I just meant that the defer would only be actioned during normal kiam app shutdown and that the issues described above that by the others would probably not be solved by that defer code.

@mumoshu
Copy link
Contributor

mumoshu commented Feb 21, 2018

FYI I've just merged the KIAM support into master

@cknowles
Copy link
Contributor

To finish this off, I think we probably need #1150 with the role for system pods generated.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 23, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants