First run during deployment produce 404 error #122

SharpEdgeMarshall · 2018-01-09T13:56:10Z

Kube2iam version: 0.8.1
Kubernetes version: 1.8.4
CNI: weave 2.0.5

Only during deployment we are experiencing an error retrieving aws credentials with boto, the pod crashes and is restarted, then works perfectly.

This is the log line from kube2iam during the error

GET /latest/meta-data/iam/security-credentials/ (404) took 3116998023 ns

App error log

botocore.exceptions.NoCredentialsError: Unable to locate credentials

The text was updated successfully, but these errors were encountered:

jrnt30 · 2018-01-09T19:48:27Z

Your configuration for the DS would be helpful.

Are you using a fallback role?
Default ARN?
What is the general size of the cluster you are working with? Pod count being the most useful metric here I think.
Is there a large influx of pods being scheduled simultaneously when you see this or does it happen all the time?
Have you tried testing with the latest version which reduces the scope of the Kube2IAM caching to be only the node that you are working on?

If you put on verbose logging you will get a lot of information that may would be helpful to understand if the pod is being indexed properly prior to the initial call.

SharpEdgeMarshall · 2018-01-15T11:25:50Z

Yes we are using a default-role with no permissions
ARN autodiscover is enabled
We are talking about dev cluster with 2 nodes and 1 master running about 73 pods including system ones
Yes the Helm Chart handles about 9 k8s deployments launching 2 replicas each
I've increased timeout and retries on boto library and updated Kube2IAM to 0.9.0 and until now it has not happened yet.

bajacondor · 2018-01-17T22:37:59Z

Hello @SharpEdgeMarshall, I'm experiencing strange behaviors and hangs with boto3 and kube2iam. Can you share a little more about the code your are trying to deploy in your container(s). @jrnt30: When you say 'latest version' do you mean 0.9.0? I don't see any code changes there that indicate changes in caching. https://github.com/jtblin/kube2iam/releases/tag/0.9.0 Could you clarify? Thank you.

jtblin · 2018-03-26T05:02:35Z

Closing as "not repro", let me know if you experience this again. 0.10.0 should help with performance as well.

roffe · 2018-05-31T15:07:23Z

Observing the same problem, the first start in all deployments that begins with downloading stuff from S3 via aws-cli always fails, after the first pod restart they come up fine.

common for all of them is a 404 in the kube2iam logs & error:

fatal error: Unable to locate credentials

Kube2iam: 0.10.0
Kubernetes 1.9.6
Kops 1.9.0

jbk/def➜  kube2iam git:(kube-7) ✗ k logs -n kube-system kube2iam-qsvq2 | grep 100.96.5.38
time="2018-05-31T14:02:04Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2260454199 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=100.96.5.38 res.duration=2260454199 res.status=404
time="2018-05-31T14:02:05Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2359985679 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=100.96.5.38 res.duration=2359985679 res.status=404
time="2018-05-31T14:02:05Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (200) took 1548675024 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=100.96.5.38 res.duration=1548675024 res.status=200
time="2018-05-31T14:02:06Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (200) took 23565 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=100.96.5.38 res.duration=23565 res.status=200
time="2018-05-31T14:02:06Z" level=info msg="GET /latest/meta-data/iam/security-credentials/my-app-iam-role (200) took 78035 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-app-iam-role req.remote=100.96.5.38 res.duration=78035 res.status=200

roffe · 2018-05-31T15:11:49Z

With verbose log:

time="2018-05-31T15:08:26Z" level=debug msg="Caches have been synced.  Proceeding with server."
time="2018-05-31T15:08:26Z" level=info msg="Listening on port 8181"
time="2018-05-31T15:08:26Z" level=debug msg="Pod OnUpdate" pod.iam.role= pod.name=kube2iam-ltvlk pod.namespace=kube-system pod.status.ip=10.0.117.21 pod.status.phase=Running
time="2018-05-31T15:08:26Z" level=debug msg="Pod OnUpdate" pod.iam.role= pod.name=kube2iam-58hqb pod.namespace=kube-system pod.status.ip=10.0.118.75 pod.status.phase=Running
time="2018-05-31T15:08:26Z" level=debug msg="Pod OnUpdate" pod.iam.role= pod.name=kube2iam-d86rp pod.namespace=kube-system pod.status.ip=10.0.118.202 pod.status.phase=Running
time="2018-05-31T15:08:27Z" level=debug msg="Pod OnUpdate" pod.iam.role= pod.name=kube2iam-ltvlk pod.namespace=kube-system pod.status.ip=10.0.117.21 pod.status.phase=Running
time="2018-05-31T15:08:27Z" level=debug msg="Pod OnUpdate" pod.iam.role= pod.name=kube2iam-d86rp pod.namespace=kube-system pod.status.ip=10.0.118.202 pod.status.phase=Running
time="2018-05-31T15:08:43Z" level=debug msg="Pod OnUpdate" pod.iam.role=my-app-iam-role pod.name=my-app-55b894857d-8vtwd pod.namespace=default pod.status.ip=100.96.5.38 pod.status.phase=Running
time="2018-05-31T15:08:43Z" level=debug msg="Pod OnAdd" pod.iam.role=my-app-iam-role pod.name=my-app-55b894857d-hcxzq pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2018-05-31T15:08:43Z" level=debug msg="Pod OnUpdate" pod.iam.role=my-app-iam-role pod.name=my-app-55b894857d-hcxzq pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2018-05-31T15:08:43Z" level=debug msg="Pod OnUpdate" pod.iam.role=my-app-iam-role pod.name=my-app-55b894857d-hcxzq pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2018-05-31T15:08:44Z" level=debug msg="Pod OnUpdate" pod.iam.role=my-app-iam-role pod.name=my-app-55b894857d-8vtwd pod.namespace=default pod.status.ip=100.96.5.38 pod.status.phase=Running
time="2018-05-31T15:08:49Z" level=debug msg="Pod OnUpdate" pod.iam.role=my-app-iam-role pod.name=my-app-55b894857d-8vtwd pod.namespace=default pod.status.ip=100.96.5.38 pod.status.phase=Running
time="2018-05-31T15:08:49Z" level=debug msg="Pod OnDelete" pod.iam.role=my-app-iam-role pod.name=my-app-55b894857d-8vtwd pod.namespace=default pod.status.ip=100.96.5.38 pod.status.phase=Running
time="2018-05-31T15:08:58Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2473049227 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=100.96.7.253 res.duration=2473049227 res.status=404
time="2018-05-31T15:08:59Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2559204447 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=100.96.7.253 res.duration=2559204447 res.status=404
time="2018-05-31T15:08:59Z" level=debug msg="Pod OnUpdate" pod.iam.role=my-app-iam-role pod.name=my-app-55b894857d-hcxzq pod.namespace=default pod.status.ip=100.96.7.253 pod.status.phase=Running
time="2018-05-31T15:09:00Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (200) took 2260242986 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=100.96.7.253 res.duration=2260242986 res.status=200
time="2018-05-31T15:09:00Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (200) took 25576 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=100.96.7.253 res.duration=25576 res.status=200
time="2018-05-31T15:09:00Z" level=info msg="GET /latest/meta-data/iam/security-credentials/my-app-iam-role (200) took 446768681 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-app-iam-role req.remote=100.96.7.253 res.duration=446768681 res.status=200
time="2018-05-31T15:09:15Z" level=debug msg="Pod OnUpdate" pod.iam.role=my-app-iam-role pod.name=my-app-55b894857d-hcxzq pod.namespace=default pod.status.ip=100.96.7.253 pod.status.phase=Running

jrnt30 · 2018-06-01T01:27:52Z

This does seem to be the same type of timing issue that we have seen before. You probably pieced this together, but in your 2nd example, the Kube2IAM watch has not received notification that the Pod has become running (which is when we get the IP and cache the IP -> Role) until after you have made two failing calls.

There are a few events that Kube2IAM is registering in your example, but if you notice the pod.status.ip is nil until the OnUpdate event @ time="2018-05-31T15:08:59Z". Your calls to Kube2IAM will not return anything until then.

I don't know how, in it's current form, we could give people both things they need:

A very fast response as some SDKs have extremely aggressive timeouts for Metadata calls
Blocking semantics until the pod updates.

A much bigger solution that has floated around would be to augment the "Pod based" IAM approach and augment it with a "Service Account" based IAM approach. This would solve part of this problem potentially, because we can see the Pod's service account before it ever gets an IP so we could start the STS handshake and caching of credentials even if we hadn't gotten the OnUpdate event with the Phase == Running.

I don't want to "Fork" the repo because I feel like @jtblin deserves to get credit for the good work he's put into this and the very crafty idea, but if there isn't any updates on the #132 I will probably just start working and publishing something myself

SharpEdgeMarshall · 2018-06-01T09:00:35Z

Still same situation here even updating k8s and kube2iam to 1.9.8 and 0.10.0 respectively.

dblackdblack · 2018-10-23T15:19:35Z

We're in the exact same situation ... jobs which attempt/fail to pull settings from S3 as their first action

Evanjt1 · 2020-02-04T18:11:05Z

@jrnt30 I'm interested in which direction you ended up going, if any. We ended up mitigating this issue by increasing the BackoffMaxInterval to a large enough duration such that kube2iam pods had enough time to receive, process the pod events, and update the cache. Then we left it up to individual client applications to set their own metadata service timeouts in case the worst-case latency due to kube2iam wasn't tolerable. That's really just a band-aid though, and we're thinking about doing a "fork and fix" with a proper solution.

dblackdblack · 2020-02-04T18:29:08Z

we wrote a shell script that runs as part of our run command:

attempt=0
exitCode=0
response=0
max_attempts=${MAX_ATTEMPTS:-30}
jitter=${JITTER:-5}

while [[ $attempt -lt $max_attempts ]] ; do
  # Jitter requests to break resource contention
  timeout=$((1 + RANDOM % jitter))

  # 169.254.169.254/latest/meta-data is reserved by AWS to access the Metadata API to retrieve IAM credentials.
  #   https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html
  # This curl command is hitting the Metadata API which routes through Kube2iam via IP Tables
  #   https://github.com/jtblin/kube2iam#iptables
  # If the requests return a 200OK, then it's implied that we can communicate with Kube2iam
  response=$(curl --write-out %{http_code} --silent --output /dev/null --connect-timeout 2 'http://169.254.169.254/latest/meta-data/iam/security-credentials/')

  if [[ $response == 200 ]] ; then
    exitCode=0
    break
  fi

  echo "Attempt #${attempt}. Failure connecting to Kube2iam! Retrying in ${timeout}s." 1>&2

  sleep $timeout
  attempt=$(( attempt + 1 ))
done

once kube2iam returns 200 OK, the rest of the run command proceeds. Note that this has implications for local development, since presumably you won't have kube2iam running on developers' laptops when they run these containers.

but the real solution is to switch to kiam which does not have this kube2iam race condition

Evanjt1 · 2020-02-04T18:58:26Z

@dblackdblack That's another route we were considering. Thanks for the example script! If kiam doesn't have this issue I will have to take a serious look at switching.

dblackdblack · 2020-02-04T20:08:06Z

With kiam you have to run some dedicated machines (> 1 in order to avoid a single point of failure) just to run the kiam server component, so it's not a drop-in replacement for kube2iam (since kiam requires more infrastructure than kube2iam).

jacksonofalltrades · 2021-01-08T16:27:50Z

@dblackdblack

we wrote a shell script that runs as part of our run command:

attempt=0
exitCode=0
response=0
max_attempts=${MAX_ATTEMPTS:-30}
jitter=${JITTER:-5}

while [[ $attempt -lt $max_attempts ]] ; do
  # Jitter requests to break resource contention
  timeout=$((1 + RANDOM % jitter))

  # 169.254.169.254/latest/meta-data is reserved by AWS to access the Metadata API to retrieve IAM credentials.
  #   https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html
  # This curl command is hitting the Metadata API which routes through Kube2iam via IP Tables
  #   https://github.com/jtblin/kube2iam#iptables
  # If the requests return a 200OK, then it's implied that we can communicate with Kube2iam
  response=$(curl --write-out %{http_code} --silent --output /dev/null --connect-timeout 2 'http://169.254.169.254/latest/meta-data/iam/security-credentials/')

  if [[ $response == 200 ]] ; then
    exitCode=0
    break
  fi

  echo "Attempt #${attempt}. Failure connecting to Kube2iam! Retrying in ${timeout}s." 1>&2

  sleep $timeout
  attempt=$(( attempt + 1 ))
done

once kube2iam returns 200 OK, the rest of the run command proceeds. Note that this has implications for local development, since presumably you won't have kube2iam running on developers' laptops when they run these containers.

but the real solution is to switch to kiam which does not have this kube2iam race condition

When you say you added this to your "run command" do you mean the final script that gets executed for your Docker container? Just want to make sure I attempt to use this the right way. Thanks!

dblackdblack · 2021-01-08T16:41:07Z

All of our docker images inherit from a single common base image. That base image contains a script named run-server which is the command that kubernetes specifies as the run command when starting the container. run-server calls out to this wait-for-kube2iam script before proceeding with the rest of its normal steps.

create image from amazon/aws-cli with a run script that will ensure the aws credentials are available before executing the command. script taken from: jtblin/kube2iam#122 (comment) related issue: jtblin/kube2iam#136

jtblin closed this as completed Mar 26, 2018

roffe mentioned this issue May 31, 2018

Timing issue causing failures in Pods created by Jobs #136

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First run during deployment produce 404 error #122

First run during deployment produce 404 error #122

SharpEdgeMarshall commented Jan 9, 2018

jrnt30 commented Jan 9, 2018

SharpEdgeMarshall commented Jan 15, 2018 •

edited

Loading

bajacondor commented Jan 17, 2018

jtblin commented Mar 26, 2018

roffe commented May 31, 2018

roffe commented May 31, 2018

jrnt30 commented Jun 1, 2018

SharpEdgeMarshall commented Jun 1, 2018

dblackdblack commented Oct 23, 2018

Evanjt1 commented Feb 4, 2020 •

edited

Loading

dblackdblack commented Feb 4, 2020 •

edited

Loading

Evanjt1 commented Feb 4, 2020

dblackdblack commented Feb 4, 2020 •

edited

Loading

jacksonofalltrades commented Jan 8, 2021 •

edited

Loading

dblackdblack commented Jan 8, 2021

First run during deployment produce 404 error #122

First run during deployment produce 404 error #122

Comments

SharpEdgeMarshall commented Jan 9, 2018

jrnt30 commented Jan 9, 2018

SharpEdgeMarshall commented Jan 15, 2018 • edited Loading

bajacondor commented Jan 17, 2018

jtblin commented Mar 26, 2018

roffe commented May 31, 2018

roffe commented May 31, 2018

jrnt30 commented Jun 1, 2018

SharpEdgeMarshall commented Jun 1, 2018

dblackdblack commented Oct 23, 2018

Evanjt1 commented Feb 4, 2020 • edited Loading

dblackdblack commented Feb 4, 2020 • edited Loading

Evanjt1 commented Feb 4, 2020

dblackdblack commented Feb 4, 2020 • edited Loading

jacksonofalltrades commented Jan 8, 2021 • edited Loading

dblackdblack commented Jan 8, 2021

SharpEdgeMarshall commented Jan 15, 2018 •

edited

Loading

Evanjt1 commented Feb 4, 2020 •

edited

Loading

dblackdblack commented Feb 4, 2020 •

edited

Loading

dblackdblack commented Feb 4, 2020 •

edited

Loading

jacksonofalltrades commented Jan 8, 2021 •

edited

Loading