Timing issue causing failures in Pods created by Jobs #136

damomurf · 2018-05-14T03:06:35Z

As per #46 I'm seeing similar issues where requests to kube2iam frequently fail with a 404 error from kube2iam soon after Job pod creation. I'm able to reliably reproduce the issue with a fairly simple Job definition in a couple of different Kubernetes clusters running Kubernetes v0.8.7.

I've tried with a number of different kube2iam versions: v0.8.2, v0.8.4, v0.10.0 and all exhibit similar behaviour.

Adding a delay to the beginning of the job's execution can help, but at other times a sleep of 5 seconds appears to still not be enough to get around the issue:

(Real IAM role arn replaced to protect the innocent)

Job definition:

apiVersion: batch/v1
kind: Job
metadata:
  name: iam-role-test
spec:
  completions: 50
  parallelism: 5
  backoffLimit: 2
  template:
    metadata:
      annotations:
        iam.amazonaws.com/role: arn:aws:iam::accountId:role/roleName
    spec:
      restartPolicy: Never
      containers:
      - name: test
        image: governmentpaas/curl-ssl
        command:
        - sh
        - -c 
        - "curl -v -f -H 'Accept: application/json' http://169.254.169.254/latest/meta-data/iam/security-credentials/arn:aws:iam::accountId:role/roleName"

Individual failed Pod log output appears as: (curl -v is messing with the formatting, but you get the idea):


*   Trying 169.254.169.254...
* TCP_NODELAY set
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to 169.254.169.254 (169.254.169.254) port 80 (#0)
> GET /latest/meta-data/iam/security-credentials/arn:aws:iam::accountId:role/roleName HTTP/1.1
> Host: 169.254.169.254
> User-Agent: curl/7.55.0
> Accept: application/json
> 
* The requested URL returned error: 404 Not Found
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Closing connection 0
curl: (22) The requested URL returned error: 404 Not Found

Some of the 50 job executions (the success/failure rate seems to be fairly random) fail with the curl command simply receiving a 404 Not Found as described above. Adding a sleep 5 prior to the curl command fixes the issue most of the time.

I've enabled --log-level=debug and --debug options and the relevant log entries that mention the related role are listed in this gist: https://gist.github.com/damomurf/30468bfc1bd595720cb3c9e44946bc19

Hopefully this is sufficient detail provided on the issue, as requested in #46.

The text was updated successfully, but these errors were encountered:

damomurf · 2018-05-14T03:09:54Z

One other fact that may be helpful: the equivalent deployment (with a while loop to keep invoking the curl command and keep the pods running) does not exhibit the same behaviour:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: iam-role-test-deploy
spec:
  replicas: 50
  template:
    metadata:
      annotations:
        iam.amazonaws.com/role: arn:aws:iam::accountId:role/roleName
      labels:
        app: iam-role-test
    spec:
      containers:
      - name: test
        image: governmentpaas/curl-ssl
        command:
        - /bin/sh
        - -c
        - "while [ true ]; do curl -v -f -H 'Accept: application/json' http://169.254.169.254/latest/meta-data/iam/security-credentials/iam.amazonaws.com/role: arn:aws:iam::accountId:role/roleName; sleep 60; done"

pms1969 · 2018-05-31T11:31:49Z

We are seeing very similar behaviour in a CronJob. Time to delve deep I guess.

roffe · 2018-05-31T14:56:04Z

Observing the same problem here but with pods in deployments, the first start in all deployments that begins with downloading stuff from S3 via aws-cli always fails the first start, after the first restart they come up fine.

common for all of them is a 404 in the kube2iam logs & error:

fatal error: Unable to locate credentials

Kube2iam: 0.10.0
Kubernetes 1.9.6
Kops 1.9.0

roffe · 2018-05-31T15:17:12Z

I've posted more info in #122

blimmer · 2018-06-04T16:55:37Z

I can reproduce this issue 99% of the time with a workload that generates a lot of pods (20-ish) all at once with kube2iam annotations. If I make the pod sleep for 30 seconds before trying to use my IAM role, it seems to work around the problem.

When I only spin up a few pods at a time, I don't see the problem.

szuecs · 2018-06-07T08:31:47Z

We also see similar problems for all applications which for example want to read from S3 at start.
Our workaround is to use an initContainer that tries to access the resource.
The nature of a distributed system (like Kubernetes) is to have race conditions all over the place.
The question is how to tackle these in general. Prefetching the credentials from AWS would be one way, I am not sure what kube2iam is doing.

blimmer · 2018-06-07T15:15:52Z

We found this article, which does a good job of explaining the problem and talks about the kiam workaround:

https://medium.com/@pingles/kiam-iterating-for-security-and-reliability-5e793ab93ec3

#132 is indicating to our org that maybe we should look for alternatives. we plan on trying out kiam somewhat soon to see if it helps with this problem.

roffe · 2018-06-07T17:08:08Z

i've found that postStart hooks in k8s makes this problem worse with kube2iam since the pod wont go into a running state until postStart is executed

mikkeloscar · 2018-06-08T06:19:17Z

FYI: a discussion has been started in sig-aws about finding a common solution to this problem. See https://docs.google.com/document/d/1rn-v2TNH9k4Oz-VuaueP77ANE5p-5Ua89obK2JaArfg/edit?disco=AAAAB6DI_qM&ts=5b19085a for comparisons between existing projects.

timm088 · 2018-10-23T07:14:03Z

For clarity, @mikkeloscar refers to the common solution about implementation and support for IAM in kubernetes on AWS (EKS or self run), not specifically, this issue around jobs and there race condition.

mikkeloscar · 2018-10-23T07:55:03Z

For clarity, @mikkeloscar refers to the common solution about implementation and support for IAM in kubernetes on AWS (EKS or self run), not specifically, this issue around jobs and there race condition.

Yup. The point of my comment back then was that it's very hard to fix these sort of race conditions with the architecture of kube2iam. The discussion was started to approach the problem differently.

I've been working on a Proof of Concept (https://github.com/mikkeloscar/kube-aws-iam-controller) to eliminate all of these kinds of race conditions. Since the AWS SDKs handle credentials differently it currently doesn't work for all of them. It works for python and Java for now and I'm working with AWS to add support to Go as well (aws/aws-sdk-go#1993).

WxFang · 2019-10-08T18:18:58Z

Any update to this issue?

create image from amazon/aws-cli with a run script that will ensure the aws credentials are available before executing the command. script taken from: jtblin/kube2iam#122 (comment) related issue: jtblin/kube2iam#136

jhsmith mentioned this issue Feb 26, 2019

Request to add Metadata API environment variables minio/minio-go#1080

Closed

gcampax mentioned this issue Oct 22, 2019

Workaround for jobs failing with missing S3 credentials stanford-oval/genie-cloud#525

Merged

perdasilva mentioned this issue Jan 8, 2020

[Improvement] Increase error state visibility to end users awslabs/benchmark-ai#996

Open

This was referenced Sep 15, 2020

SOPS Can't Find Data Key Required to Decrypt the SOPS File (AWS KMS) getsops/sops#736

Open

kube2iam Race Condition? Role Trying To Assume Itself #280

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timing issue causing failures in Pods created by Jobs #136

Timing issue causing failures in Pods created by Jobs #136

damomurf commented May 14, 2018

damomurf commented May 14, 2018 •

edited

Loading

pms1969 commented May 31, 2018

roffe commented May 31, 2018 •

edited

Loading

roffe commented May 31, 2018

blimmer commented Jun 4, 2018

szuecs commented Jun 7, 2018

blimmer commented Jun 7, 2018

roffe commented Jun 7, 2018

mikkeloscar commented Jun 8, 2018

timm088 commented Oct 23, 2018

mikkeloscar commented Oct 23, 2018

WxFang commented Oct 8, 2019

Timing issue causing failures in Pods created by Jobs #136

Timing issue causing failures in Pods created by Jobs #136

Comments

damomurf commented May 14, 2018

damomurf commented May 14, 2018 • edited Loading

pms1969 commented May 31, 2018

roffe commented May 31, 2018 • edited Loading

roffe commented May 31, 2018

blimmer commented Jun 4, 2018

szuecs commented Jun 7, 2018

blimmer commented Jun 7, 2018

roffe commented Jun 7, 2018

mikkeloscar commented Jun 8, 2018

timm088 commented Oct 23, 2018

mikkeloscar commented Oct 23, 2018

WxFang commented Oct 8, 2019

damomurf commented May 14, 2018 •

edited

Loading

roffe commented May 31, 2018 •

edited

Loading