Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

environment template is not always working #3498

Closed
mildred opened this issue Nov 3, 2017 · 14 comments
Closed

environment template is not always working #3498

mildred opened this issue Nov 3, 2017 · 14 comments

Comments

@mildred
Copy link
Contributor

mildred commented Nov 3, 2017

Nomad version

Nomad v0.6.2

Operating system and Environment details

CoreOS

Issue

When a nomad job with a task that uses the docker driver has both an environment section env { "KEY" = "value" } and an environment template, the environment section is used to populate the docker container environment but the environment template is ignored. but sometimes it works too.

Reproduction steps

Run the nomad job, sometimes you don't get the environment variables from template

Job file (if appropriate)

The job file is similar to this:

job "test" {
  datacenters = ["staging-sqsc"]

  type = "service"

  constraint {
    attribute = "${meta.sqsc.environment}"
    value     = "staging"
  }

  constraint {
    attribute = "${meta.sqsc.project}"
    value     = "sqsc"
  }

  update {
    stagger = "10s"
    max_parallel = 1
  }

  group "test" {

    count = 1

    task "test" {
      driver = "docker"

      config {
        image = "alpine"
        args = [ "/bin/sh", "-c", "date; env; echo; cat" ]
      }

      resources {
        cpu = 100
        memory = 512
        network {
          mbits = 10
        }
      }

      env {
        "ENV_FROM_ENV" = "ok"
      }

      template {
        destination = "environment"
        env = true
        data = <<TEMPLATE
ENV_FROM_TEMPLATE={{ key "sample-consul-key" }}
TEMPLATE
      }

    }
  }
}

(You need to the the consul key sample-consul-key and have a nomad node that matches the constraints set)

@chelseakomlo
Copy link
Contributor

Thanks for reporting this issue. Do you have more information about when the environment template is and isn't ignored? If you have logs that are relevant to this issue, feel free to add them here.

@dadgar
Copy link
Contributor

dadgar commented Nov 3, 2017

@mildred Could you please share DEBUG level logs of the client where this occurs.

@mildred
Copy link
Contributor Author

mildred commented Nov 6, 2017

I have the following nomad client logs. This occurs even though the environment section of the job file is empty and all variables are set using the environment template instead.

nomad logs
    2017/11/06 18:31:04.988262 [INFO] client: task "deploy" for alloc "bc4e33c0-aed2-bd19-1798-e06216f5876a" failed: Wait returned exit code 1, signal 0, and error Docker container exited with non-zero exit code: 1
    2017/11/06 18:31:04.988317 [INFO] client: Restarting task "deploy" for alloc "bc4e33c0-aed2-bd19-1798-e06216f5876a" in 16.362625637s
    2017/11/06 18:31:04.990046 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
    2017/11/06 18:31:05.138308 [DEBUG] client: updated allocations at index 4602 (total 14) (pulled 0) (filtered 14)
    2017/11/06 18:31:05.138530 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
    2017/11/06 18:31:07.154924 [DEBUG] http: Request /v1/agent/servers (45.587µs)
    2017/11/06 18:31:07.156368 [DEBUG] http: Request /v1/status/peers (170.845µs)
    2017/11/06 18:31:08.768912 [ERR] http: Request /v1/node/172857b9-3b5f-bd0a-b09d-545ac3ceb6dc?index=6&wait=1m, error: node not found
    2017/11/06 18:31:08.769380 [DEBUG] http: Request /v1/node/172857b9-3b5f-bd0a-b09d-545ac3ceb6dc?index=6&wait=1m (1m2.208570212s)
    2017/11/06 18:31:09.107824 [INFO] client: task "register" for alloc "6decdf71-8e82-438b-4365-c85ec05ab800" completed successfully
    2017/11/06 18:31:09.107876 [INFO] client: Restarting task "register" for alloc "6decdf71-8e82-438b-4365-c85ec05ab800" in 5.69028103s
    2017/11/06 18:31:09.107836 [INFO] client: task "forward" for alloc "6decdf71-8e82-438b-4365-c85ec05ab800" failed: Wait returned exit code 255, signal 0, and error 
    2017/11/06 18:31:09.107968 [INFO] client: Restarting task "forward" for alloc "6decdf71-8e82-438b-4365-c85ec05ab800" in 5.69028103s
    2017/11/06 18:31:09.110549 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
    2017/11/06 18:31:09.112986 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
    2017/11/06 18:31:09.337888 [DEBUG] client: updated allocations at index 4603 (total 14) (pulled 0) (filtered 14)
    2017/11/06 18:31:09.337955 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
    2017/11/06 18:31:11.742120 [INFO] client: task "forward" for alloc "597a9be9-af7e-b8ea-eb14-41b66d98ec92" failed: Wait returned exit code 255, signal 0, and error 
    2017/11/06 18:31:11.742181 [INFO] client: Restarting task "forward" for alloc "597a9be9-af7e-b8ea-eb14-41b66d98ec92" in 5.346224508s
    2017/11/06 18:31:11.742517 [INFO] client: task "register" for alloc "597a9be9-af7e-b8ea-eb14-41b66d98ec92" completed successfully
    2017/11/06 18:31:11.742567 [INFO] client: Restarting task "register" for alloc "597a9be9-af7e-b8ea-eb14-41b66d98ec92" in 5.346224508s
    2017/11/06 18:31:11.745454 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
    2017/11/06 18:31:11.747612 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
    2017/11/06 18:31:11.938454 [DEBUG] client: updated allocations at index 4604 (total 14) (pulled 0) (filtered 14)
    2017/11/06 18:31:11.938588 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
    2017/11/06 18:31:12 [DEBUG] memberlist: TCP connection from=127.0.0.1:36034
    2017/11/06 18:31:14.819411 [DEBUG] driver.raw_exec: started process with pid: 7231
    2017/11/06 18:31:14.830486 [DEBUG] driver.raw_exec: started process with pid: 7275
    2017/11/06 18:31:14.937910 [DEBUG] client: updated allocations at index 4606 (total 14) (pulled 0) (filtered 14)
    2017/11/06 18:31:14.937994 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
    2017/11/06 18:31:15.137896 [DEBUG] client: updated allocations at index 4607 (total 14) (pulled 0) (filtered 14)
    2017/11/06 18:31:15.137968 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
    2017/11/06 18:31:15.551060 [DEBUG] http: Request /v1/job/svc-qa-noenv-20171002180522-squarescale-machine-sqsc-demo/allocations?index=4588&wait=1m (1m0.939825811s)
    2017/11/06 18:31:16.552631 [DEBUG] http: Request /v1/job/svc-qa-noenv-20171002180522-squarescale-machine-sqsc-demo/allocations?index=4588&wait=1m (673.628µs)
    2017/11/06 18:31:16.943638 [DEBUG] http: Request /v1/job/svc-qa-noenv-20171002172154-squarescale-machine-hello/allocations?index=4588&wait=1m (1m1.648431515s)
    2017/11/06 18:31:17.114875 [DEBUG] driver.raw_exec: started process with pid: 7451
    2017/11/06 18:31:17.120173 [DEBUG] driver.raw_exec: started process with pid: 7455
    2017/11/06 18:31:17.156304 [DEBUG] http: Request /v1/agent/servers (34.34µs)
    2017/11/06 18:31:17.157408 [DEBUG] http: Request /v1/status/peers (121.718µs)
    2017/11/06 18:31:17.338751 [DEBUG] client: updated allocations at index 4608 (total 14) (pulled 0) (filtered 14)
    2017/11/06 18:31:17.338891 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
    2017/11/06 18:31:17.554116 [DEBUG] http: Request /v1/job/svc-qa-noenv-20171002180522-squarescale-machine-sqsc-demo/allocations?index=4607&wait=1m (566.876µs)
    2017/11/06 18:31:17.945171 [DEBUG] http: Request /v1/job/svc-qa-noenv-20171002172154-squarescale-machine-hello/allocations?index=4588&wait=1m (601.115µs)
    2017/11/06 18:31:21.382134 [DEBUG] driver.docker: Setting default logging options to syslog and unix:///tmp/plugin253912072
    2017/11/06 18:31:21.382179 [DEBUG] driver.docker: Using config for logging: {Type:syslog ConfigRaw:[] Config:map[syslog-address:unix:///tmp/plugin253912072]}
    2017/11/06 18:31:21.382192 [DEBUG] driver.docker: using 536870912 bytes memory for deploy
    2017/11/06 18:31:21.382200 [DEBUG] driver.docker: using 100 cpu shares for deploy
    2017/11/06 18:31:21.382237 [DEBUG] driver.docker: binding directories []string{"/home/mildred/Work/SquareScale/squarescale-infrastructure/dev/nomad-data/alloc/bc4e33c0-aed2-bd19-1798-e06216f5876a/alloc:/alloc", "/home/mildred/Work/SquareScale/squarescale-infrastructure/dev/nomad-data/alloc/bc4e33c0-aed2-bd19-1798-e06216f5876a/deploy/local:/local", "/home/mildred/Work/SquareScale/squarescale-infrastructure/dev/nomad-data/alloc/bc4e33c0-aed2-bd19-1798-e06216f5876a/deploy/secrets:/secrets", "/home/mildred/Work/SquareScale/squarescale-infrastructure/dev/squarescale-deploy/terraform:/terraform"} for deploy
    2017/11/06 18:31:21.382261 [DEBUG] driver.docker: networking mode not specified; defaulting to bridge
    2017/11/06 18:31:21.382288 [DEBUG] driver.docker: setting container name to: deploy-bc4e33c0-aed2-bd19-1798-e06216f5876a
    2017/11/06 18:31:21.978340 [INFO] driver.docker: created container 8ce48f04f15f9c4c8b9015d6f0fc757f05ac1e21229d1422f0e54e6882004c6a
    2017/11/06 18:31:22.502623 [INFO] driver.docker: started container 8ce48f04f15f9c4c8b9015d6f0fc757f05ac1e21229d1422f0e54e6882004c6a
    2017/11/06 18:31:22.543870 [WARN] client: error fetching stats of task deploy: stats collection hasn't started yet
    2017/11/06 18:31:22 [DEBUG] memberlist: TCP connection from=127.0.0.1:36054
    2017/11/06 18:31:22.737883 [DEBUG] client: updated allocations at index 4609 (total 14) (pulled 0) (filtered 14)
    2017/11/06 18:31:22.737986 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
    2017/11/06 18:31:23.090949 [INFO] client: task "deploy" for alloc "bc4e33c0-aed2-bd19-1798-e06216f5876a" failed: Wait returned exit code 1, signal 0, and error Docker container exited with non-zero exit code: 1
 

The task named deploy with alloc bc4e33c0 is the nomad job that has missing environment variable. At the moment it keeps restarting because it misses the AWS_REGION environment variable and the docker container exits with status 1.

nomad inspect deploy shows:

json job (nomad inspect)
{
    "Job": {
        "AllAtOnce": false,
        "Constraints": [
            {
                "LTarget": "${meta.sqsc.environment}",
                "Operand": "=",
                "RTarget": "dev"
            },
            {
                "LTarget": "${meta.sqsc.project}",
                "Operand": "=",
                "RTarget": "sqsc"
            }
        ],
        "CreateIndex": 42,
        "Datacenters": [
            "dev-sqsc"
        ],
        "ID": "deploy",
        "JobModifyIndex": 42,
        "Meta": null,
        "ModifyIndex": 92,
        "Name": "deploy",
        "ParameterizedJob": null,
        "ParentID": "",
        "Payload": null,
        "Periodic": null,
        "Priority": 50,
        "Region": "global",
        "Stable": true,
        "Status": "running",
        "StatusDescription": "",
        "Stop": false,
        "SubmitTime": 1509728475312274239,
        "TaskGroups": [
            {
                "Constraints": null,
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Name": "deploy",
                "RestartPolicy": {
                    "Attempts": 2,
                    "Delay": 15000000000,
                    "Interval": 60000000000,
                    "Mode": "delay"
                },
                "Tasks": [
                    {
                        "Artifacts": null,
                        "Config": {
                            "force_pull": false,
                            "image": "095348363195.dkr.ecr.eu-west-1.amazonaws.com/squarescale-deploy:dev-latest",
                            "volumes": [
                                "/home/mildred/Work/SquareScale/squarescale-infrastructure/dev/squarescale-deploy/terraform:/terraform"
                            ],
                            "dns_servers": [
                                "172.17.0.1"
                            ]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": null,
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "deploy",
                        "Resources": {
                            "CPU": 100,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 512,
                            "Networks": [
                                {
                                    "CIDR": "",
                                    "Device": "",
                                    "DynamicPorts": null,
                                    "IP": "",
                                    "MBits": 10,
                                    "ReservedPorts": null
                                }
                            ]
                        },
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": [
                            {
                                "ChangeMode": "restart",
                                "ChangeSignal": "",
                                "DestPath": "environment",
                                "EmbeddedTmpl": "\n\nSQSC_ENVIRONMENT=\"staging\"\nSQSC_TFSTATE_BUCKET=\"sqsc-tfstate-staging\"\n\n\nNOMAD_VERSION=\"{{                key \"squarescale/dev/nomad_version\" }}\"\nAWS_REGION=\"{{                   key \"squarescale/dev/aws/region\" }}\"\nAWS_ACCESS_KEY_ID=\"{{            key \"squarescale/dev/aws/access_key_id\" }}\"\nAWS_SECRET_ACCESS_KEY=\"{{        key \"squarescale/dev/aws/secret_access_key\" }}\"\nSQSC_TERRAFORM_SQS_INPUT_URL=\"{{ key \"squarescale/dev/deploy_sqs\" }}\"\n",
                                "Envvars": true,
                                "LeftDelim": "{{",
                                "Perms": "0644",
                                "RightDelim": "}}",
                                "SourcePath": "",
                                "Splay": 5000000000,
                                "VaultGrace": 300000000000
                            },
                            {
                                "ChangeMode": "restart",
                                "ChangeSignal": "",
                                "DestPath": "restart",
                                "EmbeddedTmpl": "          {{keyOrDefault (printf \"restart/%s\" (env \"NOMAD_JOB_NAME\")) \"\"}}\n          {{keyOrDefault (printf \"restart/%s/%s\" (env \"NOMAD_JOB_NAME\") (env \"NOMAD_TASK_NAME\")) \"\"}}\n          {{keyOrDefault (printf \"restart/%s/%s/%s\" (env \"NOMAD_JOB_NAME\") (env \"NOMAD_TASK_NAME\") (env \"NOMAD_ALLOC_INDEX\")) \"\"}}\n          {{keyOrDefault (printf \"restart/%s\" (env \"NOMAD_ALLOC_NAME\")) \"\"}}\n",
                                "Envvars": false,
                                "LeftDelim": "{{",
                                "Perms": "0644",
                                "RightDelim": "}}",
                                "SourcePath": "",
                                "Splay": 5000000000,
                                "VaultGrace": 300000000000
                            }
                        ],
                        "User": "",
                        "Vault": null
                    }
                ],
                "Update": {
                    "AutoRevert": false,
                    "Canary": 0,
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000,
                    "Stagger": 10000000000
                }
            }
        ],
        "Type": "service",
        "Update": {
            "AutoRevert": false,
            "Canary": 0,
            "HealthCheck": "",
            "HealthyDeadline": 0,
            "MaxParallel": 1,
            "MinHealthyTime": 0,
            "Stagger": 10000000000
        },
        "VaultToken": "",
        "Version": 0
    }
}
 

consul keys are available:

$ bin/consul-0.8.3 kv get squarescale/dev/aws/region
eu-west-1
$ bin/consul-0.8.3 kv get squarescale/dev/nomad_version
0.6.2
$ bin/consul-0.8.3 kv get squarescale/dev/aws/access_key_id
xxxxxxxxxxxxxxxxxxxxx
$ bin/consul-0.8.3 kv get squarescale/dev/aws/secret_access_key
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
$ bin/consul-0.8.3 kv get squarescale/dev/deploy_sqs
https://sqs.eu-west-1.amazonaws.com/xxxxxxxxxxxxxxxxxx/terraform-deploy-shanti

environment template file is being written to disk correctly:

$ nomad fs bc4e33c0 deploy/environment


SQSC_ENVIRONMENT="staging"
SQSC_TFSTATE_BUCKET="sqsc-tfstate-staging"


NOMAD_VERSION="0.6.2"
AWS_REGION="eu-west-1"
AWS_ACCESS_KEY_ID="xxxxxxxxxxx"
AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
SQSC_TERRAFORM_SQS_INPUT_URL="https://sqs.eu-west-1.amazonaws.com/xxxxxx/terraform-deploy-shanti"

deploy/executor.out shows only repeated lines like this one:

2017/11/06 18:45:03.820637 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin198853221

@mildred mildred changed the title environment template is ignored, sometimes, if task has an environment section environment template is not always working Nov 7, 2017
@burdandrei
Copy link
Contributor

burdandrei commented Nov 8, 2017

I just ran in the same problem on 0.7.
I got a job with env stanza and 2 templates one for thinks from consul and one for secrets.
I can see secrets in the file, but not in the environment variables

@burdandrei
Copy link
Contributor

@dadgar I got the reproduce steps:
the problem is if there are 2 templates that have env = true.
After one of them gets re-rendered, nomad restarts the tasks with envs from only this template.

Here is the job definition:

job "envtest" {
  datacenters = ["staging-us-east-1"]
  region      = "staging-us-east-1"
  type        = "service"

  update {
    stagger      = "10s"
    max_parallel = 1
  }

  group "envtest" {
    count = 1

    task "envtest" {
      driver = "exec"

      constraint {
        attribute = "${attr.kernel.name}"
        value     = "linux"
      }

     
      config {
        command = "sleep"
        args    = [
          "600"
        ]
      }

      template {
  data = <<EOH
# Lines starting with a # are ignored

# Empty lines are also ignored
VLAD="FROG"
EOH

  destination = "local/0.env"
  env         = true
}

      template {
  data = <<EOH
test1="{{key "test1"}}"
EOH

  destination = "local/1.env"
  env         = true
}

      template {
  data = <<EOH
test2="{{key "test2"}}"
EOH

  destination = "local/2.env"
  env         = true
}

      resources {
        cpu    = 20
        memory = 10

        network {
          mbits = 10
        }
      }
    }
  }
}

Now when you start the job you receive both test1 and test2 environment variables.
If you update test1 key in consul you'll receive only test1 env var, update test2 and you'll receive only it.

@burdandrei
Copy link
Contributor

if I understand this right this happening because of this: https://github.com/hashicorp/nomad/blob/master/client/consul_template.go#L397

Question is what is the right solution, not to give more than one template that is injecting environment, or build the logic of merging them?

@mildred
Copy link
Contributor Author

mildred commented Nov 9, 2017

See the JSON Nomad job in #3498 (comment) it has two templates but only one of them is an environment template. Still it doesn't alays work.

About the two environment templates issue, I don't see a problem to have nomad restrict to a single environment template. But if it is not difficult, implementing a merge mechanism might be a better solution (no edge case to check when accepting jobs)

@burdandrei
Copy link
Contributor

this is even more fun, and awful!

@dadgar
Copy link
Contributor

dadgar commented Nov 10, 2017

@mildred I was not able to ever reproduce with either of the job files you gave. I was able to reproduce with the description @burdandrei gave.

I have attached a Linux AMD64 build that has the fix. I would love if you could test and tell me if I have fixed your issue too.

nomad.zip

@burdandrei
Copy link
Contributor

Thanks for super fast fix @dadgar !

@mildred
Copy link
Contributor Author

mildred commented Nov 13, 2017

I noticed the problem occurred mostly when I changed a consul key in restart/deploy that triggers a change in the restart template (which is not an environment template). Then the job restarts with missing environment variables. Stopping the job completely (nomad stop) and restarting it (nomad run) will rerun the job correctly in all cases I noticed.

I'll check this version and report back if I still have issues.

@mildred
Copy link
Contributor Author

mildred commented Nov 13, 2017

@dadgar it seems to work well. The cases where this bug happened don't seem to trigger the bug any more now. Thank you.

@dadgar
Copy link
Contributor

dadgar commented Nov 13, 2017

@burdandrei @mildred Thank you for testing and for the super fast feedback!

@github-actions
Copy link

github-actions bot commented Dec 6, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants