Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

envoyproxy/envoy:v${NOMAD_envoy_version} error="API error (400): invalid tag format" #9887

Closed
bert2002 opened this issue Jan 26, 2021 · 32 comments · Fixed by #16815
Closed

Comments

@bert2002
Copy link
Contributor

Nomad version

Nomad v1.0.2 (4c1d4fc6a5823ebc8c3e748daec7b4fda3f11037)

Operating system and Environment details

Issue

After initiating a restart of a allocation, nomad cannot find envoy sidecar image anymore. I did a couple of restarts earlier today (same config) and did not run into this problem and consul does not report any problems.

Nomad Server logs (if appropriate)

Jan 26 09:30:08 node-3 nomad[6127]:     2021-01-26T09:30:08.751Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=5fa76958-04fe-e6b4-ce6f-9e4647fee2b8 task=connect-proxy-app-backend reason="Restart within policy" delay=16.074283377s
Jan 26 09:30:08 node-3 nomad[6127]:     2021-01-26T09:30:08.840Z [ERROR] client.driver_mgr.docker: failed pulling container: driver=docker image_ref=envoyproxy/envoy:v${NOMAD_envoy_version} error="API error (400): invalid tag format"
Jan 26 09:30:08 node-3 nomad[6127]:     2021-01-26T09:30:08.842Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=c141dfaa-72c5-e589-1f7d-91628e8e501c task=connect-proxy-app-backend-service error="Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format"

Any way to find out what NOMAD_envoy_version is set to?

Cheers,
bert2002

@tgross
Copy link
Member

tgross commented Jan 26, 2021

Hi @bert2002, from the error message I think what's happening is that it's not getting interpolated at all, as that "API error" should be bubbling up from the driver. Can you share the jobspec? It might help us figure out what's going on there.

@bert2002
Copy link
Contributor Author

My jobspec has quite a lot of groups, so I will share a limited one.

example.nomad.txt

I had to drain the node and the container got started on another node and it was without any problem. Now on the same node (where the problem was) containers are working again without any problem.

The only thing I can image is a runtime error or external service was not reachable (docker hub, etc.)

Any other idea?

@tgross
Copy link
Member

tgross commented Jan 27, 2021

@bert2002 this is the jobspec you shared and I don't see any setting for the Envoy proxy image. You saw that error without trying to set the image via interpolation?

job "app1" {

  datacenters = ["staging"]
  type        = "service"

  reschedule {
    delay          = "10s"
    delay_function = "exponential"
    max_delay      = "120s"
    unlimited      = true
  }

  #
  # collectd
  #
  group "collectd" {
    count = 1

    network {
      mode = "bridge"
    }

    restart {
      interval = "2m"
      attempts = 8
      delay    = "15s"
      mode     = "delay"
    }

    service {
      name = "collectd"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "redis"
              local_bind_port  = 6379
            }
          }
        }
      }
    }

    task "collectd" {
      driver = "docker"
      leader = true

      config {
        image = var.docker_image_collectd

        mounts = [
          {
            type   = "bind"
            target = "/etc/collectd/collectd.conf"
            source = "/opt/app/collectd/collectd.conf"
          },
          {
            type   = "bind"
            target = "/etc/collectd/collectd.conf.d"
            source = "/opt/app/collectd/collectd.conf.d/"
          },
          {
            type   = "bind"
            target = "/usr/local/lib/collectd"
            source = "/opt/app/collectd/plugins/"
          }
        ]
      }
    }

    task "filebeat" {
      driver = "docker"

      resources {
        memory = 100
        cpu    = 50
      }

      config {
        image = var.docker_image_filebeat

        mounts = [
          {
            type   = "bind"
            target = "/usr/share/filebeat/filebeat.yml"
            source = "/opt/app/filebeat/filebeat.yml"
          }
        ]
      }
    }

  }
}

@lukas-w
Copy link
Contributor

lukas-w commented Jan 30, 2021

You saw that error without trying to set the image via interpolation?

I ran into that same error today without setting the image (my jobspec is very similar), though I can't reproduce this either unfortunately.

@bert2002
Copy link
Contributor Author

bert2002 commented Feb 3, 2021

You saw that error without trying to set the image via interpolation?

yes that is correct and it just happened again (on the same node). I drained it again and it works fine on the other nodes. Is there any more information/logs I can provide?

@bert2002
Copy link
Contributor Author

bert2002 commented Feb 5, 2021

it just happened again after I wanted to restart a alloc (on a different node)

@tgross tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021
@bert2002
Copy link
Contributor Author

I updated to Nomad 1.0.3 and Consul 1.9.3, but unfortunately it is happening again. Especially when restarting a allocation manually.

Feb 25, '21 16:22:05 +0800	Driver Failure	Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format

@bert2002
Copy link
Contributor Author

@tgross is it possible to set NOMAD_envoy_version myself with an specific version?

@empikls
Copy link

empikls commented Mar 23, 2021

Hey!

I have the same problem. Could you please provide any information how this can be solved?

@bert2002
Copy link
Contributor Author

@empikls as workaround I am using a fixed version. In nomad.hcl I set this meta information:

client {
  enabled = true
  meta {
    "connect.sidecar_image" = "envoyproxy/envoy:v1.16.0@sha256:9e72bbba48041223ccf79ba81754b1bd84a67c6a1db8a9dbff77ea6fc1cb04ea"
  }
}

@empikls
Copy link

empikls commented Mar 24, 2021

Thanks a lot @bert2002 !
I will try next time when i faced the problem .

@gregory112
Copy link

Had the same problem too.
It works before, this bug seems flaky. My deployment works before (apart with some other problems) but I have never encountered this. Until today, the exact same deployment. Seems like a race condition or something.

2021-07-20T11:22:32Z  Driver            Downloading image
2021-07-20T11:22:37Z  Not Restarting    Exceeded allowed attempts 2 in interval 10m0s and mode is "fail"
2021-07-20T11:22:37Z  Driver Failure    Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-20T11:22:18Z  Driver            Downloading image
2021-07-20T11:22:19Z  Restarting        Task restarting in 10.268449356s
2021-07-20T11:22:18Z  Driver Failure    Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-20T11:22:04Z  Restarting        Task restarting in 11.565177332s```

@cgthayer
Copy link

same issue for me. it's a test job so I just re-run it and it starts working again.

Recent Events:
Time                       Type            Description
2021-07-22T17:30:01-07:00  Driver Failure  Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-22T17:30:01-07:00  Driver          Downloading image
2021-07-22T17:29:50-07:00  Restarting      Task restarting in 10.714078758s
2021-07-22T17:29:50-07:00  Driver Failure  Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-22T17:29:50-07:00  Driver          Downloading image
2021-07-22T17:29:38-07:00  Restarting      Task restarting in 12.196588284s
2021-07-22T17:29:38-07:00  Driver Failure  Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-22T17:29:38-07:00  Driver          Downloading image
2021-07-22T17:29:27-07:00  Restarting      Task restarting in 10.856396313s
2021-07-22T17:29:27-07:00  Terminated      Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"

@mr-karan
Copy link
Contributor

+1 facing the same issue.

@lgfa29
Copy link
Contributor

lgfa29 commented Oct 15, 2021

@mr-karan (and others facing the issue), would you mind clicking the 👍 in the issue so we can better track common problems?

This issue does seem to be a bit unpredictable, so we don't have an update yet. The workaround from @bert2002 is the best option right now. You can check which Envoy version to use based on your Consul version in the docs.

@legege
Copy link
Contributor

legege commented Oct 25, 2021

I can easily reproduce this by simply restarting the sidecar task via the UI.

@meaty-popsicle
Copy link

@legege Can confirm, restart via UI causes the error.

@mattolson
Copy link

+1 I am experiencing this as well

@xeroc
Copy link

xeroc commented Mar 1, 2022

+1 got the same error. came out of the blue

@meaty-popsicle
Copy link

This issue is still present in v 1.3.1. After a restart via UI job fails due to error while pulling envoy image.

@vvarga007
Copy link

What is not clear to me, where is the ${NOMAD_envoy_version} variable set. The documentation says it comes from a consul query?
The official upstream Envoy Docker image, where ${NOMAD_envoy_version} is resolved automatically by a query to Consul.
Is this something that I can control over Consul then?!

@devyn
Copy link

devyn commented Jul 29, 2022

This is happening to me very, very rarely. It's not a big deal because eventually the alloc gets replaced but it's weird to see it every few days

@NOBLES5E
Copy link
Contributor

NOBLES5E commented Dec 7, 2022

Still happening in v1.4.3

@grzybniak
Copy link

yup I found the same issue in 1.4.3.
After I moved task to different Node nomad updated envoy.

@VladimirZD
Copy link

It is still happening, more often than last week or two.
Is there any news regarding this?

Thnx

@hiddewie
Copy link

This happened in v1.4.3, when an allocation was manually scaled from count = 2 to count = 3. The start of the envoy sidecar failed 3 times (with delay), and then Nomad finally reallocated the job which was successful.

This issue causes service interruption because Nomad scaling actions make existing allocations restart without reason. Those restarts also fail, and introduce delays before retrying and reallocating the allocation.

@seanamos
Copy link

seanamos commented Feb 9, 2023

The more jobs we add, the more frequently we see this. Initially it was maybe once a month, but now we have service interruptions and downtime on a weekly basis being caused by this.

@ovelascoh
Copy link

+1

We recently experienced this problem as one of our applications running in Prod crashed and we received the error when Nomad was trying to restart the task. We had to issue a restart and the application worked fine after this. Strangely, we cannot reproduce the error just by simply restarting the sidecar task via the UI. The versions we are running are Nomad v1.3.5, Consul v1.12.4.

@exFalso
Copy link

exFalso commented Feb 21, 2023

Same issue. For us it happened when manually restarting an alloc with a connect proxy sidecar

@ivantopo
Copy link

ivantopo commented Apr 3, 2023

One more occurrence of this, and same as in the previous message it happened when I manually restarted an allocation from the Nomad UI. Using Nomad 1.4.3

@jrasell
Copy link
Member

jrasell commented Apr 6, 2023

Hi everyone, I am currently taking a time-boxed look into this. If anyone has logs from the client where an allocation and task gets into this problem, could you please share them? This would really help dive into this bug, which I have been unable to reproduce so far.

@jrasell
Copy link
Member

jrasell commented Apr 6, 2023

Hi everyone, I have been able to reproduce this locally and should have a proposed fix ready at some point soon. I will post my reproduction steps below for other readers and testers. Linux is required.

  1. Run a Consul agent via consul agent -dev
  2. Run a Nomad agent via sudo nomad agent -dev-connect
  3. Perform an initial registration of the jobspec detailed below via nomad run <job_file>
  4. Modify the dashboard group in a non-destructive way; I have been changing the restart.attempts value
  5. Perform a job deployment of the new version via nomad run <job_file>
  6. Restart the dashboard allocation via the UI or via nomad alloc restart <alloc_id>
  7. Check the alloc and task events and see the error

JobSpec:

job "countdash" {
  datacenters = ["dc1"]

  group "api" {
    restart {
      attempts = 3
      delay    = "30s"
    }
    network {
      mode = "bridge"
    }

    service {
      name = "count-api"
      port = "9001"

      connect {
        sidecar_service {}
      }
    }

    task "web" {
      resources {
        cpu    = 50
        memory = 52
      }
      driver = "docker"

      config {
        image = "jrasell/counter-api:16616"
      }
    }
  }

  group "dashboard" {
    restart {
      attempts = 3
      delay    = "30s"
    }
    network {
      mode = "bridge"

      port "http" {
        static = 9002
        to     = 9002
      }
    }

    service {
      name = "count-dashboard"
      port = "http"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "count-api"
              local_bind_port  = 8080
            }
          }
        }
      }
    }

    task "dashboard" {
      resources {
        cpu    = 50
        memory = 52
      }
      driver = "docker"

      env {
        COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
      }

      config {
        image = "jrasell/counter-dashboard:16616"
      }
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.