envoyproxy/envoy:v${NOMAD_envoy_version} error="API error (400): invalid tag format" #9887

bert2002 · 2021-01-26T09:32:59Z

Nomad version

Nomad v1.0.2 (4c1d4fc6a5823ebc8c3e748daec7b4fda3f11037)

Operating system and Environment details

Issue

After initiating a restart of a allocation, nomad cannot find envoy sidecar image anymore. I did a couple of restarts earlier today (same config) and did not run into this problem and consul does not report any problems.

Nomad Server logs (if appropriate)

Jan 26 09:30:08 node-3 nomad[6127]:     2021-01-26T09:30:08.751Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=5fa76958-04fe-e6b4-ce6f-9e4647fee2b8 task=connect-proxy-app-backend reason="Restart within policy" delay=16.074283377s
Jan 26 09:30:08 node-3 nomad[6127]:     2021-01-26T09:30:08.840Z [ERROR] client.driver_mgr.docker: failed pulling container: driver=docker image_ref=envoyproxy/envoy:v${NOMAD_envoy_version} error="API error (400): invalid tag format"
Jan 26 09:30:08 node-3 nomad[6127]:     2021-01-26T09:30:08.842Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=c141dfaa-72c5-e589-1f7d-91628e8e501c task=connect-proxy-app-backend-service error="Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format"

Any way to find out what NOMAD_envoy_version is set to?

Cheers,
bert2002

The text was updated successfully, but these errors were encountered:

tgross · 2021-01-26T13:15:42Z

Hi @bert2002, from the error message I think what's happening is that it's not getting interpolated at all, as that "API error" should be bubbling up from the driver. Can you share the jobspec? It might help us figure out what's going on there.

bert2002 · 2021-01-27T07:59:25Z

My jobspec has quite a lot of groups, so I will share a limited one.

example.nomad.txt

I had to drain the node and the container got started on another node and it was without any problem. Now on the same node (where the problem was) containers are working again without any problem.

The only thing I can image is a runtime error or external service was not reachable (docker hub, etc.)

Any other idea?

tgross · 2021-01-27T13:51:15Z

@bert2002 this is the jobspec you shared and I don't see any setting for the Envoy proxy image. You saw that error without trying to set the image via interpolation?

job "app1" {

  datacenters = ["staging"]
  type        = "service"

  reschedule {
    delay          = "10s"
    delay_function = "exponential"
    max_delay      = "120s"
    unlimited      = true
  }

  #
  # collectd
  #
  group "collectd" {
    count = 1

    network {
      mode = "bridge"
    }

    restart {
      interval = "2m"
      attempts = 8
      delay    = "15s"
      mode     = "delay"
    }

    service {
      name = "collectd"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "redis"
              local_bind_port  = 6379
            }
          }
        }
      }
    }

    task "collectd" {
      driver = "docker"
      leader = true

      config {
        image = var.docker_image_collectd

        mounts = [
          {
            type   = "bind"
            target = "/etc/collectd/collectd.conf"
            source = "/opt/app/collectd/collectd.conf"
          },
          {
            type   = "bind"
            target = "/etc/collectd/collectd.conf.d"
            source = "/opt/app/collectd/collectd.conf.d/"
          },
          {
            type   = "bind"
            target = "/usr/local/lib/collectd"
            source = "/opt/app/collectd/plugins/"
          }
        ]
      }
    }

    task "filebeat" {
      driver = "docker"

      resources {
        memory = 100
        cpu    = 50
      }

      config {
        image = var.docker_image_filebeat

        mounts = [
          {
            type   = "bind"
            target = "/usr/share/filebeat/filebeat.yml"
            source = "/opt/app/filebeat/filebeat.yml"
          }
        ]
      }
    }

  }
}

lukas-w · 2021-01-30T10:11:32Z

You saw that error without trying to set the image via interpolation?

I ran into that same error today without setting the image (my jobspec is very similar), though I can't reproduce this either unfortunately.

bert2002 · 2021-02-03T02:52:33Z

You saw that error without trying to set the image via interpolation?

yes that is correct and it just happened again (on the same node). I drained it again and it works fine on the other nodes. Is there any more information/logs I can provide?

bert2002 · 2021-02-05T08:35:08Z

it just happened again after I wanted to restart a alloc (on a different node)

bert2002 · 2021-02-25T09:26:27Z

I updated to Nomad 1.0.3 and Consul 1.9.3, but unfortunately it is happening again. Especially when restarting a allocation manually.

Feb 25, '21 16:22:05 +0800	Driver Failure	Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format

bert2002 · 2021-02-25T09:55:51Z

@tgross is it possible to set NOMAD_envoy_version myself with an specific version?

empikls · 2021-03-23T13:33:54Z

Hey!

I have the same problem. Could you please provide any information how this can be solved?

bert2002 · 2021-03-24T00:58:11Z

@empikls as workaround I am using a fixed version. In nomad.hcl I set this meta information:

client {
  enabled = true
  meta {
    "connect.sidecar_image" = "envoyproxy/envoy:v1.16.0@sha256:9e72bbba48041223ccf79ba81754b1bd84a67c6a1db8a9dbff77ea6fc1cb04ea"
  }
}

empikls · 2021-03-24T12:38:42Z

Thanks a lot @bert2002 !
I will try next time when i faced the problem .

gregory112 · 2021-07-20T11:53:22Z

Had the same problem too.
It works before, this bug seems flaky. My deployment works before (apart with some other problems) but I have never encountered this. Until today, the exact same deployment. Seems like a race condition or something.

2021-07-20T11:22:32Z  Driver            Downloading image
2021-07-20T11:22:37Z  Not Restarting    Exceeded allowed attempts 2 in interval 10m0s and mode is "fail"
2021-07-20T11:22:37Z  Driver Failure    Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-20T11:22:18Z  Driver            Downloading image
2021-07-20T11:22:19Z  Restarting        Task restarting in 10.268449356s
2021-07-20T11:22:18Z  Driver Failure    Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-20T11:22:04Z  Restarting        Task restarting in 11.565177332s```

cgthayer · 2021-07-23T00:34:30Z

same issue for me. it's a test job so I just re-run it and it starts working again.

Recent Events:
Time                       Type            Description
2021-07-22T17:30:01-07:00  Driver Failure  Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-22T17:30:01-07:00  Driver          Downloading image
2021-07-22T17:29:50-07:00  Restarting      Task restarting in 10.714078758s
2021-07-22T17:29:50-07:00  Driver Failure  Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-22T17:29:50-07:00  Driver          Downloading image
2021-07-22T17:29:38-07:00  Restarting      Task restarting in 12.196588284s
2021-07-22T17:29:38-07:00  Driver Failure  Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-22T17:29:38-07:00  Driver          Downloading image
2021-07-22T17:29:27-07:00  Restarting      Task restarting in 10.856396313s
2021-07-22T17:29:27-07:00  Terminated      Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"

mr-karan · 2021-10-14T08:02:58Z

+1 facing the same issue.

lgfa29 · 2021-10-15T22:15:05Z

@mr-karan (and others facing the issue), would you mind clicking the 👍 in the issue so we can better track common problems?

This issue does seem to be a bit unpredictable, so we don't have an update yet. The workaround from @bert2002 is the best option right now. You can check which Envoy version to use based on your Consul version in the docs.

legege · 2021-10-25T23:30:02Z

I can easily reproduce this by simply restarting the sidecar task via the UI.

meaty-popsicle · 2021-11-01T15:55:06Z

@legege Can confirm, restart via UI causes the error.

mattolson · 2021-11-13T00:46:50Z

+1 I am experiencing this as well

xeroc · 2022-03-01T18:51:42Z

+1 got the same error. came out of the blue

meaty-popsicle · 2022-07-22T10:04:20Z

This issue is still present in v 1.3.1. After a restart via UI job fails due to error while pulling envoy image.

vvarga007 · 2022-07-27T20:08:13Z

What is not clear to me, where is the ${NOMAD_envoy_version} variable set. The documentation says it comes from a consul query?
The official upstream Envoy Docker image, where ${NOMAD_envoy_version} is resolved automatically by a query to Consul.
Is this something that I can control over Consul then?!

devyn · 2022-07-29T08:09:23Z

This is happening to me very, very rarely. It's not a big deal because eventually the alloc gets replaced but it's weird to see it every few days

NOBLES5E · 2022-12-07T13:24:28Z

Still happening in v1.4.3

grzybniak · 2022-12-16T07:44:18Z

yup I found the same issue in 1.4.3.
After I moved task to different Node nomad updated envoy.

VladimirZD · 2022-12-19T12:14:54Z

It is still happening, more often than last week or two.
Is there any news regarding this?

Thnx

hiddewie · 2023-01-31T09:12:55Z

This happened in v1.4.3, when an allocation was manually scaled from count = 2 to count = 3. The start of the envoy sidecar failed 3 times (with delay), and then Nomad finally reallocated the job which was successful.

This issue causes service interruption because Nomad scaling actions make existing allocations restart without reason. Those restarts also fail, and introduce delays before retrying and reallocating the allocation.

seanamos · 2023-02-09T06:25:52Z

The more jobs we add, the more frequently we see this. Initially it was maybe once a month, but now we have service interruptions and downtime on a weekly basis being caused by this.

ovelascoh · 2023-02-21T01:33:17Z

+1

We recently experienced this problem as one of our applications running in Prod crashed and we received the error when Nomad was trying to restart the task. We had to issue a restart and the application worked fine after this. Strangely, we cannot reproduce the error just by simply restarting the sidecar task via the UI. The versions we are running are Nomad v1.3.5, Consul v1.12.4.

exFalso · 2023-02-21T13:06:57Z

Same issue. For us it happened when manually restarting an alloc with a connect proxy sidecar

ivantopo · 2023-04-03T11:28:19Z

One more occurrence of this, and same as in the previous message it happened when I manually restarted an allocation from the Nomad UI. Using Nomad 1.4.3

jrasell · 2023-04-06T08:26:07Z

Hi everyone, I am currently taking a time-boxed look into this. If anyone has logs from the client where an allocation and task gets into this problem, could you please share them? This would really help dive into this bug, which I have been unable to reproduce so far.

jrasell · 2023-04-06T12:43:17Z

Hi everyone, I have been able to reproduce this locally and should have a proposed fix ready at some point soon. I will post my reproduction steps below for other readers and testers. Linux is required.

Run a Consul agent via consul agent -dev
Run a Nomad agent via sudo nomad agent -dev-connect
Perform an initial registration of the jobspec detailed below via nomad run <job_file>
Modify the dashboard group in a non-destructive way; I have been changing the restart.attempts value
Perform a job deployment of the new version via nomad run <job_file>
Restart the dashboard allocation via the UI or via nomad alloc restart <alloc_id>
Check the alloc and task events and see the error

JobSpec:

job "countdash" {
  datacenters = ["dc1"]

  group "api" {
    restart {
      attempts = 3
      delay    = "30s"
    }
    network {
      mode = "bridge"
    }

    service {
      name = "count-api"
      port = "9001"

      connect {
        sidecar_service {}
      }
    }

    task "web" {
      resources {
        cpu    = 50
        memory = 52
      }
      driver = "docker"

      config {
        image = "jrasell/counter-api:16616"
      }
    }
  }

  group "dashboard" {
    restart {
      attempts = 3
      delay    = "30s"
    }
    network {
      mode = "bridge"

      port "http" {
        static = 9002
        to     = 9002
      }
    }

    service {
      name = "count-dashboard"
      port = "http"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "count-api"
              local_bind_port  = 8080
            }
          }
        }
      }
    }

    task "dashboard" {
      resources {
        cpu    = 50
        memory = 52
      }
      driver = "docker"

      env {
        COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
      }

      config {
        image = "jrasell/counter-dashboard:16616"
      }
    }
  }
}

tgross added type/bug theme/consul/connect Consul Connect integration stage/waiting-reply stage/needs-investigation and removed stage/needs-investigation labels Jan 26, 2021

tgross added the stage/needs-investigation label Jan 27, 2021

tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021

tgross removed the stage/waiting-reply label Feb 25, 2021

jrasell mentioned this issue Apr 6, 2023

consul/connect: fixed a bug where restarting proxy tasks failed. #16815

Merged

jrasell closed this as completed in #16815 Apr 11, 2023

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Apr 11, 2023

hc-github-team-nomad-core mentioned this issue Apr 11, 2023

Backport of consul/connect: fixed a bug where restarting proxy tasks failed. into release/1.3.x #16842

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

envoyproxy/envoy:v${NOMAD_envoy_version} error="API error (400): invalid tag format" #9887

envoyproxy/envoy:v${NOMAD_envoy_version} error="API error (400): invalid tag format" #9887

bert2002 commented Jan 26, 2021

tgross commented Jan 26, 2021

bert2002 commented Jan 27, 2021

tgross commented Jan 27, 2021

lukas-w commented Jan 30, 2021

bert2002 commented Feb 3, 2021

bert2002 commented Feb 5, 2021

bert2002 commented Feb 25, 2021

bert2002 commented Feb 25, 2021

empikls commented Mar 23, 2021

bert2002 commented Mar 24, 2021

empikls commented Mar 24, 2021 •

edited

Loading

gregory112 commented Jul 20, 2021

cgthayer commented Jul 23, 2021

mr-karan commented Oct 14, 2021

lgfa29 commented Oct 15, 2021

legege commented Oct 25, 2021

meaty-popsicle commented Nov 1, 2021

mattolson commented Nov 13, 2021

xeroc commented Mar 1, 2022 •

edited

Loading

meaty-popsicle commented Jul 22, 2022

vvarga007 commented Jul 27, 2022

devyn commented Jul 29, 2022

NOBLES5E commented Dec 7, 2022

grzybniak commented Dec 16, 2022

VladimirZD commented Dec 19, 2022

hiddewie commented Jan 31, 2023

seanamos commented Feb 9, 2023

ovelascoh commented Feb 21, 2023

exFalso commented Feb 21, 2023

ivantopo commented Apr 3, 2023

jrasell commented Apr 6, 2023

jrasell commented Apr 6, 2023

envoyproxy/envoy:v${NOMAD_envoy_version} error="API error (400): invalid tag format" #9887

envoyproxy/envoy:v${NOMAD_envoy_version} error="API error (400): invalid tag format" #9887

Comments

bert2002 commented Jan 26, 2021

Nomad version

Operating system and Environment details

Issue

Nomad Server logs (if appropriate)

tgross commented Jan 26, 2021

bert2002 commented Jan 27, 2021

tgross commented Jan 27, 2021

lukas-w commented Jan 30, 2021

bert2002 commented Feb 3, 2021

bert2002 commented Feb 5, 2021

bert2002 commented Feb 25, 2021

bert2002 commented Feb 25, 2021

empikls commented Mar 23, 2021

bert2002 commented Mar 24, 2021

empikls commented Mar 24, 2021 • edited Loading

gregory112 commented Jul 20, 2021

cgthayer commented Jul 23, 2021

mr-karan commented Oct 14, 2021

lgfa29 commented Oct 15, 2021

legege commented Oct 25, 2021

meaty-popsicle commented Nov 1, 2021

mattolson commented Nov 13, 2021

xeroc commented Mar 1, 2022 • edited Loading

meaty-popsicle commented Jul 22, 2022

vvarga007 commented Jul 27, 2022

devyn commented Jul 29, 2022

NOBLES5E commented Dec 7, 2022

grzybniak commented Dec 16, 2022

VladimirZD commented Dec 19, 2022

hiddewie commented Jan 31, 2023

seanamos commented Feb 9, 2023

ovelascoh commented Feb 21, 2023

exFalso commented Feb 21, 2023

ivantopo commented Apr 3, 2023

jrasell commented Apr 6, 2023

jrasell commented Apr 6, 2023

empikls commented Mar 24, 2021 •

edited

Loading

xeroc commented Mar 1, 2022 •

edited

Loading