Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate allocation is not allowed on every (re)boot #15763

Closed
suikast42 opened this issue Jan 12, 2023 · 20 comments
Closed

duplicate allocation is not allowed on every (re)boot #15763

suikast42 opened this issue Jan 12, 2023 · 20 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/networking type/bug

Comments

@suikast42
Copy link
Contributor

suikast42 commented Jan 12, 2023

Nomad version

1.4.3

Operating system and Environment details

Ubuntu 22.04

Issue

On every (re)boot, the first allocation fails with "duplicate allocation is not allowed"

Reproduction steps

reboot the running client

Expected Result

Allocation should in state run

Actual Result

Allocation should is first at state run -> failed -> run

Job file (if appropriate)

job "security" {
  type = "service"
  datacenters = ["nomadder1"]

    reschedule {
      delay          = "10s"
      delay_function = "constant"
      unlimited      = true
    }

  group "keycloak-db" {
     restart {
       attempts = -1
       interval = "5s"
       delay = "5s"
       mode = "delay"
     }
    volume "security_postgres_volume" {
      type      = "host"
      source    = "security_postgres_volume"
      read_only = false
    }

    count = 1
    network {
      mode = "bridge"
      port "db" {
        to = 5432
      }
    }

    service {
      name = "security-postgres"
      port = "5432"
      connect {
        sidecar_service {}
      }
     check {
        name     = "security_postgres_ping"
        type     = "script"
        command  = "pg_isready"
        task     = "security_postgres"
        interval = "10s"
        timeout  = "2s"
      }
    }

    task "security_postgres" {
      volume_mount {
        volume      = "security_postgres_volume"
        destination = "/var/lib/postgresql/data/pgdata"
      }
      driver = "docker"
      env {
        POSTGRES_USER        = "keycloak"
        POSTGRES_DB          = "keycloak"
        PGDATA               = "/var/lib/postgresql/data/pgdata"
        POSTGRES_INITDB_ARGS = "--encoding=UTF8"
      }
      config {
        image = "registry.cloud.private/postgres:14.5"
        volumes = [
          "local/initddb.sql:/docker-entrypoint-initdb.d/initddb.sql"
        ]
        ports = ["db"]
      }
      resources {
        cpu    = 1000
        memory = 2048
      }
      template {
        data = <<EOF
           CREATE SCHEMA IF NOT EXISTS keycloak;
         EOF
        destination = "local/initddb.sql"
      }
      template {
              destination = "${NOMAD_SECRETS_DIR}/env.vars"
              env         = true
              change_mode = "restart"
              data        = <<EOF
      {{- with nomadVar "nomad/jobs/security" -}}
        POSTGRES_PASSWORD    = {{.keycloak_db_password}}
      {{- end -}}
      EOF
           }
    }
  }

  group "keycloak-ingress" {
     restart {
       attempts = -1
       interval = "5s"
       delay = "5s"
       mode = "delay"
     }
    volume "ca_cert" {
      type      = "host"
      source    = "ca_cert"
      read_only = true
    }
    count = 1
    network {
      mode = "bridge"
      port "auth" {
        to = 4181
      }
    }
    service {
      name = "forwardauth"
      port = "auth"
      tags = [
        "traefik.enable=true",
        "traefik.http.routers.forwardauth.entrypoints=https",
        "traefik.http.routers.forwardauth.rule= Path(`/_oauth`)",
        "traefik.http.routers.forwardauth.middlewares=traefik-forward-auth",
        "traefik.http.routers.traefik-forward-auth.tls=true",
        "traefik.http.middlewares.traefik-forward-auth.forwardauth.address=http://forwardauth.service.consul:${NOMAD_HOST_PORT_auth}",
        "traefik.http.middlewares.traefik-forward-auth.forwardauth.authResponseHeaders= X-Forwarded-User",
        "traefik.http.middlewares.traefik-forward-auth.forwardauth.authResponseHeadersRegex= ^X-",
        "traefik.http.middlewares.traefik-forward-auth.forwardauth.trustForwardHeader=true",
      #  "traefik.http.middlewares.test-auth.forwardauth.tls.insecureSkipVerify=true"
      ]


    }
      task "await-for-keycloak" {
        driver = "docker"

        config {
          image        = "registry.cloud.private/busybox:1.28"
          command      = "sh"
          args         = ["-c", "echo -n 'Waiting for service keycloak'; until nslookup keycloak.service.consul 2>&1 >/dev/null; do echo '.'; sleep 2; done"]
          #network_mode = "host"
        }

        resources {
          cpu    = 200
          memory = 128
        }

        lifecycle {
          hook    = "prestart"
          sidecar = false
        }
      }
    task "forwardauth" {
      driver = "docker"
      env {
        #        https://brianturchyn.net/traefik-forwardauth-support-with-keycloak/
        #        https://github.com/mesosphere/traefik-forward-auth/issues/36
        #        INSECURE_COOKIE = "1"
        ENCRYPTION_KEY = "45659373957778734945638459467936" #32 character encryption key
        #        SCOPE = "profile email openid" # scope openid is necessary for keycloak...
        SECRET        = "9e7d7b0776f032e3a1996272c2fe22d2"
        PROVIDER_URI  = "https://security.cloud.private/realms/nomadder"
        #        OIDC_ISSUER   = "https://security.cloud.private/realms/nomadder"
        CLIENT_ID     = "ingress"
        LOG_LEVEL     = "debug"
        # Lifetime of cookie 60s
        LIFETIME = "60"

      }
      volume_mount {
        volume      = "ca_cert"
        destination = "/etc/ssl/certs/"
      }
      config {
        image = "registry.cloud.private/mesosphere/traefik-forward-auth:3.1.0"
        ports = ["auth"]
      }
      resources {
        cpu    = 500
        memory = 256
      }
      template {
              destination = "${NOMAD_SECRETS_DIR}/env.vars"
              env         = true
              change_mode = "restart"
              data        = <<EOF
      {{- with nomadVar "nomad/jobs/security" -}}
        CLIENT_SECRET      = {{.keycloak_ingress_secret}}
      {{- end -}}
      EOF
           }
      }
  }

  group "keycloak" {
     restart {
       attempts = -1
       interval = "5s"
       delay = "5s"
       mode = "delay"
     }
    count = 1
    network {
      mode = "bridge"
      port "ui" {
        to = 8080
      }
    }

    service {
      name = "keycloak"
#      port = "ui"
      port = "8080"
      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "security-postgres"
              local_bind_port  = 5432
            }
          }
        }
      }
      tags = [
        "traefik.enable=true",
        "traefik.consulcatalog.connect=true",
        "traefik.http.routers.keycloak.tls=true",
        "traefik.http.routers.keycloak.rule=Host(`security.cloud.private`)",
      ]


      check {
        name  = "health"
        type  = "http"
        port ="ui"
        path="/health"
        interval = "10s"
        timeout  = "2s"
      }
    }
      task "await-for-security-postgres" {
        driver = "docker"

        config {
          image        = "registry.cloud.private/busybox:1.28"
          command      = "sh"
          args         = ["-c", "echo -n 'Waiting for service security-postgres'; until nslookup security-postgres.service.consul 2>&1 >/dev/null; do echo '.'; sleep 2; done"]
          #network_mode = "host"
        }

        resources {
          cpu    = 200
          memory = 128
        }

        lifecycle {
          hook    = "prestart"
          sidecar = false
        }
      }
    task "keycloak" {
      driver = "docker"
      env {
        KEYCLOAK_ADMIN  = "admin"
        KC_HTTP_ENABLED= "true"
        KC_HOSTNAME_STRICT_HTTPS="false"
        KC_HEALTH_ENABLED= "true"
        KC_HOSTNAME="security.cloud.private"
        KC_PROXY="edge"
        KC_DB                     = "postgres"
        KC_DB_SCHEMA              = "keycloak"
        KC_DB_USERNAME            = "keycloak"
        KC_DB_URL_HOST            = "${NOMAD_UPSTREAM_IP_security_postgres}"
        KC_DB_URL_PORT            = "${NOMAD_UPSTREAM_PORT_security_postgres}"
      }
      config {
        image = "registry.cloud.private/stack/core/keycloak:20.0.2.0"
        ports = ["ui"]
        args = [
          "start", "--import-realm" , "--optimized"
        ]
      }
      resources {
        cpu    = 1000
        memory = 2048
      }
    template {
            destination = "${NOMAD_SECRETS_DIR}/env.vars"
            env         = true
            change_mode = "restart"
            data        = <<EOF
    {{- with nomadVar "nomad/jobs/security" -}}
      KEYCLOAK_ADMIN_PASSWORD      = {{.keycloak_password}}
      KC_DB_PASSWORD               = {{.keycloak_db_password}}
      KC_NOMADDER_CLIENT_SECRET    = {{.keycloak_ingress_secret}}
      KC_NOMADDER_CLIENT_SECRET_GRAFANA    = {{.keycloak_secret_observability_grafana}}
    {{- end -}}
    EOF
         }
    }
  }
}

Nomad Server logs (if appropriate)

2023-01-11T17:54:45.764Z [ERROR] client.alloc_runner: prerun failed: alloc_id=174831a8-3c62-ef72-0f3a-c0e7bac072d9 error="pre-run hook \"network\" failed: failed to configure networking for alloc: failed to configure network: plugin type=\"bridge\" failed (add): failed to allocate for range 0: 172.26.68.38 has been allocated to 174831a8-3c62-ef72-0f3a-c0e7bac072d9, duplicate allocation is not allowed"

nomd job status security

**Allocations**
ID                                    Eval ID                               Node ID                               Node Name  Task Group        Version  Desired  Status   Created
    Modified
22403064-1a04-ea05-a9fa-a73d17d8c5b5  dba9d320-e89b-b309-dd97-9ba854f38fdd  afe0c908-8361-d05b-3904-8782c6cbfdc6  worker-01  keycloak-db       3        run      running  2023-01-12T09:50:26Z  2023-01-12T09:50:42Z
5692cb51-8bdd-580e-fe34-79a7f29d8189  dba9d320-e89b-b309-dd97-9ba854f38fdd  afe0c908-8361-d05b-3904-8782c6cbfdc6  worker-01  keycloak          3        run      running  2023-01-12T09:50:26Z  2023-01-12T09:50:41Z
acfab263-015f-2b1f-d850-675a089b2aae  dba9d320-e89b-b309-dd97-9ba854f38fdd  afe0c908-8361-d05b-3904-8782c6cbfdc6  worker-01  keycloak-ingress  3        run      running  2023-01-12T09:50:26Z  2023-01-12T09:50:41Z
77df20cf-3e02-88e4-5aa5-4b4d95f2d4e2  e0b0efdf-a648-0e73-3325-886b75e7f485  afe0c908-8361-d05b-3904-8782c6cbfdc6  worker-01  keycloak-ingress  3        stop     failed   2023-01-11T18:06:06Z  2023-01-12T09:50:40Z
7b8a0b52-586d-7ec7-0942-1087ca06e8de  e0b0efdf-a648-0e73-3325-886b75e7f485  afe0c908-8361-d05b-3904-8782c6cbfdc6  worker-01  keycloak-db       3        stop     failed   2023-01-11T18:06:06Z  2023-01-12T09:50:43Z
9fdcdb64-156b-1d59-d0f1-b289bdb57159  e0b0efdf-a648-0e73-3325-886b75e7f485  afe0c908-8361-d05b-3904-8782c6cbfdc6  worker-01  keycloak          3        stop     failed   2023-01-11T18:06:06Z  2023-01-12T09:50:42Z`
@lgfa29 lgfa29 added theme/networking stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jan 25, 2023
@lgfa29
Copy link
Contributor

lgfa29 commented Jan 25, 2023

Hi @suikast42 👋

I was able to reproduce this with the nomad job init -short -connect example, but I believe this was fixed in #15407, which will be out in the next release of Nomad.

If you are able to compile from main and test if the fix works in your case that would be awesome, otherwise we can wait until the release it out.

I will close one for now, but let us know if the problem still happens in the new version and we reopen it 🙂

@lgfa29 lgfa29 closed this as completed Jan 25, 2023
@suikast42
Copy link
Contributor Author

If you are able to compile from main and test if the fix works in your case that would be awesome, otherwise we can wait until the release it out.

Sure I can do it. It is enough to checkout the main branch and run go build ? 😜

With a little introduction to nomad build I can test it.

@suikast42
Copy link
Contributor Author

If you are able to compile from main and test if the fix works in your case that would be awesome, otherwise we can wait until the release it out.

Ok I found it here. https://github.com/hashicorp/nomad/tree/main/contributing.
I will try it out.

@suikast42
Copy link
Contributor Author

I ended up with
error @percy/cli@1.6.1: The engine "node" is incompatible with this module. Expected version ">=14". Got "12.22.9"
error Found incompatible module.
make: *** [GNUmakefile:358: ember-dist] Error 1

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 26, 2023

Ah yeah, it can be tricky to build with the UI stuff 😅

I'm generating a custom binary, it will be available at the bottom of the page here: https://github.com/hashicorp/nomad/actions/runs/4011598465

Just a reminder that these are development binaries and so they should not run in a production environment, so make sure that it doesn't point to any production data.

@suikast42
Copy link
Contributor Author

This is still happening with nomad 1.5.0.rc1 and consul 1.15.0

image

@faryon93
Copy link

faryon93 commented Mar 5, 2023

I can confirm, this happens with nomad 1.5.0 from the debian package repo. Same error message as @suikast42

@lgfa29 please reopen this issue, as the problem seems not to be fixed in v1.5.0

@lgfa29
Copy link
Contributor

lgfa29 commented Mar 10, 2023

Oh no, sorry to hear that. Re-opening since it's still an issue.

@lgfa29 lgfa29 reopened this Mar 10, 2023
@lgfa29 lgfa29 added this to Needs Triage in Nomad - Community Issues Triage via automation Mar 10, 2023
@suikast42
Copy link
Contributor Author

I don't know if that helps for solving this but I observe that nomad Fafter reboot to gather metric from dead containers until I restart the nomad service.

image

@lgfa29
Copy link
Contributor

lgfa29 commented Mar 11, 2023

Hum...you mentioned you upgraded to 1.5.0 right? I wonder if this may be related to #16352 🤔

@faryon93
Copy link

faryon93 commented Mar 11, 2023

@lgfa29 thanks for reopening the issue :)

I think your theory is correct. After disabling the danging_container option, as mentioned in #16352 as a workaround, the containers are starting as expected. Nomad 1.5.1 will be released next Monday, so I will test with enabled dangling_containers feature and get back to you.

@suikast42
Copy link
Contributor Author

@lgfa29 thanks for reopening the issue :)

I think your theory is correct. After disabling the danging_container option, as mentioned in #16352 as a workaround, the containers are starting as expected. Nomad 1.5.1 will be released next Monday, so I will test with enabled `dangling_containers' feature and get back to you.

Me too 😜

@lgfa29
Copy link
Contributor

lgfa29 commented Mar 13, 2023

Thanks for testing. So yeah, I think that was a different problem. 1.5.1 went out today and it should fix this one.

@faryon93
Copy link

Just upgraded to nomad 1.5.1 and rebooting works as expected. @lgfa29 the nomad_init containers are no longer garbage collected, so I think your suspicion was correct.

From my point of view, this issue can be closed :)

@lgfa29
Copy link
Contributor

lgfa29 commented Mar 14, 2023

Nice, I'm glad it's all working for you now 🙂

I think the original issue was about something else so I will keep this open until @suikast42 can confirm the original problem has been fixed 👍

@suikast42
Copy link
Contributor Author

I can confirm, too

@suikast42
Copy link
Contributor Author

But figure a out a new problem descirbed here #16453

@lgfa29
Copy link
Contributor

lgfa29 commented Mar 14, 2023

Cool, thank you for the confirmation. I'm going to close this one.

@lgfa29 lgfa29 closed this as completed Mar 14, 2023
Nomad - Community Issues Triage automation moved this from Needs Triage to Done Mar 14, 2023
@axsuul
Copy link
Contributor

axsuul commented Apr 3, 2023

I seem to be having this error still on 1.5.2. An example error I'm getting after rebooting the server and many jobs don't start back up

failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.65.119 has been allocated to b4edbaae-3b20-aa1c-400a-787ceff7c636, duplicate allocation is not allowed

@suikast42
Copy link
Contributor Author

I seem to be having this error still on 1.5.2. An example error I'm getting after rebooting the server and many jobs don't start back up

I can confirm. See #16893

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/networking type/bug
Projects
Development

No branches or pull requests

4 participants