Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad's native service registration fails if raw_exec and docker drivers are combined in the same task group #13483

Closed
groggemans opened this issue Jun 24, 2022 · 5 comments · Fixed by #13493
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/service-discovery type/bug

Comments

@groggemans
Copy link
Contributor

groggemans commented Jun 24, 2022

Nomad version

Nomad v1.3.1 (2b054e3)

Operating system and Environment details

Linux + docker

Issue

Using native service registration fails if a raw_exec and docker tasks are combined.
(using two docker tasks in the same task group isn't an issue)

Reproduction steps

Below example jobs runs two counting loops in a sh script, one in a container, the other directly on the host.
There is a dummy port and service registration that's not actually used, but that does not make a difference. (The initial discovery was with a job that actually used the port, but it's not needed to replicate the behavior)
I discovered it with a pretask hook as the raw_exec task, but using an actually long living task seems to have the same effect.

Expected Result

The service is registered and can be checked with nomad service list.

Actual Result

The service fails to register with below error;

Task hook failed | task_services: rpc error: service registration insert failed: object missing primary index

Job file (if appropriate)

job "example" {
  datacenters = ["dc1"]

  group "counter" {
    network {
      port "random" {
      }
    }

    task "raw" {
      driver = "raw_exec"
      config {
            command = "${NOMAD_TASK_DIR}/my-script.sh"
      }
      template {
        destination = "${NOMAD_TASK_DIR}/my-script.sh"
        perms = "755"
        data = <<EOF
#!/bin/sh
i=0

while true; do
  echo "$i"
  i=$((i+1))
  sleep 2
done

EOF
      }
    }

    task "docker" {
      driver = "docker"
      config {
        image = "alpine:3.13.4"
        command = "${NOMAD_TASK_DIR}/my-script.sh"
        ports = ["random"]
      }
      service {
        provider = "nomad"
        name = "counter"
        port = "random"
      }

      template {
        destination = "${NOMAD_TASK_DIR}/my-script.sh"
        perms = "755"
        data = <<EOF
#!/bin/sh
i=0

while true; do
  echo "$i"
  i=$((i+1))
  sleep 2
done

EOF
      }
    }
  }
}
@jrasell jrasell self-assigned this Jun 27, 2022
@jrasell jrasell added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/service-discovery labels Jun 27, 2022
@jrasell jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Jun 27, 2022
@jrasell jrasell moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jun 27, 2022
@jrasell
Copy link
Member

jrasell commented Jun 27, 2022

Hi @groggemans and thanks for raising this issue along with a great reproduction. I have been able to confirm this problem locally and have identified the fix required. I will raise a PR shortly with this once I have written a test case to cover this behaviour.

@attachmentgenie
Copy link
Contributor

attachmentgenie commented Jul 28, 2022

@jrasell is this supposed to be part of 1.3.2? in any case i am still observing it there too with the following job

job "webapp" {
  datacenters = ["dc1"]

  group "demo" {
    count = 3

    network {
      port "webapp_http" {}
      port "toxiproxy_webapp" {}
    }

    scaling {
      enabled = true
      min     = 1
      max     = 20

      policy {
        cooldown = "20s"

        check "avg_sessions" {
          source = "prometheus"
          query  = "sum(traefik_entrypoint_open_connections{entrypoint=\"webapp\"})/scalar(nomad_nomad_job_summary_running{task_group=\"demo\"})"

          strategy "target-value" {
            target = 5
          }
        }
      }
    }

    task "webapp" {
      driver = "docker"

      config {
        image = "hashicorp/demo-webapp-lb-guide"
        ports = ["webapp_http"]
      }

      env {
        PORT    = "${NOMAD_PORT_webapp_http}"
        NODE_IP = "${NOMAD_IP_webapp_http}"
      }

      resources {
        cpu    = 100
        memory = 16
      }
    }

    task "toxiproxy" {
      driver = "docker"

      lifecycle {
        hook    = "prestart"
        sidecar = true
      }

      config {
        image      = "shopify/toxiproxy:2.1.4"
        entrypoint = ["/entrypoint.sh"]
        ports      = ["toxiproxy_webapp"]

        volumes = [
          "local/entrypoint.sh:/entrypoint.sh",
        ]
      }

      template {
        data = <<EOH
#!/bin/sh

set -ex

/go/bin/toxiproxy -host 0.0.0.0  &

while ! wget --spider -q http://localhost:8474/version; do
  echo "toxiproxy not ready yet"
  sleep 0.2
done

/go/bin/toxiproxy-cli create webapp -l 0.0.0.0:${NOMAD_PORT_toxiproxy_webapp} -u ${NOMAD_ADDR_webapp_http}
/go/bin/toxiproxy-cli toxic add -n latency -t latency -a latency=1000 -a jitter=500 webapp
tail -f /dev/null
        EOH

        destination = "local/entrypoint.sh"
        perms       = "755"
      }

      resources {
        cpu    = 100
        memory = 32
      }

      service {
        name     = "webapp"
        provider = "nomad"
        port     = "toxiproxy_webapp"
        tags = [
          "traefik.enable=true",
          "traefik.http.routers.webapp.entrypoints=webapp",
          "traefik.http.routers.webapp.rule=PathPrefix(`/`)"
        ]
      }
    }
  }
}

@lgfa29
Copy link
Contributor

lgfa29 commented Sep 2, 2022

is this supposed to be part of 1.3.2?

Ah sorry for that. Yes, the initial plan was to have it shipped in 1.3.2 but the backport never happened, so this was shipped in 1.3.4.

Apologies for the confusion.

@attachmentgenie
Copy link
Contributor

i can confirm this now works in 1.3.5

@github-actions
Copy link

github-actions bot commented Jan 4, 2023

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/service-discovery type/bug
Projects
Development

Successfully merging a pull request may close this issue.

4 participants