Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler can't connect to UNIX domain socket. #955

Closed
nwmqpa opened this issue Aug 20, 2024 · 2 comments · Fixed by #966
Closed

Autoscaler can't connect to UNIX domain socket. #955

nwmqpa opened this issue Aug 20, 2024 · 2 comments · Fixed by #966
Assignees
Labels

Comments

@nwmqpa
Copy link

nwmqpa commented Aug 20, 2024

Nomad server version: 1.8.3
Nomad client version: 1.8.3
Nomad autoscaler version: 0.4.5 (with custom plugin)

I've started experimenting with nomad workload identity and nomad autoscaler, and it would seems that the patch for #944 isn't working for me.

Relevant log
2024-08-20T13:54:56.590Z [WARN]  internal_plugin.nomad-target: failed to read job scale status, retrying in 10 seconds: Job=service2 error="Get \"http://127.0.0.1/v1/job/service2/scale?namespace=XXXX&region=playground&wait=300000ms\": dial tcp 127.0.0.1:80: connect: connection refused"
2024-08-20T13:54:56.084Z [WARN]  internal_plugin.nomad-target: failed to read job scale status, retrying in 10 seconds: Job=service1 error="Get \"http://127.0.0.1/v1/job/service1/scale?namespace=XXXX&region=playground&wait=300000ms\": dial tcp 127.0.0.1:80: connect: connection refused"
Full job HCL
job "nomad-autoscaler" {
  namespace = "infrastructure"

  priority = 98

  group "nomad-autoscaler" {

    network {
      mode = "bridge"
      port "http" {
        to = 8080
      }
    }

    service {
      name = "nomad-autoscaler"
      port = 8080

      connect {
        sidecar_service {
          proxy {
            transparent_proxy {}
          }
        }

        sidecar_task {
          resources {
            cpu    = 20
            memory = 50
          }
        }
      }

      check {
        expose   = true
        type     = "http"
        port     = "http"
        path     = "/v1/health"
        interval = "3s"
        timeout  = "1s"
      }
    }

    task "nomad-autoscaler" {
      driver = "docker"

      leader = true

      vault {
        role        = "infrastructure-nomad-autoscaler"
        change_mode = "restart"
      }

      identity {
        env = true
      }

      config {
        image   = "nwmqpa/nomad-autoscaler-plugins:v0.2.2"
        command = "nomad-autoscaler"
        ports   = ["http"]
        args = [
          "agent",
          "-config=${NOMAD_TASK_DIR}/config",
          "-config=${NOMAD_SECRETS_DIR}/config",
          "-log-level=INFO"
        ]
      }

      template {
        data        = <<-EOH
          // This file is reponsible for the configuration of the nomad autoscaler agent
          
          telemetry {
            prometheus_metrics = true
            disable_hostname   = true
          }
          
          http {
            bind_address = "0.0.0.0"
            bind_port    = 8080
          }
          
          nomad {
            address   = "unix://{{ env (" NOMAD_SECRETS_DIR " | trimSpace) }}/api.sock"
            token     = "{{ env (" NOMAD_TOKEN " | trimSpace) }}"
            namespace = "*"
          }
          
          apm "prometheus" {
            driver = "prometheus"
            config = {
              address              = "http://thanos-query-http-upstream.virtual.consul/thanos"
              header_THANOS-TENANT = "XXXX"
            }
          }
          
          strategy "target-value" {
            driver = "target-value"
          }
          
          strategy "threshold" {
            driver = "threshold"
          }
          
          policy {
            dir = "{{ env (" NOMAD_TASK_DIR " | trimSpace) }}/policies"
          }
          
          plugin_dir = "/opt/nomad-autoscaler/plugins"
          
        EOH
        destination = "${NOMAD_TASK_DIR}/config/config.hcl"
      }

      template {
        data          = <<-EOH
          // This file is responsible for communicating with AWS AutoScalingGroup
          
          target "aws_asg" {
            driver        = "nomad-aws-asg-target"
            config = {
              nomad_namespace = "*"
              nomad_address = "unix://{{ env (" NOMAD_SECRETS_DIR " | trimSpace) }}/api.sock"
              nomad_token   = "{{ env (" NOMAD_TOKEN " | trimSpace) }}"
              
              aws_region = "{{ $x := env (" attr.platform.aws.placement.availability-zone " | trimSpace) }}{{ $length := len $x | subtract 1 }}{{ slice $x 0 $length}}"
            }
          }
          
        EOH
        change_mode   = "signal"
        destination   = "${NOMAD_SECRETS_DIR}/config/aws.hcl"
        change_signal = "SIGHUP"
      }

      template {
        data          = <<-EOH
          // This file is a template for the cluster scaling configuration.
          
          // {{ range ls "nomad_autoscaler/clusters" }}
          // {{ $base_cluster := .Value | parseJSON }}
          // {{ $cluster := keyOrDefault (printf "nomad_autoscaler/clusters/%s/dynamic_config" .Key) "{}" | parseJSON | mergeMap $base_cluster }}
          // {{ $ignore_jobs := $cluster.ignore_exported_jobs | toJSON | parseJSON  }}
          scaling "{{ .Key }}_cluster_scaling" {
            enabled = true
            min     = "{{ $cluster.min_nodes }}"
            max     = "{{ $cluster.max_nodes }}"
          
            policy {
              cooldown            = "5m"
              evaluation_interval = "1m"
          
              check "cpu_usage" {
                  source = "prometheus"
                  group = "cpu-usage"
          
                  query = <<-EOQ
                      clamp_min(
                          (
                              {{ $cluster.reserved_instances }} + count(nomad_client_uptime{node_class='{{ $cluster.node_class }}', node_status='ready', node_scheduling_eligibility='eligible'}) - (
                                  count(
                                      nomad_client_unallocated_cpu{node_class='{{ $cluster.node_class }}', node_scheduling_eligibility='eligible', node_status='ready'} > (
                                          scalar(
                                              max(
                                                  sum(
                                                      nomad_client_allocs_cpu_allocated{exported_job!~'{{ $ignore_jobs.cpu }}'}
                                                  ) by (task_group, alloc_id)
                                              )
                                          )
                                      )
                                  ) or on() vector(0)
                              )
                          ),
                          {{ $cluster.min_nodes }}
                      ) or on() vector({{ $cluster.min_nodes }})
                  EOQ
          
                  strategy "pass-through" {}
              }
          
              check "memory_usage" {
                  source = "prometheus"
                  group = "memory-usage"
          
                  query = <<-EOQ
                      clamp_min(
                          (
                              {{ $cluster.reserved_instances }} + count(nomad_client_uptime{node_class='{{ $cluster.node_class }}', node_status='ready', node_scheduling_eligibility='eligible'}) - (
                                  count(
                                      nomad_client_unallocated_memory{node_class='{{ $cluster.node_class }}', node_scheduling_eligibility='eligible', node_status='ready'} > (
                                          scalar(
                                              max(
                                                  sum(
                                                      nomad_client_allocs_memory_allocated{exported_job!~'{{ $ignore_jobs.memory }}'}
                                                  ) by (task_group, alloc_id)
                                              ) / 1024 / 1024
                                          )
                                      )
                                  ) or on() vector(0)
                              )
                          ),
                          {{ $cluster.min_nodes }}
                      ) or on() vector({{ $cluster.min_nodes }})
                  EOQ
          
                  strategy "pass-through" {}
              }
          
              target "aws_asg" {
                node_class   = "{{ $cluster.node_class }}"
                aws_asg_name = "{{ $cluster.asg_name }}"
          
                dry_run                       = "false"
                node_purge                    = "true"
                node_drain_deadline           = "2m"
                node_drain_ignore_system_jobs = "true"
              }
            }
          }
          
          // {{ end }}
          
        EOH
        change_mode   = "signal"
        destination   = "${NOMAD_TASK_DIR}/policies/asg.hcl"
        change_signal = "SIGHUP"
      }

      template {
        data = <<-EOH
  				  {{- define "nomad_service_aws_config_key" -}}
            {{- $nomad_job_id := envOrDefault "NOMAD_JOB_PARENT_ID" (env "NOMAD_JOB_ID") -}}
            {{- printf "nomad/%s/%s/config/aws" (env "NOMAD_NAMESPACE") $nomad_job_id -}}
          {{- end -}}
          
          {{- define "aws_environment_variables" -}}
            AWS_ACCESS_KEY_ID={{.access_key }}
            AWS_SECRET_ACCESS_KEY={{.secret_key }}
          {{- end -}}

          {{- if keyExists (executeTemplate "nomad_service_aws_config_key") -}}
            {{- with (key (executeTemplate "nomad_service_aws_config_key") | parseJSON) -}}
              {{- if eq .creds_type "assumed_role" -}}
                {{- with secret .creds_name (printf "ttl=%s" .ttl ) -}}
                  {{ template "aws_environment_variables" .Data }}
            		  AWS_SESSION_TOKEN={{.Data.security_token }}
                {{- end -}}
              {{- else -}}
                {{- with secret .creds_name -}}
                  {{- template "aws_environment_variables" .Data -}}
                {{- end -}}
              {{- end -}} 
            {{- end -}}
          {{- end -}}
        EOH

        destination = "${NOMAD_SECRETS_DIR}/aws.env"
        env         = true
      }

      resources {
        cpu    = 250
        memory = 128
      }
    }
  }
}
@tgross tgross self-assigned this Sep 3, 2024
@tgross tgross added the hcc/jira label Sep 3, 2024
tgross added a commit that referenced this issue Sep 5, 2024
For #944 we fixed the Nomad API package so that it no longer mutated the private
`url` field if previously set, which allowed reusing an `api.Config` object
between clients when a unix domain socket was in use.

However, the autoscaler plugins for Nomad strategy and target don't use the
`api.Config` object we parse directly and instead get a map of string->string
derived from that config so it can be passed over the go-plugin interface. This
mapping did not account for the `Address` field being mutated when unix domain
sockets are in use, so the bug was not actually fixed.

Update the mapping to use the safe `URL()` method on the config, rather than
reading the `Address` field.

Fixes: #955
Ref: hashicorp/nomad#23785
tgross added a commit that referenced this issue Sep 5, 2024
For #944 we fixed the Nomad API package so that it no longer mutated the private
`url` field if previously set, which allowed reusing an `api.Config` object
between clients when a unix domain socket was in use.

However, the autoscaler plugins for Nomad strategy and target don't use the
`api.Config` object we parse directly and instead get a map of string->string
derived from that config so it can be passed over the go-plugin interface. This
mapping did not account for the `Address` field being mutated when unix domain
sockets are in use, so the bug was not actually fixed.

Update the mapping to use the safe `URL()` method on the config, rather than
reading the `Address` field.

Fixes: #955
Ref: hashicorp/nomad#23785
@tgross
Copy link
Member

tgross commented Sep 5, 2024

I've got a fix for this in #966

tgross added a commit that referenced this issue Sep 5, 2024
For #944 we fixed the Nomad API package so that it no longer mutated the private
`url` field if previously set, which allowed reusing an `api.Config` object
between clients when a unix domain socket was in use.

However, the autoscaler plugins for Nomad strategy and target don't use the
`api.Config` object we parse directly and instead get a map of string->string
derived from that config so it can be passed over the go-plugin interface. This
mapping did not account for the `Address` field being mutated when unix domain
sockets are in use, so the bug was not actually fixed.

Update the mapping to use the safe `URL()` method on the config, rather than
reading the `Address` field.

Fixes: #955
Ref: hashicorp/nomad#23785
tgross added a commit that referenced this issue Sep 11, 2024
For #944 we fixed the Nomad API package so that it no longer mutated the private
`url` field if previously set, which allowed reusing an `api.Config` object
between clients when a unix domain socket was in use.

However, the autoscaler plugins for Nomad strategy and target don't use the
`api.Config` object we parse directly and instead get a map of string->string
derived from that config so it can be passed over the go-plugin interface. This
mapping did not account for the `Address` field being mutated when unix domain
sockets are in use, so the bug was not actually fixed.

Update the mapping to use the safe `URL()` method on the config, rather than
reading the `Address` field.

Fixes: #955
Ref: hashicorp/nomad#23785
@tgross
Copy link
Member

tgross commented Sep 11, 2024

I've merged #966 and that'll go out in the next version. I don't know when that's scheduled but I'll look into expediting it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants