Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad is not ready to serve consistent reads when multiple tasks enter restart loop #15579

Open
Dgotlieb opened this issue Dec 18, 2022 · 3 comments

Comments

@Dgotlieb
Copy link
Contributor

Dgotlieb commented Dec 18, 2022

Nomad version

Nomad v1.4.3 (f464aca)

Operating system and Environment details

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.5 LTS
Release:	20.04
Codename:	focal

Infra resources

10 Clients (3 are also servers) with the below spec:

 $ nomad node status <NODE_ID>
Allocated Resources
CPU                Memory          Disk
169698/332800 MHz  91 GiB/126 GiB  1.5 TiB/3.2 TiB

Allocation Resource Utilization
CPU               Memory
77229/332800 MHz  36 GiB/126 GiB

Host Resource Utilization
CPU               Memory          Disk
77324/332800 MHz  47 GiB/126 GiB  291 GiB/3.4 TiB

Issue

I have a job with 780 groups with 2 tasks inside each group.
When all the tasks enter a restart loop (all crashing together) Nomad doesn't handle it well.

Reproduction steps

Run the below Job file and enter all tasks into a restart loop.

Expected Result

Nomad should handle the restart loop and not get stuck.

Actual Result

  • UI - all tabs are stuck and often show server error message
    image

  • CLI - no operations work with the below errors:

Unexpected response code: 500 (rpc error: Not ready to serve consistent reads
evaluation reached delivery limit (3)
[ERROR] nomad.raft: failed to take snapshot: error="cannot take snapshot now, wait until the configuration entry at 125550 has been applied (have applied 123884)"
 [ERROR] worker: error waiting for Raft index: worker_id=04faa162-292c-a943-b8ec-ca3b9af37be1 error="timed out after 5s waiting for index=125843" index=125843
error fetching node stats: actual resource usage not present
nomad: memberlist: Refuting a suspect message
nomad: yamux: failed to send ping reply: session shutdown

Job file

variable "sensors" {
  type = list(object({
    id   = string
    host = string
  }))
}

job "test" {
  type        = "service"
  datacenters = ["dc1"]

  reschedule {
    delay          = "10s"
    delay_function = "exponential"
    max_delay      = "10m"
    unlimited      = true
  }

  constraint {
    // Limit the number of tasks on each node
    distinct_property = "${attr.unique.hostname}"
    value             = "100"
  }

  dynamic "group" {
    for_each = var.sensors
    iterator = sensor
    labels   = ["sensor_${sensor.value.id}"]

    content {
      count = 1

      ephemeral_disk {
        migrate = true
        size    = 16000
        sticky  = true
      }

      network {
        mode = "bridge"

        port "sensor_http" {
          to = 9000
        }
        port "capture_http" {
          to = 9001
        }
        port "capture_grpc" {
          to = 9002
        }
        port "metrics" {
          to = 9003
        }
        port "envoy_metrics" {
          to = 9004
        }
        port "jpegs" {
          to = 9005
        }
        port "predictions" {
          to = 9006
        }
      }

      restart {
        interval = "10m"
        attempts = 10
        delay    = "60s"
        mode     = "delay"
      }

      service {
        name = "sensor"
        port = 9000

        meta {
          metrics_port       = "${NOMAD_HOST_PORT_sensor_http}"
          envoy_metrics_port = "${NOMAD_HOST_PORT_envoy_metrics}"
          logical_node       = "sensor_${sensor.value.id}"
          sensor_id          = "${sensor.value.id}"
        }
      }

      task "sensor" {
        driver = "docker"

        config {
          ports = ["sensor_http"]
          image = "my_image:1.0.0"
          args  = ["service", "-c", "/local/config.json", "--sources-config-url", "file:///local/sources.json"]
          runtime = "nvidia"

          labels {
            service = "sensor"
          }
        }

        template {
          data        = file("sensor.json")
          destination = "local/config.json"
          change_mode = "restart"
        }

        template {
          data        = <<EOH
          {
            "sources": [
              {
                "kind": {
                  "sensor": {
                    "ip": "${sensor.value.host}",
                    "kind": "sensor"
                  }
                },
                "id": "${sensor.value.id}"
              }
            ]
          }
          EOH
          destination = "local/sources.json"
          change_mode = "restart"
        }

        logs {
          max_files     = "2"
          max_file_size = 100
        }

        resources {
          memory     = 500  # MiB
          memory_max = 1000 # MiB
          cpu        = 1000 # in MHZ
        }
      }

      service {
        name = "sensor-storage"
        port = 9002

        meta {
          metrics_port       = "${NOMAD_HOST_PORT_capture_http}"
          envoy_metrics_port = "${NOMAD_HOST_PORT_envoy_metrics}"
          logical_node       = "sensor_${sensor.value.id}"
          exposed_grpc_port = "${NOMAD_HOST_PORT_capture_grpc}"
          sensor_id         = "${sensor.value.id}"
        }
      }

      task "capture_storage" {
        driver = "docker"
        user = "root"

        config {
          ports = ["sensor_http"]
          image = "my_image2:1.0.0"
          args  = ["-c", "/local/config.json"]

          labels {
            service = "capture"
          }
        }

        template {
          data        = file("capture_storage.json")
          destination = "local/config.json"
          change_mode = "restart"
        }

        logs {
          max_files     = "2"
          max_file_size = 100
        }

        resources {
          memory     = 200  # MiB
          memory_max = 1000 # MiB
          cpu        = 250  # in MHZ
        }
      }
    }
  }
}

Server.hcl

server {
    enabled = true

    bootstrap_expect = 3
    server_join {
      retry_join = ["server1", "server2", "server3", "server4"]
      retry_max = 3
      retry_interval = "15s"
    }


    rejoin_after_leave = false
    enabled_schedulers = ["service","batch","system","sysbatch"]
    num_schedulers = 128
    node_gc_threshold = "24h"
    eval_gc_threshold = "30s"
    job_gc_threshold = "4h"
    deployment_gc_threshold = "1h"

    encrypt = ""

    raft_protocol = 3
    default_scheduler_config {
        scheduler_algorithm = "spread"
        memory_oversubscription_enabled = true
        preemption_config {
          service_scheduler_enabled = true
        }
    }

    failover_heartbeat_ttl = "30s"
    heartbeat_grace = "2m"
}

Client.hcl

client {
    enabled = true

    node_class = ""
    no_host_uuid = false

    max_kill_timeout = "30s"
    network_speed = 0
    cpu_total_compute = 0

    gc_interval = "1m"
    gc_disk_usage_threshold = 80
    gc_inode_usage_threshold = 70
    gc_parallel_destroys = 2
    gc_max_allocs = 200

    reserved {
        cpu = 0
        memory = 0
        disk = 0
    }

    meta = {
        "instance_type" = "server"
    }
}

plugin "raw_exec" {
    config {
        enabled = true
    }
}
plugin "docker" {
    config {
        infra_image = "gcr.io/google_containers/pause-amd64:3.1"
        allow_privileged = true
        allow_caps = ["audit_write", "chown", "dac_override", "fowner", "fsetid", "kill", "mknod", "net_bind_service", "setfcap", "setgid", "setpcap", "setuid", "sys_chroot", "net_raw", "sys_time", "sys_ptrace"]
        volumes {
            enabled = true
        }
    }
}
plugin "nvidia-gpu" {
    config {
        enabled = true
        fingerprint_period = "1m"
    }
}

servers = ["server1", "server2"]

server_join {
    retry_join = ["server1", "server2"]
}

Suspect

I suspect the size of the final job HCL containing 780 groups is the issue.
I'm not sure how the HCL size is related to the raft indices but many errors are raft-related.

Worth Mentioning

  1. I removed the Connect blocks from the groups to reduce the number of containers, health-checks and Consul-related operations
  2. When tasks are not restarting it takes a few minutes to schedule and finally works well.
  3. All operations including $ nomad system gc, $ nomad system reconcile summaries, $ nomad stop <job_id> -purge are throwing the mentioned errors with no ability to repair the state (even manually).
  4. I was under the assumption that this would fix it.
  5. I know Nomad was tested in the 2M containers challenge, so in terms of the containers count I think I'm good 🙂
    but I wonder if a restart loop of multiple containers is something that was tested and the scheduler should be able to handle.

Questions

  1. Is there a limit for the number of groups in a single job spec? if so, what is the limit?
  2. In case I reach this vicious cycle, is there a way to ask the cluster to stop all scheduling operations (mainly evaluations) - some "super admin" action so my operator actions take precedence and bypass all existing requests in the queue?
@shoenig
Copy link
Member

shoenig commented Dec 20, 2022

Thanks for reporting, @Dgotlieb. There is no hard limit on the number of groups that can exist in a job spec, however as you've found you may end up overwhelming the ability to process the amount of evaluations that get created in a pathological case like all of those tasks entering a crash loop. There's been some work around load shedding of evaluations recently; we'll want to see if any of that applies to this case or if we need to do something extra.

@Dgotlieb
Copy link
Contributor Author

Dgotlieb commented Dec 21, 2022

OK, I will also add that once the job has stopped, I still saw errors in the cluster logs, servers disappearing from the UI/CLI, Failures in Consul health checks for ports 4646/4647/4648, and other issues which could take hours to get resolved on their own.

@shoenig just a few more points:

  1. With regard to my second question, is there any way to bypass all eval/schedule requests in the cluster queue to prevent the cluster from entering this vicious cycle?
  2. Can you please explain what is the relation between the evals size/frequency and the raft as most of the errors are raft-related?
  3. Can you suggest a version/branch to try out?
  4. Is there any way to set max parallel restarts (cluster-wide), so even in a case when 1000's of tasks are restarting together, something in the cluster will limit those restarts (e.g: server_max_parallel_restarts)

Thanks

@jamesearl
Copy link

I would also be interested in the answers to the above questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants